Personal Spotify Data Analysis: Data Preparation

Here, I read in and organized the raw data provided by Spotify. The raw data is provided as JSON, so I converted into tidy data frames. I have not provided the raw data used here because there is some potentially private/financial information, but the results of the processing below are public. More information about the data can be found in Spotify’s Understanding my Data page.

Streaming History

Spotify’s description:

A list of items (e.g. songs, videos, and podcasts) listened to or watched in the past year, including:

Date and time of when the stream ended in UTC format (Coordinated Universal Time zone).

Name of “creator” for each stream (e.g. the artist name if a music track).

Name of items listened to or watched (e.g. title of music track or name of video).

“msPlayed”- Stands for how many mili-seconds the track was listened.

Below is the code to prepare the streaming history data that was contained in two JSON files.

parse_streaming_history <- function(f) {
  rjson::fromJSON(file = f) %.% {
    map(as_tibble)
    bind_rows()
    janitor::clean_names()
    mutate(
      end_time = lubridate::ymd_hm(end_time),
      sec_played = ms_played / 1e3
    )
    select(-ms_played)
  }
}

streaming_history <- map_chr(
  seq(0, 1),
  ~ glue::glue("StreamingHistory{.x}.json")
) %>%
  map(~ file.path(raw_data_dir, .x)) %>%
  map(parse_streaming_history) %>%
  bind_rows()

saveRDS(streaming_history, file.path(data_dir, "streaming_history.rds"))

There are four columns in this data frame:

end_time: when the song finished playing
artist_name: the artist
track_name: name of the song
sec_played: duration of the song (in seconds)

end_time	artist_name	track_name	sec_played
2019-11-20 11:29:00	Panic! At The Disco	Hey Look Ma, I Made It	169.666
2019-11-20 11:32:00	Panic! At The Disco	Say Amen (Saturday Night)	189.186
2019-11-20 11:36:00	AJR	Drama	204.424
2019-11-20 11:39:00	Jon Bellion	Overwhelming	172.751
2019-11-20 13:25:00	Quinn XCII	Panama	84.845
2019-12-04 14:29:00	AJR	Three-Thirty	210.283

Inferences

Spotify’s description:

We draw certain inferences about your interests and preferences based on your usage of the Spotify service and using data obtained from our advertisers and other advertising partners. This includes a list of market segments with which you are currently associated. Depending on your settings, this data may be used to serve interest-based advertising to you within the Spotify service.

I am most excited to dig into this “Inferences” data set. From a skim, some interesting notes are:

3P_Politics - Any Republican_US, 3P_Politics - Registered Republican_US,3P_Politics - Any Democrat_US, 3P_Politics - Registered Democrat_US
3P_Custom__CNN_US, 3P_Custom__Conservative Affinity TV News_US, 3P_Custom__DailyShow_16Oct2020_US
3p_Anime_Manga_Enthusiast_Es
3P_Women's Apparel_CA
3P_Alcohol Consumers_UK (and some others)
3P__Custom_Cigarette Buyers_12Dec2019_US [Do Not Use in 2021]
3P_Custom_Netflix_US
3P_Custom_Parents of Boys 6-11_US, 3P_Custom_Parents of Kids 5-13_28Apr2020_US, 3P_Custom_Parents of Toddlers_30Nov2020_US, etc.

This JSON file was just a single list, so I turned it into a one-column data frame.

inferences <- rjson::fromJSON(file = file.path(raw_data_dir, "Inferences.json"))
inferences <- tibble(inference = inferences$inference)

inference
1P_Custom_Auto_BMW
1P_Custom_Discovery_Streamers
1P_Custom_Passionate_Curators
1P_Custom_Soothing_Sounds
1P_Custom_T-Mobile_Switchers
1P_Custom_iPhone_11_Users

Search Queries

Spotify’s description:

A list of searches made, including: 1. The date and time the search was made. 2. Type of device/platform used (such as iOS, desktop). 3. Search Query shows what the user typed in the search field. 4. Search interaction URIs shows the list of Uniform Resource Identifiers (URI) of the search results the user interacted with.

The only quirk of this data was that there could be multiple search interaction URIs for a single query. I decided to keep each query as a single row and have the URIs kept as a list data type in the column. For later analysis, it might be best to assign a search query index and unnest this column.

search_queries <- rjson::fromJSON(file = file.path(raw_data_dir, "SearchQueries.json")) %>%
  head() %>%
  map(function(x) {
    df <- as_tibble(x)
    df$searchInteractionURIs <- list(df$searchInteractionURIs)
    return(df)
  }) %>%
  bind_rows() %>%
  janitor::clean_names() %>%
  rename(search_interaction_uris = search_interaction_ur_is) %>%
  mutate(search_time = lubridate::ymd_hms(search_time))

platform	search_time	search_query	search_interaction_uris
IPHONE_ARM64	2020-09-23 12:10:35	stupid	spotify:track:23LIOXAz6QSzgPYH4dmZL7
IPHONE_ARM64	2020-09-28 13:12:25	cranber	spotify:artist:7t0rwkOPGlDPEhaOcVtOt9
IPHONE_ARM64	2020-09-30 11:25:01	blu a	spotify:track:2e7SEPyhuReSD8n9E1QOEq, spotify:album:5bDUKo7gGyXWDkbLXai0gI
IPHONE_ARM64	2020-09-30 11:25:01	blu a	spotify:track:2e7SEPyhuReSD8n9E1QOEq, spotify:album:5bDUKo7gGyXWDkbLXai0gI
IPHONE_ARM64	2020-10-15 15:57:13	coffee	spotify:playlist:37i9dQZF1DX6ziVCJnEm59, spotify:track:4CxmynXhw78QefruycvxG8
IPHONE_ARM64	2020-10-15 15:57:13	coffee	spotify:playlist:37i9dQZF1DX6ziVCJnEm59, spotify:track:4CxmynXhw78QefruycvxG8

Data Preparation

Streaming History

Inferences

Search Queries

Corrections