Data Preparation

Preparation of the raw data downloaded from Spotify.

true
December 7, 2020

Here, I read in and organized the raw data provided by Spotify. The raw data is provided as JSON, so I converted into tidy data frames. I have not provided the raw data used here because there is some potentially private/financial information, but the results of the processing below are public. More information about the data can be found in Spotify’s Understanding my Data page.

Streaming History

Spotify’s description:

A list of items (e.g. songs, videos, and podcasts) listened to or watched in the past year, including:

  1. Date and time of when the stream ended in UTC format (Coordinated Universal Time zone).
  2. Name of “creator” for each stream (e.g. the artist name if a music track).
  3. Name of items listened to or watched (e.g. title of music track or name of video).
  4. “msPlayed”- Stands for how many mili-seconds the track was listened.

Below is the code to prepare the streaming history data that was contained in two JSON files.

parse_streaming_history <- function(f) {
  rjson::fromJSON(file = f) %.% {
    map(as_tibble)
    bind_rows()
    janitor::clean_names()
    mutate(
      end_time = lubridate::ymd_hm(end_time),
      sec_played = ms_played / 1e3
    )
    select(-ms_played)
  }
}

streaming_history <- map_chr(
  seq(0, 1),
  ~ glue::glue("StreamingHistory{.x}.json")
) %>%
  map(~ file.path(raw_data_dir, .x)) %>%
  map(parse_streaming_history) %>%
  bind_rows()

saveRDS(streaming_history, file.path(data_dir, "streaming_history.rds"))

There are four columns in this data frame:

end_time artist_name track_name sec_played
2019-11-20 11:29:00 Panic! At The Disco Hey Look Ma, I Made It 169.666
2019-11-20 11:32:00 Panic! At The Disco Say Amen (Saturday Night) 189.186
2019-11-20 11:36:00 AJR Drama 204.424
2019-11-20 11:39:00 Jon Bellion Overwhelming 172.751
2019-11-20 13:25:00 Quinn XCII Panama 84.845
2019-12-04 14:29:00 AJR Three-Thirty 210.283

Inferences

Spotify’s description:

We draw certain inferences about your interests and preferences based on your usage of the Spotify service and using data obtained from our advertisers and other advertising partners. This includes a list of market segments with which you are currently associated. Depending on your settings, this data may be used to serve interest-based advertising to you within the Spotify service.

I am most excited to dig into this “Inferences” data set. From a skim, some interesting notes are:

This JSON file was just a single list, so I turned it into a one-column data frame.

inferences <- rjson::fromJSON(file = file.path(raw_data_dir, "Inferences.json"))
inferences <- tibble(inference = inferences$inference)
inference
1P_Custom_Auto_BMW
1P_Custom_Discovery_Streamers
1P_Custom_Passionate_Curators
1P_Custom_Soothing_Sounds
1P_Custom_T-Mobile_Switchers
1P_Custom_iPhone_11_Users

Search Queries

Spotify’s description:

A list of searches made, including: 1. The date and time the search was made. 2. Type of device/platform used (such as iOS, desktop). 3. Search Query shows what the user typed in the search field. 4. Search interaction URIs shows the list of Uniform Resource Identifiers (URI) of the search results the user interacted with.

The only quirk of this data was that there could be multiple search interaction URIs for a single query. I decided to keep each query as a single row and have the URIs kept as a list data type in the column. For later analysis, it might be best to assign a search query index and unnest this column.

search_queries <- rjson::fromJSON(file = file.path(raw_data_dir, "SearchQueries.json")) %>%
  head() %>%
  map(function(x) {
    df <- as_tibble(x)
    df$searchInteractionURIs <- list(df$searchInteractionURIs)
    return(df)
  }) %>%
  bind_rows() %>%
  janitor::clean_names() %>%
  rename(search_interaction_uris = search_interaction_ur_is) %>%
  mutate(search_time = lubridate::ymd_hms(search_time))
platform search_time search_query search_interaction_uris
IPHONE_ARM64 2020-09-23 12:10:35 stupid spotify:track:23LIOXAz6QSzgPYH4dmZL7
IPHONE_ARM64 2020-09-28 13:12:25 cranber spotify:artist:7t0rwkOPGlDPEhaOcVtOt9
IPHONE_ARM64 2020-09-30 11:25:01 blu a spotify:track:2e7SEPyhuReSD8n9E1QOEq, spotify:album:5bDUKo7gGyXWDkbLXai0gI
IPHONE_ARM64 2020-09-30 11:25:01 blu a spotify:track:2e7SEPyhuReSD8n9E1QOEq, spotify:album:5bDUKo7gGyXWDkbLXai0gI
IPHONE_ARM64 2020-10-15 15:57:13 coffee spotify:playlist:37i9dQZF1DX6ziVCJnEm59, spotify:track:4CxmynXhw78QefruycvxG8
IPHONE_ARM64 2020-10-15 15:57:13 coffee spotify:playlist:37i9dQZF1DX6ziVCJnEm59, spotify:track:4CxmynXhw78QefruycvxG8

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.