Preparation of the raw data downloaded from Spotify.
Here, I read in and organized the raw data provided by Spotify. The raw data is provided as JSON, so I converted into tidy data frames. I have not provided the raw data used here because there is some potentially private/financial information, but the results of the processing below are public. More information about the data can be found in Spotify’s Understanding my Data page.
Spotify’s description:
A list of items (e.g. songs, videos, and podcasts) listened to or watched in the past year, including:
- Date and time of when the stream ended in UTC format (Coordinated Universal Time zone).
- Name of “creator” for each stream (e.g. the artist name if a music track).
- Name of items listened to or watched (e.g. title of music track or name of video).
- “msPlayed”- Stands for how many mili-seconds the track was listened.
Below is the code to prepare the streaming history data that was contained in two JSON files.
parse_streaming_history <- function(f) {
rjson::fromJSON(file = f) %.% {
map(as_tibble)
bind_rows()
janitor::clean_names()
mutate(
end_time = lubridate::ymd_hm(end_time),
sec_played = ms_played / 1e3
)
select(-ms_played)
}
}
streaming_history <- map_chr(
seq(0, 1),
~ glue::glue("StreamingHistory{.x}.json")
) %>%
map(~ file.path(raw_data_dir, .x)) %>%
map(parse_streaming_history) %>%
bind_rows()
saveRDS(streaming_history, file.path(data_dir, "streaming_history.rds"))
There are four columns in this data frame:
end_time
: when the song finished playingartist_name
: the artisttrack_name
: name of the songsec_played
: duration of the song (in seconds)end_time | artist_name | track_name | sec_played |
---|---|---|---|
2019-11-20 11:29:00 | Panic! At The Disco | Hey Look Ma, I Made It | 169.666 |
2019-11-20 11:32:00 | Panic! At The Disco | Say Amen (Saturday Night) | 189.186 |
2019-11-20 11:36:00 | AJR | Drama | 204.424 |
2019-11-20 11:39:00 | Jon Bellion | Overwhelming | 172.751 |
2019-11-20 13:25:00 | Quinn XCII | Panama | 84.845 |
2019-12-04 14:29:00 | AJR | Three-Thirty | 210.283 |
Spotify’s description:
We draw certain inferences about your interests and preferences based on your usage of the Spotify service and using data obtained from our advertisers and other advertising partners. This includes a list of market segments with which you are currently associated. Depending on your settings, this data may be used to serve interest-based advertising to you within the Spotify service.
I am most excited to dig into this “Inferences” data set. From a skim, some interesting notes are:
3P_Politics - Any Republican_US
, 3P_Politics - Registered Republican_US
,3P_Politics - Any Democrat_US
, 3P_Politics - Registered Democrat_US
3P_Custom__CNN_US
, 3P_Custom__Conservative Affinity TV News_US
, 3P_Custom__DailyShow_16Oct2020_US
3p_Anime_Manga_Enthusiast_Es
3P_Women's Apparel_CA
3P_Alcohol Consumers_UK
(and some others)3P__Custom_Cigarette Buyers_12Dec2019_US [Do Not Use in 2021]
3P_Custom_Netflix_US
3P_Custom_Parents of Boys 6-11_US
, 3P_Custom_Parents of Kids 5-13_28Apr2020_US
, 3P_Custom_Parents of Toddlers_30Nov2020_US
, etc.This JSON file was just a single list, so I turned it into a one-column data frame.
inference |
---|
1P_Custom_Auto_BMW |
1P_Custom_Discovery_Streamers |
1P_Custom_Passionate_Curators |
1P_Custom_Soothing_Sounds |
1P_Custom_T-Mobile_Switchers |
1P_Custom_iPhone_11_Users |
Spotify’s description:
A list of searches made, including: 1. The date and time the search was made. 2. Type of device/platform used (such as iOS, desktop). 3. Search Query shows what the user typed in the search field. 4. Search interaction URIs shows the list of Uniform Resource Identifiers (URI) of the search results the user interacted with.
The only quirk of this data was that there could be multiple search interaction URIs for a single query. I decided to keep each query as a single row and have the URIs kept as a list data type in the column. For later analysis, it might be best to assign a search query index and unnest this column.
search_queries <- rjson::fromJSON(file = file.path(raw_data_dir, "SearchQueries.json")) %>%
head() %>%
map(function(x) {
df <- as_tibble(x)
df$searchInteractionURIs <- list(df$searchInteractionURIs)
return(df)
}) %>%
bind_rows() %>%
janitor::clean_names() %>%
rename(search_interaction_uris = search_interaction_ur_is) %>%
mutate(search_time = lubridate::ymd_hms(search_time))
platform | search_time | search_query | search_interaction_uris |
---|---|---|---|
IPHONE_ARM64 | 2020-09-23 12:10:35 | stupid | spotify:track:23LIOXAz6QSzgPYH4dmZL7 |
IPHONE_ARM64 | 2020-09-28 13:12:25 | cranber | spotify:artist:7t0rwkOPGlDPEhaOcVtOt9 |
IPHONE_ARM64 | 2020-09-30 11:25:01 | blu a | spotify:track:2e7SEPyhuReSD8n9E1QOEq, spotify:album:5bDUKo7gGyXWDkbLXai0gI |
IPHONE_ARM64 | 2020-09-30 11:25:01 | blu a | spotify:track:2e7SEPyhuReSD8n9E1QOEq, spotify:album:5bDUKo7gGyXWDkbLXai0gI |
IPHONE_ARM64 | 2020-10-15 15:57:13 | coffee | spotify:playlist:37i9dQZF1DX6ziVCJnEm59, spotify:track:4CxmynXhw78QefruycvxG8 |
IPHONE_ARM64 | 2020-10-15 15:57:13 | coffee | spotify:playlist:37i9dQZF1DX6ziVCJnEm59, spotify:track:4CxmynXhw78QefruycvxG8 |
If you see mistakes or want to suggest changes, please create an issue on the source repository.