Annotating the data downloaded from Spotify with more specific and descriptive data using Spotify’s API.
Spotify actually has another API for collecting additional details about playlists, albums, artists, and tracks. (It also contains endpoints for a user to control their own playlists and song playback, but those are not relevant for this project.) Below, I collect as much data as possible about songs I listen to using this API.
To access the API, I used the ‘spotifyr’ R package. The instructions are setting up a developer account and aquiring access keys from Spotify are detailed in the documentation for ‘spotifyr’.
To make development and reproduction a bit faster, I wrapped all of the ‘spotifyr’ functions I used in memoise::memoise()
from the ‘memoise’ package. I also cached the long-running data frame creation steps using the ‘mustashe’ package.
Track and artist information must be queried using their unique IDs assigned by Spotify. Annoyingly, the personal data I have already downloaded does not contain these IDs with the tracks. Therefore, I instead used the API to collect all of my playlist information, extracted the songs (and their IDs) from the playlists, and used those to query. Because of this limitation, I was not able to access the detailed information from every song in my streaming history, but this will get most of them and all of my most popular tracks.
In the following code, I collected the names and IDs of all of my playlists. A sample of the resultant data frame is presented.
my_playlists <- get_my_playlists_memo(limit = 50) %>%
as_tibble() %>%
janitor::clean_names() %>%
select(name, id, uri, tracks_total)
name | id | uri | tracks_total |
---|---|---|---|
Coffee Shop Sounds | 3SoFTOMn9GQem4yXRiMsP3 | spotify:playlist:3SoFTOMn9GQem4yXRiMsP3 | 71 |
do not go gently | 4LZeYGsNyZ3pzIIhLkcSHb | spotify:playlist:4LZeYGsNyZ3pzIIhLkcSHb | 157 |
Deep Focus | 37i9dQZF1DWZeKCadgRdKQ | spotify:playlist:37i9dQZF1DWZeKCadgRdKQ | 190 |
Coding | 08nu8IFkbbXkiaYkofMVwv | spotify:playlist:08nu8IFkbbXkiaYkofMVwv | 175 |
Good Bakcground | 5uzutKLSRFtPqFaxjjRO56 | spotify:playlist:5uzutKLSRFtPqFaxjjRO56 | 3 |
The playlist IDs were then used to get all of the information about the playlists. The data
column of the playlist_df
data frame contains a data frame for each playlist with the track and album information.
get_playlist_info <- function(playlist_id) {
get_playlist_memo(playlist_id = playlist_id)$tracks$items %>%
as_tibble() %>%
janitor::clean_names() %>%
select(
track_name, track_id, track_uri, track_artists, track_duration_ms,
track_popularity, track_album_name, track_album_id, track_album_uri,
track_album_release_date, track_album_release_date_precision,
track_explicit
)
}
stash("playlist_df", depends_on = c("my_playlists"), {
playlist_df <- my_playlists %>%
mutate(data = map(id, get_playlist_info))
})
Then, I gathered the track names and IDs from each playlist and queried the API to obtain as much information from Spotify as possible. For each track, Spotify has what they call “features” and “analysis” (links are to the API documentation). The “features” data tend to be quantitative characteristics of the track such as danceability, acousticness, and liveness. The “analysis” data are various time decompositions of the songs at different scales such as into their beats, segments, etc.
The following code shows the functions that collect and clean the data returned by the API calls.
clean_track_info <- function(x) {
tibble(
album_id = x$album$id,
album_name = x$album$name,
duration = x$duration_ms,
explicit = x$explicit,
popularity = x$popularity,
album_info = list(x$album),
release_date = x$album$release_date,
release_date_precision = x$album$release_date_precision
)
}
clean_track_features <- function(x) {
janitor::clean_names(x) %>%
select(-id, -uri)
}
clean_track_analysis <- function(x) {
x[-c(1:2)] %>%
map(list) %>%
as_tibble()
}
get_track_data <- function(id) {
x <- bind_cols(
get_track_memo(id = id) %>% clean_track_info(),
get_track_audio_features_memo(id = id) %>% clean_track_features(),
get_track_audio_analysis_memo(id = id) %>% clean_track_analysis()
)
}
get_track_data <- memoise(get_track_data, cache = cache_filesystem(cache_path))
Each track is run through get_track_data()
to get and organize all of the data. The resulting data frame is cached for future analysis and some columns are displayed below as examples.
If you see mistakes or want to suggest changes, please create an issue on the source repository.