Collecting song and artist data

Annotating the data downloaded from Spotify with more specific and descriptive data using Spotify’s API.

true
2021-01-16

Spotify actually has another API for collecting additional details about playlists, albums, artists, and tracks. (It also contains endpoints for a user to control their own playlists and song playback, but those are not relevant for this project.) Below, I collect as much data as possible about songs I listen to using this API.

Accessing the API and caching

To access the API, I used the ‘spotifyr’ R package. The instructions are setting up a developer account and aquiring access keys from Spotify are detailed in the documentation for ‘spotifyr’.

To make development and reproduction a bit faster, I wrapped all of the ‘spotifyr’ functions I used in memoise::memoise() from the ‘memoise’ package. I also cached the long-running data frame creation steps using the ‘mustashe’ package.

Data collection

Track and artist information must be queried using their unique IDs assigned by Spotify. Annoyingly, the personal data I have already downloaded does not contain these IDs with the tracks. Therefore, I instead used the API to collect all of my playlist information, extracted the songs (and their IDs) from the playlists, and used those to query. Because of this limitation, I was not able to access the detailed information from every song in my streaming history, but this will get most of them and all of my most popular tracks.

List of my playlists

In the following code, I collected the names and IDs of all of my playlists. A sample of the resultant data frame is presented.

my_playlists <- get_my_playlists_memo(limit = 50) %>%
  as_tibble() %>%
  janitor::clean_names() %>%
  select(name, id, uri, tracks_total)
name id uri tracks_total
Coffee Shop Sounds 3SoFTOMn9GQem4yXRiMsP3 spotify:playlist:3SoFTOMn9GQem4yXRiMsP3 71
do not go gently 4LZeYGsNyZ3pzIIhLkcSHb spotify:playlist:4LZeYGsNyZ3pzIIhLkcSHb 157
Deep Focus 37i9dQZF1DWZeKCadgRdKQ spotify:playlist:37i9dQZF1DWZeKCadgRdKQ 190
Coding 08nu8IFkbbXkiaYkofMVwv spotify:playlist:08nu8IFkbbXkiaYkofMVwv 175
Good Bakcground 5uzutKLSRFtPqFaxjjRO56 spotify:playlist:5uzutKLSRFtPqFaxjjRO56 3

Playlist information

The playlist IDs were then used to get all of the information about the playlists. The data column of the playlist_df data frame contains a data frame for each playlist with the track and album information.

get_playlist_info <- function(playlist_id) {
  get_playlist_memo(playlist_id = playlist_id)$tracks$items %>%
    as_tibble() %>%
    janitor::clean_names() %>%
    select(
      track_name, track_id, track_uri, track_artists, track_duration_ms,
      track_popularity, track_album_name, track_album_id, track_album_uri,
      track_album_release_date, track_album_release_date_precision,
      track_explicit
    )
}

stash("playlist_df", depends_on = c("my_playlists"), {
  playlist_df <- my_playlists %>%
    mutate(data = map(id, get_playlist_info))
})

Track information

Then, I gathered the track names and IDs from each playlist and queried the API to obtain as much information from Spotify as possible. For each track, Spotify has what they call “features” and “analysis” (links are to the API documentation). The “features” data tend to be quantitative characteristics of the track such as danceability, acousticness, and liveness. The “analysis” data are various time decompositions of the songs at different scales such as into their beats, segments, etc.

The following code shows the functions that collect and clean the data returned by the API calls.

clean_track_info <- function(x) {
  tibble(
    album_id = x$album$id,
    album_name = x$album$name,
    duration = x$duration_ms,
    explicit = x$explicit,
    popularity = x$popularity,
    album_info = list(x$album),
    release_date = x$album$release_date,
    release_date_precision = x$album$release_date_precision
  )
}

clean_track_features <- function(x) {
  janitor::clean_names(x) %>%
    select(-id, -uri)
}

clean_track_analysis <- function(x) {
  x[-c(1:2)] %>%
    map(list) %>%
    as_tibble()
}


get_track_data <- function(id) {
  x <- bind_cols(
    get_track_memo(id = id) %>% clean_track_info(),
    get_track_audio_features_memo(id = id) %>% clean_track_features(),
    get_track_audio_analysis_memo(id = id) %>% clean_track_analysis()
  )
}

get_track_data <- memoise(get_track_data, cache = cache_filesystem(cache_path))

Each track is run through get_track_data() to get and organize all of the data. The resulting data frame is cached for future analysis and some columns are displayed below as examples.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.