Personal Spotify Data Analysis: Collecting song and artist data

Spotify actually has another API for collecting additional details about playlists, albums, artists, and tracks. (It also contains endpoints for a user to control their own playlists and song playback, but those are not relevant for this project.) Below, I collect as much data as possible about songs I listen to using this API.

Accessing the API and caching

To access the API, I used the ‘spotifyr’ R package. The instructions are setting up a developer account and aquiring access keys from Spotify are detailed in the documentation for ‘spotifyr’.

To make development and reproduction a bit faster, I wrapped all of the ‘spotifyr’ functions I used in memoise::memoise() from the ‘memoise’ package. I also cached the long-running data frame creation steps using the ‘mustashe’ package.

Data collection

Track and artist information must be queried using their unique IDs assigned by Spotify. Annoyingly, the personal data I have already downloaded does not contain these IDs with the tracks. Therefore, I instead used the API to collect all of my playlist information, extracted the songs (and their IDs) from the playlists, and used those to query. Because of this limitation, I was not able to access the detailed information from every song in my streaming history, but this will get most of them and all of my most popular tracks.

List of my playlists

In the following code, I collected the names and IDs of all of my playlists. A sample of the resultant data frame is presented.

my_playlists <- get_my_playlists_memo(limit = 50) %>%
  as_tibble() %>%
  janitor::clean_names() %>%
  select(name, id, uri, tracks_total)

name	id	uri	tracks_total
Coffee Shop Sounds	3SoFTOMn9GQem4yXRiMsP3	spotify:playlist:3SoFTOMn9GQem4yXRiMsP3	71
do not go gently	4LZeYGsNyZ3pzIIhLkcSHb	spotify:playlist:4LZeYGsNyZ3pzIIhLkcSHb	157
Deep Focus	37i9dQZF1DWZeKCadgRdKQ	spotify:playlist:37i9dQZF1DWZeKCadgRdKQ	190
Coding	08nu8IFkbbXkiaYkofMVwv	spotify:playlist:08nu8IFkbbXkiaYkofMVwv	175
Good Bakcground	5uzutKLSRFtPqFaxjjRO56	spotify:playlist:5uzutKLSRFtPqFaxjjRO56	3

Playlist information

The playlist IDs were then used to get all of the information about the playlists. The data column of the playlist_df data frame contains a data frame for each playlist with the track and album information.

get_playlist_info <- function(playlist_id) {
  get_playlist_memo(playlist_id = playlist_id)$tracks$items %>%
    as_tibble() %>%
    janitor::clean_names() %>%
    select(
      track_name, track_id, track_uri, track_artists, track_duration_ms,
      track_popularity, track_album_name, track_album_id, track_album_uri,
      track_album_release_date, track_album_release_date_precision,
      track_explicit
    )
}

stash("playlist_df", depends_on = c("my_playlists"), {
  playlist_df <- my_playlists %>%
    mutate(data = map(id, get_playlist_info))
})

Track information

Then, I gathered the track names and IDs from each playlist and queried the API to obtain as much information from Spotify as possible. For each track, Spotify has what they call “features” and “analysis” (links are to the API documentation). The “features” data tend to be quantitative characteristics of the track such as danceability, acousticness, and liveness. The “analysis” data are various time decompositions of the songs at different scales such as into their beats, segments, etc.

The following code shows the functions that collect and clean the data returned by the API calls.

clean_track_info <- function(x) {
  tibble(
    album_id = x$album$id,
    album_name = x$album$name,
    duration = x$duration_ms,
    explicit = x$explicit,
    popularity = x$popularity,
    album_info = list(x$album),
    release_date = x$album$release_date,
    release_date_precision = x$album$release_date_precision
  )
}

clean_track_features <- function(x) {
  janitor::clean_names(x) %>%
    select(-id, -uri)
}

clean_track_analysis <- function(x) {
  x[-c(1:2)] %>%
    map(list) %>%
    as_tibble()
}


get_track_data <- function(id) {
  x <- bind_cols(
    get_track_memo(id = id) %>% clean_track_info(),
    get_track_audio_features_memo(id = id) %>% clean_track_features(),
    get_track_audio_analysis_memo(id = id) %>% clean_track_analysis()
  )
}

get_track_data <- memoise(get_track_data, cache = cache_filesystem(cache_path))

Each track is run through get_track_data() to get and organize all of the data. The resulting data frame is cached for future analysis and some columns are displayed below as examples.

Collecting song and artist data

Accessing the API and caching

Data collection

List of my playlists

Playlist information

Track information

Corrections