The Problem With Spotify Wrapped

Spotify Wrapped is nice, but it just doesn’t give me enough insight over just how bad my music taste is, and it’s only pre-year. I have a couple of questions I’d like answered:

  • Just how bad is my album/artist diversity?
  • How many times have I listened to each album, all the way through?
  • How much of my listening history is just Taylor Swift?
  • Do the songs I listen to significantly differ based on what time of day it is?

I recently realized that I had actually learned enough during my undergraduate courses and internships to actually answer all of these questions, so it’s time to dig in.

Getting Your Spotify Data

As of 2023, you can download all of your streaming history through Spotify’s privacy settings. There are three options, Account Data, Extended Streaming History, and Technical Log Information. For this project, we want Extended Streaming History since it will contain the UUIDs of the songs which we will need for the Spotify API. It’ll take up to 30 days to arrive, for me, it took around 3 weeks.

Parsing the Logs

I’m using Rust for this part. Outside of embedded programming and as a C/C++ replacement, I find Rust to be a nice general-purpose langugage with okay library support and very nice compile-time checks.

The Spotify download gives a nice table about what data to expect, a JSON array with each element being an object of the following form:

Technical fieldContains
ts This field is a timestamp indicating when the track stopped playing in UTC (Coordinated Universal Time). The order is year, month and day followed by a timestamp in military time
username This field is your Spotify username.
platform This field is the platform used when streaming the track (e.g. Android OS, Google Chromecast). ms_played This field is the number of milliseconds the stream was played.
conn_country This field is the country code of the country where the stream was played (e.g. SE - Sweden).
ip_addr_decrypted This field contains the IP address logged when streaming the track.
user_agent_decrypted This field contains the user agent used when streaming the track (e.g. a browser, like Mozilla Firefox, or Safari)
master_metadata_track_name This field is the name of the track.
master_metadata_album_artist_name This field is the name of the artist, band or podcast.
master_metadata_album_album_name This field is the name of the album of the track.
spotify_track_uri A Spotify URI, uniquely identifying the track in the form of “spotify:track:<base-62 string>” A Spotify URI is a resource identifier that you can enter, for example, in the Spotify Desktop client’s search box to locate an artist, album, or track.
episode_name This field contains the name of the episode of the podcast.
episode_show_name This field contains the name of the show of the podcast.
spotify_episode_uri A Spotify Episode URI, uniquely identifying the podcast episode in the form of “spotify:episode:<base-62 string>” A Spotify Episode URI is a resource identifier that you can enter, for example, in the Spotify Desktop client’s search box to locate an episode of a podcast.
reason_start This field is a value telling why the track started (e.g. “trackdone”)
reason_end This field is a value telling why the track ended (e.g. “endplay”).
shuffle This field has the value True or False depending on if shuffle mode was used when playing the track.
skipped This field indicates if the user skipped to the next song
offline This field indicates whether the track was played in offline mode (“True”) or not (“False”).
offline_timestamp This field is a timestamp of when offline mode was used, if used.
incognito_mode This field indicates whether the track was played in incognito mode (“True”) or not (“Falsets This field is a timestamp indicating when the track stopped playing in UTC (Coordinated Universal Time). The order is year, month and day followed by a timestamp in military time

So most of these fields aren’t too useful, but just in case we need any later, let’s put all of this data into our database. I’m using a simple MySQL instance locally and the sqlx library in Rust to execute SQL queries. But first, to parse the JSON, the natural choice is serde_json in Rust. Unfortunately, which fields are nullable isn’t very well documented, but a little trial and error, we arrive at this Rust struct for deserialization:

#[derive(Debug, Serialize, Deserialize)]
struct StreamingData {
    ts: String,
    username: String,
    platform: String,
    ms_played: i32,
    conn_country: String,
    ip_addr_decrypted: Option<String>,
    user_agent_decrypted: Option<String>,
    master_metadata_track_name: Option<String>,
    master_metadata_album_artist_name: Option<String>,
    master_metadata_album_album_name: Option<String>,
    spotify_track_uri: Option<String>,
    episode_name: Option<String>,
    episode_show_name: Option<String>,
    spotify_episode_uri: Option<String>,
    reason_start: String,
    reason_end: Option<String>,
    shuffle: Option<bool>,
    skipped: Option<bool>,
    offline: Option<bool>,
    offline_timestamp: Option<i64>,
    incognito_mode: Option<bool>
}

with the corresponding SQL table DDL:

CREATE TABLE streams (
    ts TIMESTAMP NOT NULL,
    username VARCHAR(255) NOT NULL,
    platform VARCHAR(255) NOT NULL,
    ms_played INT NOT NULL,
    conn_country VARCHAR(255) NOT NULL,
    ip_addr_decrypted VARCHAR(255),
    user_agent_decrypted VARCHAR(255),
    master_metadata_track_name VARCHAR(255),
    master_metadata_album_artist_name VARCHAR(255),
    master_metadata_album_album_name VARCHAR(255),
    spotify_track_uri VARCHAR(255),
    episode_name VARCHAR(255),
    episode_show_name VARCHAR(255),
    spotify_episode_uri VARCHAR(255),
    reason_start VARCHAR(255) NOT NULL,
    reason_end VARCHAR(255),
    shuffle BOOLEAN,
    skipped BOOLEAN,
    offline BOOLEAN,
    offline_timestamp BIGINT,
    incognito_mode BOOLEAN
);

With some bulk inserts, we’re able to load the data in under a minute, all 112,526 rows. Time to play around a little bit of the data.

Exploratory Data Analysis

Let’s first look at my top 10 streamed songs of all time:

SELECT
	master_metadata_track_name,
	master_metadata_album_artist_name,
	SUM(ms_played) / 3600000 AS hours_played
FROM streams
WHERE spotify_track_uri IS NOT NULL
GROUP BY spotify_track_uri, 1, 2
ORDER BY 3 DESC
LIMIT 10
master_metadata_track_name master_metadata_album_artist_name hours_played
On Some Emo Shit blink-182 27.7208
Beautiful Days Piano Masafumi Takada 25.0852
No Capes Waterparks 23.1735
Spring Day BTS 21.9136
Roses The Chainsmokers 21.6337
Rhapsody in Blue George Gershwin 19.9754
Reset (feat. Jinsil) Tiger JK 19.7157
Hawaii (Stay Awake) Waterparks 19.5447
It’s Time Imagine Dragons 18.6422
whatever it takes convolk 17.9429

Concerningly, if we don’t group by spotify_track_uri, we get a different result:

master_metadata_track_name master_metadata_album_artist_name hours_played
Spring Day BTS 30.9958
On Some Emo Shit blink-182 27.7208
Beautiful Days Piano Masafumi Takada 25.0852
whatever it takes convolk 24.5301
Sanctuary Joji 24.0277
No Capes Waterparks 23.1735
KIDS ON MOLLY Aries 21.8589
Rhapsody in Blue George Gershwin 21.7469
Roses The Chainsmokers 21.6337
Reset (feat. Jinsil) Tiger JK 19.7157

We see that Spring Day by BTS grows by over 9 hours. This is probably due to the fact that the same song has different tracks on Spotify. Let’s see all the different versions of Spring Day:

SELECT
	spotify_track_uri,
	master_metadata_album_album_name,
	SUM(ms_played) / 3600000 AS hours_played
FROM streams
WHERE master_metadata_track_name = 'Spring Day'
  AND master_metadata_album_artist_name = 'BTS'
GROUP BY 1, 2;
spotify_track_uri master_metadata_album_album_name hours_played
spotify:track:02q0ZnV2L4XByzEvWZJqBC YOU NEVER WALK ALONE 8.0562
spotify:track:0WNGsQ1oAuHzNTk8jivBKW You Never Walk Alone 21.9136
spotify:track:2j1fFjWHCI9KJSwcuYAOyF You Never Walk Alone 1.0259

This confirms our suspicions, we’re going to have to treat two songs as the same if they have the same title and artist, ignoring case, and even so, a couple of the same songs might be treated as different, and some different songs treated as the same if an aritst decides to name two different songs the same name, but there’s not much we can do.

Let’s also make sure not too much of our data is unusable, which is, what percent of streams by time has a NULL ID?

SELECT
( SELECT SUM(ms_played) WHERE spotify_track_uri IS NULL ) /
( SELECT SUM(ms_played) ) * 100
AS percent_null;
percent_null
0.57201341

0.57% is not too bad. Let’s move onto a more interesting question. I consider myself to be primarily an album listener, so what album have I listend to the most? I don’t want to be skewed by listening to a single song of the album, so let’s define the playtime of an album as the minimum playtime of any song in that album if all songs were listened to in that album, zero otherwise.

To once again avoid issues with the same album being identified with different IDs within Spotify, lets define an album uniquely defined as its album name and artist name, case insensitive.

This is the point where we need more information, since album info isn’t available in our Spotify download. It is found in the API, which we can query with our track IDs. We have 11,091 distinct track IDs to query, and the API allows bulk queries of up to 50 IDs each. The exact rate limit for the API isn’t known but we should be able to get away with 2 queries a second, so this should only take two minutes or so.

We have to do a lot of insertions into our database, so let’s factor out the logic in a helper function:

async fn bulk_insert<'a, T, F>(pool: &sqlx::Pool<sqlx::MySql>, table: &str, elements: &'a [T], row_closure: F)
where F: Fn(Separated<'_, 'a, MySql, &str>, &'a T) {
    const SQL_BATCH_SIZE: usize = 1000;

    for chunk in elements.chunks(SQL_BATCH_SIZE) {
        let mut query_builder: QueryBuilder<MySql> = QueryBuilder::new(format!("INSERT INTO {} ", table));
        query_builder.push_values(chunk, &row_closure);
        query_builder.build().execute(pool).await.unwrap();
    }
}

There’s a fun little Rust lifetime puzzle we had to solve in the function parameters to ensure that the query builder outlives the items being added to the query in the closure passed to this function.

In addition, we’re going to be doing a lot of API requests, so let’s factor out the behavior we want to exhibit based on the return code received into another function:

async fn spotify_request(token: &mut String, url: reqwest::Url) -> String {
    let mut backoff = 1;

    loop {
        let request = reqwest::Client::new().get(url.clone()).bearer_auth(&token);
        let request_copy = request.try_clone().unwrap();
        let response = request_copy.send().await.unwrap();
        let status = response.status();

        if !matches!(status, reqwest::StatusCode::TOO_MANY_REQUESTS) {
            backoff = 1;
        }

        match status {
            reqwest::StatusCode::UNAUTHORIZED => {
                println!("Invalid token! Getting a new one...");
                *token = get_new_token().await;
            }
            reqwest::StatusCode::TOO_MANY_REQUESTS => {
                let retry_after = response.headers().get("Retry-After");
                let timeout = match retry_after {
                    // Use the value of retry_after given to us from Spotify if it exists
                    Some(value) => {
                        value.to_str().unwrap().to_owned().parse().unwrap()
                    }
                    None => backoff
                };
                println!("Too many requests! Backing off {timeout} second(s)...");
                tokio::time::sleep(std::time::Duration::from_secs(timeout)).await;
                backoff *= 2;
            }
            reqwest::StatusCode::OK => {
                return response.text().await.unwrap();
            }
            _ => {
                panic!("Unhandled return code {}!", status);
            }
        }
    }
}

The full code is available on my GitHub, I won’t include most of it here since it’s a lot of boilerplate.

Finally, with all the data we need, let’s write a query to figure out my top albums, by the number of minimum playthroughs of a song in the album, with the requirement that all songs in the album have been played.

First, as discussed, we identify a song uniquely by the key (track name, artist name), which gives us multiple track ids. A little experimentation showed that all of these track ids for a given key (track name, artist name) pair yield the same album id, so we can pick one arbitarily, such as the first one that appears.

Next, we need to divide to aggregate all the time played for each key and divide it by the length of track. We then use a window function to count how many distinct songs for each album show up.

Finally, filtering on the albums where the number of distinct songs that show up is equal to the number of tracks in the album, we use aggregation to find the minimum times played for each album id.

Our resulting query is

WITH cte1 AS (
    SELECT
        master_metadata_track_name AS track_name,
        master_metadata_album_artist_name AS artist_name,
        JSON_UNQUOTE(JSON_EXTRACT(JSON_ARRAYAGG(spotify_track_uri), '$[0]')) AS track_uri,
        SUM(ms_played) AS total_ms_played
    FROM streams
    GROUP BY 1, 2
    HAVING total_ms_played > 0
),
cte2 AS (
    SELECT
        cte1.track_name,
        cte1.artist_name,
        cte1.total_ms_played / tracks.duration_ms AS times_played,
        albums.id AS album_id,
        albums.name AS album_name,
        albums.total_tracks,
        COUNT(track_name) OVER (PARTITION BY albums.id) AS played_tracks
    FROM cte1 JOIN tracks ON SUBSTRING(cte1.track_uri, 15) = tracks.id
              JOIN albums ON tracks.album_id = albums.id
    WHERE tracks.duration_ms > 0
)
SELECT DISTINCT
    album_name,
    artist_name,
    total_tracks,
    MIN(times_played) AS times_played
FROM cte2
WHERE played_tracks = total_tracks AND total_tracks >= 5
GROUP BY album_id, 1, 2, 3
ORDER BY 4 DESC
LIMIT 20;

and our result is

album_name artist_name total_tracks times_played
WELCOME HOME Aries 9 94.5799
BLOODLUST nothing,nowhere. 6 87.5505
Take Off Your Pants And Jacket blink-182 13 85.5395
Double Dare Waterparks 13 61.7403
deadroses blackbear 10 58.3813
Cluster Waterparks 5 36.1772
NINE blink-182 15 35.8173
Boys Like Girls BOYS LIKE GIRLS 12 35.0078
help blackbear 10 29.0632
5 Seconds Of Summer 5 Seconds of Summer 16 28.6338
BALLADS 1 Joji 12 26.2355
Heart Flip This Wild Life 8 25.7468
Entertainment Waterparks 10 24.3553
X&Y Coldplay 13 16.1856
Songs About Jane Maroon 5 12 14.8786
÷ (Deluxe) Ed Sheeran 16 14.0803
Greatest Hits Waterparks 17 13.8155
Last Young Renegade All Time Low 10 12.6903
Black Light Waterparks 6 12.5680
Astoria Marianas Trench 17 9.1888

That’s it for now! In the next part, we’ll be trying to see if there’s a statistically significant link between things such as time of day or day of week and the mood of songs, which is something the Spotify API provides. We’ll also be developing a standalone application so people can see their own stats.