An Analysis of My Entire Spotify Streaming History (2017-2023)

The Problem With Spotify Wrapped

Spotify Wrapped is nice, but it just doesn’t give me enough insight over just how bad my music taste is, and it’s only pre-year. I have a couple of questions I’d like answered:

Just how bad is my album/artist diversity?
How many times have I listened to each album, all the way through?
How much of my listening history is just Taylor Swift?
Do the songs I listen to significantly differ based on what time of day it is?

I recently realized that I had actually learned enough during my undergraduate courses and internships to actually answer all of these questions, so it’s time to dig in.

Getting Your Spotify Data

As of 2023, you can download all of your streaming history through Spotify’s privacy settings. There are three options, Account Data, Extended Streaming History, and Technical Log Information. For this project, we want Extended Streaming History since it will contain the UUIDs of the songs which we will need for the Spotify API. It’ll take up to 30 days to arrive, for me, it took around 3 weeks.

Parsing the Logs

I’m using Rust for this part. Outside of embedded programming and as a C/C++ replacement, I find Rust to be a nice general-purpose langugage with okay library support and very nice compile-time checks.

The Spotify download gives a nice table about what data to expect, a JSON array with each element being an object of the following form:

Technical field	Contains
ts	This field is a timestamp indicating when the track stopped playing in UTC (Coordinated Universal Time). The order is year, month and day followed by a timestamp in military time
username	This field is your Spotify username.
platform	This field is the platform used when streaming the track (e.g. Android OS, Google Chromecast). ms_played This field is the number of milliseconds the stream was played.
conn_country	This field is the country code of the country where the stream was played (e.g. SE - Sweden).
ip_addr_decrypted	This field contains the IP address logged when streaming the track.
user_agent_decrypted	This field contains the user agent used when streaming the track (e.g. a browser, like Mozilla Firefox, or Safari)
master_metadata_track_name	This field is the name of the track.
master_metadata_album_artist_name	This field is the name of the artist, band or podcast.
master_metadata_album_album_name	This field is the name of the album of the track.
spotify_track_uri	A Spotify URI, uniquely identifying the track in the form of “spotify:track:<base-62 string>” A Spotify URI is a resource identifier that you can enter, for example, in the Spotify Desktop client’s search box to locate an artist, album, or track.
episode_name	This field contains the name of the episode of the podcast.
episode_show_name	This field contains the name of the show of the podcast.
spotify_episode_uri	A Spotify Episode URI, uniquely identifying the podcast episode in the form of “spotify:episode:<base-62 string>” A Spotify Episode URI is a resource identifier that you can enter, for example, in the Spotify Desktop client’s search box to locate an episode of a podcast.
reason_start	This field is a value telling why the track started (e.g. “trackdone”)
reason_end	This field is a value telling why the track ended (e.g. “endplay”).
shuffle	This field has the value True or False depending on if shuffle mode was used when playing the track.
skipped	This field indicates if the user skipped to the next song
offline	This field indicates whether the track was played in offline mode (“True”) or not (“False”).
offline_timestamp	This field is a timestamp of when offline mode was used, if used.
incognito_mode	This field indicates whether the track was played in incognito mode (“True”) or not (“Falsets This field is a timestamp indicating when the track stopped playing in UTC (Coordinated Universal Time). The order is year, month and day followed by a timestamp in military time

So most of these fields aren’t too useful, but just in case we need any later, let’s put all of this data into our database. I’m using a simple MySQL instance locally and the sqlx library in Rust to execute SQL queries. But first, to parse the JSON, the natural choice is serde_json in Rust. Unfortunately, which fields are nullable isn’t very well documented, but a little trial and error, we arrive at this Rust struct for deserialization:

#[derive(Debug, Serialize, Deserialize)]
struct StreamingData {
    ts: String,
    username: String,
    platform: String,
    ms_played: i32,
    conn_country: String,
    ip_addr_decrypted: Option<String>,
    user_agent_decrypted: Option<String>,
    master_metadata_track_name: Option<String>,
    master_metadata_album_artist_name: Option<String>,
    master_metadata_album_album_name: Option<String>,
    spotify_track_uri: Option<String>,
    episode_name: Option<String>,
    episode_show_name: Option<String>,
    spotify_episode_uri: Option<String>,
    reason_start: String,
    reason_end: Option<String>,
    shuffle: Option<bool>,
    skipped: Option<bool>,
    offline: Option<bool>,
    offline_timestamp: Option<i64>,
    incognito_mode: Option<bool>
}

with the corresponding SQL table DDL:

CREATE TABLE streams (
    ts TIMESTAMP NOT NULL,
    username VARCHAR(255) NOT NULL,
    platform VARCHAR(255) NOT NULL,
    ms_played INT NOT NULL,
    conn_country VARCHAR(255) NOT NULL,
    ip_addr_decrypted VARCHAR(255),
    user_agent_decrypted VARCHAR(255),
    master_metadata_track_name VARCHAR(255),
    master_metadata_album_artist_name VARCHAR(255),
    master_metadata_album_album_name VARCHAR(255),
    spotify_track_uri VARCHAR(255),
    episode_name VARCHAR(255),
    episode_show_name VARCHAR(255),
    spotify_episode_uri VARCHAR(255),
    reason_start VARCHAR(255) NOT NULL,
    reason_end VARCHAR(255),
    shuffle BOOLEAN,
    skipped BOOLEAN,
    offline BOOLEAN,
    offline_timestamp BIGINT,
    incognito_mode BOOLEAN
);

With some bulk inserts, we’re able to load the data in under a minute, all 112,526 rows. Time to play around a little bit of the data.

Exploratory Data Analysis

Let’s first look at my top 10 streamed songs of all time:

SELECT
	master_metadata_track_name,
	master_metadata_album_artist_name,
	SUM(ms_played) / 3600000 AS hours_played
FROM streams
WHERE spotify_track_uri IS NOT NULL
GROUP BY spotify_track_uri, 1, 2
ORDER BY 3 DESC
LIMIT 10

master_metadata_track_name	master_metadata_album_artist_name	hours_played
On Some Emo Shit	blink-182	27.7208
Beautiful Days Piano	Masafumi Takada	25.0852
No Capes	Waterparks	23.1735
Spring Day	BTS	21.9136
Roses	The Chainsmokers	21.6337
Rhapsody in Blue	George Gershwin	19.9754
Reset (feat. Jinsil)	Tiger JK	19.7157
Hawaii (Stay Awake)	Waterparks	19.5447
It’s Time	Imagine Dragons	18.6422
whatever it takes	convolk	17.9429

Concerningly, if we don’t group by spotify_track_uri, we get a different result:

master_metadata_track_name	master_metadata_album_artist_name	hours_played
Spring Day	BTS	30.9958
On Some Emo Shit	blink-182	27.7208
Beautiful Days Piano	Masafumi Takada	25.0852
whatever it takes	convolk	24.5301
Sanctuary	Joji	24.0277
No Capes	Waterparks	23.1735
KIDS ON MOLLY	Aries	21.8589
Rhapsody in Blue	George Gershwin	21.7469
Roses	The Chainsmokers	21.6337
Reset (feat. Jinsil)	Tiger JK	19.7157

We see that Spring Day by BTS grows by over 9 hours. This is probably due to the fact that the same song has different tracks on Spotify. Let’s see all the different versions of Spring Day:

SELECT
	spotify_track_uri,
	master_metadata_album_album_name,
	SUM(ms_played) / 3600000 AS hours_played
FROM streams
WHERE master_metadata_track_name = 'Spring Day'
  AND master_metadata_album_artist_name = 'BTS'
GROUP BY 1, 2;

spotify_track_uri	master_metadata_album_album_name	hours_played
spotify:track:02q0ZnV2L4XByzEvWZJqBC	YOU NEVER WALK ALONE	8.0562
spotify:track:0WNGsQ1oAuHzNTk8jivBKW	You Never Walk Alone	21.9136
spotify:track:2j1fFjWHCI9KJSwcuYAOyF	You Never Walk Alone	1.0259

This confirms our suspicions, we’re going to have to treat two songs as the same if they have the same title and artist, ignoring case, and even so, a couple of the same songs might be treated as different, and some different songs treated as the same if an aritst decides to name two different songs the same name, but there’s not much we can do.

Let’s also make sure not too much of our data is unusable, which is, what percent of streams by time has a NULL ID?

SELECT
( SELECT SUM(ms_played) WHERE spotify_track_uri IS NULL ) /
( SELECT SUM(ms_played) ) * 100
AS percent_null;

percent_null
0.57201341

0.57% is not too bad. Let’s move onto a more interesting question. I consider myself to be primarily an album listener, so what album have I listend to the most? I don’t want to be skewed by listening to a single song of the album, so let’s define the playtime of an album as the minimum playtime of any song in that album if all songs were listened to in that album, zero otherwise.

To once again avoid issues with the same album being identified with different IDs within Spotify, lets define an album uniquely defined as its album name and artist name, case insensitive.

This is the point where we need more information, since album info isn’t available in our Spotify download. It is found in the API, which we can query with our track IDs. We have 11,091 distinct track IDs to query, and the API allows bulk queries of up to 50 IDs each. The exact rate limit for the API isn’t known but we should be able to get away with 2 queries a second, so this should only take two minutes or so.

We have to do a lot of insertions into our database, so let’s factor out the logic in a helper function:

async fn bulk_insert<'a, T, F>(pool: &sqlx::Pool<sqlx::MySql>, table: &str, elements: &'a [T], row_closure: F)
where F: Fn(Separated<'_, 'a, MySql, &str>, &'a T) {
    const SQL_BATCH_SIZE: usize = 1000;

    for chunk in elements.chunks(SQL_BATCH_SIZE) {
        let mut query_builder: QueryBuilder<MySql> = QueryBuilder::new(format!("INSERT INTO {} ", table));
        query_builder.push_values(chunk, &row_closure);
        query_builder.build().execute(pool).await.unwrap();
    }
}

There’s a fun little Rust lifetime puzzle we had to solve in the function parameters to ensure that the query builder outlives the items being added to the query in the closure passed to this function.

In addition, we’re going to be doing a lot of API requests, so let’s factor out the behavior we want to exhibit based on the return code received into another function:

async fn spotify_request(token: &mut String, url: reqwest::Url) -> String {
    let mut backoff = 1;

    loop {
        let request = reqwest::Client::new().get(url.clone()).bearer_auth(&token);
        let request_copy = request.try_clone().unwrap();
        let response = request_copy.send().await.unwrap();
        let status = response.status();

        if !matches!(status, reqwest::StatusCode::TOO_MANY_REQUESTS) {
            backoff = 1;
        }

        match status {
            reqwest::StatusCode::UNAUTHORIZED => {
                println!("Invalid token! Getting a new one...");
                *token = get_new_token().await;
            }
            reqwest::StatusCode::TOO_MANY_REQUESTS => {
                let retry_after = response.headers().get("Retry-After");
                let timeout = match retry_after {
                    // Use the value of retry_after given to us from Spotify if it exists
                    Some(value) => {
                        value.to_str().unwrap().to_owned().parse().unwrap()
                    }
                    None => backoff
                };
                println!("Too many requests! Backing off {timeout} second(s)...");
                tokio::time::sleep(std::time::Duration::from_secs(timeout)).await;
                backoff *= 2;
            }
            reqwest::StatusCode::OK => {
                return response.text().await.unwrap();
            }
            _ => {
                panic!("Unhandled return code {}!", status);
            }
        }
    }
}

The full code is available on my GitHub, I won’t include most of it here since it’s a lot of boilerplate.

Finally, with all the data we need, let’s write a query to figure out my top albums, by the number of minimum playthroughs of a song in the album, with the requirement that all songs in the album have been played.

First, as discussed, we identify a song uniquely by the key (track name, artist name), which gives us multiple track ids. A little experimentation showed that all of these track ids for a given key (track name, artist name) pair yield the same album id, so we can pick one arbitarily, such as the first one that appears.

Next, we need to divide to aggregate all the time played for each key and divide it by the length of track. We then use a window function to count how many distinct songs for each album show up.

Finally, filtering on the albums where the number of distinct songs that show up is equal to the number of tracks in the album, we use aggregation to find the minimum times played for each album id.

Our resulting query is

WITH cte1 AS (
    SELECT
        master_metadata_track_name AS track_name,
        master_metadata_album_artist_name AS artist_name,
        JSON_UNQUOTE(JSON_EXTRACT(JSON_ARRAYAGG(spotify_track_uri), '$[0]')) AS track_uri,
        SUM(ms_played) AS total_ms_played
    FROM streams
    GROUP BY 1, 2
    HAVING total_ms_played > 0
),
cte2 AS (
    SELECT
        cte1.track_name,
        cte1.artist_name,
        cte1.total_ms_played / tracks.duration_ms AS times_played,
        albums.id AS album_id,
        albums.name AS album_name,
        albums.total_tracks,
        COUNT(track_name) OVER (PARTITION BY albums.id) AS played_tracks
    FROM cte1 JOIN tracks ON SUBSTRING(cte1.track_uri, 15) = tracks.id
              JOIN albums ON tracks.album_id = albums.id
    WHERE tracks.duration_ms > 0
)
SELECT DISTINCT
    album_name,
    artist_name,
    total_tracks,
    MIN(times_played) AS times_played
FROM cte2
WHERE played_tracks = total_tracks AND total_tracks >= 5
GROUP BY album_id, 1, 2, 3
ORDER BY 4 DESC
LIMIT 20;

and our result is

album_name	artist_name	total_tracks	times_played
WELCOME HOME	Aries	9	94.5799
BLOODLUST	nothing,nowhere.	6	87.5505
Take Off Your Pants And Jacket	blink-182	13	85.5395
Double Dare	Waterparks	13	61.7403
deadroses	blackbear	10	58.3813
Cluster	Waterparks	5	36.1772
NINE	blink-182	15	35.8173
Boys Like Girls	BOYS LIKE GIRLS	12	35.0078
help	blackbear	10	29.0632
5 Seconds Of Summer	5 Seconds of Summer	16	28.6338
BALLADS 1	Joji	12	26.2355
Heart Flip	This Wild Life	8	25.7468
Entertainment	Waterparks	10	24.3553
X&Y	Coldplay	13	16.1856
Songs About Jane	Maroon 5	12	14.8786
÷ (Deluxe)	Ed Sheeran	16	14.0803
Greatest Hits	Waterparks	17	13.8155
Last Young Renegade	All Time Low	10	12.6903
Black Light	Waterparks	6	12.5680
Astoria	Marianas Trench	17	9.1888

That’s it for now! In the next part, we’ll be trying to see if there’s a statistically significant link between things such as time of day or day of week and the mood of songs, which is something the Spotify API provides. We’ll also be developing a standalone application so people can see their own stats.