If you know me, even a little bit, you’ll know that music is a big part of my life. I’ve played musical instruments since childhood and pride myself on my eclectic tastes. Needless to say, my choice of streaming service is something I take pretty seriously.
Since college I’ve slowly been transitioning from the many gigabytes of mp3’s stored on my computer to a more “modern” approach to listening. My first foray into streaming services started with Pandora which I honestly loved for a long time. It fit very well with my hands-off approach to listening to music and introduced me to artists I’d never listened to before.
Unfortunately after listening to Pandora for 10+ years, I started noticing a lot of the same songs popping up. I couldn’t figure out if it was because their music library wasn’t changing and expanding or the algorithm had become “stuck” in a rut. In any case, I decided to start exploring other streaming services to see what was out there.
I started with Spotify and tried to use it exactly like I did with Pandora. I was like, “wait a second, I have to find a playlist I want to listen to or create my own… Next!” I quickly rifled through pretty much all of the streaming services, but each of them had a quirk or two that just didn’t sit right with me.
I was up to my gills in free trials, but felt like nothing was going to actually work. I thought to myself, “maybe I should crawl back to Pandora and beg for them to take me back”. That was when I noticed a shimmer of hope. Spotify had an API! (Here’s the wiki on APIs, just in case)
After poking around the Spotify API, my hypothesis started to form.
Accomplishing this would allow me to take advantage of Spotify’s huge music library and prevent me from having to sift through playlists or suffer through Spotify’s half-witted attempt at an artist radio.
I told you I took my music seriously…
Let’s dig in.
Discovering Spotify’s Data
After an initial look at the data available, there were two areas that looked most promising: related artists and genres. Now this might not sound like a lot but, as I’ll try to show, just these two pieces of information can create something pretty compelling.
But first, it might make sense to quickly talk about how Pandora’s radio works.
What Does Pandora Do?
Pandora is driven by something called the Music Genome Project. This is a project that uses around 450 musical attributes to distinguish, group, and recommend songs to Pandora’s users. Each song is broken down, not by genre or era, but by how the song sounds. Is the song slow? What is the time-signature? Is there singing or is it just instrumental? Is the song happy or sad (major or minor keys)? These are just some examples of what may be included in the 450 musical attributes.
It takes quite a bit of doing to calculate all of these audio features, but I think we can skip over all of this and just use the genre and related artist information to create essentially the same experience.
Related Artists
Spotify provides a list of up to 20 related artists that they call “Fans Also Like”. Given a single artist, Mumford & Sons for example, the 20 related artists listed are artists that Spotify users listen to frequently when they listen to Mumford. They add a few other things to arrive at this list, but that is the essence of it. Here’s what it looks like for Mumford:
Genres
In addition to related artists, Spotify provides genre details for each of its artists. This is a bit more complex than you’d think because Spotify uses genre like a tag. This means that a single artist typically is tagged with multiple genres instead of just one.
This simple “multi-genre” concept allows for quite a bit of information to be transferred about how similar certain artists are to others. We’ll take advantage of this later! For now try to take in the enormity of all the different genres and how they are related to each other.
Developing the Algorithm
Now that I’d gained an understanding of the data available, it was time to see if I could get this data to sing! I started with the related artist because that seemed like a pretty close solution to what I was after; an artist based “radio station” that played songs from that artist and other artists that were similar.
Leveraging the Related Artist Information
Only 20 related artists, like in the above example with Mumford & Sons, wasn’t enough to create a true radio-like experience. I also speculated that there were other artists that were closely related to the “seed” artist that weren’t identified by this initial list. So, naturally I wanted to expand the tree to include each of the related artists’ related artists. If you do this 2 or 3 times, you end up with thousands of artists to choose from (4,201 in our example), all at least loosely related to the original artist.
The real magic comes when you allow the tree to be bi-directional, instead of unidirectional. This insight allows artists from the furthest point in the tree (a leaf node) to connect with artists in previous layers. Bringing it together, you can collect a list of the most similar artists based on how strong their connection is to the original artist.
Arranging this data in a spectral layout allows an interesting way to arrive at a “physical” placement of artists in relation to one another. The ones that are closer to Mumford & Sons will tend to be more similar than those further away – based on the graph that we created above. Unfortunately spectral clustering doesn’t provide a really cool to look at set of points (It’s more like a jumble of points roughly centered around Mumford). So instead, below I decided to show you the proximity of other artists to Mumford that resulted from this clustering technique.
If you’re interested, I used networkX’s spectral layout to compute the positions of each artist in the graph and then used a simple nearest neighbors algorithm to find the points that were closest to the seed artist.
With this in hand, we’ve got a great start, but it’s not enough to create a satisfactory artist radio replica. So I turned to another technique that I like called Jaccard Similarity. Using the same related artist data, we can look at the 20 artists Mumford is related to and compare that to the 20 artists any other artist is related to. Here’s an example comparing Mumford & Sons to The Lumineers to help (the closer to 1 the more similar the artists are).
This approach provides a slightly different set of results, but alone still doesn’t feel complete. It was time to delve into the genre data and see what I could find.
Leveraging the Genre Information
Using the expanded tree of related artists from the previous steps, I was able to obtain a list of about 450 unique genres. Since each artist typically has multiple genres, you can start to build an association between similar genres based on how often they appear together. For example, if we observe the genres “indie folk” and “stomp and holler” show up together for 1,351 different artists, but we only see “indie folk” and “new orleans blues” show up together for one artist, we can safely conclude that “stomp and holler” is much more closely related to “indie folk” than “new orleans blues”. Of course, it’s not a perfect solution. But this approach gives us great insight into what’s going on with all of the different genres we see.
Leveraging networkX again, I created a force directed graph of all of the genre connections using the number times each genre pair showed up for a single artist as the weight or strength of the connection. This provides a two dimensional arrangement of all of the genres that I can query to find how similar two artists are based on their genres.
In the case of Mumford & Sons, they have the following genres:
- Modern Folk Rock
- Stomp and Holler
- UK Americana
We can loop through each genre from another “test” artist and find which of the above genres are closest to it. Recording that distance, we can do this for the rest of the genres for that “test” artist and take the average of those numbers. Doing this for all of the candidate artists produces a complete list of scores that we can use to determine artist similarity. Interestingly, this alone does not produce a great artist radio, but brings another unique approach to the table.
We need to add a finishing touch to make all of this data work together!
Bringing It All Together
At this point, I’ve got three different approaches that give me a list of artists that are all similar to Mumford & Sons. Each approach seems to work reasonably well, but each also has their issues. My favorite approach when faced with these situations is to use the power of ensembling to supercharge the algorithm. Since each approach has unique information, the act of combining them into a single list will take advantage of all of that information while reducing the “error” associated with each method. This is the power of ensembling (a topic for another time).
Take a look a the top 10 artists each approach identified:
Since we don’t have any strong preference for a single method, a nice way of combining these lists is to rank order each list and then sum up the rank for each artist across each method. This will equally combine each method and get rid of some of the noise produced by each scoring method. All in all, this method finds 13 other artists that are more closely related to Mumford & Sons than are the original 20 suggested by Spotify. Cool!
Allowing the algorithm to select the 50 most closely related artists provides great variety and a very realistic artist radio listening experience.
To make this algorithm work smoothly for any artist a few more details have to be worked out, but this is how the sausage is made. These are things like:
- Actually creating the playlist once we have a list of the top artists
- I evenly distribute songs among the top artists and use each song’s popularity to determine how likely it is to become part of the playlist. This doesn’t mean that no “unpopular” songs are selected, they are just selected less frequently.
- Working out a caching strategy – Any individual artist radio pulls a lot of data, so I want to make sure not to piss off Spotify by unnecessarily hitting their API.
- Since Spotify has such a large music library there are certain artists and tracks that you’ll want to filter out, unless of course you’re interested in them.
- Christmas music
- Commentary Tracks
- Remixes (there are a lot of these, and I prefer the original tracks)
- Christian music (unless that’s desired)
- Specific artists that you’re not thrilled about listening to.
Final Thoughts
There are some pretty cool things you can do with this once it’s working. For example, just like Pandora, instead of a single seed artist, you can pass the algorithm many seed artists and get a radio-like playlist that is as unique as your musical tastes. Once caching is working properly, you can generate playlists in seconds, even when using 100+ seed artists. Pretty sweet!
I’ve been using this algorithm for years now with very few tweaks. I am totally satisfied with the listening experience it provides, and don’t know what I would do without it.
I’d probably go back to Pandora if I’m being honest…
I’m working on making this artist radio more widely available, but until then feel free to reach out if you are interested in replicating this for yourself.
OK. That’s all I’ve got for now!
-Scott
Resources
The code for the data visualizations seen in this post.
The Mumford & Sons Artist Radio created from the techniques of this post.