This week, Google launched the beta of its music locker service where you can upload all your music to the cloud and listen to it from anywhere. According to Techcrunch, Google’s Paul Joyce revealed that the Music Beta killer feature is ‘Instant Mix,’ Google’s version of Genius playlists, where you can select a song that you like and the music manager will create a playlist based on songs that sound similar. I wondered how good this ‘killer feature’ of Music Beta really was and so I decided to try to evaluate how well Instant Mix works to create playlists.
Google’s Instant Mix, like many playlisting engines, creates a playlist of songs given a seed song. It tries to find songs that go well with the seed song. Unfortunately, there’s no solid objective measure to evaluate playlists. There’s no algorithm that we can use to say whether one playlist is better than another. A good playlist derived from a single seed will certainly have songs that sound similar to the seed, but there are many other aspects as well: the mix of the familiar and the new, surprise, emotional arc, song order, song transitions, and so on. If you are interested in the perils of playlist evaluation, check out this talk Dr. Ben Fields and I gave at ISMIR 2010: Finding a path through the jukebox. The Playlist tutorial. (Warning, it is a 300 slide deck). Adding to the difficulty in evaluating the Instant Mix is that since it generates playlists within an individual’s music collection, the universe of music that it can draw from is much smaller than a general playlisting engine such as we see with a system like Pandora. A playlist may appear to be poor because it is filled with songs that are poor matches to the seed, but in fact those songs actually may be the best matches within the individual’s music collection.
Evaluating playlists is hard. However, there is something that we can do that is fairly easy to give us an idea of how well a playlisting engine works compared to others. I call it the WTF test. It is really quite simple. You generate a playlist, and just count the number of head-scratchers in the list. If you look at a song in a playlist and say to yourself ‘How the heck did this song get in this playlist’ you bump the counter for the playlist. The higher the WTF count the worse the playlist. As a first order quality metric, I really like the WTF Test. It is easy to apply, and focuses on a critical aspect of playlist quality. If a playlist is filled with jarring transitions, leaving the listener with iPod whiplash as they are jerked through songs of vastly different styles, it is a bad playlist.
For this evaluation, I took my personal collection of music (about 7,800 tracks) and enrolled it into 3 systems; Google Music, iTunes and The Echo Nest. I then created a set of playlist using each system and counted the WTFs for each playlist. I picked seed songs based on my music taste (it is my collection of music so it seemed like a natural place to start).
I compared three systems: iTunes Genius, Google Instant Mix, and The Echo Nest playlisting API. All of them are black box algorihms, but we do know a little bit about them:
- iTunes Genius – this system seems to be a collaborative filtering algorithm driven from purchase data acquired via the iTunes music store. It may use play, skip and ratings to steer the playlisting engine. More details about the system can be found in: Smarter than Genius? Human Evaluation of Music Recommender Systems. This is a one button system – there are no user-accessible controls that affect the playlisting algorithm.
- Google Instant Mix – there is no data published on how this system works. It appears to be a hybrid system that uses collaborative filtering data along with acoustic similarity data. Since Google Music does give attribution to Gracenote, there is a possibility that some of Gracenote’s data is used in generating playlists. This is a one button system. There are no user-accessible controls that affect the playlisting algorithm.
- The Echo Nest playlist engine – this is a hybrid system that uses cultural, collaborative filtering data and acoustic data to build the playlist. The cultural data is gleaned from a deep crawl of the web. The playlisting engine takes into account artist popularity, familiarity, cultural similarity, and acoustic similarity along with a number of other attributes There are a number of controls that can be set to control the playlists: variety, adventurousness, style, mood, energy. For this evaluation, the playlist engine was configured to create playlists with relatively low variety with songs by mostly mainstream artists. The configuration of the engine was not changed once the test was started.
For this evaluation I’ve used my personal iTunes music collection of about 7,800 songs. I think it is a fairly typical music collection. It has music of a wide variety of styles. It contains music of my taste (70s progrock and other dad-core, indie and numetal), music from my kids (radio pop, musicals), some indie, jazz, and a whole bunch of Canadian music from my friend Steve. There’s also a bunch of podcasts as well. It has the usual set of metadata screwups that you see in real-life collections (3 different spellings of Björk for example). I’ve placed a listing of all the music in the collection at Paul’s Music Collection if you are interested in all of the details.
Although I’ve tried my best to be objective, I clearly have a vested interest in the outcome of this evaluation. I work for a company that has its own playlisting technology. I have friends that work for Google. I like Apple products. So feel free to be skeptical about my results. I will try to do a few things to make it clear that I did not fudge things. I’ll show screenshots of results from the 3 playlisting sources, as opposed to just listing songs. (I’m too lazy to try to fake screenshots). I’ll also give API command I used for the Echo Nest playlists so you can generate those results yourself. Still, I won’t blame the skeptics. I encourage anyone to try a similar A/B/C evaluation on their own collection so we can compare results.
For each trial, I picked a seed song, generated a 25 song playlist using each system, and counted the WTFs in each list. I show the results as screenshots from each system and I mark each WTF that I see with a red dot.
Trial #1 – Miles Davis – Kind of Blue
I don’t have a whole lot of Jazz in my collection, so I thought this would be a good test to see if a playlister could find the Jazz amidst all the other stuff.
First up is iTunes Genius
This looks like an excellent mix. All jazz artists. The most WTF results are the Blood, Sweat and Tears tracks – which is Jazz-Rock fusion, or the Norah Jones tracks which are more coffee house, but neither of these tracks rise above the WTF level. Well done iTunes! WTF score: 0
Next up is The Echo Nest.
As with iTunes, the Echo Nest playlist has no WTFs, all hardcore jazz. I’d be pretty happy with this playlist, especially considering the limited amount of Jazz in my collection. I think this playlist may even be a bit better than the iTunes playlist. It is a bit more hardcore Jazz. If you are listening to Miles Davis, Norah Jones may not be for you. Well done Echo Nest. WTF score: 0
If you want to generate a similar playlist via our api use this API command:
http://developer.echonest.com/api/v4/playlist/static?api_key=3YDUQHGT9ZVUBFBR0&format=json &limit=true&song_id=SOAQMYC12A8C13A0A8 &type=song-radio&bucket=id%3ACAQHGXM12FDF53542C &variety=.12&artist_min_hotttnesss=.4
Next up is google:
I’ve marked the playlist with red dots on the songs that I consider to be WTF songs. There are 18(!) songs on this 25 song playlist that are not justifiable. There’s electronica, rock, folk, Victorian era brass band and Coldplay. Yes, that’s right, there’s Coldplay on a Miles Davis playlist. WTF score: 18
After Trial 1 Scores are: iTunes: 0 WTFs, The Echo Nest 0 WTFs, Google Music: 18 WTFs
Trial #2 – Lady Gaga – Bad Romance
First up is iTunes:
Next up: The Echo Nest
Next up, Google Instant Mix
Google’s Instant Mix for Lady Gaga’s Bad Romance seems filled with non sequitur. Tracks by Dave Brubeck (cool jazz), Maynard Ferguson (big band jazz), are mixed in with tracks by Ice Cube and They Might be Giants. The most appropriate track in the playlist is a 20 year old track by Madonna. I think I was pretty lenient in counting WTFs on this one. Even then, it scores pretty poorly. WTF Score: 13
After Trial 2 Scores are: iTunes: 2 WTFs, The Echo Nest 0 WTFs, Google Music: 31WTFs
Trial #3 – The Nice – Rondo
First up: iTunes:
Next up is The Nest:
Next up is Google Instant Mix:
I would not like to listen to this playlist. It has a number songs that are just too far out. ABBA, Simon & Garfunkel, are WTF enough, but this playlist takes WTF three steps further. First offense, including a song with the same title more than once. This playlist has two versions of ‘Side A-Popcorn’. That’s a no-no in playlisting (except for cover playlists). Next offense is the song ‘I think I love you’ by the Partridge family. This track was not in my collection. It was one of the free tracks that Google gave me when I signed up. 70s bubblegum pop doesn’t belong on this list. However,as bad as The Partridge family song is, it is not the worst track on the playlist. That award goes to FM 2.0: The future of Internet Radio’. Yep, Instant Mix decided that we should conclude a prog rock playlist with an hour long panel about the future of online music. That’s a big WTF. I can’t imagine what algorithm would have led to that choice. Google really deserves extra WTF points for these gaffes, but I’ll be kind. WTF Score: 11
After Trial 3 Scores are: iTunes: 2 WTFs, The Echo Nest 0 WTFs, Google Music: 42WTFs
Trial #4 – Kraftwerk – Autobahn
I don’t have too much electronica, but I like to listen to it, especially when I’m working. Let’s try a playlist based on the group that started it all.
First up, iTunes.
iTunes nails it here. Not a bad track. Perfect playlist for programming. Again, well done iTunes. WTF Score: 0
Next up, The Echo Nest
Another solid playlist, No WTFs. It is a bit more vocal heavy than the iTunes playlist. I think I prefer the iTunes version a bit more because of that. Still, nothing to complain about here: WTF Score: 0
Next Up Google
After listening to this playlist, I am starting to wonder if Google is just messing with us. They could do so much better by selecting songs at random within a top level genre than what they are doing now. This playlist only has 6 songs that can be considered OK, the rest are totally WTF. WTF Score: 18
After Trial 4 Scores are: iTunes: 2 WTFs, The Echo Nest 0 WTFs, Google Music: 60 WTFs
Trial #5 The Beatles – Polythene Pam
For the last trial I chose the song Polythene Pam by The Beatles. It is at the core of the amazing bit on side two of Abbey Road. The zenith of the Beatles music are (IMHO) the opening chords to this song. Lets see how everyone does:
First up: iTunes
iTunes gets a bit WTF here. They can’t offer any recommendations based upon this song. This is totally puzzling to me since The Beatles have been available in the iTunes store for quite a while now. I tried to generate playlists seeded with many different Beatles songs and was not able to generate one playlist. Totally WTF. I think that not being able to generate a playlist for any Beatles song as seed should be worth at least 10 WTF points. WTF Score: 10
Next Up: The Echo Nest
No worries with The Echo Nest playlist. Probably not the most creative playlist, but quite serviceable. WTF Score: 0
Next up Google
Instant Mix scores better on this playlist than it has on the other four. That’s not because I think they did a better job on this playlist, it is just that since the Beatles cover such a wide range of music styles, it is not hard to make a justification for just about any song. Still, I do like the variety in this playlist. There are just two WTFs on this playlist. WTF Score: 2.
After Trial 5 Scores are: iTunes: 12 WTFs, The Echo Nest 0 WTFs, Google Music: 62 WTFs
(lower scores are better)
I learned quite a bit during this evaluation. First of all, Apple Genius is actually quite good. The last time I took a close look at iTunes Genius was 3 years ago. It was generating pretty poor recommendations. Today, however, Genius is generating reliable recommendations for just about any track I could throw at it, with the notable exception of Beatles tracks.
I was also quite pleased to see how well The Echo Nest playlister performed. Our playlist engine is designed to work with extremely large collections (10million tracks) or with personal sized collections. It has lots of options to allow you to control all sorts of aspects of the playlisting. I was glad to see that even when operating in a very constrained situation of a single seed song, with no user feedback it performed well. I am certainly not an unbiased observer, so I hope that anyone who cares enough about this stuff will try to create their own playlists with The Echo Nest API and make their own judgements. The API docs are here: The Echo Nest Playlist API.
However, the biggest surprise of all in this evaluation is how poorly Google’s Instant Mix performed. Nearly half of all songs in Instant Mix playlists were head scratchers – songs that just didn’t belong in the playlist. These playlists were not usable. It is a bit of a puzzle as to why the playlists are so bad considering all of the smart people at Google. Google does say that this release is a Beta, so we can give them a little leeway here. And I certainly wouldn’t count Google out here. They are data kings, and once the data starts rolling from millions of users, you can bet that their playlists will improve over time, just like Apple’s did. Still, when Paul Joyce said that the Music Beta killer feature is ‘Instant Mix’, I wonder if perhaps what he meant to say was “the feature that kills Google Music is ‘Instant Mix’.”