The KDD Cup is an annual Data Mining and Knowledge Discovery competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. This year, the KDD-Cup is called Learn the rhythm, predict the musical scores. Yahoo! Music has contributed 300 million ratings performed by over 1 million anonymized users. The ratings are given to to songs, albums, artists and genres. The goal for this competition is for submitters to (1) Accurately predict ratings that users gave to various items and (2) Separation of loved songs from other songs.
This is a pretty exciting set of data. It is perhaps the largest set of music rating data ever released. With a data set of this size we should see Netflix Prize -sized advances in the music recommendation field because of it. However, there’s one little gotcha. The data is entirely anonymized. Not only have the user data been anonymized, but all of the songs, albums, artists and genres as well. So instead of getting ratings data like ‘user 1 rated bon jovi with five stars’, you get data like ‘user 1 rated artist 10 with five stars’ . Here’s a sample of data for one user:
3|14 # user ID 3 has 14 ratings 5980 90 3811 13:24:00 # item 5980 got a score of 90/100 11059 90 3811 13:24:00 # 3811 is a day offset from an 21931 90 3811 13:24:00 # undisclosed date 74262 90 3811 13:24:00 # 146781 90 3811 13:24:00 # 13:24 is the time on day 3811 173094 90 3811 13:24:00 175835 90 3811 13:24:00 180037 90 3811 13:24:00 194044 90 3811 13:24:00 267723 90 3811 13:24:00 290303 90 3811 13:24:00 366723 90 3811 13:24:00 432968 90 3811 13:24:00 451800 90 3811 13:24:00
Without any way to tie the item IDs to actual music items, this competition seems to be less about music recommendation and more about collaborative filtering (CF) algorithms. As Oscar Celma (who literally wrote the book on music recommendation) put it in the KDD Cup competition forum:
Without artist/song name, the dataset has no interest for me (e.g. it doesn’t make any sense not being able to understand what are you predicting). As it is now, this is not really a “music dataset” nor a competition about “music recommendation”, but simply a way to apply CF to a huge dataset. In a way, this is good for people doing research on CF. But, not being able to add *any* knowledge about the domain… it doesn’t make any sense, IMHO.
There is so much we could do if we had access to the artist and track names, using Music Information Retrieval techniques: we could analyse the audio (tempo, chords, melody, timbre, etc.), the scores, the lyrics,the artists’ connections, and much more. There is a growing community working on these topics, and attempting to do music recommendation without any contextual and/or content information other than the genres (which is a limited approach) is simply ignoring this whole branch of research.
The folks at Yahoo! who have generously put together the dataset do understand how the lack of real, non-anonymized music data makes it difficult for a whole branch of researchers from the Music Information Retrieval community to participate in the competition. However Noam Koenigstein, one of the organizers of this years KDD-Cup, says that the aggressive anonymization of the data is required by their legal team due to recent lawsuits around large releases of user rating data (see Netflix lawsuit) and their hands are tied. Noam does go on to say that:
After working with this dataset for 6 months now, I can defiantly say that there are differences between music CF and other types of CF. One example is the popularity temporal trends in music that are different than in movies (Netflix). So a CF system that considers also temporal effects will be different in music. There are other differences as well, but I can not reveal them right now.
I’m sure Noam is right, that there is some interesting differences between the music rating data and other large rating sets and I’m sure that exploring these differences will improve the state-of-the-art in CF systems, but Oscar and Amelie are right too – so much more could be learned if we had the ability to know what items were actually being rated
There have been two very active research communities involved in music recommendation. The RecSys community takes a traditional recommender systems approach and relies mostly on collaborative filtering techniques to make recommendations. To this community, data mining of user behavior is enough to make good recommendations. Whereas the Music Information Retrieval (MIR) community focuses much more on the music itself, relying on content-based (CB) techniques based on the audio (or descriptions of the audio) to find musical connections to base recommendations on. Each approach has its own strengths and weaknesses (CF has the cold start problem, popularity feedback loops, hacking susceptibility etc. while CB tends to be computationally more challenging and has trouble separating good music from bad). The best systems tend to combine aspects of both approaches into hybrid systems.
The KDD-cup data set is a fantastic set of data, and I’m sure this data will help the RecSys community improve the state-of-the-art in CF systems. The MIR community is also creating its own industrial-sized datasets for research such as the recently released Million Song Data Set which will be used to improve CB techniques. It is my hope that someday we’ll be able to offer a combined dataset that contains both massive rating data and massive content data. If we put all this data in the hands of researchers, there’s no telling what they’ll find. And perhaps that’s the real problem – as Jeremy Reed tweeted: Biomed researchers can obtain illegal substances for research, but we can’t get data because we’ll find users with bad taste!