The KDD Cup is an annual Data Mining and Knowledge Discovery competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. This year, the KDD-Cup is called Learn the rhythm, predict the musical scores. Yahoo! Music has contributed 300 million ratings performed by over 1 million anonymized users. The ratings are given to to songs, albums, artists and genres. The goal for this competition is for submitters to (1) Accurately predict ratings that users gave to various items and (2) Separation of loved songs from other songs.
This is a pretty exciting set of data. It is perhaps the largest set of music rating data ever released. With a data set of this size we should see Netflix Prize -sized advances in the music recommendation field because of it. However, there’s one little gotcha. The data is entirely anonymized. Not only have the user data been anonymized, but all of the songs, albums, artists and genres as well. So instead of getting ratings data like ‘user 1 rated bon jovi with five stars’, you get data like ‘user 1 rated artist 10 with five stars’ . Here’s a sample of data for one user:
3|14 # user ID 3 has 14 ratings 5980 90 3811 13:24:00 # item 5980 got a score of 90/100 11059 90 3811 13:24:00 # 3811 is a day offset from an 21931 90 3811 13:24:00 # undisclosed date 74262 90 3811 13:24:00 # 146781 90 3811 13:24:00 # 13:24 is the time on day 3811 173094 90 3811 13:24:00 175835 90 3811 13:24:00 180037 90 3811 13:24:00 194044 90 3811 13:24:00 267723 90 3811 13:24:00 290303 90 3811 13:24:00 366723 90 3811 13:24:00 432968 90 3811 13:24:00 451800 90 3811 13:24:00
Without any way to tie the item IDs to actual music items, this competition seems to be less about music recommendation and more about collaborative filtering (CF) algorithms. As Oscar Celma (who literally wrote the book on music recommendation) put it in the KDD Cup competition forum:
Without artist/song name, the dataset has no interest for me (e.g. it doesn’t make any sense not being able to understand what are you predicting). As it is now, this is not really a “music dataset” nor a competition about “music recommendation”, but simply a way to apply CF to a huge dataset. In a way, this is good for people doing research on CF. But, not being able to add *any* knowledge about the domain… it doesn’t make any sense, IMHO.
Researcher Amelie Anglade adds:
There is so much we could do if we had access to the artist and track names, using Music Information Retrieval techniques: we could analyse the audio (tempo, chords, melody, timbre, etc.), the scores, the lyrics,the artists’ connections, and much more. There is a growing community working on these topics, and attempting to do music recommendation without any contextual and/or content information other than the genres (which is a limited approach) is simply ignoring this whole branch of research.
The folks at Yahoo! who have generously put together the dataset do understand how the lack of real, non-anonymized music data makes it difficult for a whole branch of researchers from the Music Information Retrieval community to participate in the competition. However Noam Koenigstein, one of the organizers of this years KDD-Cup, says that the aggressive anonymization of the data is required by their legal team due to recent lawsuits around large releases of user rating data (see Netflix lawsuit) and their hands are tied. Noam does go on to say that:
After working with this dataset for 6 months now, I can defiantly say that there are differences between music CF and other types of CF. One example is the popularity temporal trends in music that are different than in movies (Netflix). So a CF system that considers also temporal effects will be different in music. There are other differences as well, but I can not reveal them right now.
I’m sure Noam is right, that there is some interesting differences between the music rating data and other large rating sets and I’m sure that exploring these differences will improve the state-of-the-art in CF systems, but Oscar and Amelie are right too – so much more could be learned if we had the ability to know what items were actually being rated
There have been two very active research communities involved in music recommendation. The RecSys community takes a traditional recommender systems approach and relies mostly on collaborative filtering techniques to make recommendations. To this community, data mining of user behavior is enough to make good recommendations. Whereas the Music Information Retrieval (MIR) community focuses much more on the music itself, relying on content-based (CB) techniques based on the audio (or descriptions of the audio) to find musical connections to base recommendations on. Each approach has its own strengths and weaknesses (CF has the cold start problem, popularity feedback loops, hacking susceptibility etc. while CB tends to be computationally more challenging and has trouble separating good music from bad). The best systems tend to combine aspects of both approaches into hybrid systems.
The KDD-cup data set is a fantastic set of data, and I’m sure this data will help the RecSys community improve the state-of-the-art in CF systems. The MIR community is also creating its own industrial-sized datasets for research such as the recently released Million Song Data Set which will be used to improve CB techniques. It is my hope that someday we’ll be able to offer a combined dataset that contains both massive rating data and massive content data. If we put all this data in the hands of researchers, there’s no telling what they’ll find. And perhaps that’s the real problem – as Jeremy Reed tweeted: Biomed researchers can obtain illegal substances for research, but we can’t get data because we’ll find users with bad taste!
#1 by jeremy on February 22, 2011 - 10:47 am
So I remember having chats with you five years ago, and you were much more of the strictly CF, not content-based methods, mindset. Has this strongly changed? Do you see a lot more value in the content, now?
Another interesting distinction is that there are two types of content-based methods.. surface audio features, and deeper analytical features. MFCCs versus chords. Beat detection vs. rhythmic comparisons.
Any thoughts on the relative value of each? I’ve always been a strong chord/rhythm proponent, at a time when most folks just looked at the surface audio features. But maybe with those surface features plus CF methods, that may be enough?
#2 by Thierry BM on February 22, 2011 - 11:31 am
I understand Oscar and Amelie’s disappointment at not being able to link the data to more resources and metadata. By the way, in the yahoo dataset R-1, artists are known by name, and the Million Song Dataset matches 91% of the ratings.
But isn’t it still amazingly exciting? Millions of ratings? what if someone actually predicts the ratings in a convincing manner? Isn’t it enough to build a new LastFM or Pandora? Isn’t CF powering most of the EN recommendation system?
Mostly, if MIR researchers don’t look at this data because they don’t find it open enough… won’t we be scooped by machine learners in our own field, simply because it changes our way of doing things and we don’t take the time to adapt?
#3 by brian on February 22, 2011 - 1:14 pm
Isn’t CF powering most of the EN recommendation system?
What, no. First, we did not have a recommender at all until the release of personal catalogs this past October. And even that recommender takes absolutely no usage data into account other than the usage of the individual or catalog you are recommending to. (That is, no anonymous users can influence your recommendations at this time.)
#4 by Thierry BM on February 22, 2011 - 1:27 pm
Sorry, my mistake, no (anonymous) user involved.
But still, how are done most recommendations? If it is from web crawling, metadata about the artist, … rather than audio features, it is not that far off from collaborative filtering.
If all users always like/dislike the same two artists at the same time, it is close to always seeing these two artists together on blog posts.
#5 by Oscar Celma on February 22, 2011 - 3:22 pm
Q: “But isn’t it still amazingly exciting? Millions of ratings?”
– No, as it is now (only IDs, with no strings at all).
Also, in the music recommendation domain most of the (real) feedback is implicit (play a song, skip it), and only some comes in the form of (explicit) ratings (e.g love/ban).
Q: “What if someone actually predicts the ratings in a convincing manner?”
– Well, they will win a prize. But it doesn’t solve some other (real) problems in music recommendation, namely:
1) given a playlist a user is listening to, which song should be the next one? (choose one from a dataset of 10M tracks)?
2) given a new release, to whom should I recommend it?
3) I want to discover unknown music (to me). Please recommend good music I might like, along the Loooooong Tail of music.
…and a few other use cases.
In a way, music (instant) recommendations are much more dynamic than other domains, such as movies, books, etc. where it takes more than 3 minutes to consume the item (or even skip the item after the first 10 secs. C’mon, give me the next song! Now. Faster! :-)
Q: “Isn’t it enough to build a new LastFM or Pandora? Isn’t CF powering most of the EN recommendation system?”
– To my knowledge, a music rec. based entirely in CF it’s not as good as if you mix other ingredients there (e.g. social tagging, editorial metadata, etc.).
Also, I doubt the systems you mention are mostly based on CF:
. Last.fm: makes use of social tagging. A lot.
. Pandora: originally only based on manual content annotation, and indeed since a few years ago using feedback from users (e.g. skip / love / ban song), but not CF per se.
. Echo Nest: Brian already replied :-)
All in all, I still think this is not a “music recommendation dataset”, but just a whole bunch of numbers, that will excite those that love: black-box-CFs, scalability issues, algorithm efficiency, and other interesting problems, none of them really tacking the “music recommendation problem”.
#6 by Thierry BM on February 22, 2011 - 3:40 pm
I mostly agree. Evidently, this dataset will not solve every aspect of music recommendation. And evidently, we could do even more if the data was somehow linked to metadata.
But I won’t believe that there is no music knowledge in these numbers. Think about clustering songs and users in a few meaningful categories, think about measuring how long is the long-tail, … yes, we won’t be able to fully analyze the results until we actually know what the artists were, but the algorithms will be ready. And for instance, for the clustering idea, just the number of clusters is already interesting!
It is also true that current radios don’t necessarily use CF, but still, if I can predict correctly your rating of a song, I am off to a pretty good start, whatever other data I add to fine-tune it.
And I still believe that if the MIR community does not pay attention to that data, it is a missed opportunity.
#7 by Norman Casagrande on February 22, 2011 - 12:08 pm
You’re totally right Paul: the best algorithms are the ones that combine different sources of information. True, the CF nature of musical data is different, but not to the point of justifying a special cup. In my experience improvements at CF-only level did help but not as much as integrating tags and other sources.
On top of that, debugging an algorithm without knowing about the context is a real pain.
#8 by Oscar Celma on February 22, 2011 - 2:40 pm
Norman, I completely agree on both things you say ( i) CF-only not as useful as combining with tags and ii) debugging without any knowledge of what you’re doing is non-sense!).
Still, even though the data is a black box itself, there are some “tags” (ok, only styles/genres) assigned to tracks, album and artists.
So, in a way, you can also add this little bit of info and combine it with the results from CF.
#9 by Norman Casagrande on February 23, 2011 - 6:14 am
Canonical tags are clearly useful, but extended tags help even more. If it is true then that those exists only in form of ID it is harder to draw relationships among them.
Finally, allow me to highlight the *and other sources* part.. ;)
#10 by Robin on February 22, 2011 - 12:27 pm
At our last meeting, the group was wondering how possible it might be to de-anonymize some parts of the Yahoo music data. Could you find “Dark Side of the Moon” or “Sgt. Pepper” and work outwards from there?
#11 by Malcolm Slaney on February 22, 2011 - 1:47 pm
I agree Paul.
But the massive effort devoted to the Netflix competition “showed” that content-based information did NOT give better recommendations. The competitors tried to incorporate content analysis into recommendation systems, but in the end it did NOT help. :-(
Admittedly, summarizing the content of a 2 hour movie is much harder than a 3 minute movie. Come work for us and we’ll show you our data.
P.S. If you do de-anonymize the data please don’t tell anybody. We’ll NEVER be able to release data again.
#12 by Paul on February 22, 2011 - 2:07 pm
malcolm – netflix shows that it is hard to beat CF data when you have lots of it, but of course when you have little or no CF data (like with new content or long tail content) CF won’t help you at all. If you tried to use a CF algorithm to recommend the 10,000 new tracks that were probably pushed to the web today, you’d have pretty poor results.
P.S. thanks for the job offer, but I’m already up to my neck in good data ;)
#13 by brian on February 22, 2011 - 2:25 pm
The netflix thing was vastly different from what a music recommender / similarity system is up against in the real world, as you know. Netflix had a tiny amount of movies and had usage for each of them. And yes, movies themselves do not react to content & contextual analysis the same way music does.
This competition is great but with the understanding that it has absolutely nothing to do with “music” or “music recommendation.” By definition, this data only can predict stuff about things that already have usage. As you know, the recommendable music catalog of the world is much larger, and that is whole point of discovery, finding stuff that people haven’t found yet. So this is all fine as a regression or machine learning evaluation but will have no worth to a “pandora of the future” or really any music discovery platform that i know of. (I don’t want to get into the “what is the use of MIR other than to have better industry success” thing– but a recommender built out of this competition will have precisely 0 value in the industry that i am in)
A true music recommendation evaluation would be faced with a 10m sized catalog of songs, a large majority of which have no usage data. And quite a large percentage of those would have no contextual data either. Tons of duplicates, covers, soundslikes, birthday songs copied 200 times with different names, etc. The evaluation would have to be in user feedback on the new data. This is a big reason why we worked with Dan & Thierry on the million songs thing — actual context & content data about music at a scale that makes sense.
This is why context and content are necessary in recommenders, not because they make a recommendation from the Beatles to John Lennon “better”, but because they’ll find something that does not yet have much (or any) other kind of data. And why else would we all be building recommender systems if not for that kind of discovery?
#14 by jeremy on February 22, 2011 - 2:56 pm
“that is whole point of discovery, finding stuff that people haven’t found yet.”
Abso-positiv-frigging yes. Very well said, Brian.
#15 by brian on February 22, 2011 - 2:27 pm
yeah, also, don’t try to hire paul. he’s just terrible, you don’t want him, he wastes all his work day writing blog posts :)
#16 by jeremy on February 22, 2011 - 2:55 pm
But the massive effort devoted to the Netflix competition “showed” that content-based information did NOT give better recommendations. The competitors tried to incorporate content analysis into recommendation systems, but in the end it did NOT help. :-( Admittedly, summarizing the content of a 2 hour movie is much harder than a 3 minute movie. Come work for us and we’ll show you our data.
But there is a *huge* difference, Malcolm, in content-based analysis of a movie, vs. content-based analysis of a piece of music. The semantic gap in movie analysis is huge. However, the semantic gap in music is much smaller, mostly because music is self-semantic. What I mean is that there are clear, repeating, extractable patterns that you can find in music (rhythmic patterns, harmonic progressions, etc.) that very robustly self-describe what the music is about. And it is much easier to then take these extracted patterns and apply them, as Brian is saying, to songs without any CF data at all. Because the patterns themselves have semantic meaning.
Movies are not like that at all. You can’t really pull much out of a movie by way of self-descriptive content patterns. Or, if you can, it is much more difficult.
So the Netflix competition’s success on content really is not an indicator of what is possible with music. And that’s kinda the whole point of this KDD discussion.. are we doing recommendation, or are we doing *music* recommendation, i.e. taking advantage of those properties and aspects of music that are uniquely…musical.
#17 by Deb Chachra on February 22, 2011 - 3:37 pm
Cf: Jeremy Reed’s comment: As someone who actually is a biomedical engineer and who does research with human subjects, de-anonymizing subjects would be a major ethical violation. I can understand Yahoo’s point.
Also, it’s surprisingly difficult to get illegal substances. Or even syringes.
#18 by Amelie Anglade on February 22, 2011 - 7:41 pm
Yes de-anonymising subjects would be an ethical violation, however given the large amount of data and the fact that some of these users’ pages might not exist anymore or might come from various unidentified sources (have a look at this message from Oscar: http://tech.groups.yahoo.com/group/kddcup2011/message/6) it is not certain that once the artists and track names are de-anonymised we’d be able to de-anonymise the users themselves…
And anyway Yahoo! provides similar but smaller datasets in which we have access to the artists and track names: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
In that last case de-anonymising the artists and songs is made possible by checking the identity and affiliation of the researchers working with the data. Why can’t we do the same for that KDD Cup?
I don’t see any difference, except that as it is now this contest excludes a lot of potentially interesting approaches non limited to CF. After all Malcolm, that would be a great way to check if the non-CF methods do not bring better MUSIC recommendation as you seem to suggest…
#19 by Markus Weimer on February 28, 2011 - 12:55 am
KDD Cup is different from Webscope (the other Yahoo! data sets): The KDD Cup data comes with a different (much less strict) set of data governance rules that made it imperative to anonymize the data to the extend we did.
Unless we re-negotiate with millions of users, we cannot compromise the contract we have with them.
The reasoning works the other way around: Can we make sure that the users cannot be revealed if we add song names? The answer is no.
It might be old-fashioned, but Yahoo! treats the privacy of its users the standard of making sure it stays intact as opposed to making the plausible case that it might not be violated immediately.
#20 by Deb Chachra on February 23, 2011 - 1:40 pm
Amelie, I feel your frustration. Even as someone outside the field, it’s clear that Yahoo! is disenfranchising a large group of interested researchers and greatly attenuating the power of this contest (possibly so much that it’s irrelevant and mis-described as a ‘music recommendation’ contest). So that’s the value they’re leaving on the table, and it’s considerable.
And it doesn’t seem like there is a large probability that it’d be possible to de-anonymize users with just the addition of artist/song info. But per Paul’s link regarding the Netflix debacle, it’s quite possible that this risk is non-negligible. And yes, there are ways to mitigate this risk, such as only releasing the data to registered researchers, as you suggested.
So the real question is, “How serious is the risk?” If it’s just seen as something trivial–“oh noes, we’ll find users with bad taste!”–Yahoo!’s decision is utterly inexplicable. But from their point of view, the value of the risk is significant (hence Malcolm’s plea to Paul above): de-anonymizing users, however inadvertently and even for something as seemingly harmless as their taste in music, is pretty serious. It’s a fundamental tenet of social science research ethics–and regulation, at least in the US–that research subjects won’t have their identities revealed without their explicit consent.
Is Yahoo! making the right decision about withholding the artist/track data? I have no idea. But it’s hard to imagine that it’s possible to engage with them about changing the decision if the community doesn’t really consider the risk of user de-anonymization to be a serious one.
[I’m in a little bit of shock that I am publically agreeing with Yahoo!’s lawyers on the grounds of research ethics… :)]
#21 by Dinesh Vadhia on February 23, 2011 - 4:08 pm
Talk about Yahoo! shooting themselves in the foot! I came across a similar situation with Yahoo! a few years back where their lawyers were insistent that data be fully anonymised even though there was zero possibility of de-anonymising. The fact that the data was practically useless was of no concern.
Having worked with the Netflix data and other similar data, mixing ratings with say, textual data does make a positive difference to the recommendations. Mixing ratings and low-level content also improves results.
#22 by Malcolm Slaney on February 24, 2011 - 1:37 pm
It’s worth repeating… AOL proved that releasing even *one* user’s personal data is a fireable offense. If I remember right, even a VP got fired over the search-log data release. Executives are loath to lose their jobs, especially when researchers go to them asking to give away some data, possibly put their jobs in jeopardy, and for a tenuous bit of future technology. I think it’s a good tradeoff, but a VP might not see it the same way we do.
I can not speak for Yahoo here, but this is probably true for everybody that has data. We have data from users and from our business partners for which we have guaranteed to hold in confidence. The user data is a privacy issue. The song identification data is a business decision. The people that joined the MovieLens data and the Netflix data showed that they could identify one joint user, even though the user gave away his MovieLens data. It’s good to know that such privacy leaks are possible, but it makes it harder to educate VPs about why their data release will be ok.
In the data releases *I* have orchestrated I have chosen to not fight the business issue. Anonymizing the business data is easier than spending weeks convincing VPs that they won’t get fired, or that I am going to give away their business.
If you want more data, please collect it and give it away!!!
#23 by Dinesh Vadhia on February 28, 2011 - 5:25 am
If a really useful and usable data set cannot be released for whatever internal reasons then don’t – it just makes Yahoo! and others who do look silly.
#24 by Markus Weimer on February 28, 2011 - 1:42 pm
I’m not quite sure whether this is sarcasm or an honest suggestion. I’ll answer under the assumption of the latter, running the risk of making a fool out of myself:
The definitions of “really useful and usable” data varies wildly between different fields of research. For the collaborative filtering community, this data set is a huge leap forward in terms of large, real world data available in their research. Also, to the best of my knowledge it is the first to contain meaningful hierarchies and time stamps of the available precision. The release of the data is a net-win for the research community.
However, I understand that for content based music analysis, this isn’t the right data set. This was the initial comment made on this blog and I agree with it.
Back to your argument: even if we would be able to add song names, the data set would still not be “really useful and usable” for an even larger group of researchers in the social sciences. What good is the data without demographics?
My personal position on this is for companies to release as much data in as much detail as their ethical and legal restraints allow. Yahoo! did just that with this data set. While the researcher in me is a sad that we can’t release all the data in the same detail we have it, the user in me appreciates Yahoo!’s concerns about my privacy.
I believe the “really useful and usable” data sets you ask for can only be gathered as part of research projects in academia where all users are fully aware of the fact that their behavior and data will be studied and more importantly: agree to it.
#25 by Noam Koenigstein on February 28, 2011 - 1:19 pm
This competition is focused on Collaborative Filtering (CF) based music recommendation.
As I said before, CF approaches do incorporate models that are different in each domain (movies, music, books, etc.).
Taking into account the importance of content based approaches in the MIR community; I still hope the community would not exclude music based CF research.