Posts Tagged data
Back in February I wrote a post about the KDD Cup ( an annual Data Mining and Knowledge Discovery competition), asking whether this year’s cup was really music recommendation since all the data identifying the music had been anonymized. The post received a number of really interesting comments about the nature of recommendation and whether or not context and content was really necessary for music recommendation, or was user behavior all you really needed. A few commenters suggested that it might be possible de-anonymize the data using a constraint propagation technique.
Many voiced an opinion that such de-anonymizing of the data to expose user listening habits would indeed be unethical. Malcolm Slaney, the researcher at Yahoo! who prepared the dataset offered the plea:
As far as I know, no one has de-anonymized the KDD Cup dataset, however, researcher Matthew J. H. Rattigan of The University of Massachusetts at Amherst has done the next best thing. He has published a paper called Reidentification of artists and genres the KDD cup that shows that by analyzing at the relational structures within the dataset it is possible to identify the artists, albums, tracks and genres that are used in the anonymized dataset. Here’s an excerpt from the paper that gives an intuitive description of the approach:
For example, consider Artist 197656 from the Track 1 data. This artist has eight albums described by diﬀerent combinations of ten genres. Each album is associated with several tracks, with track counts ranging from 1 to 69. We make the assumption that these albums and tracks were sampled without replacement from the discography of some real artist on the Yahoo! Music website. Furthermore, we assume that the connections between genres and albums are not sampled; that is, if an album in the KDD Cup dataset is attached to three genres, its real-world counterpart has exactly three genres (or “Categories”, as they are known on the Yahoo! Music site).
Under the above assumptions, we can compare the unlabeled KDD Cup artist with real-world Yahoo! Music artists in order to ﬁnd a suitable match. The band Fischer Z, for example, is an unsuitable match, as their online discography only contains seven albums. An artist such as Meatloaf certainly has enough albums (56) to be a match, but none of those albums contain more than 31 tracks. The entry for Elvis Presley contains 109 albums, 17 of which boast 69 or more tracks; however, there is no consistent assignment of genres that satisﬁes our assumptions. The band Tool, however, is compatible with Artist 197656. The Tool discography contains 19 albums containing between 0 and 69 tracks. These albums are described by exactly 10 genres, which can be assigned to the unlabeled KDD Cup genres in a consistent manner. Furthermore, the match is unique: of the 134k artists in our labeled dataset, Tool is the only suitable match for Artist 197656.
Of course it is impossible for Matthew to evaluate his results directly, but he did create a number of synthetic, anonymized datasets draw from Yahoo and was able to demonstrate very high accuracy for the top artists and a 62% overall accuracy.
The motivation for this type of work is not to turn the KDD cup dataset into something that music recommendation researchers could use, but instead is to get a better understanding of data privacy issues. By understanding how large datasets can be de-anonymized, it will be easier for researchers in the future to create datasets that won’t be easily yield their hidden secrets. The paper is an interesting read – so since you are done doing all of your reviews for RecSys and ISMIR, go ahead and give it a read:
. Thanks to @ocelma for the tip.
Last night I was watching the pilot for Glee (a snarky TV version of High school musical) with my 3 teenage daughters. I was surprised to hear the soundtrack filled with songs by the band Journey, songs that brought me back to my own high school years. The thing that I like the most about Journey is that many of their songs have this slow and gradual build up over the course of the whole song as in this song Lovin Touchin Squeezin:
A number of my favorite songs have this slow build up. The canonical example is Zep’s ‘Stairway to Heaven’ – it starts with a slow acoustic guitar and over the course of 8 minutes builds to metal frenzy. I thought it would be fun to see if I could write a bit of software that could find the songs that have the same arc as ‘Stairway to Heaven’ or ‘Lovin, Touchin Squeezin’ – songs that have this slow build. With this ‘stairway detector’ I could build playlists filled with the songs that fire me up.
The obvious place to start with is to look how the loudness of a song changes overtime. To do this I used the Echo Nest developer API to extract the loudness as a function of time for Journey’s Lovin, Touchin Squeezin:
In this plot the light green curve is the loudness, while the blue line is a windowed average of the loudness. This plot shows a nice rise in the volume over the course of the song. Compared to a song like the Beatles ‘Ticket to Ride’ that doesn’t have this upward slope:
From these two examples, it is pretty clear that we can build our stairway-detector just by looking at the average slope of the volume. The higher the slope, the bigger the build. Now, I suspect that there’s lots of ways to find the average slope of a bumpy line – but I like to always try the simplest thing that could possibly work first – and for me the simplest thing was to just divide the average loudness of the second half of the song by the average loudness of the first half of the song. So for example, with the Journey song the average loudness of the second half of the song is -15.86 db and the average of the first half of the song is -24.37 db. This gives us a ratio of 1.54, while ‘Ticket to ride’ gets a ratio of 1.06. Here’s the Journey song with averages shown:
With this new found metric I analyzed a few thousand of the tracks in my personal collection to find the songs with the biggest crescendos. The biggest of all was this song by Muse with a whopping score of 3.07:
The metric isn’t perfect. For instance, I would have expected Postal Services ‘Natural Anthem’ to have a high score because it has such a great build up, but it only gets a score of 1.19. Looking at the plot we can see why:
Of course, we can use this ratio to find tracks that go the other way, to find songs that gradually wind down. These seem to occur less frequently than the songs that build up. One example is Neutral Milk Hotel’s Two Headed Boy:
Despite the fact that I’m using a very naive metric to find the loudness slope, this stairway detector is pretty effective in finding songs that have that slow build. It’s another tool that I can use for helping to build interesting playlists. This is one of the really cool things about how the Echo Nest approaches music playlisting. By having an understanding of what the music actually sounds like, we can build much more interesting playlists than you get from genius-style playlists that only take into account artists co-occurrence.
The Echo Nest developer web services offer a number of interesting pieces of data about an artist, including similar artists, artist familiarity and artist hotness. Familiarity is an indication of how well known the artist is, while hotness (which we spell as the zoolanderish ‘hotttnesss’) is an indication of how much buzz the artist is getting right now. Top familiar artists are band like Led Zeppelin, Coldplay, and The Beatles, while top ‘hottt’ artists are artists like Katy Perry, The Boy Least Likely to, and Mastodon.
I was interested in understanding how familiarity, hotness and similarity interact with each other, so I spent my Memorial day morning creating a couple of plots to help me explore this. First, I was interested in learning how the familiarity of an artist relates to the familiarity of that artists’s similar artists. When you get the similar artists for an artist, is there any relationship between the familiarity of these similar artists and the seed artist? Since ‘similar artists’ are often used for music discovery, it seems to me that on average, the similar artists should be less familiar than the seed artist. If you like the very familiar Beatles, I may recommend that you listen to ‘Bon Iver’, but if you like the less familiar ‘Bon Iver’ I wouldn’t recommend ‘The Beatles’. I assume that you already know about them. To look at this, I plotted the average familiarity for the top 15 most similar artists for each artist along with the seed artist’s familiarity. Here’s the plot:
In this plot, I’ve take the top 18,000 most familiar artists, ordered them by familiarity. The red line is the familiarity of the seed artist, and the green cloud shows the average familiarity of the similar artists. In the plot we can see that there’s a correlation between artist familiarity and the average familiarity of similar artists. We can also see that similar artists tend to be less familiar than the seed artist. This is exactly the behavior I was hoping to see. Our similar artist function yields similar artists that, in general, have an average famililarity that is less than the seed artist.
This plot can help us q/a our artist similarity function. If we see the average familiarity for similar artists deviates from the standard curve, there may be a problem with that particular artist. For instance, T-Pain has a familiarity of 0.869, while the average familiarity of T-Pain’s similar artists is 0.340. This is quite a bit lower than we’d expect – so there may be something wrong with our data for T-Pain. We can look at the similars for T-Pain and fix the problem.
For hotness, the desired behavior is less clear. If a listener starting from a medium hot artist is looking for new music, it is unclear whether or not they’d like a hotter or colder artist. To see what we actually do, I looked at how the average hotness for similar artists compare to the hotness of the seed artist. Here’s the plot:
In this plot, the red curve is showing the hotness of the top 18,000 most familiar artists. It is interesting to see the shape of the curve, there are very few ultra-hot artists (artists with a hotness about .8) and very few familiar, ice cold artists (with a hotness of less than 0.2). The average hotness of the similar artists seems to be somewhat correlated with the hotness of the seed artist. But markedly less than with the familiarity curve. For hotness if your seed artist is hot, you are likely to get less hot similar artists, while if the seed artist is not hot, you are likely to get hotter artists. That seems like reasonable behavior to me.
Well, there you have it. Some Monday morning explorations of familiarity, similarity and hotness. Why should you care? If you are building a music recommender, familiarity and hotness are really interesting pieces of data to have access to. There’s a subtle game a recommender has to play, it has to give a certain amount of familiar recommendations to gain trust, while also giving a certain number of novel recommendations in order to enable music discovery.