Posts Tagged data
Rumor has it from some of the Echo Nest gang that went to Stockholm last week for new employee orientation that there is some sort of mandatory Karaoke requirement. Now for some, I’m sure this is great fun, but for others, like myself, not so much. I thought it would be best to prepare for my own mandatory Karaoke by finding some very short songs in order to minimize my time on stage. To do this I went through a database of the top Billboard songs of the last 60 years to find the shortest songs. Here are some of the top shortest popular songs of the last 60 years:
|76||Anna Kendrick Cups||2013-01-14|
|78||Zac Efron What I’ve Been Looking For (Reprise)||2006-02-13|
|83||Buchanan & Goodman Santa And The Satellite (Part I)||1957-12-25|
|92||Audrey Dear Elvis (Page 1)||1956-09-24|
|96||Fats Domino Whole Lotta Loving||1958-11-19|
|98||Glee Cast Isn’t She Lovely||2011-05-30|
|99||Maurice Williams & The Zodiacs Stay||1960-10-05|
|101||Swinging Blue Jeans, The Hippy Hippy Shake||1964-03-09|
|103||Peter, Paul & Mary Settle Down (Goin’ Down That Highway)||1963-01-21|
|105||Four Tops Ain’t That Love||1965-08-02|
|105||Fats Domino Shu Rah||1961-03-22|
|105||Chuck Berry Let It Rock||1960-02-03|
|107||Lucas Gabreel & Ashley Tisdale Bop To The Top||2006-02-13|
|107||Beach Boys, The Little Deuce Coupe||1963-08-19|
|107||Clyde McPhatter Lover Please||1962-03-05|
|108||Ventures, The Hawaii Five-O||1969-03-10|
|110||Glee Cast Sing!||2010-11-01|
|110||Glee Cast It’s My Life / Confessions Part II||2009-10-26|
|110||Ricky Nelson If You Can’t Rock Me||1963-04-22|
So it looks like my minimum possible karaoke pain will be 76 seconds if I go with Anna Kendrick’s Cups. Certainly better than Gun’s in Roses November Rain at 8:57 seconds or Don Mclean’s American Pie at 6:49. But better yet, I can go with Hawaii Five-O . That song is not only short, but has no vocals. With that song I’m sure to be pitch perfect!
I still remember the evening well. It was midnight during the summer of 1982. I was living in a thin-walled apartment, trying unsuccessfully to go to sleep while the people who lived upstairs were music bingeing on The B52’s Rock Lobster. They listened to the song continuously on repeat for hours, giving me the chance to ponder the rich world of undersea life, filled with manta rays, narwhals and dogfish.
We tend to binge on things we like – potato chips, Ben & Jerry’s, and Battlestar Galactica. Music is no exception. Sometimes we like a song so much, that as soon as it’s over, we want to hear it again. But not all songs are equally replayable. There are some songs that have some secret mysterious ingredients that makes us want to listen to the song over and over again. What are these most replayed songs? Let’s look at some data to find out.
The Data – For this experiment I used a week’s worth of song play data from the summer of 2013 that consists of user / song / play-timestamp triples. This data set has on the order of 100 million of these triples for about a half million unique users and 5 million unique songs. To find replays I looked for consecutive plays by a user of song within a time window (to ensure that the replays are in the same listening session). Songs with low numbers of plays or fans were filtered out.
For starters, I simply counted up the most replayed songs. As expected, this yields very boring results – the list of the top most replayed songs is exactly the same as the most played songs. No surprise here. The most played songs are also the most replayed songs.
Top Most Replayed Songs – (A boring result)
- Robin Thicke — Blurred Lines featuring T.I., Pharrell
- Jay-Z — Holy Grail featuring Justin Timberlake
- Miley Cyrus — We Can’t Stop
- Imagine Dragons — Radioactive
- Macklemore — Can’t Hold Us (feat. Ray Dalton)
To make this more interesting, instead of looking at the absolute number of replays, I adjusted for popularity by looking at the ratio of replays to the total number of plays for each song. This replay ratio tells us the what percentage of plays of a song are replays. If we plot the replay ratio vs. the number of fans a song has the outliers become quite clear. Some songs are replayed at a higher rate than others.
I made an interactive version of this graph, you can mouse over the songs to see what they are and click on the songs to listen to them.
Sorting the results by the replay ratio yields a much more interesting result. It surfaces up a few classes of frequently replayed songs: background noise, children’s music, soft and smooth pop and friday night party music. Here’s the color coded list of the top 20:
Top Replayed songs by percentage
- 91% replays White Noise For Baby Sleep — Ocean Waves
- 86% replays Eric West — Reckless (From Playing for Keeps)
- 86% replays Soundtracks For The Masters — Les Contes D’hoffmann: Barcarole
- 83% replays White Noise For Baby Sleep — Warm Rain
- 83% replays Rain Sounds — Relax Ocean Waves
- 82% replays Dennis Wilson — Friday Night
- 81% replays Sleep — Ocean Waves for Sleep – White Noise
- 74% replays White Noise Sleep Relaxation White Noise Relaxation: Ocean Waves 7hz
- 74% replays Ween — Ocean Man
- 73% replays Children’s Songs Music — Whole World In His Hands
- 71% replays Glee Cast — Friday (Glee Cast Version)
- 63% replays Rain Sounds — Rain On the Window
- 63% replays Rihanna — Cheers (Drink To That)
- 60% replays Group 1 Crew — He Said (feat. Chris August)
- 59% replays Karsten Glück Simone Sommerland — Schlaflied für Anne
- 56% replays Monica — With You
- 54% replays Jessie Ware — Wildest Moments
- 53% replays Tim McGraw — I Like It, I Love It
- 53% replays Rain Sounds — Morning Rain In Sedona
- 52% replays Rain Sounds — Rain Sounds
It is no surprise that the list is dominated by background noise. There’s nothing like ambient ocean waves or rain sounds to help baby go to sleep in the noisy city. A five minute track of ambient white noise may be played dozens of times during every nap. It is not uncommon to find 8 hour long stretches of the same five minute white noise audio track played on auto repeat.
The top most replayed song is Reckless by Eric West from the ‘shamelessly sentimental’ 2012 movie Playing for Keeps (4% rotten). 86% of the time this song is played it is a replay. This is the song that you can’t listen to just once. It is the Lays potato chip of music. Beware, if you listen to it, you may be caught in its web and you’ll never be able to escape. Listen at your own risk:
Luckily, most people don’t listen to this song even once. It is only part of the regular listening rotation of a couple hundred listeners. Still, it points to a pattern that we’ll see more of – overly sentimental music has high replay value.
Top Replayed Popular Songs
Perhaps even more interesting is to look at the top most replayed popular songs. We can do this by restricting the songs in the results to those that are by artists that have a significant fan base:
- 31% replays Miley Cyrus — The Climb
- 16% replays August Alsina — I Luv This sh*t featuring Trinidad James
- 15% replays Brad Paisley — Whiskey Lullaby
- 14% replays Tamar Braxton — The One
- 14% replays Chris Brown — Love More
- 14% replays Anna Kendrick — Cups (Pitch Perfect’s “When I’m Gone”)
- 13% replays Avenged Sevenfold — Hail to the King
- 13% replays Jay-Z — Big Pimpin’
- 13% replays Labrinth — Beneath Your Beautiful
- 13% replays Karmin — Acapella
- 12% replays Lana Del Rey — Summertime Sadness [Lana Del Rey vs. Cedric Gervais]
- 12% replays MGMT — Electric Feel
- 12% replays One Direction — Best Song Ever
- 12% replays Big Sean — Beware featuring Lil Wayne, Jhené Aiko
- 12% replays Chris Brown — Don’t Think They Know
- 11% replays Justin Bieber — Boyfriend
- 11% replays Avicii — Wake Me Up
- 11% replays 2 Chainz — Feds Watching featuring Pharrell
- 10% replays Paramore — Still Into You
- 10% replays Alicia Keys — Fire We Make
- 10% replays Lorde — Royals
- 10% replays Miley Cyrus — We Can’t Stop
- 10% replays Ciara — Body Party
- 9% replays Marc Anthony — Vivir Mi Vida
- 9% replays Ellie Goulding — Burn
- 9% replays Fantasia — Without Me
- 9% replays Rich Homie Quan — Type of Way
- 9% replays The Weeknd — Wicked Games (Explicit)
- 9% replays A$AP Ferg — Work REMIX
- 9% replays Jay-Z — Part II (On The Run) featuring Beyoncé
It is hard to believe, but the data doesn’t lie – More than 30% of the time after someone listens to Miley Cyrus’s The Climb they listen to it again right away – proving that there is indeed always going to be another mountain that you are going to need to climb. Miley Cyrus is well represented – her aptly named song We can’t Stop is the most replayed song of the top ten most popular songs.
Here are the top 30 most replayed popular songs in Spotify and Rdio playlists for you to enjoy, but I’m sure you’ll never get to the end of the playlist, you’ll just get stuck repeating The Best Song Ever or Boyfriend forever.
Here’s the Rdio version of the Top 30 Most Replayed popular songs:
Most Manually Replayed
More than once I’ve come back from lunch to find that I left my music player on auto repeat and it has played the last song 20 times while I was away. The song was playing, but no one was listening. It is more interesting to find songs replays in which the replay is manually initiated. These are the songs that grabbed the attention of the listener enough to make them interact with their player and actually queue the song up again. We can find manually replayed songs by looking at replay timestamps. Replays generated by autorepeat will have a very regular timestamp delta, while manual replay timestamps will have more random delta between timestamps.
Here are the top manually replayed songs:
- Body Party by Ciara
- Still Into You by Paramore
- Tapout featuring Lil Wayne, Birdman, Mack Maine, Nicki Minaj, Future by Rich Gang
- Part II (On The Run) featuring Beyoncé by Jay-Z
- Feds Watching featuring Pharrell by 2 Chainz
- Royals by Lorde
- V.S.O.P. by K. Michelle
- Just Give Me A Reason by Pink
- Don’t Think They Know by Chris Brown
- Wake Me Up by Avicii
There’s an Rdio playlist of these songs: Most Manually Replayed
Why do we care which songs are most replayed? It’s part of our never ending goal to try to better understand how people interact with music. For instance, recognizing when music is being used in a context like helping the baby go to sleep is important – without taking this context into account, the thousands of plays of Ocean Waves and Warn Rain would dominate the taste profile that we build for that new mom and dad. We want to make sure that when that mom and dad are ready to listen to music, we can recommend something besides white noise.
Looking at replays can help us identify new artists for certain audiences. For instance, parents looking for an alternative to Miley Cyrus for their pre-teen playlists after Miley’s recent VMA performance, may look to an artist like Fifth Harmony. Their song Miss Movin’ On has similar replay statistics to the classic Miley songs:
Finally, looking at replays is another tool to help us understand the music that people really like. If the neighbors play Rock Lobster 20 times in a row, you can be sure that they really, really like that song. (And despite, or perhaps because of, that night 30 years ago, I like the song too). You should give it a listen, or two…
Back in February I wrote a post about the KDD Cup ( an annual Data Mining and Knowledge Discovery competition), asking whether this year’s cup was really music recommendation since all the data identifying the music had been anonymized. The post received a number of really interesting comments about the nature of recommendation and whether or not context and content was really necessary for music recommendation, or was user behavior all you really needed. A few commenters suggested that it might be possible de-anonymize the data using a constraint propagation technique.
Many voiced an opinion that such de-anonymizing of the data to expose user listening habits would indeed be unethical. Malcolm Slaney, the researcher at Yahoo! who prepared the dataset offered the plea:
As far as I know, no one has de-anonymized the KDD Cup dataset, however, researcher Matthew J. H. Rattigan of The University of Massachusetts at Amherst has done the next best thing. He has published a paper called Reidentification of artists and genres the KDD cup that shows that by analyzing at the relational structures within the dataset it is possible to identify the artists, albums, tracks and genres that are used in the anonymized dataset. Here’s an excerpt from the paper that gives an intuitive description of the approach:
For example, consider Artist 197656 from the Track 1 data. This artist has eight albums described by diﬀerent combinations of ten genres. Each album is associated with several tracks, with track counts ranging from 1 to 69. We make the assumption that these albums and tracks were sampled without replacement from the discography of some real artist on the Yahoo! Music website. Furthermore, we assume that the connections between genres and albums are not sampled; that is, if an album in the KDD Cup dataset is attached to three genres, its real-world counterpart has exactly three genres (or “Categories”, as they are known on the Yahoo! Music site).
Under the above assumptions, we can compare the unlabeled KDD Cup artist with real-world Yahoo! Music artists in order to ﬁnd a suitable match. The band Fischer Z, for example, is an unsuitable match, as their online discography only contains seven albums. An artist such as Meatloaf certainly has enough albums (56) to be a match, but none of those albums contain more than 31 tracks. The entry for Elvis Presley contains 109 albums, 17 of which boast 69 or more tracks; however, there is no consistent assignment of genres that satisﬁes our assumptions. The band Tool, however, is compatible with Artist 197656. The Tool discography contains 19 albums containing between 0 and 69 tracks. These albums are described by exactly 10 genres, which can be assigned to the unlabeled KDD Cup genres in a consistent manner. Furthermore, the match is unique: of the 134k artists in our labeled dataset, Tool is the only suitable match for Artist 197656.
Of course it is impossible for Matthew to evaluate his results directly, but he did create a number of synthetic, anonymized datasets draw from Yahoo and was able to demonstrate very high accuracy for the top artists and a 62% overall accuracy.
The motivation for this type of work is not to turn the KDD cup dataset into something that music recommendation researchers could use, but instead is to get a better understanding of data privacy issues. By understanding how large datasets can be de-anonymized, it will be easier for researchers in the future to create datasets that won’t be easily yield their hidden secrets. The paper is an interesting read – so since you are done doing all of your reviews for RecSys and ISMIR, go ahead and give it a read: https://www.cs.umass.edu/publication/docs/2011/UM-CS-2011-021.pdf. Thanks to @ocelma for the tip.
Last night I was watching the pilot for Glee (a snarky TV version of High school musical) with my 3 teenage daughters. I was surprised to hear the soundtrack filled with songs by the band Journey, songs that brought me back to my own high school years. The thing that I like the most about Journey is that many of their songs have this slow and gradual build up over the course of the whole song as in this song Lovin Touchin Squeezin:
A number of my favorite songs have this slow build up. The canonical example is Zep’s ‘Stairway to Heaven’ – it starts with a slow acoustic guitar and over the course of 8 minutes builds to metal frenzy. I thought it would be fun to see if I could write a bit of software that could find the songs that have the same arc as ‘Stairway to Heaven’ or ‘Lovin, Touchin Squeezin’ – songs that have this slow build. With this ‘stairway detector’ I could build playlists filled with the songs that fire me up.
The obvious place to start with is to look how the loudness of a song changes overtime. To do this I used the Echo Nest developer API to extract the loudness as a function of time for Journey’s Lovin, Touchin Squeezin:
In this plot the light green curve is the loudness, while the blue line is a windowed average of the loudness. This plot shows a nice rise in the volume over the course of the song. Compared to a song like the Beatles ‘Ticket to Ride’ that doesn’t have this upward slope:
From these two examples, it is pretty clear that we can build our stairway-detector just by looking at the average slope of the volume. The higher the slope, the bigger the build. Now, I suspect that there’s lots of ways to find the average slope of a bumpy line – but I like to always try the simplest thing that could possibly work first – and for me the simplest thing was to just divide the average loudness of the second half of the song by the average loudness of the first half of the song. So for example, with the Journey song the average loudness of the second half of the song is -15.86 db and the average of the first half of the song is -24.37 db. This gives us a ratio of 1.54, while ‘Ticket to ride’ gets a ratio of 1.06. Here’s the Journey song with averages shown:
With this new found metric I analyzed a few thousand of the tracks in my personal collection to find the songs with the biggest crescendos. The biggest of all was this song by Muse with a whopping score of 3.07:
The metric isn’t perfect. For instance, I would have expected Postal Services ‘Natural Anthem’ to have a high score because it has such a great build up, but it only gets a score of 1.19. Looking at the plot we can see why:
Of course, we can use this ratio to find tracks that go the other way, to find songs that gradually wind down. These seem to occur less frequently than the songs that build up. One example is Neutral Milk Hotel’s Two Headed Boy:
Despite the fact that I’m using a very naive metric to find the loudness slope, this stairway detector is pretty effective in finding songs that have that slow build. It’s another tool that I can use for helping to build interesting playlists. This is one of the really cool things about how the Echo Nest approaches music playlisting. By having an understanding of what the music actually sounds like, we can build much more interesting playlists than you get from genius-style playlists that only take into account artists co-occurrence.
The Echo Nest developer web services offer a number of interesting pieces of data about an artist, including similar artists, artist familiarity and artist hotness. Familiarity is an indication of how well known the artist is, while hotness (which we spell as the zoolanderish ‘hotttnesss’) is an indication of how much buzz the artist is getting right now. Top familiar artists are band like Led Zeppelin, Coldplay, and The Beatles, while top ‘hottt’ artists are artists like Katy Perry, The Boy Least Likely to, and Mastodon.
I was interested in understanding how familiarity, hotness and similarity interact with each other, so I spent my Memorial day morning creating a couple of plots to help me explore this. First, I was interested in learning how the familiarity of an artist relates to the familiarity of that artists’s similar artists. When you get the similar artists for an artist, is there any relationship between the familiarity of these similar artists and the seed artist? Since ‘similar artists’ are often used for music discovery, it seems to me that on average, the similar artists should be less familiar than the seed artist. If you like the very familiar Beatles, I may recommend that you listen to ‘Bon Iver’, but if you like the less familiar ‘Bon Iver’ I wouldn’t recommend ‘The Beatles’. I assume that you already know about them. To look at this, I plotted the average familiarity for the top 15 most similar artists for each artist along with the seed artist’s familiarity. Here’s the plot:
In this plot, I’ve take the top 18,000 most familiar artists, ordered them by familiarity. The red line is the familiarity of the seed artist, and the green cloud shows the average familiarity of the similar artists. In the plot we can see that there’s a correlation between artist familiarity and the average familiarity of similar artists. We can also see that similar artists tend to be less familiar than the seed artist. This is exactly the behavior I was hoping to see. Our similar artist function yields similar artists that, in general, have an average famililarity that is less than the seed artist.
This plot can help us q/a our artist similarity function. If we see the average familiarity for similar artists deviates from the standard curve, there may be a problem with that particular artist. For instance, T-Pain has a familiarity of 0.869, while the average familiarity of T-Pain’s similar artists is 0.340. This is quite a bit lower than we’d expect – so there may be something wrong with our data for T-Pain. We can look at the similars for T-Pain and fix the problem.
For hotness, the desired behavior is less clear. If a listener starting from a medium hot artist is looking for new music, it is unclear whether or not they’d like a hotter or colder artist. To see what we actually do, I looked at how the average hotness for similar artists compare to the hotness of the seed artist. Here’s the plot:
In this plot, the red curve is showing the hotness of the top 18,000 most familiar artists. It is interesting to see the shape of the curve, there are very few ultra-hot artists (artists with a hotness about .8) and very few familiar, ice cold artists (with a hotness of less than 0.2). The average hotness of the similar artists seems to be somewhat correlated with the hotness of the seed artist. But markedly less than with the familiarity curve. For hotness if your seed artist is hot, you are likely to get less hot similar artists, while if the seed artist is not hot, you are likely to get hotter artists. That seems like reasonable behavior to me.
Well, there you have it. Some Monday morning explorations of familiarity, similarity and hotness. Why should you care? If you are building a music recommender, familiarity and hotness are really interesting pieces of data to have access to. There’s a subtle game a recommender has to play, it has to give a certain amount of familiar recommendations to gain trust, while also giving a certain number of novel recommendations in order to enable music discovery.