Posts Tagged data

How Students Listen

The Spotify Insights team took a deep dive into some of the listening data of college students to see if there were any differences in how students at different schools listen. We looked at a wide range of data including what artists were played, what songs were played and when, what playlists played, what genres were played and so on. We focused mostly on looking for distinctive listening patterns and behaviors at the different schools. The results were a set of infographic style visualizations that summarize the distinctive listening patterns for each school.

How_Students_Listen_and_weezerIt was a fun study to do and really shows how much we learn about listening behavior based upon music streaming behavior.  Read about the study on the Spotify Insights Blog:  Top 40 Musical Universities in America:How Students Listen

, ,

Leave a comment

Minimizing my Karaoke pain

Rumor has it from some of the Echo Nest gang that went to Stockholm last week for new employee orientation that there is some sort of mandatory Karaoke requirement.  Now for some, I’m sure this is great fun, but for others, like myself, not so much.  I thought it would be best to prepare for my own mandatory Karaoke by finding some very short songs in order to minimize my time on stage.   To do this I went through  a database of the top Billboard songs of the last 60 years to find the shortest songs.   Here are some of the top shortest popular songs of the last 60 years:

Length(Seconds) Artist/Title Date
76 Anna Kendrick Cups 2013-01-14
78 Zac Efron What I’ve Been Looking For (Reprise) 2006-02-13
83 Buchanan & Goodman Santa And The Satellite (Part I) 1957-12-25
92 Audrey Dear Elvis (Page 1) 1956-09-24
96 Fats Domino Whole Lotta Loving 1958-11-19
98 Glee Cast Isn’t She Lovely 2011-05-30
99 Maurice Williams & The Zodiacs Stay 1960-10-05
101 Swinging Blue Jeans, The Hippy Hippy Shake 1964-03-09
103 Peter, Paul & Mary Settle Down (Goin’ Down That Highway) 1963-01-21
105 Four Tops Ain’t That Love 1965-08-02
105 Fats Domino Shu Rah 1961-03-22
105 Chuck Berry Let It Rock 1960-02-03
107 Lucas Gabreel & Ashley Tisdale Bop To The Top 2006-02-13
107 Beach Boys, The Little Deuce Coupe 1963-08-19
107 Clyde McPhatter Lover Please 1962-03-05
108 Ventures, The Hawaii Five-O 1969-03-10
110 Glee Cast Sing! 2010-11-01
110 Glee Cast It’s My Life / Confessions Part II 2009-10-26
110 Ricky Nelson If You Can’t Rock Me 1963-04-22

So it looks like my minimum possible karaoke pain will be 76 seconds if I go with Anna Kendrick’s Cups. Certainly better than Gun’s in Roses November Rain at 8:57 seconds or  Don Mclean’s American Pie at 6:49. But better yet, I can go with Hawaii Five-O . That song is not only short, but has no vocals.  With that song I’m sure to be pitch perfect!

,

6 Comments

The Most Replayed Songs

rocklobsterI still remember the evening well. It was midnight during the summer of 1982.  I was living in a thin-walled apartment, trying unsuccessfully to go to sleep while the people who lived upstairs were music bingeing on The B52’s Rock Lobster.  They listened to the song continuously on repeat for hours, giving me the chance to ponder the rich world of undersea life, filled with manta rays, narwhals and dogfish.

We tend to binge on things we like – potato chips, Ben & Jerry’s, and Battlestar Galactica. Music is no exception. Sometimes we like a song so much, that as soon as it’s over, we want to hear it again. But not all songs are equally replayable.  There are some songs that have some secret mysterious ingredients that makes us want to listen to the song over and over again. What are these most replayed songs? Let’s look at some data to find out.

The Data – For this experiment I used a week’s worth of song play data from the summer of 2013 that consists of user / song /  play-timestamp triples.  This data set has on the order of 100 million of these triples for about a half million unique users and 5 million unique songs.  To find replays I looked for consecutive plays by a user of song within a time window (to ensure that the replays are in the same listening session). Songs with low numbers of plays or fans were filtered out.

For starters, I simply counted up the most replayed songs. As expected, this yields very boring results – the list of the top most replayed songs is exactly the same as the most played songs.  No surprise here.  The most played songs are also the most replayed songs.

Top Most Replayed Songs  – (A boring result)

  1. Robin Thicke — Blurred Lines featuring T.I., Pharrell
  2. Jay-Z — Holy Grail featuring Justin Timberlake
  3. Miley Cyrus — We Can’t Stop
  4. Imagine Dragons — Radioactive
  5. Macklemore — Can’t Hold Us (feat. Ray Dalton)

To make this more interesting,  instead of looking at the absolute number of replays, I adjusted for popularity by looking at the ratio of replays to the total number of plays for each song. This replay ratio tells us the what percentage of plays of a song are replays. If we plot the replay ratio vs. the number of fans a song has the outliers become quite clear. Some songs are replayed at a higher rate than others.

click to open an interactive version of this chart.

I made an interactive version of this graph, you can mouse over the songs to see what they are and click on the songs to listen to them.

Sorting the results by the replay ratio yields a much more interesting result.  It surfaces up a few classes of frequently replayed songs: background noise,  children’s music,  soft and smooth pop and friday night party music.  Here’s the color coded list of the top 20:

Top Replayed songs by percentage

  1. 91% replays   White Noise For Baby Sleep — Ocean Waves
  2. 86% replays   Eric West — Reckless (From Playing for Keeps)
  3. 86% replays   Soundtracks For The Masters — Les Contes D’hoffmann: Barcarole
  4. 83% replays   White Noise For Baby Sleep — Warm Rain
  5. 83% replays   Rain Sounds — Relax Ocean Waves
  6. 82% replays   Dennis Wilson — Friday Night
  7. 81% replays   Sleep — Ocean Waves for Sleep – White Noise
  8. 74% replays   White Noise Sleep Relaxation White Noise Relaxation: Ocean Waves 7hz
  9. 74% replays   Ween — Ocean Man
  10. 73% replays   Children’s Songs Music — Whole World In His Hands
  11. 71% replays   Glee Cast — Friday (Glee Cast Version)
  12. 63% replays   Rain Sounds — Rain On the Window
  13. 63% replays   Rihanna — Cheers (Drink To That)
  14. 60% replays   Group 1 Crew — He Said (feat. Chris August)
  15. 59% replays   Karsten Glück Simone Sommerland — Schlaflied für Anne
  16. 56% replays   Monica — With You
  17. 54% replays   Jessie Ware — Wildest Moments
  18. 53% replays   Tim McGraw — I Like It, I Love It
  19. 53% replays   Rain Sounds — Morning Rain In Sedona
  20. 52% replays   Rain Sounds — Rain Sounds

It is no surprise that the list is dominated by background noise. There’s nothing like ambient ocean waves or rain sounds to help baby go to sleep in the noisy city. A five minute track of ambient white noise may be played dozens of times during every nap. It is not uncommon to find 8 hour long stretches of the same five minute white noise audio track played on auto repeat.

The top most replayed song is Reckless  by Eric West from the ‘shamelessly sentimental’ 2012 movie Playing for Keeps (4% rotten).  86% of the time this song is played it is a replay. This is the song that you can’t listen to just once. It is the Lays potato chip of music. Beware, if you listen to it, you may be caught in its web and you’ll never be able to escape. Listen at your own risk:

Luckily, most people don’t listen to this song even once. It is only part of the regular listening rotation of a couple hundred listeners. Still, it points to a pattern that we’ll see more of – overly sentimental music has high replay value.

Top Replayed Popular Songs
Perhaps even more interesting is to look at the top most replayed popular songs.  We can do this by restricting the songs in the results to those that are by artists that have a significant fan base:

  1. 31% replays   Miley Cyrus — The Climb
  2. 16% replays   August Alsina — I Luv This sh*t featuring Trinidad James
  3. 15% replays   Brad Paisley — Whiskey Lullaby
  4. 14% replays   Tamar Braxton — The One
  5. 14% replays   Chris Brown — Love More
  6. 14% replays   Anna Kendrick — Cups (Pitch Perfect’s “When I’m Gone”)
  7. 13% replays   Avenged Sevenfold — Hail to the King
  8. 13% replays   Jay-Z — Big Pimpin’
  9. 13% replays   Labrinth — Beneath Your Beautiful
  10. 13% replays   Karmin — Acapella
  11. 12% replays   Lana Del Rey — Summertime Sadness [Lana Del Rey vs. Cedric Gervais]
  12. 12% replays   MGMT — Electric Feel
  13. 12% replays   One Direction — Best Song Ever
  14. 12% replays   Big Sean — Beware featuring Lil Wayne, Jhené Aiko
  15. 12% replays   Chris Brown — Don’t Think They Know
  16. 11% replays   Justin Bieber — Boyfriend
  17. 11% replays   Avicii — Wake Me Up
  18. 11% replays   2 Chainz — Feds Watching featuring Pharrell
  19. 10% replays   Paramore — Still Into You
  20. 10% replays   Alicia Keys — Fire We Make
  21. 10% replays   Lorde — Royals
  22. 10% replays   Miley Cyrus — We Can’t Stop
  23. 10% replays   Ciara — Body Party
  24.   9% replays   Marc Anthony — Vivir Mi Vida
  25.   9% replays   Ellie Goulding — Burn
  26.   9% replays   Fantasia — Without Me
  27.   9% replays   Rich Homie Quan — Type of Way
  28.   9% replays   The Weeknd — Wicked Games (Explicit)
  29.   9% replays   A$AP Ferg — Work REMIX
  30.   9% replays   Jay-Z  — Part II (On The Run) featuring Beyoncé

It is hard to believe, but the data doesn’t lie – More than 30% of the time after someone listens to Miley Cyrus’s The Climb they listen to it again right away –  proving that there is indeed always going to be another mountain that you are going to need to climb.  Miley Cyrus is well represented – her aptly named song We can’t Stop is the most replayed song of the top ten most popular songs.

Here are the top 30 most replayed popular songs in Spotify and Rdio playlists for you to enjoy, but I’m sure you’ll never get to the end of the playlist, you’ll just get stuck repeating The Best Song Ever or Boyfriend forever.

Here’s the Rdio version of the Top 30 Most Replayed popular songs:

Most Manually Replayed
More than once I’ve come back from lunch to find that I left my music player on auto repeat and it has played the last song 20 times while I was away.  The song was playing, but no one was listening. It is more interesting to find songs replays in which the replay is manually initiated. These are the songs that grabbed the attention of the listener enough to make them interact with their player and actually queue the song up again.   We can find manually replayed songs by looking at replay timestamps. Replays generated by autorepeat will have a very regular timestamp delta, while manual replay timestamps will have more random delta between timestamps.

Here are the top manually replayed songs:  

  1. Body Party by Ciara
  2. Still Into You by Paramore
  3. Tapout featuring Lil Wayne, Birdman, Mack Maine, Nicki Minaj, Future by Rich Gang
  4. Part II (On The Run) featuring Beyoncé by Jay-Z
  5. Feds Watching featuring Pharrell by 2 Chainz
  6. Royals by Lorde
  7. V.S.O.P. by K. Michelle
  8. Just Give Me A Reason by Pink
  9. Don’t Think They Know by Chris Brown
  10. Wake Me Up by Avicii

There’s an Rdio playlist of these songs: Most Manually Replayed

So what?
Why do we care which songs are most replayed?  It’s part of our never ending goal to try to better understand how people interact with music.  For instance, recognizing when music is being used in a context like helping the baby go to sleep is important – without taking this context into account, the thousands of plays of Ocean Waves and Warn Rain would dominate the taste profile that we build for that new mom and dad. We want to make sure that when that mom and dad are ready to listen to music, we can recommend something besides white noise.

Looking at replays can help us identify new artists for certain audiences. For instance, parents looking for an alternative to Miley Cyrus for their pre-teen playlists after Miley’s recent VMA performance, may look to an artist like Fifth Harmony. Their song Miss Movin’ On has similar replay statistics to the classic Miley songs:

Finally, looking at replays is another tool to help us understand the music that people really like. If the neighbors play Rock Lobster 20 times in a row, you can be sure that they really, really like that song.   (And despite, or perhaps because of, that night 30 years ago, I like the song too). You should give it a listen, or two…

, , ,

Leave a comment

Reidentification of artists and genres in the KDD cup data

Back in February I wrote a post about the KDD Cup ( an annual Data Mining and Knowledge Discovery competition), asking whether this year’s cup  was really music recommendation since all the data identifying the music had been anonymized.  The post received a number of really interesting comments about the nature of recommendation and whether or not context and content was really necessary for music recommendation, or was user behavior all you really needed.   A few commenters suggested that it might be possible  de-anonymize the data using a constraint propagation technique.

Many voiced an opinion that such de-anonymizing of the data to expose user listening habits would indeed be unethical. Malcolm Slaney, the researcher at Yahoo! who prepared the dataset offered the plea:

If you do de-anonymize the data please don’t tell anybody. We’ll NEVER be able to release data again.

As far as I know, no one has de-anonymized the KDD Cup dataset, however, researcher Matthew J. H. Rattigan of The University of Massachusetts at Amherst has done the next best thing.  He has published a paper called Reidentification of artists and genres the KDD cup that shows that by analyzing at the relational structures within the dataset it is possible to identify the artists, albums, tracks and genres that are used in the anonymized dataset.   Here’s an excerpt from the paper that gives an intuitive description of the approach:

For example, consider Artist 197656 from the Track 1 data. This artist has eight albums described by different combinations of ten genres. Each album is associated with several tracks, with track counts ranging from 1 to 69. We make the assumption that these albums and tracks were sampled without replacement from the discography of some real artist on the Yahoo! Music website. Furthermore, we assume that the connections between genres and albums are not sampled; that is, if an album in the KDD Cup dataset is attached to three genres, its real-world counterpart has exactly three genres (or “Categories”, as they are known on the Yahoo! Music site).

Under the above assumptions, we can compare the unlabeled KDD Cup artist with real-world Yahoo! Music artists in order to find a suitable match. The band Fischer Z, for example, is an unsuitable match, as their online discography only contains seven albums. An artist such as Meatloaf certainly has enough albums (56) to be a match, but none of those albums contain more than 31 tracks. The entry for Elvis Presley contains 109 albums, 17 of which boast 69 or more tracks; however, there is no consistent assignment of genres that satisfies our assumptions. The band Tool, however, is compatible with Artist 197656. The Tool discography contains 19 albums containing between 0 and 69 tracks. These albums are described by exactly 10 genres, which can be assigned to the unlabeled KDD Cup genres in a consistent manner. Furthermore, the match is unique: of the 134k artists in our labeled dataset, Tool is the only suitable match for Artist 197656.

Of course it is impossible for Matthew to evaluate his results directly, but he did create a number of synthetic, anonymized datasets draw from Yahoo and was able to demonstrate very high accuracy for the top artists and a 62% overall accuracy.

The motivation for this type of work is not to turn the KDD cup dataset into something that music recommendation researchers could use, but instead is to get a better understanding of data privacy issues.  By understanding how large datasets can be de-anonymized, it will be easier for researchers in the future to create datasets that won’t be easily yield their hidden secrets.   The paper is an interesting read – so since you are done doing all of your reviews for RecSys and ISMIR, go ahead and give it a read:  https://www.cs.umass.edu/publication/docs/2011/UM-CS-2011-021.pdf.  Thanks to @ocelma for the tip.

, ,

1 Comment

The Stairway Detector

Last night I was watching the pilot for Glee (a snarky TV version of High school musical) with my 3 teenage daughters.  I was surprised to hear the soundtrack filled with songs by the band Journey, songs that  brought me back to my own high school years.   The thing that I like the most about Journey is that many of their songs have this slow and gradual build up over the course of the whole song  as in this song Lovin Touchin Squeezin:

A number of my favorite songs have this slow build up. The canonical example is Zep’s ‘Stairway to Heaven’ – it starts with a slow acoustic guitar and over the course of 8 minutes builds to metal frenzy.    I thought it would be fun to see if I could write a bit of software that could find the songs that have the same arc as ‘Stairway to Heaven’ or ‘Lovin, Touchin Squeezin’  – songs that have this slow build. With this ‘stairway detector’  I could build playlists filled with the songs that fire me up.

The obvious place to start with is to look how the loudness of a song changes overtime. To do this I used the Echo Nest developer API to extract the loudness as a function of time for  Journey’s Lovin, Touchin Squeezin:

louness-journey-no-avgIn this plot the light green curve is the loudness, while the blue line is a windowed average of the loudness.  This plot shows a nice rise in the volume over the course of the song.   Compared to a song like the Beatles ‘Ticket to Ride’ that doesn’t have this upward slope:

loudness-ticket-to-ridFrom these two examples, it is pretty clear that we can build our stairway-detector just by looking at the average slope of the volume. The higher the slope, the bigger the build.  Now, I suspect that there’s lots of ways to find the average slope of a bumpy line – but I like to always try the simplest thing that could possibly work first – and for me the simplest thing was to just divide the average loudness of the second half of the song by the average loudness of the first half of the song.   So for example, with the Journey song the average loudness of the second half of the song is -15.86 db and the average of the first half of the song is -24.37 db.  This gives us a ratio of 1.54, while ‘Ticket to ride’ gets a ratio of 1.06.  Here’s the Journey song with averages shown:

loudness-for-journeyHere are a few more songs that fit the ‘slow build’ profile:

stairway-to-heaven‘Stairway to Heaven’ has a score of 1.6 so it has a bigger build than Journey’s Lovin’.

loudness-for-bridge-over-troubled-waterSimon and Garfunkle’s ‘Bridge over troubled water’ has an even bigger build with a score of 1.7.

Also sprach ZarathustraAlso sprach Zarathustra has a more modest score of  1.56

With this new found metric I analyzed a few thousand of the tracks in my personal collection to find the songs with the biggest crescendos.  The biggest of all was this song by Muse with a whopping score of  3.07:

loudness-for-muse-take-a-bowAnother find is Arcade Fire’s “My Body is a Cage” with a  score of 2.32.

loudness-for-my-body-is-a-cage

The metric isn’t perfect. For instance, I would have expected Postal Services ‘Natural Anthem’ to have a high score because it has such a great build up, but it only gets a score of 1.19. Looking at the plot we can see why:

loudness-for-postal-service-natural-anthemAfter the initial build up, there’s a drop an energy for that last quarter of the song, so even though the song has a sustained crescendo for 3 minutes it doesn’t get a high score due to this drop.

Of course, we can use this ratio to find tracks that go the other way, to find songs that gradually wind down. These seem to occur less frequently than the songs that build up.  One example is Neutral Milk Hotel’s Two Headed Boy:

loudness-for-two-headed-boy

Despite the fact that I’m using a very naive metric to find the loudness slope,  this stairway detector is pretty effective in finding songs that have that slow build.   It’s another tool that I can use for helping to build interesting playlists.  This is one of the really cool things about how the Echo Nest approaches music playlisting.   By having an understanding of what the music actually sounds like,  we can build much more interesting playlists than you get from genius-style playlists that only take into account  artists co-occurrence.

, , ,

1 Comment

Artist similarity, familiarity and hotness

en_logo_250x200_ltThe Echo Nest developer web services offer a number of interesting pieces of data about an artist, including similar artists,  artist familiarity and artist hotness.  Familiarity is an indication of how well known the artist is, while hotness (which we spell  as the zoolanderish ‘hotttnesss’) is an indication of how much buzz the artist is getting  right now.   Top familiar artists are band like Led Zeppelin, Coldplay,  and The Beatles, while top ‘hottt’ artists are artists like Katy Perry,  The Boy Least Likely to, and Mastodon.

I was interested in understanding how familiarity, hotness and similarity interact with each other, so I spent my Memorial day morning creating a couple of plots to help me explore this.  First, I was interested in learning how the familiarity of an artist relates to the familiarity of that artists’s similar artists.   When you get the similar artists for an artist, is there any relationship between the familiarity of these similar artists and the seed artist?  Since ‘similar artists’ are often used for music discovery, it seems to me that on average, the similar artists should be less familiar than the seed artist.   If you like the very familiar Beatles, I may recommend that you listen to ‘Bon Iver’, but if you like the less familiar ‘Bon Iver’ I wouldn’t recommend ‘The Beatles’. I assume that you already know about them.      To look at this,  I plotted the average familiarity for the top 15  most similar artists for each artist along with the seed artist’s familiarity.  Here’s the plot:

familiarityIn this plot,  I’ve take the top 18,000 most familiar artists, ordered them by familiarity.  The red line is the familiarity of the seed artist, and the green cloud shows the average familiarity of the similar artists.   In the plot we can see that there’s a correlation between artist familiarity and the average familiarity of similar artists. We can also see that similar artists tend to be less familiar than the seed artist.  This is exactly the behavior I was hoping to see. Our similar artist function yields similar artists that, in general, have an average famililarity that is less than the seed artist.

This plot can help us q/a our artist similarity function.  If we see the average familiarity for similar artists deviates from the standard curve, there may be a problem with that particular artist.  For instance, T-Pain has a familiarity of 0.869, while the average familiarity of T-Pain’s similar artists is 0.340. This is quite a bit lower than we’d expect – so there may be something wrong with our data for T-Pain. We can look at the similars for T-Pain and fix the problem.

For hotness, the desired behavior is less clear.  If a listener starting from a medium hot artist is looking for new music, it is unclear whether or not they’d like a hotter or colder artist.    To see what we actually do, I  looked at how the average hotness for similar artists compare to the hotness of the seed artist.  Here’s the plot:

hotnessIn this plot, the red curve is showing the hotness of the top 18,000 most familiar artists.  It is interesting to see the shape of the curve, there are very few ultra-hot artists (artists with a hotness about .8) and very few familiar, ice cold artists (with a hotness of less than 0.2).  The average hotness of the similar artists seems to be somewhat correlated with the hotness of the seed artist.  But markedly less than with the familiarity curve.  For hotness if your seed artist is hot, you are likely to get less hot similar artists, while if the seed artist is not hot, you are likely to get hotter artists.  That seems like reasonable behavior to me.

Well, there you have it. Some Monday morning explorations of familiarity, similarity and hotness.    Why should you care? If you are building a music recommender, familiarity and hotness are really interesting pieces of data to have access to.  There’s a subtle game a recommender has to play, it has to give a certain amount of familiar recommendations to gain trust, while also giving a certain number of novel recommendations in order to enable music discovery.

, ,

4 Comments

Follow

Get every new post delivered to your Inbox.

Join 1,186 other followers