Posts Tagged data

50 Years Ago in Music

Posted by Paul in code, Spotify, twitter on April 12, 2015

There’s a strong connection between music and memory. Whenever I here the song Lovin You by Minnie Riperton, I’m instantly transported back to 1975 when I spent the summer apprenticed to Tom, my future brother-in-law, fixing electronic organs. I was 15, Tom was 22 and super cool. He had a business (New Hampshire Organ Service) and he had a van with an 8-track player and an FM radio (a rarity in 1975). As we drove between repairs across rural New Hampshire we’d pass the time by listening to the radio. Now, when I hear those radio songs from 1975 it is like I’m sitting in that van again.

Music can be like a time machine. Transporting us to different times in our lives. I was interested in exploring this a bit more. Inspired by @realtimewwii which gives a day-by-day account of World War II, I created a set of dynamically updating Spotify playlists that follow the charts week-by-week.

For example there’s the 50 Years Ago in Music playlist that contains the top 100 or songs that were on the chart 50 years ago. As I write this on April 12, 2015, this playlist is showing the top songs for the week of April 12, 1965.

The music on this playlist sends me back to when I was 5 years old listening to music on our AM radio in the kitchen in the morning while eating breakfast.

If you follow this playlist you’ll be able to re-create what it was like to listen to music 50 years ago. If the mid-sixties doesn’t speak to you musically, there are some other playlists that you can try.

There’s 40 Years Ago in Music that brings me back to 1975 on the road with Tom.

There’s 30 years Ago in Music which is currently playing music from the mid-80s like Madonna and Phil Collins.

There’s 20 Years Ago in Music currently playing music from the mid-90s:

10 Years Ago in Music plays the music that was on the radio when Spotify was just a gleam in Daniel’s eye.

5 Years Ago in Music – the playlist of @echonest in its heyday.

All of the playlists update weekly on Monday. If you’d like a reminder about when they are updated you can follow @50yearmusic. And of course, the code is on github.

charts, code, data, fun, Music

1 Comment

How Students Listen

Posted by Paul in analytics, data, echonest, Spotify on September 18, 2014

The Spotify Insights team took a deep dive into some of the listening data of college students to see if there were any differences in how students at different schools listen. We looked at a wide range of data including what artists were played, what songs were played and when, what playlists played, what genres were played and so on. We focused mostly on looking for distinctive listening patterns and behaviors at the different schools. The results were a set of infographic style visualizations that summarize the distinctive listening patterns for each school.

It was a fun study to do and really shows how much we learn about listening behavior based upon music streaming behavior. Read about the study on the Spotify Insights Blog: Top 40 Musical Universities in America:How Students Listen

data, insights, spotify

Minimizing my Karaoke pain

Posted by Paul in code, data on May 10, 2014

Rumor has it from some of the Echo Nest gang that went to Stockholm last week for new employee orientation that there is some sort of mandatory Karaoke requirement. Now for some, I’m sure this is great fun, but for others, like myself, not so much. I thought it would be best to prepare for my own mandatory Karaoke by finding some very short songs in order to minimize my time on stage. To do this I went through a database of the top Billboard songs of the last 60 years to find the shortest songs. Here are some of the top shortest popular songs of the last 60 years:

Length(Seconds)	Artist/Title	Date
76	Anna Kendrick Cups	2013-01-14
78	Zac Efron What I’ve Been Looking For (Reprise)	2006-02-13
83	Buchanan & Goodman Santa And The Satellite (Part I)	1957-12-25
92	Audrey Dear Elvis (Page 1)	1956-09-24
96	Fats Domino Whole Lotta Loving	1958-11-19
98	Glee Cast Isn’t She Lovely	2011-05-30
99	Maurice Williams & The Zodiacs Stay	1960-10-05
101	Swinging Blue Jeans, The Hippy Hippy Shake	1964-03-09
103	Peter, Paul & Mary Settle Down (Goin’ Down That Highway)	1963-01-21
105	Four Tops Ain’t That Love	1965-08-02
105	Fats Domino Shu Rah	1961-03-22
105	Chuck Berry Let It Rock	1960-02-03
107	Lucas Gabreel & Ashley Tisdale Bop To The Top	2006-02-13
107	Beach Boys, The Little Deuce Coupe	1963-08-19
107	Clyde McPhatter Lover Please	1962-03-05
108	Ventures, The Hawaii Five-O	1969-03-10
110	Glee Cast Sing!	2010-11-01
110	Glee Cast It’s My Life / Confessions Part II	2009-10-26
110	Ricky Nelson If You Can’t Rock Me	1963-04-22

So it looks like my minimum possible karaoke pain will be 76 seconds if I go with Anna Kendrick’s Cups. Certainly better than Gun’s in Roses November Rain at 8:57 seconds or Don Mclean’s American Pie at 6:49. But better yet, I can go with Hawaii Five-O . That song is not only short, but has no vocals. With that song I’m sure to be pitch perfect!

data, fun

6 Comments

The Most Replayed Songs

Posted by Paul in data, Music, The Echo Nest on August 27, 2013

rocklobster I still remember the evening well. It was midnight during the summer of 1982. I was living in a thin-walled apartment, trying unsuccessfully to go to sleep while the people who lived upstairs were music bingeing on The B52’s Rock Lobster. They listened to the song continuously on repeat for hours, giving me the chance to ponder the rich world of undersea life, filled with manta rays, narwhals and dogfish.

We tend to binge on things we like – potato chips, Ben & Jerry’s, and Battlestar Galactica. Music is no exception. Sometimes we like a song so much, that as soon as it’s over, we want to hear it again. But not all songs are equally replayable. There are some songs that have some secret mysterious ingredients that makes us want to listen to the song over and over again. What are these most replayed songs? Let’s look at some data to find out.

The Data – For this experiment I used a week’s worth of song play data from the summer of 2013 that consists of user / song / play-timestamp triples. This data set has on the order of 100 million of these triples for about a half million unique users and 5 million unique songs. To find replays I looked for consecutive plays by a user of song within a time window (to ensure that the replays are in the same listening session). Songs with low numbers of plays or fans were filtered out.

For starters, I simply counted up the most replayed songs. As expected, this yields very boring results – the list of the top most replayed songs is exactly the same as the most played songs. No surprise here. The most played songs are also the most replayed songs.

Top Most Replayed Songs – (A boring result)

Robin Thicke — Blurred Lines featuring T.I., Pharrell
Jay-Z — Holy Grail featuring Justin Timberlake
Miley Cyrus — We Can’t Stop
Imagine Dragons — Radioactive
Macklemore — Can’t Hold Us (feat. Ray Dalton)

To make this more interesting, instead of looking at the absolute number of replays, I adjusted for popularity by looking at the ratio of replays to the total number of plays for each song. This replay ratio tells us the what percentage of plays of a song are replays. If we plot the replay ratio vs. the number of fans a song has the outliers become quite clear. Some songs are replayed at a higher rate than others.

click to open an interactive version of this chart.

I made an interactive version of this graph, you can mouse over the songs to see what they are and click on the songs to listen to them.

Sorting the results by the replay ratio yields a much more interesting result. It surfaces up a few classes of frequently replayed songs: background noise, children’s music, soft and smooth pop and friday night party music. Here’s the color coded list of the top 20:

Top Replayed songs by percentage

91% replays White Noise For Baby Sleep — Ocean Waves
86% replays Eric West — Reckless (From Playing for Keeps)
86% replays Soundtracks For The Masters — Les Contes D’hoffmann: Barcarole
83% replays White Noise For Baby Sleep — Warm Rain
83% replays Rain Sounds — Relax Ocean Waves
82% replays Dennis Wilson — Friday Night
81% replays Sleep — Ocean Waves for Sleep – White Noise
74% replays White Noise Sleep Relaxation White Noise Relaxation: Ocean Waves 7hz
74% replays Ween — Ocean Man
73% replays Children’s Songs Music — Whole World In His Hands
71% replays Glee Cast — Friday (Glee Cast Version)
63% replays Rain Sounds — Rain On the Window
63% replays Rihanna — Cheers (Drink To That)
60% replays Group 1 Crew — He Said (feat. Chris August)
59% replays Karsten Glück Simone Sommerland — Schlaflied für Anne
56% replays Monica — With You
54% replays Jessie Ware — Wildest Moments
53% replays Tim McGraw — I Like It, I Love It
53% replays Rain Sounds — Morning Rain In Sedona
52% replays Rain Sounds — Rain Sounds

It is no surprise that the list is dominated by background noise. There’s nothing like ambient ocean waves or rain sounds to help baby go to sleep in the noisy city. A five minute track of ambient white noise may be played dozens of times during every nap. It is not uncommon to find 8 hour long stretches of the same five minute white noise audio track played on auto repeat.

The top most replayed song is Reckless by Eric West from the ‘shamelessly sentimental’ 2012 movie Playing for Keeps (4% rotten). 86% of the time this song is played it is a replay. This is the song that you can’t listen to just once. It is the Lays potato chip of music. Beware, if you listen to it, you may be caught in its web and you’ll never be able to escape. Listen at your own risk:

Luckily, most people don’t listen to this song even once. It is only part of the regular listening rotation of a couple hundred listeners. Still, it points to a pattern that we’ll see more of – overly sentimental music has high replay value.

Top Replayed Popular Songs
Perhaps even more interesting is to look at the top most replayed popular songs. We can do this by restricting the songs in the results to those that are by artists that have a significant fan base:

31% replays Miley Cyrus — The Climb
16% replays August Alsina — I Luv This sh*t featuring Trinidad James
15% replays Brad Paisley — Whiskey Lullaby
14% replays Tamar Braxton — The One
14% replays Chris Brown — Love More
14% replays Anna Kendrick — Cups (Pitch Perfect’s “When I’m Gone”)
13% replays Avenged Sevenfold — Hail to the King
13% replays Jay-Z — Big Pimpin’
13% replays Labrinth — Beneath Your Beautiful
13% replays Karmin — Acapella
12% replays Lana Del Rey — Summertime Sadness [Lana Del Rey vs. Cedric Gervais]
12% replays MGMT — Electric Feel
12% replays One Direction — Best Song Ever
12% replays Big Sean — Beware featuring Lil Wayne, Jhené Aiko
12% replays Chris Brown — Don’t Think They Know
11% replays Justin Bieber — Boyfriend
11% replays Avicii — Wake Me Up
11% replays 2 Chainz — Feds Watching featuring Pharrell
10% replays Paramore — Still Into You
10% replays Alicia Keys — Fire We Make
10% replays Lorde — Royals
10% replays Miley Cyrus — We Can’t Stop
10% replays Ciara — Body Party
9% replays Marc Anthony — Vivir Mi Vida
9% replays Ellie Goulding — Burn
9% replays Fantasia — Without Me
9% replays Rich Homie Quan — Type of Way
9% replays The Weeknd — Wicked Games (Explicit)
9% replays A$AP Ferg — Work REMIX
9% replays Jay-Z — Part II (On The Run) featuring Beyoncé

It is hard to believe, but the data doesn’t lie – More than 30% of the time after someone listens to Miley Cyrus’s The Climb they listen to it again right away – proving that there is indeed always going to be another mountain that you are going to need to climb. Miley Cyrus is well represented – her aptly named song We can’t Stop is the most replayed song of the top ten most popular songs.

Here are the top 30 most replayed popular songs in Spotify and Rdio playlists for you to enjoy, but I’m sure you’ll never get to the end of the playlist, you’ll just get stuck repeating The Best Song Ever or Boyfriend forever.

Here’s the Rdio version of the Top 30 Most Replayed popular songs:

http://www.rdio.com/people/plamere/playlists/5733386/Most_replayed/

Most Manually Replayed
More than once I’ve come back from lunch to find that I left my music player on auto repeat and it has played the last song 20 times while I was away. The song was playing, but no one was listening. It is more interesting to find songs replays in which the replay is manually initiated. These are the songs that grabbed the attention of the listener enough to make them interact with their player and actually queue the song up again. We can find manually replayed songs by looking at replay timestamps. Replays generated by autorepeat will have a very regular timestamp delta, while manual replay timestamps will have more random delta between timestamps.

Here are the top manually replayed songs:

Body Party by Ciara
Still Into You by Paramore
Tapout featuring Lil Wayne, Birdman, Mack Maine, Nicki Minaj, Future by Rich Gang
Part II (On The Run) featuring Beyoncé by Jay-Z
Feds Watching featuring Pharrell by 2 Chainz
Royals by Lorde
V.S.O.P. by K. Michelle
Just Give Me A Reason by Pink
Don’t Think They Know by Chris Brown
Wake Me Up by Avicii

There’s an Rdio playlist of these songs: Most Manually Replayed

So what?
Why do we care which songs are most replayed? It’s part of our never ending goal to try to better understand how people interact with music. For instance, recognizing when music is being used in a context like helping the baby go to sleep is important – without taking this context into account, the thousands of plays of Ocean Waves and Warn Rain would dominate the taste profile that we build for that new mom and dad. We want to make sure that when that mom and dad are ready to listen to music, we can recommend something besides white noise.

Looking at replays can help us identify new artists for certain audiences. For instance, parents looking for an alternative to Miley Cyrus for their pre-teen playlists after Miley’s recent VMA performance, may look to an artist like Fifth Harmony. Their song Miss Movin’ On has similar replay statistics to the classic Miley songs:

http://www.rdio.com/artist/Fifth_Harmony/album/Miss_Movin%27_On/track/Miss_Movin%27_On/

Finally, looking at replays is another tool to help us understand the music that people really like. If the neighbors play Rock Lobster 20 times in a row, you can be sure that they really, really like that song. (And despite, or perhaps because of, that night 30 years ago, I like the song too). You should give it a listen, or two…

http://www.rdio.com/artist/The_B-52%27s/album/Rock_Lobster_/_6060-842_(Digital_45)/track/Rock_Lobster/

data, Music, passion, replays

Reidentification of artists and genres in the KDD cup data

Posted by Paul in data, recommendation, research on June 21, 2011

Back in February I wrote a post about the KDD Cup ( an annual Data Mining and Knowledge Discovery competition), asking whether this year’s cup was really music recommendation since all the data identifying the music had been anonymized. The post received a number of really interesting comments about the nature of recommendation and whether or not context and content was really necessary for music recommendation, or was user behavior all you really needed. A few commenters suggested that it might be possible de-anonymize the data using a constraint propagation technique.

Many voiced an opinion that such de-anonymizing of the data to expose user listening habits would indeed be unethical. Malcolm Slaney, the researcher at Yahoo! who prepared the dataset offered the plea:

If you do de-anonymize the data please don’t tell anybody. We’ll NEVER be able to release data again.

As far as I know, no one has de-anonymized the KDD Cup dataset, however, researcher Matthew J. H. Rattigan of The University of Massachusetts at Amherst has done the next best thing. He has published a paper called Reidentification of artists and genres the KDD cup that shows that by analyzing at the relational structures within the dataset it is possible to identify the artists, albums, tracks and genres that are used in the anonymized dataset. Here’s an excerpt from the paper that gives an intuitive description of the approach:

For example, consider Artist 197656 from the Track 1 data. This artist has eight albums described by diﬀerent combinations of ten genres. Each album is associated with several tracks, with track counts ranging from 1 to 69. We make the assumption that these albums and tracks were sampled without replacement from the discography of some real artist on the Yahoo! Music website. Furthermore, we assume that the connections between genres and albums are not sampled; that is, if an album in the KDD Cup dataset is attached to three genres, its real-world counterpart has exactly three genres (or “Categories”, as they are known on the Yahoo! Music site).

Under the above assumptions, we can compare the unlabeled KDD Cup artist with real-world Yahoo! Music artists in order to ﬁnd a suitable match. The band Fischer Z, for example, is an unsuitable match, as their online discography only contains seven albums. An artist such as Meatloaf certainly has enough albums (56) to be a match, but none of those albums contain more than 31 tracks. The entry for Elvis Presley contains 109 albums, 17 of which boast 69 or more tracks; however, there is no consistent assignment of genres that satisﬁes our assumptions. The band Tool, however, is compatible with Artist 197656. The Tool discography contains 19 albums containing between 0 and 69 tracks. These albums are described by exactly 10 genres, which can be assigned to the unlabeled KDD Cup genres in a consistent manner. Furthermore, the match is unique: of the 134k artists in our labeled dataset, Tool is the only suitable match for Artist 197656.

Of course it is impossible for Matthew to evaluate his results directly, but he did create a number of synthetic, anonymized datasets draw from Yahoo and was able to demonstrate very high accuracy for the top artists and a 62% overall accuracy.

The motivation for this type of work is not to turn the KDD cup dataset into something that music recommendation researchers could use, but instead is to get a better understanding of data privacy issues. By understanding how large datasets can be de-anonymized, it will be easier for researchers in the future to create datasets that won’t be easily yield their hidden secrets. The paper is an interesting read – so since you are done doing all of your reviews for RecSys and ISMIR, go ahead and give it a read: https://www.cs.umass.edu/publication/docs/2011/UM-CS-2011-021.pdf. Thanks to @ocelma for the tip.

data, kdd cup, privacy

1 Comment

The Stairway Detector

Posted by Paul in data, fun, Music, playlist, The Echo Nest, web services on August 17, 2009

Last night I was watching the pilot for Glee (a snarky TV version of High school musical) with my 3 teenage daughters. I was surprised to hear the soundtrack filled with songs by the band Journey, songs that brought me back to my own high school years. The thing that I like the most about Journey is that many of their songs have this slow and gradual build up over the course of the whole song as in this song Lovin Touchin Squeezin:

A number of my favorite songs have this slow build up. The canonical example is Zep’s ‘Stairway to Heaven’ – it starts with a slow acoustic guitar and over the course of 8 minutes builds to metal frenzy. I thought it would be fun to see if I could write a bit of software that could find the songs that have the same arc as ‘Stairway to Heaven’ or ‘Lovin, Touchin Squeezin’ – songs that have this slow build. With this ‘stairway detector’ I could build playlists filled with the songs that fire me up.

The obvious place to start with is to look how the loudness of a song changes overtime. To do this I used the Echo Nest developer API to extract the loudness as a function of time for Journey’s Lovin, Touchin Squeezin:

In this plot the light green curve is the loudness, while the blue line is a windowed average of the loudness. This plot shows a nice rise in the volume over the course of the song. Compared to a song like the Beatles ‘Ticket to Ride’ that doesn’t have this upward slope:

From these two examples, it is pretty clear that we can build our stairway-detector just by looking at the average slope of the volume. The higher the slope, the bigger the build. Now, I suspect that there’s lots of ways to find the average slope of a bumpy line – but I like to always try the simplest thing that could possibly work first – and for me the simplest thing was to just divide the average loudness of the second half of the song by the average loudness of the first half of the song. So for example, with the Journey song the average loudness of the second half of the song is -15.86 db and the average of the first half of the song is -24.37 db. This gives us a ratio of 1.54, while ‘Ticket to ride’ gets a ratio of 1.06. Here’s the Journey song with averages shown:

Here are a few more songs that fit the ‘slow build’ profile:

‘Stairway to Heaven’ has a score of 1.6 so it has a bigger build than Journey’s Lovin’.

Simon and Garfunkle’s ‘Bridge over troubled water’ has an even bigger build with a score of 1.7.

Also sprach Zarathustra has a more modest score of 1.56

With this new found metric I analyzed a few thousand of the tracks in my personal collection to find the songs with the biggest crescendos. The biggest of all was this song by Muse with a whopping score of 3.07:

Another find is Arcade Fire’s “My Body is a Cage” with a score of 2.32.

The metric isn’t perfect. For instance, I would have expected Postal Services ‘Natural Anthem’ to have a high score because it has such a great build up, but it only gets a score of 1.19. Looking at the plot we can see why:

After the initial build up, there’s a drop an energy for that last quarter of the song, so even though the song has a sustained crescendo for 3 minutes it doesn’t get a high score due to this drop.

Of course, we can use this ratio to find tracks that go the other way, to find songs that gradually wind down. These seem to occur less frequently than the songs that build up. One example is Neutral Milk Hotel’s Two Headed Boy:

Despite the fact that I’m using a very naive metric to find the loudness slope, this stairway detector is pretty effective in finding songs that have that slow build. It’s another tool that I can use for helping to build interesting playlists. This is one of the really cool things about how the Echo Nest approaches music playlisting. By having an understanding of what the music actually sounds like, we can build much more interesting playlists than you get from genius-style playlists that only take into account artists co-occurrence.

data, Music, playlists, The Echo Nest

1 Comment

Artist similarity, familiarity and hotness

Posted by Paul in Music, recommendation, The Echo Nest, visualization, web services on May 25, 2009

The Echo Nest developer web services offer a number of interesting pieces of data about an artist, including similar artists, artist familiarity and artist hotness. Familiarity is an indication of how well known the artist is, while hotness (which we spell as the zoolanderish ‘hotttnesss’) is an indication of how much buzz the artist is getting right now. Top familiar artists are band like Led Zeppelin, Coldplay, and The Beatles, while top ‘hottt’ artists are artists like Katy Perry, The Boy Least Likely to, and Mastodon.

I was interested in understanding how familiarity, hotness and similarity interact with each other, so I spent my Memorial day morning creating a couple of plots to help me explore this. First, I was interested in learning how the familiarity of an artist relates to the familiarity of that artists’s similar artists. When you get the similar artists for an artist, is there any relationship between the familiarity of these similar artists and the seed artist? Since ‘similar artists’ are often used for music discovery, it seems to me that on average, the similar artists should be less familiar than the seed artist. If you like the very familiar Beatles, I may recommend that you listen to ‘Bon Iver’, but if you like the less familiar ‘Bon Iver’ I wouldn’t recommend ‘The Beatles’. I assume that you already know about them. To look at this, I plotted the average familiarity for the top 15 most similar artists for each artist along with the seed artist’s familiarity. Here’s the plot:

In this plot, I’ve take the top 18,000 most familiar artists, ordered them by familiarity. The red line is the familiarity of the seed artist, and the green cloud shows the average familiarity of the similar artists. In the plot we can see that there’s a correlation between artist familiarity and the average familiarity of similar artists. We can also see that similar artists tend to be less familiar than the seed artist. This is exactly the behavior I was hoping to see. Our similar artist function yields similar artists that, in general, have an average famililarity that is less than the seed artist.

This plot can help us q/a our artist similarity function. If we see the average familiarity for similar artists deviates from the standard curve, there may be a problem with that particular artist. For instance, T-Pain has a familiarity of 0.869, while the average familiarity of T-Pain’s similar artists is 0.340. This is quite a bit lower than we’d expect – so there may be something wrong with our data for T-Pain. We can look at the similars for T-Pain and fix the problem.

For hotness, the desired behavior is less clear. If a listener starting from a medium hot artist is looking for new music, it is unclear whether or not they’d like a hotter or colder artist. To see what we actually do, I looked at how the average hotness for similar artists compare to the hotness of the seed artist. Here’s the plot:

In this plot, the red curve is showing the hotness of the top 18,000 most familiar artists. It is interesting to see the shape of the curve, there are very few ultra-hot artists (artists with a hotness about .8) and very few familiar, ice cold artists (with a hotness of less than 0.2). The average hotness of the similar artists seems to be somewhat correlated with the hotness of the seed artist. But markedly less than with the familiarity curve. For hotness if your seed artist is hot, you are likely to get less hot similar artists, while if the seed artist is not hot, you are likely to get hotter artists. That seems like reasonable behavior to me.

Well, there you have it. Some Monday morning explorations of familiarity, similarity and hotness. Why should you care? If you are building a music recommender, familiarity and hotness are really interesting pieces of data to have access to. There’s a subtle game a recommender has to play, it has to give a certain amount of familiar recommendations to gain trust, while also giving a certain number of novel recommendations in order to enable music discovery.

data, Music, The Echo Nest

4 Comments

Music Machinery

Posts Tagged data

50 Years Ago in Music

How Students Listen

Minimizing my Karaoke pain

The Most Replayed Songs

Reidentification of artists and genres in the KDD cup data

The Stairway Detector

Artist similarity, familiarity and hotness

Music Machinery

Top Posts

Related Stuff

Categories