Speechiness – is it banjo or banter?

There’s no bigger buzz kill when listening to a playlist of songs by your favorite artists than to find that you are no longer listening to music, but instead to some radio interview the drummer of the band gave to some local radio station in 1963.  If you’ve listened to much Internet radio this has probably happened to you.  An algorithmic playlisting engine may know that it is time to play a track by The Beatles, but it probably doesn’t know which tracks in the Beatles discography are music and which ones are interviews, and so sooner or later you’ll find yourself listening to Ringo talking about his new haircut instead of listening to While My Guitar Gently Weeps.

To help deal with this type of problem, The Echo Nest has just pushed out a new analysis attribute called Speechiness.  Speechiness is a number between zero and one that indicates how likely a particular audio file is speech.   Whenever you analyze a track with the Echo Nest analyzer, the track will be assigned a speechiness score.  If the track has a high speechiness score, it is probably mostly speech, if it has a low score it is mostly non-speech.  This speechiness parameter is a pretty good way to distinguish between music tracks and non-music tracks.   As an example, lets look at tracks by comedian and banjo player Steve Martin. Steve has a large collection of comedy tracks, but he’s also an accomplished blue grass banjo player (What’s the difference between a chain saw and a banjo? You can turn a chain saw off.).   We took 35 Steve Martin tracks and calculated the speechiness of them all and ordered them by increasing speechiness.  Here’s a plot of the speechiness for these tracks:

You can see there’s a nice stable flat zone of low speechiness tracks – these are the banjo and blue grass ones and a stable flat zone of high speechiness tracks – the standup comedy. In between are some hybrid tracks – like Ramblin man – a comedy routine with a banjo accompaniment.  I created a web page where you can audition the tracks to see how well the speechiness attribute has separated the banjo from the banter.

I think it is quite cool how the speechiness attribute was able to separate the music from the spoken word.

Trying this yourself

Brian put together a quick demo that lets you calculate the speechiness of any track that’s on  SoundCloud. Brian put the demo together in an hour so he says it is ‘totally buggy and hacky’ – but so far it has worked great for me.  Just enter the URL to any SoundCloud tracks, wait a half-a-minute and see the speechiness score. A result in the green is probably music (or some other non-speech audio), while a result in the red is speech.

The demo is pretty cool. Go to speechiness.echonest.com to try it out.  If you don’t have any SoundCloud tracks handy, here are some tracks to try (expect these direct links to take 30 seconds to load since  the speechiness web app is triggering a full song analysis on page load):

API Access to Speechiness
You can programmatically calculate the speechiness attribute using the Echo Nest track API.   With this API you can upload and analyze a track. The audio summary returned via the track/profile method will include the new speechiness attribute.
For example, here’s the call to get back the speechiness for an excerpt of Dizzy Miss Lizzie and here’s the response:
{
    "response": {
        "status": {
            "code": 0,
            "message": "Success",
            "version": "4.2"
        },
        "track": {
            "analyzer_version": "3.08d",
            "artist": "The Beatles",
            "artist_id": "AR6XZ861187FB4CECD",
            "audio_summary": {
                "analysis_url": "https://echonest-analysis.s3.amazonaws.com/TR/TRAVQYP13369CD8BDC/3/full.json?Signature=EEiMYDzPquMmlW7fJlLvdWKI6PI%3D&Expires=1320409614&AWSAccessKeyId=AKIAJRDFEY23UEVW42BQ",
                "danceability": 0.37855052706867015,
                "duration": 54.999549999999999,
                "energy": 0.85756107654449365,
                "key": 9,
                "loudness": -10.613,
                "mode": 1,
 "speechiness": 0.1824877387165752,
                "tempo": 91.356999999999999,
                "time_signature": 5
            },
            "bitrate": 2425500,
            "id": "TRAVQYP13369CD8BDC",
            "md5": "ec2d40704439f5650b67884e00242d99",
            "release": "Help!",
            "samplerate": 44100,
            "song_id": "SOINKRY12B20E5E547",
            "status": "complete",
            "title": "Dizzy Miss Lizzie"
        }
    }
}

Wrapping up

The speechiness attribute is an alpha release. There may still be some tweaks to the algorithm in the near future. We’ve currently applied the attribute to the top 100,000 or so most popular tracks in The Echo Nest. Once we are totally satisfied with the algorithm we will apply it to all of our many millions of tracks as well as incorporating it into our search and playlisting APIs allowing you to filter and sort results based upon speechiness.  In the future you’ll be able to make that Beatles playlist and limit the results to only tracks that have a low speechiness, eliminating the hair cut interviews entirely from your listening rotation (or conversely and perversely you’ll be able to create a playlist with just the hair cut interviews.)  Congrats to The Echo Nest Audio team for rolling out this really useful feature.

  1. #1 by bobo on November 4, 2011 - 12:34 pm

    Very nice idea the speechiness concept !!!

    Sad, it doesn’t seems to work out with environmental sounds :-(

    http://speechiness.echonest.com/speechy?url=http%3A%2F%2Fsoundcloud.com%2Faudiosense%2F5_minutes_of_lightning_strikes

    Boris

    • #2 by Paul on November 4, 2011 - 1:04 pm

      Hi Boris – in what way don’t you think it works? We here at the Echo Nest are obviously mostly interested in music vs. speech. This particular track gives an ambiguous result, because it is not music and it is not speech. That is a good answer for us.

  2. #3 by bobo on November 7, 2011 - 3:47 am

    Yeah Paul, I know The EN doesn’t deal with environmental sounds. Just wanted to tickle ;-)

    I suggest it should answer : “it is not music and non speech, you tried to mislead me!”

    Congrats for all your work an openness

%d bloggers like this: