Posts Tagged entity extraction
Let’s say you have a block of text – perhaps a tweet or a web page from a music review site. If you want to find out if the text mentions a particular artist such as Weezer, it is a pretty straightforward task: Just search through the text for the artist name and all the variants and aliases for that artist. It is pretty easy.
What is harder is trying to figure out if any artists are mentioned in a block of text, and if so, which ones. Since there are millions of artists, each with their own set of aliases and variants, the simple search that we use to find ‘Weezer’ in a tweet doesn’t work so well. The fact that many artist names are also common words adds to the difficulty.
Luckily I work with a bunch of really smart folks at The Echo Nest who’ve already had to solve this problem in order to make The Echo Nest work. Over on the Echo Nest blog, there’s a nifty description of the problem of artist name identification and extraction and an announcement of the release of a new (and very much beta) API called artist/extract that will expose some of this functionality to application developers that use our APIs.
This morning I spent a few minutes and created a little web app that lets you play with the artist/extract API. Here’s a screenshot:
In this example I’ve typed in the text:
I like Deerhoof, and Emerson, Lake and Palmer. I don’t like Coldplay, or Justin Bieber. GNR is OK. Go try it yourself!
You can see that it found Deerhoof and Coldplay, (easy enough), and a spelling variant of Emerson, Lake & Palmer. It also recognized GNR as two bands – GNR (a Portuguese rock band), and as a nickname for Guns N’ Roses. Also notice that it didn’t get confused by the mention of ‘ OK. Go’ that is embedded in there. The extractor is not always perfect – it tries hard to avoid confusing artists with regular English words (since just about every English word is a band name), so it will rely on letter case and other hints to try to separate real artist mentions from accidental ones.
The artist extractor is very much a beta api so it may be a bit unsteady on its feet and may sometimes not work as you’d expect it to. Nevertheless, it is a nifty bit of music data infrastructure that will help us understand better who is talking about what artists.