If you write software for music applications, then you understand the difficulties in dealing with matching artist names. There are lots of issues: spelling errors, stop words (‘the beatles’ vs. ‘beatles, the’ vs ‘beatles’), punctuation (is it “Emerson, Lake and Palmer” or “Emerson, Lake & Palmer“), common aliases (ELP, GNR, CSNY, Zep), to name just a few of the issues. One common problem is dealing with international characters. Most Americans don’t know how to type accented characters on their keyboards so when they are looking for Beyoncé they will type ‘beyonce’. If you want your application to find the proper artist for these queries you are going to have deal with these missing accents in the query. One way to do this is to extend the artist name matching to include a check against a version of the artist name where all of the accents have been removed. However, this is not so easy to do – You could certainly build a mapping table of all the possible accented characters, but that is prone to failure. You may neglect some obscure character mapping (like that funny ř in Antonín Dvořák).
Luckily, in Java 1.6 there’s a pretty reliable way to do this. Java 1.6 added a Normalizer class to the java. text package. The Normalize class allows you to apply Unicode Normalization to strings. In particular you can apply Unicode decomposition that will replace any precomposed character into a base character and the combining accent. Once you do this, its a simple string replace to get rid of the accents. Here’s a bit of code to remove accents:
public static String removeAccents(String text) { return Normalizer.normalize(text, Normalizer.Form.NFD) .replaceAll("\\p{InCombiningDiacriticalMarks}+", ""); }
This is nice and straightforward code, and has no effect on strings that have no accents.
Of course ‘removeAccents’ doesn’t solve all of the problems – it certainly won’t help you deal with artist names like ‘KoЯn’ nor will it deal with the wide range of artist name misspellings. If you are trying to deal normalizing aritist names you should read how Columbia researcher Dan Ellis has approached the problem. I suspect that someday, (soon, I hope) there will be a magic music web service that will solve this problem once and for all and you”ll never again have to scratch our head at why you are listening to a song by Peter, Bjork and John, instead of a song by Björk.
#1 by Daniel Lemire on April 10, 2009 - 4:31 pm
Another example… Celine Dion’s name is actually Céline Dion.
#2 by brian on April 10, 2009 - 5:58 pm
lucene/solr does this very well with the ISOLatin1AccentFilter http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/ISOLatin1AccentFilter.html
#3 by Norman Casagrande on April 11, 2009 - 6:50 am
At last.fm we have a pretty advanced auto-correcting system in place. Just try and look for http://www.last.fm/music/Korn