Archive for category search
sched.org support added to SXSW Artist Catalog
Posted by Paul in search, The Echo Nest on March 1, 2009
I’ve just pushed out a new version of my SXSW Artist Catalog that lets you add any artist to your SXSW schedule (via sched.org). Each artist now has a ‘schedule at sched.org’ link which brings you directly to the sched.org page for the artist where you can select the artist event that you are interested in and then add it to your schedule. It is pretty handy.
By the way, the integration with sched.org could not have been easier. Taylor McKnight added a search url of the form:
http://sxsw2009.sched.org/?searchword=DEVO
that brings you to the DEVO page at sched.org. Very nice.
While adding the sched support, I also did a recrawl of all the artist info, so the data should be pretty fresh.
Thanks to Steve for fixing things for me after I had botched things up on the deploy, and thanks in general to Sun for continuing to host the catalog.
By the way, doing this update was a bit of a nightmare. The key data for the guide is the artist list that is crawled from the SXSW site – but the SXSW folks have recently changed the format of the artist list (spreading it out over multiple pages, adding more context, etc ). I didn’t want to have to rewrite the parsing code (when working on a spare time project, just the thought of working with regular expressions makes me close the IDE and fire up Team Fortress 2). Luckily, I had anticipated this event – my SXSW crawler had diligently been creating archives of every SXSW crawl, so if they did change formats, I could fall back on a previous crawl without needing to work on the parser. I’m so smart. Except that I had a bug. Here’s the archive code:
public void createArchive(URL url) throws IOException { createArchiveDir(); File file = new File(getArchiveName()); if (!file.exists()) { URLConnection connection = url.openConnection(); BufferedReader in = new BufferedReader( newInputStreamReader(connection.getInputStream())); PrintWriter out = new PrintWriter(getArchiveName()); String line = null; try { while ((line = in.readLine()) != null) { out.println(line); } } finally { in.close(); } }
See the bug? Yep, I forgot to close the output file – which means that all of my many archive files were missing the last block of data, making them useless. My pennance for this code-and-test sin was that I had to go and rewrite the SXSW parser to support the new format. But this turned out to be a good thing, since SXSW has been adding more artists. So this push has a new fresh crawl, with the absolute latest artists, fresh data from all of the sites like Youtube, Flicker, Last.fm and The Echo Nest. My bug makes more work for me, but a better catalog for you.
One Blog, Two Blog, Old Blog, New Blog
Here are a couple of blogs to add to your blog roll. First, Stephen Green (aka SearchGuy) has started posting to his blog again. Steve writes indepth articles about the innards of a search engine – and why that inverted text file that you created for your CS 301 homework is not going to put Google out of business anytime soon. It’s a good blog: SearchGuy.
Second, Jeremy seems to now be blogging – this makes me quite sad, because Jeremy has regularly emailed me blog fodder – so now that he has his own blog, I suspect that source will dry up. But it is all for the greater good. Jeremy is writing interesting articles about search from a higher vantage point than Steve. Jeremey says: “My idea was to have a place where interested researchers and search observers can gather, survey, and discuss information retrieval from a useful vantage point: somewhere tall where you can get a good overview of what is happening.” Jeremy is blogging at Information Retrieval Gupf.