Archive for category music information retrieval

Predicting High-level Music Semantics Using Social Tags via Ontology-based Reasoning

Predicting High-level Music Semantics Using Social Tags via Ontology-based Reasoning
Jun Wang, Xiaoou Chen, Yajie Hu and Tao Feng

ABSTRACT – High-level semantics such as “mood” and “usage” are very useful in music retrieval and recommendation but they are normally hard to acquire. Can we predict them from a cloud of social tags? We propose a semantic iden- tification and reasoning method: Given a music taxonomy system, we map it to an ontology’s terminology, map its finite set of terms to the ontology’s assertional axioms, and then map tags to the closest conceptual level of the referenced terms in WordNet to enrich the knowledge base, then we predict richer high-level semantic informa- tion with a set of reasoning rules. We find this method predicts mood annotations for music with higher accuracy, as well as giving richer semantic association information, than alternative SVM-based methods do.

In this paper, the authors use word-net to map social tags to a professional taxonomy and then use these for traditional tagging tasks such as classification and mood identification.

Leave a comment

Learning Tags that Vary Within a Song

Learning Tags that Vary Within a Song
Michael I. Mandel, Douglas Eck and Yoshua Bengio

Abstract:This paper examines the relationship between human generated tags describing different parts of the same song. These tags were collected using Amazon’s Mechanical Turk service. We find that the agreement between different people’s tags decreases as the distance between the parts of a song that they heard increases. To model these tags and these relationships, we describe a conditional restricted Boltzmann machine. Using this model to fill in tags that should probably be present given a context of other tags, we train automatic tag classifiers (autotaggers) that outperform those trained on the original data.

Michael enlisted the Amazon Turk to tag music. They paid for about a penny per tag.  About 11% of the tags were spam.  He then looked at co-occurence data that showed interesting patterns, especially as related to the distance between clips in a single track.

Results shows that smoothing of the data with a boltzmann machine tends to give better accuracy when tag data is sparse.

Leave a comment

Sparse Multi-label Linear Embedding Within Nonnegative Tensor Factorization Applied to Music Tagging

Sparse Multi-label Linear Embedding Within Nonnegative Tensor Factorization Applied to Music Tagging Yannis Panagakis, Constantine Kotropoulos and Gonzalo R. Arce

Abstract: A novel framework for music tagging is proposed. First, each music recording is represented by bio-inspired auditory temporal modulations. Then, a multilinear subspace learning algorithm based on sparse label coding is developed to effectively harness the multi-label information for dimensionality reduction. The proposed algorithm is referred to as Sparse Multi-label Linear Embedding Non- negative Tensor Factorization, whose convergence to a stationary point is guaranteed. Finally, a recently proposed method is employed to propagate the multiple labels of training auditory temporal modulations to auditory temporal modulations extracted from a test music recording by means of the sparse l1 reconstruction coefficients. The overall framework, that is described here, outperforms both humans and state-of-the-art computer audition systems in the music tagging task, when applied to the CAL500 dataset.

This paper gets the ‘Title that rolls off the tongue best’ award.   I don’t understand all of the math for this one, but some notes – the wavelet-based features used, seem to be good at discriminating at the genre level.  He compares the system to Doug Turnbull’s MixHier and to the system that we built at Sun labs with Thierry, Doug, Francois and myself (Autotagger: A model for predicting social tags from acoustic features on Large Music Databases)


1 Comment

MIREX 2010

Dr. Downie gave a summary of MIREX 2010 – the evaluation track for MIR (it is like TREC for MIR).  Results are here:  MIREX-2010 results

Matthias Mauch makes a case for improving chord recognition by making the MIREX tasks harder. He’d like to gently increase the difficulty.

Leave a comment

Improving the Generation of Ground Truths Based on Partially Ordered Lists

Improving the Generation of Ground Truths Based on Partially Ordered Lists
Julián Urbano, Mónica Marrero, Diego Martín and Juan Lloréns

abstract: Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggrega- tion of the relevant results, and show that it is not possi- ble to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Re- call and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.

Current approach of Partially Ordered Lists for evaluating tasks like melody similarity may not be the best way to evaluate.

  • They are expensive
  • They have some odd results
  • They are hard to replicate
  • leave out relevant results
  • Inconsistencies among the expert evaluations are not treated properly

Authors propose an alternate aggregation:

  • All: a new group is started if the pivot incipit is significantly different from every incipit in the current group. This should lead to larger groups.
  • Any: a new group is started if the pivot incipit is significantly different from any incipit in the current group. This should lead to smaller groups.
  • Prev: a new group is started if the pivot incipit is significantly different from the previous one.

Authors applied the new ranking system to MIREX 2005 – resulting in lowered performance and modifying ranking of several systems.

Leave a comment

A Cartesian Ensemble of Feature Subspace Classifiers for Music Categorization

A Cartesian Ensemble of Feature Subspace Classifiers for Music Categorization (pdf)

Thomas Lidy, Rudolf Mayer, Andreas Rauber, Pedro J. Ponce de León, Antonio Pertusa, and Jose Manuel Iñesta

Abstract: We present a cartesian ensemble classification system that is based on the principle of late fusion and feature sub- spaces. These feature subspaces describe different aspects of the same data set. The framework is built on the Weka machine learning toolkit and able to combine arbitrary fea- ture sets and learning schemes. In our scenario, we use it for the ensemble classification of multiple feature sets from the audio and symbolic domains. We present an extensive set of experiments in the context of music genre classifi- cation, based on numerous Music IR benchmark datasets, and evaluate a set of combination/voting rules. The results show that the approach is superior to the best choice of a single algorithm on a single feature set. Moreover, it also releases the user from making this choice explicitly.

An ensemble classification system built on top of Weka:

Results, using different datasets, classifiers and feature sets:

Execution times were about 10 seconds per song, so rather slow for large collections.

The ensemble approach delivered superior results through adding a reasonable amount of feature sets and classifiers.  However, they did not discover a combination rule that always outperforms all the others.

Leave a comment


Noam Koenigstein, Yuval Shavitt, Ela Weinsberg, and Udi Weinsberg

abstract:Peer-to-Peer (p2p) networks are being increasingly adopted as an invaluable resource for various music information re- trieval (MIR) tasks, including music similarity, recommen- dation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts re- garding the ability to actually extract sufficiently accurate information.

This paper evaluates the applicability of using data orig- inating from p2p networks for MIR research, focusing on partial crawling, inherent noise and localization of songs and search queries. These aspects are quantified using songs collected from the Gnutella p2p network. We show that the power-law nature of the network makes it relatively easy to capture an accurate view of the main-streams using relatively little effort. However, some applications, like trend prediction, mandate collection of the data from the “long tail”, hence a much more exhaustive crawl is needed. Furthermore, we present techniques for overcoming noise originating from user generated content and for filtering non informative data, while minimizing information loss

Observation – CF systems tend to outperform content-based systems until you get in the long tail – so to improved CF systems, you need more long tail data.  This work explores how to get more long tail data by mining p2p networks.

P2P systems have some problems – privacy concerns, data collection is hard. High user churn, very noisy data, some users delete content from shared folders right away, sparsity

P2P mining Shared folders are useful for similarity, search queries are useful for trends.

Lots of p2p challenges and steps – getting IP addresses for p2p nodes, filtering out non-musical content, geo-identification, anonymization.

Dealing with sparsity:  1.2 million users, but average of 1 artist/song data point for each artist/song relation.  These graphs show song popularity in shared folders. They use this data to help filter out non-typical users.

Identifying songs: Use the hash file – but of course many songs have many different digital copies – so they also look at the (noisy) metadata.

Songs Discovery Rate

Once you reach about 1/3 of the network you’ve found most of the tracks if you use metadata for resolving.  If you use the hashes, you need to crawl 70% of the network.

Using shared folders for similarity

There’s a preferential attachment model for popular  songs

Conclusion: P2P data is good source of long tail data, but dealing with the noisy data is hard.  The p2p data is especially good for building similarity models localized to countries. A good talk with from someone with lots of experience with p2p stuff.

Leave a comment


MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEWYoungmoo E. Kim, Erik M. Schmidt, Raymond Migneco, Brandon G. Morton Patrick Richardson, Jeffrey Scott, Jacquelin A. Speck, and Douglas Turnbull (pdf)

Youngmoo presents on mood

From the paper: Recognizing musical mood remains a challenging problem primarily due to the inherent ambiguities of human emotions. Though research on this topic is not as mature as some other Music-IR tasks, it is clear that rapid progress is being made. In the past 5 years, the performance of automated systems for music emotion recognition using a wide range of annotated and content-based features (and multi-modal feature combinations) have advanced significantly. As with many Music-IR tasks open problems remain at all levels, from emotional representations and annotation methods to feature selection and machine learning.

While significant advances have been made, the most accurate systems thus far achieve predictions through large-scale machine learning algorithms operating on vast feature sets, sometimes spanning multiple domains, applied to relatively short musical selections. Oftentimes, this approach reveals little in terms of the underlying forces driving the perception of musical emotion (e.g., varying contributions of features) and, in particular, how emotions in music change over time. In the future, we anticipate further collaborations between Music-IR researchers, psychologists, and neuroscientists, which may lead to a greater understanding of not only mood within music, but human emotions in general. Furthermore, it is clear that individu- als perceive emotions within music differently. Given the multiple existing approaches for modeling the ambiguities of musical mood, a truly personalized system would likely need to incorporate some level of individual profiling to adjust its predictions.

This paper has provided a broad survey of the state of the art, highlighting many promising directions for further research. As attention to this problem increases, it is our hope that the progress of this research will continue to accelerate in the near future.

My notes:

Mirex performance on mood classification has held steady for the last few years.  Most mood classification systems in Mirex are just adapted genre classifiers.

Categorical vs. dimensional

Categorical: Mirex classifies mood into 5 clusters:

Dimensional:  – the ever popular Valence-Arousal space – sometimes called the Thayer Mood model:

Typical emotion classification system

Ground Truth

A big challenge is to come up with groundtruth for training a recognition system. tags,  GWAP,  AMG labels, web documents are common sources.

Lyrics – using lyrics alone has not been too successful for  mood classification.

Content-based methods – typical features for mood:

Youngmoo’s latest work (with Eric Schmidt) is showing the distribution and change of emotion over time.

Hybrid systems

  • Audio + Lyrics – some to high improvement
  • Audio + Tags = good improvement
  • Audio + Images = using album art to derive associations to mood

Conclusions – Mood recognition hasn’t improved much in recent years – probably because most systems are not really designed specifically for mood.

This was a great overview of the state-of-the-art. I’d be interested in hearing a much longer version of this talk.  The paper and the references will be a great resource for anyone who’s interested in pursuing mood classification.

Leave a comment

Solving Misheard Lyric Search Queries ….

Solving Misheard Lyric Search Queries  using a Probabilistic Model of Speech Sounds

Hussein Hirjee and Daniel G. Brown

People often use lyrics to find songs – and they get them wrong. Some examples  ‘Nirvana’ = “Don’t walk on guns, burn your friends”.  Approach: look at using phonetic similarity.  They adapt ‘blast’ from DNA sequence matching to the problem.  Lyrics are represented as a sequence of matrix.

For training, they get data from ‘misheard lyrics’ sites like  They align the misheard with the real lyrics – to build a model of frequently  misheard phonemes.  They tested with misheard lyrics.  Scored with 5 different models.

Evaluation: Mean Reciprocal Rank and Hit Rank by Rank.  The approach compared well with previous techniques.  Still, 17% of lyrics are still not identified – some are just bad queries, but dealing with short queries is a source of errors.  They also looked at phoneme confusion, in particular confusions caused by singing.

Future work: look at phoneme trigrams, and build a web site. Quesitioner suggests that they create a mondegreen generator

Good presentation, interesting, fun  problem area.

Leave a comment



This is a new chroma extraction method using a non-negative least squares (NNLS) algorithm for prior approximate note transcription. Twelve different chroma methods were tested for chord transcription accuracy on popular music, using an existing high- level probabilistic model. The NNLS chroma features achieved top results of 80% accuracy that significantly exceed the state of the art by a large margin.

We have shown that the positive influence of the approximate transcription is particularly strong on chords whose harmonic structure causes ambiguities, and whose identification is therefore difficult in approaches without prior approximate transcription. The identification of these difficult chord types was substantially increased by up to twelve percentage points in the methods using NNLS transcription.

Matthias is an enthusiastic presenter who did not hesitate to jump onto the piano to demonstrate ‘difficult chords’.  Very nice presentation.

Leave a comment