Improving the Generation of Ground Truths Based on Partially Ordered Lists
Julián Urbano, Mónica Marrero, Diego Martín and Juan Lloréns
abstract: Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggrega- tion of the relevant results, and show that it is not possi- ble to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Re- call and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.
Current approach of Partially Ordered Lists for evaluating tasks like melody similarity may not be the best way to evaluate.
- They are expensive
- They have some odd results
- They are hard to replicate
- leave out relevant results
- Inconsistencies among the expert evaluations are not treated properly
Authors propose an alternate aggregation:
- All: a new group is started if the pivot incipit is significantly different from every incipit in the current group. This should lead to larger groups.
- Any: a new group is started if the pivot incipit is significantly different from any incipit in the current group. This should lead to smaller groups.
- Prev: a new group is started if the pivot incipit is significantly different from the previous one.
Authors applied the new ranking system to MIREX 2005 – resulting in lowered performance and modifying ranking of several systems.