[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Million Song Dataset

It is our pleasure to announce the release of The Million Song dataset, a new resource to support music information research.

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:
   * To encourage research on algorithms that scale to commercial sizes
   * To provide a reference dataset for evaluating research
* As a shortcut alternative to creating a large dataset with The Echo Nest's API
   * To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.

The Million Song Dataset is a collaborative project between The Echo Nest and LabROSA. It is hosted by Infochimps and supported in part by the NSF.

Aside from instructions on how to get the dataset, the website contains:
   * code and tutorials to get you started
* benchmark results for some example tasks (automatic tagging, artist recognition, ...) * artist-level mappings to link to the Yahoo Ratings Dataset (91% of the artist ratings covered) * demos including how to fetch audio snippets, mapping artists on a world map, ...
   * forum, FAQ, blog, etc.

To better understand where this dataset comes from and what it aims to achieve, you can read Dan Ellis' blog post: http://bit.ly/hF8ozR

We are keen to receive questions, comments and suggestions, and we look forward to your new number-crunching MIR algorithms!

Thierry Bertin-Mahieux and Dan Ellis, for the Million Song Dataset team