[AUDITORY] Releasing FSD50K: an open dataset of human-labeled sound events with over 100h of audio (Eduardo Fonseca )


Subject: [AUDITORY] Releasing FSD50K: an open dataset of human-labeled sound events with over 100h of audio
From:    Eduardo Fonseca  <eduardo.fonseca@xxxxxxxx>
Date:    Fri, 2 Oct 2020 20:06:34 +0200

--000000000000dddca605b0b3fef6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =3D=3D=3D Apologies for cross-posting =3D=3D=3D Dear list, We=E2=80=99re glad to announce the release of FSD50K, the new open dataset = of human-labeled sound events. FSD50K contains over 51k Freesound audio clips, totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. To our knowledge, this is the largest fully-open dataset of human-labeled sound events, and modestly the second largest after AudioSet. FSD50K's most important characteristics: - FSD50K contains 51,197 audio clips from Freesound <https://freesound.org/>, totalling 108.3 hours of multi-labeled audio - The dataset encompasses 200 sound classes hierarchically organized with a subset of the AudioSet Ontology <https://research.google.com/audioset////////ontology/index.html>, allowing development and evaluation of large-vocabulary machine listenin= g methods - The audio content is composed mainly of sound events produced by physical sound sources, including human sounds, sounds of things, animal= s, natural sounds, musical instruments and more - The acoustic material has been manually labeled using the Freesound Annotator <https://annotator.freesound.org/> platform - Clips are of variable length (0.3 to 30s), and ground truth labels are provided at the clip-level (i.e., weak labels) - All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files - The dataset is split into a development set (41k clips / 80h, in turn split into train and validation) and an evaluation set (10k clips / 28h) - In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freeso= und metadata, and more), allowing a variety of sound event research tasks - All these resources are licensed under Creative Commons licenses, which allow sharing and reuse FSD50K dataset: http://doi.org/10.5281/zenodo.4060432 Paper documenting dataset creation, characterization and experiments: Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events <https://arxiv.org/pdf/2010.00475.pdf>", arXiv:2010.00475, 2020 Companion site (where you can explore the audio content of the dataset): https://annotator.freesound.org/fsd/release/FSD50K/ Code for baseline experiments (to be released soon): https://github.com/edufonseca/FSD50K_baseline Also, we will soon publish a blog post. Stay up-to-date about FSD50K by subscribing to the freesound-annotator Google Group. <https://groups.google.com/g/freesound-annotator> We hope all these resources are useful for the community! FSD50K has been created at the Music Technology Group <https://www.upf.edu/web/mtg/> of Universitat Pompeu Fabra, Barcelona. This effort was kindly sponsored by two Google Faculty Research Awards 2017 <https://ai.googleblog.com/2018/03/google-faculty-research-awards-2017.html= > and 2018. <https://ai.googleblog.com/2019/03/google-faculty-research-awards-2018.html= > Cheers, Eduardo on behalf of the Freesound Datasets team -- Eduardo Fonseca Music Technology Group Universitat Pompeu Fabra -- --000000000000dddca605b0b3fef6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><p dir=3D"ltr" style=3D"line-height:1.2;text-align:justify= ;margin-top:0pt;margin-bottom:0pt" id=3D"gmail-docs-internal-guid-09e84468-= 7fff-f6a9-5933-7cb1c39c3c8b"><span style=3D"font-size:10pt;font-family:Aria= l;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:= normal;font-variant:normal;text-decoration:none;vertical-align:baseline;whi= te-space:pre-wrap">=3D=3D=3D Apologies for cross-posting =3D=3D=3D</span></= p><p dir=3D"ltr" style=3D"line-height:1.656;text-align:justify;margin-top:1= 2pt;margin-bottom:12pt"><span style=3D"font-size:10pt;font-family:Arial;col= or:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:norma= l;font-variant:normal;text-decoration:none;vertical-align:baseline;white-sp= ace:pre-wrap">Dear list,</span></p><p dir=3D"ltr" style=3D"line-height:1.38= ;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:10pt;font-famil= y:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-= style:normal;font-variant:normal;text-decoration:none;vertical-align:baseli= ne;white-space:pre-wrap">We=E2=80=99re glad to announce the release of FSD5= 0K, the new open dataset of human-labeled sound events. FSD50K contains ove= r 51k Freesound audio clips, totalling over 100h of audio manually labeled = using 200 classes drawn from the AudioSet Ontology. To our knowledge, this = is the largest fully-open dataset of human-labeled sound events, and modest= ly the second largest after AudioSet.</span></p><br><p dir=3D"ltr" style=3D= "line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-siz= e:10pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font= -weight:400;font-style:normal;font-variant:normal;text-decoration:none;vert= ical-align:baseline;white-space:pre-wrap">FSD50K&#39;s most important chara= cteristics:</span></p><ul style=3D"margin-top:0px;margin-bottom:0px"><li di= r=3D"ltr" style=3D"list-style-type:disc;font-size:10pt;font-family:Roboto,s= ans-serif;color:rgb(0,0,0);background-color:transparent;font-weight:400;fon= t-style:normal;font-variant:normal;text-decoration:none;vertical-align:base= line;white-space:pre"><p dir=3D"ltr" style=3D"line-height:1.2;margin-top:12= pt;margin-bottom:0pt"><span style=3D"font-size:10pt;font-family:Arial;color= :rgb(0,0,0);background-color:transparent;font-weight:400;font-style:normal;= font-variant:normal;text-decoration:none;vertical-align:baseline;white-spac= e:pre-wrap">FSD50K contains 51,197 audio clips from</span><a href=3D"https:= //freesound.org/" style=3D"text-decoration:none"><span style=3D"font-size:1= 0pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-we= ight:400;font-style:normal;font-variant:normal;text-decoration:none;vertica= l-align:baseline;white-space:pre-wrap"> </span><span style=3D"font-size:10p= t;font-family:Arial;color:rgb(17,85,204);background-color:transparent;font-= weight:400;font-style:normal;font-variant:normal;text-decoration:underline;= vertical-align:baseline;white-space:pre-wrap">Freesound</span></a><span sty= le=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color:tr= ansparent;font-weight:400;font-style:normal;font-variant:normal;text-decora= tion:none;vertical-align:baseline;white-space:pre-wrap">, totalling 108.3 h= ours of multi-labeled audio</span></p></li><li dir=3D"ltr" style=3D"list-st= yle-type:disc;font-size:10pt;font-family:Roboto,sans-serif;color:rgb(0,0,0)= ;background-color:transparent;font-weight:400;font-style:normal;font-varian= t:normal;text-decoration:none;vertical-align:baseline;white-space:pre"><p d= ir=3D"ltr" style=3D"line-height:1.2;margin-top:0pt;margin-bottom:0pt"><span= style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-colo= r:transparent;font-weight:400;font-style:normal;font-variant:normal;text-de= coration:none;vertical-align:baseline;white-space:pre-wrap">The dataset enc= ompasses 200 sound classes hierarchically organized with a subset of the</s= pan><a href=3D"https://research.google.com/audioset////////ontology/index.h= tml" style=3D"text-decoration:none"><span style=3D"font-size:10pt;font-fami= ly:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font= -style:normal;font-variant:normal;text-decoration:none;vertical-align:basel= ine;white-space:pre-wrap"> </span><span style=3D"font-size:10pt;font-family= :Arial;color:rgb(17,85,204);background-color:transparent;font-weight:400;fo= nt-style:normal;font-variant:normal;text-decoration:underline;vertical-alig= n:baseline;white-space:pre-wrap">AudioSet Ontology</span></a><span style=3D= "font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa= rent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:= none;vertical-align:baseline;white-space:pre-wrap">, allowing development a= nd evaluation of large-vocabulary machine listening methods</span></p></li>= <li dir=3D"ltr" style=3D"list-style-type:disc;font-size:10pt;font-family:Ro= boto,sans-serif;color:rgb(0,0,0);background-color:transparent;font-weight:4= 00;font-style:normal;font-variant:normal;text-decoration:none;vertical-alig= n:baseline;white-space:pre"><p dir=3D"ltr" style=3D"line-height:1.2;margin-= top:0pt;margin-bottom:0pt"><span style=3D"font-size:10pt;font-family:Arial;= color:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:no= rmal;font-variant:normal;text-decoration:none;vertical-align:baseline;white= -space:pre-wrap">The audio content is composed mainly of sound events produ= ced by physical sound sources, including human sounds, sounds of things, an= imals, natural sounds, musical instruments and more</span></p></li><li dir= =3D"ltr" style=3D"list-style-type:disc;font-size:10pt;font-family:Roboto,sa= ns-serif;color:rgb(0,0,0);background-color:transparent;font-weight:400;font= -style:normal;font-variant:normal;text-decoration:none;vertical-align:basel= ine;white-space:pre"><p dir=3D"ltr" style=3D"line-height:1.2;margin-top:0pt= ;margin-bottom:0pt"><span style=3D"font-size:10pt;font-family:Arial;color:r= gb(0,0,0);background-color:transparent;font-weight:400;font-style:normal;fo= nt-variant:normal;text-decoration:none;vertical-align:baseline;white-space:= pre-wrap">The acoustic material has been manually labeled using the</span><= a href=3D"https://annotator.freesound.org/" style=3D"text-decoration:none">= <span style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background= -color:transparent;font-weight:400;font-style:normal;font-variant:normal;te= xt-decoration:none;vertical-align:baseline;white-space:pre-wrap"> </span><s= pan style=3D"font-size:10pt;font-family:Arial;color:rgb(17,85,204);backgrou= nd-color:transparent;font-weight:400;font-style:normal;font-variant:normal;= text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">Fre= esound Annotator</span></a><span style=3D"font-size:10pt;font-family:Arial;= color:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:no= rmal;font-variant:normal;text-decoration:none;vertical-align:baseline;white= -space:pre-wrap"> platform</span></p></li><li dir=3D"ltr" style=3D"list-sty= le-type:disc;font-size:10pt;font-family:Roboto,sans-serif;color:rgb(0,0,0);= background-color:transparent;font-weight:400;font-style:normal;font-variant= :normal;text-decoration:none;vertical-align:baseline;white-space:pre"><p di= r=3D"ltr" style=3D"line-height:1.2;margin-top:0pt;margin-bottom:0pt"><span = style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color= :transparent;font-weight:400;font-style:normal;font-variant:normal;text-dec= oration:none;vertical-align:baseline;white-space:pre-wrap">Clips are of var= iable length (0.3 to 30s), and ground truth labels are provided at the clip= -level (i.e., weak labels)</span></p></li><li dir=3D"ltr" style=3D"list-sty= le-type:disc;font-size:10pt;font-family:Roboto,sans-serif;color:rgb(0,0,0);= background-color:transparent;font-weight:400;font-style:normal;font-variant= :normal;text-decoration:none;vertical-align:baseline;white-space:pre"><p di= r=3D"ltr" style=3D"line-height:1.2;margin-top:0pt;margin-bottom:0pt"><span = style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color= :transparent;font-weight:400;font-style:normal;font-variant:normal;text-dec= oration:none;vertical-align:baseline;white-space:pre-wrap">All clips are pr= ovided as uncompressed PCM 16 bit 44.1 kHz mono audio files</span></p></li>= <li dir=3D"ltr" style=3D"list-style-type:disc;font-size:10pt;font-family:Ro= boto,sans-serif;color:rgb(0,0,0);background-color:transparent;font-weight:4= 00;font-style:normal;font-variant:normal;text-decoration:none;vertical-alig= n:baseline;white-space:pre"><p dir=3D"ltr" style=3D"line-height:1.2;margin-= top:0pt;margin-bottom:0pt"><span style=3D"font-size:10pt;font-family:Arial;= color:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:no= rmal;font-variant:normal;text-decoration:none;vertical-align:baseline;white= -space:pre-wrap">The dataset is split into a development set (41k clips / 8= 0h, in turn split into train and validation) and an evaluation set (10k cli= ps / 28h)</span></p></li><li dir=3D"ltr" style=3D"list-style-type:disc;font= -size:10pt;font-family:Roboto,sans-serif;color:rgb(0,0,0);background-color:= transparent;font-weight:400;font-style:normal;font-variant:normal;text-deco= ration:none;vertical-align:baseline;white-space:pre"><p dir=3D"ltr" style= =3D"line-height:1.2;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-s= ize:10pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;fo= nt-weight:400;font-style:normal;font-variant:normal;text-decoration:none;ve= rtical-align:baseline;white-space:pre-wrap">In addition to audio clips and = ground truth, additional metadata is made available (including raw annotati= ons, sound predominance ratings, Freesound metadata, and more), allowing a = variety of sound event research tasks</span></p></li><li dir=3D"ltr" style= =3D"list-style-type:disc;font-size:10pt;font-family:Roboto,sans-serif;color= :rgb(0,0,0);background-color:transparent;font-weight:400;font-style:normal;= font-variant:normal;text-decoration:none;vertical-align:baseline;white-spac= e:pre"><p dir=3D"ltr" style=3D"line-height:1.2;margin-top:0pt;margin-bottom= :12pt"><span style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);bac= kground-color:transparent;font-weight:400;font-style:normal;font-variant:no= rmal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">All= these resources are licensed under Creative Commons licenses, which allow = sharing and reuse</span></p></li></ul><p dir=3D"ltr" style=3D"line-height:1= .38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:10pt;font-fa= mily:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:700;fo= nt-style:normal;font-variant:normal;text-decoration:none;vertical-align:bas= eline;white-space:pre-wrap">FSD50K dataset</span><span style=3D"font-size:1= 0pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-we= ight:400;font-style:normal;font-variant:normal;text-decoration:none;vertica= l-align:baseline;white-space:pre-wrap">: <a href=3D"http://doi.org/10.5281/= zenodo.4060432">http://doi.org/10.5281/zenodo.4060432</a></span></p><p dir= =3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span = style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color= :transparent;font-weight:700;font-style:normal;font-variant:normal;text-dec= oration:none;vertical-align:baseline;white-space:pre-wrap">Paper documentin= g dataset creation, characterization and experiments</span><span style=3D"f= ont-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color:transpare= nt;font-weight:400;font-style:normal;font-variant:normal;text-decoration:no= ne;vertical-align:baseline;white-space:pre-wrap">: Eduardo Fonseca, Xavier = Favory, Jordi Pons, Frederic Font, Xavier Serra. &quot;</span><a href=3D"ht= tps://arxiv.org/pdf/2010.00475.pdf" style=3D"text-decoration:none"><span st= yle=3D"font-size:10pt;font-family:Arial;color:rgb(17,85,204);background-col= or:transparent;font-weight:400;font-style:normal;font-variant:normal;text-d= ecoration:underline;vertical-align:baseline;white-space:pre-wrap">FSD50K: a= n Open Dataset of Human-Labeled Sound Events</span></a><span style=3D"font-= size:10pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;f= ont-weight:400;font-style:normal;font-variant:normal;text-decoration:none;v= ertical-align:baseline;white-space:pre-wrap">&quot;, arXiv:2010.00475, 2020= </span></p><p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-b= ottom:0pt"><span style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0)= ;background-color:transparent;font-weight:700;font-style:normal;font-varian= t:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap"= >Companion site (where you can explore the audio content of the dataset)</s= pan><span style=3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);backgr= ound-color:transparent;font-weight:400;font-style:normal;font-variant:norma= l;text-decoration:none;vertical-align:baseline;white-space:pre-wrap">: <a h= ref=3D"https://annotator.freesound.org/fsd/release/FSD50K/">https://annotat= or.freesound.org/fsd/release/FSD50K/</a></span></p><p dir=3D"ltr" style=3D"= line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size= :10pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-= weight:700;font-style:normal;font-variant:normal;text-decoration:none;verti= cal-align:baseline;white-space:pre-wrap">Code for baseline experiments (to = be released soon)</span><span style=3D"font-size:10pt;font-family:Arial;col= or:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:norma= l;font-variant:normal;text-decoration:none;vertical-align:baseline;white-sp= ace:pre-wrap">: <a href=3D"https://github.com/edufonseca/FSD50K_baseline">h= ttps://github.com/edufonseca/FSD50K_baseline</a></span></p><br><p dir=3D"lt= r" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style= =3D"font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color:tran= sparent;font-weight:400;font-style:normal;font-variant:normal;text-decorati= on:none;vertical-align:baseline;white-space:pre-wrap">Also, we will soon pu= blish a blog post. Stay up-to-date about FSD50K by subscribing to the</span= ><a href=3D"https://groups.google.com/g/freesound-annotator" style=3D"text-= decoration:none"><span style=3D"font-size:10pt;font-family:Arial;color:rgb(= 17,85,204);background-color:transparent;font-weight:400;font-style:normal;f= ont-variant:normal;text-decoration:underline;vertical-align:baseline;white-= space:pre-wrap"> freesound-annotator Google Group.</span></a><span style=3D= "font-size:10pt;font-family:Arial;color:rgb(0,0,0);background-color:transpa= rent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:= none;vertical-align:baseline;white-space:pre-wrap"> We hope all these resou= rces are useful for the community! FSD50K has been created at the</span><a = href=3D"https://www.upf.edu/web/mtg/" style=3D"text-decoration:none"><span = style=3D"font-size:10pt;font-family:Arial;color:rgb(17,85,204);background-c= olor:transparent;font-weight:400;font-style:normal;font-variant:normal;text= -decoration:underline;vertical-align:baseline;white-space:pre-wrap"> Music = Technology Group</span></a><span style=3D"font-size:10pt;font-family:Arial;= color:rgb(0,0,0);background-color:transparent;font-weight:400;font-style:no= rmal;font-variant:normal;text-decoration:none;vertical-align:baseline;white= -space:pre-wrap"> of Universitat Pompeu Fabra, Barcelona. This effort was k= indly sponsored by two Google Faculty Research Awards </span><a href=3D"htt= ps://ai.googleblog.com/2018/03/google-faculty-research-awards-2017.html" st= yle=3D"text-decoration:none"><span style=3D"font-size:10pt;font-family:Aria= l;color:rgb(17,85,204);background-color:transparent;font-weight:400;font-st= yle:normal;font-variant:normal;text-decoration:underline;vertical-align:bas= eline;white-space:pre-wrap">2017</span></a><span style=3D"font-size:10pt;fo= nt-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:4= 00;font-style:normal;font-variant:normal;text-decoration:none;vertical-alig= n:baseline;white-space:pre-wrap"> and </span><a href=3D"https://ai.googlebl= og.com/2019/03/google-faculty-research-awards-2018.html" style=3D"text-deco= ration:none"><span style=3D"font-size:10pt;font-family:Arial;color:rgb(17,8= 5,204);background-color:transparent;font-weight:400;font-style:normal;font-= variant:normal;text-decoration:underline;vertical-align:baseline;white-spac= e:pre-wrap">2018.</span></a></p><br><p dir=3D"ltr" style=3D"line-height:1.3= 8;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:10pt;font-fami= ly:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font= -style:normal;font-variant:normal;text-decoration:none;vertical-align:basel= ine;white-space:pre-wrap">Cheers,</span></p><br><p dir=3D"ltr" style=3D"lin= e-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style=3D"font-size:10= pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-wei= ght:400;font-style:normal;font-variant:normal;text-decoration:none;vertical= -align:baseline;white-space:pre-wrap">Eduardo on behalf of the Freesound Da= tasets team</span></p><br><br><div><div dir=3D"ltr" class=3D"gmail_signatur= e" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr= "><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><d= iv dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D= "ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr">--</div><div dir=3D"ltr">= <font size=3D"1">Eduardo Fonseca<br>Music Technology Group<br>Universitat P= ompeu Fabra</font><div><span style=3D"color:rgb(0,0,0)"><br></span></div><d= iv><span style=3D"color:rgb(255,255,255)">--</span><br></div><br></div></di= v></div></div></div></div></div></div></div></div></div></div></div></div><= /div></div></div></div></div></div></div></div></div></div> --000000000000dddca605b0b3fef6--


This message came from the mail archive
src/postings/2020/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University