[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[AUDITORY] ISCA SIGML seminar: Audio Spectrogram Transformer for Audio Scene Analysis



Dear colleagues,

We are hosting a talk that might be of interest to the people on this list.

Next Wednesday (16 Jun) at 5pm (UTC+0), Yuan Gong from MIT will talk
about audio scene analysis with transformers. The details of the talk
can be found at the end of the email and at the seminar webpage
https://homepages.inf.ed.ac.uk/htang2/sigml/seminar/.

If you are interested, the link to the talk will be distributed
through our mailing list https://groups.google.com/g/isca-sigml.
Please subscribe and stay tuned!

Best,
Hao

---

Title: Audio Spectrogram Transformer for Audio Scene Analysis

Abstract: Audio scene analysis is an active research area and has a wide
range of applications. Since the release of AudioSet, great progress has
been made in advancing model performance, which mostly comes from the
development of novel model architectures and attention modules. However,
we find that appropriate training techniques are equally important for
building audio tagging models, but have not received the attention they
deserve. In the first part of the talk, I will present PSLA, a
collection of training techniques that can noticeably boost the model
accuracy.

On the other hand, in the past decade, convolutional neural networks
(CNNs) have been widely adopted as the main building block for
end-to-end audio classification models, which aim to learn a direct
mapping from audio spectrograms to corresponding labels. To better
capture long-range global context, a recent trend is to add a
self-attention mechanism on top of the CNN, forming a CNN-attention
hybrid model. However, it is unclear whether the reliance on a CNN is
necessary, and if neural networks purely based on attention are
sufficient to obtain good performance in audio classification. In the
second part of the talk, I will answer the question by introducing the
Audio Spectrogram Transformer (AST), the first convolution-free, purely
attention-based model for audio classification.

Bio: Yuan Gong is a postdoctoral associate at the MIT Computer Science
and Artificial Intelligence Laboratory (CSAIL). He received his Ph.D.
degree in Computer Science from the University of Notre Dame, and his
B.S. degree in Biomedical Engineering from Fudan University. He won the
2017 AVEC depression detection challenge and one of his papers was
nominated for the best student paper award in Interspeech 2019.
Currently, his research interests include: audio scene analysis,
speech-based health systems, voice anti-spoofing.