Dawn of the Transformer Era in Speech Emotion Recognition : Closing the Valence Gap

Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size an...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 9 vom: 29. Sept., Seite 10745-10759
1. Verfasser: Wagner, Johannes (VerfasserIn)
Weitere Verfasser: Triantafyllopoulos, Andreas, Wierstorf, Hagen, Schmitt, Maximilian, Burkhardt, Felix, Eyben, Florian, Schuller, Bjorn W
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, Non-U.S. Gov't
LEADER 01000naa a22002652 4500
001 NLM355198754
003 DE-627
005 20231226063835.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2023.3263585  |2 doi 
028 5 2 |a pubmed24n1183.xml 
035 |a (DE-627)NLM355198754 
035 |a (NLM)37015129 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Wagner, Johannes  |e verfasserin  |4 aut 
245 1 0 |a Dawn of the Transformer Era in Speech Emotion Recognition  |b Closing the Valence Gap 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 08.08.2023 
500 |a Date Revised 10.08.2023 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of. 638 on MSP-Podcast. Our investigations reveal that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers. Finally, we show that their success on valence is based on implicit linguistic information, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. To make our findings reproducible, we release the best performing model to the community 
650 4 |a Journal Article 
650 4 |a Research Support, Non-U.S. Gov't 
700 1 |a Triantafyllopoulos, Andreas  |e verfasserin  |4 aut 
700 1 |a Wierstorf, Hagen  |e verfasserin  |4 aut 
700 1 |a Schmitt, Maximilian  |e verfasserin  |4 aut 
700 1 |a Burkhardt, Felix  |e verfasserin  |4 aut 
700 1 |a Eyben, Florian  |e verfasserin  |4 aut 
700 1 |a Schuller, Bjorn W  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 45(2023), 9 vom: 29. Sept., Seite 10745-10759  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:45  |g year:2023  |g number:9  |g day:29  |g month:09  |g pages:10745-10759 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2023.3263585  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 45  |j 2023  |e 9  |b 29  |c 09  |h 10745-10759