Gradient and Structure Consistency in Multimodal Emotion Recognition

Multimodal emotion recognition is a task that integrates textual, visual, and audio data to holistically infer an individual's emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modali...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 34(2025) vom: 01., Seite 6180-6191
1. Verfasser: Shi, QingHongYa (VerfasserIn)
Weitere Verfasser: Ye, Mang, Huang, Wenke, Du, Bo, Zong, Xiaofen
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2025
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM392755874
003 DE-627
005 20251001232128.0
007 cr uuu---uuuuu
008 250920s2025 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2025.3608664  |2 doi 
028 5 2 |a pubmed25n1586.xml 
035 |a (DE-627)NLM392755874 
035 |a (NLM)40966155 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Shi, QingHongYa  |e verfasserin  |4 aut 
245 1 0 |a Gradient and Structure Consistency in Multimodal Emotion Recognition 
264 1 |c 2025 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 29.09.2025 
500 |a Date Revised 30.09.2025 
500 |a published: Print 
500 |a Citation Status MEDLINE 
520 |a Multimodal emotion recognition is a task that integrates textual, visual, and audio data to holistically infer an individual's emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modalities in common goal learning. Due to multimodal heterogeneity, common goal learning inadvertently introduces optimization biases and interaction noise. To address above challenges, we propose a novel approach named Gradient and Structure Consistency (GSCon). Our strategy operates at both overall and individual levels to consider balance optimization and effective interaction respectively. At the overall level, to avoid the optimization suppression of one modality on others, we construct a balanced gradient direction that aligns each modality's optimization direction, ensuring unbiased convergence. Simultaneously, at the individual level, to avoid the interaction noise caused by multimodal alignment, we align the spatial structure of samples in different modalities. The spatial structure of the samples will not differ due to modal heterogeneity, achieving effective inter-modal interaction. Extensive experiments on multimodal emotion recognition and multimodal intention understanding datasets demonstrate the effectiveness of the proposed method. Code is available at https://github.com/ShiQingHongYa/GSCon 
650 4 |a Journal Article 
700 1 |a Ye, Mang  |e verfasserin  |4 aut 
700 1 |a Huang, Wenke  |e verfasserin  |4 aut 
700 1 |a Du, Bo  |e verfasserin  |4 aut 
700 1 |a Zong, Xiaofen  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 34(2025) vom: 01., Seite 6180-6191  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:34  |g year:2025  |g day:01  |g pages:6180-6191 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2025.3608664  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 34  |j 2025  |b 01  |h 6180-6191