Gradient and Structure Consistency in Multimodal Emotion Recognition

Multimodal emotion recognition is a task that integrates textual, visual, and audio data to holistically infer an individual's emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modali...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 34(2025) vom: 01., Seite 6180-6191
1. Verfasser:	Shi, QingHongYa (VerfasserIn)
Weitere Verfasser:	Ye, Mang, Huang, Wenke, Du, Bo, Zong, Xiaofen
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2025
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM392755874
003	DE-627
005	20251001232128.0
007	cr uuu---uuuuu
008	250920s2025 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2025.3608664 \|2 doi
028	5	2	\|a pubmed25n1586.xml
035			\|a (DE-627)NLM392755874
035			\|a (NLM)40966155
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Shi, QingHongYa \|e verfasserin \|4 aut
245	1	0	\|a Gradient and Structure Consistency in Multimodal Emotion Recognition
264		1	\|c 2025
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 29.09.2025
500			\|a Date Revised 30.09.2025
500			\|a published: Print
500			\|a Citation Status MEDLINE
520			\|a Multimodal emotion recognition is a task that integrates textual, visual, and audio data to holistically infer an individual's emotional state. Existing research predominantly focuses on exploiting modality-specific cues for joint learning, often ignoring the differences between multiple modalities in common goal learning. Due to multimodal heterogeneity, common goal learning inadvertently introduces optimization biases and interaction noise. To address above challenges, we propose a novel approach named Gradient and Structure Consistency (GSCon). Our strategy operates at both overall and individual levels to consider balance optimization and effective interaction respectively. At the overall level, to avoid the optimization suppression of one modality on others, we construct a balanced gradient direction that aligns each modality's optimization direction, ensuring unbiased convergence. Simultaneously, at the individual level, to avoid the interaction noise caused by multimodal alignment, we align the spatial structure of samples in different modalities. The spatial structure of the samples will not differ due to modal heterogeneity, achieving effective inter-modal interaction. Extensive experiments on multimodal emotion recognition and multimodal intention understanding datasets demonstrate the effectiveness of the proposed method. Code is available at https://github.com/ShiQingHongYa/GSCon
650		4	\|a Journal Article
700	1		\|a Ye, Mang \|e verfasserin \|4 aut
700	1		\|a Huang, Wenke \|e verfasserin \|4 aut
700	1		\|a Du, Bo \|e verfasserin \|4 aut
700	1		\|a Zong, Xiaofen \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 34(2025) vom: 01., Seite 6180-6191 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnas
773	1	8	\|g volume:34 \|g year:2025 \|g day:01 \|g pages:6180-6191
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2025.3608664 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 34 \|j 2025 \|b 01 \|h 6180-6191