CAVER : Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performanc...

Description complète

Détails bibliographiques
Publié dans:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2023) vom: 11. Jan.
Auteur principal:	Pang, Youwei (Auteur)
Autres auteurs:	Zhao, Xiaoqi, Zhang, Lihe, Lu, Huchuan
Format:	Article en ligne
Langue:	English
Publié:	2023
Accès à la collection:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM355234076
003	DE-627
005	20250304151152.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2023.3234702 \|2 doi
028	5	2	\|a pubmed25n1183.xml
035			\|a (DE-627)NLM355234076
035			\|a (NLM)37018701
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Pang, Youwei \|e verfasserin \|4 aut
245	1	0	\|a CAVER \|b Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 05.04.2023
500			\|a published: Print-Electronic
500			\|a Citation Status Publisher
520			\|a Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components
650		4	\|a Journal Article
700	1		\|a Zhao, Xiaoqi \|e verfasserin \|4 aut
700	1		\|a Zhang, Lihe \|e verfasserin \|4 aut
700	1		\|a Lu, Huchuan \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g PP(2023) vom: 11. Jan. \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnas
773	1	8	\|g volume:PP \|g year:2023 \|g day:11 \|g month:01
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2023.3234702 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d PP \|j 2023 \|b 11 \|c 01