CAVER : Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performanc...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2023) vom: 11. Jan.
1. Verfasser: Pang, Youwei (VerfasserIn)
Weitere Verfasser: Zhao, Xiaoqi, Zhang, Lihe, Lu, Huchuan
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM355234076
003 DE-627
005 20231226063922.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2023.3234702  |2 doi 
028 5 2 |a pubmed24n1184.xml 
035 |a (DE-627)NLM355234076 
035 |a (NLM)37018701 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Pang, Youwei  |e verfasserin  |4 aut 
245 1 0 |a CAVER  |b Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 05.04.2023 
500 |a published: Print-Electronic 
500 |a Citation Status Publisher 
520 |a Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components 
650 4 |a Journal Article 
700 1 |a Zhao, Xiaoqi  |e verfasserin  |4 aut 
700 1 |a Zhang, Lihe  |e verfasserin  |4 aut 
700 1 |a Lu, Huchuan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g PP(2023) vom: 11. Jan.  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:PP  |g year:2023  |g day:11  |g month:01 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2023.3234702  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d PP  |j 2023  |b 11  |c 01