Vicinity Vision Transformer

Vision transformers have shown great success on numerous computer vision tasks. However, their central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Linear attentio...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 10 vom: 13. Okt., Seite 12635-12649
1. Verfasser:	Sun, Weixuan (VerfasserIn)
Weitere Verfasser:	Qin, Zhen, Deng, Hui, Wang, Jianyuan, Zhang, Yi, Zhang, Kaihao, Barnes, Nick, Birchfield, Stan, Kong, Lingpeng, Zhong, Yiran
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM358125197
003	DE-627
005	20231226074059.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TPAMI.2023.3285569 \|2 doi
028	5	2	\|a pubmed24n1193.xml
035			\|a (DE-627)NLM358125197
035			\|a (NLM)37310842
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Sun, Weixuan \|e verfasserin \|4 aut
245	1	0	\|a Vicinity Vision Transformer
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 06.09.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Vision transformers have shown great success on numerous computer vision tasks. However, their central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Linear attention was introduced in natural language processing (NLP) which reorders the self-attention mechanism to mitigate a similar issue, but directly applying existing linear attention to vision may not lead to satisfactory results. We investigate this problem and point out that existing linear attention methods ignore an inductive bias in vision tasks, i.e., 2D locality. In this article, we propose Vicinity Attention, which is a type of linear attention that integrates 2D locality. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance from its neighbouring patches. In this case, we achieve 2D locality in a linear complexity where the neighbouring image patches receive stronger attention than far away patches. In addition, we propose a novel Vicinity Attention Block that is comprised of Feature Reduction Attention (FRA) and Feature Preserving Connection (FPC) in order to address the computational bottleneck of linear attention approaches, including our Vicinity Attention, whose complexity grows quadratically with respect to the feature dimension. The Vicinity Attention Block computes attention in a compressed feature space with an extra skip connection to retrieve the original feature distribution. We experimentally validate that the block further reduces computation without degenerating the accuracy. Finally, to validate the proposed methods, we build a linear vision transformer backbone named Vicinity Vision Transformer (VVT). Targeting general vision tasks, we build VVT in a pyramid structure with progressively reduced sequence length. We perform extensive experiments on CIFAR-100, ImageNet-1 k, and ADE20 K datasets to validate the effectiveness of our method. Our method has a slower growth rate in terms of computational overhead than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous approaches
650		4	\|a Journal Article
700	1		\|a Qin, Zhen \|e verfasserin \|4 aut
700	1		\|a Deng, Hui \|e verfasserin \|4 aut
700	1		\|a Wang, Jianyuan \|e verfasserin \|4 aut
700	1		\|a Zhang, Yi \|e verfasserin \|4 aut
700	1		\|a Zhang, Kaihao \|e verfasserin \|4 aut
700	1		\|a Barnes, Nick \|e verfasserin \|4 aut
700	1		\|a Birchfield, Stan \|e verfasserin \|4 aut
700	1		\|a Kong, Lingpeng \|e verfasserin \|4 aut
700	1		\|a Zhong, Yiran \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on pattern analysis and machine intelligence \|d 1979 \|g 45(2023), 10 vom: 13. Okt., Seite 12635-12649 \|w (DE-627)NLM098212257 \|x 1939-3539 \|7 nnns
773	1	8	\|g volume:45 \|g year:2023 \|g number:10 \|g day:13 \|g month:10 \|g pages:12635-12649
856	4	0	\|u http://dx.doi.org/10.1109/TPAMI.2023.3285569 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 45 \|j 2023 \|e 10 \|b 13 \|c 10 \|h 12635-12649