What Makes for Hierarchical Vision Transformer?

Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutio...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 10 vom: 05. Okt., Seite 12714-12720
1. Verfasser: Fang, Yuxin (VerfasserIn)
Weitere Verfasser: Wang, Xinggang, Wu, Rui, Liu, Wenyu
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. In most recently proposed hierarchical ViTs, self-attention is the de-facto standard for spatial information aggregation. In this paper, we question whether self-attention is the only choice for hierarchical ViT to attain strong performance, and study the effects of different kinds of cross-window communication methods. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed TransLinear can achieve very strong performance in ImageNet-[Formula: see text] image recognition. Moreover, we find that TransLinear is able to leverage the ImageNet pre-trained weights and demonstrates competitive transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches. Our results reveal that the macro architecture, other than specific aggregation layers or cross-window communication mechanisms, is more responsible for hierarchical ViT's strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm
Beschreibung:Date Revised 31.10.2023
published: Print
Citation Status PubMed-not-MEDLINE
ISSN:1939-3539
DOI:10.1109/TPAMI.2023.3282019