What Makes for Hierarchical Vision Transformer?
Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutio...
Veröffentlicht in: | IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 10 vom: 05. Okt., Seite 12714-12720 |
---|---|
1. Verfasser: | |
Weitere Verfasser: | , , |
Format: | Online-Aufsatz |
Sprache: | English |
Veröffentlicht: |
2023
|
Zugriff auf das übergeordnete Werk: | IEEE transactions on pattern analysis and machine intelligence |
Schlagworte: | Journal Article |
Zusammenfassung: | Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. In most recently proposed hierarchical ViTs, self-attention is the de-facto standard for spatial information aggregation. In this paper, we question whether self-attention is the only choice for hierarchical ViT to attain strong performance, and study the effects of different kinds of cross-window communication methods. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed TransLinear can achieve very strong performance in ImageNet-[Formula: see text] image recognition. Moreover, we find that TransLinear is able to leverage the ImageNet pre-trained weights and demonstrates competitive transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches. Our results reveal that the macro architecture, other than specific aggregation layers or cross-window communication mechanisms, is more responsible for hierarchical ViT's strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm |
---|---|
Beschreibung: | Date Revised 31.10.2023 published: Print Citation Status PubMed-not-MEDLINE |
ISSN: | 1939-3539 |
DOI: | 10.1109/TPAMI.2023.3282019 |