P2T : Pyramid Pooling Transformer for Scene Understanding

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular sol...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 11 vom: 01. Nov., Seite 12760-12771
1. Verfasser: Wu, Yu-Huan (VerfasserIn)
Weitere Verfasser: Liu, Yun, Zhan, Xin, Cheng, Ming-Ming
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T
Beschreibung:Date Revised 20.10.2023
published: Print-Electronic
Citation Status PubMed-not-MEDLINE
ISSN:1939-3539
DOI:10.1109/TPAMI.2022.3202765