Vision Transformer With Quadrangle Attention

Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 5 vom: 08. Mai, Seite 3608-3624
Auteur principal: Zhang, Qiming (Auteur)
Autres auteurs: Zhang, Jing, Xu, Yufei, Tao, Dacheng
Format: Article en ligne
Langue:English
Publié: 2024
Accès à la collection:IEEE transactions on pattern analysis and machine intelligence
Sujets:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM366812947
003 DE-627
005 20250305155526.0
007 cr uuu---uuuuu
008 240114s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2023.3347693  |2 doi 
028 5 2 |a pubmed25n1222.xml 
035 |a (DE-627)NLM366812947 
035 |a (NLM)38190690 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Zhang, Qiming  |e verfasserin  |4 aut 
245 1 0 |a Vision Transformer With Quadrangle Attention 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 03.04.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at QFormer 
650 4 |a Journal Article 
700 1 |a Zhang, Jing  |e verfasserin  |4 aut 
700 1 |a Xu, Yufei  |e verfasserin  |4 aut 
700 1 |a Tao, Dacheng  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 5 vom: 08. Mai, Seite 3608-3624  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnas 
773 1 8 |g volume:46  |g year:2024  |g number:5  |g day:08  |g month:05  |g pages:3608-3624 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2023.3347693  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 5  |b 08  |c 05  |h 3608-3624