Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks

Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e., Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 11., Seite 3386-3398
1. Verfasser: Luo, Gen (VerfasserIn)
Weitere Verfasser: Zhou, Yiyi, Sun, Xiaoshuai, Wang, Yan, Cao, Liujuan, Wu, Yongjian, Huang, Feiyue, Ji, Rongrong
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e., Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e., the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we apply LW-Transformer to the task of image classification, and build its network based on a recently proposed image Transformer called Swin-Transformer, where the effectiveness can be also confirmed
Beschreibung:Date Revised 16.05.2022
published: Print-Electronic
Citation Status PubMed-not-MEDLINE
ISSN:1941-0042
DOI:10.1109/TIP.2021.3139234