MoIL : Momentum Imitation Learning for Efficient Vision-Language Adaptation

Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - PP(2024) vom: 30. Juli
1. Verfasser: Luo, Gen (VerfasserIn)
Weitere Verfasser: Zhou, Yiyi, Huang, Minglang, Ren, Tianhe, Sun, Xiaoshuai, Ji, Rongrong
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL
Beschreibung:Date Revised 30.07.2024
published: Print-Electronic
Citation Status Publisher
ISSN:1939-3539
DOI:10.1109/TPAMI.2024.3435790