Knowing What to Learn : A Metric-Oriented Focal Mechanism for Image Captioning

Despite considerable progress, image captioning still suffers from the huge difference in quality between easy and hard examples, which is left unexploited in existing methods. To address this issue, we explore the hard example mining in image captioning, and propose a simple yet effective mechanism...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 01., Seite 4321-4335
1. Verfasser:	Ji, Jiayi (VerfasserIn)
Weitere Verfasser:	Ma, Yiwei, Sun, Xiaoshuai, Zhou, Yiyi, Wu, Yongjian, Ji, Rongrong
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2022
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article

Beschreibung
Zusammenfassung:	Despite considerable progress, image captioning still suffers from the huge difference in quality between easy and hard examples, which is left unexploited in existing methods. To address this issue, we explore the hard example mining in image captioning, and propose a simple yet effective mechanism to instruct the model to pay more attention to hard examples, thereby improving the performance in both general and complex scenarios. We first propose a novel learning strategy, termed Metric-oriented Focal Mechanism (MFM), for hard example mining in image captioning. Differing from the existing strategies for classification tasks, MFM can adopt the generative metrics of image captioning to measure the difficulties of examples, and then up-weight the rewards of hard examples during training. To make MFM applicable to different datasets without tedious parameter tuning, we further introduce an adaptive reward metric called Effective CIDEr (ECIDEr), which considers the data distribution of easy and hard examples during reward estimation. Extensive experiments are conducted on the MS COCO benchmark, and the results show that while maintaining the performance on simple examples, MFM can significantly improve the quality of captions for hard examples. The ECIDEr-based MFM is equipped on the current SOTA method, e.g., DLCT (Luo et al., 2021), which outperforms all existing methods and achieves new state-of-the-art performance on both the off-line and on- line testing, i.e., 134.3 CIDEr for the off-line testing and 136.1 for the on- line testing of MSCOCO. To validate the generalization ability of ECIDEr-based MFM, we also apply it to another dataset, namely Flickr30k, and superior performance gains can also be obtained
Beschreibung:	Date Revised 01.07.2022 published: Print-Electronic Citation Status PubMed-not-MEDLINE
ISSN:	1941-0042
DOI:	10.1109/TIP.2022.3183434