Discriminative Style Learning for Cross-Domain Image Captioning

The cross-domain image captioning, which is trained on a source domain and generalized to other domains, usually faces the large domain shift problem. Although prior work has attempted to leverage both paired source and unpaired target data to minimize this shift, the performance is still unsatisfac...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 27., Seite 1723-1736
1. Verfasser: Yuan, Jin (VerfasserIn)
Weitere Verfasser: Zhu, Shuai, Huang, Shuyin, Zhang, Hanwang, Xiao, Yaoqiang, Li, Zhiyong, Wang, Meng
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:The cross-domain image captioning, which is trained on a source domain and generalized to other domains, usually faces the large domain shift problem. Although prior work has attempted to leverage both paired source and unpaired target data to minimize this shift, the performance is still unsatisfactory. One main reason lies in the large discrepancy in language expression between two domains, where diverse language styles are adopted to describe an image from different views, resulting in different semantic descriptions for an image. To tackle this problem, this paper proposes a Style-based Cross-domain Image Captioner (SCIC) which incorporates the discriminative style information into the encoder-decoder framework, and interprets an image as a special sentence according to external style instructions. Technically, we design a novel "Instruction-based LSTM", which adds the instruct gate to collect a style instruction, and then outputs a specified format according to that instruction. Two objectives are designed to train I-LSTM: 1) generating correct image descriptions and 2) generating correct styles, thus the model is expected to accurately capture the semantic meanings of an image by the special caption as well as understand the syntactic structure of the caption. We use MS-COCO as the source domain, and Oxford-102, CUB-200, Flickr30k as the target domains. Experimental results demonstrate that our model consistently outperforms the previous methods, and the style information incorporating with I-LSTM significantly improves the performance, with 5% CIDEr improvements at least on all datasets
Beschreibung:Date Revised 09.02.2022
published: Print-Electronic
Citation Status PubMed-not-MEDLINE
ISSN:1941-0042
DOI:10.1109/TIP.2022.3145158