Effective Multimodal Encoding for Image Paragraph Captioning

In this paper, we present a regularization-based image paragraph generation method. We propose a novel multimodal encoding generator (MEG) to generate effective multimodal encoding that captures not only an individual sentence but also visual and paragraph-sequential information. By utilizing the en...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 10., Seite 6381-6395
1. Verfasser: Nguyen, Thanh-Son (VerfasserIn)
Weitere Verfasser: Fernando, Basura
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM347306128
003 DE-627
005 20231226033522.0
007 cr uuu---uuuuu
008 231226s2022 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2022.3211467  |2 doi 
028 5 2 |a pubmed24n1157.xml 
035 |a (DE-627)NLM347306128 
035 |a (NLM)36215365 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Nguyen, Thanh-Son  |e verfasserin  |4 aut 
245 1 0 |a Effective Multimodal Encoding for Image Paragraph Captioning 
264 1 |c 2022 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 19.10.2022 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a In this paper, we present a regularization-based image paragraph generation method. We propose a novel multimodal encoding generator (MEG) to generate effective multimodal encoding that captures not only an individual sentence but also visual and paragraph-sequential information. By utilizing the encoding generated by MEG, we regularize a paragraph generation model that allows us to improve the results of the captioning model in all the evaluation metrics. With the support of the proposed MEG model for regularization, our paragraph generation model obtains state-of-the-art results on the Stanford paragraph dataset once further optimized with reinforcement learning. Moreover, we perform extensive empirical analysis on the capabilities of MEG encoding. A qualitative visualization based on t-distributed stochastic neighbor embedding (t-SNE) illustrates that sentence encoding generated by MEG captures some level of semantic information. We also demonstrate that the MEG encoding captures meaningful textual and visual information by performing multimodal sentence retrieval tasks and image instance retrieval given a paragraph query 
650 4 |a Journal Article 
700 1 |a Fernando, Basura  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 31(2022) vom: 10., Seite 6381-6395  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:31  |g year:2022  |g day:10  |g pages:6381-6395 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2022.3211467  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 31  |j 2022  |b 10  |h 6381-6395