Actor and Action Modular Network for Text-Based Video Segmentation

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 31(2022) vom: 28., Seite 4474-4489
1. Verfasser: Yang, Jianhua (VerfasserIn)
Weitere Verfasser: Huang, Yan, Niu, Kai, Huang, Linjiang, Ma, Zhanyu, Wang, Liang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM34284427X
003 DE-627
005 20231226015043.0
007 cr uuu---uuuuu
008 231226s2022 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2022.3185487  |2 doi 
028 5 2 |a pubmed24n1142.xml 
035 |a (DE-627)NLM34284427X 
035 |a (NLM)35763476 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Yang, Jianhua  |e verfasserin  |4 aut 
245 1 0 |a Actor and Action Modular Network for Text-Based Video Segmentation 
264 1 |c 2022 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 06.07.2022 
500 |a Date Revised 06.07.2022 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of semantic asymmetry. The semantic asymmetry implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets 
650 4 |a Journal Article 
700 1 |a Huang, Yan  |e verfasserin  |4 aut 
700 1 |a Niu, Kai  |e verfasserin  |4 aut 
700 1 |a Huang, Linjiang  |e verfasserin  |4 aut 
700 1 |a Ma, Zhanyu  |e verfasserin  |4 aut 
700 1 |a Wang, Liang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 31(2022) vom: 28., Seite 4474-4489  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:31  |g year:2022  |g day:28  |g pages:4474-4489 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2022.3185487  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 31  |j 2022  |b 28  |h 4474-4489