Saying the Unseen : Video Descriptions via Dialog Agents

Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intenti...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 44(2022), 10 vom: 29. Okt., Seite 7190-7204
1. Verfasser: Zhu, Ye (VerfasserIn)
Weitere Verfasser: Wu, Yu, Yang, Yi, Yan, Yan
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2022
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Video-Audio Media Research Support, U.S. Gov't, Non-P.H.S.