Fine-Grained Visual Text Prompting
Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods ofte...
| Veröffentlicht in: | IEEE transactions on pattern analysis and machine intelligence. - 1979. - PP(2024) vom: 21. Nov. |
|---|---|
| 1. Verfasser: | |
| Weitere Verfasser: | , , , |
| Format: | Online-Aufsatz |
| Sprache: | English |
| Veröffentlicht: |
2024
|
| Zugriff auf das übergeordnete Werk: | IEEE transactions on pattern analysis and machine intelligence |
| Schlagworte: | Journal Article |
| Online verfügbar |
Volltext |