Descriptive Attributes for Language-Based Object Keypoint Detection

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Multimodal vision and language (VL) models have recently shown strong performance in phrase grounding and object detection for both zero-shot and finetuned cases. We adapt a VL model (GLIP) for keypoint detection and evaluate on NABirds keypoints. Our language-based keypoints-as-objects detector GLIP-KP outperforms baseline top-down keypoint detection models based on heatmaps and allows for zero- and few-shot evaluation. When fully trained, enhancing the keypoint names with descriptive attributes gives a significant performance boost, raising AP by as much as 6.0, compared to models without attribute information. Our model exceeds heatmap-based HRNet’s AP by 4.4 overall and 8.4 on keypoints with attributes. With limited data, attributes raise zero-/one-/few-shot test AP by 1.0/3.4/1.6, respectively, on keypoints with attributes.

Original language	English
Title of host publication	Computer Vision Systems - 14th International Conference, ICVS 2023, Proceedings
Editors	Henrik I. Christensen, Peter Corke, Renaud Detry, Jean-Baptiste Weibel, Markus Vincze
Publisher	Springer
Publication date	2023
Pages	444-458
ISBN (Print)	9783031441363
DOIs	https://doi.org/10.1007/978-3-031-44137-0_37
Publication status	Published - 2023
Event	14th International Conference on Computer Vision Systems, ICVS 2023 - VIenna, Austria Duration: 27 Sep 2023 → 29 Sep 2023

Conference

Conference	14th International Conference on Computer Vision Systems, ICVS 2023
Land	Austria
By	VIenna
Periode	27/09/2023 → 29/09/2023

Series	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14253 LNCS
ISSN	0302-9743

Bibliographical note

Research areas

Attributes, Keypoint detection, Vision & language models

ID: 372615567

Department of Computer Science