Descriptive Attributes for Language-Based Object Keypoint Detection

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Multimodal vision and language (VL) models have recently shown strong performance in phrase grounding and object detection for both zero-shot and finetuned cases. We adapt a VL model (GLIP) for keypoint detection and evaluate on NABirds keypoints. Our language-based keypoints-as-objects detector GLIP-KP outperforms baseline top-down keypoint detection models based on heatmaps and allows for zero- and few-shot evaluation. When fully trained, enhancing the keypoint names with descriptive attributes gives a significant performance boost, raising AP by as much as 6.0, compared to models without attribute information. Our model exceeds heatmap-based HRNet’s AP by 4.4 overall and 8.4 on keypoints with attributes. With limited data, attributes raise zero-/one-/few-shot test AP by 1.0/3.4/1.6, respectively, on keypoints with attributes.

Original languageEnglish
Title of host publicationComputer Vision Systems - 14th International Conference, ICVS 2023, Proceedings
EditorsHenrik I. Christensen, Peter Corke, Renaud Detry, Jean-Baptiste Weibel, Markus Vincze
PublisherSpringer
Publication date2023
Pages444-458
ISBN (Print)9783031441363
DOIs
Publication statusPublished - 2023
Event14th International Conference on Computer Vision Systems, ICVS 2023 - VIenna, Austria
Duration: 27 Sep 202329 Sep 2023

Conference

Conference14th International Conference on Computer Vision Systems, ICVS 2023
LandAustria
ByVIenna
Periode27/09/202329/09/2023
SeriesLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14253 LNCS
ISSN0302-9743

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

    Research areas

  • Attributes, Keypoint detection, Vision & language models

ID: 372615567