Cleaner Categories Improve Object Detection and Visual-Textual Grounding

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Object detectors are core components of multimodal models, enabling them to locate the region of interest in images which are then used to solve many multimodal tasks. Among the many extant object detectors, the Bottom-Up Faster R-CNN [39] (BUA) object detector is the most commonly used by the multimodal language-and-vision community, usually as a black-box visual feature generator for solving downstream multimodal tasks. It is trained on the Visual Genome Dataset [25] to detect 1600 different objects. However, those object categories are defined using automatically processed image region descriptions from the Visual Genome dataset. The automatic process introduces some unexpected near-duplicate categories (e.g. “watch” and “wristwatch”, “tree” and “trees”, and “motorcycle” and “motorbike”) that may result in a sub-optimal representational space and likely impair the ability of the model to classify objects correctly. In this paper, we manually merge near-duplicate labels to create a cleaner label set, which is used to retrain the object detector. We investigate the effect of using the cleaner label set in terms of: (i) performance on the original object detection task, (ii) the properties of the embedding space learned by the detector, and (iii) the utility of the features in a visual grounding task on the Flickr30K Entities dataset. We find that the BUA model trained with the cleaner categories learns a better-clustered embedding space than the model trained with the noisy categories. The new embedding space improves the object detection task and also presents better bounding boxes features representations which help to solve the visual grounding task.

Originalsprog	Engelsk
Titel	Image Analysis - 23rd Scandinavian Conference, SCIA 2023, Proceedings
Redaktører	Rikke Gade, Michael Felsberg, Joni-Kristian Kämäräinen
Forlag	Springer
Publikationsdato	2023
Sider	412-442
ISBN (Trykt)	9783031314346
DOI	https://doi.org/10.1007/978-3-031-31435-3_28
Status	Udgivet - 2023
Begivenhed	23nd Scandinavian Conference on Image Analysis, SCIA 2023 - Lapland, Finland Varighed: 18 apr. 2023 → 21 apr. 2023

Konference

Konference	23nd Scandinavian Conference on Image Analysis, SCIA 2023
Land	Finland
By	Lapland
Periode	18/04/2023 → 21/04/2023

Navn	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Vol/bind	13885 LNCS
ISSN	0302-9743

Bibliografisk note

ID: 357283955

Datalogisk Institut