Retrieval-augmented Image Captioning

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Accepteret manuskript, 2,44 MB, PDF-dokument

Rita Ramos
Elliott, Desmond
Bruno Martins

Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.

Originalsprog	Engelsk
Titel	EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
Forlag	Association for Computational Linguistics (ACL)
Publikationsdato	2023
Sider	3648-3663
ISBN (Elektronisk)	9781959429449
Status	Udgivet - 2023
Begivenhed	17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Kroatien Varighed: 2 maj 2023 → 6 maj 2023

Konference

Konference	17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Land	Kroatien
By	Dubrovnik
Periode	02/05/2023 → 06/05/2023
Sponsor	Adobe, Babelscape, Bloomberg Engineering, Duolingo, Liveperson

Bibliografisk note

Funding Information:
This research was supported by the Portuguese Recovery and Resilience Plan (RRP) through project C645008882-00000055 (Responsible.AI), and also through Fundação para a Ciência e Tecnologia (FCT), namely through the Ph.D. scholarship with reference 2020.06106.BD, as well as through the INESC-ID multi-annual funding from the PIDDAC programme with reference UIDB/50021/2020.

Publisher Copyright:
© 2023 Association for Computational Linguistics.

ID: 356886206

Datalogisk Institut