Retrieval-augmented Image Captioning

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Dokumenter

  • Fulltext

    Accepteret manuskript, 2,44 MB, PDF-dokument

Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.

OriginalsprogEngelsk
TitelEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
ForlagAssociation for Computational Linguistics (ACL)
Publikationsdato2023
Sider3648-3663
ISBN (Elektronisk)9781959429449
StatusUdgivet - 2023
Begivenhed17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Kroatien
Varighed: 2 maj 20236 maj 2023

Konference

Konference17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
LandKroatien
ByDubrovnik
Periode02/05/202306/05/2023
SponsorAdobe, Babelscape, Bloomberg Engineering, Duolingo, Liveperson

Bibliografisk note

Funding Information:
This research was supported by the Portuguese Recovery and Resilience Plan (RRP) through project C645008882-00000055 (Responsible.AI), and also through Fundação para a Ciência e Tecnologia (FCT), namely through the Ph.D. scholarship with reference 2020.06106.BD, as well as through the INESC-ID multi-annual funding from the PIDDAC programme with reference UIDB/50021/2020.

Publisher Copyright:
© 2023 Association for Computational Linguistics.

ID: 356886206