Retrieval-augmented Image Captioning
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Dokumenter
- Fulltext
Accepteret manuskript, 2,44 MB, PDF-dokument
Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
Originalsprog | Engelsk |
---|---|
Titel | EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference |
Forlag | Association for Computational Linguistics (ACL) |
Publikationsdato | 2023 |
Sider | 3648-3663 |
ISBN (Elektronisk) | 9781959429449 |
Status | Udgivet - 2023 |
Begivenhed | 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Kroatien Varighed: 2 maj 2023 → 6 maj 2023 |
Konference
Konference | 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 |
---|---|
Land | Kroatien |
By | Dubrovnik |
Periode | 02/05/2023 → 06/05/2023 |
Sponsor | Adobe, Babelscape, Bloomberg Engineering, Duolingo, Liveperson |
Bibliografisk note
Funding Information:
This research was supported by the Portuguese Recovery and Resilience Plan (RRP) through project C645008882-00000055 (Responsible.AI), and also through Fundação para a Ciência e Tecnologia (FCT), namely through the Ph.D. scholarship with reference 2020.06106.BD, as well as through the INESC-ID multi-annual funding from the PIDDAC programme with reference UIDB/50021/2020.
Publisher Copyright:
© 2023 Association for Computational Linguistics.
ID: 356886206