Retrieval-augmented Image Captioning

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Retrieval-augmented Image Captioning. / Ramos, Rita; Elliott, Desmond; Martins, Bruno.

EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2023. p. 3648-3663.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Ramos, R, Elliott, D & Martins, B 2023, Retrieval-augmented Image Captioning. in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), pp. 3648-3663, 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, 02/05/2023.

APA

Ramos, R., Elliott, D., & Martins, B. (2023). Retrieval-augmented Image Captioning. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 3648-3663). Association for Computational Linguistics (ACL).

Vancouver

Ramos R, Elliott D, Martins B. Retrieval-augmented Image Captioning. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2023. p. 3648-3663

Author

Ramos, Rita ; Elliott, Desmond ; Martins, Bruno. / Retrieval-augmented Image Captioning. EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2023. pp. 3648-3663

Bibtex

@inproceedings{b26460c62c954486a748490bccb471f3,
title = "Retrieval-augmented Image Captioning",
abstract = "Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.",
author = "Rita Ramos and Desmond Elliott and Bruno Martins",
note = "Publisher Copyright: {\textcopyright} 2023 Association for Computational Linguistics.; 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 ; Conference date: 02-05-2023 Through 06-05-2023",
year = "2023",
language = "English",
pages = "3648--3663",
booktitle = "EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference",
publisher = "Association for Computational Linguistics (ACL)",
address = "United States",

}

RIS

TY - GEN

T1 - Retrieval-augmented Image Captioning

AU - Ramos, Rita

AU - Elliott, Desmond

AU - Martins, Bruno

N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.

PY - 2023

Y1 - 2023

N2 - Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.

AB - Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.

UR - http://www.scopus.com/inward/record.url?scp=85153075616&partnerID=8YFLogxK

M3 - Article in proceedings

AN - SCOPUS:85153075616

SP - 3648

EP - 3663

BT - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

T2 - 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023

Y2 - 2 May 2023 through 6 May 2023

ER -

ID: 356886206