Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Multimodal pretraining unmasked : A meta-analysis and a unified framework of vision-and-language berts. / Bugliarello, Emanuele; Cotterell, Ryan; Okazaki, Naoaki; Elliott, Desmond.

In: Transactions of the Association for Computational Linguistics, Vol. 9, 2021, p. 978-994.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Bugliarello, E, Cotterell, R, Okazaki, N & Elliott, D 2021, 'Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts', Transactions of the Association for Computational Linguistics, vol. 9, pp. 978-994. https://doi.org/10.1162/tacl_a_00408

APA

Bugliarello, E., Cotterell, R., Okazaki, N., & Elliott, D. (2021). Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics, 9, 978-994. https://doi.org/10.1162/tacl_a_00408

Vancouver

Bugliarello E, Cotterell R, Okazaki N, Elliott D. Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics. 2021;9:978-994. https://doi.org/10.1162/tacl_a_00408

Author

Bugliarello, Emanuele ; Cotterell, Ryan ; Okazaki, Naoaki ; Elliott, Desmond. / Multimodal pretraining unmasked : A meta-analysis and a unified framework of vision-and-language berts. In: Transactions of the Association for Computational Linguistics. 2021 ; Vol. 9. pp. 978-994.

Bibtex

@article{a9616534ac214e5db15494f90e521dfb,
title = "Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts",
abstract = "Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.",
author = "Emanuele Bugliarello and Ryan Cotterell and Naoaki Okazaki and Desmond Elliott",
note = "Publisher Copyright: {\textcopyright} 2021, MIT Press Journals. All rights reserved.",
year = "2021",
doi = "10.1162/tacl_a_00408",
language = "English",
volume = "9",
pages = "978--994",
journal = "Transactions of the Association for Computational Linguistics",
issn = "2307-387X",
publisher = "MIT Press",

}

RIS

TY - JOUR

T1 - Multimodal pretraining unmasked

T2 - A meta-analysis and a unified framework of vision-and-language berts

AU - Bugliarello, Emanuele

AU - Cotterell, Ryan

AU - Okazaki, Naoaki

AU - Elliott, Desmond

N1 - Publisher Copyright: © 2021, MIT Press Journals. All rights reserved.

PY - 2021

Y1 - 2021

N2 - Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

AB - Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

U2 - 10.1162/tacl_a_00408

DO - 10.1162/tacl_a_00408

M3 - Journal article

AN - SCOPUS:85119145818

VL - 9

SP - 978

EP - 994

JO - Transactions of the Association for Computational Linguistics

JF - Transactions of the Association for Computational Linguistics

SN - 2307-387X

ER -

ID: 298034552