Multimodal pretraining unmasked

Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 1,22 MB, PDF-dokument

Bugliarello, Emanuele
Ryan Cotterell
Naoaki Okazaki
Elliott, Desmond

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

Originalsprog	Engelsk
Tidsskrift	Transactions of the Association for Computational Linguistics
Vol/bind	9
Sider (fra-til)	978-994
Antal sider	17
ISSN	2307-387X
DOI	https://doi.org/10.1162/tacl_a_00408
Status	Udgivet - 2021

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk

Ingen data tilgængelig

ID: 298034552

Datalogisk Institut

Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts

Dokumenter

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk