Implications of the Convergence of Language and Vision Model Geometries

Publikation: Working paper › Preprint › Forskning

Dokumenter

Fulltext
Forlagets udgivne version, 4,97 MB, PDF-dokument

Large-scale pretrained language models (LMs) are said to ``lack the ability to connect [their] utterances to the world'' (Bender and Koller, 2020). If so, we would expect LM representations to be unrelated to representations in computer vision models. To investigate this, we present an empirical evaluation across three different LMs (BERT, GPT2, and OPT) and three computer vision models (VMs, including ResNet, SegFormer, and MAE). Our experiments show that LMs converge towards representations that are partially isomorphic to those of VMs, with dispersion, and polysemy both factoring into the alignability of vision and language spaces. We discuss the implications of this finding.

Originalsprog	Engelsk
Udgiver	arXiv.org
Antal sider	19
Status	Udgivet - 2023

Datalogisk Institut

Implications of the Convergence of Language and Vision Model Geometries

Dokumenter

Links