Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 828 KB, PDF-dokument

Riccardo Bassani
Søgaard, Anders
Tejaswini Deoskar

Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.

Originalsprog	Engelsk
Titel	Proceedings of the 1st Workshop on Multilingual Representation Learning
Forlag	Association for Computational Linguistics
Publikationsdato	2021
Sider	32–40
DOI	https://doi.org/10.18653/v1/2021.mrl-1.3
Status	Udgivet - 2021
Begivenhed	1st Workshop on Multilingual Representation Learning - Online Varighed: 11 nov. 2021 → 11 nov. 2021

Konference

Konference	1st Workshop on Multilingual Representation Learning
By	Online
Periode	11/11/2021 → 11/11/2021

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk

Ingen data tilgængelig

ID: 300080332

Datalogisk Institut