mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Dokumenter
- Fulltext
Forlagets udgivne version, 490 KB, PDF-dokument
Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.
Originalsprog | Engelsk |
---|---|
Titel | Findings of the Association for Computational Linguistics: EMNLP 2021 |
Forlag | Association for Computational Linguistics |
Publikationsdato | 2021 |
Sider | 3404-3418 |
DOI | |
Status | Udgivet - 2021 |
Begivenhed | Findings of the Association for Computational Linguistics: EMNLP 2021 - Punta Cana, Dominikanske Republik, Den Varighed: 1 nov. 2021 → 1 nov. 2021 |
Konference
Konference | Findings of the Association for Computational Linguistics: EMNLP 2021 |
---|---|
Land | Dominikanske Republik, Den |
By | Punta Cana |
Periode | 01/11/2021 → 01/11/2021 |
ID: 299036345