mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
mDAPT : Multilingual Domain Adaptive Pretraining in a Single Model. / Kær Jørgensen, Rasmus; Hartmann, Mareike; Dai, Xiang; Elliott, Desmond.
Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2021. p. 3404-3418.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - mDAPT
T2 - Findings of the Association for Computational Linguistics: EMNLP 2021
AU - Kær Jørgensen, Rasmus
AU - Hartmann, Mareike
AU - Dai, Xiang
AU - Elliott, Desmond
PY - 2021
Y1 - 2021
N2 - Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.
AB - Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.
U2 - 10.18653/v1/2021.findings-emnlp.290
DO - 10.18653/v1/2021.findings-emnlp.290
M3 - Article in proceedings
SP - 3404
EP - 3418
BT - Findings of the Association for Computational Linguistics: EMNLP 2021
PB - Association for Computational Linguistics
Y2 - 1 November 2021 through 1 November 2021
ER -
ID: 299036345