mDAPT - Staff in the Natural Language Processing Section

mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

mDAPT : Multilingual Domain Adaptive Pretraining in a Single Model. / Kær Jørgensen, Rasmus; Hartmann, Mareike; Dai, Xiang; Elliott, Desmond.

Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2021. p. 3404-3418.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Kær Jørgensen, R, Hartmann, M, Dai, X & Elliott, D 2021, mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model. in Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, pp. 3404-3418, Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 01/11/2021. https://doi.org/10.18653/v1/2021.findings-emnlp.290

APA

Kær Jørgensen, R., Hartmann, M., Dai, X., & Elliott, D. (2021). mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3404-3418). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.290

Vancouver

Kær Jørgensen R, Hartmann M, Dai X, Elliott D. mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics. 2021. p. 3404-3418 https://doi.org/10.18653/v1/2021.findings-emnlp.290

Author

Kær Jørgensen, Rasmus ; Hartmann, Mareike ; Dai, Xiang ; Elliott, Desmond. / mDAPT : Multilingual Domain Adaptive Pretraining in a Single Model. Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2021. pp. 3404-3418

Bibtex

@inproceedings{c2e624d88e3142ccbb1fd8af70a41e81,

title = "mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model",

abstract = "Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.",

author = "{K{\ae}r J{\o}rgensen}, Rasmus and Mareike Hartmann and Xiang Dai and Desmond Elliott",

year = "2021",

doi = "10.18653/v1/2021.findings-emnlp.290",

language = "English",

pages = "3404--3418",

booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",

publisher = "Association for Computational Linguistics",

note = "Findings of the Association for Computational Linguistics: EMNLP 2021 ; Conference date: 01-11-2021 Through 01-11-2021",

}

RIS

TY - GEN

T1 - mDAPT

T2 - Findings of the Association for Computational Linguistics: EMNLP 2021

AU - Kær Jørgensen, Rasmus

AU - Hartmann, Mareike

AU - Dai, Xiang

AU - Elliott, Desmond

PY - 2021

Y1 - 2021

N2 - Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.

AB - Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.

U2 - 10.18653/v1/2021.findings-emnlp.290

DO - 10.18653/v1/2021.findings-emnlp.290

M3 - Article in proceedings

SP - 3404

EP - 3418

BT - Findings of the Association for Computational Linguistics: EMNLP 2021

PB - Association for Computational Linguistics

Y2 - 1 November 2021 through 1 November 2021

ER -

ID: 299036345

Department of Computer Science