Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese. / Snæbjarnarson, Vésteinn; Simonsen, Annika ; Glavaš, Goran ; Vulić, Ivan.

Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). Association for Computational Linguistics (ACL), 2023. p. 728–737.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Snæbjarnarson, V, Simonsen, A, Glavaš, G & Vulić, I 2023, Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese. in Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). Association for Computational Linguistics (ACL), pp. 728–737, NoDaLiDa 2023, Tórshavn, Denmark, 22/05/2023. <https://aclanthology.org/2023.nodalida-1.74/>

APA

Snæbjarnarson, V., Simonsen, A., Glavaš, G., & Vulić, I. (2023). Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 728–737). Association for Computational Linguistics (ACL). https://aclanthology.org/2023.nodalida-1.74/

Vancouver

Snæbjarnarson V, Simonsen A, Glavaš G, Vulić I. Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). Association for Computational Linguistics (ACL). 2023. p. 728–737

Author

Snæbjarnarson, Vésteinn ; Simonsen, Annika ; Glavaš, Goran ; Vulić, Ivan. / Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese. Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). Association for Computational Linguistics (ACL), 2023. pp. 728–737

Bibtex

@inproceedings{8c6782ee3cde4d0d871cef6f6362023b,

title = "Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",

abstract = "Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese – a low-resource language from a high-resource language family – that by leveraging the phylogenetic information and departing from the {\textquoteleft}one-size-fits-all{\textquoteright} paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages.",

author = "V{\'e}steinn Sn{\ae}bjarnarson and Annika Simonsen and Goran Glava{\v s} and Ivan Vuli{\'c}",

year = "2023",

language = "English",

pages = "728–737",

booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",

publisher = "Association for Computational Linguistics (ACL)",

address = "United States",

note = "NoDaLiDa 2023 : The 24th Nordic Conference on Computational Linguistics, NoDaLiDa ; Conference date: 22-05-2023 Through 24-05-2023",

url = "https://www.nodalida2023.fo/",

}

RIS

TY - GEN

T1 - Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese

AU - Snæbjarnarson, Vésteinn

AU - Simonsen, Annika

AU - Glavaš, Goran

AU - Vulić, Ivan

N1 - Conference code: 24

PY - 2023

Y1 - 2023

N2 - Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese – a low-resource language from a high-resource language family – that by leveraging the phylogenetic information and departing from the ‘one-size-fits-all’ paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages.

AB - Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese – a low-resource language from a high-resource language family – that by leveraging the phylogenetic information and departing from the ‘one-size-fits-all’ paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages.

M3 - Article in proceedings

SP - 728

EP - 737

BT - Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

PB - Association for Computational Linguistics (ACL)

T2 - NoDaLiDa 2023

Y2 - 22 May 2023 through 24 May 2023

ER -

ID: 383785033

Datalogisk Institut