Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fulltext
Final published version, 268 KB, PDF document

We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19^th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

Original language	English
Title of host publication	Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Editors	Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Publisher	European Language Resources Association (ELRA)
Publication date	2024
Pages	4811-4819
ISBN (Electronic)	9782493814104
Publication status	Published - 2024
Event	Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italy Duration: 20 May 2024 → 25 May 2024

Conference

Conference	Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
Land	Italy
By	Hybrid, Torino
Periode	20/05/2024 → 25/05/2024
Sponsor	Aequa-Tech, Baidu, Bloomberg, Dataforce (Transperfect), et al., Intesa San Paolo Bank

Bibliographical note

Research areas

Digital Humanities, Pre-trained Language Models, Sentiment Analysis, Word Sense Disambiguation

Department of Computer Science