Experimenting with different machine translation models in medium-resource settings

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

State-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-Icelandic sentence pairs. Given that parallel corpora for low-resource languages typically contain sentence pairs in the tens or hundreds of thousands, we classify Icelandic as a medium-resource language for MT purposes. In this paper, we present on-going experiments with different MT models, both statistical and neural, for translating English to Icelandic based on ParIce. We describe the corpus and the filtering process used for removing noisy segments, the different models used for training, and the preliminary automatic and human evaluation. We find that, while using an aggressive filtering approach, the most recent neural MT system (Transformer) performs best, obtaining the highest BLEU score and the highest fluency and adequacy scores from human evaluation for in-domain translation. Our work could be beneficial to other languages for which a similar amount of parallel data is available.

Original languageEnglish
Title of host publicationText, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings
EditorsPetr Sojka, Ivan Kopecek, Karel Pala, Aleš Horák
Number of pages9
PublisherSpringer
Publication date2020
Pages95-103
ISBN (Print)9783030583224
DOIs
Publication statusPublished - 2020
Externally publishedYes
Event23rd International Conference on Text, Speech, and Dialogue, TSD 2020 - Brno, Czech Republic
Duration: 8 Sep 202011 Sep 2020

Conference

Conference23rd International Conference on Text, Speech, and Dialogue, TSD 2020
LandCzech Republic
ByBrno
Periode08/09/202011/09/2020
SeriesLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12284 LNAI
ISSN0302-9743

Bibliographical note

Funding Information:
Acknowledgments. This project was funded by the Language Technology Programme for Icelandic 2019–2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.

Publisher Copyright:
© Springer Nature Switzerland AG 2020.

    Research areas

  • Evaluation, Machine translation, Parallel data

ID: 371185063