Experimenting with different machine translation models in medium-resource settings

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Experimenting with different machine translation models in medium-resource settings. / Jónsson, Haukur Páll; Símonarson, Haukur Barri; Snæbjarnarson, Vésteinn; Steingrímsson, Steinþór; Loftsson, Hrafn.

Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings. ed. / Petr Sojka; Ivan Kopecek; Karel Pala; Aleš Horák. Springer, 2020. p. 95-103 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 12284 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Jónsson, HP, Símonarson, HB, Snæbjarnarson, V, Steingrímsson, S & Loftsson, H 2020, Experimenting with different machine translation models in medium-resource settings. in P Sojka, I Kopecek, K Pala & A Horák (eds), Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings. Springer, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12284 LNAI, pp. 95-103, 23rd International Conference on Text, Speech, and Dialogue, TSD 2020, Brno, Czech Republic, 08/09/2020. https://doi.org/10.1007/978-3-030-58323-1_10

APA

Jónsson, H. P., Símonarson, H. B., Snæbjarnarson, V., Steingrímsson, S., & Loftsson, H. (2020). Experimenting with different machine translation models in medium-resource settings. In P. Sojka, I. Kopecek, K. Pala, & A. Horák (Eds.), Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings (pp. 95-103). Springer. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 12284 LNAI https://doi.org/10.1007/978-3-030-58323-1_10

Vancouver

Jónsson HP, Símonarson HB, Snæbjarnarson V, Steingrímsson S, Loftsson H. Experimenting with different machine translation models in medium-resource settings. In Sojka P, Kopecek I, Pala K, Horák A, editors, Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings. Springer. 2020. p. 95-103. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 12284 LNAI). https://doi.org/10.1007/978-3-030-58323-1_10

Author

Jónsson, Haukur Páll ; Símonarson, Haukur Barri ; Snæbjarnarson, Vésteinn ; Steingrímsson, Steinþór ; Loftsson, Hrafn. / Experimenting with different machine translation models in medium-resource settings. Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings. editor / Petr Sojka ; Ivan Kopecek ; Karel Pala ; Aleš Horák. Springer, 2020. pp. 95-103 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 12284 LNAI).

Bibtex

@inproceedings{793c4f07868f4e48974fd86bc6c3a573,

title = "Experimenting with different machine translation models in medium-resource settings",

abstract = "State-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-Icelandic sentence pairs. Given that parallel corpora for low-resource languages typically contain sentence pairs in the tens or hundreds of thousands, we classify Icelandic as a medium-resource language for MT purposes. In this paper, we present on-going experiments with different MT models, both statistical and neural, for translating English to Icelandic based on ParIce. We describe the corpus and the filtering process used for removing noisy segments, the different models used for training, and the preliminary automatic and human evaluation. We find that, while using an aggressive filtering approach, the most recent neural MT system (Transformer) performs best, obtaining the highest BLEU score and the highest fluency and adequacy scores from human evaluation for in-domain translation. Our work could be beneficial to other languages for which a similar amount of parallel data is available.",

keywords = "Evaluation, Machine translation, Parallel data",

author = "J{\'o}nsson, {Haukur P{\'a}ll} and S{\'i}monarson, {Haukur Barri} and V{\'e}steinn Sn{\ae}bjarnarson and Stein{\th}{\'o}r Steingr{\'i}msson and Hrafn Loftsson",

note = "Funding Information: Acknowledgments. This project was funded by the Language Technology Programme for Icelandic 2019–2023. The programme, which is managed and coordinated by Almannar{\'o}mur, is funded by the Icelandic Ministry of Education, Science and Culture. Publisher Copyright: {\textcopyright} Springer Nature Switzerland AG 2020.; 23rd International Conference on Text, Speech, and Dialogue, TSD 2020 ; Conference date: 08-09-2020 Through 11-09-2020",

year = "2020",

doi = "10.1007/978-3-030-58323-1_10",

language = "English",

isbn = "9783030583224",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "95--103",

editor = "Petr Sojka and Ivan Kopecek and Karel Pala and Ale{\v s} Hor{\'a}k",

booktitle = "Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings",

address = "Switzerland",

}

RIS

TY - GEN

T1 - Experimenting with different machine translation models in medium-resource settings

AU - Jónsson, Haukur Páll

AU - Símonarson, Haukur Barri

AU - Snæbjarnarson, Vésteinn

AU - Steingrímsson, Steinþór

AU - Loftsson, Hrafn

N1 - Funding Information: Acknowledgments. This project was funded by the Language Technology Programme for Icelandic 2019–2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. Publisher Copyright: © Springer Nature Switzerland AG 2020.

PY - 2020

Y1 - 2020

N2 - State-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-Icelandic sentence pairs. Given that parallel corpora for low-resource languages typically contain sentence pairs in the tens or hundreds of thousands, we classify Icelandic as a medium-resource language for MT purposes. In this paper, we present on-going experiments with different MT models, both statistical and neural, for translating English to Icelandic based on ParIce. We describe the corpus and the filtering process used for removing noisy segments, the different models used for training, and the preliminary automatic and human evaluation. We find that, while using an aggressive filtering approach, the most recent neural MT system (Transformer) performs best, obtaining the highest BLEU score and the highest fluency and adequacy scores from human evaluation for in-domain translation. Our work could be beneficial to other languages for which a similar amount of parallel data is available.

AB - State-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-Icelandic sentence pairs. Given that parallel corpora for low-resource languages typically contain sentence pairs in the tens or hundreds of thousands, we classify Icelandic as a medium-resource language for MT purposes. In this paper, we present on-going experiments with different MT models, both statistical and neural, for translating English to Icelandic based on ParIce. We describe the corpus and the filtering process used for removing noisy segments, the different models used for training, and the preliminary automatic and human evaluation. We find that, while using an aggressive filtering approach, the most recent neural MT system (Transformer) performs best, obtaining the highest BLEU score and the highest fluency and adequacy scores from human evaluation for in-domain translation. Our work could be beneficial to other languages for which a similar amount of parallel data is available.

KW - Evaluation

KW - Machine translation

KW - Parallel data

UR - http://www.scopus.com/inward/record.url?scp=85091177513&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-58323-1_10

DO - 10.1007/978-3-030-58323-1_10

M3 - Article in proceedings

AN - SCOPUS:85091177513

SN - 9783030583224

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 95

EP - 103

BT - Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Proceedings

A2 - Sojka, Petr

A2 - Kopecek, Ivan

A2 - Pala, Karel

A2 - Horák, Aleš

PB - Springer

T2 - 23rd International Conference on Text, Speech, and Dialogue, TSD 2020

Y2 - 8 September 2020 through 11 September 2020

ER -

ID: 371185063

Department of Computer Science