Visual Prediction Improves Zero-Shot Cross-Modal Machine Translation
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Visual Prediction Improves Zero-Shot Cross-Modal Machine Translation. / Hirasawa, Tosho; Bugliarello, Emanuele; Elliott, Desmond; Komachi, Mamoru.
Proceedings of the 8th Conference on Machine Translation, WMT 2023. Association for Computational Linguistics (ACL), 2023. s. 520-533.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Visual Prediction Improves Zero-Shot Cross-Modal Machine Translation
AU - Hirasawa, Tosho
AU - Bugliarello, Emanuele
AU - Elliott, Desmond
AU - Komachi, Mamoru
N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Multimodal machine translation (MMT) systems have been successfully developed in recent years for a few language pairs. However, training such models usually requires tuples of a source language text, target language text, and images. Obtaining these data involves expensive human annotations, making it difficult to develop models for unseen text-only language pairs. In this work, we propose the task of zero-shot cross-modal machine translation aiming to transfer multimodal knowledge from an existing multimodal parallel corpus into a new translation direction. We also introduce a novel MMT model with a visual prediction network to learn visual features grounded on multimodal parallel data and provide pseudo-features for text-only language pairs. With this training paradigm, our MMT model outperforms its text-only counterpart. In our extensive analyses, we show that (i) the selection of visual features is important, and (ii) training on image-aware translations and being grounded on a similar language pair are mandatory. Our code are available at https://github.com/toshohirasawa/zeroshot-crossmodal-mt.
AB - Multimodal machine translation (MMT) systems have been successfully developed in recent years for a few language pairs. However, training such models usually requires tuples of a source language text, target language text, and images. Obtaining these data involves expensive human annotations, making it difficult to develop models for unseen text-only language pairs. In this work, we propose the task of zero-shot cross-modal machine translation aiming to transfer multimodal knowledge from an existing multimodal parallel corpus into a new translation direction. We also introduce a novel MMT model with a visual prediction network to learn visual features grounded on multimodal parallel data and provide pseudo-features for text-only language pairs. With this training paradigm, our MMT model outperforms its text-only counterpart. In our extensive analyses, we show that (i) the selection of visual features is important, and (ii) training on image-aware translations and being grounded on a similar language pair are mandatory. Our code are available at https://github.com/toshohirasawa/zeroshot-crossmodal-mt.
U2 - 10.18653/v1/2023.wmt-1.47
DO - 10.18653/v1/2023.wmt-1.47
M3 - Article in proceedings
AN - SCOPUS:85179138715
SP - 520
EP - 533
BT - Proceedings of the 8th Conference on Machine Translation, WMT 2023
PB - Association for Computational Linguistics (ACL)
T2 - 8th Conference on Machine Translation, WMT 2023
Y2 - 6 December 2023 through 7 December 2023
ER -
ID: 377814940