Grounded Sequence to Sequence Transduction
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
Grounded Sequence to Sequence Transduction. / Specia, Lucia; Barrault, Loic; Caglayan, Ozan; Duarte, Amanda; Elliott, Desmond; Gella, Spandana; Holzenberger, Nils; Lala, Chiraag; Lee, Sun Jae; Libovicky, Jindrich; Madhyastha, Pranava; Metze, Florian; Mulligan, Karl; Ostapenko, Alissa; Palaskar, Shruti; Sanabria, Ramon; Wang, Josiah; Arora, Raman.
In: IEEE Journal on Selected Topics in Signal Processing, Vol. 14, No. 3, 9103248, 2020, p. 577-591.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Grounded Sequence to Sequence Transduction
AU - Specia, Lucia
AU - Barrault, Loic
AU - Caglayan, Ozan
AU - Duarte, Amanda
AU - Elliott, Desmond
AU - Gella, Spandana
AU - Holzenberger, Nils
AU - Lala, Chiraag
AU - Lee, Sun Jae
AU - Libovicky, Jindrich
AU - Madhyastha, Pranava
AU - Metze, Florian
AU - Mulligan, Karl
AU - Ostapenko, Alissa
AU - Palaskar, Shruti
AU - Sanabria, Ramon
AU - Wang, Josiah
AU - Arora, Raman
PY - 2020
Y1 - 2020
N2 - Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality - either speech or text - as input. Evidence from human learning suggests that additional modalities can provide disambiguating signals crucial for many language tasks. In this article, we describe the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations. We then show how this single dataset can be used to develop systems for a variety of language tasks and present a number of models meant as starting points. Across tasks, we find that building multimodal architectures that perform better than their unimodal counterpart remains a challenge. This leaves plenty of room for the exploration of more advanced solutions that fully exploit the multimodal nature of the How2 dataset , and the general direction of multimodal learning with other datasets as well.
AB - Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality - either speech or text - as input. Evidence from human learning suggests that additional modalities can provide disambiguating signals crucial for many language tasks. In this article, we describe the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations. We then show how this single dataset can be used to develop systems for a variety of language tasks and present a number of models meant as starting points. Across tasks, we find that building multimodal architectures that perform better than their unimodal counterpart remains a challenge. This leaves plenty of room for the exploration of more advanced solutions that fully exploit the multimodal nature of the How2 dataset , and the general direction of multimodal learning with other datasets as well.
KW - Grounding
KW - machine translation
KW - multimodal machine learning
KW - representation learning
KW - speech recognition
KW - summarization
UR - http://www.scopus.com/inward/record.url?scp=85087505272&partnerID=8YFLogxK
U2 - 10.1109/JSTSP.2020.2998415
DO - 10.1109/JSTSP.2020.2998415
M3 - Journal article
AN - SCOPUS:85087505272
VL - 14
SP - 577
EP - 591
JO - IEEE Journal on Selected Topics in Signal Processing
JF - IEEE Journal on Selected Topics in Signal Processing
SN - 1932-4553
IS - 3
M1 - 9103248
ER -
ID: 250484073