Grounded Sequence to Sequence Transduction

Research output: Contribution to journal › Journal article › Research › peer-review

Standard

Grounded Sequence to Sequence Transduction. / Specia, Lucia; Barrault, Loic; Caglayan, Ozan; Duarte, Amanda; Elliott, Desmond; Gella, Spandana; Holzenberger, Nils; Lala, Chiraag; Lee, Sun Jae; Libovicky, Jindrich; Madhyastha, Pranava; Metze, Florian; Mulligan, Karl; Ostapenko, Alissa; Palaskar, Shruti; Sanabria, Ramon; Wang, Josiah; Arora, Raman.

In: IEEE Journal on Selected Topics in Signal Processing, Vol. 14, No. 3, 9103248, 2020, p. 577-591.

Research output: Contribution to journal › Journal article › Research › peer-review

Harvard

Specia, L, Barrault, L, Caglayan, O, Duarte, A, Elliott, D, Gella, S, Holzenberger, N, Lala, C, Lee, SJ, Libovicky, J, Madhyastha, P, Metze, F, Mulligan, K, Ostapenko, A, Palaskar, S, Sanabria, R, Wang, J & Arora, R 2020, 'Grounded Sequence to Sequence Transduction', IEEE Journal on Selected Topics in Signal Processing, vol. 14, no. 3, 9103248, pp. 577-591. https://doi.org/10.1109/JSTSP.2020.2998415

APA

Specia, L., Barrault, L., Caglayan, O., Duarte, A., Elliott, D., Gella, S., Holzenberger, N., Lala, C., Lee, S. J., Libovicky, J., Madhyastha, P., Metze, F., Mulligan, K., Ostapenko, A., Palaskar, S., Sanabria, R., Wang, J., & Arora, R. (2020). Grounded Sequence to Sequence Transduction. IEEE Journal on Selected Topics in Signal Processing, 14(3), 577-591. [9103248]. https://doi.org/10.1109/JSTSP.2020.2998415

Vancouver

Specia L, Barrault L, Caglayan O, Duarte A, Elliott D, Gella S et al. Grounded Sequence to Sequence Transduction. IEEE Journal on Selected Topics in Signal Processing. 2020;14(3):577-591. 9103248. https://doi.org/10.1109/JSTSP.2020.2998415

Author

Specia, Lucia ; Barrault, Loic ; Caglayan, Ozan ; Duarte, Amanda ; Elliott, Desmond ; Gella, Spandana ; Holzenberger, Nils ; Lala, Chiraag ; Lee, Sun Jae ; Libovicky, Jindrich ; Madhyastha, Pranava ; Metze, Florian ; Mulligan, Karl ; Ostapenko, Alissa ; Palaskar, Shruti ; Sanabria, Ramon ; Wang, Josiah ; Arora, Raman. / Grounded Sequence to Sequence Transduction. In: IEEE Journal on Selected Topics in Signal Processing. 2020 ; Vol. 14, No. 3. pp. 577-591.

Bibtex

@article{7fe7df28e6b54187bff653847b1b3283,

title = "Grounded Sequence to Sequence Transduction",

abstract = "Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality - either speech or text - as input. Evidence from human learning suggests that additional modalities can provide disambiguating signals crucial for many language tasks. In this article, we describe the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations. We then show how this single dataset can be used to develop systems for a variety of language tasks and present a number of models meant as starting points. Across tasks, we find that building multimodal architectures that perform better than their unimodal counterpart remains a challenge. This leaves plenty of room for the exploration of more advanced solutions that fully exploit the multimodal nature of the How2 dataset , and the general direction of multimodal learning with other datasets as well. ",

keywords = "Grounding, machine translation, multimodal machine learning, representation learning, speech recognition, summarization",

author = "Lucia Specia and Loic Barrault and Ozan Caglayan and Amanda Duarte and Desmond Elliott and Spandana Gella and Nils Holzenberger and Chiraag Lala and Lee, {Sun Jae} and Jindrich Libovicky and Pranava Madhyastha and Florian Metze and Karl Mulligan and Alissa Ostapenko and Shruti Palaskar and Ramon Sanabria and Josiah Wang and Raman Arora",

year = "2020",

doi = "10.1109/JSTSP.2020.2998415",

language = "English",

volume = "14",

pages = "577--591",

journal = "IEEE Journal on Selected Topics in Signal Processing",

issn = "1932-4553",

publisher = "Institute of Electrical and Electronics Engineers",

number = "3",

}

RIS

TY - JOUR

T1 - Grounded Sequence to Sequence Transduction

AU - Specia, Lucia

AU - Barrault, Loic

AU - Caglayan, Ozan

AU - Duarte, Amanda

AU - Elliott, Desmond

AU - Gella, Spandana

AU - Holzenberger, Nils

AU - Lala, Chiraag

AU - Lee, Sun Jae

AU - Libovicky, Jindrich

AU - Madhyastha, Pranava

AU - Metze, Florian

AU - Mulligan, Karl

AU - Ostapenko, Alissa

AU - Palaskar, Shruti

AU - Sanabria, Ramon

AU - Wang, Josiah

AU - Arora, Raman

PY - 2020

Y1 - 2020

N2 - Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality - either speech or text - as input. Evidence from human learning suggests that additional modalities can provide disambiguating signals crucial for many language tasks. In this article, we describe the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations. We then show how this single dataset can be used to develop systems for a variety of language tasks and present a number of models meant as starting points. Across tasks, we find that building multimodal architectures that perform better than their unimodal counterpart remains a challenge. This leaves plenty of room for the exploration of more advanced solutions that fully exploit the multimodal nature of the How2 dataset , and the general direction of multimodal learning with other datasets as well.

AB - Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality - either speech or text - as input. Evidence from human learning suggests that additional modalities can provide disambiguating signals crucial for many language tasks. In this article, we describe the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations. We then show how this single dataset can be used to develop systems for a variety of language tasks and present a number of models meant as starting points. Across tasks, we find that building multimodal architectures that perform better than their unimodal counterpart remains a challenge. This leaves plenty of room for the exploration of more advanced solutions that fully exploit the multimodal nature of the How2 dataset , and the general direction of multimodal learning with other datasets as well.

KW - Grounding

KW - machine translation

KW - multimodal machine learning

KW - representation learning

KW - speech recognition

KW - summarization

UR - http://www.scopus.com/inward/record.url?scp=85087505272&partnerID=8YFLogxK

U2 - 10.1109/JSTSP.2020.2998415

DO - 10.1109/JSTSP.2020.2998415

M3 - Journal article

AN - SCOPUS:85087505272

VL - 14

SP - 577

EP - 591

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

SN - 1932-4553

IS - 3

M1 - 9103248

ER -

ID: 250484073

Department of Computer Science