How2: A large-scale dataset for multimodal language understanding

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

How2 : A large-scale dataset for multimodal language understanding. / Sanabria, Ramon; Caglayan, Ozan; Palaskar, Shruti; Elliott, Desmond; Barrault, Loic; Specia, Lucia; Metze, Florian.

Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).. 2018. (arXiv).

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Sanabria, R, Caglayan, O, Palaskar, S, Elliott, D, Barrault, L, Specia, L & Metze, F 2018, How2: A large-scale dataset for multimodal language understanding. in Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).. arXiv, 32nd Annual Conference on Neural Information Processing Systems, Montreal, Canada, 02/12/2018. <https://nips2018vigil.github.io/static/papers/accepted/26.pdf>

APA

Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L., & Metze, F. (2018). How2: A large-scale dataset for multimodal language understanding. In Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS). arXiv https://nips2018vigil.github.io/static/papers/accepted/26.pdf

Vancouver

Sanabria R, Caglayan O, Palaskar S, Elliott D, Barrault L, Specia L et al. How2: A large-scale dataset for multimodal language understanding. In Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).. 2018. (arXiv).

Author

Sanabria, Ramon ; Caglayan, Ozan ; Palaskar, Shruti ; Elliott, Desmond ; Barrault, Loic ; Specia, Lucia ; Metze, Florian. / How2 : A large-scale dataset for multimodal language understanding. Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).. 2018. (arXiv).

Bibtex

@inproceedings{14c47cd8617c46378ad0b25da2cd4755,
title = "How2: A large-scale dataset for multimodal language understanding",
abstract = "Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processingcapabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multimodal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities",
author = "Ramon Sanabria and Ozan Caglayan and Shruti Palaskar and Desmond Elliott and Loic Barrault and Lucia Specia and Florian Metze",
year = "2018",
language = "English",
series = "arXiv",
publisher = "arxiv.org",
booktitle = "Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).",
note = "32nd Annual Conference on Neural Information Processing Systems, NeurIPS ; Conference date: 02-12-2018 Through 08-12-2018",
url = "https://nips.cc/Conferences/2018",

}

RIS

TY - GEN

T1 - How2

T2 - 32nd Annual Conference on Neural Information Processing Systems

AU - Sanabria, Ramon

AU - Caglayan, Ozan

AU - Palaskar, Shruti

AU - Elliott, Desmond

AU - Barrault, Loic

AU - Specia, Lucia

AU - Metze, Florian

N1 - Conference code: 32

PY - 2018

Y1 - 2018

N2 - Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processingcapabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multimodal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities

AB - Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processingcapabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multimodal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities

M3 - Article in proceedings

T3 - arXiv

BT - Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).

Y2 - 2 December 2018 through 8 December 2018

ER -

ID: 236508335