Fine-Grained Grounding for Multimodal Speech Recognition

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fine-Grained Grounding for Multimodal Speech Recognition
Final published version, 3.41 MB, PDF document

Tejas Srinivasan
Ramon Sanabria
Florian Metze
Elliott, Desmond

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

Original language	English
Title of host publication	Findings of the Association for Computational Linguistics: EMNLP 2020
Publisher	Association for Computational Linguistics
Publication date	2020
Pages	2667-2677
DOIs	https://doi.org/10.18653/v1/2020.findings-emnlp.242
Publication status	Published - 2020
Event	Findings of the Association of Computational Linguistics: EMNLP 2020 - Duration: 16 Nov 2020 → 20 Nov 2020

Conference

Conference	Findings of the Association of Computational Linguistics
Periode	16/11/2020 → 20/11/2020

ID: 305182727

Department of Computer Science