Do end-to-end speech recognition models care about context?

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Open-article
Final published version, 394 KB, PDF document

Lasse Borgholt
Jakob D. Havtorn
Željko Agic
Søgaard, Anders
Lars Maaløe
Igel, Christian

The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.

Original language	English
Title of host publication	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2020-October
Publisher	International Speech Communication Association (ISCA)
Publication date	2020
Pages	4352-4356
DOIs	https://doi.org/10.21437/Interspeech.2020-1750
Publication status	Published - 2020
Event	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Conference

Conference	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Land	China
By	Shanghai
Periode	25/10/2020 → 29/10/2020
Sponsor	Alibaba Group, Amazon Alexa, Apple, et al., Intel, Magic Data

Research areas

Attention-based encoder-decoder, Automatic speech recognition, Connectionist temporal classification, End-to-end speech recognition

Number of downloads are based on statistics from Google Scholar and www.ku.dk

No data available

ID: 254726027

Department of Computer Science