Do end-to-end speech recognition models care about context?

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Standard

Do end-to-end speech recognition models care about context? / Borgholt, Lasse; Havtorn, Jakob D.; Agic, Željko; Søgaard, Anders; Maaløe, Lars; Igel, Christian.

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Bind 2020-October International Speech Communication Association (ISCA), 2020. s. 4352-4356.

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Harvard

Borgholt, L, Havtorn, JD, Agic, Ž, Søgaard, A, Maaløe, L & Igel, C 2020, Do end-to-end speech recognition models care about context? i Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. bind 2020-October, International Speech Communication Association (ISCA), s. 4352-4356, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, Kina, 25/10/2020. https://doi.org/10.21437/Interspeech.2020-1750

APA

Borgholt, L., Havtorn, J. D., Agic, Ž., Søgaard, A., Maaløe, L., & Igel, C. (2020). Do end-to-end speech recognition models care about context? I Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Bind 2020-October, s. 4352-4356). International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2020-1750

Vancouver

Borgholt L, Havtorn JD, Agic Ž, Søgaard A, Maaløe L, Igel C. Do end-to-end speech recognition models care about context? I Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Bind 2020-October. International Speech Communication Association (ISCA). 2020. s. 4352-4356 https://doi.org/10.21437/Interspeech.2020-1750

Author

Borgholt, Lasse ; Havtorn, Jakob D. ; Agic, Željko ; Søgaard, Anders ; Maaløe, Lars ; Igel, Christian. / Do end-to-end speech recognition models care about context?. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Bind 2020-October International Speech Communication Association (ISCA), 2020. s. 4352-4356

Bibtex

@inproceedings{9cd3f7a63cca49108ae28929d525649e,
title = "Do end-to-end speech recognition models care about context?",
abstract = "The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.",
keywords = "Attention-based encoder-decoder, Automatic speech recognition, Connectionist temporal classification, End-to-end speech recognition",
author = "Lasse Borgholt and Havtorn, {Jakob D.} and {\v Z}eljko Agic and Anders S{\o}gaard and Lars Maal{\o}e and Christian Igel",
year = "2020",
doi = "10.21437/Interspeech.2020-1750",
language = "English",
volume = "2020-October",
pages = "4352--4356",
booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
publisher = "International Speech Communication Association (ISCA)",
note = "21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

}

RIS

TY - GEN

T1 - Do end-to-end speech recognition models care about context?

AU - Borgholt, Lasse

AU - Havtorn, Jakob D.

AU - Agic, Željko

AU - Søgaard, Anders

AU - Maaløe, Lars

AU - Igel, Christian

PY - 2020

Y1 - 2020

N2 - The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.

AB - The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.

KW - Attention-based encoder-decoder

KW - Automatic speech recognition

KW - Connectionist temporal classification

KW - End-to-end speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85098151098&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-1750

DO - 10.21437/Interspeech.2020-1750

M3 - Article in proceedings

AN - SCOPUS:85098151098

VL - 2020-October

SP - 4352

EP - 4356

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

PB - International Speech Communication Association (ISCA)

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

ID: 254726027