Textual Supervision for Visually Grounded Spoken Language Understanding
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Textual Supervision for Visually Grounded Spoken Language Understanding. / Higy, Bertrand; Elliott, Desmond; Chrupała, Grzegorz.
Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. p. 2698–2709.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Textual Supervision for Visually Grounded Spoken Language Understanding
AU - Higy, Bertrand
AU - Elliott, Desmond
AU - Chrupała, Grzegorz
PY - 2020
Y1 - 2020
N2 - Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.
AB - Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.
KW - cs.CL
KW - cs.LG
KW - cs.SD
KW - eess.AS
U2 - 10.18653/v1/2020.findings-emnlp.244
DO - 10.18653/v1/2020.findings-emnlp.244
M3 - Article in proceedings
SP - 2698
EP - 2709
BT - Findings of the Association for Computational Linguistics: EMNLP 2020
PB - Association for Computational Linguistics
T2 - Findings of the Association of Computational Linguistics
Y2 - 16 November 2020 through 20 November 2020
ER -
ID: 305183788