Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. / Barrett, Maria Jung; Gonzalez, Ana Valeria; Frermann, Lea; Søgaard, Anders.

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . ed. / Silvio Ricardo Cordeiro ; Shereen Oraby; Umashanthi Pavalanathan; Kyeongmin Rim. Vol. 1 Association for Computational Linguistics, 2018. p. 2028-2038.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Barrett, MJ, Gonzalez, AV, Frermann, L & Søgaard, A 2018, Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. in SR Cordeiro , S Oraby, U Pavalanathan & K Rim (eds), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . vol. 1, Association for Computational Linguistics, pp. 2028-2038, 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, United States, 01/06/2018. https://doi.org/10.18653/v1/N18-1184

APA

Barrett, M. J., Gonzalez, A. V., Frermann, L., & Søgaard, A. (2018). Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. In S. R. Cordeiro , S. Oraby, U. Pavalanathan, & K. Rim (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) (Vol. 1, pp. 2028-2038). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1184

Vancouver

Barrett MJ, Gonzalez AV, Frermann L, Søgaard A. Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. In Cordeiro SR, Oraby S, Pavalanathan U, Rim K, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . Vol. 1. Association for Computational Linguistics. 2018. p. 2028-2038 https://doi.org/10.18653/v1/N18-1184

Author

Barrett, Maria Jung ; Gonzalez, Ana Valeria ; Frermann, Lea ; Søgaard, Anders. / Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . editor / Silvio Ricardo Cordeiro ; Shereen Oraby ; Umashanthi Pavalanathan ; Kyeongmin Rim. Vol. 1 Association for Computational Linguistics, 2018. pp. 2028-2038

Bibtex

@inproceedings{2f629969d6ca4f4dada21de93ed6b658,
title = "Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing",
abstract = "When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller. ",
author = "Barrett, {Maria Jung} and Gonzalez, {Ana Valeria} and Lea Frermann and Anders S{\o}gaard",
year = "2018",
doi = "10.18653/v1/N18-1184",
language = "English",
isbn = "978-1-948087-27-8",
volume = "1",
pages = "2028--2038",
editor = "{Cordeiro }, {Silvio Ricardo } and Oraby, {Shereen } and Pavalanathan, {Umashanthi } and Kyeongmin Rim",
booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)",
publisher = "Association for Computational Linguistics",
note = "16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2018 ; Conference date: 01-06-2018 Through 06-06-2018",

}

RIS

TY - GEN

T1 - Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing

AU - Barrett, Maria Jung

AU - Gonzalez, Ana Valeria

AU - Frermann, Lea

AU - Søgaard, Anders

PY - 2018

Y1 - 2018

N2 - When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.

AB - When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.

U2 - 10.18653/v1/N18-1184

DO - 10.18653/v1/N18-1184

M3 - Article in proceedings

SN - 978-1-948087-27-8

VL - 1

SP - 2028

EP - 2038

BT - Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

A2 - Cordeiro , Silvio Ricardo

A2 - Oraby, Shereen

A2 - Pavalanathan, Umashanthi

A2 - Rim, Kyeongmin

PB - Association for Computational Linguistics

T2 - 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Y2 - 1 June 2018 through 6 June 2018

ER -

ID: 202768266