Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing. / Barrett, Maria Jung; Gonzalez, Ana Valeria; Frermann, Lea; Søgaard, Anders.
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, (Long Papers) . red. / Silvio Ricardo Cordeiro ; Shereen Oraby; Umashanthi Pavalanathan; Kyeongmin Rim. Bind 1 Association for Computational Linguistics, 2018. s. 2028-2038.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Unsupervised Induction of Linguistic Categories with Records of Reading, Speaking, and Writing
AU - Barrett, Maria Jung
AU - Gonzalez, Ana Valeria
AU - Frermann, Lea
AU - Søgaard, Anders
PY - 2018
Y1 - 2018
N2 - When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.
AB - When learning POS taggers and syntactic chunkers for low-resource languages, different resources may be available, and often all we have is a small tag dictionary, motivating type-constrained unsupervised induction. Even small dictionaries can improve the performance of unsupervised induction algorithms. This paper shows that performance can be further improved by including data that is readily available or can be easily obtained for most languages, i.e., eye-tracking, speech, or keystroke logs (or any combination thereof). We project information from all these data sources into shared spaces, in which the union of words is represented. For English unsupervised POS induction, the additional information, which is not required at test time, leads to an average error reduction on Ontonotes domains of 1.5% over systems augmented with state-of-the-art word embeddings. On Penn Treebank the best model achieves 5.4% error reduction over a word embeddings baseline. We also achieve significant improvements for syntactic chunk induction. Our analysis shows that improvements are even bigger when the available tag dictionaries are smaller.
U2 - 10.18653/v1/N18-1184
DO - 10.18653/v1/N18-1184
M3 - Article in proceedings
SN - 978-1-948087-27-8
VL - 1
SP - 2028
EP - 2038
BT - Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)
A2 - Cordeiro , Silvio Ricardo
A2 - Oraby, Shereen
A2 - Pavalanathan, Umashanthi
A2 - Rim, Kyeongmin
PB - Association for Computational Linguistics
T2 - 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Y2 - 1 June 2018 through 6 June 2018
ER -
ID: 202768266