A cascaded classification approach to semantic head recognition

Publikation: Bidrag til bog/antologi/rapportBidrag til bog/antologiForskningfagfællebedømt

Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because - unlike other work on MWUs - tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.
OriginalsprogEngelsk
TitelEMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Antal sider11
Publikationsdato1 jan. 2011
Sider793-803
ISBN (Trykt)9781937284114
StatusUdgivet - 1 jan. 2011
Eksternt udgivetJa

ID: 49502244