A cascaded classification approach to semantic head recognition

Research output: Chapter in Book/Report/Conference proceedingBook chapterResearchpeer-review

Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because - unlike other work on MWUs - tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.
Original languageEnglish
Title of host publicationEMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Number of pages11
Publication date1 Jan 2011
Pages793-803
ISBN (Print)9781937284114
Publication statusPublished - 1 Jan 2011
Externally publishedYes

ID: 49502244