Part of Speech Based Term Weighting for Information Retrieval
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Automatic language processing tools typically assign to terms so-called `weights' corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the `POS contexts' in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.
Original language | English |
---|---|
Title of host publication | ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval |
Publication date | 2009 |
Pages | 412-423 |
Publication status | Published - 2009 |
Externally published | Yes |
Bibliographical note
Published in:
· Proceeding
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Pages 412 - 423
Springer-Verlag Berlin, Heidelberg ©2009
ISBN: 978-3-642-00957-0 doi>10.1007/978-3-642-00958-7_37
ID: 38252017