Light syntactically-based index pruning for information retrieval

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of contentpoor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.
Original languageEnglish
Title of host publicationECIR'07 Proceedings of the 29th European conference on IR research
Publication date2007
Pages88-100
Publication statusPublished - 2007
Externally publishedYes

Bibliographical note

Published in:
· Proceeding
ECIR'07 Proceedings of the 29th European conference on IR research
Pages 88-100
Springer-Verlag Berlin, Heidelberg ©2007
ISBN: 978-3-540-71494-1

ID: 38251980