Light syntactically-based index pruning for information retrieval

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Lioma, Christina
Iadh Ounis

Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of contentpoor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.

Originalsprog	Engelsk
Titel	ECIR'07 Proceedings of the 29th European conference on IR research
Publikationsdato	2007
Sider	88-100
Status	Udgivet - 2007
Eksternt udgivet	Ja

Datalogisk Institut

Light syntactically-based index pruning for information retrieval

Links