Light syntactically-based index pruning for information retrieval

Datalogisk Institut

Light syntactically-based index pruning for information retrieval

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Light syntactically-based index pruning for information retrieval. / Lioma, Christina; Ounis, Iadh.

ECIR'07 Proceedings of the 29th European conference on IR research. 2007. p. 88-100.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Lioma, C & Ounis, I 2007, Light syntactically-based index pruning for information retrieval. in ECIR'07 Proceedings of the 29th European conference on IR research. pp. 88-100. <http://64.238.147.53/citation.cfm?id=1763653.1763667&coll=DL&dl=GUIDE&CFID=87655016&CFTOKEN=30826131>

APA

Lioma, C., & Ounis, I. (2007). Light syntactically-based index pruning for information retrieval. In ECIR'07 Proceedings of the 29th European conference on IR research (pp. 88-100) http://64.238.147.53/citation.cfm?id=1763653.1763667&coll=DL&dl=GUIDE&CFID=87655016&CFTOKEN=30826131

Vancouver

Lioma C, Ounis I. Light syntactically-based index pruning for information retrieval. In ECIR'07 Proceedings of the 29th European conference on IR research. 2007. p. 88-100

Author

Lioma, Christina ; Ounis, Iadh. / Light syntactically-based index pruning for information retrieval. ECIR'07 Proceedings of the 29th European conference on IR research. 2007. pp. 88-100

Bibtex

@inproceedings{ccaf4645daff400bb34c4d5eda18cc33,

title = "Light syntactically-based index pruning for information retrieval",

abstract = "Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of contentpoor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.",

author = "Christina Lioma and Iadh Ounis",

note = "Published in: · Proceeding ECIR'07 Proceedings of the 29th European conference on IR research Pages 88-100 Springer-Verlag Berlin, Heidelberg {\textcopyright}2007 ISBN: 978-3-540-71494-1 ",

year = "2007",

language = "English",

pages = "88--100",

booktitle = "ECIR'07 Proceedings of the 29th European conference on IR research",

}

RIS

TY - GEN

T1 - Light syntactically-based index pruning for information retrieval

AU - Lioma, Christina

AU - Ounis, Iadh

PY - 2007

Y1 - 2007

N2 - Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of contentpoor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.

AB - Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of contentpoor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.

M3 - Article in proceedings

SP - 88

EP - 100

BT - ECIR'07 Proceedings of the 29th European conference on IR research

ER -

ID: 38251980