On the Estimation and Use of Statistical Modelling in Information Retrieval

Research output: Book/Report › Ph.D. thesis › Research

Casper Petersen

Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.

Original language	English

Publisher	Department of Computer Science, Faculty of Science, University of Copenhagen
Publication status	Published - 2016

Department of Computer Science

On the Estimation and Use of Statistical Modelling in Information Retrieval

Links