PhD Defence by Brian Brost

Online Evaluation of Rankers Using Multileaving


This thesis deals with two central challenges in the online evaluation of rankers for information retrieval: (i) The design of multileaving algorithms and (ii) how best to manage the exploration-exploitation tradeoff associated with online evaluation using multileaving.

Multileaving is an online evaluation approach where the ranked lists produced by a set of rankers are combined to produce a single ranked list which is presented to users of a system. The quality of the rankers is then inferred based on implicit user feedback, i.e. which items the user clicks on. Multileaving is a generalization of interleaving, which differs from multileaving in only combining pairs of rankers at a time. Multileaving has been shown to reduce the quantity of feedback needed in order to evaluate rankers relative to interleaving.

We show that prior multileaving methods can be much less accurate than previously believed. In particular, prior multileaving methods fail to account for the interaction between how they create the multileaved lists presented to users, and how they use implicit feedback to infer the relative qualities of rankers. This can result in the quality estimates of prior multileaving methods depending on artefacts of the multileaving process, rather than the quality of the rankers being evaluated.

We introduce two new multileaving algorithms. Sample-Only Scored Multileaving (SOSM) is the first multileaving algorithm to scale well with the number of rankers being compared, without introducing substantial errors. Multileaving using Importance Sampling (MIS) is the first multileaving algorithm for which we can provide provable guarantees regarding the accuracy of evaluation.

Multileaving evaluates a chosen set of rankers. However it does not address how to choose these rankers. This is a classic exploitation versus exploration problem. On the one hand we would like the multileaved list to contain relevant documents, i.e. we should exploit the rankers we believe are good. On the other hand, other rankers may be better, i.e. we should explore rankers we are uncertain of.

This problem has been previously framed as a dueling bandit problem when evaluating rankers using interleaving.

We extend the dueling bandit framework to managing the exploration-exploitation tradeoff associated with most currently existing multileaving algorithms. This is designed for algorithms where the outcome of the multileaving is a binary outcome, reflecting whether one ranker was better than another in a comparison. For this setting we introduce multi-dueling bandits, and show that regret can be reduced by orders of magnitude relative to what is attainable for dueling bandits.

For managing the exploration-exploitation tradeoff associated with multileaving algorithms such as MIS, which output absolute scores for rankers, we introduce a new variant of the bandits with multiple plays setting. This distinguishes itself from previous multiple play settings in that the number of arms to be played at each iteration is not fixed. Thus, it is possible to converge on exclusively selecting the best arm.

Assessment Committee:

Chairman: Associate Professor Christian Wulff-Nilsen, Department of Computer Science, University of Copenhagen

Associate Professor Emine Yilmaz, Department of Computer Science, University College London

Professor Lars Kai Hansen, Section for Cognitive Systems, Technical University of Denmark

Academic supervisors:

Professor Ingemar Johansson Cox, Department of Computer Science, University of Copenhagen

Associate Professor Christina Lioma, Department of Computer Science, University of Copenhagen

For an electronic copy of the thesis, please contact