An in-depth investigation on the behavior of measures to quantify reproducibility

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

An in-depth investigation on the behavior of measures to quantify reproducibility. / Maistro, Maria; Breuer, Timo; Schaer, Philipp; Ferro, Nicola.

In: Information Processing and Management, Vol. 60, No. 3, 103332, 2023.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Maistro, M, Breuer, T, Schaer, P & Ferro, N 2023, 'An in-depth investigation on the behavior of measures to quantify reproducibility', Information Processing and Management, vol. 60, no. 3, 103332. https://doi.org/10.1016/j.ipm.2023.103332

APA

Maistro, M., Breuer, T., Schaer, P., & Ferro, N. (2023). An in-depth investigation on the behavior of measures to quantify reproducibility. Information Processing and Management, 60(3), [103332]. https://doi.org/10.1016/j.ipm.2023.103332

Vancouver

Maistro M, Breuer T, Schaer P, Ferro N. An in-depth investigation on the behavior of measures to quantify reproducibility. Information Processing and Management. 2023;60(3). 103332. https://doi.org/10.1016/j.ipm.2023.103332

Author

Maistro, Maria ; Breuer, Timo ; Schaer, Philipp ; Ferro, Nicola. / An in-depth investigation on the behavior of measures to quantify reproducibility. In: Information Processing and Management. 2023 ; Vol. 60, No. 3.

Bibtex

@article{fba81796e17d4ac3970220629675b666,
title = "An in-depth investigation on the behavior of measures to quantify reproducibility",
abstract = "Science is facing a so-called reproducibility crisis, where researchers struggle to repeat experiments and to get the same or comparable results. This represents a fundamental problem in any scientific discipline because reproducibility lies at the very basis of the scientific method. A central methodological question is how to measure reproducibility and interpret different measures. In Information Retrieval (IR), current practices to measure reproducibility rely mainly on comparing averaged scores. If the reproduced score is close enough to the original one, the reproducibility experiment is deemed successful, although the identical scores can still rely on entirely different result lists. Therefore, this paper focuses on measures to quantify reproducibility in IR and their behavior. We present a critical analysis of IR reproducibility measures by synthetically generating runs in a controlled experimental setting, which allows us to control the amount of reproducibility error. These synthetic runs are generated by a deterioration algorithm based on swaps and replacements of documents in ranked lists. We investigate the behavior of different reproducibility measures with these synthetic runs in three different scenarios. Moreover, we propose a normalized version of Root Mean Square Error (RMSE) to quantify reproducibility better. Experimental results show that a single score is not enough to decide whether an experiment is successfully reproduced because such a score depends on the type of effectiveness measure and the performance of the original run. This study highlights how challenging it can be to reproduce experimental results and quantify the amount of reproducibility.",
keywords = "Evaluation, Information retrieval, Reproducibility",
author = "Maria Maistro and Timo Breuer and Philipp Schaer and Nicola Ferro",
note = "Publisher Copyright: {\textcopyright} 2023 The Author(s)",
year = "2023",
doi = "10.1016/j.ipm.2023.103332",
language = "English",
volume = "60",
journal = "Information Processing & Management",
issn = "0306-4573",
publisher = "Elsevier",
number = "3",

}

RIS

TY - JOUR

T1 - An in-depth investigation on the behavior of measures to quantify reproducibility

AU - Maistro, Maria

AU - Breuer, Timo

AU - Schaer, Philipp

AU - Ferro, Nicola

N1 - Publisher Copyright: © 2023 The Author(s)

PY - 2023

Y1 - 2023

N2 - Science is facing a so-called reproducibility crisis, where researchers struggle to repeat experiments and to get the same or comparable results. This represents a fundamental problem in any scientific discipline because reproducibility lies at the very basis of the scientific method. A central methodological question is how to measure reproducibility and interpret different measures. In Information Retrieval (IR), current practices to measure reproducibility rely mainly on comparing averaged scores. If the reproduced score is close enough to the original one, the reproducibility experiment is deemed successful, although the identical scores can still rely on entirely different result lists. Therefore, this paper focuses on measures to quantify reproducibility in IR and their behavior. We present a critical analysis of IR reproducibility measures by synthetically generating runs in a controlled experimental setting, which allows us to control the amount of reproducibility error. These synthetic runs are generated by a deterioration algorithm based on swaps and replacements of documents in ranked lists. We investigate the behavior of different reproducibility measures with these synthetic runs in three different scenarios. Moreover, we propose a normalized version of Root Mean Square Error (RMSE) to quantify reproducibility better. Experimental results show that a single score is not enough to decide whether an experiment is successfully reproduced because such a score depends on the type of effectiveness measure and the performance of the original run. This study highlights how challenging it can be to reproduce experimental results and quantify the amount of reproducibility.

AB - Science is facing a so-called reproducibility crisis, where researchers struggle to repeat experiments and to get the same or comparable results. This represents a fundamental problem in any scientific discipline because reproducibility lies at the very basis of the scientific method. A central methodological question is how to measure reproducibility and interpret different measures. In Information Retrieval (IR), current practices to measure reproducibility rely mainly on comparing averaged scores. If the reproduced score is close enough to the original one, the reproducibility experiment is deemed successful, although the identical scores can still rely on entirely different result lists. Therefore, this paper focuses on measures to quantify reproducibility in IR and their behavior. We present a critical analysis of IR reproducibility measures by synthetically generating runs in a controlled experimental setting, which allows us to control the amount of reproducibility error. These synthetic runs are generated by a deterioration algorithm based on swaps and replacements of documents in ranked lists. We investigate the behavior of different reproducibility measures with these synthetic runs in three different scenarios. Moreover, we propose a normalized version of Root Mean Square Error (RMSE) to quantify reproducibility better. Experimental results show that a single score is not enough to decide whether an experiment is successfully reproduced because such a score depends on the type of effectiveness measure and the performance of the original run. This study highlights how challenging it can be to reproduce experimental results and quantify the amount of reproducibility.

KW - Evaluation

KW - Information retrieval

KW - Reproducibility

UR - http://www.scopus.com/inward/record.url?scp=85150195597&partnerID=8YFLogxK

U2 - 10.1016/j.ipm.2023.103332

DO - 10.1016/j.ipm.2023.103332

M3 - Journal article

AN - SCOPUS:85150195597

VL - 60

JO - Information Processing & Management

JF - Information Processing & Management

SN - 0306-4573

IS - 3

M1 - 103332

ER -

ID: 371274251