Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System

Research output: Contribution to journal › Journal article › Research › peer-review

Standard

Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System. / Li, Dana; Pehrson, Lea Marie; Tøttrup, Lea; Fraccaro, Marco; Bonnevie, Rasmus; Thrane, Jakob; Sørensen, Peter Jagd; Rykkje, Alexander; Andersen, Tobias Thostrup; Steglich-Arnholm, Henrik; Stærk, Dorte Marianne Rohde; Borgwardt, Lotte; Hansen, Kristoffer Lindskov; Darkner, Sune; Carlsen, Jonathan Frederik; Nielsen, Michael Bachmann.

In: Diagnostics, Vol. 12, No. 12, 3112, 2022.

Research output: Contribution to journal › Journal article › Research › peer-review

Harvard

Li, D, Pehrson, LM, Tøttrup, L, Fraccaro, M, Bonnevie, R, Thrane, J, Sørensen, PJ, Rykkje, A, Andersen, TT, Steglich-Arnholm, H, Stærk, DMR, Borgwardt, L, Hansen, KL, Darkner, S, Carlsen, JF & Nielsen, MB 2022, 'Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System', Diagnostics, vol. 12, no. 12, 3112. https://doi.org/10.3390/diagnostics12123112

APA

Li, D., Pehrson, L. M., Tøttrup, L., Fraccaro, M., Bonnevie, R., Thrane, J., Sørensen, P. J., Rykkje, A., Andersen, T. T., Steglich-Arnholm, H., Stærk, D. M. R., Borgwardt, L., Hansen, K. L., Darkner, S., Carlsen, J. F., & Nielsen, M. B. (2022). Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System. Diagnostics, 12(12), [3112]. https://doi.org/10.3390/diagnostics12123112

Vancouver

Li D, Pehrson LM, Tøttrup L, Fraccaro M, Bonnevie R, Thrane J et al. Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System. Diagnostics. 2022;12(12). 3112. https://doi.org/10.3390/diagnostics12123112

Author

Li, Dana ; Pehrson, Lea Marie ; Tøttrup, Lea ; Fraccaro, Marco ; Bonnevie, Rasmus ; Thrane, Jakob ; Sørensen, Peter Jagd ; Rykkje, Alexander ; Andersen, Tobias Thostrup ; Steglich-Arnholm, Henrik ; Stærk, Dorte Marianne Rohde ; Borgwardt, Lotte ; Hansen, Kristoffer Lindskov ; Darkner, Sune ; Carlsen, Jonathan Frederik ; Nielsen, Michael Bachmann. / Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System. In: Diagnostics. 2022 ; Vol. 12, No. 12.

Bibtex

@article{083a2a7c43b8410085c4529625883129,

title = "Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System",

abstract = "Consistent annotation of data is a prerequisite for the successful training and testing of artificial intelligence-based decision support systems in radiology. This can be obtained by standardizing terminology when annotating diagnostic images. The purpose of this study was to evaluate the annotation consistency among radiologists when using a novel diagnostic labeling scheme for chest X-rays. Six radiologists with experience ranging from one to sixteen years, annotated a set of 100 fully anonymized chest X-rays. The blinded radiologists annotated on two separate occasions. Statistical analyses were done using Randolph{\textquoteright}s kappa and PABAK, and the proportions of specific agreements were calculated. Fair-to-excellent agreement was found for all labels among the annotators (Randolph{\textquoteright}s Kappa, 0.40–0.99). The PABAK ranged from 0.12 to 1 for the two-reader inter-rater agreement and 0.26 to 1 for the intra-rater agreement. Descriptive and broad labels achieved the highest proportion of positive agreement in both the inter- and intra-reader analyses. Annotating findings with specific, interpretive labels were found to be difficult for less experienced radiologists. Annotating images with descriptive labels may increase agreement between radiologists with different experience levels compared to annotation with interpretive labels.",

keywords = "artificial intelligence, chest X-ray, diagnostic scheme, image annotation, inter-rater, intra-rater, ontology, radiologists",

author = "Dana Li and Pehrson, {Lea Marie} and Lea T{\o}ttrup and Marco Fraccaro and Rasmus Bonnevie and Jakob Thrane and S{\o}rensen, {Peter Jagd} and Alexander Rykkje and Andersen, {Tobias Thostrup} and Henrik Steglich-Arnholm and St{\ae}rk, {Dorte Marianne Rohde} and Lotte Borgwardt and Hansen, {Kristoffer Lindskov} and Sune Darkner and Carlsen, {Jonathan Frederik} and Nielsen, {Michael Bachmann}",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors.",

year = "2022",

doi = "10.3390/diagnostics12123112",

language = "English",

volume = "12",

journal = "Diagnostics",

issn = "2075-4418",

publisher = "MDPI AG",

number = "12",

}

RIS

TY - JOUR

T1 - Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System

AU - Li, Dana

AU - Pehrson, Lea Marie

AU - Tøttrup, Lea

AU - Fraccaro, Marco

AU - Bonnevie, Rasmus

AU - Thrane, Jakob

AU - Sørensen, Peter Jagd

AU - Rykkje, Alexander

AU - Andersen, Tobias Thostrup

AU - Steglich-Arnholm, Henrik

AU - Stærk, Dorte Marianne Rohde

AU - Borgwardt, Lotte

AU - Hansen, Kristoffer Lindskov

AU - Darkner, Sune

AU - Carlsen, Jonathan Frederik

AU - Nielsen, Michael Bachmann

PY - 2022

Y1 - 2022

N2 - Consistent annotation of data is a prerequisite for the successful training and testing of artificial intelligence-based decision support systems in radiology. This can be obtained by standardizing terminology when annotating diagnostic images. The purpose of this study was to evaluate the annotation consistency among radiologists when using a novel diagnostic labeling scheme for chest X-rays. Six radiologists with experience ranging from one to sixteen years, annotated a set of 100 fully anonymized chest X-rays. The blinded radiologists annotated on two separate occasions. Statistical analyses were done using Randolph’s kappa and PABAK, and the proportions of specific agreements were calculated. Fair-to-excellent agreement was found for all labels among the annotators (Randolph’s Kappa, 0.40–0.99). The PABAK ranged from 0.12 to 1 for the two-reader inter-rater agreement and 0.26 to 1 for the intra-rater agreement. Descriptive and broad labels achieved the highest proportion of positive agreement in both the inter- and intra-reader analyses. Annotating findings with specific, interpretive labels were found to be difficult for less experienced radiologists. Annotating images with descriptive labels may increase agreement between radiologists with different experience levels compared to annotation with interpretive labels.

AB - Consistent annotation of data is a prerequisite for the successful training and testing of artificial intelligence-based decision support systems in radiology. This can be obtained by standardizing terminology when annotating diagnostic images. The purpose of this study was to evaluate the annotation consistency among radiologists when using a novel diagnostic labeling scheme for chest X-rays. Six radiologists with experience ranging from one to sixteen years, annotated a set of 100 fully anonymized chest X-rays. The blinded radiologists annotated on two separate occasions. Statistical analyses were done using Randolph’s kappa and PABAK, and the proportions of specific agreements were calculated. Fair-to-excellent agreement was found for all labels among the annotators (Randolph’s Kappa, 0.40–0.99). The PABAK ranged from 0.12 to 1 for the two-reader inter-rater agreement and 0.26 to 1 for the intra-rater agreement. Descriptive and broad labels achieved the highest proportion of positive agreement in both the inter- and intra-reader analyses. Annotating findings with specific, interpretive labels were found to be difficult for less experienced radiologists. Annotating images with descriptive labels may increase agreement between radiologists with different experience levels compared to annotation with interpretive labels.

KW - artificial intelligence

KW - chest X-ray

KW - diagnostic scheme

KW - image annotation

KW - inter-rater

KW - intra-rater

KW - ontology

KW - radiologists

UR - http://www.scopus.com/inward/record.url?scp=85144620440&partnerID=8YFLogxK

U2 - 10.3390/diagnostics12123112

DO - 10.3390/diagnostics12123112

M3 - Journal article

C2 - 36553118

AN - SCOPUS:85144620440

VL - 12

JO - Diagnostics

JF - Diagnostics

SN - 2075-4418

IS - 12

M1 - 3112

ER -

ID: 330935108

Department of Computer Science