Unsupervised Evaluation for Question Answering with Transformers

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Dokumenter

It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.
OriginalsprogEngelsk
TitelProceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
ForlagAssociation for Computational Linguistics
Publikationsdato2020
Sider83-90
DOI
StatusUdgivet - 2020
BegivenhedThe 2020 Conference on Empirical Methods in Natural Language Processing - online
Varighed: 16 nov. 202020 nov. 2020
http://2020.emnlp.org

Konference

KonferenceThe 2020 Conference on Empirical Methods in Natural Language Processing
Lokationonline
Periode16/11/202020/11/2020
Internetadresse

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk


Ingen data tilgængelig

ID: 254996871