Faithfulness Tests for Natural Language Explanations
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Faithfulness Tests for Natural Language Explanations. / Atanasova, Pepa; Camburu, Oana Maria; Lioma, Christina; Lukasiewicz, Thomas; Simonsen, Jakob Grue; Augenstein, Isabelle.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics (ACL), 2023. p. 283-294.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Faithfulness Tests for Natural Language Explanations
AU - Atanasova, Pepa
AU - Camburu, Oana Maria
AU - Lioma, Christina
AU - Lukasiewicz, Thomas
AU - Simonsen, Jakob Grue
AU - Augenstein, Isabelle
N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Explanations of neural models aim to reveal a model’s decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model’s inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.
AB - Explanations of neural models aim to reveal a model’s decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model’s inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.
UR - http://www.scopus.com/inward/record.url?scp=85164122520&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.acl-short.25
DO - 10.18653/v1/2023.acl-short.25
M3 - Article in proceedings
AN - SCOPUS:85164122520
SP - 283
EP - 294
BT - Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
PB - Association for Computational Linguistics (ACL)
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Y2 - 9 July 2023 through 14 July 2023
ER -
ID: 369552736