Generating Label Cohesive and Well-Formed Adversarial Claims

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Generating Label Cohesive and Well-Formed Adversarial Claims. / Atanasova, Pepa; Wright, Dustin; Augenstein, Isabelle.

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020. p. 3168-3177.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Atanasova, P, Wright, D & Augenstein, I 2020, Generating Label Cohesive and Well-Formed Adversarial Claims. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 3168-3177, The 2020 Conference on Empirical Methods in Natural Language Processing, 16/11/2020. https://doi.org/10.18653/v1/2020.emnlp-main.256

APA

Atanasova, P., Wright, D., & Augenstein, I. (2020). Generating Label Cohesive and Well-Formed Adversarial Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3168-3177). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.256

Vancouver

Atanasova P, Wright D, Augenstein I. Generating Label Cohesive and Well-Formed Adversarial Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. 2020. p. 3168-3177 https://doi.org/10.18653/v1/2020.emnlp-main.256

Author

Atanasova, Pepa ; Wright, Dustin ; Augenstein, Isabelle. / Generating Label Cohesive and Well-Formed Adversarial Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020. pp. 3168-3177

Bibtex

@inproceedings{0940221cef68423fb7d9ae05a3f87d83,
title = "Generating Label Cohesive and Well-Formed Adversarial Claims",
abstract = "Adversarial attacks reveal important vulnerabilities and flaws of trained models. One potent type of attack are universal adversarial triggers, which are individual n-grams that, when appended to instances of a class under attack, can trick a model into predicting a target class. However, for inference tasks such as fact checking, these triggers often inadvertently invert the meaning of instances they are inserted in. In addition, such attacks produce semantically nonsensical inputs, as they simply concatenate triggers to existing samples. Here, we investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning and are semantically valid. We extend the HotFlip attack algorithm used for universal trigger generation by jointly minimizing the target class loss of a fact checking model and the entailment class loss of an auxiliary natural language inference model. We then train a conditional language model to generate semantically valid statements, which include the found universal triggers. We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.",
author = "Pepa Atanasova and Dustin Wright and Isabelle Augenstein",
year = "2020",
doi = "10.18653/v1/2020.emnlp-main.256",
language = "English",
pages = "3168--3177",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
publisher = "Association for Computational Linguistics",
note = "The 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 ; Conference date: 16-11-2020 Through 20-11-2020",
url = "http://2020.emnlp.org",

}

RIS

TY - GEN

T1 - Generating Label Cohesive and Well-Formed Adversarial Claims

AU - Atanasova, Pepa

AU - Wright, Dustin

AU - Augenstein, Isabelle

PY - 2020

Y1 - 2020

N2 - Adversarial attacks reveal important vulnerabilities and flaws of trained models. One potent type of attack are universal adversarial triggers, which are individual n-grams that, when appended to instances of a class under attack, can trick a model into predicting a target class. However, for inference tasks such as fact checking, these triggers often inadvertently invert the meaning of instances they are inserted in. In addition, such attacks produce semantically nonsensical inputs, as they simply concatenate triggers to existing samples. Here, we investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning and are semantically valid. We extend the HotFlip attack algorithm used for universal trigger generation by jointly minimizing the target class loss of a fact checking model and the entailment class loss of an auxiliary natural language inference model. We then train a conditional language model to generate semantically valid statements, which include the found universal triggers. We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.

AB - Adversarial attacks reveal important vulnerabilities and flaws of trained models. One potent type of attack are universal adversarial triggers, which are individual n-grams that, when appended to instances of a class under attack, can trick a model into predicting a target class. However, for inference tasks such as fact checking, these triggers often inadvertently invert the meaning of instances they are inserted in. In addition, such attacks produce semantically nonsensical inputs, as they simply concatenate triggers to existing samples. Here, we investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning and are semantically valid. We extend the HotFlip attack algorithm used for universal trigger generation by jointly minimizing the target class loss of a fact checking model and the entailment class loss of an auxiliary natural language inference model. We then train a conditional language model to generate semantically valid statements, which include the found universal triggers. We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.

U2 - 10.18653/v1/2020.emnlp-main.256

DO - 10.18653/v1/2020.emnlp-main.256

M3 - Article in proceedings

SP - 3168

EP - 3177

BT - Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

PB - Association for Computational Linguistics

T2 - The 2020 Conference on Empirical Methods in Natural Language Processing

Y2 - 16 November 2020 through 20 November 2020

ER -

ID: 254988517