Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Dokumenter

  • Simon Hellemann Flachs
  • Ophélie Lacroix
  • Helen Yannakoudakis
  • Marek Rei
  • Søgaard, Anders
Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.
OriginalsprogEngelsk
TitelProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
ForlagAssociation for Computational Linguistics
Publikationsdato2020
Sider8467–8478
DOI
StatusUdgivet - 2020
BegivenhedThe 2020 Conference on Empirical Methods in Natural Language Processing - online
Varighed: 16 nov. 202020 nov. 2020
http://2020.emnlp.org

Konference

KonferenceThe 2020 Conference on Empirical Methods in Natural Language Processing
Lokationonline
Periode16/11/202020/11/2020
Internetadresse

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk


Ingen data tilgængelig

ID: 258376622