Computational Grammatical Error Correction: Bridging the Gap from Academia to Industry

Research output: Book/ReportPh.D. thesisResearch

Standard

Computational Grammatical Error Correction : Bridging the Gap from Academia to Industry. / Flachs, Simon Hellemann.

Department of Computer Science, Faculty of Science, University of Copenhagen, 2021. 111 p.

Research output: Book/ReportPh.D. thesisResearch

Harvard

Flachs, SH 2021, Computational Grammatical Error Correction: Bridging the Gap from Academia to Industry. Department of Computer Science, Faculty of Science, University of Copenhagen.

APA

Flachs, S. H. (2021). Computational Grammatical Error Correction: Bridging the Gap from Academia to Industry. Department of Computer Science, Faculty of Science, University of Copenhagen.

Vancouver

Flachs SH. Computational Grammatical Error Correction: Bridging the Gap from Academia to Industry. Department of Computer Science, Faculty of Science, University of Copenhagen, 2021. 111 p.

Author

Flachs, Simon Hellemann. / Computational Grammatical Error Correction : Bridging the Gap from Academia to Industry. Department of Computer Science, Faculty of Science, University of Copenhagen, 2021. 111 p.

Bibtex

@phdthesis{39f34356af1049419799f3f204e320d0,
title = "Computational Grammatical Error Correction: Bridging the Gap from Academia to Industry",
abstract = "Grammatical Error Correction (GEC) is the research field concerned with computational methods for correcting grammatical errors in text. With the vast amounts of content currently being produced online, these methods hold the promise of improving human communication by enabling clear and error-free prose.While GEC is a thoroughly studied field in academia, industrial adoption has been limited. Three specific obstacles are particularly holding back wide-spread industrial adoption: current academic GEC systems 1) depend on a lot of expensive data for training the systems; 2) are mostly evaluated on text written by English language learners, leaving the systems{\textquoteright} performance beyond this domain unclear; and 3) are mainly developed for the English language.This thesis presents research into tackling these obstacles, in order to bridge the gap between academic research and industrial use. In the first part of the thesis, we investigate two avenues for building low-resource GEC systems. Firstly, we show that leveraging artificially generated training data improves systems{\textquoteright} ability to detect subject-verb-agreement errors, particularly improving robustness to challenging linguistic phenomena. Secondly, we show that language modelstrained by self-supervision can be used for creating viable GEC systems that do not rely on annotated training data. In the second part of the thesis, we look into GEC systems{\textquoteright} ability to generalize beyond the English language learner domain – we release a new GEC benchmark, CWEB, consisting of website text annotated for correctness, and show that current GEC systems do not generalize well to this domain. In the final part, we focus on GEC for non-English languages and investigate strategies for leveraging available sources of noisy data. We show that GEC systems pre-trained on noisy data can be fine-tuned effectively on only small amounts of expert-annotated data, which opens up for creating inexpensive GEC systems in new languages.",
author = "Flachs, {Simon Hellemann}",
year = "2021",
language = "English",
publisher = "Department of Computer Science, Faculty of Science, University of Copenhagen",

}

RIS

TY - BOOK

T1 - Computational Grammatical Error Correction

T2 - Bridging the Gap from Academia to Industry

AU - Flachs, Simon Hellemann

PY - 2021

Y1 - 2021

N2 - Grammatical Error Correction (GEC) is the research field concerned with computational methods for correcting grammatical errors in text. With the vast amounts of content currently being produced online, these methods hold the promise of improving human communication by enabling clear and error-free prose.While GEC is a thoroughly studied field in academia, industrial adoption has been limited. Three specific obstacles are particularly holding back wide-spread industrial adoption: current academic GEC systems 1) depend on a lot of expensive data for training the systems; 2) are mostly evaluated on text written by English language learners, leaving the systems’ performance beyond this domain unclear; and 3) are mainly developed for the English language.This thesis presents research into tackling these obstacles, in order to bridge the gap between academic research and industrial use. In the first part of the thesis, we investigate two avenues for building low-resource GEC systems. Firstly, we show that leveraging artificially generated training data improves systems’ ability to detect subject-verb-agreement errors, particularly improving robustness to challenging linguistic phenomena. Secondly, we show that language modelstrained by self-supervision can be used for creating viable GEC systems that do not rely on annotated training data. In the second part of the thesis, we look into GEC systems’ ability to generalize beyond the English language learner domain – we release a new GEC benchmark, CWEB, consisting of website text annotated for correctness, and show that current GEC systems do not generalize well to this domain. In the final part, we focus on GEC for non-English languages and investigate strategies for leveraging available sources of noisy data. We show that GEC systems pre-trained on noisy data can be fine-tuned effectively on only small amounts of expert-annotated data, which opens up for creating inexpensive GEC systems in new languages.

AB - Grammatical Error Correction (GEC) is the research field concerned with computational methods for correcting grammatical errors in text. With the vast amounts of content currently being produced online, these methods hold the promise of improving human communication by enabling clear and error-free prose.While GEC is a thoroughly studied field in academia, industrial adoption has been limited. Three specific obstacles are particularly holding back wide-spread industrial adoption: current academic GEC systems 1) depend on a lot of expensive data for training the systems; 2) are mostly evaluated on text written by English language learners, leaving the systems’ performance beyond this domain unclear; and 3) are mainly developed for the English language.This thesis presents research into tackling these obstacles, in order to bridge the gap between academic research and industrial use. In the first part of the thesis, we investigate two avenues for building low-resource GEC systems. Firstly, we show that leveraging artificially generated training data improves systems’ ability to detect subject-verb-agreement errors, particularly improving robustness to challenging linguistic phenomena. Secondly, we show that language modelstrained by self-supervision can be used for creating viable GEC systems that do not rely on annotated training data. In the second part of the thesis, we look into GEC systems’ ability to generalize beyond the English language learner domain – we release a new GEC benchmark, CWEB, consisting of website text annotated for correctness, and show that current GEC systems do not generalize well to this domain. In the final part, we focus on GEC for non-English languages and investigate strategies for leveraging available sources of noisy data. We show that GEC systems pre-trained on noisy data can be fine-tuned effectively on only small amounts of expert-annotated data, which opens up for creating inexpensive GEC systems in new languages.

M3 - Ph.D. thesis

BT - Computational Grammatical Error Correction

PB - Department of Computer Science, Faculty of Science, University of Copenhagen

ER -

ID: 273016748