2kenize: Tying Subword Sequences for Chinese Script Conversion

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Dokumenter

  • 2kenize

    Forlagets udgivne version, 633 KB, PDF-dokument

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method’s particular strengths are in dealing with code mixing and named entities.
OriginalsprogEngelsk
TitelProceedings of the 58th Annual Meeting of the Association for Computational Linguistics
ForlagAssociation for Computational Linguistics
Publikationsdato2020
Sider7257-7272
DOI
StatusUdgivet - 2020
Begivenhed58th Annual Meeting of the Association for Computational Linguistics - Online
Varighed: 5 jul. 202010 jul. 2020

Konference

Konference58th Annual Meeting of the Association for Computational Linguistics
ByOnline
Periode05/07/202010/07/2020

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk


Ingen data tilgængelig

ID: 255044965