2kenize - Staff

2kenize: Tying Subword Sequences for Chinese Script Conversion

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

2kenize
Final published version, 633 KB, PDF document

Pranav A
Augenstein, Isabelle

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method’s particular strengths are in dealing with code mixing and named entities.

Original language	English
Title of host publication	Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Publisher	Association for Computational Linguistics
Publication date	2020
Pages	7257-7272
DOIs	https://doi.org/10.18653/v1/2020.acl-main.648
Publication status	Published - 2020
Event	58th Annual Meeting of the Association for Computational Linguistics - Online Duration: 5 Jul 2020 → 10 Jul 2020

Conference

Conference	58th Annual Meeting of the Association for Computational Linguistics
By	Online
Periode	05/07/2020 → 10/07/2020

Number of downloads are based on statistics from Google Scholar and www.ku.dk

No data available

ID: 255044965

Department of Computer Science

2kenize: Tying Subword Sequences for Chinese Script Conversion

Documents

Conference

Number of downloads are based on statistics from Google Scholar and www.ku.dk