2kenize: Tying Subword Sequences for Chinese Script Conversion

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

2kenize : Tying Subword Sequences for Chinese Script Conversion. / A, Pranav; Augenstein, Isabelle.

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. p. 7257-7272.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

A, P & Augenstein, I 2020, 2kenize: Tying Subword Sequences for Chinese Script Conversion. in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 7257-7272, 58th Annual Meeting of the Association for Computational Linguistics, Online, 05/07/2020. https://doi.org/10.18653/v1/2020.acl-main.648

APA

A, P., & Augenstein, I. (2020). 2kenize: Tying Subword Sequences for Chinese Script Conversion. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7257-7272). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.648

Vancouver

A P, Augenstein I. 2kenize: Tying Subword Sequences for Chinese Script Conversion. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2020. p. 7257-7272 https://doi.org/10.18653/v1/2020.acl-main.648

Author

A, Pranav ; Augenstein, Isabelle. / 2kenize : Tying Subword Sequences for Chinese Script Conversion. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. pp. 7257-7272

Bibtex

@inproceedings{13820c87a0094059ad82af34e4681468,
title = "2kenize: Tying Subword Sequences for Chinese Script Conversion",
abstract = "Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method{\textquoteright}s particular strengths are in dealing with code mixing and named entities.",
author = "Pranav A and Isabelle Augenstein",
year = "2020",
doi = "10.18653/v1/2020.acl-main.648",
language = "English",
pages = "7257--7272",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
publisher = "Association for Computational Linguistics",
note = "58th Annual Meeting of the Association for Computational Linguistics ; Conference date: 05-07-2020 Through 10-07-2020",

}

RIS

TY - GEN

T1 - 2kenize

T2 - 58th Annual Meeting of the Association for Computational Linguistics

AU - A, Pranav

AU - Augenstein, Isabelle

PY - 2020

Y1 - 2020

N2 - Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method’s particular strengths are in dealing with code mixing and named entities.

AB - Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method’s particular strengths are in dealing with code mixing and named entities.

U2 - 10.18653/v1/2020.acl-main.648

DO - 10.18653/v1/2020.acl-main.648

M3 - Article in proceedings

SP - 7257

EP - 7272

BT - Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

PB - Association for Computational Linguistics

Y2 - 5 July 2020 through 10 July 2020

ER -

ID: 255044965