MULTIFIN: A Dataset for Multilingual Financial NLP
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
MULTIFIN : A Dataset for Multilingual Financial NLP. / Jørgensen, Rasmus Kær; Brandt, Oliver; Hartmann, Mareike; Dai, Xiang; Igel, Christian; Elliott, Desmond.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023. Association for Computational Linguistics (ACL), 2023. p. 864-879.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - MULTIFIN
T2 - 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023
AU - Jørgensen, Rasmus Kær
AU - Brandt, Oliver
AU - Hartmann, Mareike
AU - Dai, Xiang
AU - Igel, Christian
AU - Elliott, Desmond
N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN– a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
AB - Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN– a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
UR - http://www.scopus.com/inward/record.url?scp=85159859314&partnerID=8YFLogxK
M3 - Article in proceedings
AN - SCOPUS:85159859314
SP - 864
EP - 879
BT - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023
PB - Association for Computational Linguistics (ACL)
Y2 - 2 May 2023 through 6 May 2023
ER -
ID: 355143987