MULTIFIN: A Dataset for Multilingual Financial NLP

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Fulltext

    Final published version, 395 KB, PDF document

Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN– a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.

Original languageEnglish
Title of host publicationEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023
PublisherAssociation for Computational Linguistics (ACL)
Publication date2023
Pages864-879
ISBN (Electronic)9781959429470
Publication statusPublished - 2023
Event17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023 - Dubrovnik, Croatia
Duration: 2 May 20236 May 2023

Conference

Conference17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023
LandCroatia
ByDubrovnik
Periode02/05/202306/05/2023
SponsorAdobe, Babelscape, Bloomberg Engineering, Duolingo, Liveperson

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

Number of downloads are based on statistics from Google Scholar and www.ku.dk


No data available

ID: 355143987