MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 1,81 MB, PDF-dokument

Chalkidis, Ilias
Manos Fergadiotis
Ion Androutsopoulos

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union ( EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zeroshot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate finetuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.

Originalsprog	Engelsk
Titel	Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Forlag	Association for Computational Linguistics
Publikationsdato	2021
Sider	6974-6996
DOI	https://doi.org/10.18653/v1/2021.emnlp-main.559
Status	Udgivet - 2021
Begivenhed	Conference on Empirical Methods in Natural Language Processing (EMNLP) - Punta Cana Varighed: 7 nov. 2021 → 11 nov. 2021

Konference

Konference	Conference on Empirical Methods in Natural Language Processing (EMNLP)
By	Punta Cana
Periode	07/11/2021 → 11/11/2021

ID: 326679675

Datalogisk Institut