MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Standard

MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. / Chalkidis, Ilias; Fergadiotis, Manos; Androutsopoulos, Ion.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. s. 6974-6996.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Harvard

Chalkidis, I, Fergadiotis, M & Androutsopoulos, I 2021, MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. i Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, s. 6974-6996, Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, 07/11/2021. https://doi.org/10.18653/v1/2021.emnlp-main.559

APA

Chalkidis, I., Fergadiotis, M., & Androutsopoulos, I. (2021). MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. I Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (s. 6974-6996). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.559

Vancouver

Chalkidis I, Fergadiotis M, Androutsopoulos I. MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. I Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2021. s. 6974-6996 https://doi.org/10.18653/v1/2021.emnlp-main.559

Author

Chalkidis, Ilias ; Fergadiotis, Manos ; Androutsopoulos, Ion. / MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. s. 6974-6996

Bibtex

@inproceedings{447210f4ba0a4033afc0e0d9e4a429c0,

title = "MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer",

abstract = "We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union ( EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zeroshot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate finetuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.",

author = "Ilias Chalkidis and Manos Fergadiotis and Ion Androutsopoulos",

year = "2021",

doi = "10.18653/v1/2021.emnlp-main.559",

language = "English",

pages = "6974--6996",

booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",

publisher = "Association for Computational Linguistics",

note = "Conference on Empirical Methods in Natural Language Processing (EMNLP) ; Conference date: 07-11-2021 Through 11-11-2021",

}

RIS

TY - GEN

T1 - MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

AU - Chalkidis, Ilias

AU - Fergadiotis, Manos

AU - Androutsopoulos, Ion

PY - 2021

Y1 - 2021

N2 - We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union ( EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zeroshot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate finetuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.

AB - We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union ( EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zeroshot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate finetuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.

U2 - 10.18653/v1/2021.emnlp-main.559

DO - 10.18653/v1/2021.emnlp-main.559

M3 - Article in proceedings

SP - 6974

EP - 6996

BT - Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

PB - Association for Computational Linguistics

T2 - Conference on Empirical Methods in Natural Language Processing (EMNLP)

Y2 - 7 November 2021 through 11 November 2021

ER -

ID: 326679675

Datalogisk Institut