Legal-Tech Open Diaries

Legal-Tech Open Diaries: Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 2,49 MB, PDF-dokument

Stelios Maroudas
Sotiris Legkas
Prodromos Malakasiotis
Chalkidis, Ilias

In the era of billion-parameter-sized Language Models (LMs), start-ups have to follow trends and adapt their technology accordingly. Nonetheless, there are open challenges since the development and deployment of large models comes with a need for high computational resources and has economical consequences. In this work, we follow the steps of the R&D group of a modern legal-tech start-up and present important insights on model development and deployment. We start from ground zero by pre-training multiple domain-specific multi-lingual LMs which are a better fit to contractual and regulatory text compared to the available alternatives (XLM-R). We present benchmark results of such models in a half-public half-private legal benchmark comprising 5 downstream tasks showing the impact of larger model size. Lastly, we examine the impact of a full-scale pipeline for model compression which includes: a) Parameter Pruning, b) Knowledge Distillation, and c) Quantization: The resulting models are much more efficient without sacrificing performance at large.

Originalsprog	Engelsk
Titel	NLLP 2022 - Natural Legal Language Processing Workshop 2022, Proceedings of the Workshop
Antal sider	23
Forlag	Association for Computational Linguistics (ACL)
Publikationsdato	2022
Sider	88-110
ISBN (Elektronisk)	9781959429180
Status	Udgivet - 2022
Begivenhed	4th Natural Legal Language Processing Workshop, NLLP 2022, co-located with the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates Varighed: 8 dec. 2022 → …

Konference

Konference	4th Natural Legal Language Processing Workshop, NLLP 2022, co-located with the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Land	United Arab Emirates
By	Abu Dhabi
Periode	08/12/2022 → …
Sponsor	Bloomberg, European Research Council (ERC), LBox

Bibliografisk note

Funding Information:
This research has been co-funded by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE (Τ2ΕΔΚ-03849). This work is also partly funded by the Innovation Fund Denmark (IFD)21 under File No. 0175-00011A.

Funding Information:
This project was also supported by the Tensor-Flow Research Cloud (TFRC)22 program that provided instances of Google Cloud TPU v3-8 for free that were used to pre-train all C-XLM language models. Cognitiv+ provided the compute (16× Quadro RTX 6000 24GB) to fine-tune all models.

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Datalogisk Institut

Legal-Tech Open Diaries: Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models

Dokumenter

Konference

Bibliografisk note

Links