E-NER - An Annotated Named Entity Recognition Corpus of Legal Text

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 1,36 MB, PDF-dokument

Ting Wai Terence Au
Vasileios Lampos
Cox, Ingemar Johansson

Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4% and 60.4%, compared to training and testing on the E-NER collection.

Originalsprog	Engelsk
Titel	NLLP 2022 - Natural Legal Language Processing Workshop 2022, Proceedings of the Workshop
Antal sider	10
Forlag	Association for Computational Linguistics (ACL)
Publikationsdato	2022
Sider	246-255
ISBN (Elektronisk)	9781959429180
Status	Udgivet - 2022
Begivenhed	4th Natural Legal Language Processing Workshop, NLLP 2022, co-located with the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates Varighed: 8 dec. 2022 → …

Konference

Konference	4th Natural Legal Language Processing Workshop, NLLP 2022, co-located with the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Land	United Arab Emirates
By	Abu Dhabi
Periode	08/12/2022 → …
Sponsor	Bloomberg, European Research Council (ERC), LBox

Bibliografisk note

Funding Information:
T.W.T.A. and I.J.C. would like to thank Clifford Chance LLP for the financial support and for providing guidance with respect to requirements from the legal community.

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Datalogisk Institut