PHD: Pixel-Based Language Modeling of Historical Documents

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

PHD : Pixel-Based Language Modeling of Historical Documents. / Borenstein, Nadav; Rust, Phillip; Elliott, Desmond; Augenstein, Isabelle.

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin. Association for Computational Linguistics (ACL), 2023. p. 87–107.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Borenstein, N, Rust, P, Elliott, D & Augenstein, I 2023, PHD: Pixel-Based Language Modeling of Historical Documents. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin. Association for Computational Linguistics (ACL), pp. 87–107, 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 06/12/2023. https://doi.org/10.18653/v1/2023.emnlp-main.7

APA

Borenstein, N., Rust, P., Elliott, D., & Augenstein, I. (2023). PHD: Pixel-Based Language Modeling of Historical Documents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin (pp. 87–107). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.7

Vancouver

Borenstein N, Rust P, Elliott D, Augenstein I. PHD: Pixel-Based Language Modeling of Historical Documents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin. Association for Computational Linguistics (ACL). 2023. p. 87–107 https://doi.org/10.18653/v1/2023.emnlp-main.7

Author

Borenstein, Nadav ; Rust, Phillip ; Elliott, Desmond ; Augenstein, Isabelle. / PHD : Pixel-Based Language Modeling of Historical Documents. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin. Association for Computational Linguistics (ACL), 2023. pp. 87–107

Bibtex

@inproceedings{54ba7e486a174b1ca0863f94df4a4f4d,
title = "PHD: Pixel-Based Language Modeling of Historical Documents",
abstract = " The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain. ",
keywords = "cs.CL",
author = "Nadav Borenstein and Phillip Rust and Desmond Elliott and Isabelle Augenstein",
year = "2023",
doi = "10.18653/v1/2023.emnlp-main.7",
language = "English",
pages = "87–107",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin",
publisher = "Association for Computational Linguistics (ACL)",
address = "United States",
note = "2023 Conference on Empirical Methods in Natural Language Processing ; Conference date: 06-12-2023 Through 10-12-2023",

}

RIS

TY - GEN

T1 - PHD

T2 - 2023 Conference on Empirical Methods in Natural Language Processing

AU - Borenstein, Nadav

AU - Rust, Phillip

AU - Elliott, Desmond

AU - Augenstein, Isabelle

PY - 2023

Y1 - 2023

N2 - The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain.

AB - The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain.

KW - cs.CL

U2 - 10.18653/v1/2023.emnlp-main.7

DO - 10.18653/v1/2023.emnlp-main.7

M3 - Article in proceedings

SP - 87

EP - 107

BT - Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processin

PB - Association for Computational Linguistics (ACL)

Y2 - 6 December 2023 through 10 December 2023

ER -

ID: 379722635