Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. / Ostendorff, Malte ; Rethmeier, Nils; Augenstein, Isabelle; Gipp, Bela; Rehm, Georg.

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), 2022. p. 11670–11688.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Ostendorff, M, Rethmeier, N, Augenstein, I, Gipp, B & Rehm, G 2022, Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), pp. 11670–11688, 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 07/12/2022. <https://aclanthology.org/2022.emnlp-main.802/>

APA

Ostendorff, M., Rethmeier, N., Augenstein, I., Gipp, B., & Rehm, G. (2022). Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 11670–11688). Association for Computational Linguistics (ACL). https://aclanthology.org/2022.emnlp-main.802/

Vancouver

Ostendorff M, Rethmeier N, Augenstein I, Gipp B, Rehm G. Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL). 2022. p. 11670–11688

Author

Ostendorff, Malte ; Rethmeier, Nils ; Augenstein, Isabelle ; Gipp, Bela ; Rehm, Georg. / Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), 2022. pp. 11670–11688

Bibtex

@inproceedings{4032c0e23b6d4933a9efb2419b88fd58,
title = "Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings",
abstract = "Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) language models sample-efficiently and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.",
author = "Malte Ostendorff and Nils Rethmeier and Isabelle Augenstein and Bela Gipp and Georg Rehm",
year = "2022",
language = "English",
pages = "11670–11688",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
publisher = "Association for Computational Linguistics (ACL)",
address = "United States",
note = "2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

}

RIS

TY - GEN

T1 - Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

AU - Ostendorff, Malte

AU - Rethmeier, Nils

AU - Augenstein, Isabelle

AU - Gipp, Bela

AU - Rehm, Georg

PY - 2022

Y1 - 2022

N2 - Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) language models sample-efficiently and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.

AB - Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) language models sample-efficiently and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.

M3 - Article in proceedings

SP - 11670

EP - 11688

BT - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

PB - Association for Computational Linguistics (ACL)

T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

Y2 - 7 December 2022 through 11 December 2022

ER -

ID: 341060672