Unsupervised multi-index semantic hashing

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Unsupervised multi-index semantic hashing. / Hansen, Christian; Hansen, Casper; Simonsen, Jakob Grue; Alstrup, Stephen; Lioma, Christina.

The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021. Association for Computing Machinery, Inc, 2021. p. 2879-2889 (The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021).

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Hansen, C, Hansen, C, Simonsen, JG, Alstrup, S & Lioma, C 2021, Unsupervised multi-index semantic hashing. in The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021. Association for Computing Machinery, Inc, The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021, pp. 2879-2889, 2021 World Wide Web Conference, WWW 2021, Ljubljana, Slovenia, 19/04/2021. https://doi.org/10.1145/3442381.3450014

APA

Hansen, C., Hansen, C., Simonsen, J. G., Alstrup, S., & Lioma, C. (2021). Unsupervised multi-index semantic hashing. In The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021 (pp. 2879-2889). Association for Computing Machinery, Inc. The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021 https://doi.org/10.1145/3442381.3450014

Vancouver

Hansen C, Hansen C, Simonsen JG, Alstrup S, Lioma C. Unsupervised multi-index semantic hashing. In The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021. Association for Computing Machinery, Inc. 2021. p. 2879-2889. (The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021). https://doi.org/10.1145/3442381.3450014

Author

Hansen, Christian ; Hansen, Casper ; Simonsen, Jakob Grue ; Alstrup, Stephen ; Lioma, Christina. / Unsupervised multi-index semantic hashing. The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021. Association for Computing Machinery, Inc, 2021. pp. 2879-2889 (The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021).

Bibtex

@inproceedings{b629579b4c4943669ec78ea390f34b71,

title = "Unsupervised multi-index semantic hashing",

abstract = "Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness. ",

keywords = "Multi-index hashing, Semantic hashing, Similarity search",

author = "Christian Hansen and Casper Hansen and Simonsen, {Jakob Grue} and Stephen Alstrup and Christina Lioma",

note = "Publisher Copyright: {\^A}{\textcopyright} 2021 ACM.; 2021 World Wide Web Conference, WWW 2021 ; Conference date: 19-04-2021 Through 23-04-2021",

year = "2021",

doi = "10.1145/3442381.3450014",

language = "English",

series = "The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021",

pages = "2879--2889",

booktitle = "The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021",

publisher = "Association for Computing Machinery, Inc",

}

RIS

TY - GEN

T1 - Unsupervised multi-index semantic hashing

AU - Hansen, Christian

AU - Hansen, Casper

AU - Simonsen, Jakob Grue

AU - Alstrup, Stephen

AU - Lioma, Christina

PY - 2021

Y1 - 2021

N2 - Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.

AB - Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.

KW - Multi-index hashing

KW - Semantic hashing

KW - Similarity search

UR - http://www.scopus.com/inward/record.url?scp=85104220631&partnerID=8YFLogxK

U2 - 10.1145/3442381.3450014

DO - 10.1145/3442381.3450014

M3 - Article in proceedings

AN - SCOPUS:85104220631

T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

SP - 2879

EP - 2889

BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

PB - Association for Computing Machinery, Inc

T2 - 2021 World Wide Web Conference, WWW 2021

Y2 - 19 April 2021 through 23 April 2021

ER -

ID: 300921204

Datalogisk Institut