Unsupervised multi-index semantic hashing
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Unsupervised multi-index semantic hashing. / Hansen, Christian; Hansen, Casper; Simonsen, Jakob Grue; Alstrup, Stephen; Lioma, Christina.
The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021. Association for Computing Machinery, Inc, 2021. p. 2879-2889 (The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021).Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Unsupervised multi-index semantic hashing
AU - Hansen, Christian
AU - Hansen, Casper
AU - Simonsen, Jakob Grue
AU - Alstrup, Stephen
AU - Lioma, Christina
N1 - Publisher Copyright: © 2021 ACM.
PY - 2021
Y1 - 2021
N2 - Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.
AB - Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.
KW - Multi-index hashing
KW - Semantic hashing
KW - Similarity search
UR - http://www.scopus.com/inward/record.url?scp=85104220631&partnerID=8YFLogxK
U2 - 10.1145/3442381.3450014
DO - 10.1145/3442381.3450014
M3 - Article in proceedings
AN - SCOPUS:85104220631
T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
SP - 2879
EP - 2889
BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PB - Association for Computing Machinery, Inc
T2 - 2021 World Wide Web Conference, WWW 2021
Y2 - 19 April 2021 through 23 April 2021
ER -
ID: 300921204