Subsets and supermajorities: Optimal hashing-based set similarity search

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Subsets and supermajorities : Optimal hashing-based set similarity search. / Ahle, Thomas D.; Knudsen, Jakob B.T.

Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020. IEEE, 2020. p. 728-739 9317929 (Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, Vol. 2020-November).

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Ahle, TD & Knudsen, JBT 2020, Subsets and supermajorities: Optimal hashing-based set similarity search. in Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020., 9317929, IEEE, Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, vol. 2020-November, pp. 728-739, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Virtual, Durham, United States, 16/11/2020. https://doi.org/10.1109/FOCS46700.2020.00073

APA

Ahle, T. D., & Knudsen, J. B. T. (2020). Subsets and supermajorities: Optimal hashing-based set similarity search. In Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020 (pp. 728-739). [9317929] IEEE. Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS Vol. 2020-November https://doi.org/10.1109/FOCS46700.2020.00073

Vancouver

Ahle TD, Knudsen JBT. Subsets and supermajorities: Optimal hashing-based set similarity search. In Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020. IEEE. 2020. p. 728-739. 9317929. (Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, Vol. 2020-November). https://doi.org/10.1109/FOCS46700.2020.00073

Author

Ahle, Thomas D. ; Knudsen, Jakob B.T. / Subsets and supermajorities : Optimal hashing-based set similarity search. Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020. IEEE, 2020. pp. 728-739 (Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, Vol. 2020-November).

Bibtex

@inproceedings{1dd2ab7755f34efdae618655907ab7ae,
title = "Subsets and supermajorities: Optimal hashing-based set similarity search",
abstract = "We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search, and Partial Match. Our algorithm can be seen as a natural generalization of previous work on Set as well as Euclidean Similarity Search, but conceptually it differs by optimally exploiting the information present in the sets as well as their complements, and doing so asymmetrically between queries and stored sets. Doing so we improve upon the best previous work: MinHash [J. Discrete Algorithms 1998], SimHash [STOC 2002], Spherical LSF [SODA 2016, 2017], and Chosen Path [STOC 2017] by as much as a factor n{0.14} in both time and space; or in the near-constant time regime, in space, by an arbitrarily large polynomial factor. Turning the geometric concept, based on Boolean supermajority functions, into a practical algorithm requires ideas from branching random walks on mathbb{Z}{2}, for which we give the first non-asymptotic near tight analysis. Our lower bounds follow from new hypercontractive arguments, which can be seen as characterizing the exact family of similarity search problems for which supermajorities are optimal. The optimality holds for among all hashing based data structures in the random setting, and by reductions, for 1 cell and 2 cell probe data structures.",
keywords = "n/a",
author = "Ahle, {Thomas D.} and Knudsen, {Jakob B.T.}",
year = "2020",
doi = "10.1109/FOCS46700.2020.00073",
language = "English",
series = "Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS",
pages = "728--739",
booktitle = "Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020",
publisher = "IEEE",
note = "61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020 ; Conference date: 16-11-2020 Through 19-11-2020",

}

RIS

TY - GEN

T1 - Subsets and supermajorities

T2 - 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020

AU - Ahle, Thomas D.

AU - Knudsen, Jakob B.T.

PY - 2020

Y1 - 2020

N2 - We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search, and Partial Match. Our algorithm can be seen as a natural generalization of previous work on Set as well as Euclidean Similarity Search, but conceptually it differs by optimally exploiting the information present in the sets as well as their complements, and doing so asymmetrically between queries and stored sets. Doing so we improve upon the best previous work: MinHash [J. Discrete Algorithms 1998], SimHash [STOC 2002], Spherical LSF [SODA 2016, 2017], and Chosen Path [STOC 2017] by as much as a factor n{0.14} in both time and space; or in the near-constant time regime, in space, by an arbitrarily large polynomial factor. Turning the geometric concept, based on Boolean supermajority functions, into a practical algorithm requires ideas from branching random walks on mathbb{Z}{2}, for which we give the first non-asymptotic near tight analysis. Our lower bounds follow from new hypercontractive arguments, which can be seen as characterizing the exact family of similarity search problems for which supermajorities are optimal. The optimality holds for among all hashing based data structures in the random setting, and by reductions, for 1 cell and 2 cell probe data structures.

AB - We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search, and Partial Match. Our algorithm can be seen as a natural generalization of previous work on Set as well as Euclidean Similarity Search, but conceptually it differs by optimally exploiting the information present in the sets as well as their complements, and doing so asymmetrically between queries and stored sets. Doing so we improve upon the best previous work: MinHash [J. Discrete Algorithms 1998], SimHash [STOC 2002], Spherical LSF [SODA 2016, 2017], and Chosen Path [STOC 2017] by as much as a factor n{0.14} in both time and space; or in the near-constant time regime, in space, by an arbitrarily large polynomial factor. Turning the geometric concept, based on Boolean supermajority functions, into a practical algorithm requires ideas from branching random walks on mathbb{Z}{2}, for which we give the first non-asymptotic near tight analysis. Our lower bounds follow from new hypercontractive arguments, which can be seen as characterizing the exact family of similarity search problems for which supermajorities are optimal. The optimality holds for among all hashing based data structures in the random setting, and by reductions, for 1 cell and 2 cell probe data structures.

KW - n/a

U2 - 10.1109/FOCS46700.2020.00073

DO - 10.1109/FOCS46700.2020.00073

M3 - Article in proceedings

AN - SCOPUS:85100351894

T3 - Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS

SP - 728

EP - 739

BT - Proceedings - 2020 IEEE 61st Annual Symposium on Foundations of Computer Science, FOCS 2020

PB - IEEE

Y2 - 16 November 2020 through 19 November 2020

ER -

ID: 258712597