Fast hashing with strong concentration bounds

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Fast hashing with strong concentration bounds. / Aamand, Anders; Knudsen, Jakob Bæk Tejs; Knudsen, Mathias Bæk Tejs; Rasmussen, Peter Michael Reichstein; Thorup, Mikkel.

STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. ed. / Konstantin Makarychev; Yury Makarychev; Madhur Tulsiani; Gautam Kamath; Julia Chuzhoy. Association for Computing Machinery, 2020. p. 1265-1278 (Proceedings of the Annual ACM Symposium on Theory of Computing).

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Aamand, A, Knudsen, JBT, Knudsen, MBT, Rasmussen, PMR & Thorup, M 2020, Fast hashing with strong concentration bounds. in K Makarychev, Y Makarychev, M Tulsiani, G Kamath & J Chuzhoy (eds), STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. Association for Computing Machinery, Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 1265-1278, 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, United States, 22/06/2020. https://doi.org/10.1145/3357713.3384259

APA

Aamand, A., Knudsen, J. B. T., Knudsen, M. B. T., Rasmussen, P. M. R., & Thorup, M. (2020). Fast hashing with strong concentration bounds. In K. Makarychev, Y. Makarychev, M. Tulsiani, G. Kamath, & J. Chuzhoy (Eds.), STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (pp. 1265-1278). Association for Computing Machinery. Proceedings of the Annual ACM Symposium on Theory of Computing https://doi.org/10.1145/3357713.3384259

Vancouver

Aamand A, Knudsen JBT, Knudsen MBT, Rasmussen PMR, Thorup M. Fast hashing with strong concentration bounds. In Makarychev K, Makarychev Y, Tulsiani M, Kamath G, Chuzhoy J, editors, STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. Association for Computing Machinery. 2020. p. 1265-1278. (Proceedings of the Annual ACM Symposium on Theory of Computing). https://doi.org/10.1145/3357713.3384259

Author

Aamand, Anders ; Knudsen, Jakob Bæk Tejs ; Knudsen, Mathias Bæk Tejs ; Rasmussen, Peter Michael Reichstein ; Thorup, Mikkel. / Fast hashing with strong concentration bounds. STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. editor / Konstantin Makarychev ; Yury Makarychev ; Madhur Tulsiani ; Gautam Kamath ; Julia Chuzhoy. Association for Computing Machinery, 2020. pp. 1265-1278 (Proceedings of the Annual ACM Symposium on Theory of Computing).

Bibtex

@inproceedings{6fb53196f1d04037bad029d099578c83,
title = "Fast hashing with strong concentration bounds",
abstract = "Previous work on tabulation hashing by P{\c C}tra{\c s}cu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain ς should be small enough that character tables of size |ς| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|ς|). However, the concentration bounds by P{\c C}tra{\c s}cu and Thorup only apply if the expected sums are g‰ |ς|. To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=m, for then the expected value is 1. However, if m=2, as when tossing n unbiased coins, the expected value n/2 is ≫ |ς| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.",
keywords = "Chernoff bounds, Concentration bounds, Hashing, Sampling, Streaming algorithms",
author = "Anders Aamand and Knudsen, {Jakob B{\ae}k Tejs} and Knudsen, {Mathias B{\ae}k Tejs} and Rasmussen, {Peter Michael Reichstein} and Mikkel Thorup",
year = "2020",
doi = "10.1145/3357713.3384259",
language = "English",
series = "Proceedings of the Annual ACM Symposium on Theory of Computing",
publisher = "Association for Computing Machinery",
pages = "1265--1278",
editor = "Konstantin Makarychev and Yury Makarychev and Madhur Tulsiani and Gautam Kamath and Julia Chuzhoy",
booktitle = "STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing",
note = "52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020 ; Conference date: 22-06-2020 Through 26-06-2020",

}

RIS

TY - GEN

T1 - Fast hashing with strong concentration bounds

AU - Aamand, Anders

AU - Knudsen, Jakob Bæk Tejs

AU - Knudsen, Mathias Bæk Tejs

AU - Rasmussen, Peter Michael Reichstein

AU - Thorup, Mikkel

PY - 2020

Y1 - 2020

N2 - Previous work on tabulation hashing by PÇtraşcu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain ς should be small enough that character tables of size |ς| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|ς|). However, the concentration bounds by PÇtraşcu and Thorup only apply if the expected sums are g‰ |ς|. To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=m, for then the expected value is 1. However, if m=2, as when tossing n unbiased coins, the expected value n/2 is ≫ |ς| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

AB - Previous work on tabulation hashing by PÇtraşcu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain ς should be small enough that character tables of size |ς| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|ς|). However, the concentration bounds by PÇtraşcu and Thorup only apply if the expected sums are g‰ |ς|. To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=m, for then the expected value is 1. However, if m=2, as when tossing n unbiased coins, the expected value n/2 is ≫ |ς| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

KW - Chernoff bounds

KW - Concentration bounds

KW - Hashing

KW - Sampling

KW - Streaming algorithms

UR - http://www.scopus.com/inward/record.url?scp=85086766216&partnerID=8YFLogxK

U2 - 10.1145/3357713.3384259

DO - 10.1145/3357713.3384259

M3 - Article in proceedings

AN - SCOPUS:85086766216

T3 - Proceedings of the Annual ACM Symposium on Theory of Computing

SP - 1265

EP - 1278

BT - STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing

A2 - Makarychev, Konstantin

A2 - Makarychev, Yury

A2 - Tulsiani, Madhur

A2 - Kamath, Gautam

A2 - Chuzhoy, Julia

PB - Association for Computing Machinery

T2 - 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020

Y2 - 22 June 2020 through 26 June 2020

ER -

ID: 258498058