No Repetition: Fast and Reliable Sampling with Highly Concentrated Hashing

Research output: Contribution to journalConference articleResearchpeer-review

Standard

No Repetition : Fast and Reliable Sampling with Highly Concentrated Hashing. / Aamand, Anders; Das, Debarati; Kipouridis, Evangelos; Knudsen, Jakob B.T.; Rasmussen, Peter M.R.; Thorup, Mikkel.

In: Proceedings of the VLDB Endowment, Vol. 15, No. 13, 2022, p. 3989-4001.

Research output: Contribution to journalConference articleResearchpeer-review

Harvard

Aamand, A, Das, D, Kipouridis, E, Knudsen, JBT, Rasmussen, PMR & Thorup, M 2022, 'No Repetition: Fast and Reliable Sampling with Highly Concentrated Hashing', Proceedings of the VLDB Endowment, vol. 15, no. 13, pp. 3989-4001. https://doi.org/10.14778/3565838.3565851

APA

Aamand, A., Das, D., Kipouridis, E., Knudsen, J. B. T., Rasmussen, P. M. R., & Thorup, M. (2022). No Repetition: Fast and Reliable Sampling with Highly Concentrated Hashing. Proceedings of the VLDB Endowment, 15(13), 3989-4001. https://doi.org/10.14778/3565838.3565851

Vancouver

Aamand A, Das D, Kipouridis E, Knudsen JBT, Rasmussen PMR, Thorup M. No Repetition: Fast and Reliable Sampling with Highly Concentrated Hashing. Proceedings of the VLDB Endowment. 2022;15(13):3989-4001. https://doi.org/10.14778/3565838.3565851

Author

Aamand, Anders ; Das, Debarati ; Kipouridis, Evangelos ; Knudsen, Jakob B.T. ; Rasmussen, Peter M.R. ; Thorup, Mikkel. / No Repetition : Fast and Reliable Sampling with Highly Concentrated Hashing. In: Proceedings of the VLDB Endowment. 2022 ; Vol. 15, No. 13. pp. 3989-4001.

Bibtex

@inproceedings{ed8ffab8abb748b4b8495f5cde23830d,
title = "No Repetition: Fast and Reliable Sampling with Highly Concentrated Hashing",
abstract = "Stochastic sample-based estimators are among the most fundamental and universally applied tools in statistics. Such estimators are particularly important when processing huge amounts of data, where we need to be able to answer a wide range of statistical queries reliably, yet cannot afford to store the data in its full length. In many applications we need the sampling to be coordinated which is typically attained using hashing. In previous work, a common strategy to obtain reliable sample-based estimators that work within certain error bounds with high probability has been to design one that works with constant probability, and then boost the probability by taking the median over r independent repetitions. Aamand et al. (STOC{\textquoteright}20) recently proposed a fast and practical hashing scheme with strong concentration bounds, Tabulation-1Permutation, the first of its kind. In this paper, we demonstrate that using such a hash family for the sampling, we achieve the same high probability bounds without any need for repetitions. Using the same space, this saves a factor r in time, and simplifies the overall algorithms. We validate our approach experimentally on both real and synthetic data. We compare Tabulation-1Permutation with other hash functions such as strongly universal hash functions and various other hash functions such as MurmurHash3 and BLAKE3, both with and without resorting to repetitions. We see that if we want reliability in terms of small error probabilities, then Tabulation-1Permutation is significantly faster.",
author = "Anders Aamand and Debarati Das and Evangelos Kipouridis and Knudsen, {Jakob B.T.} and Rasmussen, {Peter M.R.} and Mikkel Thorup",
note = "Publisher Copyright: {\textcopyright} 2022, VLDB Endowment. All rights reserved.; 48th International Conference on Very Large Data Bases, VLDB 2022 ; Conference date: 05-09-2022 Through 09-09-2022",
year = "2022",
doi = "10.14778/3565838.3565851",
language = "English",
volume = "15",
pages = "3989--4001",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "VLDB Endowment",
number = "13",

}

RIS

TY - GEN

T1 - No Repetition

T2 - 48th International Conference on Very Large Data Bases, VLDB 2022

AU - Aamand, Anders

AU - Das, Debarati

AU - Kipouridis, Evangelos

AU - Knudsen, Jakob B.T.

AU - Rasmussen, Peter M.R.

AU - Thorup, Mikkel

N1 - Publisher Copyright: © 2022, VLDB Endowment. All rights reserved.

PY - 2022

Y1 - 2022

N2 - Stochastic sample-based estimators are among the most fundamental and universally applied tools in statistics. Such estimators are particularly important when processing huge amounts of data, where we need to be able to answer a wide range of statistical queries reliably, yet cannot afford to store the data in its full length. In many applications we need the sampling to be coordinated which is typically attained using hashing. In previous work, a common strategy to obtain reliable sample-based estimators that work within certain error bounds with high probability has been to design one that works with constant probability, and then boost the probability by taking the median over r independent repetitions. Aamand et al. (STOC’20) recently proposed a fast and practical hashing scheme with strong concentration bounds, Tabulation-1Permutation, the first of its kind. In this paper, we demonstrate that using such a hash family for the sampling, we achieve the same high probability bounds without any need for repetitions. Using the same space, this saves a factor r in time, and simplifies the overall algorithms. We validate our approach experimentally on both real and synthetic data. We compare Tabulation-1Permutation with other hash functions such as strongly universal hash functions and various other hash functions such as MurmurHash3 and BLAKE3, both with and without resorting to repetitions. We see that if we want reliability in terms of small error probabilities, then Tabulation-1Permutation is significantly faster.

AB - Stochastic sample-based estimators are among the most fundamental and universally applied tools in statistics. Such estimators are particularly important when processing huge amounts of data, where we need to be able to answer a wide range of statistical queries reliably, yet cannot afford to store the data in its full length. In many applications we need the sampling to be coordinated which is typically attained using hashing. In previous work, a common strategy to obtain reliable sample-based estimators that work within certain error bounds with high probability has been to design one that works with constant probability, and then boost the probability by taking the median over r independent repetitions. Aamand et al. (STOC’20) recently proposed a fast and practical hashing scheme with strong concentration bounds, Tabulation-1Permutation, the first of its kind. In this paper, we demonstrate that using such a hash family for the sampling, we achieve the same high probability bounds without any need for repetitions. Using the same space, this saves a factor r in time, and simplifies the overall algorithms. We validate our approach experimentally on both real and synthetic data. We compare Tabulation-1Permutation with other hash functions such as strongly universal hash functions and various other hash functions such as MurmurHash3 and BLAKE3, both with and without resorting to repetitions. We see that if we want reliability in terms of small error probabilities, then Tabulation-1Permutation is significantly faster.

UR - http://www.scopus.com/inward/record.url?scp=85147795979&partnerID=8YFLogxK

U2 - 10.14778/3565838.3565851

DO - 10.14778/3565838.3565851

M3 - Conference article

AN - SCOPUS:85147795979

VL - 15

SP - 3989

EP - 4001

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 13

Y2 - 5 September 2022 through 9 September 2022

ER -

ID: 340710363