No Repetition: Fast Streaming with Highly Concentrated Hashing

Research output: Working paperResearch

Standard

No Repetition : Fast Streaming with Highly Concentrated Hashing. / Aamand, Anders; Das, Debarati; Kipouridis, Evangelos; Knudsen, Jakob Bæk Tejs; Rasmussen, Peter Michael Reichstein; Thorup, Mikkel.

2021.

Research output: Working paperResearch

Harvard

Aamand, A, Das, D, Kipouridis, E, Knudsen, JBT, Rasmussen, PMR & Thorup, M 2021 'No Repetition: Fast Streaming with Highly Concentrated Hashing'. <https://arxiv.org/abs/2004.01156>

APA

Aamand, A., Das, D., Kipouridis, E., Knudsen, J. B. T., Rasmussen, P. M. R., & Thorup, M. (2021). No Repetition: Fast Streaming with Highly Concentrated Hashing. https://arxiv.org/abs/2004.01156

Vancouver

Aamand A, Das D, Kipouridis E, Knudsen JBT, Rasmussen PMR, Thorup M. No Repetition: Fast Streaming with Highly Concentrated Hashing. 2021.

Author

Aamand, Anders ; Das, Debarati ; Kipouridis, Evangelos ; Knudsen, Jakob Bæk Tejs ; Rasmussen, Peter Michael Reichstein ; Thorup, Mikkel. / No Repetition : Fast Streaming with Highly Concentrated Hashing. 2021.

Bibtex

@techreport{e0f2fb83e01c40e0a6324fc6371c4709,
title = "No Repetition: Fast Streaming with Highly Concentrated Hashing",
abstract = "To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$independent experiments increases the processing time by a factor $r$.Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler.Fast practical hash functions with strong concentration bounds were recently proposed by Aamand {\em et al.} (to appear in {\em STOC 2020}). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.",
author = "Anders Aamand and Debarati Das and Evangelos Kipouridis and Knudsen, {Jakob B{\ae}k Tejs} and Rasmussen, {Peter Michael Reichstein} and Mikkel Thorup",
year = "2021",
language = "English",
type = "WorkingPaper",

}

RIS

TY - UNPB

T1 - No Repetition

T2 - Fast Streaming with Highly Concentrated Hashing

AU - Aamand, Anders

AU - Das, Debarati

AU - Kipouridis, Evangelos

AU - Knudsen, Jakob Bæk Tejs

AU - Rasmussen, Peter Michael Reichstein

AU - Thorup, Mikkel

PY - 2021

Y1 - 2021

N2 - To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$independent experiments increases the processing time by a factor $r$.Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler.Fast practical hash functions with strong concentration bounds were recently proposed by Aamand {\em et al.} (to appear in {\em STOC 2020}). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.

AB - To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$independent experiments increases the processing time by a factor $r$.Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler.Fast practical hash functions with strong concentration bounds were recently proposed by Aamand {\em et al.} (to appear in {\em STOC 2020}). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.

M3 - Working paper

BT - No Repetition

ER -

ID: 257869626