Simple Hashing-Based Algorithms with Strong Theoretical Guarantees

Department of Computer Science

Research output: Book/Report › Ph.D. thesis › Research

Standard

Simple Hashing-Based Algorithms with Strong Theoretical Guarantees. / Aamand, Anders.

Department of Computer Science, Faculty of Science, University of Copenhagen, 2020. 189 p.

Research output: Book/Report › Ph.D. thesis › Research

Harvard

Aamand, A 2020, Simple Hashing-Based Algorithms with Strong Theoretical Guarantees. Department of Computer Science, Faculty of Science, University of Copenhagen.

APA

Aamand, A. (2020). Simple Hashing-Based Algorithms with Strong Theoretical Guarantees. Department of Computer Science, Faculty of Science, University of Copenhagen.

Vancouver

Aamand A. Simple Hashing-Based Algorithms with Strong Theoretical Guarantees. Department of Computer Science, Faculty of Science, University of Copenhagen, 2020. 189 p.

Author

Aamand, Anders. / Simple Hashing-Based Algorithms with Strong Theoretical Guarantees. Department of Computer Science, Faculty of Science, University of Copenhagen, 2020. 189 p.

Bibtex

@phdthesis{0e61bba0f440454eae1077f21eca9314,

title = "Simple Hashing-Based Algorithms with Strong Theoretical Guarantees",

abstract = "In recent years, demand has immensely increased for algorithms for handling and analyzing large-scale data. The data often arrives in a stream, and if it is not processed in time, the information is lost. Indeed, the high volume of the data makes storing a complete copy and performing exact computations an impossible task. These challenges motivate the following general question. {"}Can we design efficient algorithms and data structures that provide reliable statistical analyses in large-scale data scenarios?{"}. A powerful tool in designing such algorithms is randomization, and in particular, randomization through the use of hash functions. Conceptually, a hash function can be thought of as distributing a set of balls into a set of bins. Ideally it should do this in a fully random way, but unfortunately such fully random hashing is impossible to implement in practice. This thesis seeks to investigate practical, pseudorandom hash functions that still have enough randomness in them to give the same good theoretical guarantees as fully random hashing. First, we introduce a new family of hash functions, tabulation-permutation, which is simple to implement and demonstrated to be extremely fast in practice. We prove that this family of hash function provides reliable statistics on data, essentially showing that certain hash based sums nicely follow a normal distribution. We further show how access to such a hashing scheme leads to speedups of various streaming algorithms. We also study the distribution of non-empty bins when using simple tabulation hashing, another fast and practical hashing scheme. In several ways, this distribution is shown to resemble that from the fully random scenario.Second, we study algorithms for estimating the frequencies of elements in a large stream of data. We show how such algorithms can profit from having access to a machine learning oracle that can predict whether an item is a heavy hitter or not.Finally, we present a basic graph algorithm which can be understood as solving the following problem. For a city, can we make all the streets of the city one-way such that it is still possible to get from any place to any other place? The algorithm decides in linear time whether this can be done, also finding such a one-way orientation of the streets if it exists. ",

author = "Anders Aamand",

year = "2020",

language = "English",

publisher = "Department of Computer Science, Faculty of Science, University of Copenhagen",

}

RIS

TY - BOOK

T1 - Simple Hashing-Based Algorithms with Strong Theoretical Guarantees

AU - Aamand, Anders

PY - 2020

Y1 - 2020

N2 - In recent years, demand has immensely increased for algorithms for handling and analyzing large-scale data. The data often arrives in a stream, and if it is not processed in time, the information is lost. Indeed, the high volume of the data makes storing a complete copy and performing exact computations an impossible task. These challenges motivate the following general question. "Can we design efficient algorithms and data structures that provide reliable statistical analyses in large-scale data scenarios?". A powerful tool in designing such algorithms is randomization, and in particular, randomization through the use of hash functions. Conceptually, a hash function can be thought of as distributing a set of balls into a set of bins. Ideally it should do this in a fully random way, but unfortunately such fully random hashing is impossible to implement in practice. This thesis seeks to investigate practical, pseudorandom hash functions that still have enough randomness in them to give the same good theoretical guarantees as fully random hashing. First, we introduce a new family of hash functions, tabulation-permutation, which is simple to implement and demonstrated to be extremely fast in practice. We prove that this family of hash function provides reliable statistics on data, essentially showing that certain hash based sums nicely follow a normal distribution. We further show how access to such a hashing scheme leads to speedups of various streaming algorithms. We also study the distribution of non-empty bins when using simple tabulation hashing, another fast and practical hashing scheme. In several ways, this distribution is shown to resemble that from the fully random scenario.Second, we study algorithms for estimating the frequencies of elements in a large stream of data. We show how such algorithms can profit from having access to a machine learning oracle that can predict whether an item is a heavy hitter or not.Finally, we present a basic graph algorithm which can be understood as solving the following problem. For a city, can we make all the streets of the city one-way such that it is still possible to get from any place to any other place? The algorithm decides in linear time whether this can be done, also finding such a one-way orientation of the streets if it exists.

AB - In recent years, demand has immensely increased for algorithms for handling and analyzing large-scale data. The data often arrives in a stream, and if it is not processed in time, the information is lost. Indeed, the high volume of the data makes storing a complete copy and performing exact computations an impossible task. These challenges motivate the following general question. "Can we design efficient algorithms and data structures that provide reliable statistical analyses in large-scale data scenarios?". A powerful tool in designing such algorithms is randomization, and in particular, randomization through the use of hash functions. Conceptually, a hash function can be thought of as distributing a set of balls into a set of bins. Ideally it should do this in a fully random way, but unfortunately such fully random hashing is impossible to implement in practice. This thesis seeks to investigate practical, pseudorandom hash functions that still have enough randomness in them to give the same good theoretical guarantees as fully random hashing. First, we introduce a new family of hash functions, tabulation-permutation, which is simple to implement and demonstrated to be extremely fast in practice. We prove that this family of hash function provides reliable statistics on data, essentially showing that certain hash based sums nicely follow a normal distribution. We further show how access to such a hashing scheme leads to speedups of various streaming algorithms. We also study the distribution of non-empty bins when using simple tabulation hashing, another fast and practical hashing scheme. In several ways, this distribution is shown to resemble that from the fully random scenario.Second, we study algorithms for estimating the frequencies of elements in a large stream of data. We show how such algorithms can profit from having access to a machine learning oracle that can predict whether an item is a heavy hitter or not.Finally, we present a basic graph algorithm which can be understood as solving the following problem. For a city, can we make all the streets of the city one-way such that it is still possible to get from any place to any other place? The algorithm decides in linear time whether this can be done, also finding such a one-way orientation of the streets if it exists.

M3 - Ph.D. thesis

BT - Simple Hashing-Based Algorithms with Strong Theoretical Guarantees

PB - Department of Computer Science, Faculty of Science, University of Copenhagen

ER -

ID: 250973835