PhD defence by Anders Aamand
Simple Hashing-Based Algorithms with Strong Theoretical Guarantees
In recent years, demand has immensely increased for algorithms for handling and analyzing large-scale data. The data often arrives in a stream, and if it is not processed in time, the information is lost. Indeed, the high volume of the data makes storing a complete copy and performing exact computations an impossible task. These challenges motivate the following general question. "Can we design efficient algorithms and data structures that provide reliable statistical analyses in large-scale data scenarios?". A powerful tool in designing such algorithms is randomization, and in particular, randomization through the use of hash functions. Conceptually, a hash function can be thought of as distributing a set of balls into a set of bins. Ideally it should do this in a fully random way, but unfortunately such fully random hashing is impossible to implement in practice. This thesis seeks to investigate practical, pseudorandom hash functions that still have enough randomness in them to give the same good theoretical guarantees as fully random hashing. First, we introduce a new family of hash functions, tabulation-permutation, which is simple to implement and demonstrated to be extremely fast in practice. We prove that this family of hash function provides reliable statistics on data, essentially showing that certain hash based sums nicely follow a normal distribution. We further show how access to such a hashing scheme leads to speedups of various streaming algorithms. We also study the distribution of non-empty bins when using simple tabulation hashing, another fast and practical hashing scheme. In several ways, this distribution is shown to resemble that from the fully random scenario.
Second, we study algorithms for estimating the frequencies of elements in a large stream of data. We show how such algorithms can profit from having access to a machine learning oracle that can predict whether an item is a heavy hitter or not.
Finally, we present a basic graph algorithm which can be understood as solving the following problem. For a city, can we make all the streets of the city one-way such that it is still possible to get from any place to any other place? The algorithm decides in linear time whether this can be done, also finding such a one-way orientation of the streets if it exists.
- Christian Wulff-Nilsen, Associate Professor, DIKU
- Rasmus Pagh, Professor, IT-University of Copenhagen
- Martin Dietzfelbinger, Professor, Dr., Technische Universität Ilmenau, Germany
Academic Main Supervisor
- Mikkel Thorup, Professor, DIKU
- Mikkel Abrahamsen, Assistant Professor, DIKU
- Jacob Holm, Assistant Professor, DIKU
For an electronic copy of the thesis, please contact firstname.lastname@example.org