Tabulation Hashing for Large-Scale Data Processing - PhD Defense by Søren Dahlgaard


Søren Dahlgaard

The past decade has brought with it an immense amount of data from large volumes of text to network traffic data. Working with such large-scale data has become an increasingly important topic, giving rise to many important problems and influential solutions. One common denominator between many popular algorithms and data structures for tackling these problems is randomization implemented with hash functions.

A common practice in the analysis of such randomized algorithms, is to work under the abstract assumption that truly random unit cost hash functions are freely available without concern for which concrete hash function to employ. However, the choice of hash function is of huge importance, as the theoretical guarantees of a randomized algorithm rely crucially on this choice, and the analysis breaks down completely when too simple hash functions are used.

Furthermore, hashing is often employed as an “inner-loop” operation and evaluation time is thus of utmost importance. This thesis seeks to bridge this gap in the theory by providing efficient families of hash functions with strong theoretical guarantees for several influential problems in the large-scale data regime. This is done by studying families of tabulation-based hashing – a method for constructing very efficient hashing schemes based on table lookups and word parallelism.

We provide a new fundamental understanding of the dependencies between hash values of keys when using tabulation-based hashing, leading to the first “practical” hash functions with strong theoretical guarantees for many popular algorithms and techniques. This includes statistics based on k-partitioning which is employed in the popular HyperLogLog counters as well as the one permutation hashing sketch for similarity estimation. Furthermore, our techniques lead to the most efficient known hashing schemes for the power of two choices, approximately minwise independence, constant moment bounds, and more.

From an algorithmic point of view, we present a new similarity sketch with properties similar to the seminal MinHash sketch, but with much faster running time. This problem has previously been considered from a practical perspective, but the proposed solutions fail to give strong concentration bounds.

We complement our theoretical result with experiments demonstrating that tabulation hashing systematically outperforms simpler hashing schemes for similarity estimation and feature hashing on both synthetic and real-world data.

Assessment Committee:

Chairman: Associate Professor Christina Lioma, Department of Computer Science, University of Copenhagen

Associate Professor, Alexandr Andoni, Columbia University

Assistant Professor Kasper Green Larsen, Aarhus University

Academic supervisor & Co-Supervisor:

Professor Mikkel Thorup, Department of Computer Science, University of Copenhagen

Co-Supervisor Professor Stephen Alstrup, Department of Computer Science, University of Copenhagen

For an electronic copy of the thesis, please contact