PhD defence by Jakob Bæk Tejs Houen
Title
New Results on Hash Functions and Hashing-Based Algorithms
Abstract
With the increasing size of datasets in the era of machine learning and AI, exact computations on these datasets have become increasingly infeasible. Nevertheless, in many applications, approximate answers are sufficient. This motivates the question of whether efficient algorithms and data structures can be designed to provide reliable approximate answers on huge datasets.
Randomization, particularly through hash functions, is a powerful tool for simplifying these algorithms. However, existing analyses of such algorithms regularly assume fully random or highly independent hash functions, which ignores the issue of efficiency in practice. While there exist efficient theoretical constructions of highly independent hash functions, none of them are efficient in practice. This thesis addresses these issues by providing new analyses of practical tabulation-based hashing schemes. This analysis allows us to show that mixed tabulation hashing has strong concentration guarantees for hash-based sums that closely match those of fully random hashing. Furthermore, we demonstrate how these insights can be used to implement a sparse Johnson-Lindenstrauss transform, a widely used technique in high-dimensional data analysis.
Additionally, we study the problem of approximate set similarity search, where the goal is to build a data structure that can identify similar sets in a database of known sets or report that the database contains no similar set. We show that our algorithm is optimal for all hashing-based data structures for random sets, providing a comprehensive solution to this problem.
Finally, we consider the problem of dynamic load balancing in an environment where both balls and bins can be added and removed. We assume that each bin has a capacity of C, that is, each bin can contain at most C balls. We construct a data structure whose performance is comparable to the simpler problem of throwing balls into capacitated bins, that is, place the balls into bins with capacity C, one ball at the time, where each ball picks a uniformly random non-full bin.
Supervisors
Principal Supervisor Mikkel Thorup
Assessment Committee
Professor Jakob Nordstrom, Department of Computer Science
Associate Professor Eva Rotenberg, DTU
Associate Professor Alexander Andoni, Columbia University, USA
Leader of defence: Joel Daniel Andersson
For an electronic copy of the thesis, please visit the PhD Programme page.