New Results on Hash Functions and Hashing-Based Algorithms

Publikation: Bog/antologi/afhandling/rapportPh.d.-afhandlingForskning

  • Jakob Bæk Tejs Houen
With the increasing size of datasets in the era of machine learning and AI, exact computations on these datasets have become increasingly infeasible. Nevertheless, in many applications, approximate answers are sufficient. This motivates the question of whether efficient algorithms and data structures can be designed to provide reliable approximate answers on huge datasets. Randomization, particularly through hash functions, is a powerful tool for simplifying these algorithms. However, existing analyses of such algorithms regularly assume fully random or highly independent hash functions, which ignores the issue of efficiency in practice. While there exist efficient theoretical constructions of highly independent hash functions none of them are efficient in practice. This thesis addresses these issues by providing new analyses of practical tabulation-based hashing schemes. First in [HT22], we obtain a near-optimal understanding of concentration guarantees for simple tabulation hashing of hash-based sums by bounding the moments. This analysis allows us to show that mixed tabulation hashing has strong concentration guarantees for hash-based sums that closely match those of fully random hashing. Furthermore in [HT23], we demonstrate how these insights can be used to implement a sparse Johnson-Lindenstrauss transform, a widely used technique in high-dimensional data analysis. Additionally in [AK20], we study the problem of approximate set similarity search, where the goal is to build a data structure that can identify similar sets in a database of known sets or report that the database contains no similar set. We show that our algorithm is optimal for all hashing-based data structures for random sets, providing a comprehensive solution to this problem. Finally in [AKT21], we consider the problem of dynamic load balancing in an environment where both balls and bins can be added and removed. We assume that each bin has a capacity of C, that is, each bin can contain at most C balls. We construct a data structure that in expectation moves O(1/f ) balls when inserting or deleting a ball, and O(C/f ) balls when inserting or deleting a bin. Where f is the fraction of non-full bins in the following simpler probabilistic problem: Place the balls into bins with capacity C, one ball at the time, where each ball picks a uniformly random non-full bin. We also solve this simpler problem and provide a tight bound for f . In order to prove these results, we needed a new result in probability theory which was proven in [AAHT22].
OriginalsprogEngelsk
ForlagDepartment of Computer Science, Faculty of Science, University of Copenhagen
Antal sider375
StatusUdgivet - 2023

ID: 379984076