EADS Talk by Rasmus Pagh: Set Similarity Search Beyond Minhash


Set Similarity Search Beyond Minhash


We present an improved data structure for approximate similarity (or "nearest neighbor") search under Braun-Blanquet similarity. For x,y⊆{1,…,d}, Braun-Blanquet similarity is defined as B(x,y)=|x∩y|/max(|x|,|y|). Given a collection P of sets, (b1,b2)-approximate Braun-Blanquet similarity search asks to preprocess P such that given a query set q for which there exists x∈P with B(q,x)≥b1, we can efficiently return x′∈P with B(q,x′)>b2. Historically a different measure, the Jaccard similarity J(x,y)=|x∩y|/|x∪y|, has been more widely studied. We argue that it suffices to study regular data structures in which the size of the query set q and the sizes of sets in P are fixed. In this setting there is a 1-1 mapping between Jaccard similarity and Braun-Blanquet similarity, so the problems are equivalent. It follows from the seminal paper of Indyk and Motwani (STOC 1998) that the so-called MinHash locality-sensitive hash function can be used to construct a data structure for (j1,j2)-approximate Jaccard similarity with space O(n^(1+ρ)+dn) and query time O(dn^ρ), where n=|P| and ρ=log(1/j1)/log(1/j2)<1. In the regular setting our data structure improves ρ in the exponent to ρ′=log((1+j1)/2j1)/log((1+j2)/2j2) which is always strictly smaller than ρ. The exponent in terms of Braun-Blanquet similarity is ρ′=log(1/b1)/log(1/b2).

Our main technical idea is to create a locality-sensitive mapping of a set x to a set of memory locations by simulating a certain Galton-Watson branching process with pairwise independent choices. The algorithm is simple to describe and implement, and its analysis uses only elementary tools. Surprisingly, even though the locality-sensitive mapping is data-independent, for a large part of the parameter space {(b1,b2) | 0 < b2 < b1 < 1} our data structure outperforms the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

Joint work with Tobias Christiani, to appear at STOC 2017.


Rasmus Pagh is professor in the Algorithms Group of the computer science department at the IT University of Copenhagen, Denmark. He is part of the Copenhagen algorithms community. He received his PhD in 2002 from BRICS at the Department of Computer Science, Aarhus University, supervised by Peter Bro Miltersen. His scientific interests are within algorithms and data structures, with an emphasis on big data. He has worked extensively on basic questions in information retrieval and the role of randomness in computing, on problems with applications in databases and knowledge discovery, and on the exploitation of parallelism in modern computer architectures. He is currently running an ERC-funded project on Scalable Similarity Search.