Join Algorithms for Distributed Databases – University of Copenhagen

Join Algorithms for Distributed Databases

Master Thesis Defence by Martin Ljørring

Abstract:

We investigate why NoSQL databases do not provide a join operation.
Document stores offer embedding as a join alternative, but this does not suffice to join many-to-many relational data. We bring a survey of relevant join litterature in search for a way to introduce a parallel join operation.
To this extent we provide a solution, which includes a parallel join algorithm working on a cluster of MongoDB nodes. Tests indicate, that this solution outperforms the naive join, when the cluster and data size grows sufficiently large.

Furthermore we present two distinct ways of incorporating a join operation in a map-reduce framework. Tests made of the first approach yields scalable results, that slightly outperforms results from tests made on Hive, which is an existing application in commercial use enabling equi-joins in a Hadoop map-reduce framework. The second approach is inspired by 1-Bucket-Theta, which provides better defenses against data skew and can process any join predicate. Tests show however, that this solution performs worse than the first and Hive, and a possible reason for this is proposed.

The defence will be in English.

External Examiner:

Philippe Bonnet, ITU

Supervisor:

Ken Friis Larsen, DIKU