Join Algorithms for Distributed Databases
Master Thesis Defence by Martin Ljørring
We investigate why NoSQL databases do not provide a join operation.
Document stores offer embedding as a join alternative, but this does not suffice to join many-to-many relational data. We bring a survey of relevant join litterature in search for a way to introduce a parallel join operation.
To this extent we provide a solution, which includes a parallel join algorithm working on a cluster of MongoDB nodes. Tests indicate, that this solution outperforms the naive join, when the cluster and data size grows sufficiently large.
Furthermore we present two distinct ways of incorporating a join operation in a map-reduce framework. Tests made of the first approach yields scalable results, that slightly outperforms results from tests made on Hive, which is an existing application in commercial use enabling equi-joins in a Hadoop map-reduce framework. The second approach is inspired by 1-Bucket-Theta, which provides better defenses against data skew and can process any join predicate. Tests show however, that this solution performs worse than the first and Hive, and a possible reason for this is proposed.
The defence will be in English.
Philippe Bonnet, ITU
Ken Friis Larsen, DIKU