A distributed design and implementation of the DataCleaner data cleaning tool – Københavns Universitet

A distributed design and implementation of the DataCleaner data cleaning tool

- Master thesis defense by Tomasz Guzialek

Abstract

The thesis presents the design and implementation of a distributed version of DataCleaner - an open source data cleaning tool. DataCleaner executes analysis jobs - workloads built over a language of operators. A centralized database approach in the existing distributed version, called clustered DataCleaner, is believed to impose a bottleneck in scalability with regards to the size of the cluster. Different approaches to distributing the execution of analysis jobs have been researched: distributed database for clustered DataCleaner, as well as integrated processing framework and storage solutions in two variants - interpreter and compiler approach. After the research, Apache Hadoop with HBase has been chosen as an alternative for clustered DataCleaner and named Hadoop DataCleaner. Scale-up and speed-up experiments have been run for two types of workloads: a long-running operation workload (smaller datasets, but demanding processing per element) and a data-intensive one (bigger datasets, less demanding processing per element). These experiments show that for workloads including long-running transformations clustered DataCleaner outperforms Hadoop DataCleaner, but does not scale well in the number of nodes for data-centric workloads.

Academic supervisor: Marcos Vaz Salles, DIKU

Company co-supervisor: Kasper Sørensen, HumanInference

Censor: Lars Frank, CBS