Talk by Thorsten Papenbrock


Data Profiling - Efficient Discovery of Dependencies


Data profiling is the computer science discipline of analyzing a given dataset for its metadata. The types of metadata range from basic statistics, such as tuple counts, column aggregations, and value distributions, to much more complex structures, in particular inclusion dependencies (INDs), unique column combinations (UCCs), and functional dependencies (FDs). If present, these statistics and structures serve to efficiently store, query, change, and understand the data. Most datasets, however, do not provide their metadata explicitly so that data scientists need to profile them.

While basic statistics are relatively easy to calculate, more complex structures present difficult, mostly NP-complete discovery tasks; even with good domain knowledge, it is hardly possible to detect them manually. Therefore, various profiling algorithms have been developed to automate the discovery. None of them, however, can process datasets of typical real-world size, because their resource consumptions and/or execution times exceed effective limits.

In my talk, I will discuss data profiling and one novel profiling algorithm from my PhD theses "Data Profiling - Efficient Discovery of Dependencies"  in more detail. More specifically, we will investigate the discovery of functional dependencies and how techniques, such as hybrid search, progressivity, memory sensitivity, parallelization, and additional pruning help to greatly improve upon current limitations. I will also shortly introduce the tool Metanome that we built to make our profiling algorithms accessible to all data scientists and IT-professionals.

Contact person: Yongluan Zhou,