Automatic IO filtering for optimizing cloud analytics

Joint HIPERFIT and COPLAS talk by Dimitrios Vytiniotis, Microsoft Research, Cambridge, England


Hadoop is a popular framework for processing large datasets. Many Hadoop jobs are very selective and operate on just a fraction of their input data, which can often be unstructured (for instance text files). In such scenarios it is impossible to apply out-of-the-box database optimizations. In this project at MSRC we have used static analysis techniques to examine the (executable bytecode of the) map phase of a job and automatically extract a filter that identifies the interesting ``rows'' and ``columns'' of the input data. Instead of sending all data from the storage to the compute cluster, we automatically identify and send only the subset of interest. Our automatically-generated filters are purely an optimization: they soundly approximate the set of interesting data, they are side-effect free (whereas mappers need not be), and can be killed or restarted on demand. Using our filters on example jobs, we have reduced network overheads by a factor of 5, and job completion times by a factor of 3 to 4 for certain jobs. In this talk I will emphasize the static analysis part and show how the domain of Hadoop map jobs makes a great fit for a very simple to implement, cheap to run, and effective in terms of improving job-completion times static analysis.


I am a researcher at the MSRC PPT group. My interests span programming languages theory and implementation, type systems, theorem proving, semantics, functional programming, and -- of course -- Haskell! I am involved in the design and implementation of the constraint solver underlying GHC's type inference engine. I am also fascinated by using PL techniques such as program analyses or domain-specific languages to optimize systems.

Before joining MSR, I completed my PhD on Programming Languages at the University of Pennsylvania. Before that, I was at the NTUA ECE department in Athens. More info:


Scientific host: Fritz Henglein Administrative host:Jette Møller. All are welcome.

The Copenhagen Programming Language Seminar (COPLAS) is a collaboration between DIKU, DTU, ITU, and RUC.

COPLAS is part of the FIRST Research School.

To receive information about COPLAS talks by email, send a message to with the word 'subscribe' as subject or in the body.

For more information about COPLAS, see