MSc Thesis Defense by Michael Bang: Improving the performance of global selections


Improving the performance of global selections - An experimental evaluation of system level techniques


As more and more people gain access to smartphones and other location-aware devices, spatial data is becoming increasingly widespread. Organizations have found that it can be very profitable to make sense of and use spatial data in novel ways. This means that methods facilitating this process are in high demand. Among such methods is data visualization, a method that is particularly useful when working with a spatial datasets. As spatial datasets increase in size, however, visualizing them becomes decreasingly practical. When working with datasets that contain more tuples than there are pixels on today’s largest displays, it is required that they be reduced in size before they can be visualized. One way to decrease the size of a dataset is to select tuples that are important or representative, according to the informational goal of the visualization. Because today’s spatial datasets have become so large, and because spatial computations in general are expensive, finding ways to efficiently and easily select tuples for visualizations has become a problem in and of itself.

In this thesis, we take a pragmatic approach to improving the performance of an existing method for declaratively describing spatial selections using constraints, for the purposes of visualization. This method is called global selections. We utilize system level techniques implemented in existing and widely deployed DBMSs to efficiently compute data selections within the DBMS, where the data already resides and therefore requires the least effort from the user. In our work, we explore how the techniques of partitioning, indexing, and parallelism affect the performance of spatial global selections. We begin by analyzing how these techniques impact the performance of global selections, before implementing global selections on top of multiple existing DBMSs with different configurations of the mentioned techniques. We finish by experimentally evaluating the performance of each dimension. In our experiments, we find that major performance gains are achievable, and that, in particular, spatial partitioning and parallelism are very important traits for a system to possess in order to efficiently compute global selections over large datasets.

Supervisor: Marcos Vaz Salles, DIKU

External Examiner: Czeslaw Kazimierczak, CSC Danmark A/S