Data Management Systems Lab
The Data Management Systems group conducts computer systems research in areas emerging with new challenges in data management. Projects include design of spatial databases, scalable data streaming, actor database systems and in-memory databases, graph analysis systems, and cloud computing deployments. The group is keen on validating their work experimentally -- we love writing code, which is not to say that our love for the blackboard is in any way diminished. :-)
When conducting our work we usually resort to one or more of the following:
- Abstractions & Languages
- Combinatorial Optimization
- Indexing & Data Structures
- System Implementation & Design
- Statistics & Prediction
- Parallelism & Distribution
You can learn about our work in detail through our publications.
The group is leading the organization of the EDBT/ICDT 2020 Joint Conference in Copenhagen, Denmark. Please do not miss out on the opportunity to join us for this exciting event!
Actor Database Systems
With changing architectural and application trends, we are re-visiting the design of online transaction processing databases by pursuing an integration of the actor model with relational database systems. We are studying the programming model of relational actors (or reactors) to achieve in-memory transaction processing that allows for flexible programming, high-level reasoning about transaction latencies, high resource utilization, and flexibility in database architecture. To demonstrate these principles, we are building an in-memory database system called ReactDB. Recently, we have investigated together with collaborators from PUC-Rio, Brazil, how reactors can be derived from monolithic OLTP applications implemented following a three-tier architecture.
In addition, we are exploring the integration of industrial-strength actor runtimes, e.g., Orleans and Akka, with database concepts. One first result in this line of research was a proposal and discussion of guidelines for modeling and building IoT applications with actor-oriented databases, developed with collaborators from the University of Campinas, Brazil.
The profound digital transformation of society leads to a variety of disruptive trends, such as Internet of Things (IoT), Smart Cities, and Big Data. They raise the need for novel approaches to software architectures in order to handle the ever-growing dynamicity of business and make decisions at real-time based on events. Event-Driven Architecture (EDA) is a key technology to achieve this goal. EDA is a loosely coupled software architecture, where each component executes upon receiving an event. EDA therefore is the key to enable a real-time event flow across components in a digital eco-system, and hence to enable a quick and agile response to the change of business.
The project CEEDA (Consistent and Efficient Event-Driven Architecture) envisions the next generation of actor-based programming framework with high data consistency and high system efficiency to simplify the development of complex EDA systems. Applications within IoT and Logistics will be used as case studies and evaluation scenarios.
In the project Enorm, we especially focus on the scalability problems in stream processing. We have developed and implemented novel algorithms for memory management, stream dissemination, query optimization, load balancing, dynamic and elastic scaling, and fault-tolernace in distributed stream processing systems. These algorithms and techniques have been implemented in a prototype system.
Data Platforms for Geospatial Data
Our group is participating in the Future Cropping project, a large collaboration in Denmark in the domain of precision agriculture. Future Cropping aims at developing a new generation of tools for improving farming practices by leveraging data from new sensors, such as farm machinery, drones, and satellites as well as open geospatial data. Our contribution to the project is focused on scalability for the project's data platform, which aims at serving geospatial data to a set of added-value analytical services in agriculture. In addition to Future Cropping, we have been collaborating with the Machine Learning (ML) group on mining large collections of satellite data. We are also collaborating with the ML group on the GANDALF project on spatial prediction of urban contamination.
Graph Analysis Systems
There is an increasing amount of data that takes the form of complex graphs in various applications, such as social network, linked data, telecommunication networks, chemistry, life science, etc. The analysis of such data should not only focus on the attributes attached to the nodes or edges, but also on the way how the nodes are interconnected. In this project, we focus on the scalability issues of graph processing, and develop algorithms and techniques to enahnce the scalability of the graph querying and analysis systems. In particular, we have developed a prototye system, called SemStore, to manage and query large-scale RDF data over a cluster of computers. SemStore adopts a path-based RDF data partitioning method and a highly efficient query optimizer, which enhances the system's scalability and query efffiency by minimizing the use of distributed joins and maximizing the parallelism of the query processor. It is shown that SemStore outperforms the state-of-the-art techniques by orders of magnitude for complex graph queries.
Open Geodata Serving
In a collaboration with the Danish Geodata Agency, we have explored new approaches to cook and serve geodata to the public on the Web. A main challenge in cartography is producing maps of high quality over complex shapes requires the craft of human expertise. However, given the explosion in geospatial data, the pressure for high-productivity tools for cartography is increasing at a fast pace. Our work has explored how to create a new class of declarative cartography tools. Our language CVL, the Cartographic Visualization Language, can be processed entirely within a spatial DBMS, opening up exciting opportunities for automatic optimization and scalability. Additionally, we have investigated how declarative cartography can be achieved efficiently inside the DBMS in the presence of fine-grained access control. In a separate line of work, we have also analyzed production logs for map-serving web services. These production logs reveal strong spatial and temporal concentration patterns which can be exploited for more efficient caching.
Behavioral Simulations and Computer Games
In collaboration with the Cornell Database Group, we have worked on a new scripting platform for games and agent-based simulations. Our recent work in this project has been around iterated spatial join techniques optimized for main memory, as well as communication, especially mean latency and jitter optimizations, for cloud environments. We have also explored techniques for automatic parallelization of large-scale behavioral simulations, as well as efficient checkpoint-recovery techniques for Massively Multiplayer Online Games (MMOs).
Multidimensional Indexing and Large Main Memories
We have also studied index structures for either read-intensive or write-intensive workloads. For the first class of workloads, we have studied experimentally, together with collaborators from Saarland University and ETH Zurich, the performance of one specific index structure, the Dwarf index. For the second class of workloads, we have studied how to answer queries over collections of moving objects, e.g., for vehicle tracking or spatial agent-based simulations. The problem is challenging because these applications have very high update rates that result from continuous movement. Our technique, MOVIES, is based on frequently rebuilding index snapshots in main memory. Using data partitioning over multiple nodes in a small cluster, we have scaled MOVIES up to 100 million moving objects over the road network of Germany, while keeping snapshot latencies below a few seconds.
Dataspaces and Personal Information Management
In early work at the ETH Zurich Systems Group, we have co-designed the iMeMex Dataspace Management System, a hybrid information integration architecture that allows users to transition from search to data integration in a pay-as-you-go fashion. Unlike traditional relational DBMS, iMeMex does not take full control of the data, but offers services over one's complex personal dataspace. We have explored several interesting themes in the design of iMeMex, such as the definition of a unified data model for personal information, a novel technique based on mapping hints (called trails) to increase the level of integration of personal information over time, and the search over graphs of user data created by view definitions.
Courses and Seminars
- Data Science (every Spring - Blocks 3+4; started Spring 2019)
- Development of Information Systems (every Spring - Blocks 3+4; started Spring 2017)
- Reactive and Event-Based Systems (every Fall - Block 2; starts Fall 2019)
- Databases and Web Programming and Databases and Data Mining (Spring 2014 - Spring 2015 - Block 3)
- Computer Networks (Datanet) (Spring 2012 - Spring 2013 - Block 4)
- Advanced Computer Systems (every Fall - Block 2; started Fall 2011, formerly Principles of Computer Systems Design)
- Big Data Systems (every Fall - Block 2; started Fall 2018)