There is an increasing amount of data that takes the form of complex graphs in various applications, such as social network, linked data, telecommunication networks, chemistry, life science, etc. The analysis of such data should not only focus on the attributes attached to the nodes or edges, but also on the way how the nodes are interconnected.
To address the challenges of the analysis of big graph data and develop supporting system technologies for data scientists, it requires joint efforts from various research areas, including but not limited to graph data management, linked data management, datalog, high-performance computing, distributed/parallel systems, etc.
The purpose of this workshop is to get together experts from several relevant communities that are actively addressing the problems related to big graph data analysis to identify commonalities and synergies and to stimulate cross-collaboration that can be expected but yet to be explored.
The workshop will host invited talks and will feature plenty of free time for discussions.
- Yongluan Zhou, University of Copenhagen
- Amol Deshpande, University of Maryland at College Park
- Marcos António Vaz Salles, University of Copenhagen
|09.00 - 09.15||
Welcome with a short introduction to the university and the department by the head of the department Mads Nielsen
|09.15 - 10.00||Yannis Papakonstantinou, University of California San Diego
The GQ-Fast accelerator for queries over typed graphs
The GQ-Fast system processes (potentially extended) SQL queries over relational data. It is specialized for the case where the relational data capture an encoding of an annotated graph and the queries correspond to graph analytics. In the current version, these queries involve aggregation, join, semijoin, intersection and selection thus presenting a wide superset of fixed-length graph reachability queries and of tree pattern queries. Extensions to variable-length reachability are in process.
We present challenging real-world interactive (OLAP) graph analytics scenarios, and show that row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. GQ-Fast's internal structure corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access.
Finally, we present an under-development extension where the GQ-Fast memory-resident database/processor is in charge of the computationally demanding part of processing graph analytics, while a polystore accesses multiple sources, mirrors in GQ-Fast the graph structure and decomposes graph analytics queries into parts executed at GQ-Fast and parts executed at the original sources.
Yannis Papakonstantinou is a Professor of Computer Science and Engineering at the University of California, San Diego. His research is in the intersection of data management technologies and the web, where he has published over one hundred research articles that have received more than 14,000 citations, according to Google Scholar. A common theme of his research is the extension of database platforms and query processors beyond centralized relational databases and into semistructured & graph databases, integrated views of distributed databases and web services, textual data and queries involving keyword search, and most recently spatiotemporal sensor data. He has given multiple tutorials and invited talks, has served on journal editorial boards and has chaired and participated in program committees for many international conferences and workshops. He is a co-director and teaches for UCSD's Master of Advanced Studies in Data Science.
Yannis enjoys to commercialize his research and to inform his research accordingly. He was the CEO and Chief Scientist of Enosys Software, which built and commercialized an early Enterprise Information Integration platform for structured and semistructured data. The Enosys Software was OEM'd and sold under the BEA Liquid Data and BEA Aqualogic brand names, eventually acquired in 2003 by BEA Systems. He has also consulted for Amazon Web Services and multiple startups.
Yannis holds a Diploma of Electrical Engineering from the National Technical University of Athens, MS and Ph.D. in Computer Science from Stanford University (1997) and an NSF CAREER award for his work on data integration.
|10.00 - 10.45||Amol Deshpande, University of Maryland
GraphGen: Adaptive Graph Extraction and Analytics over Relational Databases
Graph querying and analytics are becoming an increasingly important component of the arsenal of tools for extracting different kinds of insights from data. However, graphs are not the primary representation choice for most data today, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. This cumbersome and costly process not only raises barriers in using graph analytics, but also makes it hard to explore and identify hidden or implicit graphs in the data.
In this talk, I will present our ongoing work on an end-to-end graph analysis framework, called GraphGen, that sits atop an RDBMS and enables users to declaratively specify graph extraction tasks, visually explore the extracted graphs, and write and execute graph algorithms over them, either directly or using existing graph libraries like the widely used NetworkX Python library. GraphGen has a fundamentally different goal from recent work on using relational databases to store graph data through "shredding". Instead, GraphGen is intended to analyze graphs that are present in existing relational databases. GraphGen attempts to utilize the underlying relational database to the full extent possible by pushing down computation, uses a novel condensed representation to handle graphs that may be too large to extract in their entirety, allows writing programs using a general subgraph-centric API, and features several optimizations for efficient extraction and querying of large graphs.
Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences.
|10.45 - 11.15||Coffee break|
|11.15 - 12.00||Dan Olteanu, University of Oxford
In-Database Factorized Learning
In this talk I will overview work on compilation of join queries over relational databases into compressed lossless data graphs called factorized representations. The primary motivation for this compilation is to avoid redundancy in the representation of query results in relational databases and in subsequent computation. The relationship between the standard listing representation of a relation and an equivalent factorized representation is on a par with the relationship between propositional formulas in disjunctive normal form and equivalent circuits. For conjunctive queries, their factorized results can be arbitrarily more succinct than their listing representations. I will also discuss an application of this work to learning regression models. On-going joint work with LogicBlox collaborators shows that in-database factorized learning can speed up real analytical workloads in the retail domain by several orders of magnitude over state-of-the-art systems. This work is based on long-standing collaboration with Maximilian Schleich (Oxford) and Jakub Zavodny (Oxford) and more recent collaboration with Mahmoud Abo-Khamis (LogicBlox), Hung Ngo (LogicBlox), and XuanLong Nguyen (Michigan).
Dan Olteanu is a Computer Science professor at Oxford and a computer scientist at LogicBlox. He has also taught at the universities of California Berkeley, Munich, Saarland, and Heidelberg. He received his PhD in Computer Science from University of Munich in 2005. His research interests are in databases and adjacent areas. Dan contributed to XML query processing, incomplete information and probabilistic databases, and more recently to factorized databases, in-database analytics, and the LogicBlox commercial system. He co-authored the book "Probabilistic Databases" (2011). He has served as associate editor for PVLDB'13 and IEEE TKDE, track chair for ICDE'15, group leader for SIGMOD'15, and vice chair for SIGMOD'17. His current research is supported by an ERC consolidator grant and awards from Google, LogicBlox, and Ordnance Survey.
|12.00 - 12.15||Pimin Konstantin Kefaloukos (Local Industrial Talk)
Reverse engineering a social network from web logs
Social networks contain a lot of information that is valuable for advertisers, such as the products purchased, interests, family relations and socio-demographic brackets of the network users. But what do you do if you don't own a social network? Every day advertising networks send billions of web requests to AudienceProject servers, where an individual log lines may contains information about device identifiers, URLs, IP addresses and more. This talks shows how AudienceProject has used novel and existing graph algorithms to reverse engineered a vast social and knowledge graph from web logs and used it to aid adversiters and media agencies. The focus of the talk will be on graph data in the advertisement world as well as some of the graph algorithms innovated or utilised by AudienceProject in this domain.
Pimin Konstantin Kefaloukos works in the ad-tech industry for AudienceProject in Copenhagen. He has a Ph.D. in Computer Science from the University of Copenhagen.
|12.15 - 13.30||Lunch|
|13.30 - 14.15||Hassan Chafi, Oracle Labs
Large Scale Graph Processing Opportunities and Challenges, an Industry Perspective
This talk will present Oracle Labs’ ongoing work in the area of large scale graph processing. We will present our approach to the problem both from a research and product perspective. In particular, we will highlight some of the unique aspects of our approach which relies on the adoption of high level Domain-Specific Languages (DSLs) to express both algorithms and queries against graph datasets. Next, we will discuss our approach to distributed graph computation which leverages and adapts existing state of the art techniques as well as some of our own novel ideas. We will highlight in particular, recent work to support distributed graph querying using an asynchronous graph pattern matching engine.The final part of the talk will present some our current streams of research and our perspective on some of the more practical challenges of large scale graph processing.
Hassan Chafi is Sr. Director of Research and Advanced Development at Oracle Labs where he currently leads various projects. His research investigates high-performance, parallel, in-memory Graph Analytics and using domain specific languages (DSLs) to simplify parallel programming. A second major project focuses on enabling extensibility within Oracle’s various data management solutions, including the Oracle Database. Dr. Chafi received his PhD from Stanford University. His thesis work at Stanford focused on building a Domain Specific Language Infrastructure, Delite. He was advised by Dr. Kunle Olukotun. Prior to that, Hassan worked in the area of hardware transactional memory as part of the Transactional Coherence and Consistency (TCC) project at Stanford where he developed a scalable extension to the original TCC protocol.
|14.15 - 15.00||George Fletcher, Eindhoven University of Technology
Schema-driven generation of synthetic graphs and queries with gMark
Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this talk, we present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.
This is joint work with colleagues at CNRS Lyon (France), INRIA Lille (France), Université Clermont Auvergne (France), and TU Eindhoven (Netherlands).
George Fletcher (PhD, Indiana University Bloomington) is an associate professor of computer science at Eindhoven University of Technology. His research interests span query language design and engineering, foundations of databases, and data integration. His current focus is on management of massive graphs such as social networks and linked open data. He is a member of the LDBC Graph Query Language Standardization Task Force.
|15.00 - 15.30||Coffee break|
|15.30 - 16.15||Jun Huan, University of Kansas
Investigating feed-forward Deep Neural Networks for Modeling Graph-based High Throughput Chemical Bioactivity Data
In recent years, research in Artificial Neural Networks (ANNs) has resurged, now under the Deep-Learning umbrella, and grown extremely popular due to major breakthroughs in methodological and computing capabilities. Deep-Learning methods are part of representation-learning algorithms that attempt to extract and organize discriminative information from the data. Recently reported success of DL techniques in crowd-sourced quantitative structure activity relationship modeling and predictive toxicology competitions has showcased these methods as powerful tools for cheminformatics. Nevertheless, reported applications of Deep Learning techniques for modeling complex bioactivity data for small molecules remain limited.
In this talk I will present our recent work on optimizing feed-forward Deep Neural Nets (DNNs) hyper-parameters and performance evaluation of these methods as compared to shallow methods. In our study 48 DNNs, 24 Random Forest, 20 SVM and 6 Naïve Bayes arbitrary but reasonably selected configurations were compared employing 7 diverse bioactivity datasets assembled from ChEMBL repository combined with circular fingerprints as molecular descriptors. The non-parametric Wilcoxon paired singed-rank test was employed to compare the performance of DNN to RF, SVM and NB. Overall it was found that DNNs with 2 hidden layers, 2,000 neurons per each hidden layer, ReLU activation function and Dropout regularization technique achieved strong classification performance across all tested datasets. Our results demonstrate that DNNs are powerful modeling techniques for modeling complex bioactivity data.
Dr. Jun (Luke) Huan is a professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Data Science and Computational Life Sciences Laboratory at KU. Dr. Huan holds courtesy appointments at the KU Bioinformatics Center and the KU Bioengineering Program.
Dr. Huan is an internationally recognized investigator in data science. He has published more than 120 peer-reviewed papers in leading conferences and journals and has graduated more than 10 graduate students including seven Ph.D.s. He was a recipient of the National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, R&D magazine, and EurekAlert. Dr. Huan’s editorial memberships have included Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics among others. He regularly serves the program committee of top-tier international conferences on Machine Learning, Data Mining, Big Data, and Bioinformatics.
Since 2015 Dr. Huan has served as a Program Director in the Information and Intelligent Systems division at the US National Science Foundation. At NSF he manages programs such as IIS core, Big Data, and Partnerships for International Research and Education.
|16.15 - 17.00||Boris Düdder, University of Copenhagen
Proof Enumeration as Big Graph Analysis Problem
Enumerating proofs for a theorem is an inherently complex problem by means of space and time complexity. Enumerating proofs in a (constructive) logic corresponds to enumerating or synthesizing typed programs in a corresponding type system. We present in this talk an example, how the algorithmic complexity of the proof enumeration problem is mapped to a big graph analysis problem to be solved. We show empirical results on a tool for program synthesis.
Boris Düdder is Assistant Professor in Software Engineering. He was leading a research team in software engineering at Technical University Dortmund, Germany, after finishing his doctorate in Dortmund in 2014. His research topic is automatic software generation technology. He was researcher at Fraunhofer Institute for Software- and System Engineering and worked as IT consultant and enterprise architect in banking and finance and Microsoft Corp., Redmond. Funded a software company and was its CEO. Worked at CERN, Geneva, after studying physics and mathematics at Ruhr University Bochum. Received various industry and academic awards.
|09.00 - 09.45||Semih Salihoglu, University of Waterloo
Surprises in a Survey of the Graphs, Computations, and Software Used in Practice
This talk will be about an online user survey we conducted in April 2017 among industry and research users of a wide-spectrum of graph data processing technologies, including graph databases, RDF engines, distributed graph processing systems, libraries for graph processing, graph visualization software, and the Gremlin query language. The survey was intended to understand three high-level questions: (1) What kinds of graph data do participants have? (2) What kind of computations do participants do on their graphs? and (3) What kind of software do participants use to perform these computations. We also asked the participants about their major challenges when processing their graphs. An analysis of the 89 participants revealed very surprising facts about these questions, which we will share during the talk.
Semih Salihoglu is an assistant professor at University of Waterloo's Cheriton School of Computer Science. He is member of the Data Systems Research Group. Previously, he was a PhD student at Stanford University, advised by Jennifer Widom. His current research focuses graph databases and distributed graph processing engines.
|09.45 - 10.30||Milos Nikolic, University of Oxford
A Unified Approach to Incremental View Maintenance of In-Database Analytics
In this talk, we study the problem of incremental computation of analytical tasks over joins. We introduce a principled incremental view maintenance (IVM) mechanism that reduces the maintenance of the given task to the maintenance of a hierarchy of increasingly simpler views. The views are functions mapping keys, which are tuples of input data values, to payloads, which are elements from a task-specific ring. The computation over the keys is the same for all tasks, while the computation over the payloads depends on the task. Our IVM approach achieves efficiency by factorizing the computation of the keys, payloads, and updates.
We demonstrate our technique on a variety of analytical tasks, such as gradient computation for learning regression models, linear algebra operations, and factorized evaluation of conjunctive queries, and experimentally show that it can achieve orders of magnitude better performance and lower memory consumption than existing state-of-the-art IVM techniques.
Milos Nikolic is a departmental lecturer in the Department of Computer Science at the University of Oxford. His research focuses on the design and implementation of data-intensive systems. His work studies the incremental computation of complex analytical queries, such as database queries and machine learning models, in local and distributed streaming environments using novel approaches to query optimization and compilation. He received a Ph.D. in Computer Science from EPFL.
|10.30 - 10.45||Tim D. Ward (Local Industrial Talk)
How to build a decision engine with Neo4j and Machine Learning Techniques
There is no doubt that machine learning and graphs go hand in hand — and Neo4j makes it possible to take machine learning to the next level. In this presentation, we'll share how CluedIn built Neo4j into a decision engine that turns company data into actionable insights. By using machine learning techniques and the graph as a decision tree, we were able to achieve amazing precision in merging and identifying insights in the enterprise. With practical demos and techniques, viewers will leave this presentation with new and effective ways of working with Neo4j.
Tim is an Engineer at CluedIn where they specialize in combining Neo4j and other database technologies with Machine Learning. He has been using the Graph for over 5 years and the team at CluedIn are currently running a Neo4j Graph Database with over 180 Million nodes in production.
|10.45 - 11.15||Coffee break|
|11.15 - 12.00||Ioana Manolescu, INRIA Saclay
Towards Analytics for RDF Graphs
RDF is the W3C's recommended format for representing graph-structured data. RDF graphs can be large and complex, lacking a predefined structure; this makes their exploration difficult and raises special difficulties for query answering, as it has to account for explicit and for implicit data. The talk will cover recent works on efficiently querying, summarizing, and discovering insights in heterogeneous, semantic-rich RDF graphs.
Ioana Manolescu is a senior researcher at Inria Saclay and a part-time professor at Ecole Polytechnique, France. She is the lead of the CEDAR INRIA team focusing on rich data analytics at cloud scale. She is a member of the PVLDB Endowment Board of Trustees, of the ACM SIGMOD Jim Gray PhD dissertation committee, and an associate editor for PVLDB. Recently, she has been the program chair of the Scientific and Statistical Data Management Conference 2016, and she is a co-chair of the upcoming IEEE ICDE conference in 2018. She has co-authored more than 130 articles in international journals and conferences, and contributed recently to a book on "Web Data Management" by S. Abiteboul, I. Manolescu, P. Rigaux, M.-C. Rousset and P. Senellart. Her main research interests algebraic and storage optimizations for semistructured data and in particular data models for the Semantic Web, novel data models and languages for complex data management, data models and algorithms for fact-checking, and distributed architectures for complex large data.
|12.00 - 12.45||Yongluan Zhou, University of Copenhagen
Graph Processing On GPUs: Challenges and Opportunites
In this talk, I will present the key issues of graph processing on GPUs, including data layout, memory access pattern, workload mapping and specific GPU programming. I will summarize a few representattive state-of-the-art researches on GPU-based graph processing, and explore the research opportunities in future.
Yongluan is an Associate Professor in the Department of Computer Science at the University of Copenhagen. Before joining KU, he has been an Associate Professor at the University of Southern Denmark and a postdoc at Ecole Polytechnique Fédérale de Lausanne (EPFL). He obtained his PhD in Computer Science at the National University of Singapore. He had also been visitors at several international institutes, including the University of Maryland at College Park, the Advanced Digital Sciences Center at the University of Illinois at Urbana-Champaign and INRIA Rennes. His research interests span database systems and distributed systems. His recent research focuses on scalable streaming data processing and graph data processing. He has published over 50 research articles in international journals and conference proceedings.
|12.45 - 14.00||Lunch|
|14.00 - 17.00||Discussions and walking tour of the city|
Room 4-0-02, Biocenter
University of Copenhagen
Ole Maaløes Vej 5
2200 Copenhagen N
The workshop is only for invitees.
Please contact associate professor Yongluan Zhou (firstname.lastname@example.org) if you have any questions or wish to join the workshop.