Foundations and Trends in Databases Series

Query Processing on Probabilistic Data

by Guy Van den Broeck and Dan Suciu

Published 8 August 2017

Probabilistic data is motivated by the need to model uncertainty in large databases. Over the last twenty years or so, both the Database community and the Al community have studied various aspects of probabilistic relational data. Query Processing on Probabilistic Data: A Survey presents the main approaches developed in the literature, reconciling concepts developed in parallel by the two research communities. It starts with an extensive discussion of the main probabilistic data models and their relationships, followed by a brief overview of model counting and its relationship to probabilistic data. The monograph proceeds to discuss lifted probabilistic inference, a suite of techniques developed in parallel by the Database and Al communities for probabilistic query evaluation. It then provides a summary of query compilation, presenting some theoretical results highlighting limitations of various query evaluation techniques on probabilistic data. It ends with a brief discussion of some popular probabilistic data sets, systems, and applications that build on this technology.

Algorithmic Aspects of Parallel Data Processing

by Paraschos Koutris, Semih Salihoglu, and Dan Suciu

Published 22 February 2018

The last decade has seen a huge and growing interest in processing large data sets on large distributed clusters. This trend began with the MapReduce framework, and has been widely adopted by several other systems, including PigLatin, Hive, Scope, Dremmel, Spark and Myria to name a few. While the applications of such systems are diverse (for example, machine learning, data analytics), most involve relatively standard data processing tasks like identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. This has generated great interest in the study of algorithms for data processing on large distributed clusters. Algorithmic Aspects of Parallel Data Processing discusses recent algorithmic developments for distributed data processing. It uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model where the only cost is given by the amount of communication and the number of communication rounds. The survey studies several algorithms for multi-join queries, sorting, and matrix multiplication. It discusses their relationships and common techniques applied across the different data processing tasks.

Modern Datalog Engines

by Bas Ketsman and Paraschos Koutris

Published 29 June 2022

Recent years have seen a resurgence of interest in Datalog from both the industry and research community. Datalog is a declarative query language that extends relational algebra with recursion. It is used to express a wide spectrum of modern data management tasks such as data integration, declarative networking, graph analysis, business analytics, and program analysis. The result of this long line of research is a plethora of Datalog engines that support different variants of Datalog, and have different technical specifications and capabilities. In this monograph, the authors provide an overview of the architecture and technical characteristics of the various Datalog engines. They identify common architectural decisions and evaluation methods as well as data structures and layouts used to speed up the query execution. They also discuss the ways in which Datalog engines differ when they specialize to workloads with different characteristics. A particular focus of this monograph is how modern Datalog engines scale to massively parallel environments, which is necessary to support the processing of very large datasets. The authors conclude with opportunities for future research directions and new possible applications for Datalog engines.