Presenter: Damianos Chatziantoniou Date: 20 February 2020
In this talk we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term ``data infrastructure'' we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuitive to use manner, the same way the age or gender of a customer are made. We define formally a DVM, queries over DVMs and an algebraic framework for query evaluation. We also claim that a conceptual layer, such as DVM, is a prerequisite for end-to-end processing. In addition, we present a prototype tool based on DVMs, called DataMingler, which enables end-to-end processing in analytics environments by data stakeholders. Specifically, DataMingler: