Data Virtual Machines: Data-Driven Conceptual Modeling of Big Data Infrastructures

Presenter: Damianos Chatziantoniou
Date: 20 February 2020

Abstract

In this talk we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term ``data infrastructure'' we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuitive to use manner, the same way the age or gender of a customer are made. We define formally a DVM, queries over DVMs and an algebraic framework for query evaluation. We also claim that a conceptual layer, such as DVM, is a prerequisite for end-to-end processing. In addition, we present a prototype tool based on DVMs, called DataMingler, which enables end-to-end processing in analytics environments by data stakeholders. Specifically, DataMingler:

  • allows data engineers to easily, quickly map actual data and processes’ output onto it, regardless systems and models
  • makes it easy for data scientists to understand entities and attributes, transform data as they wish – possibly using different languages – and define inputs for ML algorithms/ ad hoc reports
  • enables data officers to govern data, see data provenance and comply with data regulations
  • gives a view of own data to data contributors (e.g. customers, suppliers) – understand, retrieve, manage their personal data