What is a Data Virtualization Platform?

DataVera EKG Provider is a data virtualization platform intended for processing of the big data arrays gathered in the Enterprise Knowledge Graph and structured according to an ontology. See User Guide for more details.

EKG stands for "Enterprise Knowledge Graph". This term is used to describe the overall process of managing and governing an organization's data and knowledge assets. EKG platform serves as a centralized system for managing and governing data across an enterprise, ensuring data quality, completeness, consistency, and accuracy. It also provides data lineage control, data dictionary, data profiling, data validation, and other data governance features.

DataVera EKG Provider uses ontologies to represent a data model. Ontologies allows to process effectively the data of a complex structure, containing thousands of the entity types and properties. There is a distinct class of the graph databases, RDF triple stores, intended for ontologies processing. The logical rules (such as SHACL) processing engines can be plug in to execute rules on its content. Such databases are performing well with the data of complex structure, but not with the big data.

This requires data virtualization platforms implementation. Data virtualization allows to physically store data in the common relational or document-oriented database, but process it as if they are sutuated in a graph. The industrial ontology-based data processing framework has to implement the following functions:

  • Support SPARQL queries and/or other API types for data reading and manipulation
  • Support logical rules execution for data quality management and data transformation
  • Offer an editor to design the model (TBox) and manage data (ABox)
  • Provide a data search tool

DataVera EKG Platform Architecture

The DataVera EKG Platform implements all these requirements. The platform architecture is presented on the next diagram:

DataVera EKG Platform Architecture

The next diagram presents the architecture of the platform's cornerstone, DataVera EKG Provider:

DataVera EKG Provider Architecture

DataVera EKG Platform Architecture

DataVera EKG Platform Architecture includes the following components:

  • DataVera EKG Explorer is a data editing interface which allows exploring and editing EKG contents
  • DataVera EKG Provider is a data virtualization platform which provides an API to access all the EKG contents despite its physical storage. It implements temporal (historical) data management, access rights control and rules execution.
  • Kafka or RabbitMQ - provides asynchronous exchange with data providers and consumers
  • Apache Fuseki RDF triple store - stores ontology TBox (structure) and rules
  • Protege Model (TBox) editor - one of the components which can be used to manage ontology TBox and rules
  • Postgres or Greenplum - ABox (data objects) storage

DataVera EKG Explorer exchanges data to DataVera EKG Provider via REST API. DataVera EKG Provider exchanges data to RDF triple store Apache Fuseki via SPARQL protocol.

DataVera EKG Provider and DataVera EKG Explorer can be run on Kubernetes and scaled as required.

DataVera EKG Provider offers the following integration interfaces:

  • REST
  • JSON exchange through Kafka topics or RabbitMQ queues
  • SPARQL queries (with limitations)

REST API provides the next core functions:

  • Get an entity by identifier
  • Get objects group by a set of filters
  • Get data model (TBox)
  • Create/edit/delete objects
  • Bulk data load
  • Objects validation and running inference rules
  • Obtaining data quality metrics for a data set, querying objects violating data quality rules
  • Subscribe for receiving any class objects updates via Kafka or RabbitMQ, changing subscription settings

One of the core DataVera EKG Provider features is the support of the data temporal aspect. Temporal data functions allows working with any past or future state of the data set. There are the following functions for temporal data manipulation:

  • Get an object state for any moment of time
  • Get all the object changes history
  • Get a set of objects using filters for any moment of time

DataVera EKG Provider as a data virtualization platform provides:

  • All the core advantages of the ontologies (faceted classification, multi-valued properties, several superclasses inheritance etc)
  • Multi-lingual literal values (any languages)
  • Access rights control at the classes level
  • Constraints and inference rules execution

DataVera EKG Provider has the following deployment infrastructure integration features:

  • Kubernetes deployment and scaling
  • Internal multithreading
  • ELK logging
  • Reporting mertics to Prometheus
  • API Swagger description