Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

31
WWW.LEDS-PROJEKT.DE ECCENCA CORPORATE MEMORY SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES 7/5/22 1

Transcript of Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

Page 1: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

WWW.LEDS-PROJEKT.DE

ECCENCA CORPORATE MEMORY

SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES

May 2, 20231

Page 2: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 20232

MOTIVATION

Enterprise Data Management Objective:“Ensure all data is aligned to a common meaning in order to achieve automation in performing complex analytics and generating trusted reports.”

Source: 2015 Data Management Industry Benchmark - EDM Council

In 2015 only 7% of respondents claim to already be using shared and unambiguous definitions of data across the firm and have it accessible as operational metadata.

7%

Page 3: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 20233

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Corporate Memory

Inbound

Data Sources

Outbound and Consumption

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems

Big Data DWH-Infrastructure

Page 4: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to

Target Systems

Big Data DWH-Infrastructure

Data Ingestion• Files in the data lake (CSV, XML, Excel)• (relational) Databases

Page 5: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to

Target Systems

Big Data

DWH-Infrastructure

Data Lake• Emerging approach to handle large amounts

of data• Cost-effective storage• Data is held in their native formats GoodDoes not force an up-front integration of the ingested data sets BadRetaining an overview of disparate data silos in the lake without having a coherent shared view is a challenging issue

Page 6: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to

Target Systems

Big Data DWH-Infrastructure

Data Warehouses• Existing infrastucture• Typically relational databases

Page 7: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to

Target Systems

Big Data DWH-Infrastructure

Metadata Layer• Dataset Metadata• Ontologies• Integration Rules

Page 8: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to

Target Systems

Big Data DWH-Infrastructure

Graphical User Interface

Customer Applications

Page 9: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 20239

INTEGRATION PROCESS

Dataset Management• Catalog Datasets• Catalog Ontologies• Manage Metadata

Dataset Discovery• Data Profiling• Dataset Exploration

Dataset Integration• Dataset Lifting• Dataset Linking• Data Quality

Validation

Data Access• Domain Specific

Consolidated Views• Execution on Hadoop

Page 10: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202310

DATASET MANAGEMENT

Dataset Management• Catalog Datasets• Catalog Ontologies• Manage Metadata

Dataset Discovery• Data Profiling• Dataset Exploration

Dataset Integration• Dataset Lifting• Dataset Linking• Data Quality

Validation

Data Access• Domain Specific

Consolidated Views• Execution on Hadoop

Page 11: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202311

DATASET CATALOG

• Enables the user to explore and manage datasets in the data lake• Files in the data lake (CSV, XML, Excel)• Databases (Apache Hive or external databases)

Page 12: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202312

MANAGING METADATA

• Exploring and editing dataset metadata • Semantic content information, like

textual descriptions, tags and related Persons

• Technical information and parameters, like formats, data model and encoding

• Access information, like access path or URL, source system or API call

• Organizational provenance, like organizational units owning or maintaining the dataset

Page 13: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET DISCOVERY

Dataset Management• Catalog Datasets• Catalog Ontologies• Manage Metadata

Dataset Discovery• Data Profiling• Dataset Exploration

Dataset Integration• Dataset Lifting• Dataset Linking• Data Quality

Validation

Data Access• Domain Specific

Consolidated Views• Execution on Hadoop

May 2, 202313

Page 14: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202314

DATASET DISCOVERY

• Goal: Augment a dataset with data from related datasets• Automatic discovery of dataset with overlapping information• Explorative interface• Discovery is based on two data parts

• Business meta data• Profiling summary

Page 15: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202315

DISCOVERY VIEW

• Datasets are matched based on their metadata (profiling + business data)

Page 16: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202316

DATASET PROFILING

• Datasets often contain implicit and explicit schema information• Column names, data formats, enumerated values etc.• Example: column contains formatted dates

• Idea: Extract a dataset summary• For each column / property the summary contains:

1. Data type (e.g., number, date, industry classification)2. Data format (e.g., date format)3. Data statistics (e.g., range, distribution, most frequent values)

• Materialized as RDF with UI view

Page 17: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202317

DETECTING DATA TYPES

• Detecting common datatypes as well as user-defined types• Common datatypes

• Numbers• Dates / Times• Geographic locations (geo-coordinates, states, countries)

• User-defined data types can be integrated by adding an ontology / taxonomy• Usually a SKOS taxonomy• Managed as another dataset in the dataset management• Example: Industry taxonomy

• Standard taxonomy (NACE, SIC, NAICS) or company specific

Page 18: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202318

FORMATS AND STATISTICS

• For some types, the data format is detected• Example: Dates are formatted in DD-MM-YYYY

• Two functions are generated:1. Parser that is able to read the detected representation2. Normalizer that converts the parsed values into a configurable,

organization-wide target representation• Statistics summarize the values:

• Value range and distribution• Most frequent values• Data selectivity

Page 19: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DISCOVERY VIEW

• Datasets are matched based on their metadata (profiling + business data)

May 2, 202319

Page 20: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202320

INTEGRATION PROCESS

Dataset Management• Catalog Datasets• Catalog Ontologies• Manage Metadata

Dataset Discovery• Data Profiling• Dataset Exploration

Dataset Integration• Dataset Lifting• Dataset Linking• Data Quality

Validation

Data Access• Domain Specific

Consolidated Views• Execution on Hadoop

Page 21: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202321

DATA INTEGRATION

• The integration process is driven by a set of rules• Lifting Rules map the source datasets to a ontology• Linking Rules connect different datasets to a knowledge graph

• Rules are operator trees, consisting of four types of operators• Data Access Operators• Transformation Operators• Similarity Operators• Aggregation Operators

• Rules can be learned using genetic programming algorithms• Rules are human understandable and can be edited

Page 22: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202322

DATASET LIFTING

• Objective: Map the datasets in the data lake to a consistent vocabulary.• A lifting rule consists of a number of mappings

• Each mapping assigns a term in the original data set (such as a column for tabular data) to a term in the target ontology (such as a property provided by an ontology).

• Multiple mappings for each dataset can be managed to allow different views on the same data.• Initial mappings are generated automatically based on the

profiling results from where the user can continue to build on.

Page 23: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202323

LIFTING EXAMPLE

Bond ISIN Country Industry

NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking

SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical Equipment

Electricite de France (EDF), 6,5% 26jan2019

USF2893TAB29 France Utilities

NEDWBK CAD 5,2%25

fibo:hasSecurityIdentifier

Utilities

Industry Ontology

Banking

France

Country Ontology

Germany

EMEA

“CA639832AA25”

fibo:legallyRecordedIn

fibo:industrySector

Page 24: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202324

LINKING

• Goal: Connect individual datasets to a knowledge graph• Identify related entities in different datasets and link them

• Either entities describing the same real world object or another relationNEDWBK CAD 5,2%25

ratingScore

Industry OntologyCountry Ontology

EMEA“AAA”

fibo:legallyRecordedIn

fibo:industrySector

Rating CAD 5,2%25hasRating

fibo:industrySector

fibo:legallyRecordedIn

Page 25: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202325

LINKAGE RULES

• Linking is based on domain-specific rules• Specify the conditions that must hold true for two entities to be

linked

Page 26: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202326

LEARNING LINKAGE RULES

Problem: Manually writing rules is time-consuming and requires expertiseApproach: Interactive machine learning algorithm for generating rules• Generates a rule based on a number of user-confirmed link candidates.• Link candidates are actively selected by the learning algorithm to include link

candidates that yield a high information gain.• The user does not need any knowledge of the characteristics of the dataset or any particular similarity computation techniques.

Page 27: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

INTEGRATION PROCESS

Dataset Management• Catalog Datasets• Catalog Ontologies• Manage Metadata

Dataset Discovery• Data Profiling• Dataset Exploration

Dataset Integration• Dataset Lifting• Dataset Linking• Data Quality

Validation

Data Access• Domain Specific

Consolidated Views• Execution on Hadoop

Page 28: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202328

VIEW GENERATION

• The user selects a set of lifted and linked datasets

Page 29: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

May 2, 202329

Hadoop Data Lake

DATA ACCESS

• Generate data flows based on Apache Spark• The data flows utilize Resilient

Distributed Datasets (RDDs)• RDDs derive new data sets from

existing data sets by applying a chain of transformations• A derived data set can either

• be recomputed on-the-fly • persisted on stable storage

• Data flows can be executed efficiently on Hadoop clusters. Corporate

Bonds

Data Lifting 1(Apache Spark

RDD)

Data Linking(Apache Spark RDD)

Internal Ratings

Data Lifting 2(Apache Spark

RDD)

External Ratings

Data Lifting 3(Apache Spark

RDD)

eccenca Corporate

Memory

Data Consumer

SQL CSVExcel

SparkAPI

Page 30: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DEMO

Page 31: Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ContactDr. Robert IseleTel: +49 151 17238616email: [email protected]

eccencaCommand your Data!