LECTURE @DHBW: DATA WAREHOUSE PART XI: DATA VAULT …buckenhofer/20182DWH/Buckenhofer-… · Hadoop...
Transcript of LECTURE @DHBW: DATA WAREHOUSE PART XI: DATA VAULT …buckenhofer/20182DWH/Buckenhofer-… · Hadoop...
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART XI: DATA VAULT MODELINGANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas BuckenhoferSenior DB [email protected]
Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics
ANDREAS BUCKENHOFER, DAIMLER TSS GMBH
Data Warehouse / DHBWDaimler TSS 3
“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”
Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.
I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.
I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.
DHBWDOAG
Contact/Connect
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 5
Daimler TSS
LOCATIONS
Data Warehouse / DHBW
Daimler TSS China
Hub Beijing
10 employees
Daimler TSS Malaysia
Hub Kuala Lumpur
42 employeesDaimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
6
After the end of this lecture you will be able to
Understand differences in data modeling between OLTP and OLAP
Understand why data modeling is important
Understand data modeling in the Core Warehouse Layer and Data Mart Layer
• Data Vault
• Dimensional Model / Star schema
Understand dimensions and facts
Understand ROLAP & MOLAP
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 7
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 8
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse
Layer(Storage
Layer)
Mart Layer(Output Layer)
(Reporting Layer)
Integration Layer
(Cleansing Layer)
Aggregation Layer
Metadata Management
Security
DWH Manager incl. Monitor
DATA MODELS IN THE DWH
Data Warehouse / DHBWDaimler TSS 9
Layer Characteristics Data Model
Staging Layer ▪ Temporary storage
▪ Ingest of source data
▪ Normally 1:1 copy of source table structure –usually without constraints and indexes
Core Warehouse Layer
▪ Historization / bitemporal data
▪ Integration
▪ Tool-independent
▪ Non-redundant data storage
▪ Historization
▪ 3NF with historization
▪ Head and Version modelling
▪ Data Vault
▪ Anchor modeling
▪ Dimensional model with historization (possible)
Data Mart Layer ▪ Performance for end user queries required, Tool-dependent
▪ Lots of joins necessary to answer complex questions
▪ Flat structures, esp. Dimensional model(ROLAP / MOLAP / HOLAP)
DATA MODELING: 3NF, STAR SCHEMA, DATA VAULT
Data Warehouse / DHBWDaimler TSS 10
Business Key
Relationships
Contextandhistory
3NF
Star:Dimensions
Star:Facts
Data Vault:Hub
Data Vault:Link
Data Vault:Sat
Vehicle identifierManufacturerModelTypePlantDelivery DateProduction DatePriceSource systemLoad date/timeBuyerSalespersonEngine
Vehicle data
DATA VAULT - ARCHITECTURE, METHODOLOGY, MODEL
Data Warehouse / DHBWDaimler TSS 11
Lecture part 1: DWH Architectures
Lecture part 2:DWH Data Modeling
Architecture
• Multi-Tier
• Scalable
• Supports NoSQL
Methodology
• Repeatable
• Measureable
• Agile
Model
• Flexible
• Hash based
• Hub & Spoke
Implementation: Automation, Pattern based, High speed
"Data Vault 2.0” is a system of business intelligence which includes: Modeling, Methodology, Architecture, and Implementation best practices. The components, also known as pillars of Data Vault 2.0 are identified as follows:
• Data Vault 2.0 Modeling - Focused on Process and Data Models
• Data Vault 2.0 Methodology – Following SCRUM and agile ways of working
• Data Vault 2.0 Architecture – Includes NoSQL and big-data systems
• Data Vault 2.0 Implementation – Pattern based automation and generation
The term “Data Vault” is merely a marketing term chosen in 2001 to represent the system to the market. The true name for the Data Vault System of BI is: common foundational warehouse modeling, methodology, architecture, and implementation.
DATA VAULT 2.0, DEFINITION BY DAN LINSTEDT
Data Warehouse / DHBWDaimler TSS 12
Source: https://www.linkedin.com/pulse/defining-data-vault-10-20-business-dan-linstedt/
Unique
identification
by
Natural keys
(Business Keys)
HUB
STRUCTURE HUB TABLES
Data Warehouse / DHBWDaimler TSS 14
HUB TABLES: TYPICAL CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 15
Business Keys should be natural keys used by the business (e.g. Vehicle Identifier, Serial number)
Business Keys should stand alone and have meaning to the business
Business Keys should never change, have the same semantic meaning and the same granularity
Focus on Business Keys (instead focus on source system surrogates) ensures that the result serves the needs of the business
TYING BUSINESS PROCESSES TO BUSINESS KEYS
Time
ProcurementSales
$$Revenue
DeliveryContractsFinance
PlanningManufacturing
CustomerContact
Finance
Sales Procurement
SLS123 SLS123SLS123 *P123MFG
*P123MFG
Excel Spreadsheet
Manual Process
NO VISIBILITY!© Copyright 1990-2017, Dan Linstedt, all rights reserved
Data Warehouse / DHBW 17Daimler TSS
LINK
Unique
relationships
between
Business Keys
(HUBs)
STRUCTURE LINK TABLES
Data Warehouse / DHBWDaimler TSS 18
LINK TABLES: TYPICAL CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 19
A LINK models a relationship between 2 or more HUBs
The relationship is always n:m
The composed key must be unique. One of the foreign keys is driving key
Link to Link allowed but should be avoided in a physical implementation due to load dependency
• Relationships / Associations
• Foreign Keys in OLTP systems
• Hierarchies and Redefinitions
• Hierarchical relationships are modeled by one link and two connections to HUBs: HAL (parent-child LINK) and SAL (same-as LINK)
• Transactions and events are often modeled as link (could also be a Hub)
• E.g. sales order or sensor data
• Intensive discussions about modeling as Hub or Link on conferences or social media (modeling solution depends from requirements, context, etc)
CANDIDATES FOR LINKS
Data Warehouse / DHBWDaimler TSS 20
Data Warehouse / DHBW 21Daimler TSS
SAT
Descriptive,
detailled,
current
and
historized
data
STRUCTURE SAT TABLES
Data Warehouse / DHBWDaimler TSS 22
SAT TABLES: TYPICAL CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 23
Contains all non-key attributes
Is connected to exactly one Hub or Link
HUB or LINK tables can (should) have several SAT tables, e.g. by source system
Can contain in the extreme case one column only (or any number of columns)
Different criteria to design SAT tables (separate data into different SAT tables)
• Source system
• Rate of change
• Data types (e.g. separate CLOBS or other lengthy textual fields)
SAT TABLE DESIGN
Data Warehouse / DHBWDaimler TSS 24
SAT TABLE DESIGN
Rate of change in order to avoid redundant storage of data
Data Warehouse / DHBWDaimler TSS 25
Data that change oftenData that do not change
Delivery date SW-Version Controlunit
Theft message
Color CommentsInterior
HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES?
Data Warehouse / DHBWDaimler TSS 26
vehicleid model productiondate
engine color
V1 SUV 15.01.13 E1 red
V2 Cabrio 16.01.13 E2 blue
V1 SUV 15.01.13 E1 red
V3 Cabrio 17.01.13 E3 red
Staging Data in table stg_vehicle from 15.01.2015
V1 SUV 16.01.13 E4 red
V4 Cabrio 17.01.13 E5 blue
Staging Data Data in table stg_vehicle from 16.01.2015
V1 SUV 16.01.13 E1 red
Staging Data Data in table stg_vehicle from 17.01.2015
• H_VEHICLE
• 4 rows: V1, V2, V3, V4
• H_ENGINE
• 5 rows: E1, E2, E3, E4, E5
• L_PLUGGED_IN_EFFECTIVITY
• 5 rows: V1-E1, V2-E2, V3-E3, V1-E4, V4-E5
HOW MANY ROWS ARE STORED IN THE HUB AND LINK TABLES?
Data Warehouse / DHBWDaimler TSS 27
HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES?
Data Warehouse / DHBWDaimler TSS 28
vehicleid model productiondate
engine color
V1 SUV 15.01.13 E1 red
V2 Cabrio 16.01.13 E2 blue
V1 SUV 15.01.13 E1 red
V3 Cabrio 17.01.13 E3 red
Staging Data Data in table stg_vehicle from 15.01.2015
V1 SUV 16.01.13 E4 red
V4 Cabrio 17.01.13 E5 blue
Staging Data Data in table stg_vehicle from 16.01.2015
V1 SUV 16.01.13 E1 red
Staging Data Data in table stg_vehicle from 17.01.2015
HOW MANY ROWS ARE STORED IN THE FIRST 3 SAT TABLES?
Data Warehouse / DHBWDaimler TSS 29
vehicleid model productiondate
engine color
V1 SUV 15.01.13 E1 red
V2 Cabrio 16.01.13 E2 blue
V1 SUV 15.01.13 E1 red
V3 Cabrio 17.01.13 E3 red
Staging Data Data in table stg_vehicle from 15.01.2015
V1 SUV 16.01.13 E4 red
V4 Cabrio 17.01.13 E5 blue
Staging Data Data in table stg_vehicle from 16.01.2015
V1 SUV 16.01.13 E1 red
Staging Data Data in table stg_vehicle from 17.01.2015
5
4
6
4
5
5
Hans Hultgren: “An ensemble is a representation of a Core Business Concept including all of its parts – the business key, with context and relationships”
ENSEMBLE MODELING
Data Warehouse / DHBWDaimler TSS 30
Source:e.g. vehicle
Ensemble
Decomposition
Entity
ENSEMBLE MODELING – NOT JUST DATA VAULT 2.0
Data Warehouse / DHBWDaimler TSS 31
EXERCISE DATA VAULT
The following data model shows vehicle sales with entities
• Person (sales_person and owner)
• Vehicle
• Production_plant
Architect a Data Vault model for theCore Warehouse Layer
Data Warehouse / DHBWDaimler TSS 32
SAMPLE SOLUTION DATA VAULT
Data Warehouse / DHBWDaimler TSS 33
• Flexible / agile approach
• Highly parallel data loads, Scalable
• Automatable
• Systematic approach that covers historization and integration
• Full auditability
• No updates or deletes on business data
• Horizontal and vertical partitioning
• Supports / Combines RDBMS and Hadoop/NoSQL technologies
• Separates soft and hard rules into different parts of the data integration
DATA VAULT - ADVANTAGES
Data Warehouse / DHBWDaimler TSS 34
• More Tables
• More joins
• Performance to load Data Mart can be a challenge
• Logic to load Data Marts can be rather complex if many tables are involved
• All relationships are modeled n:m (documentation necessary!). Data Vault assumes worst-case scenario for relationships
• The same source table is used several times while loading HUBs, SATs, LINKs
• Data Vault is an additional layer compared to a Kimball DWH bringing in some additional overhead
DATA VAULT - DISADVANTAGES
Data Warehouse / DHBWDaimler TSS 35
DATA VAULT - QUOTES
Data Warehouse / DHBWDaimler TSS 36
Bill Inmon: “Over multiple years, Dan improved the Data Vault and evolved it into Data Vault 2.0. Today this System Of Business Intelligence includes not only a more sophisticated model, but an agile methodology, a reference architecture for enterprise data warehouse systems, and best practices for implementation.The Data Vault 2.0 System Of Business Intelligence is ground-breaking, again. It incorporates concepts from massively parallel architectures, Big Data, real-time and unstructured data.“Source: Linstedt / Olschimke: Building a Scalable Data Warehouse with Data Vault 2.0
Barry Devlin: “The Data Vault approach, since the early 2000s, promises a much-improved balance, with a hybrid of the
normalized and star schema forms above. Version 2.0 introduced in 2013 ,
consisting of a data model, methodology, and systems architecture, provides a
design basis for data warehouses that emphasizes core data quality,
consistency, and agility.“Source: https://www.wherescape.com/media/3476/data-vault-thoughpoint-april-2017.pdf
The issues Data Vault 2.0 is built to solve include:
• Global distributed Teams
• Global distributed physical data warehouse components
• „Lazy“ joining during query time across multi-country servers
• Ingestion and query parsing of images, video, audio, documents (unstructured data)
• Ingestion of real-time streaming (IOT) data
• Cloud and On-premise seamless integration
• Agile Team Delivery
• Incorporation of Data Virtualization, and NoSQL platforms
• Extremely large data sets (in to the Petabyte ranges and beyond)
• Automation and Generation of 80% of the work products
DATA VAULT 2.0 BENEFITS ACCORDING TO DAN LINSTEDT
Data Warehouse / DHBWDaimler TSS 37
Source: https://www.linkedin.com/pulse/defining-data-vault-10-20-business-dan-linstedt/
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 38
THANK YOU
1. You can't understand what you don't research
2. You can't define what you don't understand (standards, context, concepts)
3. You can't identify what you don't define (KPA's and structure)
4. You can't measure what you don't identify (KPA's and KPI's)
5. You can't optimize what you can't measure (KPI's and retrospectiveadaptation)
5 LEVELS OF CMMI
Data Warehouse / DHBWDaimler TSS 39
Source: https://www.linkedin.com/pulse/data-vault-20-beyond-model-dan-linstedt/
„The Data Vault is a detail oriented, historical tracking and uniquely linked setof normalized tables that support one or more functional areas of business.It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.“
FORMAL DEFINITION DATA VAULT (1.0) BY DAN LINSTEDT
Data Warehouse / DHBWDaimler TSS 40
Source: http://www.vertabelo.com/blog/technical-articles/data-vault-series-agile-modeling-not-an-option-anymore