Post on 22-May-2020
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
03 COMMON DWH ARCHITECTURESANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas BuckenhoferSenior DB Professionalandreas.buckenhofer@daimler.com
Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics
ANDREAS BUCKENHOFER, DAIMLER TSS GMBH
Data Warehouse / DHBWDaimler TSS 3
“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”
Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.
I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.
I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.
DHBWDOAG
Contact/Connect
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 5
Daimler TSS
LOCATIONS
Data Warehouse / DHBW
Daimler TSS China
Hub Beijing
10 employees
Daimler TSS Malaysia
Hub Kuala Lumpur
42 employeesDaimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
6
• Describe different DWH architectures
• Explain DWH data modeling methods and design logical models
• Name DB techniques that are well-suited for DWHs
• Explain ETL processes
• Specify reporting & project management & meta data requirements
• Name current DWH trends
DWH LECTURE - LEARNING TARGETS
Data Warehouse / DHBWDaimler TSS 7
The article
http://www.kimballgroup.com/2004/03/differences-of-opinion/
compares THE two classic DWH architectures.
Read the paper and complete the table / questions on the next slide.
(Caution: The paper is biased / favors one approach; you may want to read other/more papers for a neutral view.)
EXERCISE: CLASSICAL DWH ARCHITECTURES
Data Warehouse / DHBWDaimler TSS 8
EXERCISE: CLASSICAL DWH ARCHITECTURES
Data Warehouse / DHBWDaimler TSS 9
How are the approaches called?
Who “invented” the approach?
How many layers are used and how are the layers called?
Which data modeling approaches are used in which layer?
In which layer are atomic detail data stored?
In which layer are aggregated / summary data stored?
List at least 2 advantages
List at least 2 disadvantages
EXERCISE: CLASSICAL DWH ARCHITECTURES
Data Warehouse / DHBWDaimler TSS 10
How are the approaches called?
Kimball Bus Architecture Corporate Information Factory
Who “invented” the approach?
• Ralph Kimball • Bill Inmon
How many layers are used and how are the layers called?
• Data Staging• Dimensional Data Warehouse
• Data Acquisition• Normalized Data Warehouse• Data Delivery / Dimensional Mart
Which data modeling approaches are used in which layer?
• Data Staging: variable, corresponds to source system
• Dimensional Data Warehouse:Dimensional Model
• Data Acquisition: variable, corresponds to source system
• Normalized Data Warehouse: 3NF• Data Delivery: Dimensional Model
In which layer are atomic detail data stored?
• Dimensional Data Warehouse • Normalized Data Warehouse
In which layer are aggregated / summary data stored?
• Dimensional Data Warehouse • Data Delivery / Dimensional Mart
EXERCISE: CLASSICAL DWH ARCHITECTURES
Data Warehouse / DHBWDaimler TSS 11
Kimball Bus Architecture Corporate Information Factory
Advantages • Two layers only mean faster development and less work
• Rather simple approach to make data fast and easily accessible
• Lower startup costs (but higher subsequent development costs)
• Separation of concerns: long-term enterprise data storage separated from data presentation
• Changes in requirements and scope are easier to manage
• Lower subsequent development costs (but higher startup costs)
Disadvantages • If table structures change (instable source systems), high effort to implement the changes and reload data, especially conformed dimensions (“Dimensionitis” desease)
• Non-metric data not optimal for dimensional model
• Dimensional model (esp. Star Schema) contains data redundancy
• Data model transformations from 3NF to Dimensional model required
• More complex as two different data models are required
• Larger team(s) of specialists required
• Kimball Bus Architecture (Central data warehouse based on data marts)
• Inmon Corporate Information Factory
• Data Vault 2.0 Architecture (Dan Linstedt)
• DW 2.0: The Architecture for the Next Generation of Data Warehousing
• Virtual Data Warehouse
• Operational Data Store (ODS)
OTHER ARCHITECTURES
Data Warehouse / DHBWDaimler TSS 12
KIMBALL BUS ARCHITECTURE (CENTRAL DATA WAREHOUSE BASED ON DATA MARTS)
Data Warehouse / DHBWDaimler TSS 13
Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/
KIMBALL BUS ARCHITECTURE (CENTRAL DATA WAREHOUSE BASED ON DATA MARTS)
Data Warehouse / DHBWDaimler TSS 14
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse Layer= Mart Layer
Data Mart 1
Data Mart 2Data Mart 3
Metadata Management
Security
DWH Manager incl. Monitor
More Business-process oriented
than subject-oriented,
integrated, time-variant,non-volatile
• Bottom-up approach
• Dimensional model with denormalized data
• Sum of the data marts constitute the Enterprise DWH
• Enterprise Service Bus / conformed dimensions for integration purposes• (don’t confuse with ESB as middleware/communication system between applications)
• Kimball describes that agreeing on conformed dimensions is a hard job and it’s expected that the team will get stuck from time to time trying to align the incompatible original vocabularies of different groups
• Data marts need to be redesigned if incompatibilities exist
KIMBALL BUS ARCHITECTURE (CENTRAL DATA WAREHOUSE BASED ON DATA MARTS)
Data Warehouse / DHBWDaimler TSS 15
Co
re W
are
ho
use
La
yer
DATA INTEGRATION WITH AND WITHOUT COREWAREHOUSE LAYER
Data Warehouse / DHBWDaimler TSS 16
INMON CORPORATE INFORMATION FACTORY
Data Warehouse / DHBWDaimler TSS 17
Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/
INMON CORPORATE INFORMATION FACTORY
Data Warehouse / DHBWDaimler TSS 18
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse
Layer(Storage
Layer)
Mart Layer(Output Layer)
(Reporting Layer)
Metadata Management
Security
DWH Manager incl. Monitor
subject-oriented,
integrated, time-
variant,non-
volatile
• Top-down approach
• (Normalized) Core Warehouse is essential for subject-oriented, integrated, time-variant and nonvolatile data storage
• Create (departmental) Data Marts as subsets of Core Enterprise DWH as needed
• Data Marts can be designed with Dimensional model
• The logical standard architecture is more general compared to CIF, but was mainly influenced by CIF
INMON CORPORATE INFORMATION FACTORY
Data Warehouse / DHBWDaimler TSS 19
DATA VAULT 2.0 ARCHITECTURE – TODAY’S WORLD (DANLINSTEDT)
Data Warehouse / DHBWDaimler TSS
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
Data Warehouse / DHBWDaimler TSS 21
Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
Data Warehouse / DHBWDaimler TSS 22
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Raw Data Vault
Mart Layer(Output Layer)
(Reporting Layer)
Business Data Vault
Metadata Management
Security
DWH Manager incl. Monitor
Hard Rules only
Soft Rules
• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” (Data Vault modeling is a separate lecture)
• Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart Layer and not earlier
• Alternatively from Raw Data Vault to additional layer called Business Data Vault
• Hard Rules don’t change data
• Data is fully auditable
• Real-time capable architecture
• Architecture got very popular recently; also applicable to BigData, NoSQL
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
Data Warehouse / DHBWDaimler TSS 23
• In the classical DWHs, the Core Warehouse Layer is regarded as “single version of the truth”
• Integrates + cleanses data from different sources and eliminates contradiction
• Produces consistent results/reports across Data Marts
• But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as more and more systems exist
• Data in Raw Data Vault Layer is regarded as “Single version of the facts”
• 100% of data is loaded 100% of time
• Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
Data Warehouse / DHBWDaimler TSS 24
• Data Vault is optimized for the following requirements:
• Flexibility
• Agility
• Data historization
• Data integration
• Auditability
• Bill Inmon wrote in 2008: “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
Data Warehouse / DHBWDaimler TSS 25
DW 2.0: THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING
Data Warehouse / DHBWDaimler TSS 26
Source: W.H. Inmon, Dan Linstedt: Data Architecture: A Primer for the Data Scientist, Morgan Kaufmann, 2014, chapter 3.1
Operational applicationdata model
Integrated corporatedata model
Integrated corporatedata model
Archivaldata model
Dat
a Li
fecy
cle
Main characteristics:
• Structured and “unstructured” data, not just metrics
• Life Cycle of data with different storage areas
• Hot data: High speed, expensive storage (RAM, SSDs) for most recent data
• …
• Cold data: Low speed, inexpensive storage (e.g. hard disks) for old data; archival data model with high compression
• Metadata is an integral part of the DWH and not an afterthought
DW 2.0: THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING
Data Warehouse / DHBWDaimler TSS 27
VIRTUAL DATA WAREHOUSE
Data Warehouse / DHBWDaimler TSS 28
Data Warehouse
FrontendBackend
External data sources
Internal data sources
OLTP
OLTPQuery Management
Weakly+partly subject-oriented, Weakly+partly integrated,
Not time-variant,Not non-volatile
• Data not extracted from operational systems and stored separately
• Standardized interface for all operational data sources
• One "GUI" for all existing data
• Generates combined queries
• Query Processor joins query result data from different sources
• Can also access data in Hadoop (Polybase, Big SQL, BigData SQL, etc)
VIRTUAL DATA WAREHOUSE
Data Warehouse / DHBWDaimler TSS 29
• Query Management manages metadata about all operational systems
• (physical) location of data and algorithms for extracting data from OLTP system
• Implementation easier
• Low cost: can use existing hardware infrastructure
• Queries cause significant performance problems in operational systems
• Known problems when analyzing operational data directly
• Same query is processed multiple times (if queried multiple times)
• Same query delivers different results when processed at different times
VIRTUAL DATA WAREHOUSE
Data Warehouse / DHBWDaimler TSS 30
OPERATIONAL DATA STORE (ODS)
Data Warehouse / DHBWDaimler TSS 31
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse
Layer(Storage
Layer)
Mart Layer(Output Layer)
(Reporting Layer)
Metadata Management
Security
DWH Manager incl. Monitor
subject-oriented,
integrated, time-
variant,non-
volatile
Operational Data Store
• ODS: Real-time/Right-time layer
• Replication techniques used to transport data from source database to ODS layer with minimal impact on source system
• Data in the ODS has no history and is stored without any cleansing and without any integration (1:1 copy from single source)
• DWH performance not optimal as data model is suited for OLTP and not for reporting requirements
• ODS normally additionally to Staging / Core DWH / Mart Layer but can exist alone without other layers
OPERATIONAL DATA STORE (ODS)
Data Warehouse / DHBWDaimler TSS 32
EXAMPLE DWH FOR STATE OF CONSTRUCTION DOCU
Data Warehouse / DHBWDaimler TSS 33
ARCHITECTURE FROM AN ACTUAL PROJECT IN THE AUTOMOTIVE INDUSTRY
Data Warehouse / DHBWDaimler TSS 34
ETL Engine
Fron
tend
StandardReports
AdHocReportsLogs
TSM
IIDRReplEngine
Source
DatastoreSource
Mirror DB (Operational Data Store)
OLTPDB
IIDR ReplEngineMirror
DatastoreMirror
IIDR ReplEngineDWH
DatastoreDWH
BackendDWH DB
Staging Layer
Raw + Business Data Vault
Mart Layer
END USER SAMPLE QUESTIONS
Data Warehouse / DHBWDaimler TSS 35
Which vehicles or aggregates are documented incompletely? (Data quality)
Which vehicles / which control units require SW updates?
Which interiors are most common by region?
Supply data for external simulations, customs clearance, spare part planning, etc.
Review the presented data warehouse architectures.
Which architecture would you recommend for
• A holding of 3 telecommunication companies
• An online store with real/right-time data integration needs
• Marketing department of a bank
List advantages and drawbacks of your proposal.
EXERCISE: RECOMMEND AN ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 36
A holding of 3 telecommunication companies
• Architecture: Virtual Data Warehouse
• + Companies may not want to provide their data to a new storage
• + Can easily be extended if new companies join the holding or reduced if a company leaves the holding
• - Bad performance
• - Not really data integration achieved, low Data Quality
• - Firewalls have to be opened
EXERCISE: RECOMMEND AN ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 37
An online store with real-time/right-time data integration needs
• Architecture: Data Vault 2.0
• + Integration of many internal and external source systems (e.g. integrate social media data about the online store)
• + Fast data delivery in Raw Vault Layer (Real-time/Right-time data integration). Complex data cleansing / transformation / soft rules are delayed downstream towards Mart Layer
• - Transformation overhead (Source system data model to Data Vault data model to Dimensional data model)
EXERCISE: RECOMMEND AN ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 38
Marketing department of a bank
• Architecture: Kimball Bus architecture
• + Start small for a department. If other departments are interested, new data and new Marts can be added on demand
• - High risk to loose the Enterprise view and several DWHs are built
That’s still quite a common scenario nowadays. A single Enterprise DWH is often not achieved (e.g. Mergers & Acquisitions, inflexibility due to a single centralized DWH, rapidly changing conditions, etc.) and therefore very often several DWHs with different architectures exist in parallel within a company.
EXERCISE: RECOMMEND AN ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 39
• Now imagine that you prepare an exam.
• Identify 1-3 questions about DWH architecture (and/or DWH introduction) that you would ask in an exam.
• Write down the questions on stick-it cards.
EXERCISE - INTRODUCTION AND DWH ARCHITECTUREGROUP TASK
Data Warehouse / DHBWDaimler TSS 40
Which layers does the logical standard architecture have?
• Staging (Input), Integration (Cleansing), Core Warehouse (Storage), Aggregation, Mart (Reporting, Output) and additionally Metadata, Security, DWH Manager, Monitor
Which other architectures exist?
• Kimball Bus Architecture (Central data warehouse based on data marts)
• Inmon Corporate Information Factory
• Data Vault 2.0 Architecture (Dan Linstedt)
• DW 2.0: The Architecture for the Next Generation of Data Warehousing
• Virtual Data Warehouse
• Operational Data Store (ODS)
SUMMARY
Data Warehouse / DHBWDaimler TSS 41
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 42
THANK YOU