PDI data vault framework #pcmams 2012
-
date post
11-Sep-2014 -
Category
Education
-
view
1.206 -
download
4
description
Transcript of PDI data vault framework #pcmams 2012
![Page 2: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/2.jpg)
Data Vault Definition
Source: Dan Linstedthttp://www.tdan.com/view-articles/5054/
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.
![Page 3: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/3.jpg)
Data Vault Building Blocks
Source: Dan Linstedthttp://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012
different sources/rate of change
![Page 4: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/4.jpg)
Data Vault Fundamentals: Hub
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
![Page 5: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/5.jpg)
Data Vault Fundamentals: Link
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
![Page 6: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/6.jpg)
Data Vault Fundamentals: Satellite
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
![Page 7: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/7.jpg)
Data Vault Fundamentals: Model
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
![Page 8: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/8.jpg)
Data Vault ETL
Many objects to load, standardized procedures
This screams for a generic solution!
I don't want to:
throw ETL tool away and code it all myself
manage too many ETL objects
connect similar columns in mappings by hand
I do want to:
generate ETL (Kettle) objects? No
Take it one step further: there's only 1 parameterised hub load object. Don't need to know xml structure of PDI objects
![Page 9: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/9.jpg)
Tools
Version Control
Database
Virtualization
Data Integration
Operating System
'Productivity'
Sql Development
![Page 10: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/10.jpg)
Place of framework in architecture
StagingArea
CSVFiles
ETL
ERP
DBMS
Sources ETL Process Data Warehouse EUL
MySQL
Files
ETL:KettleDataVault Framework
Central DWH & Data Marts
MySQLDataVault
ETL
![Page 11: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/11.jpg)
What has to be taken care of?
Data Vault designed and implemented in database
Staging tables and loading procedures in place(can also be generic, we use PDI Metadata Injection step for loading files)
Mapping from source to Data Vault specified (now in an Excel sheet)
What
![Page 12: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/12.jpg)
Framework components
PDI repository (file based), jobs and transformations
Configuration files:kettle.properties
shared.xml
repositories.xml
Excel sheet that contains the specifications
MySQL database for metadata
Virtual machine with Ubuntu 12.04 Server
![Page 13: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/13.jpg)
Design decisions
Updateable views with generic column names
(MySQL more lenient than PostgreSQL)
Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)
'inject' the metadata using Kettle parameters
Generate and use an error table for each Data Vault table
![Page 14: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/14.jpg)
Metadata tables
All have history tables
![Page 15: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/15.jpg)
Metadata in Excel
Data Vault
connections
source systems
source tables
![Page 16: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/16.jpg)
Metadata in Excel (hub + sat)
x 200 (max)
![Page 17: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/17.jpg)
Metadata in Excel (link)
link attributes
x 10
![Page 18: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/18.jpg)
Metadata in Excel (link satellite)
x 10
x 5
x 200 (max)
![Page 19: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/19.jpg)
Last seen date
applicable for hubs and links
existing hubs and links: update 'last_seen_dts'!
![Page 20: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/20.jpg)
Link validity satellite
Link has 'business key': not all hub id's
![Page 21: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/21.jpg)
Loading the metadata
![Page 22: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/22.jpg)
'design errors'
Checks to avoid debugging:(compares design metadata with Data Vault DB information_schema)
hubs, links, satellites that don't exist in the DV
key columns that do not exist in the DV
missing connection data (source db)
missing attribute columns
![Page 23: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/23.jpg)
A complete run
![Page 24: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/24.jpg)
Metadata needed for a hub
name
key column
business key column
source table
source table business key column(can be expression, e.g. concatenate for composite key)
![Page 25: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/25.jpg)
Job for hub
![Page 26: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/26.jpg)
Transformation for hub
![Page 27: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/27.jpg)
Metadata needed for a linkname
key column
for each hub (maximum 10, can be a ref-table)
hub name
column name for the hub key in the link (roles!)
column in the source table → business key of hub
link 'attributes' (part of key, no hub, maximum 5)
link validity satellite needed?
last seen date needed?
source table
![Page 28: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/28.jpg)
Job for link
![Page 29: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/29.jpg)
Transformation for link
Run table needed for validity sat ?
Lookup hubs
Remove columns not in link
Last seen?
![Page 30: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/30.jpg)
Metadata needed for a hub satellite
name
key column
hub name
column in the source table → business key of hub
for each attribute (maximum 200)
source column target column
source table
![Page 31: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/31.jpg)
Job for hub satellite
![Page 32: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/32.jpg)
Transformation for hub satellite
![Page 33: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/33.jpg)
Metadata needed for a link satellite
name
key column
link name
for each hub of the link:
column in the source table → business key of hub
for each key attribute: source column
for each attribute: source column → target column
source table
![Page 34: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/34.jpg)
Job for link satellite
![Page 35: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/35.jpg)
Transformation for link satellite
![Page 36: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/36.jpg)
Executing in a loop ..
![Page 37: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/37.jpg)
.. and parallel
![Page 38: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/38.jpg)
Logging
Configuring log tablesfor concurrent access
PDI logging
Custom logging
![Page 39: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/39.jpg)
Version Control: PDI objects
![Page 40: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/40.jpg)
Version Control: database objects
![Page 41: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/41.jpg)
Some points of interest
Easy to make mistake in design sheet
Generic → a bit harder to maintain and debug
Application/tool to maintain metadata?
Data Vault generators (e.g. Quipu)?
Spinoff using Informatica and Oracle: Sander Robijns
Thanks to: Jos van Dongen Kasper de Graaf
![Page 42: PDI data vault framework #pcmams 2012](https://reader034.fdocuments.us/reader034/viewer/2022051411/541231738d7f72d0738b477b/html5/thumbnails/42.jpg)
Sourceforge!