Data Vault 2.0: Using MD5 Hashes for Change Data Capture
-
Upload
kent-graziano -
Category
Data & Analytics
-
view
6.134 -
download
2
Transcript of Data Vault 2.0: Using MD5 Hashes for Change Data Capture
![Page 1: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/1.jpg)
Data Vault 2.0: Using MD5 Hashes for
Change Data Capture
Kent Graziano
Data Warrior LLC
Twitter @KentGraziano
![Page 2: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/2.jpg)
Data Vault Definition
The Data Vault is a detail oriented, historical tracking
and uniquely linked set of normalized tables that
support one or more functional areas of business.
It is a hybrid approach encompassing the best of
breed between 3rd normal form (3NF) and star
schema. The design is flexible, scalable, consistent
and adaptable to the needs of the enterprise.
Dan Linstedt: Defining the Data Vault TDAN.com Article
Architected specifically to meet the needs
of today’s enterprise data warehouses
![Page 3: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/3.jpg)
Data Vault Time Line
2000 1960 1970 1980 1990
E.F. Codd invented
relational modeling
Chris Date and
Hugh Darwen
Maintained and
Refined
Modeling
1976 Dr Peter Chen
Created E-R
Diagramming
Early 70’s Bill
Inmon Began
Discussing Data
Warehousing
Mid 60’s Dimension & Fact
Modeling presented by
General Mills and Dartmouth
University
Mid 70’s AC Nielsen
Popularized
Dimension & Fact Terms
Mid – Late 80’s Dr Kimball
Popularizes Star Schema
Mid 80’s Bill Inmon
Popularizes Data
Warehousing
Late 80’s – Barry
Devlin and Dr Kimball
Release “Business
Data Warehouse”
1990 – Dan Linstedt
Begins R&D on Data
Vault Modeling
2000 – Dan Linstedt
releases first 5
articles on Data Vault
Modeling
© LearnDataVault.com
![Page 4: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/4.jpg)
2014 - Next Evolution
![Page 5: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/5.jpg)
What’s New in DV2.0?
Modeling Structure Includes… ● NoSQL, and Non-Relational DB systems, Hybrid Systems
● Minor Structure Changes to support NoSQL
New ETL Implementation Standards ● For true real-time support
● For NoSQL support
New Architecture Standards ● To include support for NoSQL data management systems
New Methodology Components ● Including CMMI, Six Sigma, and TQM
● Including Project Planning, Tracking, and Oversight
● Agile Delivery Mechanisms
● Standards, and templates for Projects
© LearnDataVault.com
![Page 6: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/6.jpg)
This model is fully
compliant with Hadoop,
needs NO changes to
work properly.
The Hash Keys can be
used to join to Hadoop
data sets.
MD5 PK – replaces
surrogate keys
MD5DIFF – used for
change detection
Use of MD5 Hash in DV2.0
© LearnDataVault.com
![Page 7: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/7.jpg)
MD5-based Change Detection
Think Type 2 SCD
Old Way:
● Compare column by column
● Source value != Current value in DW table
● 20 columns, then 20 compares
New Way:
● Concatenate all columns to one string
● Convert to one char(32) string with hash function
● Compare to hashed value (MD5DIFF) in target table
● Does not matter how many columns
© Data Warrior LLC
![Page 8: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/8.jpg)
What does it look like?
Encode using standard MD5 hash function
● rawtohex(sys.utl_raw.cast_to_raw(
dbms_obfuscation_toolkit.md5 (input_string => ...)
Need to minimize chance of duplicates
● 12||3||45 and 1||2||345 hash to same value
● Need a separator between each
● Also handles case of null values
● Example: Col1||’^’||Col2||’^’||Col3
© Data Warrior LLC
![Page 9: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/9.jpg)
Other considerations
To generate most consistent string: standardize!
Convert data types
If 'NUMBER', 'NVARCHAR2', 'NVARCHAR',
'NCHAR‘ ● THEN 'TO_CHAR(' || column_name || ')‘
If 'RAW‘ ● THEN 'ENC_BASE64(' || column_name || ')‘
If 'DATE‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD'')‘
If LIKE 'TIME%‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD
HH24:MI:SS'')' © Data Warrior LLC
![Page 10: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/10.jpg)
Final Input String
(UPPER(TRIM(T1.GENERICNAME))
||'^'||
UPPER(TRIM(
TO_CHAR(T1.MED_STRNG_AMT)))
||'^'||
UPPER(TRIM(T1.UOM_CD))
||'^'||
UPPER(TRIM(T1.MED_FORM_NM))
||'^')
© Data Warrior LLC
![Page 11: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/11.jpg)
So what?
MD5 hash is consistent cross-platform
Changes multi-column compares to a single column
All compares take the same time during load process
Can use with any DW architecture that requires change detections
Virtually no limit ● Think Big Data/Hadoop/NoSQL
Can generate the input string automatically ● But that is another talk!
© Data Warrior LLC
![Page 12: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/12.jpg)
Learn more about Data Vault
www.LearnDataVault.com
www.danlinstedt.com
On YouTube:
www.youtube.com/LearnDataVault
On Facebook:
www.facebook.com/learndatavault
![Page 13: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/13.jpg)
Super Charge Your Data Warehouse
Available on Amazon.com
Soft Cover or Kindle Format
Now also available in PDF at
LearnDataVault.com
![Page 14: Data Vault 2.0: Using MD5 Hashes for Change Data Capture](https://reader031.fdocuments.us/reader031/viewer/2022021507/58f9a985760da3da068b6f72/html5/thumbnails/14.jpg)
Contact Information
Kent Graziano
The Oracle Data Warrior
Data Warrior LLC
On Twitter @KentGraziano
Visit my blog at
http://kentgraziano.com