Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

30
Extreme BI: Creating Virtualized Hybrid Type1+2 Dimensions Kent Graziano, Data Warrior LLC Keith Hoyle, McKesson Specialty Health

Transcript of Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Page 1: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Extreme BI: Creating Virtualized Hybrid Type1+2 Dimensions

Kent Graziano, Data Warrior LLCKeith Hoyle, McKesson Specialty Health

Page 2: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Agenda

BiosQuick SurveyWhy Virtualize?Virtualizing with DV 2.0Virtualizing with our hybrid architectureThe Secret Transform TableDoes it work?

Copyright 2015 Data Warrior LLC

Page 3: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Bio (Kent)

Data Vault Master, Certified DVDM (1.0), CDVP2 Authorized Data Vault 2.0 Bootcamp Instructor Oracle ACE Director (BI/DW) Blogger: The Data Warrior Data Architecture and Data Warehouse Specialist

● 30+ years in IT● 20+ years of data warehousing experience

Member: Boulder BI Brain Trust (BBBT) Author, Co-Author Past-President of ODTUG and RMOUG

Copyright 2015 Data Warrior LLC

Page 4: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Bio (Keith)

Sr. Manager, Enterprise Data Architecture (McKesson Specialty Health)

25+ years in IT 8+ years in Genetic Engineering / Biochemistry

in Pharmaceutical industry Completed multiple successful EDW efforts with

large companies (Dell, HP, AMD, Aflac, Amgen, Glaxo-SmithKline, etc.)

Consulted through large firms catering to big pharma / biotech / medical industry

Copyright 2015 Data Warrior LLC

Page 5: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Audience Survey

How long in Data Warehousing or BI? Have you heard of Data Vault?

● DV 1.0?● DV 2.0?● Ever built anything using DV model?

Page 6: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Why Virtualize?

Support Agile project approach● Shorter iterations● Faster time to market

Eliminates ETL bottleneck● Specs● Coding● Testing (QA)

Replace with simple database views

Page 7: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Basic Data Vault Example

Page 8: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Where does Data Vault fit?

Data Vault goes here

Page 9: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Virtualizing with pure Data Vault

Type 1 SCD – simple● Join Hub & Sats

● Use a PIT table to avoid the Max(LOAD_DTS) subqueries Type 2 SCD – a little harder

● See my post: How to Build a Virtual Type 2 Slowly Changing Dimension

● Need a historicized PIT table with surrogate key Type 2 SCD with DV 2.0

● Same but use MD5 Key on PIT table● Build with Hub BK + Sat1 LOAD_DTS + Sat2

LOAD_DTS + …

Page 10: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Virtualizing with Gepetto

Almost the same as DV ● A Data Vault hybrid● Added join to the KM tables

Gepetto does not split stage tables into multiple Sats● No PIT table needed

Views do a UNION ALL to include multiple source● Each source is a different stage table tied to the

same KM● KM table serves as PIT table to align them

Page 11: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

HI Stage

COMNStage

<Full copies of source

data structures

with additional plumbing fields to facilitate capturing

subsequent data changes

over time>

FIN Stage FINPresentation

HI Presentation

COMNPresentation

Gepetto Schema Architecture

Source(s)of Record

BOBJ / BI / ReportingEDW V2

COMN Validation (DQ)

COMN Integration

<Enterprise business key model with

key mapping pointers to COMN_STG

data >

FIN

HI

CLIN

G2

MU

HI

KDW

CI SAS Routines

EDW V1

FDW / PMS

KDW Lite

Lynx

SFDC

MKTG

Δ CDC

Insert1X

only

ΣΣ

ΣΣ

Σ

Page 12: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Gepetto Virtualization ArchitectureStage

Integrate

PresentationKDW_ORG

…PRIM_KEY CDC_KEYG2_PRACTICE

…PRIM_KEY CDC_KEY

DATA_XFRM<SRC System, Table, Field, Value fields>,<TGT: System, Table, Field, Value fields>

CDC_KEY field in STG also go into the CDC_KEY in INTG. Joins to other STG table(s) to complete R_x_KEY and D_x_KEY fields in INTG.

R_VSTR_VST_KEY

D_PAT_REC_KEYD_PRVDR_KEY

D_LOC_GRP_KEYD_LOC_KEY

D_CLNDR_KEY

KDW_PAT_VISIT<Patient Record ID

fields><Provider ID fields><Practice ID fields><Location ID fields><Visit Date fields>

…PRIM_KEY CDC_KEY

DIM_PAT_RECSCD2_PAT_REC_KEYSCD1_PAT_REC_KEY

D_PRSN_KEY…

DIM_PRVDRSCD2_PRVDR_KEYSCD1_PRVDR_KEY

DIM_PRCTC_HIERSCD2_PRCTC_HIER_KEYSCD1_PRCTC_HIER_KEY

D_LOC_KEY…

D_PAT_RECD_PAT_REC_KE

Y…

D_LOCD_LOC_KEY

…D_PRVDRD_PRVDR_KEY

KM_LOC_GRPD_LOC_GRP_KEY

CDC_KEY

LYNX_PRCTCPM_PRCTC_KEY

…PRIM_KEY CDC_KEY

1) Logical views can be used to initially vett reports, aggregations, etc. where possible (i.e. most dimensions, primitive facts, some aggregate facts, etc.)2) Materialized views can be used to vett the scaling of the solution3) ETL processes will be used to productional-ize the vetted solution4) STG data is transformed using joins to the DATA_XFRM table in INTG5) Data is scrubbed with standard SQL functionalities. (i.e. initcap, trim, remove special characters, etc.)

D_LOC_GRPD_LOC_GRP_K

EY…

KM_VSTR_VST_KEYCDC_KEY

FACT_VSTSCD2_VST_KEYSCD1_VST_KEYD_PAT_REC_KEYD_PRVDR_KEYD_PRCTC_KEY

D_LOC_KEYD_CLNDR_KEY

Page 13: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

MD5 Keys Concatenate source data fields and hash to create MD5 keys MD5 Key Types

● PRIM_KEY (STG):● All source fields (in table order) + LOAD_DTS● Uniquely ID’s all records with DW● Can serve as an SCD-2 key in virtual Dim’s/ Facts

● CDC_KEY (STG / INTG):● Source field(s) (in table order) used by SOR to ID data rows uniquely for change data

capture purposes● CDC_ATTR (STG):

● All non-CDC_KEY source field (in table order) to track changed for change data capture purposes

● NAT_KEY (STG):● Source field(s) (in table order) from a single SOR table used to logically ID data rows

uniquely● [D_XXX_KEY / R_XXX_KEY] BUS_KEY (INTG):

● Source field(s) (in table order) used to logically ID data rows uniquely (joins may be required)

● Can serve as an SCD-1 key in virtual Dim’s/ Facts

Page 14: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Presentation Layer – (Stage / Integration Joins)

COMN_INTG contains Business Keys in Domains and linkages between Domains in Relationships

D_xxx_KEY and R_xxx_KEY fields in COMN_INTG are populated with hashed business keys also contained in KM_xxx tables in COMN_INTG

Domains and Relationships are joined to KeyMaps and COMN_STG tables to create different COMN_PRSNTN elements (3-NF or Star Schema style) and optimized as needed:● Small/Simple: Logical views (faster time to market, less

performance)● Medium: Materialized views● Large/Complex: ETL loaded/tuned tables (slower time to

market, more performance)

Page 15: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Hybrid Type 1-2 Dims

Need to support both SCD1 and SCD2 queries Could build two sets of views We built 1 view that has two keys

● SCD1_<Hub>_KEY● SCD2_<Hub>_KEY

SCD1 Key = Hub/Domain PK (MD5) SCD2 Key = PRIM_KEY from Stage (MD5)

● Gepetto stage table is a Type 2 table already● Includes all columns + LOAD_DTS

Page 16: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Example View Mapping: DIM2_MED

Page 17: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Example Join Code: DIM2_MED

FROM COMN_INTG.D_MED D  INNER JOIN COMN_INTG.KM_MED KM  ON D.D_MED_KEY   = KM.D_MED_KEY  AND KM.EXPR_DTS IS NULL  INNER JOIN COMN_STG.G2_MEDICATION STG  ON KM.CDC_KEY      = STG.CDC_KEY  AND KM.REC_SRC     = STG.REC_SRC  AND KM.REC_SRC_TBL = STG.REC_SRC_TBL AND KM.LOAD_DTS <= STG.LOAD_DTS AND (KM.EXPR_DTS IS NULL OR KM.EXPR_DTS >= STG.EXPR_DTS)'

Page 18: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Type 1 Rows – Current Values

Use the SCD1_KEY columns Use virtual CURR_FLG or EXPR_DTSCASE   WHEN LEAD(stg.LOAD_DTS) OVER (PARTITION BY

stg.CDC_KEY ORDER BY stg.LOAD_DTS) IS NULL      THEN 'Y'      ELSE 'N'    END CURR_FLG,LEAD(stg.LOAD_DTS) OVER (PARTITION BY

stg.CDC_KEY ORDER BY stg.LOAD_DTS) EXPR_DTS

Page 19: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

The Secret Transform Table

DATA_XFRM● In Integration layer● Data driven translation table● Allows “light” transformations via joins/views● Embedded in Virtual Dimension code

Page 20: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Transform Table Design

Page 21: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Example Xfrm Data

Page 22: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Example Translation Join

LEFT OUTER JOIN comn_intg.data_xfrm ct     ON ct.SRC_SCHEMA_NM = 'COMN_STG'     AND ct.SRC_TABLE_NM = 'KDW_ORG'     AND ct.SRC_FIELD_NM = 'currentsts'     AND ct.TGT_SCHEMA_NM = 'COMN_PRSNTN'     AND UPPER (ct.TGT_TABLE_NM) IN

('DIM2_PRCTC_HIER')     AND ct.TGT_FIELD_NM = 'cntrct_typ_cd'     AND ct.SRC_VALUE_NM = UPPER

(od.currentsts);

Page 23: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Does it work?

Yes! Some views work great, others slow

● Usually with huge volumes Mitigation –

● Materialized views● Increased parallelism on base tables

Best option – Oracle Exadata● Implementing 11g SuperCLuster● Initial results – 10x performance improvement

Page 24: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Conclusion: Benefits of Virtualization

We can now rapidly demonstrate the contents of a type 2 dim prior to ETL programming

With using PIT tables we don’t need the Load End DTS on the Sats so the Sats become insert only as well (simpler loads, no update pass required)

Another by product is the Sat is now also Hadoop compliant (again insert only)

Since the nullable Load End DTS is not needed, you can now more easily partition the Sat table by Hub Id and Load DTS.

Page 25: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Cowpath Highway

Old Way vs New Way

Which way will you follow?

Sign up for WWDVC 2016 at wwdvc.com

Page 26: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Page 27: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Super Charge Your Data Warehouse

Available on Amazon.comSoft Cover or Kindle Format

Now also available in PDF at LearnDataVault.com

Page 29: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Contact Information

Kent GrazianoData Warrior [email protected] Twitter @KentGrazianoVisit my blog athttp://kentgraziano.com

Page 30: Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions

Contact Information

Keith HoyleSr. Mgr., Enterprise Data Architecture

McKesson Specialty [email protected]

Visit my blog athttp://khoyle001.wordpress.com