ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source...
-
Upload
evelyn-dennis -
Category
Documents
-
view
223 -
download
0
Transcript of ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source...
![Page 1: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/1.jpg)
ETL
By Dr. Gabriel
![Page 2: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/2.jpg)
ETL Process
• 4 major components:– Extracting
• Gathering raw data from source systems and storing it in ETL staging environment
– Cleaning and conforming• Processing data to improve its quality, format it, merge from
multiple sources, enforce conformed dimensions
– Delivering• Loading data into data warehouse tables
– Managing• Management of ETL environment
![Page 3: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/3.jpg)
ETL: Extracting
• Data profiling
• Identifying data that changed since last load
• extraction
![Page 4: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/4.jpg)
ETL: Cleaning and Conforming
• Data cleansing
• Recording error events
• Audit dimensions
• Deduping
• Creating and maintaining conformed dimensions and facts
![Page 5: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/5.jpg)
ETL: Delivering
• Implementation of SCD logic• Surrogate key generation• Managing hierarchies in dimensions• Managing special dimensions such as date and
time, junk, mini, shrunken, small static, and user-maintained dimensions– Mini dimensions
• used to track changes of dimension attribute when type 2 technique is infeasible.
• Similar to junk dimensions Typically is used for large dimensions
• Combinations can be built in advance or on the fly• Built from dimension table input
![Page 6: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/6.jpg)
ETL: Delivering (Cont)
– Small static dimensions• Dimensions created by the ETL system without
real source• Lookup dimensions for translations of codes, etc.
– User maintained dimensions• Master dimensions without real source system• Descriptions, groupings, hierarchies created for
reporting and analysis purposes.
![Page 7: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/7.jpg)
ETL: Delivering (Cont)
• Fact table loading• Building and maintaining bridge dimension
tables• Handling late arriving data• Management of conformed dimensions• Administration of fact tables• Building aggregations• Building OLAP cubes• Transferring DW data to other environment for
specific purposes
![Page 8: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/8.jpg)
ETL: Managing
• Management of ETL environment– Goals
• Reliability• Availability• Manageability
– Job scheduler– backup system– Recovery and restart system– Version control system
![Page 9: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/9.jpg)
ETL: Managing (Cont.)
• Version migration system• Workflow monitor• Sorting system• Analyzing dependencies and lineage• Problem escalation system• Parallelization• Security system• Compliance manager• Metadata repository manager
![Page 10: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/10.jpg)
ETL Process
• Planning– High level source to target data flow diagram– Selection and implementation of ETL tool– Development of default strategies for
dimension management, error handling, and other processes
– Development data transformations diagrams by target table
– Development of job sequencing
![Page 11: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/11.jpg)
ETL Process
• Developing one-time historic load– Build and test the historic dimension and fact
tables load
• Developing incremental load process– Build and test dimension and fact tables
incremental load processes– Build and test aggregate table loads and/or
OLAP processing– Design, build, and test the ETL system
automation
![Page 12: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/12.jpg)
ETL Tools: Build vs Buy
• Many off-the-shelf tools exist
• Benefits are not seen right away– Setup– Learning curve
• High-end tools may not justify value for smaller warehouses
![Page 13: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/13.jpg)
Off-the-shelf ETL ToolsTool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Pentaho Data Integration Pentaho
![Page 14: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/14.jpg)
ETL Specification Document
• Can be as large as 100 pages per business process; In reality, the work starts after the high level design is documented in a few pages.
• Source-to-target mappings• Data profiling reports• Physical design decisions• Default strategy for extracting from each major
source system• Archival strategy• Data quality tracking and metadata• Default strategy for managing changes to
dimension attributes
![Page 15: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/15.jpg)
ETL Specification Document (Cont)
• System availability requirements and strategy• Design of data auditing mechanism• Location of staging areas• Historic and incremental load strategies for each
table– Detailed table design– Historic data load parameters (# of months) and
volumes (# of rows)– Incremental data volumes
![Page 16: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/16.jpg)
ETL Specification Document (Cont)
– Handling of late arriving data– Load frequency– Handling of changes in each dimension attribute
(types 1,2,3)– Table partitioning– Overview of data sources; discussion of source-
specific characteristics– Extract strategy for the source data– Change data capture logic for each source table– Dependencies– Transformation logic (diagram or pseudo code)
![Page 17: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/17.jpg)
ETL Specification Document (Cont)
– Preconditions to avoid error conditions– Recovery and restart assumptions for each
major step of the ETL pipeline– Archiving assumptions for each table– Cleanup steps– Estimated effort
• Overall workflow• Job sequencing• Logical dependencies
![Page 18: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/18.jpg)
Loading Pointers
• One time historic load– Disable RI constraints (FKs) and re-enable
them after the load is complete– Drop indexes and re-create them after the
load is complete– Use bulk loading techniques– Not always the case
![Page 19: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/19.jpg)
Loading Pointers (Cont)
• Incremental load
![Page 20: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/20.jpg)
Loading Pointers (Cont)
– Sometimes historic and incremental load logic is the same; many times- is similar.
– Updating aggregations, if necessary– Error handling
![Page 21: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/21.jpg)
Sample: Generation of Surrogate Keys on SQL Server
As simple as:DECLARE @i INTEGER SELECT @i = MAX(ID) + 1 FROM TableNameBut may not work with concurrent processes
ORCreate PROCEDURE pGetNextID (@SeedName VARCHAR(32), @SeedValue BIGINT OUTPUT) AS
UPDATE Lookup_Seed SET @SeedValue = SeedValue = SeedValue + 1
WHERE SeedID = @SeedName
Lookup_Seed table:SeedID varchar (32)SeedValue bigint
![Page 22: ETL By Dr. Gabriel. ETL Process 4 major components: –Extracting Gathering raw data from source systems and storing it in ETL staging environment –Cleaning.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649dc45503460f94ab7159/html5/thumbnails/22.jpg)
Questions ?