7 Strategies for Extracting, Transforming, and Loading.
-
Upload
gwenda-snow -
Category
Documents
-
view
217 -
download
0
Transcript of 7 Strategies for Extracting, Transforming, and Loading.
![Page 1: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/1.jpg)
7Strategies for Extracting, Transforming, and Loading
![Page 2: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/2.jpg)
Programs
Tools
ETL
Operationalsystems
Warehouse
Gateways
Extraction, Transformation, and Loading Processes (ETL)
• Extract source data• Transform and cleanse
data• Index and summarize
• Load data into warehouse
• Detect changes• Refresh data
![Page 3: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/3.jpg)
Data Staging Area
• The construction site for the warehouse
• Required by most implementations
• Composed of ODS, flat files, or relational server tables
• Frequently configured as multitier staging
ExtractTransformTransport Transform
Transport(Load)
Operationalenvironment
Stagingenvironment
Warehouseenvironment
![Page 4: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/4.jpg)
Preferred Traditional Staging Model
Remote staging: Data staging area in its own environment, avoiding negative impact on the warehouse environment
![Page 5: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/5.jpg)
Extracting Data
• Routines developed to select fields from source
• Various data formats
• Rules, audit trails, error correction facilities
• Various techniques
![Page 6: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/6.jpg)
Examining Source Systems
• Production– Legacy systems– Database systems– Vertical applications
• Archive– Historical (for initial load)– Used for query analysis– May require transformations
![Page 7: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/7.jpg)
Mapping
• Defines which operational attributes to use
• Defines how to transform the attributes for the warehouse
• Defines where the attributes exist in the warehouse
![Page 8: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/8.jpg)
Designing Extraction Processes
• Analysis– Sources, technologies– Data types, quality, owners
• Design options– Manual, custom, gateway, third-party– Replication, full, or delta refresh
• Design issues– Batch window, volumes, data currency– Automation, skills needed, resources
• Maintenance of metadata trail
![Page 9: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/9.jpg)
Importance of Data Quality
• Business user confidence
• Query and reporting accuracy
• Standardization
• Data integration
![Page 10: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/10.jpg)
Benefits of Data Quality
Cleansed data is critical for:
• Standardization within the warehouse
• High quality matching on names and addresses
• Creation of accurate rules and constraints
• Prediction and analysis
• Creation of a solid infrastructure to support customer-centric business intelligence
• Reduction of project risk
• Reduction of long term costs
![Page 11: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/11.jpg)
Guidelines for Data Quality
• Operational data should not be used directly in the warehouse.
• Operational data must be cleaned for each increment.
• Operational data is not simply fixed by modifying applications.
![Page 12: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/12.jpg)
Transformation
Transformation eliminates operational data anomalies:
• Cleans
• Standardizes
• Presents subject-oriented data
ExtractTransformTransport Transform
Transport(Load)
Restructure
Consolidate
Cleanse
![Page 13: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/13.jpg)
Transformation Routines
• Cleansing data
• Eliminating inconsistencies
• Adding elements
• Merging data
• Integrating data
• Transforming data before load
![Page 14: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/14.jpg)
Why Transform?
• In-house system development
• Multipart keys
• Multiple encoding
• Multiple local standards
![Page 15: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/15.jpg)
Why Transform?
• Multiple files
• Missing values
• Duplicate values
• Element names
![Page 16: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/16.jpg)
Why Transform?
• Element meaning
• Input format
• Referential integrity
![Page 17: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/17.jpg)
Why Transform?
Name and address:• No unique key• Missing data values (NULLs)• Personal and commercial names mixed• Different addresses for the same member• Different names and spelling for the same member• Many names on one line• One name on two lines• The data may be in a single field of no fixed format• Each component of an address is in a specific field
![Page 18: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/18.jpg)
Integration (Match and Merge)
Source Target
Match and Merge
schema
![Page 19: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/19.jpg)
Transformation Techniques
• Merging data– Operational transactions do not usually map
one-to-one with warehouse data.– Data for the warehouse is merged to provide
information for analysis.
• Adding keys to data
![Page 20: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/20.jpg)
Transformation Techniques
Time
![Page 21: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/21.jpg)
Transformation Techniques
Adding a date stamp:
• Fact table– Add triggers– Recode applications– Compare tables
• Dimension table
• Time representation– Point in time– Time span
![Page 22: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/22.jpg)
Transformation Techniques
Creating summary data:
• During extraction on staging area
• After loading onto the warehouse server
![Page 23: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/23.jpg)
109908 109908 01
Transformation Techniques
Creating artificial keys:
• Use generalized or derived keys
• Maintain the uniqueness of a row
• Use an administrative process to assign the key
• Concatenate operational key with number
• Easy to maintain
• Cumbersome keys
• No clean value for retrieval
![Page 24: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/24.jpg)
Where to Transform?
Choose wisely where the transformation takes place:
• Operational platform
• Staging area
• Warehouse server
![Page 25: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/25.jpg)
When to Transform?
Choose the transformation point wisely:
• Workload
• Environment impact
• CPU use
• Disk space
• Network bandwidth
• Parallel execution
• Load window time
• User information needs
![Page 26: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/26.jpg)
Designing Transformation Processes
• Analysis– Sources and target mappings, business rules– Key users, metadata, grain, verify integrity of data
• Design options– Programming, Tools
• Design issues– Performance– Size of the staging area– Exception handling, integrity maintenance
![Page 27: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/27.jpg)
Loading Data into the Warehouse
• Loading moves the data into the warehouse.
• Subsequent refresh moves smaller volumes.
• Business determines the cycle.
ExtractTransformTransport Transform
Transport(Load)
Operationalenvironment
Stagingenvironment
Warehouseenvironment
![Page 28: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/28.jpg)
Extract versus Warehouse Processing Environment
• Extract processing builds a new database after each time interval.
• Warehouse processing adds changes to the database after each time interval.
T1 T2 T3
Operationaldatabases
T1 T2 T3
Operationaldatabases
![Page 29: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/29.jpg)
First-Time Load
• Single event that populates the database with historical data
• Involves a large volume of data
• Uses distinct ETL tasks
• Involves large amounts of processing after load
![Page 30: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/30.jpg)
Refresh
• Performed according to a business cycle
• Simpler task
• Less data to load than first-time load
• Less complex ETL
• Smaller amounts of postload processing
![Page 31: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/31.jpg)
Building the Transportation Process
Specification:
• Techniques and tools
• File transfer methods
• The load window
• Time window for other tasks
• First-time and refresh volumes
• Frequency of the refresh cycle
• Connectivity bandwidth
![Page 32: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/32.jpg)
Building the Transportation Process
• Test the proposed technique
• Document proposed load
• Gain agreement on the process
• Monitor
• Review
• Revise
![Page 33: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/33.jpg)
Granularity
• Important design and operational issue
• Low-level grain: Expensive, high level of processing, more disk, detail
• High-level grain: Cheaper, less processing, less disk, little detail
• Space requirements– Storage– Backup– Recovery– Partitioning– Load
![Page 34: 7 Strategies for Extracting, Transforming, and Loading.](https://reader035.fdocuments.us/reader035/viewer/2022062500/5697bfe01a28abf838cb2fe8/html5/thumbnails/34.jpg)
Post-Processing of Loaded Data
Extract Transform Transport
Summarize Index