ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS...

Post on 12-Jul-2020

25 views 4 download

Transcript of ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS...

ETL

Mission of ETL The back room must support 4 key steps

– Extracting data from original sources – Data Quality assuring & cleaning data – Conforming the labels & measures in the data to

achieve consistency across the original sources – Delivering the data in a physical format that can be used

by query tools and report writers

Data Quality • Problems when data doesn’t mean what you

think it does – garbage-in-garbage-out

• Or when you don’t understand the data – complexity, lack of metadata

• Data quality problems are expensive and pervasive – cost hundreds of $ billion each year – resolving data quality problems is often the biggest

effort in business intelligence

Examples

• Can we interpret this data? – What do the fields mean? – What is the key? The measures?

• Data problems – Typos, multiple formats, missing / default values

• Metadata and domain expertise – Field three is Revenue. In dollars or euro?

T.Das|97336o8327|24.95|Y|-|0.0|1000 Ted J.|973-360-8779|2000|N|M|NY|1000

Examples • Dummy Values

• a clerk enters 999-99-9999 as a SSN rather than asking the customer, customers 235 years old

• Absence of Data • NULL for customer age

• Cryptic Data • 0 for male, 1 for female

• Contradicting Data • customers with different

addresses

• Violation of Business Rules • NULL for customer name on

an invoice

• Reused Primary Keys, Non-Unique Identifiers • a branch bank is closed.

Several years later, a new branch is opened, and the old identifier is used again

• Data Integration Problems • Multipurpose Fields,

Synonyms

5

Data Quality Factors • Accuracy

– data was recorded correctly without errors – example: “Jhn” vs. “John”

• Completeness – all relevant data was recorded, no lack of data – no partial knowledge of the records in a table or of the attributes

in a record • Uniqueness

– entities are recorded once – Example: the same customer entered twice

• Timeliness/Currency – data is kept up to date – example: Residence (Permanent) Address: out-dated vs. up-to-

dated • Consistency

– data agrees with itself, no discrepancies in data – example: ZIP Code and City consistent

Causes of Poor Data Quality – Business rules do not exist or there are no

standards for data capture – Standards may exist but are not enforced at the

point of data capture – Inconsistent data entry (incorrect spelling, use of

nicknames, middle names, or aliases) occurs – Data entry mistakes (character transposition,

misspellings, and so on) happen – Integration of data from systems with different data

standards is present – Data quality issues are perceived as time-

consuming and expensive to fix 7

Causes of Poor Data Quality

Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002 8

07/Oct/2006 BITS-Pilani 9

“A Properly designed ETL system extracts data from the

source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions… ETL makes or breaks the data warehouse…” Ralph Kimball

ETL

MS BI Architecture

10 Source: The Data Warehouse Backroom

Definition of ETL • Extraction Transformation Loading – ETL • Get data out of sources and load into the DW

– Data is extracted from OLTP database, transformed to match the DW schema and loaded into the data warehouse database

– May also incorporate data from non-OLTP systems such as text files, legacy systems, and spreadsheets

11

Complexity of ETL

• ETL is a complex combination of process and technology – The most underestimated process in DW

development – The most time-consuming process in DW

development • Often, 80% of development time is spent on ETL

12

Data marts: Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Independent data marts

13 Periodic extraction data is not completely fresh in warehouse

E T

L

Single ETL for enterprise data warehouse (EDW)

Simpler data access

ODS provides option for obtaining current data

Dependent data marts loaded from EDW

Dependent data marts

14

Data Staging Area (DSA) • Transit storage for data underway in the ETL process

– Creates a logical and physical separation between the source systems and the data warehouse

– Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes

– Performed by central ETL logic – Easily restarted – RDBMS or flat files? (DBMS have become better)

• Allows centralized backup/recovery – Often too time consuming to load all data marts from

scratch after failure – Backup/recovery facilities needed 15

16

• To stage or not to stage? • A conflict between

– getting the data from the operational systems as fast as possible

– having the ability to restart without repeating the process from the beginning

• Reasons for staging – Recoverability: stage the data as soon as it has been

extracted from the source systems and immediately after major processing (cleaning, transformation, etc).

– Backup: can reload the data warehouse from the staging tables without going to the sources

– Auditing: lineage between the source data and the underlying transformations before the load to the data warehouse

Data Staging Area (DSA)

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Extract

17

• ETL process needs to effectively integrate systems that have different: – DBMS – Operating Systems – Hardware – Communication protocols

• Need to have a logical data map before the physical

data can be transformed

• The logical data map describes the relationship between the starting points and the ending points of your ETL system – usually presented in a table or spreadsheet

Extraction

• The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes

• The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts

• The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process

• The transformation can contain anything from the absolute solution to nothing at all

Target Source Transformation

Table Name Column Name Data Type Table Name Column Name

Data Type

Logical data map

Transform

• Main step where the ETL adds value • Changes data to be inserted in DW • 3 major steps

– Data Cleansing – Data Integration – Other Transformations (includes replacement

of codes, derived values, calculating aggregates)

20

21

Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Cleanse

Typical Cleansing Tasks • Standardization

– changing data elements to a common format defined by the data steward

• Gender Analysis – returns a person’s gender based on the elements of their

name (returns M(ale), F(emale) or U(nknown)) • Entity Analysis

– returns an entity type based on the elements in a name (i.e. customer names) (returns O(rganization), I(ndividual) or U(nknown))

• De-duplication – uses match coding to discard records that represent the

same person, company or entity • Matching and Merging

– matches records based on a given set of criteria and at a given sensitivity level

22

Typical Cleansing Tasks • Address Verification

– verifies that addresses are valid • Householding/Consolidation

– used to identify links between records, such as customers living in the same household or companies belonging to the same parent company

• Parsing – identifies individual data elements and separates

them into unique fields • Casing

– provides the proper capitalization to data elements that have been entered in all lowercase or all uppercase

23

Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00

...

Cleansing Example

24

Cleansing Example

25

Cleansing Example

0 10,000 20,000 30,000 40,000 50,000

$ Spent

Casper Corp.

Solomon Ind.

Briggs Inc.

Anderson Cons.

26

Operational System of Records Data Warehouse

01 Mark Carver SAS SAS Campus Drive Cary, N.C.

02 Mark W. Craver Mark.Craver@sas.com

03 Mark Craver Systems Engineer SAS

Mark Carver SAS SAS Campus Drive Cary, N.C.

Mark W. Craver Mark.Craver@sas.com

Mark Craver Systems Engineer SAS

Data Matching Example

27

01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 Mark.Craver@sas.com

Data Matching Example

Mark Carver SAS SAS Campus Drive Cary, N.C.

Mark W. Craver Mark.Craver@sas.com

Mark Craver Systems Engineer SAS

Operational System of Records Data Warehouse

DQ

28

Transform

29

Transform = convert data from format of operational system to format of data warehouse

Record-level: Selection–data partitioning Joining–data combining Aggregation–data summarization

Field-level: single-field–from one field to one field multi-field–from many fields to one, or one field to many

30

Single-field transformation

In general–some transformation function translates data from old form to new form

Algorithmic transformation uses a formula or logical expression

Table lookup–another approach, uses a separate table keyed by source record code

31

Multifield transformation

M:1–from many source fields to one target field

1:M–from one source field to many target fields

Loading

32

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse

Loading tools • Backup and Disaster Recovery

– Restart logic, Error detection • Continuous loads during a fixed batch

window (usually overnight) – Maximize system resources to load data

efficiently in allotted time • Parallelization

– Several tenths GB per hour

33

Loading and SCD • The data loading module consists of all the steps

required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes

• Creating and assigning the surrogate keys occur in this module

• When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses – Type 1 – Type 2 – Type 3 (not in MS SQLServer 2008)

ETL Tools

• ETL tools from the big vendors – Oracle Warehouse Builder – IBM DB2 Warehouse Manager – Microsoft Integration Services

• 100’s others: – http://en.wikipedia.org/wiki/Extract,_transform,_load#Tools

35

SQL Server Integration Services

36

SSIS: Data Extraction

• In SSIS extractions are managed via source adapters, called Data Flow Sources – source adapters read metadata from

connection managers, – connect to the source, – read raw data from the source, and – forward it to the pipeline

37

SSIS: Data Extraction Connection Manager

• A connection manager is a logical representation of a physical source

• Different types of sources need different kind of information – E.g., to connect to an Excel document,

information concerning filename and path needs to be stored, which data sheets are being accessed and if there are column headers or not in the document.

38

SSIS: Data Cleansing

• Establishing baseline indicators that assesses data quality on data sets and decide what do to when these indicators are violated (3 options): – fix the data – discard the record – stop ETL process

• In SSIS environment baseline indicators and violation rules can be set up by branching data of different quality in a data flow pipeline: – using data flow transformations such as conditional splits and by

applying SSIS Expression Language (a language similar to C#) – use Control flow tasks like Send Mail to alert the ETL staff

39

SSIS: Data Flows Transformations

• Using data flow transformations such as: – fuzzy lookups that finds close or exact

matches by comparing different columns

– fuzzy groupings, which can add values for data matching similarities

• Data Flow Transformations such as: – Merge Join combines sorted data flows – Character Map to convert character

mappings – Data Conversion applies transformation

functions

40

SSIS: Data Loading

• Data Flow Destinations load data into the tables in a star join schema

• Load operations, e.g.: – Data Flow Task: extract data,

apply transformations and load data

– For Loop Container – Bulk Insert Task

41