03-ETL
-
Upload
ajithakalyankrish -
Category
Documents
-
view
217 -
download
0
Transcript of 03-ETL
-
7/29/2019 03-ETL
1/27
1
Robert WrembelRobert Wrembel
PoznaPozna University of TechnologyUniversity of Technology
Institute of Computing ScienceInstitute of Computing Science
PoznaPozna, Poland, Poland
[email protected]@cs.put.poznan.pl
www.cs.put.poznan.pl/rwrembelwww.cs.put.poznan.pl/rwrembel
On Building Integrated andOn Building Integrated andDistributed Database SystemsDistributed Database Systems
Data Integration for Warehousing - ETL
2Robert Wrembel
OutlineOutline
ETL in a Data Warehouse architecture
ETL characteristics
Extraction
Transformation
Loading
Requirements for ETL
ETL metadata
Prototype systems
-
7/29/2019 03-ETL
2/27
2
3Robert Wrembel
ETL in DW architectureETL in DW architecture
REPORTS
FINANCIAL AND
STATISTICAL
ANALYSIS
Extraction
Transformation
Loading
(Aggregation)
DATA
WAREHOUSE
DATA MARTS
DATA SOURCES INTERMEDIATE LAYER
ETL SOFTWARE
DATA WAREHOUSE BI APPLICATIONS
4Robert Wrembel
ETL characteristicsETL characteristics
Developing ETL processes
critical for DW operation
data quality
data "freshness" (up to date)
DW is refreshed in a finite time window (any delay in a DWrefreshing makes it outdated or inconsistent or unavailable foruse)
costly and time consuming
up to 70% project resources
people
hardware
software
-
7/29/2019 03-ETL
3/27
3
5Robert Wrembel
ETL characteristicsETL characteristics
Gartnera Report on DW projects in financial institutionsFortune 500
100 persons involved in a DW project
55 ETL
17 systems' administrators (DB, hardware, software)
4 system architects
9 consultants for the end user on the BI technology
5 software developers
9 managers
hardware
multiprocessor severs, TB disks (5 mln USD)
ETL software (1 mln USD) typical number of data sources being integrated 10-50
6Robert Wrembel
Technological challengesTechnological challenges
Processing large data volumes in a limited time window
Delivering reliable (valid, true, consistent) data dataquality
Processing an ETL flow efficiently
Managing the evolution of data sources
-
7/29/2019 03-ETL
4/27
4
7Robert Wrembel
Data qualityData quality -- case studycase study
Integration of dean's office databases
9 databases
in total over 70 000 students
WIiZ
WArch
WBiI
WBMiZ WEiT
WE
WFT
WMRiTWTC
8Robert Wrembel
Student number (SN)
SN is unique within one dean's office database
globally SN is not unique problem of uniquelyidentifying a student
found 49 pairs of students having the same SN,
students in a pair are physically different persons format: 6 digits + an optional letter {a, d, s, i}
2.75 incorrect
"SSN"
format: 11 digits
22% incorrect
incorrect length, characters instead of digits, wrongchecksum, wrong gender
Data qualityData quality -- case studycase study
-
7/29/2019 03-ETL
5/27
5
9Robert Wrembel
First names
dictionary of first names was applied
0.8% not in the dictionary
0.9% incorrect
illegal characters, illegal values
Last names
dictionary of last names was applied
20% not in the dictionary
0.04% wrong characters
Predefined dictionaries 31 values out of 2 correct
757 values out of 299 correct
74 values out of 3 correct
Data qualityData quality -- case studycase study
10Robert Wrembel
Data qualityData quality -- case studycase study
The dictionary of cities
4.3% of wrong characters
81% with mixed lower-uppercase character strings
-
7/29/2019 03-ETL
6/27
6
11Robert Wrembel
ETL architectureETL architecture
DATA
WAREHOUSE
DATA SOURCES ETL
DATABASES
FILES
ODBC/JDBC SOURCES
STAGING AREA/OPERATIONAL DATA STORE (ODS)
extraction transformation cleaning integration loading
12Robert Wrembel
Data sourcesData sources
Each of the data sources supplying data into a DW haveto be identified
Data source description includes among others:
domain of activity (HR, Payroll, Marketing, ...)
type of applications used for data processing
data importance for a BI user
who is a business user of source data
who is a user of a technical architecture
DBMS used to manage data
hardware and operating systems
the number of users per day
data volume sizes
DB schema
the number of transactions per day
-
7/29/2019 03-ETL
7/27
7
13Robert Wrembel
Data access technologiesData access technologies
Gateway
ODBC/JDBC
OLE DB (Object Linking and Embedding DataBase)
Drivers to various types of files (flat text, XML, ...)
14Robert Wrembel
Detecting changes inDetecting changes in DSsDSs
Requirements
minimal interference with processing in data sources
minimal (typically no) changes in data sources (structure,applications)
Solutions
audit columns
in a monitored table date and time of operation, operationtype (I, U, D)
providing values by means of: triggers, applications
snapshot log a system maintained log of changes redo log a system DB log (transaction rollback,
transaction recovery, DB recovery)
periodical analysis (log scraping)/ on-line analysis (logsniffing)
2 consecutive snapshot comparison
low efficiency
-
7/29/2019 03-ETL
8/27
8
15Robert Wrembel
Analyzing data sourcesAnalyzing data sources
Analytical methods (statistical, data mining) forestimating characteristics of data (data profiling)
Analytical methods
data quality
identifying NULL/NOT NULL columns
for each attribute count the number of rows with NULL valuesor/and default values (default value may denote that no valuewas provided during row instert)
identifying columns with unique values
maximum length of values
allowed ranges/sets of values
MIN, MAX, AVG, Variance, STDEV
identifying not allowed values
the number of rows with not allowed values
attribute cardinality
distribution of values for each attribute (histograms)
data formats (e.g., dates, money, teleph. numbers)
16Robert Wrembel
Analyzing data sourcesAnalyzing data sources
Analytical methods
the structure and content of data sources
daly growth of data volume
MigrationArchitect(Evoke Software), Integrity (Vality)
-
7/29/2019 03-ETL
9/27
9
17Robert Wrembel
Analyzing data sourcesAnalyzing data sources
Data mining methods association rules + domainknowledge Sapia C., Hfling G., et. al.: On Supporting the Data Warehouse Design by Data
Mining Techniques
discovering attribute meaning
(country='GB' sw=2), support 95%: sw=steeringwheel; 2=right side
compleating missing values based on rules with a highsupport
replacing wrong values with correct ones
discovering functional dependencied between attributesdiscovering potential keys
discovering business rules implicitly encoded inapplications
WizRule (WizSoft), DataMiningSuite (InformationDiscovery)
18Robert Wrembel
TransformationTransformation
Requirements
Interactive and iterative process
define rules start the transformation verify results modify rules
Easily extendible
Optimizable
The more tasks executed automatically the better The less data for manual transformation the better
-
7/29/2019 03-ETL
10/27
10
19Robert Wrembel
TransformationTransformation
Transformation to a common data model
{object, O-R, semistructured, ...} relational Transformation to a common representation
Employee {SSN, FName, LName, Street, No, PostalCode,City}
Removing useless columns
User verification/correction is often required
20Robert Wrembel
CleaningCleaning
Extracting atomic values from strings
Piotrowo 2, 60-965, Pozna
ordering the values
Removing Null values
Replacing wrong values with correc ones
spelling dictionaries
name dictionaries (countries, cities, address codes)
Standardizing values
formatting values (e.g., dates, money)
converting currencies
lower-upper case conversion
consistent abbreviations
synonym dictionaries (Word Net)
abbreviation dictionaries
-
7/29/2019 03-ETL
11/27
11
21Robert Wrembel
CleaningCleaning
Merging semantically identical records
Generating artificial identifiers
IdCentric (FirstLogic), Trillium (TrilliumSoftware)
22Robert Wrembel
IntegrationIntegration -- duplicate eliminationduplicate elimination
Compared records have to be cleaned before
remove punctuation, white spaces, and special characters
no abbreviations
Records differ slightly
{Wrembel, Robert, ul. Wyspiaskiego, Pozna}{Wrbel, Robert, ul. Wyspiaskiego, Pozna}
Use natural identifiers (e.g., SSN, pasport No, engine No,e-mail}
No natural identifiers
sort + compare n neighbor records (window of size n)
similarity function (e.g., if first and last names are identicalthen the records are identical)
similarity weights for attributes
approximate join
-
7/29/2019 03-ETL
12/27
12
23Robert Wrembel
Duplicate eliminationDuplicate elimination
Simple similarity measure
the number of matching atomic strings / total number ofunique atomic strings
Universidad de Costa Rica, Faculdad de Ingeniera
Univ. de C. Rica Faculd. de Ingen.
similarity measure = 5/5
Universidad de Costa Rica, Faculdad de Ingeniera, Escuela deCiencias de la Computacin e Informtica
Univ. de C. Rica Faculd. de Ingen.
similarity measure = 5/9
24Robert Wrembel
Duplicate eliminationDuplicate elimination
Soundex
grouping entities having the same pronunciation
entities pronounced identically have the same value ofSOUNDEX (even if they are written differently)
soundex('Smith')=soundex('Smit')=S530
Levenhstein/edit distance
similarity measure of two character strings source - L1destination - L2
the distance is measured by a minimal number of insertsand deletes (sometimes updates) of signs in a characterstring leading to achieve L2 from L1
L1 and L2 are identical distance=0 ABC ABCDEF: distance=3 DEFCABABC: distance=5
Merge (Sagent), DataCleanser (EDD)
-
7/29/2019 03-ETL
13/27
13
25Robert Wrembel
Refreshing/loading HDRefreshing/loading HD
When to refresh a DW? synchronously (after committing a transaction in a data
source) RTDWs asynchronously traditional DWs
automatically in a given interval
on demand
What to send? data (Oracle)
transactions (Sybase, SQL Server)
How to refresh? incrementally
fully
How frequently to refresh? in a batch mode
in a stream mode (RTDWs)
26Robert Wrembel
Refreshing efficiencyRefreshing efficiency
In a given finite time window
Read only necessary data
Avoid
DISTINCT, set operators,
NOT i non-equal joins (usually require full scans)
function calls in the WHERE clause
GROUP BY in queries reading source data
sorting in a data source may may be ineffective
sorting may interact with original processing in a data source
triggers in a DW
-
7/29/2019 03-ETL
14/27
14
27Robert Wrembel
Refreshing efficiencyRefreshing efficiency
Separate UPDATEs and INSERTs UPDATEs are not executed in a direct load path
replacing UPDATE by DELETE and INSERT
the number of UPDATEs > INSERTs => TRUNCATE TABLE+ INSERTs
Indexes drop + re-create maintain on-line indexes and UPDATEs
remove indexes not used by UPDATEs
execute UPDATEs
remove remaining indexes
execute INSERTs re-create indexes
Integrity constraints turn off before loading
28Robert Wrembel
Refreshing efficiencyRefreshing efficiency
Redo log
turn of redo log writing
ETL software may roll back failed transactions
data loaded in a batch mode failed transactions may beeasily re-executed
turn of redo log writing for a particular table
Use direct load path Filter data stored in files by means of OS utility (awk
command)
Sort data stored in files by means of OS utility (sort)
Sort and compute aggregates in the ETL engine (not in aDW)
-
7/29/2019 03-ETL
15/27
15
29Robert Wrembel
Refreshing efficiencyRefreshing efficiency
Transformation of data
in a DW (ELT)
in an ETL workflow
Parallel loading (partitioned and non-partitioned tables)
Use native drivers for accessing data soures (avoidODBC/JDBC)
Gather DW statistics after refreshing
Defragment DW
30Robert Wrembel
Purpose of ODSPurpose of ODS
Separating ETL processing from original processing indata sources
Re-executing failed transactions
-
7/29/2019 03-ETL
16/27
16
31Robert Wrembel
ODS contentODS content
Original source data
Partially processed data
Storing ETL metadata
Mapping tables (EDS DW) lineage, data provenance
DW rows and their origines in data sources + a chain oftransformations
ODS is implemented as a database or a set of files
32Robert Wrembel
Designing ETLDesigning ETL
Data profiling
Defining ETL workflows
Testing on a sample,verifying data quality
Executing ETL
Modifying EDS improvingdata quality
repository
Jarke M., et. al.:
Improving OLTP Data
Quality Using Data
Warehouse
Mechanisms. SIGMOD
Record, (28):2, 1999
-
7/29/2019 03-ETL
17/27
17
33Robert Wrembel
Implementing ETLImplementing ETL
ETL workflow of transformations Transformations
aggregation
filtering
joining
normalizing values
lookup
generating IDs
sorting
EDS connector (DB, file, ...) ...
user-defined
34Robert Wrembel
ETL metadataETL metadata
Business
dictionary of business terms
mapping of business terms into DW objects
business rules
data quality
Managing ETL execution schedules
scripts
execution logs
monitoring
-
7/29/2019 03-ETL
18/27
18
35Robert Wrembel
ETL metadataETL metadata
Technical
source description (localization, structure, content)
source type (relational db, object db, xml, html, spreadsheet,...)
structure/schema
access methods
users and their access rights
data profiling results
daily increase in data volume
total data volume
data statistics (for access optimization)
DW description logical schema
physical data structures
various DW statistics (for query optimization)
physical disk organization
36Robert Wrembel
ETL metadataETL metadata
Technical
ETL descriptions
implementations of algorithms (transforming, cleaning,integrating)
scripts and tasks definitions
execution schedules
various dictionaries (countries, cities, ...) DW refreshing statistics (#rows loaded, #rows rejected, ...)
refreshing logs
workflow structure
DS - DW mappings (schema and data)
-
7/29/2019 03-ETL
19/27
19
37Robert Wrembel
Requirements for ETLRequirements for ETL
Efficiency finishing in a time window
parallel executions
Reliability restart after erroneous execution
recovery after crash
Manageability parameterized refreshing frequency
automatic start time-based
token-based (data source informs ETL that data can be
fetched) suspend and resume a task
Ensuring data quality
Security (access rights control)
38Robert Wrembel
Requirements for ETLRequirements for ETL
Data safety after system's crash
Predefined tasks
Automatic generation of executable code
Easy to modify
Extending with user-defined components
Batch execution Monitoring execution
processor time
RAM
throughput
disk access competition
Automatic reporting about finishing, errors, ...
Metadata management
-
7/29/2019 03-ETL
20/27
20
39Robert Wrembel
ApproachesApproaches
Off the self
quicker deployment
data repositories andmetadata management
built-in drivers to all(multiple systems)
dependency managementbetween components
incremental refreshing
parallel processing
expensive
User-defined
longer development
applicable to a particularsolution
cheaper
40Robert Wrembel
CommercialCommercial systemssystems
-
7/29/2019 03-ETL
21/27
21
41Robert Wrembel
PrototypePrototype systemssystems
AJAX - Inria Galhardas H., Florescu D., Shasha D., Simon E.: An Extensible Framework for Data
Cleaning. ICDE, 2000
Galhardas H., Florescu D., Shasha D., Simon E.: AJAX: An Extensible Data Cleaning
Tool. SIGMOD, 2000
Potter's Wheel - Berkeley Raman V., Hellerstein J.M.: Potter's Wheel: An Interactive Data Cleaning System.
VLDB, 2001
Arktos II - National Univ. of Athens, Univ. of Ioannina Vassiliadis P., A. Simitsis, Georgantas P, Terrovitis M.: A Framework for the Design of
ETL Scenarios. CAiSE, 2003
Simitsis A., Vassiliadis P., Skiadopoulos s., Sellis T.: Data Warehouse Refreshment. In
Data Warehouses and OLAP: Concepts Architectures and Solutions. IGI, 2007
Simitsis A., Vassiliadis P., Sellis T.: Optimizing ETL processes in data warehouses.ICDE, 2005
Simitsis A., Vassiliadis P., Sellis T.: State-Space Optimization of ETL Workflows. IEEETKDE (17):10, 2006
Tziovara V., Vassiliadis P., Simitsis A.: Deciding the physical implementation of ETLworkflows. DOLAP, 2007
42Robert Wrembel
AJAXAJAX
Input: a set of tables with inconsistent and duplicatedrows
Output: a set of tables with consistent, no duplicaterows
Assumption
tables have defined primary keys
-
7/29/2019 03-ETL
22/27
22
43Robert Wrembel
AJAXAJAX-- componentscomponents
Data transformation service
standardizing values
transformation
MAPPING macro-operator
CREATE MAPPING MG1
SELECT c.clID, c.FName, c.LName, c.Street, c.City, c.Code,
c.TelNo, c.Education
INTO Clients_Clean
FROM Clients1 c
LET LName=INITCAP(c.LName)
[Street, City, Code]=ExtractAdr(c.Address)Education=IF(c.Education is not null)
THEN RETURN c.Education
ELSE RETURN 'unknown'
CREATE MAPPING MG1
SELECT c.clID, c.FName, c.LName, c.Street, c.City, c.Code,
c.TelNo, c.Education
INTO Clients_Clean
FROM Clients1 c
LET LName=INITCAP(c.LName)
[Street, City, Code]=ExtractAdr(c.Address)
Education=IF(c.Education is not null)
THEN RETURN c.Education
ELSE RETURN 'unknown'
44Robert Wrembel
AJAXAJAX
Record matching service - duplicate elimination
similarity measure
MATCH macro-operator
CREATE MATCH MH1
FROM Clients1 c1, Clients1 c2
LET sim1=LNameSimF(c1.LName, c2.LName)
sim2=AddressSimF(c1.Address, c2.Address)
SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN RETURN MIN(sim1,sim2)
ELSE IF (sim1 between 0.6 and 0.89 and
sim2 between 0.7. and 0.8) THEN RETURN sim1
ELSE RETURN 0
THRESHOLD SIMILARITY>=0.7
CREATE MATCH MH1
FROM Clients1 c1, Clients1 c2
LET sim1=LNameSimF(c1.LName, c2.LName)
sim2=AddressSimF(c1.Address, c2.Address)SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN RETURN MIN(sim1,sim2)
ELSE IF (sim1 between 0.6 and 0.89 and
sim2 between 0.7. and 0.8) THEN RETURN sim1
ELSE RETURN 0
THRESHOLD SIMILARITY>=0.7
Result stored in a temporary table - matching table
M {ID_Client1, ID_Client2, similarity}
-
7/29/2019 03-ETL
23/27
23
45Robert Wrembel
AJAXAJAX
Duplicate elimination
manual
semi-automatic
automatic THRESHOLD > x
CREATE MAPPING MG2
SELECT DI, LName, Address, ... INTO Clients
FROM MH1
LET id=IDGen(M.ID_Client1, M.ID_Client2)
sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName)
sim2=StreetSimF(M.ID_Client1.Address, M.ID_Client2.Address)
SIMILARITY
LName=IF (sim1>0.9) THEN RETURN M.ID_Client1.LName
Street=IF (sim2>0.9) THEN RETURN M.ID_Client1.Street
.....
Address=CONCAT(Street, City, Code)
THRESHOLD SIMILARITY>=0.89
CREATE MAPPING MG2
SELECT DI, LName, Address, ... INTO Clients
FROM MH1
LET id=IDGen(M.ID_Client1, M.ID_Client2)
sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName)
sim2=StreetSimF(M.ID_Client1.Address, M.ID_Client2.Address)
SIMILARITYLName=IF (sim1>0.9) THEN RETURN M.ID_Client1.LName
Street=IF (sim2>0.9) THEN RETURN M.ID_Client1.Street
.....
Address=CONCAT(Street, City, Code)
THRESHOLD SIMILARITY>=0.89
46Robert Wrembel
Potter's WheelPotter's Wheel
Interactive and iterative process of data transformation andcleaning
a set of predefined transformations
transformations are applied to a small subset of data
transformations are visible to a user in real time
spreadsheet interface
-
7/29/2019 03-ETL
24/27
24
47Robert Wrembel
ArktosArktos IIII
Conceptual model transformed into implementationmodel
Unique features
evolution of a workflow
optimization of a workflow
48Robert Wrembel
ETL unsolved problemsETL unsolved problems
Structural changes in data sources
Wikipedia schema changed every 9-10 days on the averageduring the last 4 years
Telecommunication data sources changed their schemasevery 7-13 days, on the average
Banking data sources changed their schemas every 2-4
weeks, on the average The most frequent changes concerned increasing the
length of a column, changing a data type of a column, andadding a new column
-
7/29/2019 03-ETL
25/27
25
49Robert Wrembel
ETL unsolved problemsETL unsolved problems
Structural changes in data sources
50Robert Wrembel
ETLETL unsolvedunsolved problemsproblems
-
7/29/2019 03-ETL
26/27
26
51/54Robert Wrembel
ETL unsolved problemsETL unsolved problems
ETL optimization
Workflow transformation
reordering tasks
parallelizing tasks
merging splitting tasks
Figuring out the set of correct transformations
Defining cost model of executions
52/54Robert Wrembel
ExampleExample
7
64 5
Sales1 {..., total_price, s_date, ...}
Sales2 {..., cost, sales_date, ...}
NotNull(total_price)
31
2
EUR2PLN ConvertDate SUM(cost,month)
8
Select(total_price>9000)
Sales1
total_price [PLN]
s_date [yyyy-mm-dd]
monthly sales
Sales2
cost [EUR]
sales_date [dd/mm/yy]
daily sales
-
7/29/2019 03-ETL
27/27
53/54Robert Wrembel
ExampleExample
Minimize the amount of processed data
7
6 4 5
NotNull(total_price)
31
2
EUR2PLN ConvertDate
SUM(cost,month)
8
Select(total_price>9000)
8
Select(total_price>9000)
Sales1 {..., total_price, s_date, ...}
Sales2 {..., cost, sales_date, ...}
54/54Robert Wrembel
ProblemsProblems
Tasks are often expressed as programs in procedurallanguages
constructing cost model
programs may have input parameters and conditionalconstructs
how to interpret and optimize code?
Commercial systems ???