03-ETL

download 03-ETL

of 27

Transcript of 03-ETL

  • 7/29/2019 03-ETL

    1/27

    1

    Robert WrembelRobert Wrembel

    PoznaPozna University of TechnologyUniversity of Technology

    Institute of Computing ScienceInstitute of Computing Science

    PoznaPozna, Poland, Poland

    [email protected]@cs.put.poznan.pl

    www.cs.put.poznan.pl/rwrembelwww.cs.put.poznan.pl/rwrembel

    On Building Integrated andOn Building Integrated andDistributed Database SystemsDistributed Database Systems

    Data Integration for Warehousing - ETL

    2Robert Wrembel

    OutlineOutline

    ETL in a Data Warehouse architecture

    ETL characteristics

    Extraction

    Transformation

    Loading

    Requirements for ETL

    ETL metadata

    Prototype systems

  • 7/29/2019 03-ETL

    2/27

    2

    3Robert Wrembel

    ETL in DW architectureETL in DW architecture

    REPORTS

    FINANCIAL AND

    STATISTICAL

    ANALYSIS

    Extraction

    Transformation

    Loading

    (Aggregation)

    DATA

    WAREHOUSE

    DATA MARTS

    DATA SOURCES INTERMEDIATE LAYER

    ETL SOFTWARE

    DATA WAREHOUSE BI APPLICATIONS

    4Robert Wrembel

    ETL characteristicsETL characteristics

    Developing ETL processes

    critical for DW operation

    data quality

    data "freshness" (up to date)

    DW is refreshed in a finite time window (any delay in a DWrefreshing makes it outdated or inconsistent or unavailable foruse)

    costly and time consuming

    up to 70% project resources

    people

    hardware

    software

  • 7/29/2019 03-ETL

    3/27

    3

    5Robert Wrembel

    ETL characteristicsETL characteristics

    Gartnera Report on DW projects in financial institutionsFortune 500

    100 persons involved in a DW project

    55 ETL

    17 systems' administrators (DB, hardware, software)

    4 system architects

    9 consultants for the end user on the BI technology

    5 software developers

    9 managers

    hardware

    multiprocessor severs, TB disks (5 mln USD)

    ETL software (1 mln USD) typical number of data sources being integrated 10-50

    6Robert Wrembel

    Technological challengesTechnological challenges

    Processing large data volumes in a limited time window

    Delivering reliable (valid, true, consistent) data dataquality

    Processing an ETL flow efficiently

    Managing the evolution of data sources

  • 7/29/2019 03-ETL

    4/27

    4

    7Robert Wrembel

    Data qualityData quality -- case studycase study

    Integration of dean's office databases

    9 databases

    in total over 70 000 students

    WIiZ

    WArch

    WBiI

    WBMiZ WEiT

    WE

    WFT

    WMRiTWTC

    8Robert Wrembel

    Student number (SN)

    SN is unique within one dean's office database

    globally SN is not unique problem of uniquelyidentifying a student

    found 49 pairs of students having the same SN,

    students in a pair are physically different persons format: 6 digits + an optional letter {a, d, s, i}

    2.75 incorrect

    "SSN"

    format: 11 digits

    22% incorrect

    incorrect length, characters instead of digits, wrongchecksum, wrong gender

    Data qualityData quality -- case studycase study

  • 7/29/2019 03-ETL

    5/27

    5

    9Robert Wrembel

    First names

    dictionary of first names was applied

    0.8% not in the dictionary

    0.9% incorrect

    illegal characters, illegal values

    Last names

    dictionary of last names was applied

    20% not in the dictionary

    0.04% wrong characters

    Predefined dictionaries 31 values out of 2 correct

    757 values out of 299 correct

    74 values out of 3 correct

    Data qualityData quality -- case studycase study

    10Robert Wrembel

    Data qualityData quality -- case studycase study

    The dictionary of cities

    4.3% of wrong characters

    81% with mixed lower-uppercase character strings

  • 7/29/2019 03-ETL

    6/27

    6

    11Robert Wrembel

    ETL architectureETL architecture

    DATA

    WAREHOUSE

    DATA SOURCES ETL

    DATABASES

    FILES

    ODBC/JDBC SOURCES

    STAGING AREA/OPERATIONAL DATA STORE (ODS)

    extraction transformation cleaning integration loading

    12Robert Wrembel

    Data sourcesData sources

    Each of the data sources supplying data into a DW haveto be identified

    Data source description includes among others:

    domain of activity (HR, Payroll, Marketing, ...)

    type of applications used for data processing

    data importance for a BI user

    who is a business user of source data

    who is a user of a technical architecture

    DBMS used to manage data

    hardware and operating systems

    the number of users per day

    data volume sizes

    DB schema

    the number of transactions per day

  • 7/29/2019 03-ETL

    7/27

    7

    13Robert Wrembel

    Data access technologiesData access technologies

    Gateway

    ODBC/JDBC

    OLE DB (Object Linking and Embedding DataBase)

    Drivers to various types of files (flat text, XML, ...)

    14Robert Wrembel

    Detecting changes inDetecting changes in DSsDSs

    Requirements

    minimal interference with processing in data sources

    minimal (typically no) changes in data sources (structure,applications)

    Solutions

    audit columns

    in a monitored table date and time of operation, operationtype (I, U, D)

    providing values by means of: triggers, applications

    snapshot log a system maintained log of changes redo log a system DB log (transaction rollback,

    transaction recovery, DB recovery)

    periodical analysis (log scraping)/ on-line analysis (logsniffing)

    2 consecutive snapshot comparison

    low efficiency

  • 7/29/2019 03-ETL

    8/27

    8

    15Robert Wrembel

    Analyzing data sourcesAnalyzing data sources

    Analytical methods (statistical, data mining) forestimating characteristics of data (data profiling)

    Analytical methods

    data quality

    identifying NULL/NOT NULL columns

    for each attribute count the number of rows with NULL valuesor/and default values (default value may denote that no valuewas provided during row instert)

    identifying columns with unique values

    maximum length of values

    allowed ranges/sets of values

    MIN, MAX, AVG, Variance, STDEV

    identifying not allowed values

    the number of rows with not allowed values

    attribute cardinality

    distribution of values for each attribute (histograms)

    data formats (e.g., dates, money, teleph. numbers)

    16Robert Wrembel

    Analyzing data sourcesAnalyzing data sources

    Analytical methods

    the structure and content of data sources

    daly growth of data volume

    MigrationArchitect(Evoke Software), Integrity (Vality)

  • 7/29/2019 03-ETL

    9/27

    9

    17Robert Wrembel

    Analyzing data sourcesAnalyzing data sources

    Data mining methods association rules + domainknowledge Sapia C., Hfling G., et. al.: On Supporting the Data Warehouse Design by Data

    Mining Techniques

    discovering attribute meaning

    (country='GB' sw=2), support 95%: sw=steeringwheel; 2=right side

    compleating missing values based on rules with a highsupport

    replacing wrong values with correct ones

    discovering functional dependencied between attributesdiscovering potential keys

    discovering business rules implicitly encoded inapplications

    WizRule (WizSoft), DataMiningSuite (InformationDiscovery)

    18Robert Wrembel

    TransformationTransformation

    Requirements

    Interactive and iterative process

    define rules start the transformation verify results modify rules

    Easily extendible

    Optimizable

    The more tasks executed automatically the better The less data for manual transformation the better

  • 7/29/2019 03-ETL

    10/27

    10

    19Robert Wrembel

    TransformationTransformation

    Transformation to a common data model

    {object, O-R, semistructured, ...} relational Transformation to a common representation

    Employee {SSN, FName, LName, Street, No, PostalCode,City}

    Removing useless columns

    User verification/correction is often required

    20Robert Wrembel

    CleaningCleaning

    Extracting atomic values from strings

    Piotrowo 2, 60-965, Pozna

    ordering the values

    Removing Null values

    Replacing wrong values with correc ones

    spelling dictionaries

    name dictionaries (countries, cities, address codes)

    Standardizing values

    formatting values (e.g., dates, money)

    converting currencies

    lower-upper case conversion

    consistent abbreviations

    synonym dictionaries (Word Net)

    abbreviation dictionaries

  • 7/29/2019 03-ETL

    11/27

    11

    21Robert Wrembel

    CleaningCleaning

    Merging semantically identical records

    Generating artificial identifiers

    IdCentric (FirstLogic), Trillium (TrilliumSoftware)

    22Robert Wrembel

    IntegrationIntegration -- duplicate eliminationduplicate elimination

    Compared records have to be cleaned before

    remove punctuation, white spaces, and special characters

    no abbreviations

    Records differ slightly

    {Wrembel, Robert, ul. Wyspiaskiego, Pozna}{Wrbel, Robert, ul. Wyspiaskiego, Pozna}

    Use natural identifiers (e.g., SSN, pasport No, engine No,e-mail}

    No natural identifiers

    sort + compare n neighbor records (window of size n)

    similarity function (e.g., if first and last names are identicalthen the records are identical)

    similarity weights for attributes

    approximate join

  • 7/29/2019 03-ETL

    12/27

    12

    23Robert Wrembel

    Duplicate eliminationDuplicate elimination

    Simple similarity measure

    the number of matching atomic strings / total number ofunique atomic strings

    Universidad de Costa Rica, Faculdad de Ingeniera

    Univ. de C. Rica Faculd. de Ingen.

    similarity measure = 5/5

    Universidad de Costa Rica, Faculdad de Ingeniera, Escuela deCiencias de la Computacin e Informtica

    Univ. de C. Rica Faculd. de Ingen.

    similarity measure = 5/9

    24Robert Wrembel

    Duplicate eliminationDuplicate elimination

    Soundex

    grouping entities having the same pronunciation

    entities pronounced identically have the same value ofSOUNDEX (even if they are written differently)

    soundex('Smith')=soundex('Smit')=S530

    Levenhstein/edit distance

    similarity measure of two character strings source - L1destination - L2

    the distance is measured by a minimal number of insertsand deletes (sometimes updates) of signs in a characterstring leading to achieve L2 from L1

    L1 and L2 are identical distance=0 ABC ABCDEF: distance=3 DEFCABABC: distance=5

    Merge (Sagent), DataCleanser (EDD)

  • 7/29/2019 03-ETL

    13/27

    13

    25Robert Wrembel

    Refreshing/loading HDRefreshing/loading HD

    When to refresh a DW? synchronously (after committing a transaction in a data

    source) RTDWs asynchronously traditional DWs

    automatically in a given interval

    on demand

    What to send? data (Oracle)

    transactions (Sybase, SQL Server)

    How to refresh? incrementally

    fully

    How frequently to refresh? in a batch mode

    in a stream mode (RTDWs)

    26Robert Wrembel

    Refreshing efficiencyRefreshing efficiency

    In a given finite time window

    Read only necessary data

    Avoid

    DISTINCT, set operators,

    NOT i non-equal joins (usually require full scans)

    function calls in the WHERE clause

    GROUP BY in queries reading source data

    sorting in a data source may may be ineffective

    sorting may interact with original processing in a data source

    triggers in a DW

  • 7/29/2019 03-ETL

    14/27

    14

    27Robert Wrembel

    Refreshing efficiencyRefreshing efficiency

    Separate UPDATEs and INSERTs UPDATEs are not executed in a direct load path

    replacing UPDATE by DELETE and INSERT

    the number of UPDATEs > INSERTs => TRUNCATE TABLE+ INSERTs

    Indexes drop + re-create maintain on-line indexes and UPDATEs

    remove indexes not used by UPDATEs

    execute UPDATEs

    remove remaining indexes

    execute INSERTs re-create indexes

    Integrity constraints turn off before loading

    28Robert Wrembel

    Refreshing efficiencyRefreshing efficiency

    Redo log

    turn of redo log writing

    ETL software may roll back failed transactions

    data loaded in a batch mode failed transactions may beeasily re-executed

    turn of redo log writing for a particular table

    Use direct load path Filter data stored in files by means of OS utility (awk

    command)

    Sort data stored in files by means of OS utility (sort)

    Sort and compute aggregates in the ETL engine (not in aDW)

  • 7/29/2019 03-ETL

    15/27

    15

    29Robert Wrembel

    Refreshing efficiencyRefreshing efficiency

    Transformation of data

    in a DW (ELT)

    in an ETL workflow

    Parallel loading (partitioned and non-partitioned tables)

    Use native drivers for accessing data soures (avoidODBC/JDBC)

    Gather DW statistics after refreshing

    Defragment DW

    30Robert Wrembel

    Purpose of ODSPurpose of ODS

    Separating ETL processing from original processing indata sources

    Re-executing failed transactions

  • 7/29/2019 03-ETL

    16/27

    16

    31Robert Wrembel

    ODS contentODS content

    Original source data

    Partially processed data

    Storing ETL metadata

    Mapping tables (EDS DW) lineage, data provenance

    DW rows and their origines in data sources + a chain oftransformations

    ODS is implemented as a database or a set of files

    32Robert Wrembel

    Designing ETLDesigning ETL

    Data profiling

    Defining ETL workflows

    Testing on a sample,verifying data quality

    Executing ETL

    Modifying EDS improvingdata quality

    repository

    Jarke M., et. al.:

    Improving OLTP Data

    Quality Using Data

    Warehouse

    Mechanisms. SIGMOD

    Record, (28):2, 1999

  • 7/29/2019 03-ETL

    17/27

    17

    33Robert Wrembel

    Implementing ETLImplementing ETL

    ETL workflow of transformations Transformations

    aggregation

    filtering

    joining

    normalizing values

    lookup

    generating IDs

    sorting

    EDS connector (DB, file, ...) ...

    user-defined

    34Robert Wrembel

    ETL metadataETL metadata

    Business

    dictionary of business terms

    mapping of business terms into DW objects

    business rules

    data quality

    Managing ETL execution schedules

    scripts

    execution logs

    monitoring

  • 7/29/2019 03-ETL

    18/27

    18

    35Robert Wrembel

    ETL metadataETL metadata

    Technical

    source description (localization, structure, content)

    source type (relational db, object db, xml, html, spreadsheet,...)

    structure/schema

    access methods

    users and their access rights

    data profiling results

    daily increase in data volume

    total data volume

    data statistics (for access optimization)

    DW description logical schema

    physical data structures

    various DW statistics (for query optimization)

    physical disk organization

    36Robert Wrembel

    ETL metadataETL metadata

    Technical

    ETL descriptions

    implementations of algorithms (transforming, cleaning,integrating)

    scripts and tasks definitions

    execution schedules

    various dictionaries (countries, cities, ...) DW refreshing statistics (#rows loaded, #rows rejected, ...)

    refreshing logs

    workflow structure

    DS - DW mappings (schema and data)

  • 7/29/2019 03-ETL

    19/27

    19

    37Robert Wrembel

    Requirements for ETLRequirements for ETL

    Efficiency finishing in a time window

    parallel executions

    Reliability restart after erroneous execution

    recovery after crash

    Manageability parameterized refreshing frequency

    automatic start time-based

    token-based (data source informs ETL that data can be

    fetched) suspend and resume a task

    Ensuring data quality

    Security (access rights control)

    38Robert Wrembel

    Requirements for ETLRequirements for ETL

    Data safety after system's crash

    Predefined tasks

    Automatic generation of executable code

    Easy to modify

    Extending with user-defined components

    Batch execution Monitoring execution

    processor time

    RAM

    throughput

    disk access competition

    Automatic reporting about finishing, errors, ...

    Metadata management

  • 7/29/2019 03-ETL

    20/27

    20

    39Robert Wrembel

    ApproachesApproaches

    Off the self

    quicker deployment

    data repositories andmetadata management

    built-in drivers to all(multiple systems)

    dependency managementbetween components

    incremental refreshing

    parallel processing

    expensive

    User-defined

    longer development

    applicable to a particularsolution

    cheaper

    40Robert Wrembel

    CommercialCommercial systemssystems

  • 7/29/2019 03-ETL

    21/27

    21

    41Robert Wrembel

    PrototypePrototype systemssystems

    AJAX - Inria Galhardas H., Florescu D., Shasha D., Simon E.: An Extensible Framework for Data

    Cleaning. ICDE, 2000

    Galhardas H., Florescu D., Shasha D., Simon E.: AJAX: An Extensible Data Cleaning

    Tool. SIGMOD, 2000

    Potter's Wheel - Berkeley Raman V., Hellerstein J.M.: Potter's Wheel: An Interactive Data Cleaning System.

    VLDB, 2001

    Arktos II - National Univ. of Athens, Univ. of Ioannina Vassiliadis P., A. Simitsis, Georgantas P, Terrovitis M.: A Framework for the Design of

    ETL Scenarios. CAiSE, 2003

    Simitsis A., Vassiliadis P., Skiadopoulos s., Sellis T.: Data Warehouse Refreshment. In

    Data Warehouses and OLAP: Concepts Architectures and Solutions. IGI, 2007

    Simitsis A., Vassiliadis P., Sellis T.: Optimizing ETL processes in data warehouses.ICDE, 2005

    Simitsis A., Vassiliadis P., Sellis T.: State-Space Optimization of ETL Workflows. IEEETKDE (17):10, 2006

    Tziovara V., Vassiliadis P., Simitsis A.: Deciding the physical implementation of ETLworkflows. DOLAP, 2007

    42Robert Wrembel

    AJAXAJAX

    Input: a set of tables with inconsistent and duplicatedrows

    Output: a set of tables with consistent, no duplicaterows

    Assumption

    tables have defined primary keys

  • 7/29/2019 03-ETL

    22/27

    22

    43Robert Wrembel

    AJAXAJAX-- componentscomponents

    Data transformation service

    standardizing values

    transformation

    MAPPING macro-operator

    CREATE MAPPING MG1

    SELECT c.clID, c.FName, c.LName, c.Street, c.City, c.Code,

    c.TelNo, c.Education

    INTO Clients_Clean

    FROM Clients1 c

    LET LName=INITCAP(c.LName)

    [Street, City, Code]=ExtractAdr(c.Address)Education=IF(c.Education is not null)

    THEN RETURN c.Education

    ELSE RETURN 'unknown'

    CREATE MAPPING MG1

    SELECT c.clID, c.FName, c.LName, c.Street, c.City, c.Code,

    c.TelNo, c.Education

    INTO Clients_Clean

    FROM Clients1 c

    LET LName=INITCAP(c.LName)

    [Street, City, Code]=ExtractAdr(c.Address)

    Education=IF(c.Education is not null)

    THEN RETURN c.Education

    ELSE RETURN 'unknown'

    44Robert Wrembel

    AJAXAJAX

    Record matching service - duplicate elimination

    similarity measure

    MATCH macro-operator

    CREATE MATCH MH1

    FROM Clients1 c1, Clients1 c2

    LET sim1=LNameSimF(c1.LName, c2.LName)

    sim2=AddressSimF(c1.Address, c2.Address)

    SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN RETURN MIN(sim1,sim2)

    ELSE IF (sim1 between 0.6 and 0.89 and

    sim2 between 0.7. and 0.8) THEN RETURN sim1

    ELSE RETURN 0

    THRESHOLD SIMILARITY>=0.7

    CREATE MATCH MH1

    FROM Clients1 c1, Clients1 c2

    LET sim1=LNameSimF(c1.LName, c2.LName)

    sim2=AddressSimF(c1.Address, c2.Address)SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN RETURN MIN(sim1,sim2)

    ELSE IF (sim1 between 0.6 and 0.89 and

    sim2 between 0.7. and 0.8) THEN RETURN sim1

    ELSE RETURN 0

    THRESHOLD SIMILARITY>=0.7

    Result stored in a temporary table - matching table

    M {ID_Client1, ID_Client2, similarity}

  • 7/29/2019 03-ETL

    23/27

    23

    45Robert Wrembel

    AJAXAJAX

    Duplicate elimination

    manual

    semi-automatic

    automatic THRESHOLD > x

    CREATE MAPPING MG2

    SELECT DI, LName, Address, ... INTO Clients

    FROM MH1

    LET id=IDGen(M.ID_Client1, M.ID_Client2)

    sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName)

    sim2=StreetSimF(M.ID_Client1.Address, M.ID_Client2.Address)

    SIMILARITY

    LName=IF (sim1>0.9) THEN RETURN M.ID_Client1.LName

    Street=IF (sim2>0.9) THEN RETURN M.ID_Client1.Street

    .....

    Address=CONCAT(Street, City, Code)

    THRESHOLD SIMILARITY>=0.89

    CREATE MAPPING MG2

    SELECT DI, LName, Address, ... INTO Clients

    FROM MH1

    LET id=IDGen(M.ID_Client1, M.ID_Client2)

    sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName)

    sim2=StreetSimF(M.ID_Client1.Address, M.ID_Client2.Address)

    SIMILARITYLName=IF (sim1>0.9) THEN RETURN M.ID_Client1.LName

    Street=IF (sim2>0.9) THEN RETURN M.ID_Client1.Street

    .....

    Address=CONCAT(Street, City, Code)

    THRESHOLD SIMILARITY>=0.89

    46Robert Wrembel

    Potter's WheelPotter's Wheel

    Interactive and iterative process of data transformation andcleaning

    a set of predefined transformations

    transformations are applied to a small subset of data

    transformations are visible to a user in real time

    spreadsheet interface

  • 7/29/2019 03-ETL

    24/27

    24

    47Robert Wrembel

    ArktosArktos IIII

    Conceptual model transformed into implementationmodel

    Unique features

    evolution of a workflow

    optimization of a workflow

    48Robert Wrembel

    ETL unsolved problemsETL unsolved problems

    Structural changes in data sources

    Wikipedia schema changed every 9-10 days on the averageduring the last 4 years

    Telecommunication data sources changed their schemasevery 7-13 days, on the average

    Banking data sources changed their schemas every 2-4

    weeks, on the average The most frequent changes concerned increasing the

    length of a column, changing a data type of a column, andadding a new column

  • 7/29/2019 03-ETL

    25/27

    25

    49Robert Wrembel

    ETL unsolved problemsETL unsolved problems

    Structural changes in data sources

    50Robert Wrembel

    ETLETL unsolvedunsolved problemsproblems

  • 7/29/2019 03-ETL

    26/27

    26

    51/54Robert Wrembel

    ETL unsolved problemsETL unsolved problems

    ETL optimization

    Workflow transformation

    reordering tasks

    parallelizing tasks

    merging splitting tasks

    Figuring out the set of correct transformations

    Defining cost model of executions

    52/54Robert Wrembel

    ExampleExample

    7

    64 5

    Sales1 {..., total_price, s_date, ...}

    Sales2 {..., cost, sales_date, ...}

    NotNull(total_price)

    31

    2

    EUR2PLN ConvertDate SUM(cost,month)

    8

    Select(total_price>9000)

    Sales1

    total_price [PLN]

    s_date [yyyy-mm-dd]

    monthly sales

    Sales2

    cost [EUR]

    sales_date [dd/mm/yy]

    daily sales

  • 7/29/2019 03-ETL

    27/27

    53/54Robert Wrembel

    ExampleExample

    Minimize the amount of processed data

    7

    6 4 5

    NotNull(total_price)

    31

    2

    EUR2PLN ConvertDate

    SUM(cost,month)

    8

    Select(total_price>9000)

    8

    Select(total_price>9000)

    Sales1 {..., total_price, s_date, ...}

    Sales2 {..., cost, sales_date, ...}

    54/54Robert Wrembel

    ProblemsProblems

    Tasks are often expressed as programs in procedurallanguages

    constructing cost model

    programs may have input parameters and conditionalconstructs

    how to interpret and optimize code?

    Commercial systems ???