Data Warehousing

21
Ahsan Abdullah Ahsan Abdullah 1 Data Warehousing Data Warehousing Lecture-16 Lecture-16 Extract Transform Load (ETL) Extract Transform Load (ETL) Virtual University of Virtual University of Pakistan Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: [email protected]

description

Virtual University of Pakistan. Data Warehousing . Lecture-16 Extract Transform Load (ETL). Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: [email protected]. - PowerPoint PPT Presentation

Transcript of Data Warehousing

Page 1: Data Warehousing

Ahsan AbdullahAhsan Abdullah

11

Data Warehousing Data Warehousing Lecture-16Lecture-16

Extract Transform Load (ETL)Extract Transform Load (ETL)

Virtual University of PakistanVirtual University of Pakistan

Ahsan AbdullahAssoc. Prof. & Head

Center for Agro-Informatics Researchwww.nu.edu.pk/cairindex.asp

National University of Computers & Emerging Sciences, IslamabadEmail: [email protected]

Page 2: Data Warehousing

Ahsan AbdullahAhsan Abdullah

22

Extract Transform Load (ETL)Extract Transform Load (ETL)

Page 3: Data Warehousing

Ahsan AbdullahAhsan Abdullah

33

Data Warehouse Server(Tier 1)

OLAP Servers(Tier 2)

Clients(Tier 3)

DataWarehouse

OperationalData Bases

SemistructuredSources MOLAP

ROLAP

Query/Reporting

Data Marts Tools

MetaData

Data sources

Data(Tier 0)

IT

Users

BusinessUsers

Business Users

Data Mining

Archiveddata

Analysis

EExtractxtractTTransformransformLLoad oad (ETL)(ETL)

www data

Putting the pieces togetherPutting the pieces together

{Comment: All except ETL washed out look}

Page 4: Data Warehousing

Ahsan AbdullahAhsan Abdullah

44

The ETL CycleThe ETL CycleEEXTRACTXTRACT

The process of reading data from different sources.

TTRANSFORMRANSFORMThe process of transforming the extracted data from its original state into a consistent state so that it can be placed into another database.

LLOADOADThe process of writing the data into the target source.

TRANSFORM CLEANSE

LOAD

Data Warehouse

OLAPTemporary Temporary

Data storageData storage

EXTRACT

MIS Systems(Acct, HR)

LegacySystems

Other indigenous applications(COBOL, VB, C++, Java)

Archived data

www data

Page 5: Data Warehousing

Ahsan AbdullahAhsan Abdullah

55

ETL ProcessingETL Processing

Extracts Extracts from from

source source systemssystems

DataData MovementMovement

Data Data Transfor-Transfor-

mationmation Data Data

LoadingLoading IndexIndex

Mainte-Mainte- nancenance

StatisticsStatistics CollectionCollection

Data Data CleansingCleansing

ETL is independent yet interrelated steps.

It is important to look at the big picture.

Data acquisition time may include…

BackupBack-up is a major task, its a DWH not a cubeBack-up is a major task, its a DWH not a cube

Note: Backup will come as other elements after “Statistical collection”

Page 6: Data Warehousing

Ahsan AbdullahAhsan Abdullah

66

Overview of Data ExtractionOverview of Data ExtractionFirst step of ETL, followed by many.

Source system for extraction are typically OLTP systems.

A very complex task due to number of reasons: Very complex and poorly documented source system. Data has to be extracted not once, but number of times.

The process design is dependent on: Which extraction method to choose? How to make available extracted data for further processing?

Page 7: Data Warehousing

Ahsan AbdullahAhsan Abdullah

77

Types of Data ExtractionTypes of Data Extraction Logical Extraction

Full Extraction Incremental Extraction

Physical Extraction Online Extraction Offline Extraction Legacy vs. OLTP

Page 8: Data Warehousing

Ahsan AbdullahAhsan Abdullah

88

Logical Data ExtractionLogical Data Extraction Full Extraction

The data extracted completely from the source system. No need to keep track of changes.

Source data made available as-is with any additional information.

Incremental Extraction Data extracted after a well defined point/event in time.

Mechanism used to reflect/record the temporal changes in data (column or table).

Sometimes entire tables off-loaded from source system into the DWH.

Can have significant performance impacts on the data warehouse server.

Page 9: Data Warehousing

Ahsan AbdullahAhsan Abdullah

99

Physical Data Extraction…Physical Data Extraction… Online Extraction Data extracted directly from the source system. May access source tables through an intermediate system. Intermediate system usually similar to the source system.

Offline Extraction Data NOT extracted directly from the source system, instead staged explicitly outside the original source system.

Data is either already structured or was created by an extraction routine.

Some of the prevalent structures are: Flat files Dump files Redo and archive logs Transportable table-spaces

Page 10: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1010

Physical Data ExtractionPhysical Data Extraction

Legacy vs. OLTP

Data moved from the source system

Copy made of the source system data

Staging area used for performance reasons

Page 11: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1111

Data TransformationData Transformation

Basic tasks1. Selection

2. Splitting/Joining

3. Conversion

4. Summarization

5. Enrichment

Page 12: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1212

Data Transformation Basic TasksData Transformation Basic Tasks

Selection

Page 13: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1313

Data Transformation Basic TasksData Transformation Basic Tasks

Splitting/joining

Page 14: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1414

Data Transformation Basic TasksData Transformation Basic Tasks

Conversion

Page 15: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1515

Data Transformation Basic Tasks: Conversion Data Transformation Basic Tasks: Conversion Example-1Example-1

Convert common data elements into a consistent form i.e. name and address.

Translation of dissimilar codes into a standard code.

Field formatField format Field dataField dataFirst-Family-title Muhammad Ibrahim ContractorFamily-title-comma-first Ibrahim Contractor, MuhammadFamily-comma-first-title Ibrahim, Muhammad Contractor

Natl. ID NIDNational ID NID

F/NO-2F-2FL.NO.2FL.2FL/NO.2FL-2FLAT-2FLAT#FLAT,2FLAT-NO-2FL-NO.2

FLAT No. 2

Page 16: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1616

Data representation change EBCIDIC to ASCII

Operating System Change Mainframe (MVS) to UNIX UNIX to NT or XP

Data type change Program (Excel to Access), database format (FoxPro to Access). Character, numeric and date type. Fixed and variable length.

Data Transformation Basic Tasks: Conversion Data Transformation Basic Tasks: Conversion Example-2Example-2

Page 17: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1717

Data Transformation Basic TasksData Transformation Basic Tasks

Summarization

Page 18: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1818

Data Transformation Basic TasksData Transformation Basic Tasks

Enrichment

Page 19: Data Warehousing

Ahsan AbdullahAhsan Abdullah

1919

Data Transformation Basic Tasks: Enrichment Data Transformation Basic Tasks: Enrichment ExampleExample

Data elements are mapped from source tables and files to destination fact and dimension tables.

Default values are used in the absence of source data.

Fields are added for unique keys and time elements.

Input DataInput DataHAJI MUHAMMAD IBRAHIM, GOVT. CONT.K. S. ABDULLAH & BROTHERS, MAMOOJI ROAD, ABDULLAH MANZILRAWALPINDI, Ph 67855

Parsed DataParsed DataFirst Name: HAJI MUHAMMAD Family Name: IBRAHIMTitle: GOVT. CONT.Firm: K. S. ABDULLAH & BROTHERSFirm Location: ABDULLAH MANZILRoad: MAMOOJI ROADPhone: 051-67855City: RAWALPINDICode: 46200

Page 20: Data Warehousing

Ahsan AbdullahAhsan Abdullah

2020

Aspects of Data Loading StrategiesAspects of Data Loading Strategies Need to look at:

Data freshness System performance Data volatility

Data Freshness Very fresh low update efficiency Historical data, high update efficiency Always trade-offs in the light of goals

System performance Availability of staging table space Impact on query workload

Data Volatility Ratio of new to historical data High percentages of data change (batch update)

Page 21: Data Warehousing

Ahsan AbdullahAhsan Abdullah

2121

Three Loading StrategiesThree Loading Strategies

Once we have transformed data, there are three Once we have transformed data, there are three primary loading strategies:primary loading strategies:

Full data refreshFull data refresh with BLOCK INSERT or ‘block with BLOCK INSERT or ‘block slamming’ into empty table.slamming’ into empty table.

Incremental data refreshIncremental data refresh with BLOCK INSERT or with BLOCK INSERT or ‘block slamming’ into existing (populated) tables.‘block slamming’ into existing (populated) tables.

Trickle/continuous feedTrickle/continuous feed with constant data with constant data collection and loading using row level insert and collection and loading using row level insert and update operations.update operations.