DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for...

68
D ATA W AREHOUSE AND D ATA MINING IS403 Alaa Khalaf Hamoud 7-10-2019 INTRODUCTION TO D ATA WAREHOUSE

Transcript of DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for...

Page 1: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

DATA WAREHOUSE AND DATA MINING

IS403

Alaa Khalaf Hamoud

7-10-2019

INTRODUCTION TO DATA WAREHOUSE

Page 2: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 3: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

1. Introduction

Page 4: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

1. Introduction

• Data Warehouses generalize and consolidatedata in multidimensional space.– Data cleaning, data integration, and data

transformation.

• Data warehouses support online analyticalprocessing (OLAP) tools for the interactiveanalysis of multidimensional data of variedgranularities.

Page 5: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

1. Introduction

• Data mining functions, such as associationrules, classification, prediction, andclustering, can be integrated with OLAPoperations.

• Data warehouse has become an increasinglyimportant platform for data analysis andOLAP and will provide an effective platformfor data mining.

Page 6: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 7: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. Data Warehouse (DW)

• A data warehouse (DW) is a collection ofdata marts representing historical data fromdifferent operations in the company.

– Flat, text, numbers and etc.

– 5 to 10 years of huge amount of data.

• This data is stored in a structure optimizedfor querying and data analysis as a DW.

Page 8: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW

• But what is data Warehousing?

• A DW refers to a data repository that ismaintained separately from an organization’soperational databases.

• Inmon, A DW is a subject-oriented,integrated, time-variant and non-volatilecollection of data in support ofmanagement’s decision making process.

Page 9: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Subject Oriented

• DW is organized around major subjects suchas customer, supplier, product, and sales.

• Do not concentrate on the day-to-dayoperations and transaction processing of anorganization.

• Exclude data that are not useful in thedecision support process.

Page 10: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Integrated

• DW constructed by integrating multipleheterogeneous sources, such as relationaldatabases, flat files, and online transactionrecords.

– Data cleaning, ensuring consistency, namingconventions, data structure and dataintegration.

Page 11: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Time Variant

• Data are stored to provide information froman historic perspective (e.g., the past 5–10years).

• Time element is very important in the datasources.

Page 12: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Nonvolatile

• DW is always a physically separate store ofdata transformed from the application datafound in the operational environment.

• DW does not require transaction processing,recovery, and concurrency controlmechanisms.– It requires only two operations: initial loading of

data and access of data.

Page 13: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Enterprise Warehouse

Page 14: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Enterprise Warehouse

• It collects all of the information aboutsubjects spanning the entire organization.

• It provides corporate-wide data integration.

• It contains detailed data as well assummarized data.

Page 15: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Enterprise Warehouse

• Size from a few gigabytes to terabytes, orbeyond.

• It may be implemented using mainframes,super computer servers, or parallelarchitecture platforms.

Page 16: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Data Mart

Page 17: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Data Mart

• It contains a subset of corporate-wide datathat is of value to a specific group of users.

• It may be implemented using specificselected subjects.

• Data marts are usually implemented on low-cost departmental servers.

• The data contained in data marts tend to besummarized.

Page 18: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Data Mart

• The implementation cycle of a data mart ismore likely to be measured in weeks ratherthan months or years.

• Depending on the source of data, data martscan be categorized as independent ordependent.

Page 19: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Data Mart

– Independent data marts are sourced from datacaptured from one or more operational systemsor external information providers, or from datagenerated locally within a particular departmentor geographic area.

Page 20: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Data Mart

– Dependent data marts are sourced directly fromenterprise data warehouses.

Page 21: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

2. DW-Virtual Warehouse

• A virtual warehouse is a set of views overoperational databases.

• For efficient query processing, only some ofthe possible summary views may bematerialized.

• A virtual warehouse is easy to build butrequires excess capacity on operationaldatabase servers.

Page 22: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 23: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

3. Data Warehousing

Page 24: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

3. Data Warehousing

• A data warehousing is as an architectureconstructed by integrating data frommultiple heterogeneous sources to supportstructured and/or ad hoc queries, analyticalreporting, and decision making.

• Process of constructing and using datawarehouses or process of data warehouseconstruction.

Page 25: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

3. Data Warehousing

• It requires data cleaning, data integration,and data consolidation.

• Warehouse DBMS is used to refer to themanagement and utilization of datawarehouses.

• When a query is posed to a client site, ametadata dictionary is used to translate thequery into queries appropriate for theindividual heterogeneous sites involved.

Page 26: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

3. Data Warehousing

• Approach for queries that requiringaggregations.

• Data warehousing employs an update drivenapproach rather than query driven approachin order to using information for directquerying and analysis.

Page 27: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 28: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• Online operational database systems is toperform online transaction and queryprocessing.

• These systems are called online transactionprocessing (OLTP) systems.– Support day-to-day operations of an organization

such as purchasing, banking, payroll andregistration.

Page 29: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

Page 30: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• DW systems, on the other hand, serve usersor knowledge workers in the role of dataanalysis and decision making.

• These systems are known as online analyticalprocessing (OLAP) systems.

Page 31: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• Users and system orientation.

– An OLTP system is customer-oriented and is usedfor transaction and query processing by clerks,clients, and information technologyprofessionals.

– An OLAP system is market-oriented and is usedfor data analysis by knowledge workers,including managers, executives, and analysts.

Page 32: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• Data contents

– An OLTP system manages current data that,typically, are too detailed to be easily used fordecision making.

– An OLAP system manages large amounts ofhistoric data, provides facilities forsummarization and aggregation, and stores andmanages information at different levels ofgranularity.

Page 33: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• Database design

– An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design.

– An OLAP system typically adopts either a star ora snowflake model and a subject-orienteddatabase design.

Page 34: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• View– An OLTP system focuses mainly on the current

data within an enterprise or department,without referring to historic data or data indifferent organizations.

– OLAP system often spans multiple versions of adatabase schema, due to the evolutionaryprocess of an organization. Because of their hugevolume, OLAP data are stored on multiplestorage media.

Page 35: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

• Access patterns

– The access patterns of an OLTP system consistmainly of short, atomic transactions. Such asystem requires concurrency control andrecovery mechanisms.

– Access to OLAP systems are mostly read-onlyoperations (because most data warehouses storehistoric rather than up-to-date information),although many could be complex queries.

Page 36: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

4. Operational DB VS DW

Page 37: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 38: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture

Page 39: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-bottom tier

• The warehouse database server that isalmost always a relational database system.

• Back-end tools and utilities are used to feeddata into the bottom tier from different datasources.

• These tools and utilities perform dataextraction, cleaning, and transformation.– (e.g., to merge similar data from different

sources into a unified format), as well as load andrefresh functions to update the data warehouse.

Page 40: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-bottom tier

• The data are extracted using applicationprogram interfaces known as gateways.

• Examples of gateways include ODBC (OpenDatabase Connection) and OLEDB (ObjectLinking and Embedding Database) by Microsoftand JDBC (Java Database Connection).

• This tier also contains a metadata repository,which stores information about the datawarehouse and its contents.

Page 41: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-bottom tier

• It contains data mart and Meta datarepository.

• Meta data is data about data.

• Two categories of meta data: technical andbusiness meta data.– Technical Meta data: It contains information

about data warehouse data used by warehousedesigner, administrator to carry out developmentand management tasks.

Page 42: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-bottom tier

– Technical Meta data: examples (table creationtime, data types, file size and compressionschema).

– Business Meta data: It contains info that givesinfo related to business stored in data warehouseto users.

– Examples (privacy level, security level andbusiness rules).

Page 43: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-Tech. Meta

• Info about data stores.• Transformation descriptions.• Warehouse Object and data structure definitions

for target data.• The rules used to perform clean up, and data

enhancement.• Data mapping operations.• Access authorization, backup history, archive

history, info delivery history, data acquisitionhistory, data access etc.,

Page 44: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-Buss. Meta

• Subject areas, and info object type includingqueries, reports, images, video, audio clipsetc.

• Internet home pages.

• Info related to info delivery system.

• Data warehouse operational info such asownerships, audit trails etc.

Page 45: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-Meta Charac.

• It is the gateway to the data warehouseenvironment.

• It supports easy distribution and replication ofcontent for high performance and availability.

• It should be searchable by business oriented keywords.

• It should act as a launch platform for the enduser to access data and analysis tools.

Page 46: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-Meta Charac.

• Should support

– Sharing of info.

– Scheduling options for request.

– Providing interface to other applications.

– End user monitoring of the status of the datawarehouse environment.

Page 47: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture-middle tier

• An OLAP server that is typically implementedusing either– (1) a relational OLAP (ROLAP) model (i.e., an

extended relational DBMS that maps operationson multidimensional data to standard relationaloperations).

– or (2) a multidimensional OLAP (MOLAP) model(i.e., a special-purpose server that directlyimplements multidimensional data andoperations).

Page 48: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

5. DW Architecture- top tier

• The top tier is a front-end client layer, whichcontains query and reporting tools, analysistools, and/or data mining tools (e.g., trendanalysis, prediction, and so on).

Page 49: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 50: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. Extract, Transform, and Load (ETL)

Page 51: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• Consists of a work area, instantiated datastructures, and a set of processes.

• Is everything between the operationalsource systems and the DW/BI presentationarea.

• Extraction is the first step in the process ofgetting data into the data warehouseenvironment.

Page 52: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• Data extraction, which typically gathers datafrom multiple, heterogeneous, and externalsources.

• Extracting means reading and understandingthe source data and copying the data neededinto the ETL system for further manipulation.

• At this point, the data belongs to the datawarehouse.

Page 53: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• After the data is extracted to the ETL system,there are numerous potentialtransformations, such as cleansing the data(correcting misspellings, resolving domainconflicts, dealing with missing elements, orparsing into standard formats), combiningdata from multiple sources, and de-duplicating data.

Page 54: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• Data cleaning which detects errors in thedata and rectifies them when possible.

• Data transformation, which converts datafrom legacy or host format to warehouseformat.

• The ETL system adds value to the data withthese cleansing and conforming tasks bychanging the data and enhancing it.

Page 55: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• These activities can be architected to creatediagnostic metadata to improve data quality inthe source systems over time.

• The final step ETL is the physical structuring andloading of data into the presentation area’starget dimensional models.

• Load, which sorts, summarizes, consolidates,computes views, checks integrity, and buildsindices and partitions.

Page 56: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• Refresh, which propagates the updates fromthe data sources to the warehouse.

• These subsystems are critical due to theprimary mission of the ETL system is to handoff the dimension and fact tables.

Page 57: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• Focus on dimension table processing

– Surrogate key assignments, code lookups to provideappropriate descriptions, splitting, or combiningcolumns to present the appropriate data values, orjoining underlying third normal form table structuresinto flattened de-normalized dimensions.

• In many cases, the ETL is not based on relationaltechnology but instead on a system of flat files.

Page 58: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• After data validation for conformance with 1-1 and M-M business rules, does it need tobuild 3NF physical database?

• When the data arrives at the doorstep of theETL system in a 3NF relational format, the ETLsystem developers may be more comfortableperforming the cleansing andtransformation tasks using normalizedstructures.

Page 59: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

6. ETL

• The creation of both normalized structures forthe ETL and dimensional structures forpresentation means that the data is potentiallyextracted, transformed, and loaded twice, onceinto the normalized database and then againwhen you load the dimensional model.

• It is wrong to focus only on constructingnormalized structure than developingdimensional presentation.

Page 60: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

Contents

1. Introduction

2. Data Warehouse

3. Data Warehousing

4. Operational DS VS DW

5. DW Architecture

6. ETL

7. Data Cube: A Multidimensional Data Model

Page 61: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

• A data cube allows data to be modeled andviewed in multiple dimensions.

• It is defined by dimensions and facts.

• Dimensions are the perspectives or entitieswith respect to which an organization wantsto keep records.

Page 62: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

• A multidimensional data model is typicallyorganized around a central theme, such assales.

• This theme is represented by a fact table.

• Facts are numeric measures.

Page 63: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

• 2-D Cube

Page 64: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

Page 65: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

• 4-D Cube

Page 66: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

• The cuboid that holds the lowest level of summarization is called the base cuboid.

• The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid.

Page 67: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

7. Data Cube: A Multidimensional Data Model

Page 68: DATA WAREHOUSE AND DATA INING IS403 I DATA WAREHOUSEun.uobasrah.edu.iq/lectures/15102.pdf · for data mining. Contents 1. Introduction 2. Data Warehouse 3. Data Warehousing 4. Operational

End of Introduction to DW