Chap 11 Corrected

download Chap 11 Corrected

of 34

Transcript of Chap 11 Corrected

  • 7/29/2019 Chap 11 Corrected

    1/34

    Chapter 11

    Chapter 11:

    Data Warehousing

    Modern Database Management

    8th Edition

    Jeffrey A. Hoffer, Mary B. Prescott,

    Fred R. McFadden2

    007byPrenticeHall

    FAROOQ

    1

  • 7/29/2019 Chap 11 Corrected

    2/34

    Chapter 11

    Objectives

    Definition of terms

    Reasons for information gap between informationneeds and availability

    Reasons for need of data warehousing

    Describe three levels of data warehousearchitectures

    List four steps of data reconciliation

    Describe two components of star schema

    Estimate fact table size

    Design a data mart2

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    3/34

    Chapter 11

    Definition Data Warehouse:

    A subject-oriented, integrated, time-variant, non-updatable

    collection of data used in support of management decision-making processes

    Subject-oriented: e.g. customers, patients, students,products

    Integrated: Consistent naming conventions, formats,

    encoding structures; from multiple data sources Time-variant: Can study trends and changes

    Nonupdatable: Read-only, periodically refreshed

    Data Mart: A data warehouse that is limited in scope

    3

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    4/34

    Chapter 11

    Need for Data Warehousing Integrated, company-wide view of high-quality

    information (from disparate databases)

    Separation ofoperationaland informationalsystems and

    data (for improved performance)

    4

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    5/34

    Chapter 11

    5Source: adapted from Strange (1997).

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    6/34

    Chapter 11

    Data Warehouse Architectures Generic Two-Level Architecture

    Independent Data Mart

    Dependent Data Mart and Operational Data Store

    Logical Data Mart and Real-Time Data Warehouse

    Three-Layer architecture

    6

    All involve some form ofextraction,transformation and loadi

    ng (ETL)

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    7/34

    Chapter 11

    7

    Figure 11-2: Generic two-level data warehousing architecture

    E

    T

    L

    One,company-

    wide

    warehouse

    Periodic extraction data is not completely current in warehouse

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    8/34

    Chapter 11

    8

    Figure 11-3 Independent data mart

    data warehousing architectureData marts:Mini-warehouses, limited in scope

    E

    T

    L

    Separate ETL for each

    independent data mart

    Data access complexity

    due tomultiple data marts

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    9/34

    Chapter 11

    9

    Figure 11-4 Dependent data mart with

    operational data store:a three-level architecture

    E

    T

    L

    Single ETL for

    enterprise data warehouse

    (EDW)

    Simpler data access

    ODS provides option for

    obtainingcurrent data

    Dependent data marts

    loaded from EDW

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    10/34

    Chapter 11

    10

    E

    T

    L

    Near real-time ETL for

    Data Warehouse

    ODS and data warehouse

    are one and the same

    Data marts are NOT separate databases,

    but logical views of the data warehouse

    Easier to create new data marts

    Figure 11-5 Logical data mart and real

    time warehouse architecture

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    11/34

    Chapter 11

    11

    Figure 11-6 Three-layer data architecture for a data warehouse

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    12/34

    Chapter 11

    Data CharacteristicsStatus vs. Event Data

    12

    Status

    Status

    Event = a database action(create/update/delete) thatresults from a transaction

    Figure 11-7

    Example of DBMS

    log entry

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    13/34

    Chapter 11 13

    Data CharacteristicsTransient vs. Periodic Data

    Withtransientdata,changesto existingrecords

    arewrittenoverpreviousrecords,

    thusdestroyingthepreviousdata

    content

    Figure 11-8

    Transient

    operational data

  • 7/29/2019 Chap 11 Corrected

    14/34

    Chapter 11 14

    Periodicdata are

    neverphysicallyaltered or

    deletedonce they

    havebeen

    added tothe store

    Data CharacteristicsTransient vs. Periodic Data

    Figure 11-9:

    Periodic

    warehouse data

  • 7/29/2019 Chap 11 Corrected

    15/34

    Chapter 11

    Other Data Warehouse

    Changes New descriptive attributes

    New business activity attributes

    New classes of descriptive attributes

    Descriptive attributes become more refined

    Descriptive data are related to one another

    New source of data

    15

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    16/34

    Chapter 11

    The Reconciled Data Layer

    Typical operational data is: Transient

    not historical

    Not normalized (perhaps due to denormalization for performance)

    Restricted in scopenot comprehensive

    Sometimes poor qualityinconsistencies and errors

    After ETL, data should be: Detailednot summarized yet

    Historicalperiodic

    Normalized3rd normal form or higher

    Comprehensiveenterprise-wide perspective

    Timely

    data should be current enough to assist decision-making

    Quality controlledaccurate with full integrity

    16

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    17/34

    Chapter 11

    The ETL Process

    Capture/Extract

    Scrub or data cleansingTransform

    Load and Index

    17

    ETL = Extract, transform, and load

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    18/34

    Chapter 11

    18

    Static extract = capturinga snapshot of the sourcedata at a point in time

    Incremental extract =capturing changes thathave occurred since the laststatic extract

    Capture/Extractobtaining a snapshot of a chosen subsetof the source data for loading into the data warehouse

    Figure 11-10:

    Steps in data

    reconciliation

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    19/34

    Chapter 11

    19

    Scrub/Cleanseuses pattern recognition and AItechniques to upgrade data quality

    Fixing errors: misspellings,erroneous dates, incorrect fieldusage, mismatched addresses,missing data, duplicate data,inconsistencies

    Also: decoding, reformatting,time stamping, conversion, keygeneration, merging, errordetection/logging, locatingmissing data

    Figure 11-10:

    Steps in data

    reconciliation(cont.)

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    20/34

    Chapter 11

    20

    Transform = convert data from format of operationalsystem to format of data warehouse

    Record-level:Selectiondata partitioningJoiningdata combiningAggregationdata summarization

    Field-level:single-fieldfrom one field to one fieldmulti-fieldfrom many fields to one, orone field to many

    Figure 11-10:

    Steps in data

    reconciliation(cont.)

    2

    007byPrenticeHall

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    21/34

    Chapter 11

    21

    Load/Index= place transformed datainto the warehouse and create indexes

    Refresh mode: bulk rewritingof target data at periodic intervals

    Update mode: only changesin source data are written to datawarehouse

    Figure 11-10:

    Steps in data

    reconciliation(cont.)

    2

    007byPrenticeHa

    ll

    FAROOQ

    Fi 11 11 Si l fi ld t f ti

  • 7/29/2019 Chap 11 Corrected

    22/34

    Chapter 11

    22

    Figure 11-11: Single-field transformation

    In generalsome transformationfunction translates data from old

    form to new form

    Algorithmictransformation usesa formula or logical expression

    Tablelookupanotherapproach, uses a separatetable keyed by sourcerecord code

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    23/34

    Chapter 11

    23

    Figure 11-12: Multifield transformation

    M:1from many sourcefields to one target field

    1:Mfrom onesource field tomany target fields

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    24/34

    Chapter 11

    Derived Data Objectives

    Ease of use for decision support applications

    Fast response to predefined user queries

    Customized data for particular target audiences

    Ad-hoc query support

    Data mining capabilities

    Characteristics

    Detailed (mostly periodic) data

    Aggregate (for summary)

    Distributed (to departmental servers)

    24Most common data model = star schema

    (also called dimensional model)

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    25/34

    Chapter 11

    25

    Figure 11-13 Components of a star schema

    Fact tablescontain factualor quantitative data

    Dimension tablescontain descriptionsabout the subjects of the business

    1:N relationship between

    dimension tables and fact tables

    Excellent for ad-hoc queries, but bad for online transaction processing

    Dimension tables are denormalized to

    maximize performance

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    26/34

    Chapter 11

    26

    Figure 11-14 Star schema example

    Fact tableprovides statistics for salesbroken down by product, period andstore dimensions

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    27/34

    Chapter 11

    27

    Figure 11-15 Star schema with sample data

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    28/34

    Chapter 11

    Issues Regarding Star Schema Dimension table keys must be surrogate (non-intelligent

    and non-business related), because: Keys may change over time

    Length/format consistency

    Granularity of Fact Tablewhat level of detail do youwant? Transactional grain

    finest level

    Aggregated grainmore summarized

    Finer grains better market basket analysis capability

    Finer grainmore dimension tables, more rows in fact table

    Duration of the databasehow much history should bekept? Natural duration13 months or 5 quarters

    Financial institutions may need longer duration

    Older data is more difficult to source and cleanse28

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    29/34

    Chapter 11

    29

    Figure 11-16: Modeling dates

    Fact tables contain time-period data Date dimensions are important

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    30/34

    Chapter 11

    The User Interface

    Metadata (data catalog)

    Identify subjects of the data mart

    Identify dimensions and facts

    Indicate how data is derived from enterprise datawarehouses, including derivation rules

    Indicate how data is derived from operational data store,including derivation rules

    Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down)

    Identify responsible people30

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    31/34

    Chapter 11

    On-Line Analytical Processing (OLAP) Tools

    The use of a set of graphical tools that provides

    users with multidimensional views of their dataand allows them to analyze the data using simplewindowing techniques

    Relational OLAP (ROLAP)

    Traditional relational representation Multidimensional OLAP (MOLAP)

    Cube structure

    OLAP Operations

    Cube slicing

    come up with 2-D view of data

    Drill-downgoing from summary to more detailed views

    31

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    32/34

    Chapter 11

    32

    Figure 11-23 Slicing a data cube

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    33/34

    Chapter 11

    33

    Figure 11-24

    Example of drill-down

    Summary report

    Drill-down withcolor added

    Starting with summarydata, users can obtaindetails for particular

    cells

    2

    007byPrenticeHa

    ll

    FAROOQ

  • 7/29/2019 Chap 11 Corrected

    34/34

    Chapter 11

    Data Mining and Visualization Knowledge discovery using a blend of statistical, AI, and computer

    graphics techniques

    Goals: Explain observed events or conditions

    Confirm hypotheses

    Explore data for new or unexpected relationships

    Techniques

    Statistical regression Decision tree induction

    Clustering and signal processing

    Affinity

    Sequence association

    Case-based reasoning

    Rule discovery Neural nets

    Fractals

    Data visualizationrepresenting data in graphical/multimedia formatsfor analysis

    34

    2

    007byPrenticeHa

    ll

    FAROOQ