2datawarehouse2.ppt

download 2datawarehouse2.ppt

of 57

Transcript of 2datawarehouse2.ppt

  • 8/10/2019 2datawarehouse2.ppt

    1/57

    October 14, 2014 1

    Data Mining:Concepts and Techniques

  • 8/10/2019 2datawarehouse2.ppt

    2/57

    October 14, 2014 2

    Chapter 2: Data Warehousing andOLAP Technology for Data Mining

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    From data warehousing to data mining

  • 8/10/2019 2datawarehouse2.ppt

    3/57

    October 14, 2014 3

    What is Data Warehouse?

    Defined in many different ways, but not rigorously.

    A decision support database that is maintainedseparately from the organizations operationaldatabase

    Support information processingby providing a solidplatform of consolidated, historical data for analysis.

    A data warehouse is asubject-oriented,integrated,time-variant, and nonvolatilecollection of data in support

    of managements decision-making process.W. H.Inmon

    Data warehousing:

    The process of constructing and using data

    warehouses

  • 8/10/2019 2datawarehouse2.ppt

    4/57

    October 14, 2014 4

    Data WarehouseSubject-Oriented

    Organized around major subjects, such as customer,

    product, sales.

    Focusing on the modeling and analysis of data for

    decision makers, not on daily operations or transaction

    processing.

    Provide a simple and conciseview around particularsubject issues by excluding data that are not useful in

    the decision support process.

  • 8/10/2019 2datawarehouse2.ppt

    5/57

    October 14, 2014 5

    Data WarehouseIntegrated

    Constructed by integrating multiple, heterogeneousdata sources

    relational databases, flat files, on-line transactionrecords

    Data cleaning and data integration techniques areapplied.

    Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among different

    data sources E.g., Hotel price: currency, tax, breakfast covered, etc.

    When data is moved to the warehouse, it isconverted.

  • 8/10/2019 2datawarehouse2.ppt

    6/57

    October 14, 2014 6

    Data WarehouseTime Variant

    The time horizon for the data warehouse is significantly

    longer than that of operational systems.

    Operational database: current value data. Data warehouse data: provide information from a

    historical perspective (e.g., past 5-10 years)

    Every key structure in the data warehouse

    Contains an element of time, explicitly or implicitly

    But the key of operational data may or may not

    contain time element.

  • 8/10/2019 2datawarehouse2.ppt

    7/57October 14, 2014 7

    Data WarehouseNon-Volatile

    A physically separate storeof data transformed from the

    operational environment.

    Operational update of data does not occurin the datawarehouse environment.

    Does not require transaction processing, recovery,

    and concurrency control mechanisms

    Requires only two operations in data accessing:

    initial loading of dataand access of data.

  • 8/10/2019 2datawarehouse2.ppt

    8/57October 14, 2014 8

    Data Warehouse vs. Heterogeneous DBMS

    Traditional heterogeneous DB integration:

    Build wrappers/mediatorson top of heterogeneous databases

    Query drivenapproach

    When a query is posed to a client site, a meta-dictionary isused to translate the query into queries appropriate for

    individual heterogeneous sites involved, and the results are

    integrated into a global answer set

    Complex information filtering, compete for resources

    Data warehouse: update-driven, high performance

    Information from heterogeneous sources is integrated in advance

    and stored in warehouses for direct query and analysis

  • 8/10/2019 2datawarehouse2.ppt

    9/57October 14, 2014 9

    Data Warehouse vs. Operational DBMS

    OLTP (on-line transaction processing)

    Major task of traditional relational DBMS

    Day-to-day operations: purchasing, inventory, banking,

    manufacturing, payroll, registration, accounting, etc.

    OLAP (on-line analytical processing)

    Major task of data warehouse system

    Data analysis and decision making

    Distinct features (OLTP vs. OLAP):

    User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated

    Database design: ER + application vs. star + subject

    View: current, local vs. evolutionary, integrated

    Access patterns: update vs. read-only but complex queries

  • 8/10/2019 2datawarehouse2.ppt

    10/57

  • 8/10/2019 2datawarehouse2.ppt

    11/57October 14, 2014 11

    Why Separate Data Warehouse?

    High performance for both systems

    DBMStuned for OLTP: access methods, indexing,concurrency control, recovery

    Warehousetuned for OLAP: complex OLAP queries,

    multidimensional view, consolidation. Different functions and different data:

    missing data: Decision support requires historical datawhich operational DBs do not typically maintain

    data consolidation: DS requires consolidation(aggregation, summarization) of data fromheterogeneous sources

    data quality: different sources typically useinconsistent data representations, codes and formats

    which have to be reconciled

  • 8/10/2019 2datawarehouse2.ppt

    12/57October 14, 2014 12

    From Tables and Spreadsheetsto Data Cubes

    A data warehouse is based on a multidimensional data modelwhich

    views data in the form of a data cube

    A data cube, such as sales, allows data to be modeled and viewed

    in multiple dimensions

    Dimension tables, such as item (item_name, brand, type), or

    time(day, week, month, quarter, year)

    Fact table contains measures (such as dollars_sold) and keys to

    each of the related dimension tables In data warehousing literature, an n-D base cube is called a base

    cuboid. The top most 0-D cuboid, which holds the highest-level of

    summarization, is called the apex cuboid. The lattice of cuboids

    forms a data cube.

  • 8/10/2019 2datawarehouse2.ppt

    13/57October 14, 2014 13

    Cube: A Lattice of Cuboids

    all

    time item location supplier

    time,item time,location

    time,supplier

    item,location

    item,supplier

    location,supplier

    time,item,location

    time,item,supplier

    time,location,supplier

    item,location,supplier

    time, item, location, supplier

    0-D(apex) cuboid

    1-D cuboids

    2-D cuboids

    3-D cuboids

    4-D(base) cuboid

  • 8/10/2019 2datawarehouse2.ppt

    14/57October 14, 2014 14

    Conceptual Modeling ofData Warehouses

    Modeling data warehouses: dimensions & measures

    Star schema:A fact table in the middle connected to a

    set of dimension tables

    Snowflake schema: A refinement of star schema

    where some dimensional hierarchy is normalizedinto a

    set of smaller dimension tables, forming a shape

    similar to snowflake Fact constellations: Multiple fact tables share

    dimension tables, viewed as a collection of stars,

    therefore called galaxy schemaor fact constellation

  • 8/10/2019 2datawarehouse2.ppt

    15/57October 14, 2014 15

    Example of Star Schema

    time_key

    day

    day_of_the_week

    month

    quarter

    year

    time

    location_key

    streetcity

    province_or_street

    country

    location

    Sales Fact Table

    time_key

    item_key

    branch_key

    location_key

    units_sold

    dollars_sold

    avg_sales

    Measures

    item_key

    item_name

    brand

    type

    supplier_type

    item

    branch_key

    branch_namebranch_type

    branch

  • 8/10/2019 2datawarehouse2.ppt

    16/57October 14, 2014 16

    Example of Snowflake Schema

    time_key

    day

    day_of_the_week

    month

    quarter

    year

    time

    location_key

    street

    city_key

    location

    Sales Fact Table

    time_key

    item_key

    branch_key

    location_key

    units_solddollars_sold

    avg_sales

    Measures

    item_key

    item_name

    brand

    type

    supplier_key

    item

    branch_key

    branch_namebranch_type

    branch

    supplier_key

    supplier_type

    supplier

    city_key

    city

    province_or_stree

    country

    city

  • 8/10/2019 2datawarehouse2.ppt

    17/57October 14, 2014 17

    Example of Fact Constellation

    time_key

    day

    day_of_the_week

    month

    quarter

    year

    time

    location_key

    streetcity

    province_or_street

    country

    location

    Sales Fact Table

    time_key

    item_key

    branch_key

    location_key

    units_sold

    dollars_sold

    avg_sales

    Measures

    item_key

    item_name

    brand

    type

    supplier_type

    item

    branch_key

    branch_namebranch_type

    branch

    Shipping Fact Table

    time_key

    item_key

    shipper_key

    from_location

    to_location

    dollars_cost

    units_shipped

    shipper_key

    shipper_name

    location_keyshipper_type

    shipper

  • 8/10/2019 2datawarehouse2.ppt

    18/57October 14, 2014 18

    A Data Mining Query Language,DMQL: Language Primitives

    Cube Definition (Fact Table)

    define cube []:

    Dimension Definition ( Dimension Table )

    define dimension as()

    Special Case (Shared Dimension Tables)

    First time as cube definition define dimension as

    in cube

  • 8/10/2019 2datawarehouse2.ppt

    19/57October 14, 2014 19

    Defining a Star Schema in DMQL

    define cubesales_star [time, item, branch, location]:

    dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)

    define dimensiontime as (time_key, day, day_of_week,month, quarter, year)

    define dimension item as (item_key, item_name, brand,type, supplier_type)

    define dimension branch as(branch_key, branch_name,branch_type)

    define dimensionlocation as(location_key, street, city,province_or_state, country)

  • 8/10/2019 2datawarehouse2.ppt

    20/57October 14, 2014 20

    Defining a Snowflake Schema in DMQL

    define cubesales_snowflake [time, item, branch, location]:

    dollars_sold = sum(sales_in_dollars), avg_sales =

    avg(sales_in_dollars), units_sold = count(*)

    define dimensiontime as (time_key, day, day_of_week,month, quarter, year)

    define dimension item as (item_key, item_name, brand, type,

    supplier(supplier_key, supplier_type))

    define dimension branch as(branch_key, branch_name,

    branch_type)

    define dimensionlocation as(location_key, street,

    city(city_key, province_or_state, country))

  • 8/10/2019 2datawarehouse2.ppt

    21/57October 14, 2014 21

    Defining a Fact Constellation in DMQL

    define cubesales [time, item, branch, location]:

    dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)

    define dimensiontime as (time_key, day, day_of_week, month, quarter, year)

    define dimension item as (item_key, item_name, brand, type, supplier_type)

    define dimension branch as(branch_key, branch_name, branch_type)define dimensionlocation as(location_key, street, city, province_or_state,

    country)

    define cubeshipping [time, item, shipper, from_location, to_location]:

    dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

    define dimensiontime as time in cubesalesdefine dimension item as item in cubesales

    define dimension shipper as(shipper_key, shipper_name, locationaslocationin cubesales, shipper_type)

    define dimensionfrom_location aslocation in cubesales

    define dimensionto_location aslocation in cubesales

  • 8/10/2019 2datawarehouse2.ppt

    22/57October 14, 2014 22

    Measures: Three Categories

    distributive: if the result derived by applying the function

    to n aggregate values is the same as that derived by

    applying the function on all the data without partitioning.

    E.g., count(), sum(), min(), max().

    algebraic:if it can be computed by an algebraic function

    with Marguments (whereMis a bounded integer), each

    of which is obtained by applying a distributive aggregate

    function.

    E.g., avg(), min_N(), standard_deviation().

    holistic: if there is no constant bound on the storage size

    needed to describe a subaggregate.

    E.g., median(), mode(), rank().

  • 8/10/2019 2datawarehouse2.ppt

    23/57

    October 14, 2014 23

    A Concept Hierarchy: Dimension (location)

    all

    Europe North_America

    MexicoCanadaSpainGermany

    Vancouver

    M. WindL. Chan

    ...

    ......

    ... ...

    ...

    all

    region

    office

    country

    TorontoFrankfurtcity

  • 8/10/2019 2datawarehouse2.ppt

    24/57

  • 8/10/2019 2datawarehouse2.ppt

    25/57

    October 14, 2014 25

    Multidimensional Data

    Sales volume as a function of product, month,and region

    Pro

    duct

    Month

    Dimensions: Product, Location, Time

    Hierarchical summarization paths

    Industry Region Year

    Category Country Quarter

    Product City Month Week

    Office Day

  • 8/10/2019 2datawarehouse2.ppt

    26/57

    October 14, 2014 26

    A Sample Data Cube

    Total annual salesof TV in U.S.A.

    Date

    Countr

    ysum

    sumTV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    U.S.A

    Canada

    Mexico

    sum

  • 8/10/2019 2datawarehouse2.ppt

    27/57

    October 14, 2014 27

    Cuboids Corresponding to the Cube

    all

    product date country

    product,date product,country date, country

    product, date, country

    0-D(apex) cuboid

    1-D cuboids

    2-D cuboids

    3-D(base) cuboid

  • 8/10/2019 2datawarehouse2.ppt

    28/57

    October 14, 2014 28

    Browsing a Data Cube

    Visualization

    OLAP capabilities

    Interactive manipulation

  • 8/10/2019 2datawarehouse2.ppt

    29/57

    October 14, 2014 29

    Typical OLAP Operations

    Roll up (drill-up):summarize data

    by climbing up hierarchy or by dimension reduction

    Drill down (roll down):reverse of roll-up

    from higher level summary to lower level summary or detailed

    data, or introducing new dimensions

    Slice and dice:

    project and select

    Pivot (rotate):

    reorient the cube, visualization, 3D to series of 2D planes. Other operations

    drill across:involving (across) more than one fact table

    drill through:through the bottom level of the cube to its back-

    end relational tables (using SQL)

  • 8/10/2019 2datawarehouse2.ppt

    30/57

    October 14, 2014 30

    A Star-Net Query Model

    Shipping Method

    AIR-EXPRESS

    TRUCKORDER

    Customer Orders

    CONTRACTS

    Customer

    Product

    PRODUCT GROUP

    PRODUCT LINE

    PRODUCT ITEM

    SALES PERSON

    DISTRICT

    DIVISION

    OrganizationPromotion

    CITY

    COUNTRY

    REGION

    Location

    DAILYQTRLYANNUALYTime

    Each circle is

    called a footprint

  • 8/10/2019 2datawarehouse2.ppt

    31/57

    October 14, 2014 31

    Chapter 2: Data Warehousing andOLAP Technology for Data Mining

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    From data warehousing to data mining

  • 8/10/2019 2datawarehouse2.ppt

    32/57

    October 14, 2014 32

    Design of a Data Warehouse: ABusiness Analysis Framework

    Four views regarding the design of a data warehouse

    Top-down view

    allows selection of the relevant information necessary for the

    data warehouse

    Data source view

    exposes the information being captured, stored, and

    managed by operational systems

    Data warehouse view consists of fact tables and dimension tables

    Business query view

    sees the perspectives of data in the warehouse from the view

    of end-user

  • 8/10/2019 2datawarehouse2.ppt

    33/57

    October 14, 2014 33

    Data Warehouse Design Process

    Top-down, bottom-up approaches or a combination of both

    Top-down: Starts with overall design and planning (mature)

    Bottom-up: Starts with experiments and prototypes (rapid)

    From software engineering point of view

    Waterfall: structured and systematic analysis at each step beforeproceeding to the next

    Spiral: rapid generation of increasingly functional systems, shortturn around time, quick turn around

    Typical data warehouse design process

    Choose a business processto model, e.g., orders, invoices, etc.

    Choose the grain(atomic level of data)of the business process

    Choose the dimensionsthat will apply to each fact table record

    Choose the measurethat will populate each fact table record

  • 8/10/2019 2datawarehouse2.ppt

    34/57

    October 14, 2014 34

    Multi-Tiered Architecture

    Data

    Warehouse

    Extract

    Transform

    Load

    Refresh

    OLAP Engine

    Analysis

    QueryReports

    Data mining

    Monitor

    &

    Integrator

    Metadata

    Data Sources Front-End Tools

    Serve

    Data Marts

    OperationalDBs

    other

    sources

    Data Storage

    OLAP Server

  • 8/10/2019 2datawarehouse2.ppt

    35/57

    October 14, 2014 35

    Three Data Warehouse Models

    Enterprise warehouse collects all of the information about subjects spanning

    the entire organization

    Data Mart

    a subset of corporate-wide data that is of value to a

    specific groups of users. Its scope is confined to

    specific, selected groups, such as marketing data mart

    Independent vs. dependent (directly from warehouse) data mart

    Virtual warehouse

    A set of views over operational databases

    Only some of the possible summary views may be

    materialized

    D t W h D l t

  • 8/10/2019 2datawarehouse2.ppt

    36/57

    October 14, 2014 36

    Data Warehouse Development:A Recommended Approach

    Define a high-level corporate data model

    Data

    Mart

    Data

    Mart

    Distributed

    Data Marts

    Multi-Tier Data

    Warehouse

    Enterprise

    Data

    Warehouse

    Model refinementModel refinement

  • 8/10/2019 2datawarehouse2.ppt

    37/57

    October 14, 2014 37

    OLAP Server Architectures

    Relational OLAP (ROLAP)

    Use relational or extended-relational DBMS to store and managewarehouse data and OLAP middle ware to support missing pieces

    Include optimization of DBMS backend, implementation of

    aggregation navigation logic, and additional tools and services greater scalability

    Multidimensional OLAP (MOLAP)

    Array-based multidimensional storage engine (sparse matrixtechniques)

    fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)

    User flexibility, e.g., low level: relational, high-level: array

    Specialized SQL servers

    specialized support for SQL queries over star/snowflake schemas

  • 8/10/2019 2datawarehouse2.ppt

    38/57

    October 14, 2014 38

    Efficient Data Cube Computation

    Data cube can be viewed as a lattice of cuboids

    The bottom-most cuboid is the base cuboid

    The top-most cuboid (apex) contains only one cell

    How many cuboids in an n-dimensional cube with Llevels?

    Materialization of data cube

    Materialize every (cuboid) (full materialization), none

    (no materialization), or some (partial materialization)

    Selection of which cuboids to materialize

    Based on size, sharing, access frequency, etc.

    )11(

    n

    i iLT

  • 8/10/2019 2datawarehouse2.ppt

    39/57

    October 14, 2014 39

    Cube Operation

    Cube definition and computation in DMQL

    define cube sales[item, city, year]: sum(sales_in_dollars)

    compute cubesales

    Transform it into a SQL-like language (with a new operator cube by,introduced by Gray et al.96)

    SELECT item, city, year, SUM (amount)

    FROM SALES

    CUBE BYitem, city, year

    Need compute the following Group-Bys(date, product, customer),

    (date,product),(date, customer), (product, customer),

    (date), (product), (customer)

    ()

    (item)(city)

    ()

    (year)

    (city, item) (city, year) (item, year)

    (city, item, year)

  • 8/10/2019 2datawarehouse2.ppt

    40/57

    October 14, 2014 40

    Cube Computation: ROLAP-Based Method

    Efficient cube computation methods ROLAP-based cubing algorithms (Agarwal et al96)

    Array-based cubing algorithm (Zhao et al97)

    Bottom-up computation method (Bayer & Ramarkrishnan99)

    ROLAP-based cubing algorithms

    Sorting, hashing, and grouping operations are applied to the

    dimension attributes in order to reorder and cluster related

    tuples

    Grouping is performed on some subaggregates as a partial

    grouping step

    Aggregates may be computed from previously computed

    aggregates, rather than from the base fact table

    M lti A A ti f C b

  • 8/10/2019 2datawarehouse2.ppt

    41/57

    October 14, 2014 42

    Multi-way Array Aggregation for CubeComputation

    Partition arrays into chunks (a small subcube which fits in memory).

    Compressed sparse array addressing: (chunk_id, offset)

    Compute aggregates in multiway by visiting cube cells in the order

    which minimizes the # of times to visit each cell, and reduces

    memory access and storage cost.

    What is the best

    traversing order

    to do multi-wayaggregation?

    A

    B

    29 30 31 32

    1 2 3 4

    5

    9

    13 14 15 16

    6463626148474645

    a1a0

    c3c2

    c1c 0

    b3

    b2

    b1

    b0

    a2 a3

    C

    B

    44

    28 5640

    24 5236

    20

    60

  • 8/10/2019 2datawarehouse2.ppt

    42/57

    October 14, 2014 43

    Multi-way Array Aggregation forCube Computation

    A

    B

    29 30 31 32

    1 2 3 4

    5

    9

    13 14 15 16

    6463626148474645

    a1a0

    c3c2

    c1c 0

    b3

    b2

    b1

    b0a2 a3

    C

    4428

    5640

    2452

    3620

    60

    B

    M lti A A ti f

  • 8/10/2019 2datawarehouse2.ppt

    43/57

    October 14, 2014 44

    Multi-way Array Aggregation forCube Computation

    A

    B

    29 30 31 32

    1 2 3 4

    5

    9

    13 14 15 16

    6463626148474645

    a1a0

    c3c2

    c1c 0

    b3

    b2

    b1

    b0a2 a3

    C

    4428

    5640

    2452

    3620

    60

    B

  • 8/10/2019 2datawarehouse2.ppt

    44/57

    October 14, 2014 45

    Multi-Way Array Aggregation forCube Computation (Cont.)

    Method: the planes should be sorted and computedaccording to their size in ascending order.

    See the details of Example 2.12 (pp. 75-78)

    Idea: keep the smallest plane in the main memory,fetch and compute only one chunk at a time for thelargest plane

    Limitation of the method: computing well only for a

    small number of dimensions If there are a large number of dimensions, bottom-

    up computation and iceberg cube computationmethods can be explored

  • 8/10/2019 2datawarehouse2.ppt

    45/57

    October 14, 2014 46

    Indexing OLAP Data: Bitmap Index

    Index on a particular column

    Each value in the column has a bit vector: bit-op is fast

    The length of the bit vector: # of records in the base table

    The i-th bit is set if the i-th row of the base table has the value

    for the indexed column not suitable for high cardinality domains

    Cust Region Type

    C1 Asia RetailC2 Europe Dealer

    C3 Asia Dealer

    C4 America Retail

    C5 Europe Dealer

    RecID Retail Dealer

    1 1 0

    2 0 1

    3 0 1

    4 1 0

    5 0 1

    ecI Asia Europe America

    1 1 0 0

    2 0 1 0

    3 1 0 0

    4 0 0 1

    5 0 1 0

    Base table Index on Region Index on Type

  • 8/10/2019 2datawarehouse2.ppt

    46/57

    October 14, 2014 47

    Indexing OLAP Data: Join Indices

    Join index: JI(R-id, S-id) where R (R-id, ) S(S-id, )

    Traditional indices map the values to a list ofrecord ids

    It materializes relational join in JI file and

    speeds up relational joina rather costlyoperation

    In data warehouses, join index relates the valuesof the dimensionsof a start schema to rowsinthe fact table.

    E.g. fact table: Sales and two dimensions cityand product

    A join index on citymaintains for eachdistinct city a list of R-IDs of the tuplesrecording the Sales in the city

    Join indices can span multiple dimensions

  • 8/10/2019 2datawarehouse2.ppt

    47/57

    October 14, 2014 48

    Efficient Processing OLAP Queries

    Determine which operations should be performed on the

    available cuboids:

    transform drill, roll, etc. into corresponding SQL and/or

    OLAP operations, e.g, dice = selection + projection

    Determine to which materialized cuboid(s) the relevant

    operations should be applied.

    Exploring indexing structures and compressed vs. dense

    array structures in MOLAP

  • 8/10/2019 2datawarehouse2.ppt

    48/57

    October 14, 2014 49

    Metadata Repository

    Meta data is the data defining warehouse objects. It has the followingkinds

    Description of the structure of the warehouse

    schema, view, dimensions, hierarchies, derived data defn, data martlocations and contents

    Operational meta-data data lineage (history of migrated data and transformation path),

    currency of data (active, archived, or purged), monitoring information(warehouse usage statistics, error reports, audit trails)

    The algorithms used for summarization

    The mapping from operational environment to the data warehouse

    Data related to system performance warehouse schema, view and derived data definitions

    Business data

    business terms and definitions, ownership of data, charging policies

    D t W h B k E d T l d

  • 8/10/2019 2datawarehouse2.ppt

    49/57

    October 14, 2014 50

    Data Warehouse Back-End Tools andUtilities

    Data extraction: get data from multiple, heterogeneous, and external

    sources Data cleaning:

    detect errors in the data and rectify them when

    possible Data transformation:

    convert data from legacy or host format to warehouseformat

    Load:

    sort, summarize, consolidate, compute views, checkintegrity, and build indicies and partitions

    Refresh

    propagate the updates from the data sources to thewarehouse

    Ch t 2 D t W h i d

  • 8/10/2019 2datawarehouse2.ppt

    50/57

    October 14, 2014 51

    Chapter 2: Data Warehousing andOLAP Technology for Data Mining

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    From data warehousing to data mining

    Discovery Driven Exploration of Data

  • 8/10/2019 2datawarehouse2.ppt

    51/57

    October 14, 2014 52

    Discovery-Driven Exploration of DataCubes

    Hypothesis-driven: exploration by user, huge search space

    Discovery-driven (Sarawagi et al.98)

    pre-compute measures indicating exceptions, guide user in the

    data analysis, at all levels of aggregation Exception: significantly different from the value anticipated,

    based on a statistical model

    Visual cues such as background color are used to reflect the

    degree of exception of each cell

    Computation of exception indicator (modeling fitting and

    computing SelfExp, InExp, and PathExp values) can be

    overlapped with cube construction

  • 8/10/2019 2datawarehouse2.ppt

    52/57

    October 14, 2014 53

    Examples: Discovery-Driven Data Cubes

    Complex Aggregation at Multiple

  • 8/10/2019 2datawarehouse2.ppt

    53/57

    October 14, 2014 54

    Complex Aggregation at MultipleGranularities: Multi-Feature Cubes

    Multi-feature cubes (Ross, et al. 1998): Compute complex queries

    involving multiple dependent aggregates at multiple granularities

    Ex. Grouping by all subsets of {item, region, month}, find the

    maximum price in 1997 for each group, and the total sales among all

    maximum price tuplesselectitem, region, month, max(price), sum(R.sales)

    from purchases

    whereyear = 1997

    cube byitem, region, month: Rsuch thatR.price = max(price)

    Continuing the last example, among the max price tuples, find the

    min and max shelf life, and find the fraction of the total sales due to

    tuple that have min shelf life within the set of all max price tuples

  • 8/10/2019 2datawarehouse2.ppt

    54/57

    October 14, 2014 55

    Data Warehouse Usage

    Three kinds of data warehouse applications

    Information processing

    supports querying, basic statistical analysis, and reporting

    using crosstabs, tables, charts and graphs

    Analytical processing

    multidimensional analysis of data warehouse data

    supports basic OLAP operations, slice-dice, drilling, pivoting

    Data mining

    knowledge discovery from hidden patterns

    supports associations, constructing analytical models,

    performing classification and prediction, and presenting the

    mining results using visualization tools.

    Differences among the three tasks

    From On Line Analytical Processing

  • 8/10/2019 2datawarehouse2.ppt

    55/57

    October 14, 2014 56

    From On-Line Analytical Processingto On Line Analytical Mining (OLAM)

    Why online analytical mining?

    High quality of data in data warehouses

    DW contains integrated, consistent, cleaned data

    Available information processing structure surrounding data

    warehouses ODBC, OLEDB, Web accessing, service facilities, reporting

    and OLAP tools

    OLAP-based exploratory data analysis

    mining with drilling, dicing, pivoting, etc.

    On-line selection of data mining functions

    integration and swapping of multiple mining functions,algorithms, and tasks.

    Architecture of OLAM

    An OLAM Architecture

  • 8/10/2019 2datawarehouse2.ppt

    56/57

    October 14, 2014 57

    An OLAM Architecture

    Data

    Warehouse

    Meta Data

    MDDB

    OLAM

    Engine

    OLAP

    Engine

    User GUI API

    Data Cube API

    Database API

    Data cleaning

    Data integration

    Layer3

    OLAP/OLAM

    Layer2

    MDDB

    Layer1

    Data

    Repository

    Layer4

    User Interface

    Filtering&Integration Filtering

    Databases

    Mining query Mining result

  • 8/10/2019 2datawarehouse2.ppt

    57/57

    Summary

    Data warehouse

    A subject-oriented, integrated, time-variant, and nonvolatile collection of

    data in support of managements decision-making process

    A multi-dimensional modelof a data warehouse

    Star schema, snowflake schema, fact constellations

    A data cube consists of dimensions & measures

    OLAPoperations: drilling, rolling, slicing, dicing and pivoting

    OLAP servers: ROLAP, MOLAP, HOLAP

    Efficient computation of data cubes

    Partial vs. full vs. no materialization

    Multiway array aggregation

    Bitmap index and join index implementations

    Further development of data cube technology

    Discovery-drive and multi-feature cubes

    From OLAP to OLAM (on-line analytical mining)