DataWarehouse-OLAP-1

download DataWarehouse-OLAP-1

of 91

Transcript of DataWarehouse-OLAP-1

  • 7/29/2019 DataWarehouse-OLAP-1

    1/91

    1

    OLAP and Data Mining

    Transparencies

  • 7/29/2019 DataWarehouse-OLAP-1

    2/91

    2

    Objectives

    Purpose of online analytical processing (OLAP)and how OLAP differs from data warehousing.

    Key features of OLAP applications.

    Potential benefits associated with successfulOLAP applications.

    Rules for OLAP tools and main types of toolsincluding: multi-dimensional OLAP (MOLAP),relational OLAP (ROLAP), and managed queryenvironment (MQE).

  • 7/29/2019 DataWarehouse-OLAP-1

    3/91

    3

    Objectives

    OLAP extensions to SQL.

    Concepts associated with data mining.

    Main data mining operations includingpredictive modeling, database segmentation,link analysis, and deviation detection.

    Relationship between data mining and datawarehousing.

  • 7/29/2019 DataWarehouse-OLAP-1

    4/91

    4

    Data warehousing and end-user access tools

    Accompanying growth in data warehouses isincreasing demands for more powerful accesstools providing advanced analytical

    capabilities.

    Key developments include:

    Online analytical processing (OLAP)

    SQL extensions for complex data analysis Data mining tools.

  • 7/29/2019 DataWarehouse-OLAP-1

    5/91

    5

    Introducing OLAP

    The dynamic synthesis, analysis, and

    consolidation of large volumes of multi-

    dimensional data, Codd (1993).

    Describes a technology that uses a multi-

    dimensional view of aggregate data to provide

    quick access to strategic information for

    purposes of advanced analysis.

  • 7/29/2019 DataWarehouse-OLAP-1

    6/91

    6

    Introducing OLAP

    Enables users to gain a deeper understanding

    and knowledge about various aspects of their

    corporate data through fast, consistent,

    interactive access to a wide variety of possibleviews of the data.

    Allows users to view corporate data in such a

    way that it is a better model of the true

    dimensionality of the enterprise.

  • 7/29/2019 DataWarehouse-OLAP-1

    7/91

    7

    Introducing OLAP

    Can easily answer who? and what?

    questions, however, ability to answer what if?

    and why? type questions distinguishes OLAP

    from general-purpose query tools.

    Types of analysis ranges from basic navigation

    and browsing (slicing and dicing) tocalculations, to more complex analyses such as

    time series and complex modeling.

  • 7/29/2019 DataWarehouse-OLAP-1

    8/91

    8

    OLAP Benchmarks

    OLAP Council published an analytical

    processing benchmark referred to as the APB-

    1 (OLAP Council, 1998).

    Aim is to measure a servers overall OLAP

    performance rather than the performance of

    individual tasks.

  • 7/29/2019 DataWarehouse-OLAP-1

    9/91

    9

    OLAP Benchmarks

    APB-1 assesses most common business operationsincluding:

    bulk loading of data from internal/external data sources

    incremental loading of data from operational systems;

    aggregation of input level data along hierarchies;

    calculation of new data based on business models;

    time series analysis;

    queries with a high degree of complexity;

    drill-down through hierarchies;

    ad hoc queries;

    multiple online sessions.

  • 7/29/2019 DataWarehouse-OLAP-1

    10/91

    10

    OLAP Benchmarks

    OLAP applications are judged on their abilityto provide just-in-time (JIT) information, acore requirement of supporting effectivedecision-making.

    Assessing a servers ability to satisfy thisrequirement is more than measuring

    processing performance but includes itsabilities to model complex businessrelationships and to respond to changingbusiness requirements.

  • 7/29/2019 DataWarehouse-OLAP-1

    11/91

    11

    OLAP Benchmarks

    APB-1 uses a standard benchmark metric

    called AQM (Analytical Queries per Minute).

    AQM represents number of analytical queries

    processed per minute including data loading

    and computation time. Thus, AQM

    incorporates data loading performance,calculation performance, and query

    performance into a singe metric.

  • 7/29/2019 DataWarehouse-OLAP-1

    12/91

    12

    OLAP Benchmarks

    Publication of APB-1 benchmark results must

    include both the database schema and all code

    required for executing the benchmark.

    An essential requirement of all OLAP

    applications is ability to provide users with JIT

    information, to make effective decisions aboutan organization's strategic directions.

  • 7/29/2019 DataWarehouse-OLAP-1

    13/91

    13

    OLAP Applications

    JIT information is computed data that usually

    reflects complex relationships and is often

    calculated on the fly.

    Also as data relationships may not be known inadvance, the data model must be flexible.

  • 7/29/2019 DataWarehouse-OLAP-1

    14/91

    14

    Examples of OLAP applications in various

    functional areas

  • 7/29/2019 DataWarehouse-OLAP-1

    15/91

    15

    OLAP Applications

    Although OLAP applications are found in

    widely divergent functional areas, all have

    following key features:

    multi-dimensional views of data

    support for complex calculations

    time intelligence.

  • 7/29/2019 DataWarehouse-OLAP-1

    16/91

    16

    OLAP Applications - multi-dimensional

    views of data

    Core requirement of building a realisticbusiness model.

    Provides basis for analytical processingthrough flexible access to corporate data.

    The underlying database design that provides

    the multi-dimensional view of data should treatall dimensions equally.

  • 7/29/2019 DataWarehouse-OLAP-1

    17/91

    17

    OLAP Applications - support for complex

    calculations

    Must provide a range of powerful

    computational methods such as that required

    by sales forecasting, which uses trend

    algorithms such as moving averages andpercentage growth.

    Mechanisms for implementing computationalmethods should be clear and non-procedural.

  • 7/29/2019 DataWarehouse-OLAP-1

    18/91

    18

    OLAP Applicationstime intelligence

    Key feature of almost any analyticalapplication as performance is almost always

    judged over time.

    Time hierarchy is not always used in samemanner as other hierarchies.

    Concepts such as year-to-date and period-over-period comparisons should be easily defined.

  • 7/29/2019 DataWarehouse-OLAP-1

    19/91

    19

    OLAP Benefits

    Increased productivity of end-users.

    Reduced backlog of applications development

    for IT staff.

    Retention of organizational control over the

    integrity of corporate data.

    Reduced query drag and network traffic on

    OLTP systems or on the data warehouse. Improved potential revenue and profitability.

  • 7/29/2019 DataWarehouse-OLAP-1

    20/91

    20

    Representing Multi-dimensional Data

    Example of two-dimensional query.

    What is the total revenue generated by property

    sales in each city, in each quarter of 1997?

    Choice of representation is based on types of

    queries end-user may ask.

    Compare representation - three-field relationaltable versus two-dimensional matrix.

  • 7/29/2019 DataWarehouse-OLAP-1

    21/91

    21

    Multi-dimensional Data as Three-field table

    versus Two-dimensional Matrix

  • 7/29/2019 DataWarehouse-OLAP-1

    22/91

    22

    Representing Multi-dimensional Data

    Example of three-dimensional query.

    What is the total revenue generated by property

    sales for each type of property (Flat or House) in

    each city, in each quarter of 1997?

    Compare representation - four-field relational

    table versus three-dimensional cube.

  • 7/29/2019 DataWarehouse-OLAP-1

    23/91

    23

    Multi-dimensional Data as Four-field Table

    versus Three-dimensional Cube

  • 7/29/2019 DataWarehouse-OLAP-1

    24/91

    24

    Representing Multi-dimensional Data

    Cube represents data as cells in an array.

    Relational table only represents multi-

    dimensional data in two dimensions.

  • 7/29/2019 DataWarehouse-OLAP-1

    25/91

    25

    Multi-dimensional OLAP Servers

    Use multi-dimensional structures to store data

    and relationships between data.

    Multi-dimensional structures are bestvisualized as cubes of data, and cubes within

    cubes of data. Each side of cube is a dimension.

    A cube can be expanded to include otherdimensions.

  • 7/29/2019 DataWarehouse-OLAP-1

    26/91

    26

    Multi-dimensional OLAP Servers

    A cube supports matrix arithmetic.

    Multi-dimensional query response time

    depends on how many cells have to be addedon the fly.

    As number of dimensions increases, number of

    the cubes cells increases exponentially.

  • 7/29/2019 DataWarehouse-OLAP-1

    27/91

    27

    Multi-dimensional OLAP Servers

    However, majority of multi-dimensionalqueries use summarized, high-level data.

    Solution is to pre-aggregate (consolidate) alllogical subtotals and totals along alldimensions.

    Pre-aggregation is valuable, as typical

    dimensions are hierarchical in nature. (e.g. Time dimension hierarchy - years, quarters,

    months, weeks, and days)

  • 7/29/2019 DataWarehouse-OLAP-1

    28/91

    28

    Multi-dimensional OLAP Servers

    Predefined hierarchy allows logical pre-

    aggregation and, conversely, allows for a

    logical drill-down.

    Supports common analytical operations

    Consolidation

    Drill-down

    Slicing and dicing.

  • 7/29/2019 DataWarehouse-OLAP-1

    29/91

    29

    Multi-dimensional OLAP Servers

    Consolidation - aggregation of data such assimple roll-ups or complex expressionsinvolving inter-related data.

    Drill-Down - is reverse of consolidation andinvolves displaying the detailed data thatcomprises the consolidated data.

    Slicing and Dicing - (also called pivoting) refersto the ability to look at the data from differentviewpoints.

  • 7/29/2019 DataWarehouse-OLAP-1

    30/91

    30

    Multi-dimensional OLAP servers

    Can store data in a compressed form by

    dynamically selecting physical storage

    organizations and compression techniques that

    maximize space utilization.

    Dense data (ie., data that exists for high

    percentage of cells) can be stored separately

    from sparse data (ie., significant percentage of

    cells are empty).

  • 7/29/2019 DataWarehouse-OLAP-1

    31/91

    31

    Multi-dimensional OLAP Servers

    Ability to omit empty or repetitive cells can

    greatly reduce the size of the cube and the

    amount of processing.

    Allows analysis of exceptionally large amounts

    of data.

  • 7/29/2019 DataWarehouse-OLAP-1

    32/91

    32

    Multi-dimensional OLAP Servers

    In summary, pre-aggregation, dimensional

    hierarchy, and sparse data management can

    significantly reduce the size of the cube and the

    need to calculate values on-the-fly.

    Removes need for multi-table joins and

    provides quick and direct access to arrays of

    data, thus significantly speeding up execution

    of multi-dimensional queries.

  • 7/29/2019 DataWarehouse-OLAP-1

    33/91

    33

    Codds Rules for OLAP Systems

    In 1993, E.F. Codd formulated twelve rules as

    the basis for selecting OLAP tools.

    Multi-dimensional conceptual view

    Transparency

    Accessibility

    Consistent reporting performance

    Client-server architectureGeneric dimensionality

  • 7/29/2019 DataWarehouse-OLAP-1

    34/91

    34

    Codds rules for OLAP

    Dynamic sparse matrix handling

    Multi-user support

    Unrestricted cross-dimensional operations

    Intuitive data manipulation

    Flexible reporting

    Unlimited dimensions and aggregation levels.

  • 7/29/2019 DataWarehouse-OLAP-1

    35/91

    35

    Codds Rules for OLAP Systems

    There are proposals to re-defined or extended

    the rules. For example, to also include:

    Comprehensive database management tools

    Ability to drill down to detail (source record) level Incremental database refresh

    SQL interface to the existing enterprise

    environment

  • 7/29/2019 DataWarehouse-OLAP-1

    36/91

    36

    Categories of OLAP Tools

    OLAP tools are categorized according to the

    architecture of the underlying database.

    Three main categories of OLAP tools include Multi-dimensional OLAP (MOLAP or MD-OLAP)

    Relational OLAP (ROLAP), also called multi-

    relational OLAP

    Managed query environment (MQE)

  • 7/29/2019 DataWarehouse-OLAP-1

    37/91

    37

    Multi-dimensional OLAP (MOLAP)

    Use specialized data structures and multi-

    dimensional Database Management Systems

    (MDDBMSs) to organize, navigate, and

    analyze data.

    Data is typically aggregated and stored

    according to predicted usage to enhance query

    performance.

  • 7/29/2019 DataWarehouse-OLAP-1

    38/91

    38

    Multi-dimensional OLAP (MOLAP)

    Use array technology and efficient storage

    techniques that minimize the disk space

    requirements through sparse data

    management.

    Provides excellent performance when data is

    used as designed, and the focus is on data for a

    specific decision-support application.

  • 7/29/2019 DataWarehouse-OLAP-1

    39/91

    39

    Multi-dimensional OLAP (MOLAP)

    Traditionally, require a tight coupling with the

    application layer and presentation layer.

    Recent trends segregate the OLAP from the

    data structures through the use of published

    application programming interfaces (APIs).

  • 7/29/2019 DataWarehouse-OLAP-1

    40/91

    40

    Typical Architecture for MOLAP Tools

  • 7/29/2019 DataWarehouse-OLAP-1

    41/91

    41

    MOLAP Tools - Development Issues

    Underlying data structures are limited in their

    ability to support multiple subject areas and to

    provide access to detailed data.

    Navigation and analysis of data is limited

    because the data is designed according to

    previously determined requirements.

  • 7/29/2019 DataWarehouse-OLAP-1

    42/91

    42

    MOLAP Tools - Development Issues

    MOLAP products require a different set of

    skills and tools to build and maintain the

    database, thus increasing the cost and

    complexity of support.

  • 7/29/2019 DataWarehouse-OLAP-1

    43/91

    43

    Relational OLAP (ROLAP)

    Fastest-growing style of OLAP technology.

    Supports RDBMS products using a metadata

    layer - avoids need to create a static multi-

    dimensional data structure - facilitates the

    creation of multiple multi-dimensional views of

    the two-dimensional relation.

  • 7/29/2019 DataWarehouse-OLAP-1

    44/91

    44

    Relational OLAP (ROLAP)

    To improve performance, some products use

    SQL engines to support complexity of multi-

    dimensional analysis, while others recommend,

    or require, the use of highly denormalizeddatabase designs such as the star schema.

  • 7/29/2019 DataWarehouse-OLAP-1

    45/91

    45

    Typical Architecture for ROLAP Tools

  • 7/29/2019 DataWarehouse-OLAP-1

    46/91

    46

    ROLAP Tools - Development Issues

    Middleware to facilitate the development of

    multi-dimensional applications. (Software that

    converts the two-dimensional relation into a

    multi-dimensional structure).

    Development of an option to create persistent,

    multi-dimensional structures with facilities to

    assist in the administration of these structures.

  • 7/29/2019 DataWarehouse-OLAP-1

    47/91

    47

    Managed Query Environment (MQE)

    Relatively new development.

    Provide limited analysis capability, either

    directly against RDBMS products, or by usingan intermediate MOLAP server.

  • 7/29/2019 DataWarehouse-OLAP-1

    48/91

    48

    Managed Query Environment (MQE)

    Deliver selected data directly from DBMS or

    via a MOLAP server to desktop (or local

    server) in form of a datacube, where it is

    stored, analyzed, and maintained locally.

    Promoted as being relatively simple to install

    and administer with reduced cost and

    maintenance.

  • 7/29/2019 DataWarehouse-OLAP-1

    49/91

    49

    Typical Architecture for MQE Tools

  • 7/29/2019 DataWarehouse-OLAP-1

    50/91

    50

    MQE Tools - Development Issues

    Architecture results in significant dataredundancy and may cause problems fornetworks that support many users.

    Ability of each user to build a custom datacubemay cause a lack of data consistency amongusers.

    Only a limited amount of data can beefficiently maintained.

  • 7/29/2019 DataWarehouse-OLAP-1

    51/91

    51

    OLAP Extensions to SQL

    SQL promoted as easy-to-learn, nonprocedural,free-format, DBMS-independent, andinternational standard.

    However, major disadvantage has been inabilityto represent many of the questions mostcommonly asked by business analysts.

    IBM and Oracle jointly proposed OLAPextensions to SQL early in 1999, adopted as anamendment to SQL.

  • 7/29/2019 DataWarehouse-OLAP-1

    52/91

    52

    OLAP Extensions to SQL

    Many database vendors including IBM,

    Oracle, Informix, and Red Brick Systems have

    already implemented portions of specifications

    in their DBMSs.

    Red Brick Systems was first to implement

    many essential OLAP functions (as Red Brick

    Intelligent SQL (RISQL)), albeit in advance of

    the standard.

  • 7/29/2019 DataWarehouse-OLAP-1

    53/91

    53

    OLAP Extensions to SQL - RISQL

    Designed for business analysts.

    Set of extensions that augments SQL with a

    variety of powerful operations appropriate todata analysis and decision-support applications

    such as ranking, moving averages,

    comparisons, market share, this year versus

    last year.

  • 7/29/2019 DataWarehouse-OLAP-1

    54/91

    54

    Use of the RISQL CUME Function

    Show the quarterly sales for branch office B003,along with the monthly year-to-date figures.

    SELECT quarter, quarterlySales, CUME(quarterlySales)

    AS Year-to-DateFROM BranchSales

    WHERE branchNo = 'B003';

  • 7/29/2019 DataWarehouse-OLAP-1

    55/91

    55

    Use of the RISQL MOVINGAVG / MOVINGSUM

    Function

    Show the first six monthly sales for branch office

    B003 without the effect of seasonality.

    SELECT month, monthlySales,

    MOVINGAVG(monthlySales) AS 3-MonthMovingAvg,MOVINGSUM(monthlySales) AS 3-MonthMovingSum

    FROM BranchSales

    WHERE branchNo = 'B003';

  • 7/29/2019 DataWarehouse-OLAP-1

    56/91

    56

    Data Mining

    The process of extracting valid, previously

    unknown, comprehensible, and actionable

    information from large databases and using it

    to make crucial business decisions,(Simoudis,1996).

    Involves analysis of data and use of software

    techniques for finding hidden and unexpected

    patterns and relationships in sets of data.

  • 7/29/2019 DataWarehouse-OLAP-1

    57/91

    57

    Data Mining

    Reveals information that is hidden andunexpected, as little value in finding patternsand relationships that are already intuitive.

    Patterns and relationships are identified byexamining the underlying rules and features inthe data.

    Tends to work from the data up and most

    accurate results normally require largevolumes of data to deliver reliable conclusions.

  • 7/29/2019 DataWarehouse-OLAP-1

    58/91

    58

    Data Mining

    Starts by developing an optimal representationof structure of sample data, during which timeknowledge is acquired and extended to largersets of data.

    Data mining can provide huge paybacks forcompanies who have made a significantinvestment in data warehousing.

    Relatively new technology, however alreadyused in a number of industries.

  • 7/29/2019 DataWarehouse-OLAP-1

    59/91

    59

    Examples of Applications of Data Mining

    Retail / Marketing

    Identifying buying patterns of customers

    Finding associations among customer

    demographic characteristics

    Predicting response to mailing campaigns

    Market basket analysis

  • 7/29/2019 DataWarehouse-OLAP-1

    60/91

    60

    Examples of Applications of Data Mining

    Banking

    Detecting patterns of fraudulent credit card use

    Identifying loyal customers

    Predicting customers likely to change their creditcard affiliation

    Determining credit card spending by customer

    groups

  • 7/29/2019 DataWarehouse-OLAP-1

    61/91

    61

    Examples of Applications of Data Mining

    Insurance

    Claims analysis

    Predicting which customers will buy new policies.

    Medicine

    Characterizing patient behavior to predict surgery

    visits

    Identifying successful medical therapies fordifferent illnesses.

  • 7/29/2019 DataWarehouse-OLAP-1

    62/91

    62

    Data Mining Operations

    Four main operations include: Predictive modeling

    Database segmentation

    Link analysis

    Deviation detection.

    There are recognized associations between theapplications and the corresponding operations.

    e.g. Direct marketing strategies use databasesegmentation.

  • 7/29/2019 DataWarehouse-OLAP-1

    63/91

    63

    Data Mining Techniques

    Techniques are specific implementations of the

    data mining operations.

    Each operation has its own strengths and

    weaknesses.

    Data mining tools sometimes offer a choice of

    operations to implement a technique.

  • 7/29/2019 DataWarehouse-OLAP-1

    64/91

    64

    Data Mining Techniques

    Criteria for selection of tool includes

    Suitability for certain input data types

    Transparency of the mining output

    Tolerance of missing variable values Level of accuracy possible

    Ability to handle large volumes of data.

    Data Mining Operations and Associated

  • 7/29/2019 DataWarehouse-OLAP-1

    65/91

    65

    Data Mining Operations and Associated

    Techniques

  • 7/29/2019 DataWarehouse-OLAP-1

    66/91

    66

    Predictive Modeling

    Similar to the human learning experience

    uses observations to form a model of the important

    characteristics of some phenomenon.

    Uses generalizations of real world and abilityto fit new data into a general framework.

    Can analyze a database to determine essential

    characteristics (model) about the data set.

  • 7/29/2019 DataWarehouse-OLAP-1

    67/91

    67

    Predictive Modeling

    Model is developed using a supervised learning

    approach, which has two phases: training and

    testing.

    Training builds a model using a large sample ofhistorical data called a training set.

    Testing involves trying out the model on new,

    previously unseen data to determine its accuracy

    and physical performance characteristics.

  • 7/29/2019 DataWarehouse-OLAP-1

    68/91

    68

    Predictive Modeling

    Applications of predictive modeling include

    customer retention management, credit

    approval, cross selling, and direct marketing.

    Two techniques associated with predictive

    modeling: classification and value prediction,

    distinguished by nature of the variable being

    predicted.

  • 7/29/2019 DataWarehouse-OLAP-1

    69/91

    69

    Predictive Modeling - Classification

    Used to establish a specific predetermined class

    for each record in a database from a finite set

    of possible, class values.

    Two specializations of classification: tree

    induction and neural induction.

  • 7/29/2019 DataWarehouse-OLAP-1

    70/91

    70

    Example of Classification using Tree Induction

    Example of Classification using Neural

  • 7/29/2019 DataWarehouse-OLAP-1

    71/91

    71

    Example of Classification using Neural

    Induction

  • 7/29/2019 DataWarehouse-OLAP-1

    72/91

    72

    Predictive Modeling - Value Prediction

    Used to estimate a continuous numeric value

    that is associated with a database record.

    Uses the traditional statistical techniques oflinear regression and nonlinear regression.

    Relatively easy-to-use and understand.

  • 7/29/2019 DataWarehouse-OLAP-1

    73/91

    73

    Predictive Modeling - Value Prediction

    Linear regression attempts to fit a straight line

    through a plot of the data, such that the line is

    the best representation of the average of all

    observations at that point in the plot.

    Problem is that the technique only works well

    with linear data and is sensitive to the presence

    of outliers (ie., data values, which do notconform to the expected norm).

  • 7/29/2019 DataWarehouse-OLAP-1

    74/91

    74

    Predictive Modeling - Value Prediction

    Although nonlinear regression avoids the main

    problems of linear regression, still not flexible

    enough to handle all possible shapes of the data

    plot.

    Statistical measurements are fine for building

    linear models that describe predictable data

    points, however, most data is not linear in

    nature.

  • 7/29/2019 DataWarehouse-OLAP-1

    75/91

    75

    Predictive Modeling - Value Prediction

    Data mining requires statistical methods that

    can accommodate non-linearity, outliers, and

    non-numeric data.

    Applications of value prediction include credit

    card fraud detection or target mailing list

    identification.

  • 7/29/2019 DataWarehouse-OLAP-1

    76/91

    76

    Database Segmentation

    Aim is to partition a database into an unknown

    number of segments, or clusters, of similar

    records.

    Uses unsupervised learning to discover

    homogeneous sub-populations in a database to

    improve the accuracy of the profiles.

  • 7/29/2019 DataWarehouse-OLAP-1

    77/91

    77

    Database Segmentation

    Less precise than other operations thus lesssensitive to redundant and irrelevant features.

    Sensitivity can be reduced by ignoring a subset

    of the attributes that describe each instance orby assigning a weighting factor to eachvariable.

    Applications of database segmentation includecustomer profiling, direct marketing, and crossselling.

    Example of Database Segmentation using a

  • 7/29/2019 DataWarehouse-OLAP-1

    78/91

    78

    Example of Database Segmentation using a

    Scatterplot

  • 7/29/2019 DataWarehouse-OLAP-1

    79/91

    79

    Database Segmentation

    Associated with demographic or neural

    clustering techniques, distinguished by:

    Allowable data inputs

    Methods used to calculate the distancebetween records

    Presentation of the resulting segments for

    analysis.

  • 7/29/2019 DataWarehouse-OLAP-1

    80/91

    80

    Link Analysis

    Aims to establish links (associations) betweenrecords, or sets of records, in a database.

    There are three specializations

    Associations discoverySequential pattern discovery

    Similar time sequence discovery

    Applications include product affinity analysis,direct marketing, and stock price movement.

  • 7/29/2019 DataWarehouse-OLAP-1

    81/91

    81

    Link Analysis - Associations Discovery

    Finds items that imply the presence of other

    items in the same event.

    Affinities between items are represented by

    association rules.

    e.g. When customer rents property for more than 2

    years and is more than 25 years old, in 40% of cases,

    customer will buy a property. Association happens in

    35% of all customers who rent properties.

  • 7/29/2019 DataWarehouse-OLAP-1

    82/91

    82

    Link Analysis - Sequential Pattern Discovery

    Finds patterns between events such that the

    presence of one set of items is followed by

    another set of items in a database of events

    over a period of time. e.g. Used to understand long term customer buying

    behavior.

  • 7/29/2019 DataWarehouse-OLAP-1

    83/91

    83

    Link Analysis - Similar Time Sequence Discovery

    Finds links between two sets of data that are

    time-dependent, and is based on the degree of

    similarity between the patterns that both time

    series demonstrate.

    e.g. Within three months of buying property, new

    home owners will purchase goods such as cookers,

    freezers, and washing machines.

  • 7/29/2019 DataWarehouse-OLAP-1

    84/91

    84

    Deviation Detection

    Relatively new operation in terms of

    commercially available data mining tools.

    Often a source of true discovery because itidentifies outliers, which express deviation

    from some previously known expectation and

    norm.

  • 7/29/2019 DataWarehouse-OLAP-1

    85/91

    85

    Deviation Detection

    Can be performed using statistics and

    visualization techniques or as a by-product of

    data mining.

    Applications include fraud detection in the use

    of credit cards and insurance claims, quality

    control, and defects tracing.

    Example of Database Segmentation using a

  • 7/29/2019 DataWarehouse-OLAP-1

    86/91

    86

    Example of Database Segmentation using a

    Visualization

  • 7/29/2019 DataWarehouse-OLAP-1

    87/91

    87

    Data Mining Tools

    There are a growing number of commercialdata mining tools on the marketplace.

    Important characteristics of data mining toolsinclude:

    Data preparation facilities

    Selection of data mining operations

    Product scalability and performance Facilities for visualization of results.

  • 7/29/2019 DataWarehouse-OLAP-1

    88/91

    88

    Data Mining and Data Warehousing

    Major challenge to exploit data mining is

    identifying suitable data to mine.

    Data mining requires single, separate, clean,integrated, and self-consistent source of data.

  • 7/29/2019 DataWarehouse-OLAP-1

    89/91

    89

    Data Mining and Data Warehousing

    A data warehouse is well equipped for

    providing data for mining.

    Data quality and consistency is a pre-requisitefor mining to ensure the accuracy of the

    predictive models. Data warehouses are

    populated with clean, consistent data.

  • 7/29/2019 DataWarehouse-OLAP-1

    90/91

    90

    Data Mining and Data Warehousing

    Advantageous to mine data from multiplesources to discover as many interrelationships

    as possible. Data warehouses contain data from

    a number of sources.

    Selecting relevant subsets of records and fields

    for data mining requires query capabilities of

    the data warehouse.

  • 7/29/2019 DataWarehouse-OLAP-1

    91/91

    Data Mining and Data Warehousing

    Results of a data mining study are useful if

    there is some way to further investigate the

    uncovered patterns. Data warehouses provide

    capability to go back to the data source.