Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

97
Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals

Transcript of Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Page 1: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Basics

Instructor: Paul Chen

Chapter 9

Data Warehouse Fundamentals

Page 2: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

TopicsTopics

1. How Data Mining Evolved? 2. Decision Processing Overview and Tasks3. Data Mining, What’s it?

4. Data Mining vs. Data Warehousing 5. How Data Mining Works? And Its Applications6. Data Mining Operations and Associated Techniques7. The Data Mining Process8. Data Mining Tools9. Data Mining Techniques- A Summary

Page 3: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 1:How Data Mining Evolved?Topic 1:How Data Mining Evolved?

Many businesses have invested heavily in information technology to help them manage their businesses more effectively and gain a competitive edge. Increasingly large amounts of critical business data are being stored electronically and this volume is expected to continue to grow. The Data Mining technology is helping companies leverage their existing data more effectively and obtain insightful information giving them a competitive edge.

Page 4: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

How Data Mining Evolved?How Data Mining Evolved?

1960sData

Collection

1970s-80sRDBMS

1990sOLAP and

DW

Late 1990s toNow

Data Mining

Time Line

Page 5: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 2: Decision Processing Topic 2: Decision Processing OverviewOverview

Decision processing systems, and their underlying analytical applications, provide business users with the information they need to track and analyze business trends, and to explore new business opportunities. As businesses become increasingly competitive and complex, effective decision processing systems are essential for success.

Page 6: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

The Next Generation of Business The Next Generation of Business IntelligenceIntelligence

A decision processing system analyzes business information captured from operational systems (Back-and-front office, and e-business applications).

Distribution of business information to business users is via corporate intranets and extranets.

The flow of data can be thought of as an information supply chain whose objective is to convert operational data into useful business information.

Page 7: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

The Decision Processing Information The Decision Processing Information Supply ChainSupply Chain

E-BusinessApplications

Back-Office TransactionApplications

Front-OfficeApplications

Operational Systems

ExternalData

InformationStaging

Area

AnalyticApplications

DW

BusinessIntelligence

Tools

Collaborative&

Office Systems

BusinessDecisions

BusinessMetrics

Page 8: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision Processing—Four Tasks*** Decision Processing—Four Tasks***

Extracting and transforming information

This involves capturing data from operational systems,transforming it into business information, and loadingInto a data warehouse information store.

Current extract templates on the market are primarily atCapturing data from ERP (Enterprise Resource Planning)Transaction processing systems –for example: SAP BusinessInformation Warehouse and Peoplesoft BPM data warehouse)

*** Mentioned in chapter 2

Page 9: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision Processing—Four Tasks Decision Processing—Four Tasks (Cont’d)(Cont’d)

Managing information

This task encompasses the maintenance of business information in information stores, and how these information stores are processed by business intelligence tools and analytic applications.

The cornerstone of decision processing is data warehousing, and warehouse information stores should be organized and modeled into relational and multidimensional database products.

Page 10: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision Processing—Four Tasks Decision Processing—Four Tasks (Cont’d)(Cont’d)

Analyzing and modeling information

The traditional approach to decision processing is to build a data warehouse and supply business users with a set of business intelligence tools (query, reporting, OLAP and data mining, for example) to process information in data warehouse information stores.

A better approach is employ turn-key and web-based analytic application packages that are designed to provide comprehensive analyses for the business area being researched. Key business metrics (ex. Revenue dollars per sales rep per day) are useful.

Page 11: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision Processing—Four Tasks Decision Processing—Four Tasks (Cont’d)(Cont’d)

Distributing information

Business intelligence tools and analytic applications distribute information and the results of analysis operations to business users via standard graphical and Web interfaces.

To help users uncover and organize this range of business information, an enterprise information portal (EIP) is required. An EIP provides a single point of entry to any piece of business information, no matter where it resides.

The main components of an EIP are information assistant (Web browser interface) , an information directory and a subscription facility.

Page 12: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision Making Under RiskDecision Making Under Risk

Decisions are made under three sets of conditions: Certainty

The decision makers know everything in advance of making the decision

Uncertainty The decision makers know nothing about the

probabilities or the consequences of decisions Risk

Page 13: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision-Making StyleDecision-Making Style

Decision-making styles of users are categorized as either Analytic or Heuristic

Page 14: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Analytic and Heuristic Decision Analytic and Heuristic Decision MakingMaking

Analytical Decision Maker

Learns by analyzing Uses step-by-step procedure Values quantitative

information and models Builds mathematical models

and algorithms Seeks optimal solution

Heuristic Decision Maker

Learns by acting Uses trial and error Values experiences Relies on common sense Seeks completely satisfying

solution

Page 15: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 3:Topic 3: Data Mining, What’s it? Data Mining, What’s it?

Data Mining has been defined as “ a decision support process in which a search is made for patterns of information in data”. To detect patterns in data, Data Mining uses sophisticated statistical analysis and modeling technologies to uncover useful relationships hidden in databases. It predicts future trends and finds behavior allowing businesses to make predictive, knowledge-driven decisions.

Page 16: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining, What’s it?Data Mining, What’s it?

The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996).

Involves analysis of data and use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.

Page 17: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining, What’s it?Data Mining, What’s it?

Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive.

Patterns and relationships are identified by examining the underlying rules and features in the data.

Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions.

Page 18: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining, What’s it?Data Mining, What’s it?

Starts by developing an optimal representation of structure of sample data, during which time knowledge is acquired and extended to larger sets of data.

Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.

Relatively new technology, however already used in a number of industries.

Page 19: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 4:Topic 4: Data Mining vs. Data Data Mining vs. Data WarehousingWarehousing

Data Mining does not require that a Data Warehouse be built. Often, data can be downloaded from the operational files to flat files that contain the data ready for the data mining analysis.

Data Mining can be implemented rapidly on existing software and hardware platforms. Data Mining tools can analyze massive databases to deliver answers to questions such as, “ Which customers are most likely to respond to my next promotional mailing, and why?”

Page 20: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining vs. Data Data Mining vs. Data WarehousingWarehousing

Major challenge to exploit data mining is identifying suitable data to mine.

Data mining requires single, separate, clean, integrated, and self-consistent source of data.

A data warehouse is well equipped for providing data for mining.

Data quality and consistency is a pre-requisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data.

Page 21: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining vs. Data Data Mining vs. Data WarehousingWarehousing

Advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouses contain data from a number of sources.

Selecting relevant subsets of records and fields for data mining requires query capabilities of the data warehouse.

Results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide capability to go back to the data source.

Page 22: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 5:Topic 5: How Data Mining How Data Mining Works?Works?

How exactly is Data Mining able to tell you important things that you didn’t know or what is going to happen next? The technique in Data Mining is called Predictive Modeling which is knowledge discovery process via relationships and patterns in broad sense.

Modeling is the act of building a model in one situation where you know the answer and then applying it to another situation that you don’t.

Page 23: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Examples of Applications of Data Examples of Applications of Data Mining via Mining via relationships and patternsrelationships and patterns

Retail / Marketing Identifying buying patterns of customers Finding associations among customer demographic

characteristics Predicting response to mailing campaigns Market basket analysis

Page 24: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Examples of Applications of Data Examples of Applications of Data Mining Mining via via relationships and patternsrelationships and patterns

Banking Detecting patterns of fraudulent credit card use Identifying loyal customers Predicting customers likely to change their credit

card affiliation Determining credit card spending by customer

groups

Page 25: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Examples of Applications of Data Examples of Applications of Data Mining Mining via via relationships and patternsrelationships and patterns

Insurance Claims analysis Predicting which customers will buy new policies.

Medicine Characterizing patient behaviour to predict

surgery visits Identifying successful medical therapies for

different illnesses.

Page 26: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Examples of Applications of Data Examples of Applications of Data Mining Mining via via relationships and patternsrelationships and patterns

Customer profiling: characteristics of good customers are identified with the goals of predicting who will become one and helping marketers target new prospects.

Targeting specific marketing promotions to existing and potential customers offers similar benefits.

Market-basket analysis: With Data Mining, companies can determine which products to stock in which stores, and even how to place them within a store.

Page 27: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Examples of Applications of Data Examples of Applications of Data Mining Mining via via relationships and patternsrelationships and patterns

Customer Relationships Management-Determines characteristics of customers who are likely to leave for a competitor, a company can take action to retain that customer because doing so is usually for less expensive than acquiring a new customer.

Fraud detection- With Data Mining, companies can identify potentially fraudulent transactions before they happen.

Page 28: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 6:Topic 6: Data Mining Operations Data Mining Operations and Associated Techniquesand Associated Techniques

In previous foils, predictive modeling in essence includesother operations shown in the above table.

Page 29: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Descriptive: The dealer sold 200 cars last month.

Explanatory: For every increase in 1 % in the interest,auto sales decrease by 5 %.

Predictive: predictions about future buyer behavior.

Traditional DW

Operational

OLAP

(OLTP)

Data Mining

Page 30: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Descriptive

SIMPLE QUERIES& REPORTS

NormalizedTables

Explanatory

“WHAT IF” PROCESSING

ANALYZE WHAT

HAS PREVIOUSLY

OCCURRED TO

BRING ABOUT THE

CURRENT STATE

OF THE DATA

DenormalizedTables

Roll-up; Drill Down

DETERMINE IF ANY PATTERNSEXIST BY REVIEWINGDATA RELATIONSHIPS

Predictive

Statistical Analysis/Artificial Intelligence

Classification & Value Prediction

+

Level of Modeling vs. Level of Level of Modeling vs. Level of Analytical Processing

Page 31: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive ModellingPredictive Modelling

Similar to the human learning experience uses observations to form a model of the important

characteristics of some phenomenon.

Uses generalizations of ‘real world’ and ability to fit new data into a general framework.

Can analyze a database to determine essential characteristics (model) about the data set.

Page 32: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive ModellingPredictive Modelling

Model is developed using a supervised learning approach, which has two phases: training and testing.

Training builds a model using a large sample of historical data called a training set.

Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.

Page 33: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive ModellingPredictive Modelling

Applications of predictive modelling include customer retention management, credit approval, cross selling, and direct marketing.

Two techniques associated with predictive modelling: A. classification

B. value prediction, distinguished by nature of the

variable being predicted.

Page 34: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Statistical Analysis of Actual Sales (dollars Statistical Analysis of Actual Sales (dollars and quantities) relative To these Signage and quantities) relative To these Signage

Variables-a predictiveVariables-a predictive modelingmodeling example. example. Content Frequency Depth Focus Depth Scale Length Location

Statistical Analysis : Correlation, Regression, Experiment Design,

Optimization. Now it goes into real time analysis.

Page 35: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

SignageSignage

Page 36: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

SignageSignage

Page 37: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

PREDICTIVE MODELING

There are two techniques associated with predictive modeling: classification and value prediction, which are distinguished by the nature of the variable being predicted.

Page 38: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive Modelling - ClassificationPredictive Modelling - Classification

Used to establish a specific predetermined class for each record in a database from a finite set of possible, class values.

Two specializations of classification: tree induction and neural induction.

Page 39: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Example of Classification using Example of Classification using Tree InductionTree Induction

Page 40: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Example of Classification using Example of Classification using Tree InductionTree Induction

Customer renting property> 2 years

Rent property

Customer age>45

No Yes

No Yes

Rent property

Buy property

Page 41: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Example of Classification using Example of Classification using Neural InductionNeural Induction

Page 42: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Example of Classification using Example of Classification using Neural InductionNeural Induction

Each processing unit (circle) in one layer is connected to each processing unit in the next layer by a weighted value, expressing the strength of the relationship. The network attempts to mirror the way the human brain works in recognizing patterns by arithmetically combining all the variables with a given data point.

In this way, it is possible to develop nonlinear predictive models that ‘learn’ by studying combinations of variables and how different combinations of variables affect different data sets.

Page 43: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive Modelling - Value Predictive Modelling - Value PredictionPrediction

Used to estimate a continuous numeric value that is associated with a database record.

Uses the traditional statistical techniques of linear regression and non-linear regression.

Relatively easy-to-use and understand.

Page 44: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive Modelling - Value Predictive Modelling - Value PredictionPrediction

Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot.

Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (i.e.., data values, which do not conform to the expected norm).

Page 45: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive Modelling - Value Predictive Modelling - Value PredictionPrediction

Although non-linear regression avoids the main problems of linear regression, still not flexible enough to handle all possible shapes of the data plot.

Statistical measurements are fine for building linear models that describe predictable data points, however, most data is not linear in nature.

Page 46: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Predictive Modelling - Value Predictive Modelling - Value PredictionPrediction

Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data.

Applications of value prediction include credit card fraud detection or target mailing list identification.

Page 47: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Database SegmentationDatabase Segmentation

Aim is to partition a database into an unknown number of segments, or clusters, of similar records.

Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles.

Page 48: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Database SegmentationDatabase Segmentation

Less precise than other operations thus less sensitive to redundant and irrelevant features.

Sensitivity can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable.

Applications of database segmentation include customer profiling, direct marketing, and cross selling.

Page 49: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Example of Database Segmentation Example of Database Segmentation using a Scatter plotusing a Scatter plot

Page 50: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Database SegmentationDatabase Segmentation

Associated with demographic or neural clustering techniques, distinguished by: Allowable data inputs Methods used to calculate the distance between

records Presentation of the resulting segments for analysis.

Page 51: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Example of Database Segmentation Example of Database Segmentation using a Visualizationusing a Visualization

Page 52: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Link AnalysisLink Analysis

Aims to establish links (associations) between records, or sets of records, in a database.

There are three specializations Associations discovery Sequential pattern discovery Similar time sequence discovery

Applications include product affinity analysis, direct marketing, and stock price movement.

Page 53: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Link Analysis - Associations Link Analysis - Associations DiscoveryDiscovery

Finds items that imply the presence of other items in the same event.

Affinities between items are represented by association rules. e.g. ‘When customer rents property for more than 2

years and is more than 25 years old, in 40% of cases, customer will buy a property. Association happens in 35% of all customers who rent properties’.

Page 54: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Link Analysis - Sequential Pattern Link Analysis - Sequential Pattern DiscoveryDiscovery

Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time.

e.g. Used to understand long term customer buying behaviour.

Page 55: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Link Analysis - Similar Time Link Analysis - Similar Time Sequence DiscoverySequence Discovery

Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. e.g. Within three months of buying property, new

home owners will purchase goods such as cookers, freezers, and washing machines.

Page 56: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Deviation DetectionDeviation Detection

Relatively new operation in terms of commercially available data mining tools.

Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm.

Page 57: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Deviation DetectionDeviation Detection

Can be performed using statistics and visualization techniques or as a by-product of data mining.

Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.

Page 58: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

A Summary: Data-Driven A Summary: Data-Driven Techniques*Techniques*

Data Visualization

Decision Trees

Clustering

Factor Analysis

Neural Network

Association Rules

Rule Induction

* Based on Sakhr Youness’s book “ Professional Data Warehousing with SQL Server 7.0 and OLAP Services

Page 59: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Visualization Data Visualization

39%

9 %11 %

20 %

21 %

Northeast

A pie chart showing the sales of a product by region isSometimes much more effective than presenting the sameData in a text or tabular form.

East

West

South

North

Page 60: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Decision TreeDecision Tree

Page 61: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Cluster AnalysisCluster Analysis

Have Children

Married

Last car isA used one

Own car

First segment (high income>8,000)

Second Segment (8000>middle income >3000)

Third Segment (low income < 3000)

Page 62: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Factor AnalysisFactor Analysis Unlike cluster analysis, factor analysis builds a model from data.

The technique finds underlying factors, also called “latent variables” and provides models for these factors based on variables in the data. For ex., a software company is considering a survey to find out the nine most perceived attributes of one of their products. They might categorize these products to categories such as service for technical support, availability for training and a help system.

Factor analysis is used for grouping together products based on a similarity of buying patterns so that vendors may bundle several products as one to sell them together at a lower price than their added individual prices..

Page 63: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Neural NetworksNeural Networks

Page 64: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Association RulesAssociation Rules

Association models are models that examine the extent to which values of one field depend on, or are produced by, values of another field. These models are often referred to as Market Basket Analysis when they are applied to retail industries to study the buying patterns of these customers, especially in grocery and retail stores that issue their own credit cards. Charging against these cards gives the store the chance to associate the purchases of customers with their identities, which allows them to study associations among other things.

Page 65: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Rules InductionRules Induction

This is a powerful technique that involves a large number of rules using a set of “if..then” statements in the pursuit of all possible patterns in the dataset. For ex., if the customer is a male then, if he is between 30 and 40 years of ages, and his income is less than $50,000 and more than $20,000, he is likely to be driving a car that was bought as new.

Page 66: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

A Summary: Theory-Driven A Summary: Theory-Driven TechniquesTechniques

Correlations

T-Tests

Analysis of Variables

Linear Regression

Logistic Regression

Discriminate Analysis

Forecasting Methods

Page 67: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 7:Topic 7: The Data Mining Process

Define the problem. Select the data. Prepare the data. Mine the data. Deploy the model. Take business action. Are you ready for Data Mining?

Page 68: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Define the problemDefine the problem

A successful data mining initiative always starts with

a well-defined project. To insure that the project produces incremental value, include an assessment of the status quo

solution and a review of technology, organization, and business processes.

Page 69: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Select the dataSelect the data

This step involves defining your data source . (not every

data source and record is required.) The data is usually extracted from the source system to a separate server.

Page 70: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Prepare the dataPrepare the data

This step represents up to 80 percent of the total project effort. For data mining, the data must reside in one flat table (each record has many columns). In addition to being the most time consuming, the step is also the most critical. The resulting models are only as good as the data used to create them.

Page 71: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Mine the dataMine the data

Typically the easiest and shortest phase, this step involves applying statistical and AI tools to create mathematical models. Data mining typically occurs on a server separate from the data warehousing and other corporate systems.

Page 72: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Deploy the ModelDeploy the Model

Model deployment is the process of implementing the mathematical models into operational systems to improve business results.

Page 73: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Take Business ActionTake Business Action

Use the deployed model to achieve improved results to the business problem identified at the beginning of the process.

Page 74: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Step to Implement Data MiningStep to Implement Data Mining

Discovery (patterns, relationsAssociations, etc.)

Prior Knowledge

Information Model

Deployment

Validation

Page 75: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

ARE YOU READY FOR DATA MINING?

Just because you have a data warehouse doesn’t mean

you’re necessarily ready for data mining. Much of the

work our company does in the data mining arena has

more to do with data mining readiness assessment than

with actually performing data mining.

Page 76: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Metrics you can use to gauge your data mining readiness

Do you have a staff of experienced knowledge workers? Do you have the data? Do you have marketing processes in place that can use this

data? Do you have a business champion who can embrace the

process and results? Do you have the technology infrastructure to support

advanced analysis?

Page 77: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 8:Topic 8: Data Mining Tools

Data mining tools are typically classified by the type of

algorithm they use to identify hidden patterns. There are

many different algorithms in use, but the four most

popular are association, sequence, clustering (or

segmentation), and predictive modeling.

Page 78: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Tools

There are a growing number of commercial data mining tools on the marketplace.

Important characteristics of data mining tools include: Data preparation facilities Selection of data mining operations Product scalability and performance Facilities for visualization of results.

Page 79: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining vs. OLAP

They are two separate breeds of analysis with

entirely different objectives, not to mention

tools, skill sets, and implementation methods.

Page 80: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data MiningData Mining

With canned reports, ad hoc querying, and OLAP, the end user defines a hypothesis and determines which data to examine. With data mining, the tool identifies the hypothesis, and it actually tells the user where in the data to start the exploration process.

Page 81: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data MiningData Mining

Rather than using SQL to filter out values and methodically

reduce the data into a concise answer set, data mining uses

algorithms that exhaustively review the relationships among

data elements to determine if any patterns exist. The whole

purpose of data mining is to yield new business information

that a business person can act on.

Page 82: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

OLAP vs. Data Mining ToolsOLAP vs. Data Mining Tools

Are ad hoc, shrink wrapped tools that provide an interface to data

Are used when you have specific known questions

Looks and feels like a spreadsheet that allow rotation, slicing and graphic

Can be deployed to large

number of users

Methods for analyzing multiple data types

-- Regression Trees -- Neural networks -- Genetic algorithms

Are used when you don’t know what the questions are

Usually textual in nature

Usually deployed to a small number of analysts

OLAP Tools Data Mining Tools

Page 83: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Tools

ASSOCIATION

Association, also frequently referred to as "affinity analysis," reviews numerous sets of items and looks for common groupings. An example of association is market basket analysis, which involves reviewing the products that consumers purchase in a single trip to the grocery store.

Page 84: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

ASSOCIATIONASSOCIATION

Finds items that imply the presence of other items in the same event.

Affinities between items are represented by association rules. e.g. ‘When a customer rents property for more than 2

years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’.

Page 85: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Tools

SEQUENCE

Sequential analysis helps data miners identify a set of order-specific items or events. Association identifies the existence of patterns or groups of items; sequential

analysis identifies the order of those patterns or groups of items.

Page 86: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

SEQUENCE

Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time.

e.g. Used to understand long term customer buying behavior.

Page 87: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Link Analysis - Similar Time Sequence Discovery

Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate.

e.g. Within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines.

Page 88: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Tools

CLUSTERING

Cluster analysis lets the data miner assemble data into unforeseen groups containing similar characteristics. Also known as "segmentation," this type of data

mining is probably the most widely used.

Page 89: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

CLUSTERING

Aim is to partition a database into an unknown number of segments, or clusters, of similar records.

Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles.

Page 90: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Tools

PREDICTIVE MODELING

As the name implies, predictive modeling involves developing a model from historical data for predicting a future event. The power of predictive modeling engines is that they can use a broad range of data attributes to identify future behavior. Both cluster analysis and predictive modeling tools identify distinct groups of items with common attributes; the difference is that predictive modeling focuses on the likelihood of a particular outcome for a particular group.

Page 91: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Topic 9:Topic 9: Data Mining Techniques- A Summary

Artificial neural networks: Non-linear predictive models that learn through training and resembles biological neural networks in structure.

Decision Trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a database.

Generic Algorithms: Optimization techniques that use processes such as generic combination, mutation, and natural selection in a design based on the concepts of revolution.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Page 92: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Data Mining Techniques- A Summary

Predictive modeling

Database Segmentation

Link analysis

Deviation detection

Classification Value prediction Demographic clustering Neural clustering Association discovery Sequential pattern discovery Similar time sequence

discovery Statistics Visualization

Page 93: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Two Types of Data Mining Modeling- Two Types of Data Mining Modeling- Verification and DiscoveryVerification and Discovery

The verification model utilizes a process that looks in a database to detect trends and patterns in data that will help answer some specific questions about the business.

In this mode, the user generates a hypothesis about the data, issues a query against the data and examines the results of the query looking for verification of the hypothesis or the user decides that the hypothesis is not valid.

Page 94: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Verification ModelVerification Model

In this model, very little information is created in this extraction process: either the hypothesis is verified or it is not.

Common tools used in this mode are: queries, multidimensional analysis and visualization. What all have in common are that the user is essentially ‘guiding’ the exploration of the data being inspected.

Page 95: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Discovery ModelDiscovery Model

A more popular model is the Discovery Model that utilizes a process that looks in a database to discover and/or predict future patterns. The discovery model is divided into two modes: “Descriptive” and “Predictive”.

Page 96: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Discovery Model- Descriptive ModeDiscovery Model- Descriptive Mode

The Descriptive mode finds hidden patterns without a predetermined idea or hypothesis about what the patterns may be. In other words, the Data Mining software or program takes the initiative in finding what the interesting patterns are, without the user thinking of the relevant questions first. In this mode information is created about the data with very little or guidance from the user. The exploration of the data is done in such a way as to yield as large a number of useful facts about the data in the shortest amount of time.

Page 97: Data Mining Basics Instructor: Paul Chen Chapter 9 Data Warehouse Fundamentals.

Discovery Model- Predictive ModeDiscovery Model- Predictive Mode

In the Predictive mode patterns discovered from the database are used to predict the future patterns or trends. Predictive modeling allows the user to submit records with some unknown field values, and the system will guess the unknown values based on previous patterns discovered from the database.

In comparing the two models, one can state that “Verification” can be very inefficient, timely and costly. Whereas, “Discovery” modeling can be very efficient, cost effective, less dependent on user input and increases modeling accuracy.