KDD: A Definition
description
Transcript of KDD: A Definition
KDD: A Definition
• KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.
106-1012 bytes:we never see the whole data set, so will put it in the memory of computers
What is the knowledge?How to represent and use it?
Then run Data Mining algorithms
Wal-Mart records 20 millions per day
Why do we need KDD ?
Data Overload
Science
Marketing
FinanceHealthcare
Retail
Health care transactions: multi-gigabyte databases
Mobil Oil: geological data of over 100 terabytes
Some Data Overload Examples:
Data is the most Important tool to gain a competitive edge by providing improved, customized services.
Knowledge Discovery Process
______
______
______
Transformed Data
Patternsand
Rules
Target Data
RawData
KnowledgeData MiningTransformation
Interpretation& Evaluation
Selection& Cleaning
Integration
Understanding
DATAWarehouse
Knowledge
Knowledge Discovery in Database
• Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data
Clean,Collect,
SummarizeData
Warehouse
Data Preparation
TrainingData
Data Mining
ModelPatterns
Verification, EvaluationOperational
Databases
Knowledge Discovery Process
Goals
Data Selection, Acquisition & Integration
Data Cleaning
Data Reduction & Projection
Matching the Goals
Exploratory Data Analysis
Data Mining
Interpretation and Testing
Consolidation & Use
Knowledge Discovery Process
• First step is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint.
STEP – 1: IDENTIFYING THE GOAL• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing• Consolidation & Use
Knowledge Discovery Process
• Selecting a data set, or focusing on a subset of variables or data samples, on which discovery is to be performed.
STEP – 2: CREATING A TARGET DATA SET• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing• Consolidation & Use
Knowledge Discovery Process
• Basic operations include removing noise if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for time-sequence information and known changes.
STEP – 3: DATA CLEANING AND PREPROCESSING• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing• Consolidation & Use
Knowledge Discovery Process
• Finding useful features to represent the data depending on the goal of the task.
• With dimensionality reduction or transformation methods, the effective number of variables under consideration can be reduced, or invariant representations for the data can be found.
• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing• Consolidation & Use
STEP – 4: DATA REDUCTION AND PROJECTION
Knowledge Discovery Process
• Matching the goals of the KDD process to a particular data-mining method such as summarization, classification, regression, clustering, etc.
STEP – 5: MATCHING THE GOALS• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing• Consolidation & Use
Knowledge Discovery Process
• Choosing the data mining algorithms and selecting methods to be used for searching for data patterns.
• This process includes deciding which models and parameters might be appropriate and matching a particular data-mining method with the overall criteria of the KDD process.
STEP – 6: EXPLORATORY ANALYSIS AND MODEL & HYPOTHESIS SELECTION
• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining• Interpretation and Testing• Consolidation & Use
Knowledge Discovery Process
• Searching for patterns of interest in a particular representational form or a set of such representations, including classification rules or trees, regression, and clustering.
• The user can significantly aid the data-mining method by correctly performing the preceding steps.
STEP – 7: DATA MINING• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing• Consolidation & Use
Knowledge Discovery Process
• Interpreting mined patterns, possibly returning to any of steps 1 through 7 for further iteration.
• This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models.
STEP – 8: INTERPRETATION & TESTING• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining•Interpretation and Testing • Consolidation & Use
Knowledge Discovery Process
• Using the knowledge directly, incorporating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties.
• This process also includes checking for and resolving potential conflicts with previously believed (or extracted) knowledge.
STEP – 9: KNOWLEDGE PRESENTATION• Goals• Data Selection, Acquisition & Integration• Data Cleaning• Data reduction and Projection•Matching the goals• Exploratory Data Analysis• Data Mining• Testing and Verification• Interpretation• Consolidation & Use
Data Warehousing
• A platform for online analytical processing (OLAP) • Warehouses collect transactional data from several
transactional databases and organize them in a fashion amenable to analysis
• Also called “data marts”• A critical component of the decision support system (DSS) of
enterprises• Some typical DW queries:
– Which item sells best in each region that has retail outlets?– Which advertising strategy is best for Dubai Markets?
Data Warehousing
Order Processing
Inventory
Sales
Data Cleaning
DataWarehouse
(OLAP)
OLTP
Data Cleaning• Performs logical transformation of transactional data to suit the data
warehouse• Model of operations model of enterprise • Usually a semi-automatic process
OrdersOrder_id
PriceCust_id
InventoryProd_id
PricePrice_change
SalesCust_id
Cust_profitTotal_sales
Data Warehouse
CustomersProductsOrdersInventoryPriceTime
Primary Tasks of Data Mining
Classification
Deviation andchange detection
?
Summarization
Clustering
Dependency Modeling
Regression
finding the descriptionof several predefined classes and classify a data item into one of them.
maps a data item to a real-valued prediction variable.
identifying a finite set of categories or clusters to describe the data.
finding a compact description for a subset of data
finding a model which describes significant dependencies between variables.
discovering the most significant changes in the data
Data Mining Algorithm Components
• Model representation– descriptions of discovered patterns– overly limited representation -- unable to capture data patterns
too powerful -- potential for over fit.
(decision trees, rules, linear/non-linear regression & classification,
nearest neighbor and case-based reasoning methods, graphical
dependency models)
• Model evaluation criteria– how well a pattern (model) meets goals (fit function)– e.g., accuracy, novelty, etc.
Data Mining Algorithm Components
• Search method– parameter search: optimization of parameters for a given model
representation– model search: considers a family of models
Different methods suit different problems. Proper problem formulation crucial.
Data Mining Techniques
Data Mining Techniques
Descriptive Predictive
Clustering
Association
Classification
Regression
Sequential Analysis
Decision Tree
Rule Induction
Neural Networks
Nearest Neighbor Classification
Association Rule: Application
• Supermarket Shelf Management• Goal: to identify items which are bought together (by sufficiently many
customers)• Approach: process point-of-sale data (collected with barcode scanners)
to find dependencies among items.• Consider discovered rule:
{Diapers, Milk … } --> {Baby food}• Example:
– If a customer buys Diapers and Milk, then he is very likely to buy Baby foods.
– so stack baby foods next to diapers?
Sequential Pattern Discovery: Application
• Sequences in which customers purchase goods/services• Understanding long term customer behavior -- timely
promotions.
• In point-of--sale transaction sequences– Computer bookstore:
(Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs)
– Athletic Apparel Store:(Shoes) (Racket, Racket ball) --> (Sports Jacket)
Hierarchical Clustering (K-Means): Application
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2Arbitrarily choose K objects as initial cluster center
Assign each of the objects to most similar center
Update the cluster means
Update the cluster means
reassign
Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level
Decision Tree Identification: Application
Outlook Temp Play?Sunny Warm YesOvercast Chilly NoSunny Chilly YesCloudy Pleasa
ntYes
Overcast Pleasant
Yes
Overcast Chilly NoCloudy Chilly NoCloudy Warm Yes
Sunny
Cloudy
Overcast
Yes
Yes/No
Yes/No
Decision Tree Identification Example
Decision Tree Identification: Application
Yes/No
Yes/No Yes Yes/No
SunnyCloudy Overcast
Yes No YesNo
Yes
WarmChilly
Pleasant Chilly
Pleasant
Major Application Areas for Data Mining (Classification)
• Advertising• Bioinformatics• Customer Relationship Management (CRM)• Database Marketing • Fraud Detection • ecommerce• Health Care• Investment/Securities• Manufacturing, Process Control• Sports and Entertainment • Telecommunications• Web
Major Application Areas for Data Mining: Marketing• Direct Marketing:
Most major direct marketing companies are using modeling and data mining.
• Customer segmentation:
All industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis.
• CRM:Find other people in similar life stages and determine which customers are following similar behavior patterns– Up-sell– Cross-sell– Keeping the customers for a longer period of time
For e.g. Verizon Wireless reduced churn rate from 2% to 1.5%
Major Application Areas for Data Mining: Fraud Detection
• Credit Card Fraud Detection• Money laundering
– FAIS (US Treasury)• Securities Fraud
– NASDAQ Sonar system• Phone fraud
– AT&T, Bell Atlantic, British Telecom/MCI• Bio-terrorism detection at Salt Lake
Olympics 2002
Major Application Areas for Data Mining: Retail
• Sales forecasting:Examining time-based patterns helps retailers make stocking decisions.
• Database Retailing:Retailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales.
• Merchandise planning and allocation:When retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics.
Major Application Areas for Data Mining: Banking
• Credit Card marketingBy identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs.
• Cardholder pricing and profitabilityCard issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers.
Major Application Areas for Data Mining: Telecommunication
• Call detail record analysis:Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions.
• Customer loyalty:Some customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.
Major Application Areas for Data Mining: Manufacturing
• Manufacturing:Through choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand.
• Warranties:Manufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.
Issues and Challenges
• Large data– Number of variables (features), number of cases (examples)– Multi gigabyte, terabyte databases– Efficient algorithms, parallel processing
• High dimensionality– Large number of features: exponential increase in search space– Potential for spurious patterns– Dimensionality reduction
• Over fitting– Models noise in training data, rather than just the general patterns
• Changing data, missing and noisy data• Use of domain knowledge
– Utilizing knowledge on complex data relationships, known facts• Understandability of patterns
Success Stories
• Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data– Won over (manual) knowledge engineering approach– http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides
good detailed description of the entire process• Major US bank: customer attrition prediction
– First segment customers based on financial behavior: found 3 segments
– Build attrition models for each of the 3 segments– 40-50% of attritions were predicted == factor of 18 increase
• Targeted credit marketing: major US banks– Find customer segments based on 13 months credit balances– Build another response model based on surveys– Increased response 4 times -- 2%
Amitava Manna (11DCP007)Amritanshu Mehra (11DCP008)Animesh Ranjan (11DCP009)Ankit Sharma (11DCP010)Ankita Verma (11DCP011)Anuj Chabra (11DCP012)