An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...

31
DM2 2011/2012. Alfredo Vellido An Introduction to Mining (23) An Introduction to Mining (2 3)

Transcript of An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...

Page 1: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

DM22011/2012. Alfredo Vellido/

An Introduction to Mining (2‐3)An Introduction to Mining (2 3)

Page 2: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

RECAP: RECAP: What’s DATA MINING?: A procedural viewpoint

Page 3: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

RECAP:RECAP: What’s DATA MINING?: A historicist viewpointRECAP: RECAP: What s DATA MINING?: A historicist viewpoint

ESTADÍSTICASTATISTICS DM

PATT RECOG

EXPERT

KDD

ARTIFICIALINTELLIGENCE

EXPERT SYSTEMSMACHINE LEARNING

DB MANAGEMENT

Page 4: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

RECAP:RECAP: CRISP‐DM: Methodology loopRECAP: RECAP: CRISP DM: Methodology loop

Page 5: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Phases: Problem understandingCRISP: Phases: Problem understanding

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

DETERMINEPROBLEMGOAL

BACKGROUNDPROBLEM

GOALS

SUCCESS

CRITERIA

ASSESS SITUATION

INVENTORY RESOURCES

REQUERIMS. ASSUMPTIONS LIMITATIONS

RISKS CONTINGEN.

TERMINOLOG.COSTS & BENEFITS

DETERMINEDM

GOALS

GOALS DM SUCCESS CRITERIA DM

PRODUCE PROJECTPLAN

PROJECT

PLAN

INITIAL SELECTION OF 

TOOLS

Page 6: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Phases: Data understandingCRISP: Phases: Data understanding

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

OBTAIN INITIAL DATA

INITIAL DATA REPORT

DESCRIPTION DATA

DATA DESCRIPTIVE REPORT

EXPLORATION DATA

REPORT

DATA EXPLORATION 

REPORT

VERIFICATION QUALITY DATA

REPORT

DATA QUALITY REPORT

Page 7: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

METROFANG: a real story about data understanding (1)

Page 8: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

METROFANG: a real story about data understanding (2)

caudal entrada

200,00

250,00

300,00

350,00 Missing data

Stationality

0,00

50,00

100,00

150,00

200,00

Outliers

Time Series 0,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Par motor Secador A

Weekend?

FORUM???

80 00

100,00

120,00

140,00

20,00

40,00

60,00

80,00

0,001 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Page 9: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Storing data (’07)g ( )

Poll

What did you use for data storage for significant data mining projects in the past year: [142 voters, 284 votes][ , ]

Text files (e.g. tab or comma delim) (75) 52.8% Data mining system format (SAS, SPSS, arff) (57) 40.1%Excel (28) 19.7% Oracle (25) 17.6% SQL Server (15) 10 6%SQL Server (15) 10.6%mySQL (12) 8.5% other format (10) 7.0% other commercial DBMS (7) 4 9%other commercial DBMS (7) 4.9%other free DBMS (4) 2.8%

Page 10: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Phases: Data preparationCRISP: Phases: Data preparation

PROBLEM DATA DATA IMPLEMENPROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

DATA SELECTION

ARGUMENTS FOR SELECTION

DATA CLEANING

RECONSTRUCT 

DATA CLEANING REPORT

DERIVATED  OSERVATIONS DATA

INTEGRATE DATA

VARIABLES

INTEGRATED DATA

GENERATED

DATA FORMATTING DATA WITH NEW 

FORMAT

Page 11: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Is data preparation that important?Is data preparation that important?

What % of time in your data mining project(s) is spent on data cleaning and y g p j ( ) p gpreparation [187 votes total]

over 80% (46) 25% 61 to 80% (73) 39% 41 to 60% (46) 25% 21 to 40% (7) 4%21 to 40% (7) 4%20% or less (15) 8%

Page 12: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Data manipulation tools …(’07)p ( )

Page 13: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

How large is it? … (’06 → ‘09)Largest database or dataset you data-mined was [181 votes total]

less than 1 MB (5) 3% 1.1 to 10 MB (11) 6%11 to 100 MB (27) 15% 101 MB to 1 GB (22) 12% 1.1 to 10 GB (45) 25% ( ) %11 to 100 GB (22) 12% 101 GB to 1 Terabyte (28) 15% over 1 Terabyte (21) 12%

Page 14: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Phases: Modellingg

PROBLEM UNDERSTANDING

DATA  DATAMODELLING EVALUATION

IMPLEMENUNDERSTANDING

UNDERST’ING PREPARATIONMODELLING EVALUATION

TATION

SELECT MODELINGTECHNIQUE

SELECTED

TECHNIQUE

CREATE TEST DESIGN

TEST DESIGN

BUILDMODEL

PARAMETER SELECTION

MODEL MODEL DESCRIPTION

VALIDATE MODEL

MODEL VALIDATION

Page 15: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP A typology of DM problemsCRISP: A typology of DM problemsPROBLEM DESCRIPTION EXAMPLES TECHNIQUES DATA SUMMARY

and DESCRIPTION

Compact and aggregated data description. Exploratory Analysis

Almost any problem includes some elements of data description

ERPs, stats., OLAP, EIS, control dashboards

SEGMENTATIONFinding data groups (unsupervised) Market Segmentation, Clustering, NNs

(SOM GTM)SEGMENTATION (unsupervised) segm / clust / classif Shopping Basket analysis (SOM, GTM),

visualización

CONCEPTUAL DESCRIPTION

Accessible and useful description of concepts / classes / groups. Knowledge

fi t th i ió

Ex.: Description of customer groups according to loyalty. Rule segment profiling

Rule Induction, Conceptual Cl t iDESCRIPTION comes first, then precissión.

Linked to clasif / segmentation segment profilingif SEX=male and age>45 then CUST=loyal

Clustering

CLASIFICATION Assumed that different ítems can be assigned to a given Bankruptcy prediction,

Credit Scoring

Discriminant Analysis, Rule Induction, Decision g g

closed cathegory (supervised) Credit Scoring ,Trees, NNs, C-B Reasoning, GAs

PREDICTION (REGRESSION, FORECASTING)

Continuous dependent variable. Given values of the predictive variables, predict

Markets, company benefit pred., Market share forec.

Regression Analysis, Regression Trees, NNs, Box-Jenkins,

FORECASTING) p p(supervisado)

pGAs

DEPENDENCY ANALYSIS

Looking for dependencies between variables (superv. or unsuperv.) Often with segmentation

Basket Analysis Ex.: 30% of those who bought peanuts also bought beer …

Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.

Page 16: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Selection of modeling techniquesCRISP: Selection of modeling techniques

U N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E SU N I V E R S E  OF  T E C H N I Q U E SU N I V E R S E  OF  T E C H N I Q U E S(Definided by tools)

TECHNIQUES SUITED TO A PROBLEM

POLITICALPOLITICAL REQUIREMENTS

(Business, executive)

LIMITATIONS

Data types, knowledgeData types, knowledgeMoney, time, hh.rr.Money, time, hh.rr.

SELECTEDSELECTED TOOL(S)TOOL(S)SELECTEDSELECTED TOOL(S)TOOL(S)

Page 17: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Commonly used models/techniques (‘05)Commonly used models/techniques ( 05)…

Data mining/analytic techniques you use frequently: [784 votes total] Decision Trees/Rules (107) 14% Clustering (101) 13% Regression (90) 11%g ( )Statistics (80) 10% Visualization (63) 8% Neural Nets (61) 8%( )Association rules (54) 7% Nearest Neighbor (34) 4% SVM (Support vector machine) (31) 4%( pp ) ( )Bayesian (30) 4% Sequence/Time series analysis (26) 3% Boosting (25) 3%g ( )Hybrid methods (23) 3% Bagging (20) 3% Genetic algorithms (19) 2%g ( )Other (20) 3%

Page 18: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Commonly used models/techniques (‘07)Commonly used models/techniques ( 07)…

Page 19: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Phases: EvaluationCRISP: Phases: Evaluation

PROBLEM DATA DATA IMPLEMENPROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

EVALUATE RESULTS

EVOLUTION OF  DM RESULTS

APPROVED MODELS

REVISE PROCESSES

REVISION OF THE  PROCESS

DETERMINE NEXT STEPS

LIST OF POSSIBLE ACTIONS

DECISSIONS

Page 20: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

CRISP: Phases: DeploymentC S ases ep oy e t

PROBLEM  DATA  DATAMODELLING EVALUATION

IMPLEMENUNDERSTANDING

UNDERST’ING PREPARATIONMODELLING EVALUATION

TATION

PLAN IMPLEMENTATION

IMPLEMENTATION PLAN

PLAN MONITORIZATION & 

MAINTENANCE

MONITORIZATION & MAINTENANCE PLAN

PRODUCIR INFORME FINAL

FINAL REPORT FINAL PRESENTATION

REVISAR PROYECTO

DOCUMENTATION OF EXPERIENCE

Page 21: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

How do you deploy it? (’06‐>’09)How do you usually deploy data mining results? (Choose all that apply): [95 voters]

Publish research papers (37) 38.9% Use findings to change business rules (42) 44.2% ===

Deploy in production and ... (46) 48.4% Use data mining tool for scoring (47) 49.5% Convert model to SQL (20) 21.1%Convert model to another language (16) 16.8% Convert model to C or Java (16) 16.8% Convert model to PMML (4) 4.2% === Deploy in batch mode (48) 50.5% Deploy in real-time mode (21) 22.1%

Page 22: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Software popularity (‘07)

Free vs commercial:Free vs. commercial: 

debate

Page 23: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Software popularity (→‘09)

Page 24: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Ouch! (→‘10)

Page 25: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology

Why?y

Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:

The availability of new types of data—text Web and attitudinal data for example—along with newThe availability of new types of data text, Web, and attitudinal data, for example along with new techniques for pre‐processing, analyzing, and combining them with related case data 

Integration and deployment of results with operational systems such as call centers and Web sites 

Far more demanding requirements for scalability and for deployment into real‐time environmentsFar more demanding requirements for scalability and for deployment into real time environments 

The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows

The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing y g p y p gbusiness processes 

The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analyticsand best practices associated with data mining and predictive analytics 

I J l 2006 th ti d th t it i t t t th f ki t d d i fIn July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.

Page 26: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

ResourcesResourcesesou esesou es

• Webs:– www.kdnuggets.com

– http://the‐data‐mine.com

– www.sigkdd.org

• Free software:– www.keel.es

– http://www.cs.waikato.ac.nz/ml/weka/

http://rapid i com– http://rapid‐i.com

Page 27: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización

Some bibliography available at books.google.com:

Data mining: practical machine learning tools and techniques

I.H. Witten, E. Frank (2005)

Data mining: concepts and techniques

J. Han, M. Kamber (2006)

i i l f d i iPrinciples of data mining

D. J. Hand, H. Mannila, P. Smyth (2001)

Page 28: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización
Page 29: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización
Page 30: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización
Page 31: An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs (SOM GTM) segm / clust / classif Shopping Basket analysis (SOM , GTM) visualización