Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis...

Post on 22-Sep-2020

0 views 0 download

Transcript of Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis...

Lluis Belanche + Alfredo Vellido

Intelligent Data Analysis and Data Miningor …

Data Analysis and Knowledge Discoverya.k.a. Data Mining II

DATA MINING as a methodology (from previous session …) 

CRISP: a DM methodologyCRoss‐Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non‐proprietary)Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler)CRISP‐DM was conceived in 1996DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata)Financed by the EU. Version 1.0 released officially in 1999

IDADM

CRISP: Hierarchic structure of the methodology

IDADM

CRISP: The virtuous loop of methodology phases

IDADM

CRISP: Phases: Problem understanding

DETERMINEPROBLEMGOAL

ASSESS SITUATION

DETERMINEDM

GOALS

PRODUCE PROJECTPLAN

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

BACKGROUND

INVENTORY RESOURCES

GOALS DM

PROJECT

PLAN

PROBLEM

GOALS

SUCCESS

CRITERIA

SUCCESS CRITERIA DM

REQUERIMS. ASSUMPTIONS LIMITATIONS

RISKS CONTINGEN. TERMINOLOG. COSTS & 

BENEFITS

INITIAL SELECTION OF 

TOOLS

IDADM

DM application areas (’10‐>’11)IDADM

end of last session wrap‐up

CRISP: Phases: Data understanding

OBTAIN INITIAL DATA

DESCRIPTION DATA

EXPLORATION DATA

VERIFICATION QUALITY DATA

INITIAL DATA REPORT

DATA DESCRIPTIVE REPORT

DATA EXPLORATION 

REPORT

DATA QUALITY REPORT

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

IDADM

METROFANG: a real story about data understanding (1)http://www.secadolodos.com/73027_es/METROFANG‐(Barcelona‐Espa%25C3%25B1a)/

IDADM

METROFANG: a real story about data understanding (2)caudal entrada

0,00

50,00

100,00

150,00

200,00

250,00

300,00

350,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Par motor Secador A

0,00

20,00

40,00

60,00

80,00

100,00

120,00

140,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Missing data

Stationality

Outliers

Time Series 

Weekend?

FORUM???

IDADM

Storing data (’07)

IDADM

CRISP: Phases: Data preparation

DATA SELECTION

DATA CLEANING

RECONSTRUCT DATA

INTEGRATE DATA

DATA FORMATTING

ARGUMENTS FOR SELECTION

DATA CLEANING REPORT

DERIVATED VARIABLES

INTEGRATED DATA

OSERVATIONS GENERATED

DATA WITH NEW FORMAT

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

IDADM

Is data preparation that important?

IDADM

Common data types analyzed …(’07)

Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, the biggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).

IDADM

Common data types analyzed …(’09)

Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, thebiggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).

Comparing with 2008, the top 5 categories are unchanged.

IDADM

Common data types analyzed …(‐>’12)IDADM

How large is it? … (’06 ‐> ‘09)IDADM

How large is it? … (’09 ‐> ‘13)IDADM

The “Big Data” Challenge

How large is it? … (’09 ‐> ‘13)IDADM

Some fun facts:• Google processes over 20 PB worth of data every day.

• Back in December 2007, YouTube generated 27 PB of traffic.

• The CERN Large Hadron Collider (HLC) generetes about 20 PB of usable data 

per year.

• The volume of global annual data traffic is expected exceed 60,000 PB in 

2016, from 8,000 petabytes in 2011

• In the next decade, astronomers expect to be processing 10 PB of data every 

hour from the Square Kilometre Array (SKA) telescope ►one exabyte every 

four days.

• 10 PB of data every hour from the Square Kilometre Array (SKA) telescope ►one exabyte every four days.

Data manipulation tools …(’08)

IDADM

Data “manipulation” tools …(‐>’12)

IDADM

Data “manipulation”tools …(‐>’13)

IDADM

CRISP: Phases: Modelling

SELECT MODELINGTECHNIQUE

CREATE TEST DESIGN

BUILDMODEL

VALIDATE MODEL

SELECTED

TECHNIQUE

TEST DESIGN

PARAMETER SELECTION

MODEL VALIDATION

MODEL MODEL DESCRIPTION

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

IDADM

CRISP: A typology of DM problemsPROBLEM DESCRIPTION EXAMPLES TECHNIQUES DATA SUMMARY

and DESCRIPTION

Compact and aggregated data description. Exploratory Analysis

Almost any problem includes some elements of data description

ERPs, stats., OLAP, EIS, control dashboards

SEGMENTATION Finding data groups (unsupervised) segm / clust / classif

Market Segmentation, Shopping Basket analysis

Clustering, NNs (SOM, GTM), visualización

CONCEPTUAL DESCRIPTION

Accessible and useful description of concepts / classes / groups. Knowledge comes first, then precissión. Linked to clasif / segmentation

Ex.: Description of customer groups according to loyalty. Rule segment profiling if SEX=male and age>45 then CUST=loyal

Rule Induction, Conceptual Clustering

CLASIFICATION Assumed that different ítems can be assigned to a given closed cathegory (supervised)

Bankruptcy prediction, Credit Scoring

Discriminant Analysis, Rule Induction, Decision Trees, NNs, C-B Reasoning, GAs

PREDICTION (REGRESSION, FORECASTING)

Continuous dependent variable. Given values of the predictive variables, predict (supervised)

Markets, company benefit pred., Market share forec.

Regression Analysis, Regression Trees, NNs, Box-Jenkins, GAs

DEPENDENCY ANALYSIS

Looking for dependencies between variables (superv. or unsuperv.) Often with segmentation

Basket Analysis Ex.: 30% of those who bought peanuts also bought beer …

Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.

IDADM

CRISP: Selection of techniquesU N I V E R S E  OF  T E C H N I Q U E S

TECHNIQUES SUITED TO A PROBLEM

POLITICAL REQUIREMENTS

(Business, executive)

LIMITATIONS

Data types, knowledge

SELECTED TOOL(S)

Money, time, hh.rr.

(Definided by tools)

IDADM