CS626 Data Analysis and Simulation

19
1 CS626 Data Analysis and Simulation Today: Data Analysis Reference: Berthold, Borgelt, Hoeppner, Klawonn: Guide to Intelligent Data Analysis Chapter 1, 3 Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected]

Transcript of CS626 Data Analysis and Simulation

Page 1: CS626 Data Analysis and Simulation

1

CS626 Data Analysis and Simulation

Today: Data Analysis

Reference: Berthold, Borgelt, Hoeppner, Klawonn: Guide to Intelligent Data AnalysisChapter 1, 3

Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected]

Page 2: CS626 Data Analysis and Simulation

Data vs Knowledge

Data refer to single instances describe individual properties are often available in large quantities are often easy to collect/to obtain do not allow us to make predictions/forecasts

Knowledge refers to classes of instances (sets of objects, people, ...) describes general patterns, structures, laws, principles consists of as few statements as possible (explicit goal!) is often difficult & time consuming to find or to obtain allows us to make predictions/forecasts

Goal: given the data, find knowledge2

Page 3: CS626 Data Analysis and Simulation

Data Analysis ProcessCRISP-DM: CRoss Industry Standard Process for Data Mining

3

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Model, model architecture refers to a mathematical representation of data to represent knowledge, for example

interpretable: a linear regression, blackbox: an artificial neural network

Data validity: data is correct, accurate, representative

Page 4: CS626 Data Analysis and Simulation

Scheme of a simulation study (Law/Kelton, p 84) Conceptual model, programmed model refers to a simulation model for a discrete event dynamics system Model validity: model matches with real system with respect to measure of interest Simulation model as data generator

4

Page 5: CS626 Data Analysis and Simulation

5

How does a simulation study relate to CRISP - DM ?

?

Page 6: CS626 Data Analysis and Simulation

6

Overview: GIDA Book follows CRISP-DM

Project Understanding, Ch 3

Data Understanding, Ch 4

Data Preparation, Ch 6

Modeling, Ch 7-9

Deployment, Ch 10

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Page 7: CS626 Data Analysis and Simulation

Categories for Data Analysis Problems

Classification Predict the outcome of an experiment with a finite number of results Example: observing spontaneous calcium activity -> gene present?

Regression Like classification, but predicted value is numeric Example: How much will college tuition cost next year?

Clustering, Segmentation Summarize the data by forming groups of similar cases Example: Do cells divide into different groups according to Ca activity?

Association Analysis Find correlations/associations to better understand/describe

interdependencies of all attributes. Focus: relationships Example: Which optional equipment of a car often goes together?

Deviation Analysis Knowing major trends/structures, find any exceptional subgroup that

behaves differently Example: Under which circumstances does the system fail? 7

Page 8: CS626 Data Analysis and Simulation

Catalog of Methods

Finding Patterns For unknown data, explore it to learn new, previously unknown

patterns. Not focused on a particular target attribute. May apply techniques from segmentation, clustering, association

analysis or deviation analysis

Finding Explanations Special interest in a target variable, figure out why and how it varies

from case to case. May apply techniques from classification, regression, association

analysis or deviation analysis

Finding Predictors Special interest in the prediction of a target variable, but not so

much in understanding why and how it varies May apply techniques from classification and regression

8

Page 9: CS626 Data Analysis and Simulation

Project Understanding

General Problem formulation Mapping the problem formulation to a data analysis task Understanding the situation (available data, suitability of the data, ...)

The 80-20 Rule! Average time spent for project and data understanding within the

CRISP-DM model: 20% Importance for success: 80%

Why is that? Trivial: Success is to solve the right problem and to solve it right ... But: selection of model may impact prep work on data and outcome of

evaluation phase (insight vs black-box model), => better use understanding phase to plan overall project & outcome

9

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Page 10: CS626 Data Analysis and Simulation

Project Understanding

Margaret Saha, Biology Dept.: Big question: How do neurons acquire their specific identities?

More detailed questions for given data set: Do specific patterns of spontaneous calcium activity (calcium entering

the cell) contribute to the acquisition of phenotype (i.e. identity)? Are there developmental trends in the patterns of calcium activity?

Given data set: per (duration: 2h,12h) x (stages: 14,18,22) x (genes: a1A,a1B,a1C,a1H) individual experiments

for sets of cells with positive, negative or unknown gene expression, for each cell a time series of 890 values (after applying moving average filter)

10

Project understanding?

Page 11: CS626 Data Analysis and Simulation

Project Understanding Determine Project Objective(s)

Determine criteria for success How to decide if project goals are reached?

Common obstacle: Language barrier Objective too general

11

Page 12: CS626 Data Analysis and Simulation

Determine Project Objective

12

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

1 Determine the Project Objective

problemsource

project owner perspective analyst perspective

communication project owner does not under-stand the technical terms of theanalyst

analyst does not understand theterms of the domain of the projectowner

lack ofunderstanding

project owner was not sure whatthe analyst could do or achieve

analyst found it hard to under-stand how to help the projectowner

models of analyst were differentfrom what the project owner en-visioned

organization requirements had to be adoptedin later stages as problems withthe data became evident

project owner was an unpre-dictable group (not so concernedwith the project)

Table: Problems faced in data analysis projects.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada2 / 1

Problems experienced in data analysis projects

Page 13: CS626 Data Analysis and Simulation

Project Understanding Determine Project Objective(s) Determine criteria for success How to decide if project goals are reached?

Common obstacle: Language barrier Create glossary of terms, definitions, acronyms, and abbreviations Rephrase expert’s statements in own words Use explorative tools

Example: mind maps / cognitive maps to express influencing factors Nodes are properties of interest Variable of interest goes into the center Only direct influences are represented by arcs Properties should be chosen such that relationships can be understood as

positively or negatively correlated (the higher X the higher Y etc)

Common obstacle: Objective too general

13

Page 14: CS626 Data Analysis and Simulation

Example: Cognitive Map of Customer Modeling

14

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Cognitive Map

product demand

gender

negatively correlated

positively correlated

relationship depends on product

mobility

shopping opportunities

competition (stores)

price competitivness

brand loyaltyproduct affinity

product quality

frequency of product

in shopping basket

size of household

age income

affordability

educational background

range of offered substitutes

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada3 / 1

Page 15: CS626 Data Analysis and Simulation

Project Understanding Determine Project Objective(s) Determine criteria for success How to decide if project goals are reached?

Common obstacle: Language barrier Create glossary of terms, definitions, acronyms, and abbreviations Rephrase expert’s statements in own words Use explorative tools

Example: mind maps / cognitive maps to express influencing factors

Common obstacle: Objective too general Sketch target use to make more clear what a useful result has to

look like (technical report with insights <-> software tool for analysis)

15

Page 16: CS626 Data Analysis and Simulation

Determine Project Objective

16

Determine project objective

The aim of the project should be clearly defined.

Criteria to measure the success of the project should be defined.

Exampleobjective: increase revenues (per campaign and/or per cus-

tomer) in direct mailing campaigns by personalizedoffer and individual customer selection

deliverable: software that automatically selects a specifiednumber of customers from the database to whomthe mailing shall be sent, runtime max. half-dayfor database of current size

success criteria: improve order rate by 5% or total revenues by 5%,measured within 4 weeks after mailing was sent,compared to rate of last three mailings

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada4 / 1

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Page 17: CS626 Data Analysis and Simulation

Assess the Situation to Estimate Chances for Success

17

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Asses the situation

requirements and constraintsmodel requirements, e.g. model has to be explanatoryethical, political, legal issues, e.g. variables such as gender, age, racemust not be usedtechnical constraints, e.g. time limits

assumptionsrepresentativeness:The sample represents the whole.informativeness:The model includes all important information.good data qualitypresence of external factors:The external world is not changing.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada5 / 1

Page 18: CS626 Data Analysis and Simulation

Determine Analysis Goals

18

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45

Determine analysis goals

Determine data mining tasks

(classification, regression, cluster analysis, finding associations, deviationanalysis,. . .).

Specify the requirements for the models

Determine analysis goals

Interpretability

Reproduceability/stability

Model flexibility/adequacy

Runtime

Interestingness and use of expert knowledge

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada6 / 1

Page 19: CS626 Data Analysis and Simulation

Summary

Today: Data Analysis Overview, Ch 1 Project Understanding, Ch 3

Future Data Understanding, Ch 4

Data Preparation, Ch 6

Modeling, Ch 7-9

Deployment, Ch 10

19

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.

c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45