CS626 Data Analysis and Simulation
Transcript of CS626 Data Analysis and Simulation
1
CS626 Data Analysis and Simulation
Today: Data Analysis
Reference: Berthold, Borgelt, Hoeppner, Klawonn: Guide to Intelligent Data AnalysisChapter 1, 3
Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected]
Data vs Knowledge
Data refer to single instances describe individual properties are often available in large quantities are often easy to collect/to obtain do not allow us to make predictions/forecasts
Knowledge refers to classes of instances (sets of objects, people, ...) describes general patterns, structures, laws, principles consists of as few statements as possible (explicit goal!) is often difficult & time consuming to find or to obtain allows us to make predictions/forecasts
Goal: given the data, find knowledge2
Data Analysis ProcessCRISP-DM: CRoss Industry Standard Process for Data Mining
3
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Model, model architecture refers to a mathematical representation of data to represent knowledge, for example
interpretable: a linear regression, blackbox: an artificial neural network
Data validity: data is correct, accurate, representative
Scheme of a simulation study (Law/Kelton, p 84) Conceptual model, programmed model refers to a simulation model for a discrete event dynamics system Model validity: model matches with real system with respect to measure of interest Simulation model as data generator
4
5
How does a simulation study relate to CRISP - DM ?
?
6
Overview: GIDA Book follows CRISP-DM
Project Understanding, Ch 3
Data Understanding, Ch 4
Data Preparation, Ch 6
Modeling, Ch 7-9
Deployment, Ch 10
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Categories for Data Analysis Problems
Classification Predict the outcome of an experiment with a finite number of results Example: observing spontaneous calcium activity -> gene present?
Regression Like classification, but predicted value is numeric Example: How much will college tuition cost next year?
Clustering, Segmentation Summarize the data by forming groups of similar cases Example: Do cells divide into different groups according to Ca activity?
Association Analysis Find correlations/associations to better understand/describe
interdependencies of all attributes. Focus: relationships Example: Which optional equipment of a car often goes together?
Deviation Analysis Knowing major trends/structures, find any exceptional subgroup that
behaves differently Example: Under which circumstances does the system fail? 7
Catalog of Methods
Finding Patterns For unknown data, explore it to learn new, previously unknown
patterns. Not focused on a particular target attribute. May apply techniques from segmentation, clustering, association
analysis or deviation analysis
Finding Explanations Special interest in a target variable, figure out why and how it varies
from case to case. May apply techniques from classification, regression, association
analysis or deviation analysis
Finding Predictors Special interest in the prediction of a target variable, but not so
much in understanding why and how it varies May apply techniques from classification and regression
8
Project Understanding
General Problem formulation Mapping the problem formulation to a data analysis task Understanding the situation (available data, suitability of the data, ...)
The 80-20 Rule! Average time spent for project and data understanding within the
CRISP-DM model: 20% Importance for success: 80%
Why is that? Trivial: Success is to solve the right problem and to solve it right ... But: selection of model may impact prep work on data and outcome of
evaluation phase (insight vs black-box model), => better use understanding phase to plan overall project & outcome
9
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Project Understanding
Margaret Saha, Biology Dept.: Big question: How do neurons acquire their specific identities?
More detailed questions for given data set: Do specific patterns of spontaneous calcium activity (calcium entering
the cell) contribute to the acquisition of phenotype (i.e. identity)? Are there developmental trends in the patterns of calcium activity?
Given data set: per (duration: 2h,12h) x (stages: 14,18,22) x (genes: a1A,a1B,a1C,a1H) individual experiments
for sets of cells with positive, negative or unknown gene expression, for each cell a time series of 890 values (after applying moving average filter)
10
Project understanding?
Project Understanding Determine Project Objective(s)
Determine criteria for success How to decide if project goals are reached?
Common obstacle: Language barrier Objective too general
11
Determine Project Objective
12
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
1 Determine the Project Objective
problemsource
project owner perspective analyst perspective
communication project owner does not under-stand the technical terms of theanalyst
analyst does not understand theterms of the domain of the projectowner
lack ofunderstanding
project owner was not sure whatthe analyst could do or achieve
analyst found it hard to under-stand how to help the projectowner
models of analyst were differentfrom what the project owner en-visioned
organization requirements had to be adoptedin later stages as problems withthe data became evident
project owner was an unpre-dictable group (not so concernedwith the project)
Table: Problems faced in data analysis projects.
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada2 / 1
Problems experienced in data analysis projects
Project Understanding Determine Project Objective(s) Determine criteria for success How to decide if project goals are reached?
Common obstacle: Language barrier Create glossary of terms, definitions, acronyms, and abbreviations Rephrase expert’s statements in own words Use explorative tools
Example: mind maps / cognitive maps to express influencing factors Nodes are properties of interest Variable of interest goes into the center Only direct influences are represented by arcs Properties should be chosen such that relationships can be understood as
positively or negatively correlated (the higher X the higher Y etc)
Common obstacle: Objective too general
13
Example: Cognitive Map of Customer Modeling
14
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Cognitive Map
product demand
gender
negatively correlated
positively correlated
relationship depends on product
mobility
shopping opportunities
competition (stores)
price competitivness
brand loyaltyproduct affinity
product quality
frequency of product
in shopping basket
size of household
age income
affordability
educational background
range of offered substitutes
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada3 / 1
Project Understanding Determine Project Objective(s) Determine criteria for success How to decide if project goals are reached?
Common obstacle: Language barrier Create glossary of terms, definitions, acronyms, and abbreviations Rephrase expert’s statements in own words Use explorative tools
Example: mind maps / cognitive maps to express influencing factors
Common obstacle: Objective too general Sketch target use to make more clear what a useful result has to
look like (technical report with insights <-> software tool for analysis)
15
Determine Project Objective
16
Determine project objective
The aim of the project should be clearly defined.
Criteria to measure the success of the project should be defined.
Exampleobjective: increase revenues (per campaign and/or per cus-
tomer) in direct mailing campaigns by personalizedoffer and individual customer selection
deliverable: software that automatically selects a specifiednumber of customers from the database to whomthe mailing shall be sent, runtime max. half-dayfor database of current size
success criteria: improve order rate by 5% or total revenues by 5%,measured within 4 weeks after mailing was sent,compared to rate of last three mailings
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada4 / 1
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Assess the Situation to Estimate Chances for Success
17
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Asses the situation
requirements and constraintsmodel requirements, e.g. model has to be explanatoryethical, political, legal issues, e.g. variables such as gender, age, racemust not be usedtechnical constraints, e.g. time limits
assumptionsrepresentativeness:The sample represents the whole.informativeness:The model includes all important information.good data qualitypresence of external factors:The external world is not changing.
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada5 / 1
Determine Analysis Goals
18
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45
Determine analysis goals
Determine data mining tasks
(classification, regression, cluster analysis, finding associations, deviationanalysis,. . .).
Specify the requirements for the models
Determine analysis goals
Interpretability
Reproduceability/stability
Model flexibility/adequacy
Runtime
Interestingness and use of expert knowledge
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada6 / 1
Summary
Today: Data Analysis Overview, Ch 1 Project Understanding, Ch 3
Future Data Understanding, Ch 4
Data Preparation, Ch 6
Modeling, Ch 7-9
Deployment, Ch 10
19
Data Understanding
Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.
c�Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada1 / 45