Post on 19-Dec-2015
111
Support Vector Machines and
Predictive Data Modeling
Electrical and Computer Engineering
Vladimir Cherkassky University of Minnesota
cherk001@umn.edu
Presented at Tech Tune Ups, ECE Dept, June 1, 2011
222
AcknowledgementsResearch on Predictive Learning supported by• NSF grant ECCS-0802056• The
A. Richard Newton Breakthrough Research Award from Microsoft Research
Joint work with grad students F. Cai & S. Dhar
Parts of this presentation are from the books Introduction to Predictive Learning, by Cherkassky and Ma, Springer 2011
Learning from Data, by Cherkassky and Mulier, Wiley 2007
333
OUTLINE
Introduction + Motivation
4 parts of this course:• Philosophy, induction and predictive
data modeling• Support vector machines (SVM)• SVM practical issues and applications• Advanced SVM-based learning
technologies
444
Motivation 1Two critical points:
(1) Humans can not reason about uncertainty in a rational wayExamples
(2) Humans and animals have excellent biological capabilities to cope with uncertainty and riskExamples
555
Motivation 2• Growth of data in digital age
• Is it possible to extract knowledge from this data? – philosophical and cultural implications
• How to extract knowledge from data? – business and technological aspects
• Is this a natural domain of statistics?
666
Motivation 3: biological learning• Rosenblatt’s Perceptron (early 1960’s)
- an early attempt to simulate biological learning (simple learning algorithm for a linear classifier)
• Young scientists in Moscow tried to understand generalization properties of such ‘machines’ and developed new statistical learning theory
777
Motivation 4: why SVM?• Support Vector Machines
- developed in the USSR in mid-1960’s- later introduced in the West in mid-1990’s- currently the most widely used method for modeling high-dimensional data-based on new mathematical theory different from classical statistics
• VC-theory also provides philosophical framework for ‘learning from data’
• This new predictive modeling methodology is still poorly understood
888
PART 1: Philosophy, induction and predictive data modeling
• Understanding uncertainty and risk• Induction and knowledge discovery• Philosophy and statistical learning• Predictive learning approach• Introduction to VC-theory
999
Understanding Uncertainty• Humans tend to avoid uncertainty, and try to
explain unpredictable eventsAristotle: All men by nature desire knowledge• Learning ~discovering regularities from
data• Ancient cultures, i.e. Ancient Greeks, had no
formal concepts related to randomness:Unpredictable events (wars, natural disasters etc.) were thought to be controlled by Gods or Fate.
• In modern society, religion has been replaced by science and pseudo-science
101010
Gods, Prophets and Shamans
111111
Science and Uncertainty• Math, Logic and Science are about
certainty ~ deterministic rules
• Probability and empirical data: involves uncertainty ~ inferior knowledge
This view dominates modern science, i.e.
• True Scientific knowledge consists of deterministic Laws of Nature
• There is a (true, causal) model explaining a given natural phenomenon (i.e., disease)
121212
Causal Determinism in Science• Popular view of science
- deterministic rules (laws of Nature)
- reflects objective reality (single truth)
- knowledge inferred from (observed) data
• Digital technology enables growth of data
Can expect rapid growth of knowledge by applying (statistical, data mining etc.) algorithms to this data
• Reality is more sobering (as usual)
131313
Popular Hype: the data deluge makes scientific method obsolete
• Wired Magazine, 16/07: We can stop looking for (scientific) models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
• Early Detection of Cancer (or other diseases):Massive data analysis of cancer samples in order to identify unique proteins for tens of thousands of types of cancer. The goal is that (in the future) we can all be screened for these proteins as early warning signals for cancer.
141414
REALITY• Many studies have questionable value
- statistical correlation vs causation • Some border stupidity/ pseudoscience
- US scientists at SUNY discovered Adultery Gene !!! (based on a sample of 181 volunteers interviewed about sexual life)
• Usual conclusion- more research is needed …
151515
Some Views on Science
• Karl Popper: Science starts from problems, and not from observations
• Werner Heisenberg: What we observe is not nature itself, but nature exposed to our method of questioning
• Albert Einstein: Reality is merely an illusion, albeit a very persistent one.
161616
Scientific Discovery• Always involves ideas (models) and
facts (data)
• Classical first-principle knowledge:
hypothesis data scientific theory
Note: deterministic, simple models
• Modern data-driven discovery:
Computer program + DATA knowledge
Note: statistical, complex systems
• Two philosophies, poorly understood
171717
COMPLEX SYSTEMS
• A. Einstein:When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails us.
Example: weather prediction
• Does digital technology make Einstein’s claim obsolete?
181818
Examples of Complex Systems
• Life Sciences
• Healthcare
• Climate modeling• Social Systems (i.e. financial markets)
Attempts to understand and model such systems using deterministic approach usually fail
191919
Problem of Induction in Philosophy• Francis Bacon: advocated empirical
knowledge (inductive) vs scholastic
• David Hume: What right do we have to assume that the future will be like the past?
• Philosophy of Science tries to resolve this
dilemma/contradiction between deterministic logic and uncertain nature of empirical data.
• Digital Age: growth of empirical data, and this dilemma becomes important in practice.
202020
What is ‘a good model’?• All models are mental constructs that
(hopefully) relate to real world• Two goals of data-driven modeling:
- explain available data- predict future data
• All good (scientific) models make non-trivial predictions
Good data-driven models can predict well, so the goal is to estimate predictive models
212121
Three Types of Knowledge
• Growing role of empirical knowledge• Classical philosophy of science
differentiates only between (first-principle) science and beliefs (demarcation problem)
• Importance of demarcation btwn empirical knowledge and beliefs in applications
2222
Examples of Nonscientific Beliefs• Aristotle’s science
- everything is a mix of 4 basic elements: earth, water, air and fire
• Geocentric system of the world
• Origin of life (spontaneous generation)
- disproved by L. Pasteur in 19th century
• Modern belief: every medical condition can be traced to genetic variations
- is it a popular belief or scientific theory ?
2323
Popper’s Demarcation Principle
Karl Popper: Every true (inductive) theory prohibits certain events or occurences, i.e. it should be falsifiable
• First-principle scientific theories vs beliefs or metaphysical theories
• Risky prediction, testability, falsifiability
2424
Popper’s conditions for scientific hypothesis
- Should be testable
- Should be falsifiable
Example 1: Efficient Market Hypothesis(EMH) The prices of securities reflect all known information that impacts their value
Example 2: We do not see our noses, because they all live on the Moon
25
Predictive Learning: FormalizationGiven: data samples ~ training data (x,y)
Estimate: a model, or function, f(x) that
- explains this data and
- can predict future data
Classification problem:
Learning ~ function estimation
262626
Application Example:predicting gender of face images
• Training data: labeled face images
Male etc.
Female etc.
27
Predicting Gender of Face Images• Input ~ 32x32 pixel image
• Model ~ indicator function f(x) separating 1024-dimensional pixel space in two halves
• Model should predict well new images
• Difficult machine learning problem, but easy for human recognition
28
Learning ~ Reliable Induction
Induction ~ function estimation from data:
Deduction ~ prediction for new (test) inputs:
29
Common Learning ProblemsClassification Regression
Note: explanation does not ensure prediction
30
Common Learning ProblemsUnsupervised learning (i.e., clustering)
Note: many other types of problems exist.
All such problems ~ inductive learning setting
31
Generalization and Complexity ControlConsider regression estimation• Ten training samples
• Fitting linear and 2-nd order polynomial:25.0),,0( 222 whereNxy
32
Complexity Control (cont’d)The same data set:• Using k-nn regression with k=1 and k=4
Generalization depends on model complexity
33
Complexity Control: issues• Theoretical + conceptual
- how to define model complexity
• Practical 1 - high-dimensional data
• Practical 2 - true model is not known resampling for choosing opt. complexity
Model selection ~ choosing opt model complexity
34
Resampling• Split available data into 2 sets:
Training + Validation(1) Use training set for model estimation (via data fitting)(2) Use validation data to estimate the ‘prediction’ error of the model
• Change model complexity index and repeat (1) and (2)
• Select the final model providing lowest (estimated) prediction error
BUT results are sensitive to data splitting
35
K-fold cross-validation
1. Divide the training data Z into k (randomly selected) disjoint subsets {Z1, Z2,…, Zk} of size n/k
2. For each ‘left-out’ validation set Zi :
- use remaining data to estimate the model
- estimate prediction error on Zi :
3. Estimate ave prediction risk as
)(ˆ xify
k
iicv r
kR
1
1
2
)(
ij
jji yfn
kr
Z
x
36
Example of model selection• 25 samples are generated as
with x uniformly sampled in [0,1], and noise ~ N(0,1)• Regression estimated using polynomials of degree m=1,2,…,10• Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the
polynomial model, along with training (* ) and validation (*) data points, for one partitioning.
m Estimated R via Cross validation
1 0.1340
2 0.1356
3 0.1452
4 0.1286
5 0.0699
6 0.1130
7 0.1892
8 0.3528
9 0.3596
10 0.4006
xy 22sin
373737
Statistical vs Predictive Approach• Binary Classification problem estimate decision boundary from training data
where y ~ binary class label (-1/+1)Assuming distribution P(x,y) is known:
(x1,x2) space
ii y,x
-2 0 2 4 6 8 10-6
-4
-2
0
2
4
6
8
10
x1
x2
383838
Classical Statistical Approach(1) parametric form of unknown distribution P(x,y) is known (2) estimate parameters of P(x,y) from the training data (3) Construct decision boundary using estimated distribution
and given misclassification costs
Estimated boundary
Modeling assumption:Distribution P(x,y) can be accurately estimated fromavailable data
-2 0 2 4 6 8 10
-6
-4
-2
0
2
4
6
8
10
x1
x2
393939
Predictive Approach(1) parametric form of decision boundary f(x,w) is given (2) Explain available data via fitting f(x,w), or minimization of
some loss function (i.e., squared error)(3) A function f(x,w*) providing smallest fitting error is then
used for predictiion
Estimated boundary
Modeling assumptions- Need to specify f(x,w) andloss function a priori.
- No need to estimate P(x,y) -2 0 2 4 6 8 10
-6
-4
-2
0
2
4
6
8
10
x1
x2
404040
Two Different Methodologies• System Identification (~ classical statistics)
- estimate probabilistic model (class densities) from available data- use this model to make predictions
• System Imitation (~ biological learning) - need only predict well, i.e. imitate specific aspect of unknown system;- multiplicity of good models;- can they be interpreted and/or trusted?
• Which approach works for high-dim. data?
414141
Classification with High-Dimensional Data• Digit recognition 5 vs 8:
each example ~ 32 x 32 pixel image 1,024-dimensional vector x
Medical analogy- Each pixel ~ genetic marker- Each patient (sample) described by 1024 genetic markers - Two classes ~ presence/ absence of a disease• Estimation of P(x,y) with finite data is not possible• Accurate estimation of decision boundary in 1024-dim.
space is possible, using just a few hundred samples
42
Statistical vs Predictive: Discussion• Classical statistics has modeling goals:
- interpretable model explaining the data- few important input variables (risk factors)- prediction performance is not verified but (usually) assumed – Why?
• Predictive modeling has different goals:- prediction (generalization) is the main goal- prediction accuracy is measured/reported- model interpretation is not important, as it cannot be objectively evaluated
434343
PART 1: Philosophy, induction and predictive data modeling
• Understanding uncertainty and risk• Induction and knowledge discovery• Philosophy and statistical learning• Predictive learning approach• Introduction to VC-theory
44
Empirical Risk Minimization• ERM principle for learning
– Model parameterization: f(x, w) – Loss function: L(f(x, w),y)– Estimate risk from data:– Choose w* that minimizes Remp
model f(x, w*) explains past data
• ERM principle ~ biological approach• Statistical Learning Theory (aka VC-theory)
under what conditions the ERM-style models will generalize (predict) well?
n
iiiemp yfL
nR
1
)),,((1
)( wxw
45
Inductive Learning Setting• The learning machine observes samples (x ,y), and
returns an estimated response
• Recall ‘first-principles’ vs ‘empirical’ knowledge
Two types of inference: identification vs imitation• Risk
Generatorof samples
LearningMachine
System
x
y
y
),(ˆ wfy x
min,y),w)) dP(Loss(y, f( xx
4646
VC-theory basics - 1Goals of Predictive Learning
- explain (or fit) available training data- predict well future (yet unobserved) data
Similar to biological learningExample: given 1, 3, 7, …
predict the rest of the sequence.Rule 1: Rule 2: randomly chosen odd numbersRule 3:
BUT for sequence 1, 3, 7, 15, 31, 63, …,
Rule 1 seems very reliable (why?)
11 2
kkk xx
12 kkxk
4747
VC-theory basics - 2Main Practical Result of VC-theory:
If a model explains well past data AND
is simple, then it can predict well • This explains why Rule 1 is a good model for
sequence 1, 3, 7, 15, 31, 63, …, • Measure of model complexity ~ VC-dimension
~ Ability to explain past data 1, 3, 7, 15, 31, 63
BUT can not explain all other possible sequences Low VC-dimension (~ large falsifiability)• For linear models, VC-dim = DoF (as in statistics)• But for nonlinear models they are different different
4848
VC-theory basics - 3Strategy for modeling high-dimensional data:
Find a model f(x) that explains past data AND
has low VC-dimension, even when dim. is large
SVM approach
Large margin =
Low VC-dimension
~ easy to falsify
494949
SUMMARY & DISCUSSION• Predictive data modeling:
- training data similar to future (test) data- performance index/loss function - predictive methodology is different from classical statistics- may not be a single true model- ‘conventional’ model interpretation is hard
• Understanding of uncertainty and risk:- changing due to technological advances- cultural and ethical issues