How big? Measurements and Significant Digits How small? How accurate?
Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret...
1
Introduction to Statisticsand Machine Learning
How do we:• understand• interpret
our measurements
How do we get the data forour measurements
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 2
Outline
Helge Voss
Multivariate classification/regression algorithms (MVA)
motivation
another introduction/repeat the ideas of hypothesis tests in this
context
Multidimensional Likelihood (kNN : k-Nearest Neighbour)
Projective Likelihood (naïve Bayes)
What to do with correlated input variables?
Decorrelation strategies
MVA-Literature /Software Packages... a biased selection
3
Software packages for Mulitvariate Data Analysis/Classification individual classifier software
e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad
and many other packages
attempts to provide “all inclusive” packages
StatPatternRecognition: I.Narsky, arXiv: physics/0507143
http://www.hep.caltech.edu/~narsky/spr.html TMVA: Höcker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss, arXiv: physics/0703039
http://tmva.sf.net or every ROOT distribution (development moved from SourceForge to ROOT repository)
WEKA: http://www.cs.waikato.ac.nz/ml/weka/ “R”: a huge data analysis library: http://www.r-project.org/
Literature: T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001 C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006
Conferences: PHYSTAT, ACAT,…
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Event Classification
4
A linear boundary? A nonlinear one?Rectangular cuts?
S
B
x1
x2 S
B
x1
x2 S
B
x1
x2
How can we decide what to uses ?
Once decided on a class of boundaries, how to find the “optimal” one ?
Suppose data sample of two types of events: with class labels Signal and Background (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged)
how to set the decision boundary to select events of type S ? we have discriminating variables x1, x2, …
Low variance (stable), high bias methods High variance, small bias methods
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Regression
5
linear?
x
f(x)
x
f(x)
x
f(x)
constant ? non - linear?
how to estimate a “functional behaviour” from a given set of ‘known measurements” ? assume for example “D”-variables that somehow characterize the shower in your calorimeter
energy as function of the calorimeter shower parameters .
seems trivial ?
what if you have many input variables?
Cluster Size
En
erg
y
seems trivial ? The human brain has very good pattern recognition capabilities!
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
if we had an analytic model (i.e. know the function is a nth -order polynomial) than we know how to fit this (i.e. Maximum Likelihood Fit)
but what if we just want to “draw any kind of curve” and parameterize it?
Regression model functional behaviour
6
x
f(x)
Assume for example “D”-variables that somehow characterize the shower in your calorimeter.
Monte Carlo or testbeam
data sample with measured cluster observables + known particle energy
= calibration function (energy == surface in D+1 dimensional space)
1-D example 2-D example
better known: (linear) regression fit a known analytic function
e.g. the above 2-D example reasonable function would be: f(x) = ax2+by2+c
what if we don’t have a reasonable “model” ? need something more general:
e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x)
xy
f(x,y)
events generated according: underlying distribution
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
NOT in order to “fit a parameter” provide predition of function value f(x) for new measurements x (where f(x) is not known)
Event Classification
7
Each event, if Signal or Background, has “D” measured variables.
D
“feature space”
y(x)
most general formy = y(x); x D x={x1,….,xD}: input variables
y(x): RDR:
plotting (historamming) the resulting y(x) values:
Find a mapping from D-dimensional input-observable =”feature” space
to one dimensional output class label
Who sees how this would look like for the regression problem?
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Event Classification
8
Each event, if Signal or Background, has “D” measured variables.
D
“feature space”
y(B) 0, y(S) 1
y(x): “test statistic” in D-dimensional space of input variables
y(x): RnR:
distributions of y(x): PDFS(y) and PDFB(y)
overlap of PDFS(y) and PDFB(y) separation power , purity
used to set the selection cut!
Find a mapping from D-dimensional input/observable/”feature” space
y(x)=const: surface defining the decision boundary.
efficiency and purity
to one dimensional output
class lables
> cut: signal= cut: decision boundary< cut: background
y(x):
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Classification ↔ Regression
9
Classification: Each event, if Signal or Background, has “D” measured variables.
D
“feature space”
y(x): RDR: “test statistic” in D-dimensional space of input variables
y(x)=const: surface defining the decision boundary.
y(x): RDR:
Regression: Each event has “D” measured variables + one function value
(e.g. cluster shape variables in the ECAL + particles energy) y(x): RDR find
y(x)=const hyperplanes where thetarget function is constant
Now, y(x) needs to be build such that itbest approximates the target, not such that it best separates signal from bkgr. X1
X2
f(x1,x2)
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Event Classification
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 10
S S
S S B B
f PDF (y(x))P(C S | y(x))
f PDF (y(x)) f PDF (y(x))
y(x)
PDFB(y). PDFS(y): normalised distribution of y=y(x)
for background and signal events(i.e. the “function” that describes the shape of the
distribution)
with y=y(x) one can also say PDFB(y(x)), PDFS(y(x)): :
Probability densities for background and signal
now let’s assume we have an unknown event from the example above for which y(x) = 0.2
is the probability of an event with measured x={x1,….,xD} that gives y(x)
to be of type signal
y(x): RnR: the mapping from the “feature space” (observables) to one output variable
let fS and fB be the fraction of signal and background events in the sample, then:
PDFB(y(x)) = 1.5 and PDFS(y(x)) = 0.45
1.5 0.45
S S
S S B B
f PDF (y)P(C S | y)
f PDF (y) f PDF (y)
Event ClassificationP(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the
measured observables x={x1,….,xD} y(x)
P(y | C) P(C)P(Class = C | y) =
P(y)
Prior probability to observe an event of “class C”i.e. the relative abundance of “signal” versus “background”
Overall probability density to observe the actual measurement y(x). i.e.
Classes
P(y) = P(y | Class)P(Class)
Probability density distribution according to the measurements x and the given mapping function
Posterior probability
11Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Any Decision Involves Risk!
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 12
Trying to select signal events:(i.e. try to disprove the null-hypothesis stating it were “only” a background event)
Type-2 error: (false negative)
fail to identify an event from Class C as such(reject a hypothesis although it would have been correct/true)(fail to reject the null-hypothesis/accept null hypothesis although it is false)
loss of efficiency (e.g. miss true (signal) events)
Decide to treat an event as “Signal” or “Background”
SignalBack-ground
Signal Type-2 error
Back-ground
Type-1 error
accept as:truly is:
Significance α: Type-1 error rate: α = background selection “efficiency”
Size β: Type-2 error rate: (how often you miss the signal)
Power: 1- β = signal selection efficiency
“A”: region of the outcome of the test where you accept the event as Signal:
should be small
should be small
Type-1 error: (false positive)
classify event as Class C even though it is not(accept a hypothesis although it is not true/i.e.false)(reject the null-hypothesis although it would have been the correct one)
loss of purity (e.g. accepting wrong events)
most of the rest of the lecture will be about methods that try to make as little mistakes as possible
Neyman-Pearson Lemma
13
Neyman-Peason:The Likelihood ratio used as “selection criterium” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximises the area under the “Receiver Operation Characteristics” (ROC) curve
0 1
1
0
1- e
ba
ckg
r.
esignal
random guessing
good classification
better classification
varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve
how to choose “cut” need to know prior probabilities (S, B abundances)
measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) discovery of a signal (typically: S<<B): maximum of S/√(B) precision measurement: high purity (p) large background
rejection trigger selection: high efficiency ( )e
“limit” in ROC curve
given by likelihood ratio
P(x | S)Likelihood Ratio : y(x)
P(x | B)
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
y’(x)
y’’(x)
MVA and Machine Learning
14
The previous slide was basically the idea of “Multivariate Analysis” (MVA) rem: What about “standard cuts” (event rejection in each variable
separately with fix conditions. i.e. if x1>0 or x2<3 then background) ? Finding y(x) : RnR
given a certain type of model class y(x) in an automatic way using “known” or “previously solved” events
i.e. learn from known “patterns” such that the resulting y(x) has good generalization properties when
applied to “unknown” events (regression: fits well the target function “in between” the known training events
that is what the “machine” is supposed to be doing: supervised machine learning
Of course… there’s no magic, we still need to: choose the discriminating variables choose the class of models (linear, non-linear, flexible or less flexible) tune the “learning parameters” bias vs. variance trade off check generalization properties consider trade off between statistical and systematic uncertainties
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Event Classification
15
Unfortunately, the true probability densities functions are typically unknown: Neyman-Pearsons lemma doesn’t really help us directly
* hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise
Monte Carlo simulation or in general cases: set of known (already classified) “events”
2 different ways: Use these “training” events to:
estimate the functional form of p(x|C): (e.g. the differential cross section folded with the
detector influences) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, …
find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background e.g. Linear Discriminator, Neural Networks, …
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Unsupervised LearningJust a short remark as we talked about “supervised” learning before:
supervised: training with “events” for which we know the outcome (i.e. Signal or Backgr)
un-supervised: - no prior knowledge about what is “Signal” or “Background” or … we don’t even know if there are different “event classes”, then you can for example do:
- cluster analysis: if different “groups” are found class labels
- principal component analysis: find basis in observable space with biggest hierarchical differences in the variance
infer something about underlying substructure
Examples:
- think about “success” or “not success” rather than “signal” and “background” (i.e. a robot achieves his goal or does not / falls or does not fall/ …)
- market survey:
If asked many different question, maybe you can find “clusters” of people, group them together and test if there are correlations between this groups
and their tendency to buy a certain product. address them specialy
- medical survey:
group people together and perhaps find common causes for certain diseases 16Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Nearest Neighbour and Kernel Density Estimator
17
estimate probability density P(x) in D-dimensional space:
The only thing at our disposal is our “training data”
x1
x2
“events” distributed according to P(x)
“x”
1
11, , 1...x x
K , with (u) 20, otherwise
Nin
n
u i Dk k
h
k(u): is called a Kernel function
For the chosen a rectangular volume
h
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N eventsV
K (from the “training data”) estimate of average P(x) in the volume V: ∫P(x)dx = K/N V
Classification: Determine
PDFS(x) and PDFB(x)
likelihood ratio as classifier!
K-events:
Kernel Density estimator of the probability density
1
x x1 1(x)
N
nD
n
P kN h h
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Nearest Neighbour and Kernel Density Estimator
18
Regression: If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident particle)
i.e.: the average function value
x1
x2
“events” distributed according to P(x)
“x”
1
11, , 1...x x
K , with (u) 20, otherwise
Nin
n
u i Dk k
h
k(u): is called a Kernel function: rectangular Parzen-Window
h
Ni i
i V
1 ˆk(x x)f(x ) f(x)P(x)dxN
K (from the “training data”) estimate of average P(x) in the volume V: ∫P(x)dx = K/N V
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
estimate probability density P(x) in D-dimensional space:
The only thing at our disposal is our “training data”
For the chosen a rectangular volume
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N eventsV
K-events:
Nearest Neighbour and Kernel Density Estimator
19
x1
x2
“x”
1
11, , 1...x x
K , with (u) 20, otherwise
Nin
n
u i Dk k
h
h
determine K from the “training data” with signal and background mixed together
x1
x2
kNN : k-Nearest Neighboursrelative number events of the various classes amongst the k-nearest neighbours
Sny(x)
K
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
“events” distributed according to P(x) estimate probability density P(x) in D-dimensional space:
The only thing at our disposal is our “training data”
For the chosen a rectangular volume
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N eventsV
K-events:
Kernel Density Estimator
20
Parzen Window: “rectangular Kernel” discontinuities at window edgessmoother model for P(x) when using smooth Kernel Functions: e.g. Gaussian
2
2 1/2 21
x x1 1(x) exp
(2 ) 2
N
n
n
PN h h
place a “Gaussian” around each “training data point” and sum up their contributions at arbitrary points “x” P(x)
h: “size” of the Kernel “smoothing parameter”
there is a large variety of possible Kernel functions
↔ probability density estimator
individual kernels averaged kernels
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Kernel Density Estimator
21
h: “size” of the Kernel “smoothing parameter”
chosen size of the “smoothing-parameter” more important than kernel function
(Christopher M.Bishop)
h too small: overtraining h too large: not sensitive to features in P(x)
a drawback of Kernel density estimators:Evaluation for any test events involves ALL TRAINING DATA typically very time consuming
binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching
1
1nP( ) ( )
x x - xN
hn
KN
: a general probability density estimator using kernel K
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
which metric for the Kernel (window)? normalise all variables to same range include correlations ?
Mahalanobis Metric: x*x xV-1x
“Curse of Dimensionality”
22
Bellman, R. (1961), Adaptive Control Processes: A Guided Tour,
Princeton University Press.
Shortcoming of nearest-neighbour strategies:
in higher dimensional classification/regression cases the idea of looking at “training events” in a reasonably small “vicinity” of the space point to be classified becomes difficult:
1/edge length=(fraction of volume) D
consider: total phase space volume V=1D
for a cube of a particular fraction of the volume:
In 10 dimensions: in order to capture 1% of the phase space 63% of range in each variable necessary that’s not “local” anymore..
We all know: Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due to lack of Monte Carlo events.
Therefore we still need to develop all the alternative classification/regression techniques
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Naïve Bayesian Classifier “often called: (projective) Likelihood”
23
Multivariate Likelihood (k-Nearest Neighbour) estimate the full D-dimensional joint probability density
If correlations between variables are weak: D
ii 0
P( ) P( )
x x
event
event
event
variables
vari
signa
a
,
PDE
b s
,
l
k
,
l
e
P (x )
( )
P (x )
i
C
C
ii
i icla s i
k
ksse
y x
discriminating variables
Classes: signal, background types
Likelihood ratio for event event
PDFs
One of the first and still very popular MVA-algorithm in HEP
No hard cuts on individual variables,
allow for some “fuzzyness”: one very signal like variable may counterweigh another less signal like variable
optimal method if correlations == 0 (Neyman Pearson Lemma)PDE introduces fuzzy logic
product of marginal PDFs
(1-dim “histograms”)
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Naïve Bayesian Classifier “often called: (projective)
Likelihood”
24
example: original (underlying) distribution is Gaussian
Difficult to automate for arbitrary PDFs
parametric (function) fitting
Automatic, unbiased, but suboptimal
event counting(histogramming)
Easy to automate, can create artefacts/suppress information
nonparametric fitting(i.e. splines,kernel)
How parameterise the 1-dim PDFs ??
If the correlations between variables is really negligible, this classifier is “perfect” (simple, robust, performing)
If not, you seriously loose performance
How can we “fix” this ?
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
What if there are correlations?
25
Typically correlations are present: Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)
pre-processing: choose set of linear transformed input variables for which C ij = 0 (i≠j)
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Decorrelation
26
Attention: eliminates only linear correlations!!
Determine square-root C of correlation matrix C, i.e., C = C C
compute C by diagonalising C:
transformation from original (x) in de-correlated variable space (x) by: x = C 1x
T TD S SSSC C D
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Find variable transformation that diagonalises the covariance matrix
Decorrelation: Principal Component Analysis
27
PC ( )event event
variables
variab , leskk v v v
v
kix x vi x
Principle Component (PC) of variable k
sample means eigenvector
C V D V Matrix of eigenvectors V obey the relation: PCA eliminates correlations!
correlation matrix diagonalised square root of C
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
PCA (unsupervised learning algorithm) reduce dimensionality of a problem find most dominant features in a distribution
Eigenvectors of covariance matrix “axis” in transformed variable space large eigenvalue large variance along the axis (principal component)
sort eigenvectors according to their eigenvalues transform dataset accordingly diagonalised covariance matrix with first “variable” variable with
largest variance
How to Apply the Pre-Processing Transformation?
28
• Correlation (decorrelation): different for signal and background variables• we don’t know beforehand if it is signal or background. What do we do?
for likelihood ratio, decorrelate signal and background independently
for other estimators, one needs to decide on one of the two… (or decorrelate on a mixture of signal and background events)
variables
variables variable
tr
event
event
eves
nt ev
an
e t
s
n
ˆ (
ˆ ˆ(
)
) )(
k kk
k k
S S
Sk k
k
S B B
k
L
p T x
yp
i
iT x ip T xi
event
eventev
variables
variables variaent even
est
bl
)
) )
(
( (L
k kk
k k k kk k
S
S B
p x
yp x p
i
ii ix
signal transformation
background transformation
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Decorrelation at Work
29
Example: linear correlated Gaussians decorrelation works to 100%
1-D Likelihood on decorrelated sample give best possible performance
compare also the effect on the MVA-output variable!
correlated variables: after decorrelation
(note the different scale on the y-axis… sorry)
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Limitations of the Decorrelation
30
in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care
How does linear decorrelation affect cases where correlations between signal and background differ?
Original correlations
Signal Background
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Limitations of the Decorrelation
31
in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care
How does linear decorrelation affect cases where correlations between signal and background differ?
SQRT decorrelation
Signal Background
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Limitations of the Decorrelation
32
in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care
How does linear decorrelation affect strongly nonlinear cases ?
Original correlations
BackgroundSignal
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Limitations of the Decorrelation
33
in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care
How does linear decorrelation affect strongly nonlinear cases ?
SQRT decorrelation
Watch out before you used decorrelation “blindly”!!
Perhaps “decorrelate” only a subspace!
BackgroundSignal
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Improve decorrelation by pre-Gaussianisation of variables
First: transformation to achieve uniform (flat) distribution:
event
evefl
ntat variabl , es
kx
k k
i
k kix p x kx d
Rarity transform of variable k Measured value PDF of variable k
Second: make Gaussian via inverse error function:
Gauss 1 flevent even
at
t 2 erf 2 1 var iables,k ki i kx x
2
0
2erf
xte tx d
The integral can be solved in an unbinned way by event counting,
or by creating non-parametric PDFs (see later for likelihood section)
Third: decorrelate (and “iterate” this procedure)
“Gaussian-isation“
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 34
OriginalOriginalSignal - GaussianisedSignal - Gaussianised
We cannot simultaneously “Gaussianise” both signal and background !
Background - GaussianisedBackground - Gaussianised
“Gaussian-isation“
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 35
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 36
Summary
Helge Voss
Hope you are all convinced that Multivariate Algorithem are nice and powerful classification techniques
Do not use hard selection criteria (cuts) on each individual observables
Look at all observables “together” eg. combing them into 1 variable
Mulitdimensinal Likelihood PDF in D-dimensions Projective Likelihood (Naïve Bayesian) PDF in D times 1
dimension
How to “avoid” correlations