Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret...

1

Introduction to Statisticsand Machine Learning

How do we:• understand• interpret

our measurements

How do we get the data forour measurements

Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 2

Outline

Helge Voss

Multivariate classification/regression algorithms (MVA)

motivation

another introduction/repeat the ideas of hypothesis tests in this

context

Multidimensional Likelihood (kNN : k-Nearest Neighbour)

Projective Likelihood (naïve Bayes)

What to do with correlated input variables?

Decorrelation strategies

MVA-Literature /Software Packages... a biased selection

3

Software packages for Mulitvariate Data Analysis/Classification individual classifier software

e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad

and many other packages

attempts to provide “all inclusive” packages

StatPatternRecognition: I.Narsky, arXiv: physics/0507143

http://www.hep.caltech.edu/~narsky/spr.html TMVA: Höcker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss, arXiv: physics/0703039

http://tmva.sf.net or every ROOT distribution (development moved from SourceForge to ROOT repository)

WEKA: http://www.cs.waikato.ac.nz/ml/weka/ “R”: a huge data analysis library: http://www.r-project.org/

Literature: T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001 C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006

Conferences: PHYSTAT, ACAT,…

Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

http://www.hep.caltech.edu/~narsky/spr.html

http://www.hep.caltech.edu/~narsky/spr.html

http://tmva.sf.net/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.r-project.org/

Event Classification

4

A linear boundary? A nonlinear one?Rectangular cuts?

S

B

x1

x2 S

B

x1

x2 S

B

x1

x2

How can we decide what to uses ?

Once decided on a class of boundaries, how to find the “optimal” one ?

Suppose data sample of two types of events: with class labels Signal and Background (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged)

how to set the decision boundary to select events of type S ? we have discriminating variables x1, x2, …

Low variance (stable), high bias methods High variance, small bias methods


Regression

5

linear?

x

f(x)

x

f(x)

x

f(x)

constant ? non - linear?

how to estimate a “functional behaviour” from a given set of ‘known measurements” ? assume for example “D”-variables that somehow characterize the shower in your calorimeter

energy as function of the calorimeter shower parameters .

seems trivial ?

what if you have many input variables?

Cluster Size

En

erg

y

seems trivial ? The human brain has very good pattern recognition capabilities!


if we had an analytic model (i.e. know the function is a nth -order polynomial) than we know how to fit this (i.e. Maximum Likelihood Fit)

but what if we just want to “draw any kind of curve” and parameterize it?

Regression model functional behaviour

6

x

f(x)

Assume for example “D”-variables that somehow characterize the shower in your calorimeter.

Monte Carlo or testbeam

data sample with measured cluster observables + known particle energy

= calibration function (energy == surface in D+1 dimensional space)

1-D example 2-D example

better known: (linear) regression fit a known analytic function

e.g. the above 2-D example reasonable function would be: f(x) = ax2+by2+c

what if we don’t have a reasonable “model” ? need something more general:

e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x)

xy

f(x,y)

events generated according: underlying distribution


NOT in order to “fit a parameter” provide predition of function value f(x) for new measurements x (where f(x) is not known)


7

Each event, if Signal or Background, has “D” measured variables.

D

“feature space”

y(x)

most general formy = y(x); x D x={x1,….,xD}: input variables

y(x): RDR:

plotting (historamming) the resulting y(x) values:

Find a mapping from D-dimensional input-observable =”feature” space

to one dimensional output class label

Who sees how this would look like for the regression problem?



8

Each event, if Signal or Background, has “D” measured variables.

D

“feature space”

y(B) 0, y(S) 1

y(x): “test statistic” in D-dimensional space of input variables

y(x): RnR:

distributions of y(x): PDFS(y) and PDFB(y)

overlap of PDFS(y) and PDFB(y) separation power , purity

used to set the selection cut!

Find a mapping from D-dimensional input/observable/”feature” space

y(x)=const: surface defining the decision boundary.

efficiency and purity

to one dimensional output

class lables

> cut: signal= cut: decision boundary< cut: background

y(x):


Classification ↔ Regression

9

Classification: Each event, if Signal or Background, has “D” measured variables.

D

“feature space”

y(x): RDR: “test statistic” in D-dimensional space of input variables

y(x)=const: surface defining the decision boundary.

y(x): RDR:

Regression: Each event has “D” measured variables + one function value

(e.g. cluster shape variables in the ECAL + particles energy) y(x): RDR find

y(x)=const hyperplanes where thetarget function is constant

Now, y(x) needs to be build such that itbest approximates the target, not such that it best separates signal from bkgr. X1

X2

f(x1,x2)



Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 10

S S

S S B B

f PDF (y(x))P(C S | y(x))

f PDF (y(x)) f PDF (y(x))

y(x)

PDFB(y). PDFS(y): normalised distribution of y=y(x)

for background and signal events(i.e. the “function” that describes the shape of the

distribution)

with y=y(x) one can also say PDFB(y(x)), PDFS(y(x)): :

Probability densities for background and signal

now let’s assume we have an unknown event from the example above for which y(x) = 0.2

is the probability of an event with measured x={x1,….,xD} that gives y(x)

to be of type signal

y(x): RnR: the mapping from the “feature space” (observables) to one output variable

let fS and fB be the fraction of signal and background events in the sample, then:

PDFB(y(x)) = 1.5 and PDFS(y(x)) = 0.45

1.5 0.45

S S

S S B B

f PDF (y)P(C S | y)

f PDF (y) f PDF (y)

Event ClassificationP(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the

measured observables x={x1,….,xD} y(x)

P(y | C) P(C)P(Class = C | y) =

P(y)

Prior probability to observe an event of “class C”i.e. the relative abundance of “signal” versus “background”

Overall probability density to observe the actual measurement y(x). i.e.

Classes

P(y) = P(y | Class)P(Class)

Probability density distribution according to the measurements x and the given mapping function

Posterior probability

11Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Any Decision Involves Risk!


Trying to select signal events:(i.e. try to disprove the null-hypothesis stating it were “only” a background event)

Type-2 error: (false negative)

fail to identify an event from Class C as such(reject a hypothesis although it would have been correct/true)(fail to reject the null-hypothesis/accept null hypothesis although it is false)

loss of efficiency (e.g. miss true (signal) events)

Decide to treat an event as “Signal” or “Background”

SignalBack-ground

Signal Type-2 error

Back-ground

Type-1 error

accept as:truly is:

Significance α: Type-1 error rate: α = background selection “efficiency”

Size β: Type-2 error rate: (how often you miss the signal)

Power: 1- β = signal selection efficiency

“A”: region of the outcome of the test where you accept the event as Signal:

should be small

should be small

Type-1 error: (false positive)

classify event as Class C even though it is not(accept a hypothesis although it is not true/i.e.false)(reject the null-hypothesis although it would have been the correct one)

loss of purity (e.g. accepting wrong events)

most of the rest of the lecture will be about methods that try to make as little mistakes as possible

Neyman-Pearson Lemma

13

Neyman-Peason:The Likelihood ratio used as “selection criterium” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximises the area under the “Receiver Operation Characteristics” (ROC) curve

0 1

1

0

1- e

ba

ckg

r.

esignal

random guessing

good classification

better classification

varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve

how to choose “cut” need to know prior probabilities (S, B abundances)

measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) discovery of a signal (typically: S<<B): maximum of S/√(B) precision measurement: high purity (p) large background

rejection trigger selection: high efficiency ( )e

“limit” in ROC curve

given by likelihood ratio

P(x | S)Likelihood Ratio : y(x)

P(x | B)


y’(x)

y’’(x)

MVA and Machine Learning

14

The previous slide was basically the idea of “Multivariate Analysis” (MVA) rem: What about “standard cuts” (event rejection in each variable

separately with fix conditions. i.e. if x1>0 or x2<3 then background) ? Finding y(x) : RnR

given a certain type of model class y(x) in an automatic way using “known” or “previously solved” events

i.e. learn from known “patterns” such that the resulting y(x) has good generalization properties when

applied to “unknown” events (regression: fits well the target function “in between” the known training events

that is what the “machine” is supposed to be doing: supervised machine learning

Of course… there’s no magic, we still need to: choose the discriminating variables choose the class of models (linear, non-linear, flexible or less flexible) tune the “learning parameters” bias vs. variance trade off check generalization properties consider trade off between statistical and systematic uncertainties



15

Unfortunately, the true probability densities functions are typically unknown: Neyman-Pearsons lemma doesn’t really help us directly

* hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise

Monte Carlo simulation or in general cases: set of known (already classified) “events”

2 different ways: Use these “training” events to:

estimate the functional form of p(x|C): (e.g. the differential cross section folded with the

detector influences) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, …

find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background e.g. Linear Discriminator, Neural Networks, …


Unsupervised LearningJust a short remark as we talked about “supervised” learning before:

supervised: training with “events” for which we know the outcome (i.e. Signal or Backgr)

un-supervised: - no prior knowledge about what is “Signal” or “Background” or … we don’t even know if there are different “event classes”, then you can for example do:

- cluster analysis: if different “groups” are found class labels

- principal component analysis: find basis in observable space with biggest hierarchical differences in the variance

infer something about underlying substructure

Examples:

- think about “success” or “not success” rather than “signal” and “background” (i.e. a robot achieves his goal or does not / falls or does not fall/ …)

- market survey:

If asked many different question, maybe you can find “clusters” of people, group them together and test if there are correlations between this groups

and their tendency to buy a certain product. address them specialy

- medical survey:

group people together and perhaps find common causes for certain diseases 16Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Nearest Neighbour and Kernel Density Estimator

17

estimate probability density P(x) in D-dimensional space:

The only thing at our disposal is our “training data”

x1

x2

“events” distributed according to P(x)

“x”

1

11, , 1...x x

K , with (u) 20, otherwise

Nin

n

u i Dk k

h

k(u): is called a Kernel function

For the chosen a rectangular volume

h

Say we want to know P(x) at “this” point “x”

One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N eventsV

K (from the “training data”) estimate of average P(x) in the volume V: ∫P(x)dx = K/N V

Classification: Determine

PDFS(x) and PDFB(x)

likelihood ratio as classifier!

K-events:

Kernel Density estimator of the probability density

1

x x1 1(x)

N

nD

n

P kN h h



18

Regression: If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident particle)

i.e.: the average function value

x1

x2

“events” distributed according to P(x)

“x”

1

11, , 1...x x


Nin

n

u i Dk k

h

k(u): is called a Kernel function: rectangular Parzen-Window

h

Ni i

i V

1 ˆk(x x)f(x ) f(x)P(x)dxN

K (from the “training data”) estimate of average P(x) in the volume V: ∫P(x)dx = K/N V


estimate probability density P(x) in D-dimensional space:





K-events:


19

x1

x2

“x”

1

11, , 1...x x


Nin

n

u i Dk k

h

h

determine K from the “training data” with signal and background mixed together

x1

x2

kNN : k-Nearest Neighboursrelative number events of the various classes amongst the k-nearest neighbours

Sny(x)

K


“events” distributed according to P(x) estimate probability density P(x) in D-dimensional space:





K-events:

Kernel Density Estimator

20

Parzen Window: “rectangular Kernel” discontinuities at window edgessmoother model for P(x) when using smooth Kernel Functions: e.g. Gaussian

2

2 1/2 21

x x1 1(x) exp

(2 ) 2

N

n

n

PN h h

place a “Gaussian” around each “training data point” and sum up their contributions at arbitrary points “x” P(x)

h: “size” of the Kernel “smoothing parameter”

there is a large variety of possible Kernel functions

↔ probability density estimator

individual kernels averaged kernels


Kernel Density Estimator

21

h: “size” of the Kernel “smoothing parameter”

chosen size of the “smoothing-parameter” more important than kernel function

(Christopher M.Bishop)

h too small: overtraining h too large: not sensitive to features in P(x)

a drawback of Kernel density estimators:Evaluation for any test events involves ALL TRAINING DATA typically very time consuming

binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching

1

1nP( ) ( )

x x - xN

hn

KN

: a general probability density estimator using kernel K


which metric for the Kernel (window)? normalise all variables to same range include correlations ?

Mahalanobis Metric: x*x xV-1x

“Curse of Dimensionality”

22

Bellman, R. (1961), Adaptive Control Processes: A Guided Tour,

Princeton University Press.

Shortcoming of nearest-neighbour strategies:

in higher dimensional classification/regression cases the idea of looking at “training events” in a reasonably small “vicinity” of the space point to be classified becomes difficult:

1/edge length=(fraction of volume) D

consider: total phase space volume V=1D

for a cube of a particular fraction of the volume:

In 10 dimensions: in order to capture 1% of the phase space 63% of range in each variable necessary that’s not “local” anymore..

We all know: Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due to lack of Monte Carlo events.

Therefore we still need to develop all the alternative classification/regression techniques


Naïve Bayesian Classifier “often called: (projective) Likelihood”

23

Multivariate Likelihood (k-Nearest Neighbour) estimate the full D-dimensional joint probability density

If correlations between variables are weak: D

ii 0

P( ) P( )

x x

event

event

event

variables

vari

signa

a

,

PDE

b s

,

l

k

,

l

e

P (x )

( )

P (x )

i

C

C

ii

i icla s i

k

ksse

y x

discriminating variables

Classes: signal, background types

Likelihood ratio for event event

PDFs

One of the first and still very popular MVA-algorithm in HEP

No hard cuts on individual variables,

allow for some “fuzzyness”: one very signal like variable may counterweigh another less signal like variable

optimal method if correlations == 0 (Neyman Pearson Lemma)PDE introduces fuzzy logic

product of marginal PDFs

(1-dim “histograms”)


Naïve Bayesian Classifier “often called: (projective)

Likelihood”

24

example: original (underlying) distribution is Gaussian

Difficult to automate for arbitrary PDFs

parametric (function) fitting

Automatic, unbiased, but suboptimal

event counting(histogramming)

Easy to automate, can create artefacts/suppress information

nonparametric fitting(i.e. splines,kernel)

How parameterise the 1-dim PDFs ??

If the correlations between variables is really negligible, this classifier is “perfect” (simple, robust, performing)

If not, you seriously loose performance

How can we “fix” this ?


What if there are correlations?

25

Typically correlations are present: Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)

pre-processing: choose set of linear transformed input variables for which C ij = 0 (i≠j)


Decorrelation

26

Attention: eliminates only linear correlations!!

Determine square-root C of correlation matrix C, i.e., C = C C

compute C by diagonalising C:

transformation from original (x) in de-correlated variable space (x) by: x = C 1x

T TD S SSSC C D


Find variable transformation that diagonalises the covariance matrix

Decorrelation: Principal Component Analysis

27

PC ( )event event

variables

variab , leskk v v v

v

kix x vi x

Principle Component (PC) of variable k

sample means eigenvector

C V D V Matrix of eigenvectors V obey the relation: PCA eliminates correlations!

correlation matrix diagonalised square root of C


PCA (unsupervised learning algorithm) reduce dimensionality of a problem find most dominant features in a distribution

Eigenvectors of covariance matrix “axis” in transformed variable space large eigenvalue large variance along the axis (principal component)

sort eigenvectors according to their eigenvalues transform dataset accordingly diagonalised covariance matrix with first “variable” variable with

largest variance

How to Apply the Pre-Processing Transformation?

28

• Correlation (decorrelation): different for signal and background variables• we don’t know beforehand if it is signal or background. What do we do?

for likelihood ratio, decorrelate signal and background independently

for other estimators, one needs to decide on one of the two… (or decorrelate on a mixture of signal and background events)

variables

variables variable

tr

event

event

eves

nt ev

an

e t

s

n

ˆ (

ˆ ˆ(

)

) )(

k kk

k k

S S

Sk k

k

S B B

k

L

p T x

yp

i

iT x ip T xi

event

eventev

variables

variables variaent even

est

bl

)

) )

(

( (L

k kk

k k k kk k

S

S B

p x

yp x p

i

ii ix

signal transformation

background transformation


Decorrelation at Work

29

Example: linear correlated Gaussians decorrelation works to 100%

1-D Likelihood on decorrelated sample give best possible performance

compare also the effect on the MVA-output variable!

correlated variables: after decorrelation

(note the different scale on the y-axis… sorry)


Limitations of the Decorrelation

30

in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care

How does linear decorrelation affect cases where correlations between signal and background differ?

Original correlations

Signal Background



31


How does linear decorrelation affect cases where correlations between signal and background differ?

SQRT decorrelation

Signal Background



32


How does linear decorrelation affect strongly nonlinear cases ?

Original correlations

BackgroundSignal



33


How does linear decorrelation affect strongly nonlinear cases ?

SQRT decorrelation

Watch out before you used decorrelation “blindly”!!

Perhaps “decorrelate” only a subspace!

BackgroundSignal


Improve decorrelation by pre-Gaussianisation of variables

First: transformation to achieve uniform (flat) distribution:

event

evefl

ntat variabl , es

kx

k k

i

k kix p x kx d

Rarity transform of variable k Measured value PDF of variable k

Second: make Gaussian via inverse error function:

Gauss 1 flevent even

at

t 2 erf 2 1 var iables,k ki i kx x

2

0

2erf

xte tx d

The integral can be solved in an unbinned way by event counting,

or by creating non-parametric PDFs (see later for likelihood section)

Third: decorrelate (and “iterate” this procedure)

“Gaussian-isation“


OriginalOriginalSignal - GaussianisedSignal - Gaussianised

We cannot simultaneously “Gaussianise” both signal and background !

Background - GaussianisedBackground - Gaussianised

“Gaussian-isation“


Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 36

Summary

Helge Voss

Hope you are all convinced that Multivariate Algorithem are nice and powerful classification techniques

Do not use hard selection criteria (cuts) on each individual observables

Look at all observables “together” eg. combing them into 1 variable

Mulitdimensinal Likelihood PDF in D-dimensions Projective Likelihood (Naïve Bayesian) PDF in D times 1

dimension

How to “avoid” correlations

Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret...

Documents

Transcript of Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret...