Statistical Learning Methods in HEAP

C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 1

Jens Zimmermann,Christian Kiesling

Max-Planck-Institut für Physik, München

MPI für extraterrestrische Physik, München

Forschungszentrum Jülich GmbH

Statistical Learning: Introduction with a simple example

Occam‘s Razor

Decision Trees

Local Density Estimators

Methods Based on Linear Separation

Examples: Triggers in HEP and Astrophysics

Conclusion

Statistical Learning Methods in HEAP


Statistical Learning

• Does not use prior knowledge„No theory required“

• Learns only from examples„Trial and error“„Learning by reinforcement“

• Two classes of statistical learning:discrete output 0/1: „classification“continuous output: „regression“

• Application in High Energy- and Astro-Physics:Background suppression, purification of eventsEstimation of parameters not directly measured


A simple Example: Preparing a Talk

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

# formulas # slides

42 21

28 8

71 19

64 31

29 36

15 34

48 44

56 51

25 55

12 16Exp

erim

enta

list

s

The

oris

ts

Data base established by Jens duringYoung Scientists Meeting at MPI


Discriminating Theorists from Experimentalists: A First Analysis

0 2 4 6 x10# formulas

0 2 4 6 x10# slides

ExperimentalistsTheorists

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

First talks handed in

Talks a week beforemeeting


Completely separable, but only via complicated boundary

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

First Problems

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

New talk by Ludger:28 formulas on31 slides

At this point we cannotknow which feature is „real“!

Use Train/Test or Cross-Validation!

Simple „model“, but no completeseparation


See Overtraining - Want Generalization Need Regularization

Want to tune the parameters of the learning algorithm depending on the overtraining seen!

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

Train Test

2

1

1( )

N

i ii

E t out xN

Training epochs

E

Training Set

Test Set

Overtraining


See Overtraining - Want Generalization Need Regularization

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

Train Test

2

1

1( ),

N

i ii

E t out xN

w

Training epochs

E

Training Set

Test Set

Regularization will ensure adequate performance (e.g. VC dimensions):Limit the complexity of the model

“Factor 10” - Rule: (“Uncle Bernie’s Rule #2”)


Philosophy: Occam‘s Razor

Pluralitas non est ponenda sine necessitate.

• Do not make assumptions, unless they are really necessary.

• From theories which describe the same phenomenon equally well choose the one which contains the least number of assumptions.

First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself.

Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.

14th century

No! „No free lunch“-theorem Wolpert1996

Yes! But not of much use.


Decision Trees

0 2 4 6 x10

# formulas #formulas < 20 exp

0 2 4 6 x10

# slides

20 < #formulas < 60?

#slides > 40 exp#slides < 40 th

#slides < 40 #slides > 40

expth

#formulas < 20 #formulas > 60rest

exp th

all events

subset 20 < #formulas < 60

Classify Ringaile:31 formulas on32 slides th

Regularization:Pruning

#formulas > 60 th


Local Density Estimators

Search for similar events already classified within specified region,count the members of the two classes in that region.

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10


Maximum Likelihood

0 2 4 6 x10

# formulas

0 2 4 6 x10

# slides

31 32

24.05

3

5

2Thp

04.05

1

5

1Expp

out=

Correlation gets lost completely by projection! Regularization:Binning


k-Nearest-Neighbour

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

k=1out=

k=2out=

k=3out=

k=4out=

k=5out=

For every evaluation position the distances to eachtraining position need to be determined!

Regularization:Parameter k


0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

1

3

4 5

6

8

7

Range Search

1

2 3

4 56

7

8

910xx

x

y y

Tree needs to be traversed only partially if box size is small enough!

Small box: checked 1,2,4,9out=

Large box: checked all out=

3

5 8y

6

10x

7 2

9

10

Regularization:Box-Size


Methods Based on Linear Separation

Divide the input space into regionsseparated by one or more hyperplanes.

Extrapolation is done!

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

0 1 2 3 4 5 6 x10# formulas

# s

lide

s

0

1

2

3

4

5

6 x

10

LDA (Fisher discr.)


Neural Networks

aeaσ

1

1)(

-50

+0.1+1.1 -1.1

+20

+0.2

+3.6 +3.6

-1.8

# formulas # slides

sxwσy ii

0 1 2 3 4 5 6 x10

0

1

2

3

4

5

6 x

10

0

1

Regularization:# hidden neuronsweight decay

arbitrary inputs andhidden neurons Network with two

hidden neurons (gradient descent):

2

1

1( )

N

i ii

E t out xN


Support Vector Machines

Separating hyperplane with maximum distance to each data point: Maximum margin classifier

Found by setting up condition for correct classficationand minimizing which leads to the Lagrangian

1)( bxwy ii

2

w

1)(2

1 2 bxwyαwL iii

Necessary condition for a minimum is

Output becomes iii xyαw

bxxyαout iii sgn

Only linear separation?

The mapping to feature spaceis hidden in a kernel

FRd :)()(),( yxyxK

No! Replace dot products: )()( yxyx

Non-separable case: iξCww

22

2

1

2

1


Physics Applications: Neural Network Trigger at HERA

keep physics reject background

H1


Trigger for J/Events

H1NN 99.6%SVM 98.3%k-NN 97.7%RS 97.5%C4.5 97.5%ML 91.2%LDA 82%

Eff@Rej=95%:


Triggering Charged Current Events

signal

background

p

W

pX

~

e

p

NN 74%SVM 73%C4.5 72%RS 72%k-NN 71%LDA 68%ML 65%

Eff@Rej=80%:


Astrophysics: MAGIC - Gamma/Hadron Separation

Random Forest: = 93.3 Neural Net: = 96.5

Training with Data and MCEvaluation with Data

vs.

Photon Hadron

= signal (photon) enhancement factor


Future Experiment XEUS: Position of X-ray Photons

of reconstruction in µmNN 3.6 SVM 3.6 k-NN 3.7 RS 3.7 ETA 3.9 CCOM 4.0

XEUS

~300µm

~10µm

electron potential

transfer direction

i

iiCOM c

cxx

)( COMCOMCCOM xxx

(Application of Stat. Learning in Regression Problems)


Conclusion

• Statistical learning theory is full of subtle details (models statistics)

• Neural Networks found superior in the HEP and Astrophysics applications (classification, regression) studied so far

• Widely used statistical learning methods studied:• Decision Trees• LDE: ML, k-NN, RS• Linear separation: LDA, Neural Nets, SVM‘s

• Further applications (trigger, offline analyses) under study


From Classification to Regression

k-NN

3

4

5

3

2

2

53

8out

RS

3

4

5

3

2

2

54

13out

NN

N

iii xouty

NE

1

2)(1

Fit Gauss

a=(-2.1x - 1) b=(+2.1x - 1) out=(-12.7a-12.7b+9.4)

Statistical Learning Methods in HEAP

Documents

Transcript of Statistical Learning Methods in HEAP