Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ?...

38
Linear Discriminant Analysis (LDA) for selection cuts : • Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present Julien Faivre Alice week Utrecht 14 June 2005

Transcript of Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ?...

Page 1: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Linear Discriminant Analysis (LDA)

for selection cuts :

• Motivations• Why LDA ?• How does it work ?• Concrete examples• Conclusions and S.Antonio’s present

Julien Faivre Alice week Utrecht – 14 June 2005

Page 2: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Initial motivations :

Julien Faivre

Some particles are critical at all p and in all collision systems

• Statistics is needed for :

• p-p collisions, low-p

• All p

• All p

• Peripheral, p-p, high-p

• Observables :

• Production yields

• Spectra slope

• p, azimutal anisotropy (v2)

• Scaled spectra (RCP, RAA), v2

Alice week Utrecht – 14 Jun 20051/15

• Need more statistics• Need fast and easy selection optimization

Apply a patternclassification method

• Examples of initial S/N ratios :• @RHIC = 10-10 @RHIC = 10-11 D0@LHC = 10-8

Page 3: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Want to extract signal out of background

• « Classical cuts » : example with n = 2 variables (actual analysis : 5- to 30+)

• For a good efficiency on signal (recognition), pollution by background is high (false alarms)

• Compromise has to be found between good efficiency and high S/N ratio

• Tuning the cuts is long and difficult

Var

iabl

e 2

Variable 1a

b

Var

iabl

e 2

Variable 1a

b

Julien Faivre Alice week Utrecht – 14 Jun 20052/15

Basic strategy : the « classical cuts »

Page 4: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Linear

• Simple training

• Simple tuning Fast tuning

• Linear shape But multicut OK

• Connex shape

• Bayesian decision theory• Markov fields, hidden Markov models• Nearest neighbours• Parzen windows• Linear Discriminant Analysis• Neural networks• Unsupervised learning methods

• Non linear

• Complex training Overtraining

• Choose layers & neurons Long tuning

• Non linear shape

• Non connex shape

Linear Discriminant Analysis (LDA)

Neural networks

• Only advantage of neural nets choose LDA

Not an absolute answer ;just tried and turns out it works fine

Julien Faivre Alice week Utrecht – 14 Jun 20053/15

Which pattern classification method ?

Page 5: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

nn xxxxu

2211

nnuuuu

2211

Variable 1

Var

iabl

e 2

Best a

xis

1u

2u

Julien Faivre Alice week Utrecht – 14 Jun 20054/15

LDA mechanism :

• Simplest idea : cut along linear combination of the n observables : = LDA axis

Cut on scalar product

Page 6: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Need a criterium to find the LDA direction

• Direction found will depend on the criterium chosen

• Fisher criterium (widely used) :

• Projection of the points on direction gives distributions of classes 1 and 2along this direction

• i = mean of distrib. i

• i = width of distrib. i

• 1 and 2 have to be as far as possible one from the other, 1 and 2 have to be as small as possible

1 2

1 2

2- 1

LDA axis

Julien Faivre Alice week Utrecht – 14 Jun 20055/15

LDA criterium : Fisher :

Page 7: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Fisher-LDA doesn’t work for us :• too much background, too few signal ;• background covers all the area where signal lies

• Fisher-LDA « considers » the distributions as gaussian(mean and width) insensitive to local parts of the distributions

• Solutions :• Apply several successive LDA cuts• Change the criterium : Fisher « optimized »

Fisher good (not us) Fisher not good (us)

(log

)

Julien Faivre Alice week Utrecht – 14 Jun 20056/15

Improvements needed :

Page 8: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• More cuts = better description of the « signal/background boundary »• BUT : if many cuts, tends to describe too locally

1st best axis

Variable 1

Var

iabl

e 2

2nd best axis

• Fisher is global irrelevant for multi-cut LDA

• Have to find criterium that depends locally on the distributions, not globally

• Criterium « optimized I » : Given an efficiency of the kth LDA-cut on the signal, maximisation of the number of background cut

Julien Faivre Alice week Utrecht – 14 Jun 20057/15

Multi-cut LDA & optimized criterium :

Page 9: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Straight line : mmmh…Curve : still not satisfiedAlmost candidate-per-candidate : happy

Over test sample :Over training sample :Not so badVery goodVery bad

Too local description bad performance

Caution with the description of the boundary :

Case of LDA : the more cuts, the better the limit is known

(determined from number of background candidates cut)

everything under control !

Julien Faivre Alice week Utrecht – 14 Jun 20058/15

Non-linear approaches :

Page 10: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

30th LDA direction

29th LDA direction

28th LDA dir.

31st LDA direction

LDA tightening

Classical cuts

Minimal relative uncertaintywith LDA

Gain

Best LDA cut value

Julien Faivre Alice week Utrecht – 14 Jun 20059/15

LDA cut-tuning :

Page 11: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Jeff Speltz’s 62 GeV K (topological) analysis (SQM 2004) :

Julien Faivre Alice week Utrecht – 14 Jun 200510/15

LDA for STAR’s hyperons :

Classical LDA+ 63 % signal

Page 12: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Ludovic Gaudichet : strange particles (topologically) K, , then and - Neural nets don’t even reach optimized classical- Cascaded neural nets do but don’t do better- LDA seems to do better (ongoing study)

• J.F. : charmed meson D0 in K (topologically)- Very preliminary results on p-integrated raw yield (PbPb central) : (« Current classical cuts » : Andrea Dainese’s thesis, PPR)- Statistical relative uncertainty (S/S) on PID-filtered candidates : Current classical = 4.4% LDA = 2.1% 2.1 times better- Statistical relative uncertainty on “unfiltered” cand. (just (,)’s out) : Current classical = 4.3% LDA = 1.6% 2.7 times better- Looking at LDA distributions new classical set found : Does 1.6 times better than current classical

Julien Faivre Alice week Utrecht – 14 Jun 200511/15

LDA in ALICE :

Page 13: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Julien Faivre Alice week Utrecht – 14 Jun 200511bis/15

LDA in ALICE (comparison) :

Optimized classical

LDA

VERY PRELIMINARY !!

Page 14: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Julien Faivre Alice week Utrecht – 14 Jun 200512/15

LDA in ALICE (performance) :

Purity-efficiency plot

Current classical cutsNew classical cuts

LDA cuts

Significance vs signal

Optimal LDA cut(tuned wrt relative uncertainty)

PID-filtered D0’s with quite tight classical pre-cuts applied

Page 15: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Zoom

Julien Faivre Alice week Utrecht – 14 Jun 200513/15

LDA in ALICE (tuning) :

Current classical

Optimal LDA

New classical

LDA

Relative uncertainty vs efficiency

• Tuning = search of the minimum of a valley-shaped 1-dim function• 2 hypothesis of background estimation

Page 16: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

The method we have now :

• Linear• Easy implementation (as classical cuts) and class ready ! (See next slide)

• Better usage of the N-dim information• Multicut not as limited as Fisher• Provides transformation from Rn to R trivial optimization of the cuts• Know when limit (too local) is reached

• Performance : better than classical cuts

• Cut-tuning : obvious (classical cuts : nightmare) cool for other centrality classes, collision energies, colliding systems, p ranges

Julien Faivre Alice week Utrecht – 14 Jun 200514/15

Conclusion :

Also provides systematics : - LDA vs classical, - Changing LDA cut value, - LDA set 1 vs LDA set 2

Cherry on the cake : optimal usage of ITS for particle with long c’s (,K,,) :

• 6 layers & 3 daughter tracks• 343 hit combinations / sets of classical cuts !!

Add 3 variables to LDA (#hits of each daughter) automatic ITS cut-tuning

Strategy could be :1- tune LDA2- derive classical from LDA

Page 17: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• C++ class which performs LDA is available• Calculates LDA cuts with chosen method, params and variable rescaling• Has a function Pass to check if a candidate passes the calculated cuts• Plug-and-play : whichever the analysis, no change in the code required• « Universal » input format (tables)• Ready-to-use : options have default values don’t need to worry for a 1st look

• Code is documented for how to use (examples included)

• Full documentation about LDA and optimization available• Example of filtering code which makes plot like in previous slide available

• Not yet on the web, send e-mail ([email protected])

• Statistics needed for training : with optimized criterium, looks like 2000 S and N after cuts are enough

Julien Faivre Alice week Utrecht – 14 Jun 200515/15

S. Antonio’s present : available tool :

Page 18: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

BACKUP

Page 19: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Rotating :

IV. Rotating

Fake Xi

Real Xi

Nothing

Another fake Xi

Destroyed

Fake Xi Created

• Destroys signal

• Keeps background

• Destroys some correlationsHas to be studied

(GeV/c2)

Padova – 22 Febbraio 200519/25

Page 20: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Pattern classification :

1/100IV. Linear Discriminant Analysis

• 2 classes :

• Signal (real Xis)• Bkgnd (combinatorial)

• 1 type : Xi vertex

• Dca’s

• Decay length

• Number of hits

• Etc…

• Background sample : real data

• Signal sample : simulation (embedding)

• p classes of objects of the same type

• n observables, defined for all the classes

• p samples of Nk objects for each class k

Learning :

Goal :Classify a new object in one of the classes defined

Observed XiVertex = signal or background

Usage :

Page 21: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Fisher criterium :

1/100IV. Linear Discriminant Analysis

• Fisher-criterium : maximisation of

• No need to have a maximisation algorithm

• LDA direction u is directly given by :

• All done with simple matrices operations

• Calculating axis way faster than reading data

211 mmSu w

22

21

2

21

Mean-vectors

Within-class scatter matrix

Page 22: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Fisher-criterium : maximisation of

• Let’s call u the vector of the LDA axis, xk the vector of the kth candidate for the training (learning)

• Means for class i (vector) :

• Mean of the projection on u for class i :

• So :

Mathematically speaking (I) :

22

21

2

21

k

ik

ii x

Nm

1

it

kk

t

ii muxu

N 1

16/42

Julien Faivre – III. Fisher LDA Yale – 04 Nov 2003

2121 mmut

Page 23: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Now :

• Let’s define

and Sw = S1 + S2 :

• So :

• In-one-shot booking of the matrix :

Mathematically speaking (II) :

k

iki x 22

k

ikt

iki mxmxS

uSu it

i 2

17/42

Julien Faivre – III. Fisher LDA Yale – 04 Nov 2003

uSu wt 2

221

mmNXXS tN

ki

ti

i

1

Page 24: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• First find Fisher LDA direction, as a start point

• Do a « performance function » : : vector u performance figure

• Maximize the « performance figure » by varying the direction of u

• Several methods for maximisation :

• Easy and fast : one coordinate at a time

• Fancy and powerfull : genetic algorithm

Algorithm for optimized criterium :

1/100IV. Linear Discriminant Analysis

Page 25: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

One coordinate at a time :

1/100IV. Linear Discriminant Analysis

• Change the direction of u by steps of a constant angle : = 8 to start, then = 4, 2, 1, eventually 0.5

• Change the 1st coordinate of u until reaches a maximum• Change all the other coordinates like this, one by one• Then, try again with 1st coordinate, and with the other ones

• When no improvement anymore : divide by 2 and do the whole thing again

Page 26: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Genetic algorithm (I) :

28/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Problem with the « one-coordinate-at-a-time » algo : likely to fall in a local maximum different than the absolute maximum

• So : use genetic algorithm !

• Like genetic evolution :

• Pool of chromosomes

• Generations : evolution, reproduction

• Darwinist selection

• Mutations

Page 27: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Genetic algorithm (II) :

29/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Start with p chromosomes (p vectors uk) made randomly from Fisher

• Calculate performance figure of each uk

• Order the p vectors by decreasing value of the performance figure

• Keep only the m first vectors (Darwinist selection)

• Have them make children : build a new set of p chromosomes, with the m selected ones and combinations of them

• In the children chromosomes, introduce some mutations (modify randomly a coordinate)

• New pool is ready : go to

Page 28: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Statistics needed :

1/100IV. Linear Discriminant Analysis

• Fisher-LDA : samples need to have more than 10000 candidates each

• Doesn’t depend on number of observables (?) (n = 10, n = 22)

• Optimized criteria : need much more

• Guess : at minimum 50000 candidates per sample, maybe up to 500000 ?

• Depends on number of observables

Page 29: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

• Optimised criterium : can’t look at the oscillations to know if enough statistics !

Statistics needed (II) :

31/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

Optimised criterium (step 1)

Variable 1

Var

iabl

e 2

Optimised criterium(step 2)

Variable 1

Var

iabl

e 2

Page 30: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Statistics needed (III) :

32/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Solutions :

• Try all the combinations of k out of n observables (never used)

• Problem : number is huge (2n-1) : n = 5 31 combinations,n = 10 1023 combinations, n = 20 1048575 combinations !

• Use underoptimal LDA (widely used)

• See next slide

• Use PCA : Principal Components Analysis (widely used)

• See one after next slide

Page 31: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Part V – Various things :

38/42

Julien Faivre – V. Various things Yale – 04 Nov 2003

• The projection of the LDA direction from the n-dimension space to a k-dimension sub-space is not the LDA direction of the projection of the samples from the n-dimension space to the k-dimension sub-space

• The more observables, the better

• Mathematically : adding an observable can’t lower discriminancy

• Practically : it can, because of limited statistics to train

• LDA (multi-cuts) can’t do worse than cutting on each observable

• Because cutting on each observable is a particular case of multi-cuts LDA !

• If does worse : criterium isn’t good, or efficiency of cuts not well chosen

Page 32: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Underoptimal LDA :

33/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Calculate discriminancy of each of the n observables

• Choose the observable that has the highest discriminancy

• Calculate discriminancy of each pair of observables containing the previously found

• Choose the most discriminating pair

• Etc… with triplets, up to desired number of directions

• Problem :

Most discriminating direction

Most discriminating pair containing most discriminating direction

Actual most discriminating pair

Page 33: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

PCA – Principal Components Analysis (I) :

34/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Tool used in data reduction (e.g. image compression)

• Read Root class description of TPrincipal

• Finds along which directions (linear combinations of observables) is most of the information

Variable 1

Var

iabl

e 2 Primary component axis

Secondary component axis

Main information of a point is x1, dropping x2

isn’t important

x1

x2

Page 34: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

PCA – Principal Components Analysis (II) :

35/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• All is matrix-based : easy

• « Informativeness » of the direction is given by normalised eigenvalues

• Use with LDA : prior to finding the axis :

• Observables = base B1 of the n-dimension space

• Apply PCA over signal+bkgnd samples (together) : get base B2 of the the n-dimension space

• Choose the k most informative directions : C2, subset of B2

• Calculate LDA axis in space defined by C2

• If several LDA directions ? No problem : apply PCA but keep all information of the candidates : just don’t use it all for LDA PCA will give different sub-space for each step

Page 35: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

PCA – Principal Components Analysis (III) :

36/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Problem of using PCA prior to LDA :

• Use it / not use it is purely empirical

• Percentage of the eigenvalues to keep is also purely empirical

Variable 1

Var

iabl

e 2

PCA 1st directionBest discriminating axis

Page 36: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

PCA – Principal Components Analysis (IV) :

37/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Difference between PCA and LDA :

• Example with letters O and Q :

• PCA finds where most of the information is : Most important part of O and Q is a big round shape applying PCA means that both O and Q become O

• LDA finds where most of the difference is : Difference between O and Q is the line at the bottom-right applying LDA means finding this little line

vs PCA

LDA

Page 37: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

Influence of an LDA cut :

40/42

Julien Faivre – V. Various things Yale – 04 Nov 2003

• Usefull to know if LDA cuts steeply or uniformly, along each direction

• fk = distribution of a sample along direction of observable k

• gk = the same, after the LDA cut

• F the normalised integral of f

• h(x) = (g/f)(F-1(x))

• Q = 0 cut uniform, Q = 1 cut steep

1

012

1

hQ

g/f

F0 1

1

Page 38: Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ? Concrete examples Conclusions and S.Antonio’s present.

V0 decay topology :

40/42