Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ?...

Linear Discriminant Analysis (LDA)

for selection cuts :

• Motivations• Why LDA ?• How does it work ?• Concrete examples• Conclusions and S.Antonio’s present

Julien Faivre Alice week Utrecht – 14 June 2005

Initial motivations :

Julien Faivre

Some particles are critical at all p and in all collision systems

• Statistics is needed for :

• p-p collisions, low-p

• All p

• All p

• Peripheral, p-p, high-p

• Observables :

• Production yields

• Spectra slope

• p, azimutal anisotropy (v2)

• Scaled spectra (RCP, RAA), v2

Alice week Utrecht – 14 Jun 20051/15

• Need more statistics• Need fast and easy selection optimization

Apply a patternclassification method

• Examples of initial S/N ratios :• @RHIC = 10-10 @RHIC = 10-11 D0@LHC = 10-8

• Want to extract signal out of background

• « Classical cuts » : example with n = 2 variables (actual analysis : 5- to 30+)

• For a good efficiency on signal (recognition), pollution by background is high (false alarms)

• Compromise has to be found between good efficiency and high S/N ratio

• Tuning the cuts is long and difficult

Var

iabl

e 2

Variable 1a

b

Var

iabl

e 2

Variable 1a

b

Julien Faivre Alice week Utrecht – 14 Jun 20052/15

Basic strategy : the « classical cuts »

• Linear

• Simple training

• Simple tuning Fast tuning

• Linear shape But multicut OK

• Connex shape

• Bayesian decision theory• Markov fields, hidden Markov models• Nearest neighbours• Parzen windows• Linear Discriminant Analysis• Neural networks• Unsupervised learning methods

• Non linear

• Complex training Overtraining

• Choose layers & neurons Long tuning

• Non linear shape

• Non connex shape

Linear Discriminant Analysis (LDA)

Neural networks

• Only advantage of neural nets choose LDA

Not an absolute answer ;just tried and turns out it works fine


Which pattern classification method ?

nn xxxxu

2211

nnuuuu

2211

Variable 1

Var

iabl

e 2

Best a

xis

1u

2u


LDA mechanism :

• Simplest idea : cut along linear combination of the n observables : = LDA axis

Cut on scalar product

• Need a criterium to find the LDA direction

• Direction found will depend on the criterium chosen

• Fisher criterium (widely used) :

• Projection of the points on direction gives distributions of classes 1 and 2along this direction

• i = mean of distrib. i

• i = width of distrib. i

• 1 and 2 have to be as far as possible one from the other, 1 and 2 have to be as small as possible

1 2

1 2

2- 1

LDA axis


LDA criterium : Fisher :

• Fisher-LDA doesn’t work for us :• too much background, too few signal ;• background covers all the area where signal lies

• Fisher-LDA « considers » the distributions as gaussian(mean and width) insensitive to local parts of the distributions

• Solutions :• Apply several successive LDA cuts• Change the criterium : Fisher « optimized »

Fisher good (not us) Fisher not good (us)

(log

)


Improvements needed :

• More cuts = better description of the « signal/background boundary »• BUT : if many cuts, tends to describe too locally

1st best axis

Variable 1

Var

iabl

e 2

2nd best axis

• Fisher is global irrelevant for multi-cut LDA

• Have to find criterium that depends locally on the distributions, not globally

• Criterium « optimized I » : Given an efficiency of the kth LDA-cut on the signal, maximisation of the number of background cut


Multi-cut LDA & optimized criterium :

Straight line : mmmh…Curve : still not satisfiedAlmost candidate-per-candidate : happy

Over test sample :Over training sample :Not so badVery goodVery bad

Too local description bad performance

Caution with the description of the boundary :

Case of LDA : the more cuts, the better the limit is known

(determined from number of background candidates cut)

everything under control !


Non-linear approaches :

30th LDA direction

29th LDA direction

28th LDA dir.

31st LDA direction

LDA tightening

Classical cuts

Minimal relative uncertaintywith LDA

Gain

Best LDA cut value


LDA cut-tuning :

• Jeff Speltz’s 62 GeV K (topological) analysis (SQM 2004) :


LDA for STAR’s hyperons :

Classical LDA+ 63 % signal

• Ludovic Gaudichet : strange particles (topologically) K, , then and - Neural nets don’t even reach optimized classical- Cascaded neural nets do but don’t do better- LDA seems to do better (ongoing study)

• J.F. : charmed meson D0 in K (topologically)- Very preliminary results on p-integrated raw yield (PbPb central) : (« Current classical cuts » : Andrea Dainese’s thesis, PPR)- Statistical relative uncertainty (S/S) on PID-filtered candidates : Current classical = 4.4% LDA = 2.1% 2.1 times better- Statistical relative uncertainty on “unfiltered” cand. (just (,)’s out) : Current classical = 4.3% LDA = 1.6% 2.7 times better- Looking at LDA distributions new classical set found : Does 1.6 times better than current classical


LDA in ALICE :

Julien Faivre Alice week Utrecht – 14 Jun 200511bis/15

LDA in ALICE (comparison) :

Optimized classical

LDA

VERY PRELIMINARY !!


LDA in ALICE (performance) :

Purity-efficiency plot

Current classical cutsNew classical cuts

LDA cuts

Significance vs signal

Optimal LDA cut(tuned wrt relative uncertainty)

PID-filtered D0’s with quite tight classical pre-cuts applied

Zoom


LDA in ALICE (tuning) :

Current classical

Optimal LDA

New classical

LDA

Relative uncertainty vs efficiency

• Tuning = search of the minimum of a valley-shaped 1-dim function• 2 hypothesis of background estimation

The method we have now :

• Linear• Easy implementation (as classical cuts) and class ready ! (See next slide)

• Better usage of the N-dim information• Multicut not as limited as Fisher• Provides transformation from Rn to R trivial optimization of the cuts• Know when limit (too local) is reached

• Performance : better than classical cuts

• Cut-tuning : obvious (classical cuts : nightmare) cool for other centrality classes, collision energies, colliding systems, p ranges


Conclusion :

Also provides systematics : - LDA vs classical, - Changing LDA cut value, - LDA set 1 vs LDA set 2

Cherry on the cake : optimal usage of ITS for particle with long c’s (,K,,) :

• 6 layers & 3 daughter tracks• 343 hit combinations / sets of classical cuts !!

Add 3 variables to LDA (#hits of each daughter) automatic ITS cut-tuning

Strategy could be :1- tune LDA2- derive classical from LDA

• C++ class which performs LDA is available• Calculates LDA cuts with chosen method, params and variable rescaling• Has a function Pass to check if a candidate passes the calculated cuts• Plug-and-play : whichever the analysis, no change in the code required• « Universal » input format (tables)• Ready-to-use : options have default values don’t need to worry for a 1st look

• Code is documented for how to use (examples included)

• Full documentation about LDA and optimization available• Example of filtering code which makes plot like in previous slide available

• Not yet on the web, send e-mail ([email protected])

• Statistics needed for training : with optimized criterium, looks like 2000 S and N after cuts are enough


S. Antonio’s present : available tool :

BACKUP

Rotating :

IV. Rotating

Fake Xi

Real Xi

Nothing

Another fake Xi

Destroyed

Fake Xi Created

• Destroys signal

• Keeps background

• Destroys some correlationsHas to be studied

(GeV/c2)

Padova – 22 Febbraio 200519/25

Pattern classification :

1/100IV. Linear Discriminant Analysis

• 2 classes :

• Signal (real Xis)• Bkgnd (combinatorial)

• 1 type : Xi vertex

• Dca’s

• Decay length

• Number of hits

• Etc…

• Background sample : real data

• Signal sample : simulation (embedding)

• p classes of objects of the same type

• n observables, defined for all the classes

• p samples of Nk objects for each class k

Learning :

Goal :Classify a new object in one of the classes defined

Observed XiVertex = signal or background

Usage :

Fisher criterium :


• Fisher-criterium : maximisation of

• No need to have a maximisation algorithm

• LDA direction u is directly given by :

• All done with simple matrices operations

• Calculating axis way faster than reading data

211 mmSu w

22

21

2

21

Mean-vectors

Within-class scatter matrix

• Fisher-criterium : maximisation of

• Let’s call u the vector of the LDA axis, xk the vector of the kth candidate for the training (learning)

• Means for class i (vector) :

• Mean of the projection on u for class i :

• So :

Mathematically speaking (I) :

22

21

2

21

k

ik

ii x

Nm

1

it

kk

t

ii muxu

N 1

16/42

Julien Faivre – III. Fisher LDA Yale – 04 Nov 2003

2121 mmut

• Now :

• Let’s define

and Sw = S1 + S2 :

• So :

• In-one-shot booking of the matrix :

Mathematically speaking (II) :

k

iki x 22

k

ikt

iki mxmxS

uSu it

i 2

17/42

Julien Faivre – III. Fisher LDA Yale – 04 Nov 2003

uSu wt 2

221

mmNXXS tN

ki

ti

i

1

• First find Fisher LDA direction, as a start point

• Do a « performance function » : : vector u performance figure

• Maximize the « performance figure » by varying the direction of u

• Several methods for maximisation :

• Easy and fast : one coordinate at a time

• Fancy and powerfull : genetic algorithm

Algorithm for optimized criterium :


One coordinate at a time :


• Change the direction of u by steps of a constant angle : = 8 to start, then = 4, 2, 1, eventually 0.5

• Change the 1st coordinate of u until reaches a maximum• Change all the other coordinates like this, one by one• Then, try again with 1st coordinate, and with the other ones

• When no improvement anymore : divide by 2 and do the whole thing again

Genetic algorithm (I) :

28/42

Julien Faivre – IV. Improvements Yale – 04 Nov 2003

• Problem with the « one-coordinate-at-a-time » algo : likely to fall in a local maximum different than the absolute maximum

• So : use genetic algorithm !

• Like genetic evolution :

• Pool of chromosomes

• Generations : evolution, reproduction

• Darwinist selection

• Mutations

Genetic algorithm (II) :

29/42


• Start with p chromosomes (p vectors uk) made randomly from Fisher

• Calculate performance figure of each uk

• Order the p vectors by decreasing value of the performance figure

• Keep only the m first vectors (Darwinist selection)

• Have them make children : build a new set of p chromosomes, with the m selected ones and combinations of them

• In the children chromosomes, introduce some mutations (modify randomly a coordinate)

• New pool is ready : go to

Statistics needed :


• Fisher-LDA : samples need to have more than 10000 candidates each

• Doesn’t depend on number of observables (?) (n = 10, n = 22)

• Optimized criteria : need much more

• Guess : at minimum 50000 candidates per sample, maybe up to 500000 ?

• Depends on number of observables

• Optimised criterium : can’t look at the oscillations to know if enough statistics !

Statistics needed (II) :

31/42


Optimised criterium (step 1)

Variable 1

Var

iabl

e 2

Optimised criterium(step 2)

Variable 1

Var

iabl

e 2

Statistics needed (III) :

32/42


• Solutions :

• Try all the combinations of k out of n observables (never used)

• Problem : number is huge (2n-1) : n = 5 31 combinations,n = 10 1023 combinations, n = 20 1048575 combinations !

• Use underoptimal LDA (widely used)

• See next slide

• Use PCA : Principal Components Analysis (widely used)

• See one after next slide

Part V – Various things :

38/42

Julien Faivre – V. Various things Yale – 04 Nov 2003

• The projection of the LDA direction from the n-dimension space to a k-dimension sub-space is not the LDA direction of the projection of the samples from the n-dimension space to the k-dimension sub-space

• The more observables, the better

• Mathematically : adding an observable can’t lower discriminancy

• Practically : it can, because of limited statistics to train

• LDA (multi-cuts) can’t do worse than cutting on each observable

• Because cutting on each observable is a particular case of multi-cuts LDA !

• If does worse : criterium isn’t good, or efficiency of cuts not well chosen

Underoptimal LDA :

33/42


• Calculate discriminancy of each of the n observables

• Choose the observable that has the highest discriminancy

• Calculate discriminancy of each pair of observables containing the previously found

• Choose the most discriminating pair

• Etc… with triplets, up to desired number of directions

• Problem :

Most discriminating direction

Most discriminating pair containing most discriminating direction

Actual most discriminating pair

PCA – Principal Components Analysis (I) :

34/42


• Tool used in data reduction (e.g. image compression)

• Read Root class description of TPrincipal

• Finds along which directions (linear combinations of observables) is most of the information

Variable 1

Var

iabl

e 2 Primary component axis

Secondary component axis

Main information of a point is x1, dropping x2

isn’t important

x1

x2

PCA – Principal Components Analysis (II) :

35/42


• All is matrix-based : easy

• « Informativeness » of the direction is given by normalised eigenvalues

• Use with LDA : prior to finding the axis :

• Observables = base B1 of the n-dimension space

• Apply PCA over signal+bkgnd samples (together) : get base B2 of the the n-dimension space

• Choose the k most informative directions : C2, subset of B2

• Calculate LDA axis in space defined by C2

• If several LDA directions ? No problem : apply PCA but keep all information of the candidates : just don’t use it all for LDA PCA will give different sub-space for each step

PCA – Principal Components Analysis (III) :

36/42


• Problem of using PCA prior to LDA :

• Use it / not use it is purely empirical

• Percentage of the eigenvalues to keep is also purely empirical

Variable 1

Var

iabl

e 2

PCA 1st directionBest discriminating axis

PCA – Principal Components Analysis (IV) :

37/42


• Difference between PCA and LDA :

• Example with letters O and Q :

• PCA finds where most of the information is : Most important part of O and Q is a big round shape applying PCA means that both O and Q become O

• LDA finds where most of the difference is : Difference between O and Q is the line at the bottom-right applying LDA means finding this little line

vs PCA

LDA

Influence of an LDA cut :

40/42

Julien Faivre – V. Various things Yale – 04 Nov 2003

• Usefull to know if LDA cuts steeply or uniformly, along each direction

• fk = distribution of a sample along direction of observable k

• gk = the same, after the LDA cut

• F the normalised integral of f

• h(x) = (g/f)(F-1(x))

•

• Q = 0 cut uniform, Q = 1 cut steep

1

012

1

hQ

g/f

F0 1

1

V0 decay topology :

40/42

Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ?...

Documents

Transcript of Linear Discriminant Analysis (LDA) for selection cuts : Motivations Why LDA ? How does it work ?...