TerraFerMA A Suite of Multivariate Analysis Tools
description
Transcript of TerraFerMA A Suite of Multivariate Analysis Tools
![Page 1: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/1.jpg)
1
TerraFerMAA Suite of Multivariate Analysis Tools
Sherry TowersSUNY-SB
TerraFerMA is now ROOT-dependent only
(ie; it is CLHEP-free)
www-d0.fnal.gov/~smjt/multiv.html
![Page 2: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/2.jpg)
2
TerraFerMA=Fermilab Multivariate
Analysis (aka “FerMA”)
TerraFerMA is, foremost, a convenient interface to various disparate multivariate analysis packages (ex: MLPfit, Jetnet, PDE/GEM, Fisher discriminant, binned likelihood, etc)
User first fills signal and background (and data) “Samples”, which are then used as input to TerraFerMA methods. A Sample consists of variables filled for many different events.
![Page 3: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/3.jpg)
3
Using a multivariate package chosen by user (ie; NN’s, PDE’s, Fisher Discriminants, binned likelihood, etc), TerraFerMA methods yield relative probability that a data event is signal or background.
TerraFerMA also includes useful statistical tools (means, RMS’s, and correlations between the variables in a Sample), and a method to detect outliers.
![Page 4: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/4.jpg)
4
TerraFerMA makes it trivial to
compare performance of different multivariate techniques (ie; simple to switch between using a NN and a PDE (for instance) because in TerraFerMA both use the same interface)
TerraFerMA makes it easy to reduce the number of discriminators used in an analysis (optional TerraFerMA methods sort variables to determine which have best signal/background discrimination power)
TerraFerMA web page includes full
documentation/descriptions
![Page 5: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/5.jpg)
5
The TerraFerMA TFermaFactory class takes as input a signal and a background TFermaSample, then calculates discriminators based on these samples using multivariate method of user’s choice.
TFermaFactory includes a method called GetTestEfficiencies() that returns signal eff vs bkgnd eff for various operating points, along with the cut in the discriminator for each operating point.
![Page 6: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/6.jpg)
6
Where to find TerraFerMA
TerraFerMA documentation: www-d0.fnal.gov/~smjt/ferma.ps
TerraFerMA users’ guide: www-d0.fnal.gov/~smjt/guide.ps
TerraFerMA package: …/ferma.tar.gz
(includes example programs)
![Page 7: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/7.jpg)
7
Future Plans (maybe)…
Add capability to handle “Ensembles” (ie; if an analysis has more than one source of background, it would be useful to treat the various background Samples together as an Ensemble).
Users really want the convenience of this.
But is by no means trivial.
![Page 8: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/8.jpg)
8
![Page 9: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/9.jpg)
9
Future Plans (maybe)…
Add Support Vector Machine interface.
Need to find a half decent SVM package (preferably in C++). Also needs to be general enough that it can be used “out of the box” without fine tuning.
![Page 10: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/10.jpg)
10
Most powerful...
Analytic/binned likelihood
Neural Networks
Support Vector Machines
Kernel Estimation
![Page 11: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/11.jpg)
11
Future Plans (maybe)…
Add in genetic algorithm for sorting discriminators.
![Page 12: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/12.jpg)
12
Future Plans (maybe)…
Add in distribution comparison tests such as Kolomogorov-Smirnoff, Anderson-Darling, etc.
![Page 13: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/13.jpg)
13
Making the most of your data:
tools, techniques, and strategies
(and potential pitfalls!)
Sherry TowersState University of New York
at Stony Brook
![Page 14: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/14.jpg)
14
Data analysis for the modern age:
Cost and complexity of HEP data behooves us to milk the data for all it’s worth!
![Page 15: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/15.jpg)
15
But how can we get the most out of our data?
• Use of more sophisticated data analysis techniques may help (multivariate methods)
• Strive to achieve excellent understanding of the data, and our modelling of it
• Make innovative use of already familiar tools and methods
• Reduction of number of variables can help too!
![Page 16: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/16.jpg)
16
Tools and Techniques
![Page 17: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/17.jpg)
17
Ignore all correlations between
discriminators… Examples; simple techniques
based on square cuts, or likelihood techniques that obtain multi-D likelihood from product of 1-D likelihoods.
Advantage: fast, easy understand.
Easy to tell if modelling of data is sound.
Disadvantage: useful discriminating info may be lost if correlations are ignored
Simple techniques
![Page 18: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/18.jpg)
18
More powerful...
More complicated techniques take into account simple (linear) correlations between discriminants:
Fisher-discriminant H-Matrix Principal component analysis Independent component analysis
and many, many more!
Advantage: fast, more powerful
Disadvantage: can be a bit harder to understand, systematics can be harder to assess. Harder to tell if modelling of data is sound.
![Page 19: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/19.jpg)
19
Fisher discriminant: (2D examples)
Finds axes in parameter space such that projection of signal and background onto axes have maximally separated means.
Can often work well …
But sometimes fails …
![Page 20: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/20.jpg)
20
Analytic/binned likelihood…
Advantages: Easy to understand Can take into account all
correlations
Disadvantages: Don’t always have analytic
PDF! Determination of binning for
large number of dimensions for binned likelihood a pain (but possible)
![Page 21: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/21.jpg)
21
Neural Networks
Most commonly used in HEP: Jetnet (since early 1990’s) MLPfit (since late 1990’s) Stuttgart (very recent)
Advantages: Fast, (relatively) easy to use Can take into account complex
non-linear correlations
(get to disadvantages in a minute)
![Page 22: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/22.jpg)
22
NN formed from inter-connected neurons.
How does a NN work?
Each neuron effectively capable of making a cut in parameter space
![Page 23: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/23.jpg)
23
![Page 24: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/24.jpg)
24
![Page 25: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/25.jpg)
25
![Page 26: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/26.jpg)
26
Possible NN pitfalls…
An easy-to-use black box!Architecture of NN is arbitrary
(if you make a mistake, you may end up being susceptible to statistical fluctuations in training data)
Difficult to determine if modelling of data is sound
Very easy to use many, many dimensions:
(www-d0.fnal.gov/~smjt/durham/reduc.ps)
![Page 27: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/27.jpg)
27
The curse of too many variables: a simple example
Signal 5D Gaussian = (1,0,0,0,0)
= (1,1,1,1,1)
Bkgnd 5D Gaussian = (0,0,0,0,0)
= (1,1,1,1,1)
Only difference between signal and background is in first dimension. Other four dimensions are `useless’ discriminators
![Page 28: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/28.jpg)
28
The curse of too many variables: a simple example
![Page 29: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/29.jpg)
29
The curse of too many variables: a simple example
Statistical fluctuations in the useless dimensions tend to wash out discrimination
in useful dimension
![Page 30: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/30.jpg)
30
Advantages of variable reduction: a “real-world” example…
A Tevatron RunI analysis used a 7 variable NN to discriminate between signal and background.
Were all 7 needed?
Ran the signal and background n-tuples through the TerraFerMA interface to the sorting method…
![Page 31: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/31.jpg)
31
A “real-world” example…
![Page 32: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/32.jpg)
32
Support Vector Machines
The new kid on the blockSimilar to NN’s in many
respects. But, first map parameter space onto a higher dimensional space, then setup up neuron architecture to optimally carve up new parameter space.
![Page 33: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/33.jpg)
33
Kernel Estimators So far, under-appreciated and
under-used in HEP! (but widely used elsewhere) Gaussian Expansion Method (GEM) Probability Density Estimation (PDE)(www-d0.fnal.gov/~smjt/durham/pde.ps)
Advantages Can take into account complex
non-linear correlations Relatively easy to understand (not
a black box) Completely different from Neural
Networks (make an excellent alternative)
(get to disadvantages in a minute)
![Page 34: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/34.jpg)
34
To estimate a PDF, PDE’s
use the concept that any n-dimensional continuous function can be modelled by sum of some n-D “kernel” function
Gaussian kernels are a good choice for particle physics
So, a PDF can be estimated by sum of multi-dimensional Gaussians centered about MC generated points
![Page 35: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/35.jpg)
35
Primary disadvantage:Unlike NN’s, PDE
algorithms do not save weights. PDE methods are thus inherently slower to use than NN’s
![Page 36: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/36.jpg)
36
TerraFerMA: a universal interface to
many multivariate methods
TerraFerMA, released in 2002, interfaces to MLPfit, Jetnet, kernel methods, Fisher Discriminant, etc, etc, etc, also includes variable sorting method. User can quickly and easily sort potential discriminators.
http://www-d0.fnal.gov/~smjt/multiv.html
![Page 37: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/37.jpg)
37
Case Studies
![Page 38: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/38.jpg)
38
Case #1: MAGIC (Major
Atmospheric Gamma Imaging Cherenkov) telescope data
Due to begin data taking, Canary Islands, Aug 2003
![Page 39: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/39.jpg)
39
Cherenkov light from gamma ray (or hadronic shower background)
Telescope
10 discriminators in total: based on shape, size, brightness, and orientation of ellipse
![Page 40: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/40.jpg)
40
![Page 41: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/41.jpg)
41
R. Bock, A. Chilingarian, M. Gaug, F. Hakl, T. Hengstebeck, M. Jirina, J.Klaschka,
E.Kotrc, P.Savicky, S.Towers, A.Vaiciulis, W.Wittek
(submitted to NIM, Feb 2003)
Examined which multivariate methods appeared to afford the best discrimination between signal and background:
Neural Networks
Kernel PDE
Support Vector Machine
Fisher Discriminant
simultaneous 1D binned likelihood fit
(and others!)
Methods for multidimensional event classification: a case study
![Page 42: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/42.jpg)
42
Results:
Conclusion: (carefully chosen) sophisticated
multivariate methods will likely help
![Page 43: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/43.jpg)
43
Case #2: Standard Model Higgs searches at the Tevatron
Direct searches (combined LEP data): MH>114.4 GeV (95% CL)
Fits to precise EW data:
MH<193GeV (95% CL)
![Page 44: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/44.jpg)
44
Discovery Thresholds:
(based on MC studies performed in 1999 at SUSY/Higgs Workshop (SHW))
Feb 2003: Tevatron “Higgs sensitivity” working group formed to revisit SHW analyses…can Tevatron Higgs sensitivity be significantly improved?
![Page 45: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/45.jpg)
45
Sherry’s physics interests at Dzero:
Co-convener, Dzero ZH working group
Convener, Dzero Higgs b-id working group (NB; H(bb) dominant for MH<130GeV)
Working on search strategies in ZH(mumubb) mode
Can 1999 SHW b-tagging performance be easily improved upon?
Can 1999 SHW analysis methods be easily improved upon?
![Page 46: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/46.jpg)
46
heavy-quark jet tagging:
b-hadrons are long lived. Several b-tagging strategies make use of presence of displaced tracks or secondary vertices in b-jet:
x
y
z
Lxy
bdo
![Page 47: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/47.jpg)
47
Two b-tags in any H(bb) analysis implies that “swimming” along the mis-tag/tag efficiency curve to a point where more light-jet background is allowed in, may dramatically improve significance of analysis:
*
*
Improvement is equivalent to almost doubling the data set!
Significance of ZH(llbb) search (per fb-1) vs light quark mis-tag rate
![Page 48: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/48.jpg)
48
How about multivariate methods?
SHW analyses looked at use of NN’s (based on kinematical and topological info in event) as a Higgs search strategy:
But, shape analysis of NN discriminants instead of making just square cuts in discriminants yields an additional factor of 30%!
NN yields improvement equivalent to factor of two more data relative to “traditional analysis”
![Page 49: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/49.jpg)
49
Summary
![Page 50: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/50.jpg)
50
Use of multivariate techniques are now widely seen in HEP
We have a lot of different tools to choose from (from simple to complex)
Complex tools are occasionally necessary, but they have their pitfalls: Assessment of data modelling
harder Assessment of systematics harder Architecture sometimes arbitrary Time needed to get to publication
Always better to start with a simple tool, and work way up to more complex tools, after showing they are actually needed!
![Page 51: TerraFerMA A Suite of Multivariate Analysis Tools](https://reader036.fdocuments.us/reader036/viewer/2022062809/56815876550346895dc5d6a9/html5/thumbnails/51.jpg)
51
Careful examination of
discriminators used in a multivariate analysis is always a good idea
Reduction of number of variables can simplify analysis considerably, and can even increase discrimination power
And, exploring simple changes, or using familiar techniques in more clever ways can sometimes dramatically improve analyses!