Wouter Verkerke, NIKHEF RooFit A tool kit for data modeling in ROOT Wouter Verkerke (NIKHEF) David...
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of Wouter Verkerke, NIKHEF RooFit A tool kit for data modeling in ROOT Wouter Verkerke (NIKHEF) David...
Wouter Verkerke, NIKHEF
RooFitA tool kit for data modeling in ROOT
Wouter Verkerke (NIKHEF) David Kirkby (UC Irvine)
Wouter Verkerke, NIKHEF
Focus: coding a probability density function
• Focus on one practical aspect of many data analysis in HEP: How do you formulate your p.d.f. in ROOT – For ‘simple’ problems (gauss, polynomial), ROOT built-in models
well sufficient
– But if you want to do unbinned ML fits, use non-trivial functions, or work with multidimensional functions you are quickly running into trouble
Wouter Verkerke, NIKHEF
A recent example
• BaBar experiment at SLAC: Extract sin(2) from time dependent CP violation of B decay: e+e- Y(4s) BB– Reconstruct both Bs, measure decay time difference
– Physics of interest is in decay time dependent oscillation
• Many issues arise– Standard ROOT function framework clearly insufficient to handle such
complicated functions must develop new framework
– Normalization of p.d.f. not always trivial to calculate may need numeric integration techniques
– Unbinned fit, >2 dimensions, many events computation performance important must try optimize code for acceptable performance
– Simultaneous fit to control samples to account for detector performance
);|BkgResol();(BkgDecay);BkgSel()1(
);|SigResol())2sin(,;(SigDecay);SigSel(
bkgbkgbkgsig
sigsigsigsig
rdttqtpmf
rdttqtpmf
Wouter Verkerke, NIKHEF
A recent example
• Initial approach BaBar: write it from scratch (in FORTRAN!)– Does its designated job quite well, but took a long time to develop
– Possible because sin(2) effort supported by O(50) people.
– Optimization of ML calculations hand-coded error prone and not easy to maintain
– Difficult to transfer knowledge/code from one analysis to another.
• A better solution: A modeling language in C++ that integrates seamlessly into ROOT– Recycle code and knowledge
• Development of RooFit package– Started 5 years ago.
– Very successful virtually everybody in BaBar uses it now in standard ROOT distribution
Wouter Verkerke, NIKHEF
RooFit – a data modeling language for ROOT
C++ command line interface & macros
Data management & histogramming
Graphics interface
I/O support
MINUIT
ToyMC dataGeneration
Data/ModelFitting
Data Modeling
Model Visualization
Extension to ROOT – (Almost) no overlap with existing functionality
Wouter Verkerke, NIKHEF
Data modeling - Desired functionality
Building/Adjusting Models
Easy to write basic PDFs ( normalization)
Easy to compose complex models (modular design)
Reuse of existing functions
Flexibility – No arbitrary implementation-related restrictions
Using Models
Fitting : Binned/Unbinned (extended) MLL fits, Chi2 fits
Toy MC generation: Generate MC datasets from any model
Visualization: Slice/project model & data in any possible way
Speed – Should be as fast or faster than hand-coded model
A n
a l y
s i s
w
o r
k c
y c
l e
Wouter Verkerke, NIKHEF
• Idea: represent math symbols as C++ objects
– Result: 1 line of code per symbol in a function (the C++ constructor) rather than 1 line of code per function
Data modeling – OO representation
variable RooRealVar
function RooAbsReal
PDF RooAbsPdf
space point RooArgSet
list of space points RooAbsData
integral RooRealIntegral
RooFit classMathematical concept
),;( qpxF
px,
x
dxxfx
xmax
min
)(
)(xf
kx
Wouter Verkerke, NIKHEF
Data modeling – Constructing composite objects
• Straightforward correlation between mathematical representation of formula and RooFit code
RooRealVar x
RooRealVar s
RooFormulaVar sqrts
RooGaussian g
RooRealVar x(“x”,”x”,-10,10) ; RooRealVar m(“m”,”mean”,0) ; RooRealVar s(“s”,”sigma”,2,0,10) ; RooFormulaVar sqrts(“sqrts”,”sqrt(s)”,s) ; RooGaussian g(“g”,”gauss”,x,m,sqrts) ;
Math
RooFitdiagram
RooFitcode
RooRealVar m
),,( smxgauss
Wouter Verkerke, NIKHEF
Model building – (Re)using standard components
• RooFit provides a collection of compiled standard PDF classes
RooArgusBG
RooPolynomial
RooBMixDecay
RooHistPdf
RooGaussian
BasicGaussian, Exponential, Polynomial,…Chebychev polynomial
Physics inspiredARGUS,Crystal Ball, Breit-Wigner, Voigtian,B/D-Decay,….
Non-parametricHistogram, KEYS
Easy to extend the library: each p.d.f. is a separate C++ class
Wouter Verkerke, NIKHEF
Importance of a good and extendable model library
• If ‘new’ p.d.f.s are made available in a easy-to-use form, more people will use them
• Example 1: non-parametric kernel estimation technique– See article by Kyle Cranmer
– People like it, but are put off by the prospect of thoroughly reading the entire article, coding the concept from scratch, and then using it.
– In RooFit switching from a histogram-based shape to a KEYS shape is a one-line code change (RooHistPdf RooKeysPdf)
• Example 2: Chebychev polynomials – Its easy to convince people to try them if all they need to do is a
one-line code change (RooPolynomial RooChebychev)
Wouter Verkerke, NIKHEF
Model building – (Re)using standard components
• Library p.d.f.s can be adjusted on the fly.– Just plug in any function expression you like as input variable
– Works universally, even for classes you write yourself
• Maximum flexibility of library shapes keeps library small
g(x,y;a0,a1,s)
g(x;m,s) m(y;a0,a1)
RooPolyVar m(“m”,y,RooArgList(a0,a1)) ;RooGaussian g(“g”,”gauss”,x,m,s) ;
Wouter Verkerke, NIKHEF
Handling of p.d.f normalization
• Normalization of (component) p.d.f.s to unity is often a good part of the work of writing a p.d.f.
• RooFit handles most normalization issues transparently to the user– P.d.f can advertise (multiple) analytical expressions for integrals
– When no analytical expression is provided, RooFit will automatically perform numeric integration to obtain normalization
– More complicated that it seems: even if gauss(x,m,s) can be integrated analytically over x, gauss(f(x),m,s) cannot. Such use cases are automatically recognized.
– Multi-dimensional integrals can be combination of numeric and p.d.f-provided analytical partial integrals
• Variety of numeric integration techniques is interfaced– Adaptive trapezoid, Gauss-Kronrod, VEGAS MC…
– User can override parameters globally or per p.d.f. as necessary
Wouter Verkerke, NIKHEF
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooAddPdf+
RooGaussian
Wouter Verkerke, NIKHEF
RooBMixDecay
RooPolynomial
RooHistPdf
RooArgusBG
RooGaussian
Model building – (Re)using standard components
• Most physics models can be composed from ‘basic’ shapes
RooProdPdf*)()|(),( ygyxfyxk
RooProdPdf k(“k”,”k”,g, Conditional(f,x))
)()(),( ygxfyxh
RooProdPdf h(“h”,”h”, RooArgSet(f,g))
Wouter Verkerke, NIKHEF
Using models - Overview
• All RooFit models provide universal and complete fitting and Toy Monte Carlo generating functionality– Model complexity only limited by available memory and CPU power
– Fitting/plotting a 5-D model as easy as using a 1-D model
– Most operations are one-liners
RooAbsPdf
RooDataSet
RooAbsData
gauss.fitTo(data)
data = gauss.generate(x,1000)
Fitting Generating
Wouter Verkerke, NIKHEF
Fitting functionality
• Binned or unbinned ML fit?– For RooFit this is the essentially the same!
– Nice example of C++ abstraction through inheritance
• Fitting interface takes abstract RooAbsData object– Supply unbinned data object (created from TTree) unbinned fit
– Supply histogram object (created from THx) binned fit
x y z1 3 52 4 61 3 52 4 6
BinnedUnbinned
RooDataSet RooDataHist
RooAbsData
pdf.fitTo(data)
Wouter Verkerke, NIKHEF
Automated vs. hand-coded optimization of p.d.f.
• Automatic pre-fit PDF optimization– Prior to each fit, the PDF is analyzed for possible optimizations
– Optimization algorithms:
• Detection and precalculation of constant terms in any PDF expression
• Function caching and lazy evaluation
• Factorization of multi-dimensional problems where ever possible
– Optimizations are always tailored to the specific use in each fit.
– Possible because OO structure of p.d.f. allows automated analysis of structure
• No need for users to hard-code optimizations – Keeps your code understandable, maintainable and flexible
without sacrificing performance
– Optimization concepts implemented by RooFit are applied consistently and completely to all PDFs
– Speedup of factor 3-10 reported in realistic complex fits
• Fit parallelization on multi-CPU hosts– Option for automatic parallelization of fit function on multi-CPU hosts
(no explicit or implicit support from user PDFs needed)
Wouter Verkerke, NIKHEF
MC Event generation
• Sample “toy Monte Carlo” samples from any PDF– Accept/reject sampling method used by default
• MC generation method automatically optimized– PDF can advertise internal generator if more efficient methods
exists (e.g. Gaussian)
– Each generator request will use the most efficient accept-reject / internal generator combination available
– Operator PDFs (sum,product,…) distribute generation over components whenever possible
– Generation order for products with conditional PDFs is sorted out automatically
– Possible because OO structure of p.d.f. allows automated analysis of structure
• Can also specify template data as input to generator
pdf.generate(vars,nevt)
pdf.generate(vars,data)
Wouter Verkerke, NIKHEF
Using models – Plotting
• Model visualization geared towards ‘publication plots’ not interactive browsing emphasis on 1-dimensional plots
• Simplest case: plotting a 1-D model over data– Modular structure of composite p.d.f.s allows easy access to components for plotting
– Can show Poisson confidence intervals instead of sqrt(N) errors
RooPlot* frame = mes.frame() ;data->plotOn(frame) ;pdf->plotOn(frame) ;
pdf->plotOn(frame,Components(“bkg”))
frame->Draw() ;
Can store plot with data andall curves as single object
Adaptive spacing of curve
points to achieve 1‰
precision
Wouter Verkerke, NIKHEF
Visualizing multi-dimensional models
• IMHO: Visualization of multi-dim. models much more complicated than visualization of multi-dim. data
• Simplest scenario: projection plots– Simple example M(m,t) = f[S(m)*T(t)]+(1-f)[B(m)*U(t)]
– Provide intuitive behavior: projection of p.d.f. automatically follows projection of data: 1-D plot frame remembers ‘hidden’ variables in dataset. Subsequent p.d.f.s in same frame are automatically projected
– Works identically for non-factorizable p.d.f.s! Necessary integrals are calculated automatically either analytical or numerically
RooPlot* frame = dt.frame() ;data.plotOn(frame) ;model.plotOn(frame) ;model.plotOn(frame,Components(“bkg”)) ;
mB distribution t distribution
dttmMmc ),()(
dmtmMtc ),()(
Wouter Verkerke, NIKHEF
Using models – plotting multi-dimensional models
• Projection plot usually doesn’t capture full information content of your p.d.f. equally easy to plot a ‘slice’
RooPlot* frame = dt.frame() ;data.plotOn(frame) ;model.plotOn(frame) ;model.plotOn(frame,Components(“bkg”)) ;
RooPlot* frame = dt.frame() ;dt.setRange(“sel”,5.27,5.30) ;data.plotOn(frame,CutRange(“sel”)) ;model.plotOn(frame,ProjectionRange(“sel”));model.plotOn(frame,ProjectionRange(“sel”), Components(“bkg”)) ;
mB distribution t distribution
More complicated ‘slice’ views such as a likelihood ratio plot only take O(3) lines of extra code
30.5
27.5
),()( dmtmMtc
Wouter Verkerke, NIKHEF
Using models – plotting
• Can also use template data to show ‘average’ curve– Convenient for visualization of conditional p.d.f.s
– Example plot t distribution of following model over data, averaging over dt values of data
)|,( dttmM
DNi
iD
dmdttmMN
tCurve,1
)|,(1
)(
model.plotOn(frame,ProjWData(dtData)) ;model.plotOn(frame,Components("bkg"), ProjWData(dtData),LineStyle(kDashed)) ;
Wouter Verkerke, NIKHEF
Advanced features – Task automation
• Support for routine task automation, e.g. useful in finding confidence intervals
Input model Generate toy MC Fit model
Repeat N times
Accumulatefit statistics
Distribution of- parameter values- parameter errors- parameter pulls
// Instantiate MC study managerRooMCStudy mgr(inputModel) ;
// Generate and fit 100 samples of 1000 eventsmgr.generateAndFit(100,1000) ;
// Plot distribution of sigma parametermgr.plotParam(sigma)->Draw()
Wouter Verkerke, NIKHEF
RooFit at SourceForge - roofit.sourceforge.net
RooFit moved to SourceForge to facilitate access and
communication with non-BaBar users
Code access–CVS repository via pserver
–File distribution sets for production versions
Wouter Verkerke, NIKHEF
RooFit at SourceForge - Documentation
Documentation
Comprehensive set of tutorials
(PPT slide show + example macros)
Five separate tutorials
More than 250 slides and 20
macros in total
Class reference in THtml style
Also available by end of 2005: Pedagogical User Guide + Reference Guide (~100 pages document)
Wouter Verkerke, NIKHEF
Selection of BaBar publications in 2004 using RooFit
• Improved Measurement of the CKM angle alpha using B0+-
• Measurement of branching fractions and charge asymmetries in B+ decays to +, K+, + and '+, and search for B0 decays to K0 and
• Branching Fraction and CP Asymmetries in B0KSKSKS
• Measurement of CP Asymmetries in B0K0 and B0K+K-K0S
• Measurement of Branching Fractions and Time-Dependent CP-Violating Asymmetries in B' K Decays
• Improved Measurement of Time-Dependent CP Violation in B0 to (ccbar)K0 Decays (‘sin2’)
• Measurements of the Branching Fraction and CP-Violating Asymmetries in B0f0(980)KS Decays
• Measurement of Time-dependent CP-Violating Asymmetries in B0K*, K*KS0 Decays
• Study of the decay B0+- and constraints on the CKM angle alpha.
• Measurement of CP-violating Asymmetries in B0K0S0 Decays
• Measurement of Time-Dependent CP Asymmetries in B0K0
• Search for B+/- [K-/+ +/-]D K+/- and upper limit on the bu amplitude in B+/- DK+/-
• Limits on the Decay-Rate Difference of Neutral B Mesons and on CP, T, and CPT Violation in B0B0bar Oscillations
Wouter Verkerke, NIKHEF
Summary
• RooFit adds a powerful data modeling language to ROOT– Mathematical objects (variables, functions…) represented as C++ objects– Complex models are easily composed from library of standard components
• New fundamental components can be written easily– Powerful tools for fitting, Toy MC generation and visualization are easy to use
• Compact syntax• Universal functionality (almost no arbitrary / implementation related restrictions)
– Automated function optimization analysis and implementation delivers industrial strength performance
• RooFit makes the road from a ‘simple’ to a really elaborate/complex p.d.f. less steep:
• RooFit is distributed with ROOT starting with ROOT version v5.02-00