Machine learning in ALICE - Nikhef...Machine learning in ALICE Rüdiger Haake 3 IML workshop Don’t...

Machine learning in ALICEActivities in ALICE and heavy-ion physics

Rüdiger Haake (CERN)(05.04.2018)

Rüdiger Haake 2Machine learning in ALICE

Outline

This lecture consists of two parts

1) Overview on machine learning activities in ALICE● Jets● Particle identification● Charmed baryons● DQM● Fast simulations● ...

2) Hands-on sessionClassification and regression example on SWAN


IML workshop

Don’t miss the

2nd IML Machine Learning Workshop9-12 April 2018, CERN (Vidyo available)

● Tutorials on methods & concepts● Industry talks by Google, IBM, Yandex, …● Modern ML applications● …

https://indico.cern.ch/event/668017/

https://indico.cern.ch/event/668017/


Motivation

See SpotMini opening a door:https://www.youtube.com/watch?v=aFuA50H9uek


Motivation

Amazing progress in the last years!Partly reached ‘superhuman’ abilities

Interest rising inphysics (and industry)


Motivation

Important: Machine Learning is not the solution for all our problems● Usually, one cannot just throw in data

and expect the algorithm to perform better than human even in awell-defined task● ML no replacement for domain knowledge● “Garbage in → Garbage out”● Still: algorithms, classifiers, and training

data need to be selected carefully● Also interesting:

Deep learning is not necessarily better than classic ML methods

xkcd.com


Heavy-ion collisions on one slide

● ALICE is the dedicated heavy-ion experiment at the LHC● Main objective in heavy-ion physics:

Quark-Gluon Plasma (QGP)

● hot & dense medium of deconfined quarks & gluons

● strongly interacting● collective expansion● created in high energy heavy-

ion collisions

With it, we hope to better understand the strong interaction & cosmological questions

Particle jets


Jets: Physics

● Conceptually, a jet is the final state of collimated hadrons that fragmented from a hard scattered parton

● Jets can be used to shed light on the very early stage of a hadron collision

● There is no unambiguous jet definition● The reconstructed jet observable is defined by the jet

finding algorithm used to clusterize tracks and calorimeter clusters into jets


Jets: Physics

● pp collisions:Test pQCD, hadronization models etc.Use as reference for HI collisions

● p-Pb & Pb-Pb collisions:Use jets as calibrated probes for modification in mediumVacuum-behavior known from pp, pQCD

● In ALICE, we reconstruct● Charged jets (charged tracks)● Full jets (charged tracks + calorimeter clusters)

● Jets are embedded in background from heavy-ion collision● Mean background corrected for event-by-event● In-event fluctuations treated statistically


Jets: Reconstruction

● In ALICE, usually FastJet is used for jet finding

● Jet reconstruction algorithm clusters tracks / calorimeter hits into jets

● anti-kT algorithm yields rather

conical jets for higher transverse momenta

● Further cuts (area, pT) applied

● Jet finder is an example for machine learning


Jets: Reconstruction

● Jet finder is an example for machine learning

Unsupervised learning: Particles are classified to belong to certain classes (i.e. jets) – no real “learning” here

● In ALICE, usually FastJet is used for jet finding

● Jet reconstruction algorithm clusters tracks / calorimeter hits into jets

● anti-kT algorithm yields rather

conical jets for higher transverse momenta

● Further cuts (area, pT) applied


Jets: Machine learning applications

● Depending on the problem, jets can serve as input to● shallow learning algorithms (e.g. BDTs)● deep learning (neural networks)

Global event properties:Multiplicity, mean background,centrality, vertex z, ...

Per-jet properties:Jet mass, N-subjettiness,radial moment, other shapes...

Low-level per-jet properties:Constituent momenta, η, φ, ...

Other low-level properties:Reconstructed secondaryvertices, ...

High-level parameters

Low-level parameters

● Features need to have discrimination power on problem● In our case: Need good MC description of features


Jets: Machine learning applications

Typical applications for jets

Jet tagging/classification● q/g-jet tagging● b/c-jet tagging● W-jets vs. QCD jets● Multiclass jet classification

Regression of jet parameters● Background in heavy-ion jets


Jets: Jet images

● Motivation: Huge progress with convolutional neural networks in image recognition/classification

● Classify jets according to their pattern they leave in detector… in calorimeter cells… as charged particle tracks

● In 1407.5675, jet images are used for W-jet tagging

Preprocessing

Average jet imagebefore preprocessing

Average jet imageafter preprocessing

Discriminationbetween the

two populations

● Several approaches on jet images: CNNs, Locally-connected networks…

● Works, but have in mind: “Jets are no cats”


Jets: Jet images

● Motivation: Huge progress with convolutional neural networks in image recognition/classification

● Classify jets according to their pattern they leave in detector… in calorimeter cells… as charged particle tracks

● In 1407.5675, jet images are used for W-jet tagging

● Several approaches on jet images: CNNs, Locally-connected networks…

● Works, but have in mind: “Jets are no cats”

Might also work forQCD jet classification


Jets: Recurrent/recursive approaches

● Jet images exploit analogy to image classification● Also analogy to speech recognition possible● Exploit:

Sentence = sequence of wordsJet = sequence of constituents

● Good analogies are useful: We can build on progress in computer science

● A lot of research on text classification/understanding

● Like for text classification, recurrent networks promising● Interesting in this context:

Recursive networks whose topology changes event-by-event depending on jet finder combination history (1702.00748)


Jets: Heavy-flavor jet tagging

● Jets from b-/c-quarks interesting probes in heavy-ion collisions● In-medium modification of b-jets different to udsg-jets

● Larger energy loss for gluons than quarks (color charge)● “Dead cone effect”: For massive quarks, gluon bremsstrahlung

suppressed at smaller angles w.r.t. parton direction● Approach in ALICE: deep learning tagger● Exploit that B-hadrons decay in the

(sub-)millimeter range→ displaced from primary vertex→ reconstruct secondary vertices

● “Conventional” approach: Rectangular cuts on properties of most displaced vertices

● Ansatz here: Apply neural network to several low-level input parameters http://bartosik.pp.ua/hep_sketches/btagging


Jets: Heavy-flavor jet tagging

● Tagger uses two subnets which are merged● Secondary vertices● Track impact parameters

● 1D convolutional networks exploiting vertex or constituent relations

● Each subnet optimized via grid search, separately:

“Clever brute force”● Powerful concept:

● Find suitable designs for available discriminators

● Optimize separately● Merge with neural net


Jets: Background approximation

● Example for regression for jets: Background approximation● In heavy-ion collisions, background strongly affects measurement● General ansatz: Subtract mean background per event, correct for

fluctuations statistically● In-event fluctuations are probed

by random cones

● Distribution of fluctuations δpT

is bgrd.-subtracted transverse momentum in cones

● For jet spectra measurements, fluctuations can be unfolded

Idea: Use neural network to approximate background under each jet


Jets: Background approximation

● Use neural network to approximate background under each jet● In contrast to jet constituents, background is uncorrelated from jet

→ Neural network might be able to estimate background● Possible input parameters: Jet constituent η, φ, p

T

● Again: Major concern is the need for good training data● Here, this means jets & background must be realistic● Training data: Toy model

● Monte Carlo jets from PYTHIA (real physics)● Random background including flow (toy), parameters taken from

real data

● First results promising with simple neural networks● Might eventually allow measurement of jets at lower transverse

momentum

Particle identification


PID: Motivation

● In tracking, usually transverse momenta measured● Particle identification needs further measurement● Information on particle species often important for physics

analysis:● Production of pions, kaons, protons and their modification in

heavy-ion collisions● Heavy-flavor physics (D, B-mesons, b-jets)● Neutron pion production● Photon production● Particle composition in jets● …

ALICE uses a variety of different detectors to gain complementary information on particles


PID: Subdetectors in ALICE

ITS: Inner Tracking System (silicon detectors)

TPC: Time Projection Chamber (gas detector)

TOF: Time-of-Flight detector

TRD: Transition Radiation Detector

HMPID: High-momentum particle identification

Calorimeters: EMCal and PHOS (Pb-scint. and PbWO

4 calorimeters)

Energy loss dE/dx









4 calorimeters)

Energy loss dE/dx









4 calorimeters)

Particle velocity









4 calorimeters)

Transition radiation (electron ID)









4 calorimeters)

Cherenkov radiation









4 calorimeters)

Total energy


PID: Electron identification

● Electron identification using several subdetectors in ALICEMeasurement of nσ’s: How many standard deviations away from mean expected value

● MVA approach:Use Boosted Decision Tree (BDT) on nσ values, track properties

● Performance evaluated in pp. Soon: PbPb

No cut PID cut MVA cut

nσ distribution for electrons TPC:


PID: General identification task

● Ultimate goal: General particle identification which exploits all available information

Stage 1: Classifier works on cleaned, calibrated distributions, e.g. on nσ values

Stage 2: Classifier works on raw PID detector distributions

● Both cases would be very helpful to raise efficiency and purity

Crucial point: Monte Carlo productions● Training data needs to be as precise as possible● Monte Carlos often show a poor PID agreement, at least for some

particle species

Charmed baryons


Charmed baryons

● Like b- or c-jets, also charmed baryons are interesting probes for the QGP in heavy-ion collisions

● Λc baryons: reconstruction challenging

● Short lifetime = decay only slightly displaced from collision vertex

● Reconstruction in ALICE possible via hadronic decays:

Λ

c+ → p K- π+

(Also other decay modes measured)


Charmed baryons

General approach:1) Identify decay projects with

ALICE’ PID capabilities2) Exploit topological constraints

(displaced production vertex)3) Select candidates and extract

signal via invariant mass fit


Charmed baryons

● Instead of rectangular cuts on PID, topological cuts, etc., a BDT is trained to select candidates

● Input variables include kinematics of decay projects, topological properties and PID (all verified to be described well in MC)

MVADefault

● BDT response shows clear separation of data & background● Invariant mass distribution shows reduced background and

enhanced signal!● Systematic uncertainties: Result shown to be robust for several BDT

variations

Fast simulation


Fast simulation: GANs

● Simulation of expected physics in detector crucial for interpretation● Usually: Monte Carlo simulation

1) Event generation on particle level (e.g. PYTHIA)2) Reconstruction on detector level (e.g. GEANT)

● Both steps can be computationally expensive, especially for heavy-ion collisions

● Currently, huge amount of computing resources (~50%) used for simulations

● Computational costs for LHC run 3 and HL-LHC much worse:

Higher statistics in data→ need higher MC statistics!



● General approach: Use fast simulationExample: Mix fully-reconstructed with only simply reconstructed signals

● But to prepare for HL-LHC (~2023), we need to save more (x100)● Promising ansatz: Generative models, realized as DNNs

● Variational Autoencoders (VAEs)● Generative Adversarial Networks (GANs)● …

● Generation of realistic samples according to training samples● A lot of research going on here, benefit from industry progress● Advantages over classic MC:

● Neural network inference much faster than reconstruction (x 105)● Parallel computing (GPU), not so much CPU-bound● Can use commercial infrastructure: GPU clusters, cloud

computing



Generative Adversarial Network:Two networks trained simultaneously: Generator and discriminator

● In competition & cooperation, generator learns to create more and more realistic samples

● Several studies show that deep GANs are able to reproduce a very large feature space

Samples


Fast simulation: CaloGAN

In ALICE, proof-of-concept work is ongoing to use GANs● to simulate fully-reconstructed tracks with TPC● to perform detector reconstruction of particle-level data

Outlook what is possible: CaloGAN (GAN for ATLAS LAr calorimeter)Deposited

energy

Energy fractionin ith layer


Fast simulation: CaloGAN

In ALICE, proof-of-concept work is ongoing to use GANs● to simulate fully-reconstructed tracks with TPC● to perform detector reconstruction of particle-level data

Outlook what is possible: CaloGAN (GAN for ATLAS LAr calorimeter)Deposited

energy

Energy fractionin ith layer

● Works for complex segmented calorimeter

● Up to five orders of magnitude faster

● More work to be done until ready for production

DQM/QA


DQM/QA

● Takes place during data taking (DQM) and shortly after (QA)

● Needs a lot of resources● Even worse for LHC Run III,

when much more data will be recorded

● Usual approach: Experts give flags to recorded runs using DQM/QA histograms

● Data Quality Management (DQM) and Quality Assurance (QA) still a lot of work from experts

● Machine learning can help with several aspects, e.g. anomaly detection & automatic data classification


DQM/QA

● Current research approach uses more than 200 physics parameters from available QA parameters

● First tests show that automatic/assisted classification is possible● Also other approaches tested: GANs for anomaly detection,

LSTM autoencoders for time series prediction (inspired by ECG anomaly detection)

Classification: BadClassification: Good

Some pragmatic hints &

hands-on


Neural networks

Neural networks can be configured very differently

● Some settings can be chosen according to experience, e.g.:● Adam optimizer for deep networks● ReLU activation function

● There is a lot of knowledge available online for problems similar to those we face in physics

● Good settings are adapted to the problem

ArchitectureArchitectureCNNs, LSTMs, DenseCNNs, LSTMs, Dense

Activation functionActivation function

Loss functionLoss function

OptimizerOptimizer

Learning rateLearning rate

RegularizationRegularization


How to design a good model

Define your problem● Is it a regression or classification task?

Optimizer, loss function, activation function, etc. depend on this choice● In a classification task, do you need multiple classes or does binary

classification suffice?Binary classification might be easier to learn for the network than multi-class classification

● Will the problem only rely on high-level parameters?If yes, also different technique (like BDTs) can be consideredHigh-level parameters are e.g. jet mass, jet shapes. Low-level parameters e.g. constituents

Define your dataset● In case of classification, clearly define signal and background● In case of regression, be sure your regression parameter is well

defined and represents what you wantCrucial step, better put more effort here than less

● Which input features could potentially have discrimination power for your problem? Implement them


How to design a good model

Define your model(s)● Get inspired by similar problems, experiment with different designs● Once you found a suitable design → Perform grid search

Clever “brute force” trial of possible hyper parameters● Number of layers, neurons per layers …● Activation function, loss function, …

If suitable, combine several models on features● Useful, if the models work on distinct input features● Example: Combination of PID classifier models on TPC and TOF

might be useful


Control parameters

● To monitor the training progress, several control parameters exist● Loss: Is directly minimized in the training● Only for classification:

● Scores: Score distribution hint to model performance● ROC-Curve: Plots the true vs. false rate● AUC (Area Under (ROC) Curve): Performance

Keras output during training looks like this:Train on 38000 samples, validate on 38000 samplesEpoch 1/138000/38000 [==============================] - 107s 3ms/step - loss: 1.0340 - acc: 0.6808 - val_loss: 0.7285 - val_acc: 0.7342

Loss on training data:Good check thatsomething is learned.Should get lower

Loss on valid. data:Check that learning isgeneralizable on unseendata.


Control parameters

Example: b-jet tagging

Typical control plot:● Loss vs. training epoch● Gets lower during whole

training● Seems to reach plateau● This means:

More training does not help anymore


Control parameters


Typical control plot:● ROC curve● Optimally, the curve were

always at 1→ Full efficiency with lowest

misclassification rates● Orange line indicates how

good guessing performed


Control parameters


Typical control plot:● AUC= Integrated area under

the ROC curve● Models shows good

behavior → reaches plateau


Control parameters


Typical control plot:● Score distribution● Good indicator how clear

the two classes can be distinguished


Hands-on sessions

● The tutorial will be done on SWAN in your browser● Prerequisites

● You need a web browser● You need a CERN account● You need a CERNBox space

(if never done before, login at https://cernbox.cern.ch/)

● When you click on the link below, notebooks are automatically copied to your CERNBox

https://swan004.cern.ch/hub/spawn?projurl=https://gitlab.cern.ch/rhaake/MLTutorial.git

https://cernbox.cern.ch/

https://swan004.cern.ch/hub/spawn?projurl=https://gitlab.cern.ch/rhaake/MLTutorial.git

Machine learning in ALICE - Nikhef...Machine learning in ALICE Rüdiger Haake 3 IML workshop Don’t...

Documents

Transcript of Machine learning in ALICE - Nikhef...Machine learning in ALICE Rüdiger Haake 3 IML workshop Don’t...