Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of...

41
Overview of ML and Big data tools at HEP experiments LHCP2019, Puebla, Mexico Alfredo Castaneda University of Sonora Overview of ML and Big data Tools 1 With results from ATLAS, CMS, LHCb and ALICE

Transcript of Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of...

Page 1: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data tools at HEP

experiments

LHCP2019, Puebla, Mexico

Alfredo CastanedaUniversity of Sonora

Overview of ML and Big data Tools 1

With results from ATLAS, CMS, LHCb and ALICE

Page 2: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

How Machine Learning and Big data are connected?• Machine learning (ML) is a branch of Artificial

Intelligence (AI)

• ML uses algorithms and statistical models to train acomputer to perform a task without being explicitlyprogrammed to do so

• ML needs large datasets (big data) in order toachieve a good accuracy on its predictions

• Big data experiments are an excellent laboratory totest ML algorithms in order to solve complexproblems that could not be solved by traditionalmethods

• To manipulated big data several tools are needed inorder to process the information (most of themdeveloped by data scientists)

Overview of ML and Big data Tools 2

Page 3: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 3

The LHC: A big data experiment

• The Large Hadron Collider (LHC) is the world largest and mostpowerful particle accelerator

• Two high energy particle beams travel at close the speed of light• The beams are made to collide at four locations around the

accelerator ring, corresponding to the position of the four particledetectors (ATLAS, CMS, ALICE, LHCb)

• The volume of data collected at the LHC represents a considerablechallenge for processing and storage

• The dataflow from all experiments during Run 2 was about 25 GB/s

On June 2017,the CERN passedthe milestone of200 petabytes ofdatapermanentlyarchived in itstape libraries

Page 4: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 4

Challenges for the HL-LHC

• A factor of 10 increase on the particle rateexpected during the High luminosity LHC era

• The vast amount of data will be used to performprecision Higgs boson measurements and searchesfor new physics phenomena

The 10x increase on particle rate will demand newtools for data processing, particle identification,offline analysis and many other tasks.Machine learning and Big data tools will be crucialduring this period

Representation of a pp collision during the HL-LHC era

Page 5: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

ML and Big data methods currently used in HEP

Overview of ML and Big data Tools 5

Machine LearningArtificial Neural Networks (AN)

Boosted Decision Trees (BDTs)

Convolutional Neural Networks (CNNs)

Recursive Neural Networks (RNN)

Generative Adversarial Networks (GANs)

Big dataWLCG world wide LHC Computing GRID

Parallel computing

Databases (Oracle)

GPU resources

Hadoop services at CERN

Since already few years ago, several methods/tools in ML and Big data have been implemented inHEP experiments, in particular for the LHC experiments. In the case of ML the most important arerelated to artificial neural networks, where several architectures have been used. In case of big datathe parallelization of processes, distributed analysis and the recent introduction of GPU resourcesare becoming crucial for the day to day activities

Page 6: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 6

ML and big data areas of development in HEP

PID, reconstruction and triggering

Simulation DQM

Hardware triggeringComputing

Page 7: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

PID, reconstruction, Triggering

Overview of ML and Big data Tools 7

Page 8: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 8

Jet classification

• Jet reconstruction/classification is one of themost important tasks in modern HEPexperiments

• Jets could reveal important information aboutthe nature of the process they were originatedfrom, whether it was:

• Quark initiated jet• Heavy quarks (b,c)• Light quarks (u,d,s)

• Gluon initiated jet• Hadronic decay of vector bosons

(W,Z), top’s and Higgs• Methods based on Multivariate analysis

and deep learning have shown a gooddiscrimination power between thedifferent type of jets

Page 9: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 9

• Differentiating between a quark-initiated jet from agluon initiated one has broad applications in SMmeasurements and beyond SM

• Approach is based on a state-of-the-art imageclassification technique (CNNs) which is comparedagainst classical jet tagging schemes

Quark initiated jetGluon initiated jet

Jet substructure using CNNs

ATL-PHYS-PUB-2017-017

Page 10: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 10

Heavy flavor jet tagging (b-tagging)

• The tasks of separating jets that contain b-hadrons from those initiated by lighter quarks iscalled b-tagging

• The significance of b-tagging is due to theprominence of b-hadrons jets in final states ofinteresting physics processes

• Typical heavy hadron topology representsone or more vertices that are displaced fromthe hard-scatter interaction point

• The longer b-hadron lifetime correspond to adecay that will be associated with asecondary vertex

Page 11: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 11

b-tagging using deep learning in ATLAS

• RNNIP is a Recurrent Neural Network-based-b-tagging classifier

• It is based on the traditional IP3D tagger that usesthe impact parameter as the principal discriminator

• RNNIP represents jets as a sequence of tracksordered by absolute transverse impact parametersignificance

• Each track described by a vector of features:transverse and longitudinal impact parameters,parameter significance, jet pt fraction, etc..

• This four-class tagger predicts a flavor probabilityassociated with each flavor (b,c,tau and light)

Diagram of the RNNIP taggerMichela Paganini and ATLAS Collaboration 2018 J. Phys.: Conf. Ser. 1085 042031

Light jet rejectionRNNIP vs IP3D

c jet rejectionRNNIP vs IP3D

Page 12: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 12

• DeepCSV is a common algorithm toperform b-tagging

• It uses a deep neural network• The input is the same set of observables

used by the CSVv2 b-tagger, secondaryvertex, track-based lifetime information

• Using four hidden layers (of a width of100 nodes each)

CMS PAS BTV-15-001

b-tagging using deep learning in CMS

Page 13: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 13

ML for object reconstruction in LHCb

• LHCb is a single-arm forward spectrometer• The main goal is to search for indirect

evidence of new physics in CP violation andrare decays or beauty and charm baryons

• Since 2015 (Run-2) the experiment hasemployed a real-time analysis strategy for alarge fraction of its physics programme

• Full physics analysis are performed directly onthe objects reconstructed in the final stage ofthe software trigger, reducing the output ofthe event size

Page 14: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 14

Deep learning in LHCb

Fake Track Rejection with Multilayer Perceptron

LHCb-PUB-2017-011

• Neural network based algorithm toidentify fake tracks in the LHCbpatter recognition

• This algorithm (ghost probability)retains more than 99% of wellreconstructed tracks whilereducing the number of fake tracksby 60%

Jet tagging Charge particle ID

• Using Boosted Decision Trees (BDTs) toidentify b,c and light jets

• Using kinematic observables ofsecondary vertex as input

• The efficiency for identifying a b,c jet isabout 65%,25% with a probability formisidentifying a light-parton jet of 0.3%

• ProbNN, neutral network with one hidden layer

JINST 10 (2015) P06013

LHCb detector performancehttps://doi.org/10.1142/S0217751X15300227

Page 15: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 15

Charmed Baryon production in ALICE

• ALICE is the dedicated heavy-ion experiment at the LHC

• Study of the Quark Gluon Plasma (QGP)• Due to the short lifetime of the Λc baryons and

statistical limitation the reconstruction was particularly challenging

• BDT were used to separate background from signal

JHEP 1804 (2018) 108

Page 16: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 16

Top tagging with deep learning

• Tagging of highly energetic jets from top quark decay based onimage techniques

• Using high level jet substructure variables• The jet classification method achieves a background rejections

of 45 at a 50% efficiency operating point for reconstructionlevel jets with pT in the range of 600 to 2500 GeV

Signal

Background

https://arxiv.org/abs/1704.02124v2

Page 17: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Simulation

Overview of ML and Big data Tools 17

Page 18: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 18

Simulation production in HEP

• Production of Monte Carlo simulation samples is a crucial task in HEP experiments

• Detailed simulation is needed to optimize the events selection for SM searches and new physics searches

• To measure the performance of new detector technologies

• Usually samples with large statistic are needed in order to reduce the analysis uncertainties

• A lot of resources (computational, human) are needed to fulfill the requirements of the entire experimental collaborations

For the past few years Monte Carlo simulations have represented more than 50% of the WLCG (LHC computing grid) workload

Page 19: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 19

GANs for fast simulation

• Generative Adversarial Networks (GANs) are a type of deep neural networks with an architecture comprised of two nets, pitting one against the other (“adversarial”)

• One neural network called “generator” generate new data instances, while the other“discriminator” evaluates the authenticitycomparing with the training data

• The goal is that the generator is trained in a way that the discriminator will not identify ifthe information received is not the truth one

Introduced by Ian Goodfellow in 2014, Referring asthe “most interesting idea in the last 10 years of ML”

Page 20: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 20

CaloGAN: a calorimetry fast simulation

Average gamma Geant4 shower from Geant4 (top) and CaloGAN (bottom)

Shower shape variables for e+, gamma and pi+, comparison Geant4 and CaloGAN

• A simulation of the calorimeter using GANs (CaloGAN)• From a series of simulated showers, the CaloGAN is tasked with learning from the simulated data

distribution for gammas, e+ and pi+• The training dataset is represented in image format by three figures of dimensions 3x96, 12x12 and

12x6, each representing the shower energy depositions

Phys. Rev. D 97, 014021 (2018)

Page 21: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 21

Calorimeter FastSim in LHCb

Showers generated with Geant4 (top) and showers simulated with GANs (bottom)

• New Deep Learning framework based on GANs• Faster than traditional simulated methods by 5

orders of magnitude• Reasonable accuracy• This approach could allow physicist to produce

enough simulated data needed during the HL-LHC

https://arxiv.org/abs/1812.01319

Page 22: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

DQM

Overview of ML and Big data Tools 22

Page 23: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 23

Data Quality monitoring

• Data quality monitoring is a crucial task for every large-scale High-Energy Physics experiment

• The system still relies on the evaluation of experts• An automated system can save resources and improve the quality of the

collected data• A detector measures physical properties of proton collisions products• When a subdetector exposes abnormal behavior, it is reflected in

measured or reconstructed properties

Page 24: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 24

doi :10.1088/1742-6596/898/9/092041

Data Quality monitoring

• The primary goal of the system is to assist the Data Quality managers by filtering mostobvious cases, both positive and negative

• The system objective is minimization of the fraction of data samples passed for the humanevaluation (i.e. the rejection rate)

• The Evaluation of the algorithm was done using CMS experimental data published in theCERN open data portal

Page 25: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Hardware triggering

Overview of ML and Big data Tools 25

Page 26: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 26

Data processing with FPGAs

• L1 trigger decision (hardware, FPGA based) to be done in 4 microsec

• Could a ML model be fast enough to make a decision in such short time window?

Field Programmable Gate Arrays (FPGAs)

• Reprogrammable fabric of logic cells embedded with DSPs, BRAMs, high speed, IO

• Low power consumption compared to CPU/GPU

• Massively parallel

Page 27: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 27

• Regular neural network model trained in CPU/GPU

• Model loaded in firmware via the hls4ml interface

• Decision made in the FPGA

https://arxiv.org/abs/1804.06913

ML interface for FPGAs

Page 28: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 28

FPGAs for Phase-2 ATLAS Muon barrel

ATLAS-TDR-026

• ATLAS Level-0 muon trigger will face a complete upgrade for HL-LHC

• New trigger processor• FPGA based system

• New trigger station• New RPC Layer

• Trigger algorithms improved • Need to be very fast and flexible

Neural Network a good candidate

From strip maps to imagesCNNs

Page 29: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

GRID, Parallel computing, GPUs, Hadoop

Overview of ML and Big data Tools 29

Page 30: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 30

The Worldwide LHC computing (WLGC)

• The WLCG is the world’s largest computing grid• Supported by many associated national and international grids across the world and

coordinated by CERN

• 42 countries • 170 computing centers • Over 2 million tasks run

every day • 1 million computer cores • 1 exabyte of storage

Page 31: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 31

Hadoop and Spark services at CERN

• Hadoop is a collection of open source programs which carries out a particulartask essential for a computer system designed for big data analytic

• Hadoop services CERN started in 2013• Evolution of the systems in increase the requirement for data store backend

• More data generated (frequencies > 100 GB/day)• Analytics, machine learning (initial design is not optimal for that)

Page 32: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 32

Big data projects with Hadoop and Spark

New CERN IT monitoring infrastructure Next Generation archiver for accelerators logs

The ATLAS event index

Swan services• Fully integrated with

Sparks and Hadoop at CERN

• Modern, powerful and scalable platform for data analysis

Page 33: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 33

GPU resources at CERN

• Graphical Processing Units (GPUs) are fundamental forthe training of models in machine learning

• With GPUs the training time can be reduced severalorders of magnitude with respect to normal CPU

• HEP workloads can benefit from from massiveparallelism

• Current challenges as the TrackML motivates CERN toconsider GPU provisioning in the OpenStack cloud

• GPUs can be accessible via virtual machines

• The user can access normal CUDA applications as tensorflow

• The resources are so far quite limited but is intended to grow in the coming years

Page 34: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 34

Conclusions and Perspectives

• ML and Bigdata toosl will be crucial to solve many of the during the HL-LHC• Expert supervision will not be removed but rather complemented with a new set of tools• Current tools are used to do PID, trigger, reconstruction and many other processing data

tasks• MC simulation production could benefit with the use of GANs which can speed up the

simulation time allowing a much faster and optimal delivery of results • The integration of ML algorithm at the level of hardware (FPGAs) will potentially allow a

better decision at the L1 trigger • CERN is investing on new devices (hardware, software) to integrate all these tools • To accomplish the different tasks a joint effort from physicist, engineers and IT people

will be needed

Page 35: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 35

Backup

Page 36: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 36

Multilayer neural network

Page 37: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 37

Deep tagging for double quark jet

• DeepDoubleBvL tagger is a tool that is used to distinguish between jets that originate from the merge decay products of resonances from jets that originate from single partons

• Simulated sample of H->bb, H->cc and QCD multijets were used for the performance studies

DP-2018/033

Page 38: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 38

CaloGAN neural network architectures

Page 39: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 39

GANs for data-simulation corrections

CMS DP -2019/003• Scale factor (SF) for b tag discriminant (DeepCSV) shape is

derived from data/simulation ratio• “deep scale factor” (DeepSF) approach uses a neural network• Discriminator network models residual data/simulation ratio• DeepSF network trains through adversarial feedback• Final scale factor build from ensemble (unweighted mean) of 25

separate training (different seeds)• Ensemble fluctuations (standard error mean) indicate procedure

stability

Page 40: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 40

Cherenkov Detectors FastSim with GANs

• A tipycal Cherenkov detector simulation is quite slow : propagating the Cherenkov photons through the media requires expensive calculations

• This work concentrate on simulating the Detector of Internally Reflected Cherenkov light (DIRC) of BaBar and SuperB

Cherenkov Detectors Fast Simulations Using Neural Networks10th International Workshop in Ring Imaging Cherenkov

Page 41: Overview of ML and Big data tools at HEP experiments...Simulation production in HEP •Production of Monte Carlo simulation samples is a crucial task in HEP experiments •Detailed

Overview of ML and Big data Tools 41

CERN openlab initiative

• Most of the new developments on big data and ML tools are now carried by theCERN openlab initiative

• CERN openlab is a public-private partnership that works to accelerate thedevelopment of cutting-edge ICT solutions for the worldwide LHC community

• Some of the ongoing projects are:

• Oracle cloud• REST services, Javascript, and JVM performance • Quantum computing for high-energy physics • High-throughput computing collaboration • Code modernization: fast simulation • Intel big-data analytics• Oracle big-data analytics • Yandex data popularity and anomaly detection