Energy Regression with Deep Learning in Particle Collider ...

MASTERARBEIT

ENERGY REGRESSION WITH DEEPLEARNING IN PARTICLE COLLIDER

PHYSICS

Simon Schnake

[email protected] PhysikMatr.-Nr. 65068216. Fachsemester

Erstgutachter: Prof. Dr. Peter SchleperZweitgutachter: Prof. Dr. Gregor Kasieczka

Abgabe: 06.2019

Dort standen auch Grenzsteine,etwas Überflüssiges, wie ihm erschien.– Hermann Lenz

Abstract

One of the important questions of machine learning in particle physics is whether wecan improve the resolution of ourmeasuring instruments. Is it possible to gain deeperinsights from the structure of the data produced in particle physics? This questionis examined in this thesis using the example of calorimeter data and jet calibration.Different types of networkswill be investigated. The technique of loss function designwas examined in order to reduce systematics in the data structure. Overall, it is shownthat deep learning can achieve better performances in the application cases examined.

1

Zusammenfassung

Eine der grundlegenden Fragen des Machinellen Lernens in der Teilchenphysik istob wir die Auflösung unser Messinstrumente verbessern können. Ist es möglich ausder Struktur der in der Teilchenphysik produzierten Daten tiefere Erkenntnisse zu er-langen. Diese Fragestellungen werden in dieser Arbeit am Beispiel von CalorimeterDaten und Jet Kalibrierung betrachtet. Es werden unterschiedliche Netzwerktypenausgetestet. Hierbeiwurde die Technik der Lossfunktionskonstruktion erprobt, die eserlauben Systematiken in derDatenstruktur zu vermindern. Insgesamtwirddargestellt,dass sich mittels Deep Learning eine besseres Ergebnis in den untersuchten Anwen-dungsfällen erzielen lässt.

2

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Particle Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 The Standard Model of Particle Physics . . . . . . . . . . . . . . . . . . 7

2.2 Fundamental Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Fundamental Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Jets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Calorimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Energy Loss due to Ionisation . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Interactions of Electrons . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Interactions of Photons with Matter . . . . . . . . . . . . . . . . . . . . . 13

3.4 Hadronic Interactions with Matter . . . . . . . . . . . . . . . . . . . . . 14

3.5 Calorimeter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5.1 Electromagnetic Calorimeter . . . . . . . . . . . . . . . . . . . . 16

3.5.2 Hadronic Calorimeter . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Geant4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 The Structure of a Simulation . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Integration of Physical Interactions . . . . . . . . . . . . . . . . . . . . . 20

4.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3

5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1 The Large Hadron Collider . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 The CMS Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2.1 Coordinate System and Conventions . . . . . . . . . . . . . . . . 265.2.2 Tracking Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.3 Electromagnetic Calorimeter . . . . . . . . . . . . . . . . . . . . 285.2.4 Hadron Calorimeter . . . . . . . . . . . . . . . . . . . . . . . . . 285.2.5 Solenoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.6 The Muon System . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.7 The Trigger System . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3 Particle Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4 Jet Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.5 Jet Energy Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 386.4 Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5 Particle Flow Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.6 Deep Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7 Calorimetry Energy Analyis . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1 Calorimeter Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2 Energy Regression By Linear Fitting . . . . . . . . . . . . . . . . . . . . 457.3 Fully Connected Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.4 Fully Connected Regression . . . . . . . . . . . . . . . . . . . . . . . . . 477.5 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.6 Convolutional Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.7 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.8 Likelihood Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.9 Comparison of Network Architectures . . . . . . . . . . . . . . . . . . . 59

4

8 Jet Calibration Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.1 Deep Learning Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.2 Jetnet Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8.3 Particle Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Eidesstattliche Versicherung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5

Chapter 1Introduction

Particle collisions such as those at the Large Hadron Collider (LHC) [1] produce gi-gantic amounts of data. With the High Lumi Upgrade [2], this amount will increaseconsiderably, requiring new analysis techniques to gain new insights from the data.The progress in the area of machine learning in recent years offers a toolbox of newanalysis techniques that can be applied to particle physics. In this thesis some possi-bilities of this application are presented.

This work does not remain the only one in this pursuit, in recent years considerableprogress has been made in certain areas of particle physics using deep learning. Oneaspect is that existing searches can be supported by the help of deep learning [3, 4, 5].Furthermore it is also possible to use deep learning for model independent searches[6, 7, 8]. Pile-up can be detected [9]. In addition, machine learning can be used toidentify jets [10, 11, 12, 13, 14, 15, 16, 17]. These are merely a few examples from thebroad field of research that was established in recent years.

This work will deal with low level data, especially calorimeter cell data, and withhigh level variables such as clustered particles from individual jets. The aim here isto apply different techniques in order to improve the energy resolution of modernparticle detection.

The following chapter gives a brief overview of particle physics. The third chapter isfollowing up with a discussion over particles introducing matter and of calorimetry.Afterwards, Geant4, one of the most commonly used analysis tools, is presented. Inchapter 5 the Large Hadron Collider, the CMS Detector, the Particle Flow Algorithmas well as jet clustering and energy correction are described. Some aspects of deeplearning that play an important role in this work are discussed afterwards.

In the seventh chapter the first part of the analysis will be presented. A simplifieddesign of the CMS HCAL is developed using Geant4. This is used to simulate elec-tromagnetic showers. In this chapter methods of deep learning are studied, whichare then applied to the complete CMS detector simulation in the following chapter.Here hadron jets are considered and deep learning is applied in a realistic scenario ofjet calibration. Chapter 9 concludes the work with a summary and an outlook.

6

Chapter 2Particle Physics

This chapter briefly outlines the core areas of particle physics relevant for this work.The Standard Model of Particle Physics and further topics are considered.

2.1 The Standard Model of Particle Physics

The StandardModel of Elementary Particle Physics summarizes the theoretical foun-dations of modern particle physics and describes the elementary particles and thefundamental interactions of physics. It contains a quantum field theoretical formula-tion of the electromagnetic, the weak and the strong interaction as well as the Higgsmechanism. The comparatively weak gravitation is not included.

2.2 Fundamental Building Blocks

The fundamental building blocks of matter are the elementary particles which are as-sumed to be point-like. Elementary particles can be divided into bosons and fermionsaccording to their intrinsic angular momentum, the spin. Bosons have an integer pos-itive spin and fermions have a positive half spin.

There exists twelve fermions in the framework of the Standard Model. Fermions areparticles with spin s = 1/2. These fermions form the fundamental components ofmatter. As presented in 2.1, they are categorized into three different generations offermions and quarks. A lepton generation consists of a particle with a charge of −eand a corresponding neutrino. The quark generations each contain a positive quarkwith charge 2/3e and a negative quarkwith charge−1/3e. The individual generationsare sorted according to their masses. Besides the charge and the spin, the fermionshave other quantum numbers, the weak isospin. the quarks have an additional colorcharge labeled into red, green and blue. Each color represents the coupling of thequarks to the gluons of the strong interaction. Similarly, the coupling to the weak in-teraction is associatedwith theweak isospin. Theweak isospin is related to the chiral-ity which characterizes the handedness of a fermion corresponding to a left-handedand right-handed projection. Corresponding to the introduced quantum numbers,

7

R/G/B2/3

1/2

2.3 MeV

up

uR/G/B−1/3

1/2

4.8 MeV

downd

−1

1/2

511 keV

electron

e

1/2

< 2 eV

e neutrino

νe

R/G/B2/3

1/2

1.28 GeV

charm

cR/G/B−1/3

1/2

95 MeV

strange

s−1

1/2

105.7 MeV

muon

µ

1/2

< 190 keV

µ neutrino

νµ

R/G/B2/3

1/2

173.2 GeV

topt

R/G/B−1/3

1/2

4.7 GeV

bottomb

−1

1/2

1.777 GeV

tau

τ

1/2

< 18.2 MeV

τ neutrino

ντ±1

1

80.4 GeV

W±1

91.2 GeV

Z

1photon

γ

color

1gluon

g

0

125.1 GeV

HiggsH

graviton

strongnuclearforce

(color)

electromagneticforce

(charge)

weak

nuclearforce(w

eakisospin)

gravitationalforce(m

ass)

chargecolorsmass

spin

6quarks

(+6anti-quarks)

6leptons

(+6anti-leptons)

12 fermions(+12 anti-fermions)increasing mass→

5 bosons(+1 opposite chargeW)

standard matter unstable matter force carriersGoldstonebosons

outsidestandard model

1st 2nd 3rd generation

Figure 2.1: A diagram of the standard model of particle physics.

each of the twelve fermions has a corresponding antiparticle, which results from theconjunction of charge and parity and time.

2.3 Fundamental Interactions

Electromagnetic, weak and strong interactions described by the SM can be repre-sented mathematically by an U(1)Y , SU(2)L and SU(3)C symmetry. These are calledgauge groups or gauge symmetries. The gauge bosons resulting from the linear com-bination of the respective generators of the gauge groups are the photon, the W- andZ-boson and eight gluons. The photon represents the exchange quanta mediating theelectromagnetic interaction and couples to the electric charge q. Electromagnetism isdescribed by quantum electrodynamics (QED).

The W- and Z-boson are the interaction particles of the weak interaction and coupleto the weak isospin. Within the framework of the weak interaction, coupling to theW-boson allows a transition between quark generations and also lepton generations.The weak and electromagnetic interaction can be combined via U(1)Y ⊗SU(2)L to theelectroweak interaction. The gluons couple to the color charge C and represent theinteraction particles of the strong interaction. The explicit couplings andprocesses aredescribed by quantum chromodynamics (QCD). It follows from the formulation ofthe interactions by gauge groups that gauge bosonsmust represent massless particlesin order to guarantee the required local gauge invariance of the Lagrangian densityof the SM.

In order to be able to introduce mass terms and thus achieve a correct description ofnature, the Higgs mechanism is introduced. Within the Higgs mechanism, a scalar

8

complex SU(2) Duplex φ with spin s = 0 is postulated. The Lagrange density gets anew assumed potential

V (φ) = µ2φ†φ+ λ(φ†φ)2. (2.1)

Here µ2 and λ are natural constants introduced according to the potential. The corre-sponding number of symmetries of the ground state is cancelled by explicit selectionof the ground state. This process is known as spontaneous symmetry breaking. The ex-citation of the vacuum state are described by the Higgs field. The quantized Higgsfield is theHiggs boson. The interactionwith theHiggs boson gives the particles theirmasses.

2.4 Jets

Thehigh-energyphenomena of quantumchromodynamics are described by the color-charged particles, the quarks and the gluons. These particles cannot be observed inisolation. Rather, the process of hadronization causes directional bundles of color-neutral hadrons, known as jets, to form. In the case of the LHC, protons are broughtinto collision, but in the heritical description of perturbation theory, this is describedvia the individual partons from the protons colliding with each other. Depending onthe momentum of the protons, the probability of their individual partons interactingvaries. To some extent, the effects involved can be described by the probability densityof the partons. The parton distribution density indicates the probability of finding aparton of type i with a certain momentum fraction x in a hadron. Using the protonas an example, it becomes clear that it is more likely to find quarks than gluons forhigher x.

The new parton created during the collision will typically radiate further partons, re-sulting in a so-called parton shower. Partons bind together during the hadronizationprocess to form hadrons which are observable. Many hadrons have a short lifetimeand decay after a short period of time. For this reason, a jet in the detector consists ofrelatively few particle types, which together allow conclusions to be drawn about theoriginal parton.

9

Figure 2.2: Representation of the distribution of the momentum fraction x of a partonmultiplied by its parton distribution function f(x). The two graphs show the distri-bution at different 4-momentum transfers µ [18].

10

Chapter 3Calorimetry

This chapter discusses the fundamental interactions between particles and matter.Building on this, the principles of electromagnetic and hadronic calorimeters are ex-plained.

3.1 Energy Loss due to Ionisation

Charged particles moving through a medium lose energy through individual, statis-tically occurring collisions with the atoms of the material. The collisions cause ion-ization, excitation of the atom, or collective excitation of the medium. The energy lossdue to a collision is usually low. In rare cases, a larger part of the energy will be lostby the particle.

The Bethe-Bloch formula indicates the average energy loss of heavy charged particles

−⟨

dE

dx

⟩= Kz2

Z

A

1

β2

[1

2ln

2mec2β2γ2Tmax

I2− β2 − δ(βγ)

2

]. (3.1)

For small particle energies, the 1/β2 term in the Bethe formula dominates. As a resultparticles that deposit their energy only through ionization processes in the material,have a fixed range and their energy deposition is largest when this range is reached.This characteristic peak in the energy deposition distribution is called the Bragg peak.

11

3.2 Interactions of Electrons

When passing through a material, electrons can deposit their energy in two differentways [19]. On the one hand electrons deposit their energy through the ionization ofthe medium, on the other hand electrons lose their energy through the generationof bremsstrahlung. The energy loss of electrons through ionization differs slightlyfrom the ionization loss of heavy charged particles. The reasons for this deviationare the kinematics, the spin, the charge and the fact that the scattering observed inionization are the scattering of two identical particles [18]. Bremsstrahlung is theloss of energy of charged particles in the Coulomb field of an atomic nucleus by theradiation of a photon. The bremsstrahlung thus runs analogously to a Rutherfordscattering under radiation of a photon. The mean energy loss due to bremsstrahlungcan be approximately expressed by

(dE

dx

)' − E

X0

. (3.2)

Since the energy loss due to ionization grows logarithmically with the energy, whilethe bremsstrahlung loss increases linearly with the energy [18], the dominant factorfor high energies is bremsstrahlung. With decreasing electron energy, the losses dueto ionization increasingly dominate. This is shown in Figure 3.1.

Figure 3.1: Illustration of the different fractions of energy loss of electrons andpositrons when passing through lead [18].

12

3.3 Interactions of Photons with Matter

Photons interact withmattermainly in four differentways. The possible processes arethe photoelectric effect, Rayleigh scattering, the Compton effect and pair production[20].

The photoelectric effect represents the knocking of an electron out of the shell of anatom. The free space in the shell is filled either by an Auger electron or by radiation ofcharacteristic X-rays. Since the cross section for the photoelectric effect is proportionalto E−3 [20], the photoelectric effect is dominant for low photon energies.

Rayleigh scattering is the coherent elastic scattering of photons at the shell electronsof an atom. In this process there is no energy deposition in the material, the photonis simply deflected.

The Compton effect describes the scattering of the incident photon at a free or quasi-free electron from an outer shell of the shell of an atom [20]. The term quasi-freemeans that the energy of the photon is significantly greater than the binding energy ofthe electron [19]. The dominance of the Compton effect over other processes extendsover an energy range from a few hundred keV to about 5 MeV.[Wig00] The crosssection for the Compton effect is proportional to 1/E [20].

If the energy of the photon is greater than twice the electron mass, there is a possibil-ity that the photon will generate an electron-positron pair in the Coulomb field of anucleus. The process of pair formation is not possible in a vacuum, since a partner isneeded to absorb the recoil for reasons of conservation of momentum [20].

13

3.4 Hadronic Interactions with Matter

The development of hadronic showers is much more complicated than electromag-netic showers [20], which is due to the fact that only a few processes play a role inelectromagnetic showers. Due to the more diverse strong interaction, more diverseprocesses occur in the development of hadronic showers. Another aspect that con-tributes to the complexity of hadronic showers is that a struck nucleus experiencesnuclear interactions. In electromagnetic showers only the small binding energy ofelectron to the nuclei is relevant [20].

Charged hadrons deposit part of their energy via the ionization of the medium, untilthey produce high-energy secondary particles in an inelastic process. In contrast, onlythe second process is of relevance for neutral hadrons [20, 21]. The mean free pathbetween two hadronic interactions is given by the hadronic absorption length [21].The hadrons produced in the inelastic process propagate further through the detectoruntil they are absorbed themselves.

The production of secondary hadrons in the nucleus takes place via the process ofspallation. The spallation is divided into two phases [20], the intranuclear cascadeand evaporation. At the intranuclear cascade the incident hadron scatters on quasi-free nucleons in the nucleus. These nucleons propagate further through the nucleusand scatter other nucleons. This forms a cascade of particles in the core. During theformation of the intranuclear cascade, pions and other unstable hadrons are formed.Someof the particles generated escape from the nucleus andpropagate further throughthe medium, thus contributing the development of the hadronic shower. The ener-gies of the particles, which propagate further through the medium are in the GeVrange [21]. Particles that do not escape from the nucleus lead to a stimulus of thecore. By emitting free nucleons, α particles or heavier particles, the nucleus loses thisexcitation energy again. The energy left in the core is radiated via photons. The en-ergy radiated from the nucleus in these two processes is in the order of magnitude ofsome MeV [21].

The particles that lead to the development of the hadronic cascade are protons, neu-trons, and charged and neutral mesons [21]. Most of them are pions. One third of allpions produced are neutral pions that electromagnetically decay into two photons.This decay occurs before the neutral pions can interact hadronically and results in afraction of the energy of the hadronic shower being converted into an electromagneticsub-shower [21]. As this energy portion is no longer available for hadronic interac-tions, the proportion of the electromagnetic sub-shower increases with the energy ofthe incoming hadrons.

The electromagnetic part of a single hadronic shower fluctuates strongly, since theelectromagnetic fraction depends on the processes that take place at the beginningof the shower [20]. In contrast to electromagnetic showers the energy of a hadronicshower is not completely detectable [20]. The reason is, that delayed photons, soft

14

neutrons, and the binding energy of hadrons and nucleons are invisible to energymeasurement [21]. Due to differences in the cross sections of the electromagnetic andthe strong interaction, hadronic showers have a significantly larger spatial expansionthan electromagnetic showers for materials with large nuclear charge [20].

15

3.5 Calorimeter

Calorimeters are used for destructive energy measurement by showers of incidentparticles. Depending on the type of particle measured, they are subdivided into elec-tromagnetic and hadronic calorimeters. Calorimeters are further divided into ho-mogeneous and sampling calorimeters. Homogeneous calorimeters are made of amaterial that both acts as an absorber for the particles and simultaneously generatesthe signal that can be measured. They consist most often of inorganic, heavy scintil-lation crystals or non-scintillating Cherenkov radiators [18]. Sampling calorimetersconsist of a sequence of active and passive layers. In the passive layers the particles arescattered and in the active layers the signal is generated by ionization or scintillation.Materials used in passive layers are typically lead, iron, copper and uranium. Liq-uid noble gases, organic or inorganic scintillators are used in active layers [18]. Thefollowing two subsections deal with the properties of electromagnetic and hadroniccalorimeters.

3.5.1 Electromagnetic Calorimeter

The relative energy resolution of electromagnetic calorimeters is given by

σ

E=

a√E⊗ b⊗ c

E[21, 18] (3.3)

The symbol ⊗ stands for the squared sum of the individual terms. The first term isthe stochastic term, the second the constant term and the third the noise term. Thestochastic term is caused by fluctuations of the number of charged tracks in the activemedium. The stochastic term in sampling calorimeters is proportional to

σ

E∝√

t

E. (3.4)

Here t describes the thickness of the absorber in units of the radiation length X0 andE is the energy of the incident particle. In order to obtain this proportionality, itis necessary to assume that the numbers of charged traces in the individual layersare independently distributed and shaped in Gaussian form [22]. The noise term iscaused by electrical noise in the signal processing fabjan03. The constant term is dueto energy-independent effects, such as a inhomogeneous structure of the detector,inaccuracies in fabrication, temperature gradients or radiation damage [21].

16

3.5.2 Hadronic Calorimeter

Since part of the energydeposited in a hadronic shower is not detectable, a calorimetergenerally provides a smaller signal for hadrons than for electrons [21]. A quantitativedescription is given by the ratio e / h , which is therefore generally greater than one[20].A calorimeter that delivers the same signals for a hadron and an electron has a ratio ofe/h = 1 and is called a compensating calorimeter [20]. Compensation is an internalproperty of a calorimeter [19] and cannot be measured directly [20]. The e/h ratio isdetermined by measuring the e/π ratio [20], which indicates the ratio of the signalsof an electron and a pion and is given by [20]

e

π=

e/h

1− fem − e/h. (3.5)

Therefore, the e/π ratio depends on the electromagnetic shower fraction fem and therebyon the energy of the incident pion.Compensation improves the linearity and resolution of a hadronic calorimeter [20,19, 21]. The response of non-compensating calorimeters is not linear, since the elec-tromagnetic part of the shower increases with the increasing energy of the incidentparticle. Since the electromagnetic component has a stronger signal, the response ofa non-compensating calorimeter for particles of higher energy is larger. The reso-lution of the calorimeter also improves if compensation is present. The proportionof the electromagnetic shower component fluctuates strongly. If a calorimeter is notcompensating, signals of different magnitudes are generated from events by the sameenergy and the resolution deteriorates.Compensation is therefore a design criterion for hadronic calorimeters. In general,e/h > 1 applies. Therefore, a reduction of the electromagnetic signal while simulta-neously increasing the hadronic signal leads to compensation. Therefore, by usingabsorber materials with high nuclear charges, the electromagnetic signal can be re-duced. A large part of the energy deposition of electromagnetic showers takes placeby absorption of low-energy photons in the absorber. During these processes, elec-trons are released that cannot reach the active medium in absorbers with high nu-clear masses and can therefore no longer contribute to the signal. The magnificationof the hadronic fraction is achieved by the improved detection of cold evaporationneutrons. The energy transfer of neutrons is inverse proportional to the nuclear massA of the material. Therefore neutrons cross a passive medium with little energy lossand transfer their energy efficiently to an active medium with small A or with a hy-drogen component. These protons have a short range and therefore do not reach thepassive medium. The increase of the signal emitted by the nuclear component of theshower can thus be achieved by variation of the layer thicknesses of active and passivemedium against each other or by enrichment of the active medium with hydrogen.

17

Chapter 4Geant4

The basics of simulating a detector with Geant4 [23] are discussed in this chapter.The first section deals with the structure and sequence of a Geant4 application. Thefollowing two sections deal with the operation of particle tracking by the detector andthe simulation mechanisms. The last section of the chapter discusses the definition ofa detector geometry.

4.1 The Structure of a Simulation

Each Geant4 application passes through different states during a simulation. Theseare the preInit state, a state during initialization, a state from which a run is started,a state in which the application is during the run, and a state that is passed throughwhile leaving the application. The first step in the simulation process is to createan instance of the RunManager class that controls the entire process [24] . Creatingthe RunManager instance sets Geant4 to preInit state. The classes, which are usedto describe the components, are transferred to the RunManager from this state [24]. There are three required and five optional classes. The required classes are theG4VUserDetector Construction class, a physics list, and a G4PrimaryGeneratorActionclass, which is used to generate primary particles and vertices. The used detector ge-ometry is definedby the G4VUserDetectorConstruction class. The G4PrimaryGeneratorActionclass is used to generate the initial state of the simulation. The initial state can bemadeavailable by the interface to a framework [23] . On the other hand, the G4ParticleGunclass provides the possibility to generate primary particles and vertices. It allows theselection of a primary particle and the setting of dynamic properties such as momen-tum, kinetic energy, location and flight direction. Furthermore, there is the option togenerate several particles at once or to assign a polarization direction to the particle[24]. After the submission of the classes to the RunManager the initialization of thekernel takes place. It starts with calling the Initialize method of the RunManager.During the initialization the application is in the initialization state and changes to thestandby state after successful execution [24]. From this state the start of a simulationrun takes place by calling the BeamOnmethod of the RunManager class. As argument itexpects the number of events to be simulated. The simulation is divided into differentsimulation units, which are hierarchically structured. The individual units represent

18

smaller and smaller building blocks of the simulation. The largest simulation unit isa run. A run consists of several events and is started by calling the BeamOn method[24, 25]. An event consists of the decay or interaction of the primary particle or parti-cles [25]. At the beginning of the simulation the event contains information about theprimary particle and the primary vertex. These are converted during the simulation,and after the simulation, the event contains information about the trajectories of theparticles by the detector as well as about hits registered in the detector [23]. The nextsmaller simulation unit is a track. A track represents a particle moving through thedetector [24]. It consists of several steps. A track contains static information aboutthe transported particle, such as the charge or mass of the particle, as well as dynamicproperties that change during simulation. Dynamic properties include momentum,kinetic energy, and the location of the particle. The trace of the particle exists until theparticle comes to rest or decays [24]. A step contains information about the beginningand the end point [24]. The length of a step is limited by the distance to the next vol-ume, the energy loss by continuous processes or by its limitation in the G4UserLimitclass [23]. The kernel can be initialized with five additional classes, which enablethe detector to interfere with the tracking of the particle at the transition between thesimulation units. There is a class for manipulating each simulation unit, as well asa class with which the priority of tracking a particular track by the detector can bechanged. This is the G4UserStackingAction class. The two classes G4UserRunActionand G4UserEventAction can be used to intervene in the simulation at the beginningand end of a run or event. These classes are usually used for the analysis of a runor event [24]. The class G4UserTrackingAction is used to manipulate the trackingof the particle at the beginning and end of a track. The G4UserSteppingAction classhandles the sequence of a step.

19

4.2 Integration of Physical Interactions

The integration of physical interactions into a simulation is done via the physics list.The physics list determines the particles that occur in the simulation and which in-teractions they experience. It can be completely defined by the user. In addition, itis possible to use and extend a predefined reference physics list [25]. The definitionof the physics list corresponds to the assignment of all processes that a particle canexperience [23]. The representation of physical interactions is implemented by theGeant4 class G4VProcess [23]. The term process stands for the physical interactionsduring the simulation and is managed by the class G4VProcess. The interfaces of allprocesses are identical. This enables a general handling of all processes by tracking.The abstraction of the processes leads to a simple possibility to add newprocesses to asimulation or to extend existing processes in order to improve the accuracy of the sim-ulation [24]. Processes are divided into seven different categories, which are electro-magnetic, hadronic, optical, decay and photo-leptonic hadron processes. In addition,there is the category of transport processes as well as the category of parametrization.A further subdivision of the processes takes place according to the type of interaction.A distinction is made between processes for particles at rest that take place along theentire tracking and processes that occur locally at the end of the tracking [23].

20

4.3 Tracking

The abstraction of physical interactions in processes with identical interfaces makesit possible to describe the transport of any particle through the detector with an al-gorithm. The tracking of a particle by the detector can thus be carried out regardlessof the observed particles and physical interactions. In Geant4, the transport of theparticles through the detector takes place step by step [23]. At the beginning of asimulation step, each process from the list of the observed particle suggests a steplength via its GetPhysicalInteractionLength method [23]. If the particle is at rest,only decay processes are considered. In order to improve the accuracy of the sim-ulation, several mechanisms that additionally limit the step length of a particle areused. On the one hand, processes that describe a continuous energy loss suggest astep length. Furthermore, the shortest distance of the present location to the nextvolume boundary limits the step length. This ensures that the particle does not passinto any different volume during the step [25]. The smallest proposed step lengthdetermines the process being performed. Processes associated with a loss of energyor a change of direction of the particle can force their execution and take place evenif their proposed step length is not the shortest present [23].

21

4.4 Geometry

The requirements for the definition of geometry are manifold. They range from basicanalyses of calorimeters up to complex detector assemblies at large-scale experimentssuch as the Large Hadron Collider [26]. The definition of the geometric objects thata detector contains is done in Geant4 in three stages. The first stage is the defini-tion of a body. A body is defined by its shape and dimensions. The construction ofthe body is done by selecting the appropriate shape from the available ConstructedSolid Geometries (CSG) [26]. The second level of the geometry definition is doneby adding physical properties to already predefined volumes. The resulting objectis called logical volume and is represented by the class G4LogicalVolume. The logi-cal volume contains the physical properties of the material it is made of. Further thedefinition of the electromagnetic fields and the user-defined limitations belong to alogical volume [24]. The third and last stage of the definition of a detector is the po-sitioning of the logical volumes in space. A placed volume is called physical volume.In order to describe the detector completely, it is necessary for volumes to be insertedinto each other. The world volume represents the largest volume in the definition ofa detector. It contains all other volumes, which describe the detector. The placed vol-umes are called daughter volumes and are surrounded by the mother volume. Theposition of the subsidiary volume is relative to the center of the mother volume [24].

22

4.5 Materials

The structure of thematerials in Geant4 replicates the structure ofmaterials in nature.Materials are composed of molecules or elements, which in turn consist of isotopes[23]. The defining properties of an isotope are the name of the isotope, the nuclearcharge number, the nucleon number and the molar mass. An element has the prop-erties of name, nuclear charge, effective nucleon number, effective molar mass andcross section per atom [24]. An element is accessed via its symbol in the periodictable of the elements. An element is defined either by the composition of the isotopesor directly by defining the effective quantities. The effective cross section per atomis calculated from the nuclear charge, the nucleon number and the molar mass [24].Analogous to the definition of an element from isotopes, the definition of a materialtakes place. Either a new element with the effective values is generated or differentelements are combined to onematerial. Amaterial is defined by its properties such asdensity, state of aggregation, temperature and pressure. Geant4 calculates the meanfree path length, radiation length, hadronic interaction length and the mean energyloss per radiation length, which is given by the Bethe-Bloch equation [24]. The valuesof the physical quantities must be defined in the program code. Furthermore, thereis the possibility to define materials from the internal database. This simplifies thedefinition of materials, since all physical quantities of a material (isotops, elements ormaterials) are provided.

23

Chapter 5Experimental Setup

5.1 The Large Hadron Collider

The Large Hadron Collider (LHC) [1] is the most powerful particle accelerator in theworld in terms of centre-of-mass energy and the frequency of particle collisions. It islocated at the European Organization for Nuclear Research (Conseil européen pourla recherche nucléaire, CERN) near Geneva in Switzerland. The storage ring wasbuilt in the tunnel of the former Large Electron Positron Collider (LEP). The tunneltube has a circumference of 26.7 km and is located between 45m and 175m under-ground. The objectives of the LHC are the investigation of physics beyond the stan-dard model as well as precision measurements of the electro-weak, the higgs and thestrong sector of the Standard Model. One of the most important tasks and achieve-ments of the LHC was the discovery of the Higgs Boson in 2012 [27, 28]. For thispurpose it was designed for a centre-of-mass energy of

√s = 14 TeV and a luminosity

of L = 1034cm−2s−1.

Figure 5.1: The graph shows the four main experiments (ALICE, ATLAS, CMS andLHCb) at the LHC [29]

.

Luminosity L and cross section σ describe the number of particle reactions per timeand are via

24

dN

dt= Lσ. (5.1)

Here dNdt

is the number of reactions per time unit. The luminosity is used especiallyfor the characterization of accelerators and provides information about the expectedparticle rate. It can be calculated for a collision experiment as

L = fNaNb

4πσxσy(5.2)

Here it is assumed that the radiation packets have a Gaussian density profile withwidths σx,y perpendicular to their flight directions. Na and Nb represent the num-ber of particles in the two colliding particle bunches which repeatedly collide at thefrequency f in the experiment. In the storage ring, protons are accelerated in twoadjacent vacuum tubes and collided in the centres of four experiments. Figure 5.1shows the LHC with its four experiments: ALICE(A large Ion Collider Experiment)[30], ATLAS(A Toroidal LHCApparatuS) [31], CMS(Compact Muon Solenoid) [32]and LHCb(LHC beauty) [33].

25

5.2 The CMS Experiment

The Compact Muon Solenoid Detector was specially developed to characterize theproton-proton collisions at a center-of-mass energy of 14 TeV. The CMS detector isplaced cylindrically around the beam axis with a radius of 15m and a length of 21.6m.The basic setup including the subcomponents of the CMS detector is shown in figure5.2 in the transverse plane. From the inside out, the detector consists of a track de-tector, an electromagnetic calorimeter (ECAL), a hadronic calorimeter (HCAL) anda muon system. Inside the muon system there is a superconducting solenoid magnetwith a diameter of about 6 m and a field strength of up to 3.8 T, which includes thecalorimeters and track detectors.

Figure 5.2: Illustration of a tranverse slice of the CMS detector. Also specific particleinteractions are shown [34]

5.2.1 Coordinate System and Conventions

For a precise description of the functionality and the construction of the subcompo-nents, the coordinate system used in the CMS experiment is introduced. In addition,further physical conventions are introduced.

The CMS experiment uses a right-handed Cartesian coordinate system which orig-inates at the collision point of both proton beams. Accordingly, the z-axis points inthe beam direction, the y-axis points upwards and the x-axis points in the direction

26

of the accelerator center. In addition to a Cartesian coordinate system, polar coordi-nates are used for a simpler representation. Here the azimuth half-angle φ denotesthe spanned angle in the x-y-plane and the polar angle θ, denotes the angle in relativeto the z-axis. According to the use of both coordinate systems, the momentum in thetransversal plane of the detector, pT , is defined as

pT =√p2x + p2y = p · sin(θ) (5.3)

The invariance of the transverse momentum with respect to the Lorentz transforma-tion along the z-axis results in the pseudorapidity η , which differences in η beeingalso invariant under such transformations.

η = − ln(tan(θ/2)) (5.4)

Assuming a negligible mass compared to the energy of the physical objects underconsideration, an identity to the rapidity y is obtained

y =1

2ln

(E + pzE − pz

). (5.5)

On the basis of the pseudorapidity η and the azimuth angle φ, a formulation of thespatial angular distance ∆R, which again is invariant with respect to the Lorentztransformation along the z-axis, follows

∆R =√

(∆η)2 + (∆φ)2. (5.6)

In combination with the energy E, φ, η and pT describe all components of the four-vector pµ of a particle. The invariant mass of the corresponding particle is calculatedfrom the four-vector

m2 = pµpµ. (5.7)

5.2.2 Tracking Systems

The inner trace detector is dedicated to the identification of charged particles and thereconstruction of associated trajectories.

27

It consists of 1440 pixel and 15148 silicon strip detectors and covers a solid angle rangeof up to |η| = 2.5. The individual pixel and strip detectors have an extension of150µm × 100µm or 80µm × 10cm and 180µm × 25cm, respectively. This enables aspatial resolution of 10µm for the pixel detectors and 23µm for the stripe detectors inthe x-y plane and 20µm and 230µm, respectively, along the beam axis.

5.2.3 Electromagnetic Calorimeter

The electromagnetic calorimeter (ECAL) consists of 75848 homogeneous PbWO4 crys-tals and has a solid angle granularity of 0.0174η× 0.0174φ, providing a homogeneousresolution. Furthermore, the ECAL covers a solid angle range of up to |η| = 3. Thelack of instrumentation in the range 1.479 < |η| < 1.653 is pointed out. This region isunsuitable for the reconstruction of electrons and photons. If the trajectory of an elec-tron or photon is directed through the ECAL, such a particle emits energy in the formof photons from bremsstrahlung and production of e± pairs. The scintillation leadsto photons, which are measured by photodiodes with a relative energy resolution σ

E,

where sigma is the resolution of the measured energy (E in unit of GeV)

σ

E=

2.8%√E/[GeV]

⊗ 0.12%

E/[GeV]⊗ 0.3%. (5.8)

5.2.4 Hadron Calorimeter

In contrast to ECAL, the hadronic calorimeter (HCAL) primarily detects hadrons.Hadrons interact via the strong interaction with the detector material resulting ininelastic reactions. The energy deposited here is absorbed by scintillators. Due to theinteraction length, theHCAL ismore extended than the ECAL and further away fromthe beam axis. It is divided into a central region (HB), an outer central region (HF),an end cap region (HE) and a forward region (HF). HB, HO and HE have a spatialangle granularity of 0.087[η]× 0.087[φ], whereas the HF with 0.0175[η]× 0.0175[φ] hasa much better angular resolution.

Compared to the ECAL, the HCAL has a significantly inferior energy resolution (Ein GeV)

σ

E=

115.3%√E/[GeV]

⊗ 5.5%. (5.9)

28

5.2.5 Solenoid

The CMS detector has a superconducting solenoid magnet, which consists of a cylin-drical magnet coil with a diameter of 6 m and a length of 12.5 m. The magnet is de-signed to generate a magnetic field of up to 4T inside the coil. The traces of chargedparticles are strongly curved in the transversal plane, enabling the detector to mea-sure their momenta.

5.2.6 The Muon System

Most of the observed muons originate from the decay of heavier particles and there-fore indicate interesting physical processes. At 105.7 MeV they have a comparablylow mass and hardly interact with the calorimeters. Therefore, nearly unscattered,the muons pass through the inner detector components into the muon spectrometer,which is the outermost detector layer. Most other particles decay or are absorbed be-forehand, so that almost every particle observed in this detector system is a muon.The muon system serves both the identification and the momentummeasurement ofmuons. The system consists of several subsystems. Themuon spectrometer functionsin interaction with the magnet. The strong magnetic field bends the particle path ofthe muons in the transverse plane. The momentum of the muons is one of the bestmeasured quantities of the entire CMS detector, since the particle is measured in theinner tracking detector and in themuon chambers. The blue curve in figure 5.2 showsa possible trajectory of a muonwhich first is first bent in a 4 Tmagnetic field in the in-ner trace detector and then deflected in the opposite direction in a 2 T magnetic field.The muon spectrometer detects muons in the range of |η| < 2.4. In addition, after alltransverse momenta of the directly detectable particles have been determined in thelast detector system, neutrinos can be indirectly detected via the missing transversemomentum due to the conservation of the entire transverse momentum.

5.2.7 The Trigger System

The proton bunches collide at the LHC at a rate of about 40 MHz, with around 25proton pairs interacting simultaneously. Since the amounts of data produced are toolarge for storage systems currently available, a pre-selection is made. This processis carried by the trigger system. It should be noted that it is basically not necessaryto evaluate all events because many of them are so-called soft events. These eventsfeature a small transverse momentum transfer and have been investigated in otherexperiments in the past. In this case, it is sufficient to only record every Nth event. Atrigger system can select events after the identification of particle signatures. As anexample the information that the muons have been identified in themuon system canbe used as a trigger criterion. CMS uses a two-stage trigger system. First the up toO(100 kHz) fast Level-1 trigger from programmable hardware processors and then

29

the high-level trigger is used. The Level-1 trigger compares the recorded data withthe desired detection characteristics and forwards the data to the high-level trigger ifthe characteristics are successfully recognized. The higher level triggers performs analmost complete reconstruction with the information from all detector components.The reconstruction algorithm is similar to the algorithm used for later data analysis.Only when events meet the requirements of this selection level are they written tostorage media for later data analysis. Overall, the rate at which the CMS triggers isbetween 200 Hz and 1 kHz.

30

5.3 Particle Flow

The particle flow reconstruction algorithm is used in the CMS experiment. The iden-tification and reconstruction of individual particles from the proton-proton collisionsat the LHC is achieved by combining the information from the different detector sys-tems. By combining the energy deposition in the calorimeterswith the datameasuredby the tracking detector and the muon system, good resolutions for the measuredparticle four-vectors are achieved. The combination of the information is carried outwith the objective of an optimal determination of the direction and energy of the par-ticles. Due to the different interactions in the detector, the observed particle types canbe determined with high probability. The CMS detector is ideally suited for the useof this algorithm as it has a precise tracker. As shown in Figure 5.2, the muons tra-verse all detector components and then leave signals in the inner source detector andin the muon chambers. Photons deposit most of their energy in the ECAL, whereasthe charged leptons leave additional traces in the trace detector. The momenta of thecharged hadrons is recorded in all positions up to HCAL, while neutral hadrons canonly be measured in the HCAL (and partially in the ECAL).

In the first step, the PF algorithm reconstructs the detected muons and elctrons andsubtracts them from the measured signals for further processing in order to separatethem from the possible candidates of the charged hadrons. The algorithmmerges theremaining traces with the energy depositions from the calorimeter. If the measuredenergy in the calorimeter is compatible with the associated reconstructedmomenta ofthe particles, the associated signals are used to determine the four-momentum of thehadron. However, if the energy deposited in ECAL or HCAL is significantly higherthan the corresponding values of the track, an additional overlapping photon in ECALor a neutral hadron in HCAL is reconstructed along the track.

31

5.4 Jet Clustering

A jet algorithm defines the rule for clustering individual particles into jets. Jet algo-rithms usually have a resolution parameter that determines how close two particlesmay be without being part of the same jet.

A large group of clustering algorithms can be defined by the general distance metrics

dij = min(p2kT,i, p2kT,j) ·

∆R2ij

R2(5.10)

Here ∆Rij describes the distance between particles i and j in eta-phi space via

∆R2ij = (ηi − ηj)2 + (φi − φj)2 (5.11)

and R specifies the maximum radius if the shape of the jet is assumed to be a conein the R − η − φ space. The factor k determines the behavior of the algorithm. Fork = 1 the equation (5.10) describes the so-called kt algorithm, for k = −1 the anti ktalgorithm and for k = 0 the Cambridge/Aachen algorithm.

These algorithms fulfill two essential properties. They provide collinear and infraredsecurity. A jet algorithm is referred to as infrared safe if the algorithm is stable againstadditional energetically weak radiation in the jet. If the jet does not change its direc-tion or its reconstructed energywhen a particle is split up in the jet, it is a jet algorithmwith collinear safety. CMS usually uses the anti-kt algorithm.

5.5 Jet Energy Corrections

Due to detector inaccuracies, the energy of the reconstructed jets does not match thetrue energy of the jets. The true energy is defined as the energy of the original par-ton, in a perturbational sense, but in fact it is given by the energy of a jet found onparticle level in the event generator. Therefore, it is necessary to align the energies ofthe jets with the true energies of the jets using jet energy corrections. To assign thecorrected energies to the reconstructed jets, the differences between the reconstructedjets and true jet energies are determined. In this way, detector-specific effects, suchas interactions in the material, are reduced. The jet corrections in CMS follow a fixedprocedure.

The Level 1 (L1) correction reduces the shift of energy by pile-up. The term pile-up describes the effects of events of additional proton-proton interactions, wherebyadditional energy deposition in the detector leads to a different energy than just the

32

energy of the jets from the process of interest. These corrections are determined bycomparing identical events from Monte Carlo simulations with and without pile-upevents. The resulting correction factor depends on the transverse momentum of thejet pT , the pseudorapidity of the jet η, the jet area A and the mean density of thetransverse momentum ρ, which are calculated using the kT algorithm for R = 0.6.

The L2L3 correction improves the energy of the reconstructed jets so that it corre-sponds on average to the energy of the generated jets. This is achieved by formingthe ratio of the reconstructed transverse momentum precoT to the generated transversemomentum p

genT . The ratio is referred to as the detector response

R =precoT

pgenT

. (5.12)

Themomenta precoT and pgenT that belong together are combined to responses in narrowbins of the generated transversal moment pgenT or the pseudo-rapidity of the generatedjets ηgen. In order to apply the correction factor to the data, the inverse of the meanresponse is expressed as a function of precoT . The L1 and L2L3 corrections are both ap-plied to the data and the simulated events. The L2L3res corrections are subsequentlyapplied to the data to handle residual differences between the data and the simula-tion. The correction factor on the jet energy scale is determined by events with a jetand a photon or a Z boson. The measurement of the transverse momenta of the Z-bosons pZT and the photons pγT are performed in the well understood detector rangeη < |1.3| and have much lower uncertainties compared to the transverse momenta ofthe jets pjetT . Thus the momentum of the jet can be balanced with the momentum ofthe photon or the Z-boson. In this case the response is accordingly

RBalance =pjetT

pγ,ZT. (5.13)

33

Chapter 6Deep Learning

Inmany areas ofmachine learning, the individual features had to be designed by hand.Therefore, expertise in the domain was necessary and the procedure was individualfor each case of application. Deep learning, on the other hand, is a type of represen-tation learning where the raw data is presented to the machine and which automat-ically discovers the representation needed. In the case of deep learning with neu-ral networks this representation is obtained by composing non-linear layers whichtransform the representation in increasingly higher levels of abstraction. With thiscomposition, very manifold functions can be learned, which maps the raw input datainto the desired solution. The core aspect of Deep Learning is that the features ofthe layers are not designed by humans but are learned by the machine. Deep learn-ing has lead to major breakthroughs in a wide variety of fields. The most prominentexample would be the recognition of images [35, 36, 37, 38] or speech-to-text synthe-sis [39, 40, 41]. However, impressive success has also been achieved in completelydifferent areas such as generating human faces [42] or predicting new drugs [43].

6.1 Multilayer Perceptron

In general machine learning constructs a predictor F of an output Y given an inputX . This machine resembles an input-output mapping

F : X 7→ Y. (6.1)

There are lots of ways to construct such a predictor. In deep learning this multivariatefunction, here denoted as the deep predictor Y (X), is constructed by blocks of hiddenlayers. Let σ[1], ..., σ[L] be vectors of univariate non-linear layer-wise activation func-tions. A semi-affine activation rule for each layer l is given by

σ[l]W,b(z) := σ[l]

(W [l]z + b[l]

)(6.2)

34

HereW [l] and b[l] are the weight matrix and the bias or threshold of the lth layer. Thisdefines a deep predictor as a composite map

Y (X) :=(σ[L]W,b ◦ ... ◦ σ

[1]W,b

)(X). (6.3)

It can be synthesized that with a deep predictor a high dimensional mapping, F , ismodeled via the composition of non-linear univariate semi-affine functions. This isanalog to a classical basis decomposition.

The deep predictor can also be defined as a computation graph, where the ith nodein the lth layer is given by

a[0] := X, (6.4)

z[l]i :=

N [l]∑

j=1

W[l]ij a

[l−1]j + b

[l]i , (6.5)

a[l]i :=

(σ[l]W,b(a

[l−1]))i

= σ[l]i (z

[l]i ), (6.6)

Y (X) := a[L]. (6.7)

This method to make machines learn was first developed by Frank Rosenblatt [44]. Hebuild his work on the model for neurons proposed by Warren McCulloch and WalterPitts [45], who showed that a neuron can be modeled as the summation of binaryinputs and outputs, a one or zero in dependence upon an internal threshold. Rosen-blatt’s Perceptron contained one input layer, one hidden layer and one output layer. Hecontributed to the idea ofMcCulloch and Pitts by describing a learningmechanism forthe computational neuron. This algorithm starts with random initialized weighs anda training set. The output of the perceptron for the training set is computed. If theoutput is below the label the weights are increased. If the output is above the labelthe weights are decreased. This is iterated until output and labels are equal. The ab-straction from the model of McCulloch and Pitts gives the predictor the name neuralnet.

The limitation of this approachwas shown byMarvinMinsky and Seymour Papert [46].They proved that it is impossible for a perceptron to learn theXOR function, since it isnot linearly separable. The learning algorithmproposed byRosenblattwas not extend-able tomultiple hidden layers, amultilayer perceptron, which are necessary for learningnon-linearly separable functions. It was even proven that a multilayer perceptron isan universal approximator, which means that it is able to approximate any continu-ous function from one to finite dimensional space [47]. To compensate for learninginability of the multi layer perceptron the backpropagation algorithm was developed.

35

6.2 Backpropagation

Let the function L be a metric

L :(Y (X), Y

)7→ [0,∞), (6.8)

which returns the distance between the output of the predictor and the labels. Thisobjective function is referred to as the loss function in optimization theory, because aloss is associated with the event X , which should be minimized. The loss functioncan be seen as a landscape in a hyper-dimensional space spanned by the parametersof the predictor. To optimize the neural net, the minimum of the loss function has tobe found.

If p is the set of parameters of the neural net, than the Taylor series expansion in firstorder of the Loss function is given by

L(p+ ∆p) ≈ L(p) +∂L(p)

∂p∆p. (6.9)

To minimize L the first order term has to be as negative as possible.

∣∣∣∣∂L(p)

∂p∆p

∣∣∣∣ ≤∣∣∣∣∂L(p)

∂p

∣∣∣∣|∆p| (Cauchy-Schwarz) (6.10)

⇒ ∆p = η∂L(p)

∂p(maximum) (6.11)

⇔ p→ p− η∂L(p)

∂p(0 ≤ η � 1). (6.12)

Here η is known as the learning rate, which is a hyper-parameter, which value is nota priori determinable. The parameters p are updated until a minimization criterionis reached. The presented minimization technique is known as the steepest descent orgradient descent method [48].

For computing this gradient the error in the jth neuron at layer l is introduced,

δ[l]j :=

∂L∂z

[l]j

. (6.13)

It is than straightforward to compute the derivation between the loss function and theparameters,

36

∂L∂W

[l]jk

= δ[l]j a

[l−1]k , (6.14)

∂L∂b

[l]j

= δ[l]j . (6.15)

δ[l]j =

∂L∂zlj

=N [l+1]∑

k=1

∂L∂z

[l+1]k

∂z[l+1]k

∂z[l]j

=N [l+1]∑

k=1

δ[l+1]k

∂z[l+1]k

∂z[l]j

. (6.16)

With (6.5) the connection between z[l+1]k and z[l]j ,

z[l+1]k =

N [l]∑

s=1

W[l+1]ks σ(z[l]s ) + b

[l+1]k , (6.17)

⇒ ∂z[l+1]k

∂z[l]j

= W[l+1]kj σ

[l]i (z

[l]j ). (6.18)

In (6.16) this gives

δ[l]j =

N [l+1]∑

k=1

δ[l+1]k W

[l+1]kj σ′(z

[l]j ). (6.19)

To conclude this discussion, as defined in (6.6), for computing a[l] the value of a[l−1]is needed, so the whole computation of the predictor can be done in a forward passthrough the network. In opposition to that to compute the gradients for layer l, thegradient of layer l + 1 is needed, so the computation of the gradients is a backwardpass through the network. This algorithm of computing the gradients is known asbackpropagation.

Since the beginning of the 1960s error minimization through gradient descent in sys-tems related to deep learning were discussed [49, 50, 51, 52, 53, 54, 55, 56]. Thesealgorithms were already efficient, as their derivative calculation was not more expen-sive than the forward computation of the system’s evolution [57]. The first descrip-tion of efficient error backpropagation in possibly arbitrary networks was presentedby Seppo Linnainmaa [58, 59]. Though the first application of the backpropagationalgorithms to neural networks was performed by Werbos in 1981 [60].

37

6.3 Convolutional Neural Networks

In this section a special form of a neural network, which is called the convolutionalneural network(ConvNet), is described. The basic idea behind this algorithm is that fordata types like pictures features next to each other are more important than featuresfar away from each other.

A convolution is a mathematical operation on two functions

(x ∗ w)(t) =

∫x(a)w(t− a)da. (6.20)

So the convolution for a given t is the average of x weighted by w around t. Theinput space for a neural net would be the nodes of the last layer which are discrete byconstruction. The discretization of a integration is a summation

(x ∗ w)(t) =∞∑

a=−∞

x(a)w(t− a). (6.21)

It is also convenient for picture like data types that the input is multidimensional.

(K ∗ I)(i, j) =∑

m

∑

n

I(i−m, j − n)K(m,n). (6.22)

The discrete convolution operation can be viewed as a matrix multiplication with asparse matrix.

Traditional neural networks treat every input of the last layer a priori the same, whilea convolutional neural network has sparse connections, which means multiplicationwith a smaller kernel size. This leads also to smaller memory requirements for ad-ditional layers, due to weight sharing between the matrices. Another key feature isthat by parameter sharing a property called equivariance is introduced, which meansobjects inside the data are processed translational invariant.

In a typical application a convolutional filter is composed by three components. Atfirst multiple convolutions are applied to the previous layer. Their output is thanfeed to an activation function, analog to a traditional neural net. In the last stage theoutput is modified by a pooling layer. A pooling function is a function which returnsa statistics of a local area. A typical variant would be the maximum of a few adjacentdata points. Thismodification is applied to reduce the dependence on small statisticalfluctuations.

38

The idea of convolutional filters is grounded in the paper of Hubel and Wiesel pub-lished in 1959 [61]. They showed that the visual cortex of cats contains neurons thatresponds to small regions of the visual space. They proposed a cascading modelbetween this type of cells and more complex cells for pattern recognition. The firstconvolutional neural net ever implemented was based on this work and introducedby Fukushima in 1980 [62]. Fukushimas Neocognitron implemented all fundamentalideas behind ConvNets. The first ConvNet trained by the backpropagation algorithmwas the Time delay neural net by Waibel et al. [63, 64], trained for speech recognition.Also the work by LeCun et al. has to be mentioned. They demonstrated the applica-tion of a backpropagating ConvNet to the recognition of handwritten zip code digits[65].

6.4 Adversarial Networks

For some deep learning applications it is necessary that the results are independentof nuisance parameters. An example of this in particle physics would be a jet classi-fication that is independent of the jet mass. In such a case correlations are to be elim-inated that could otherwise lead to systematic errors that are difficult to characterizeand control. One possibility to make a network independent of nuisance parametersis adversarial training.

A distribution of data points p(X, Y, Z) is given in which X are the data, Y are thelabels and Z are nuisance parameters. Then the goal is that the results of the networkY is invariant by variation of the nuisance parameters Z,

p(Y (X)|z) = p(Y (X)|z′) ∀z, z′ ∈ Z. (6.23)

This is the equivalent of demanding that Y and Z are independent random variables.

In order to gain independence of Z, an adversary network is attached to the predictornetwork. The task of the second network Z is to determine Z from the output of Y .This is illustrated in figure 6.1.

To achieve this, the weights of the predictor Y are locked during the training and onlythe weights of the adversary Z are trained with the loss function

LZ(Z(Y (X)), Z). (6.24)

In a further training step, the predictor is trained by freeing its weights, while theweights of adversary Z are fixed.

39

Predictor Y

XY (X)

LY (Y (X), Y )

...

Adversary Z

Z(Y (X))...

LZ(Z(Y (X)), Z)

Figure 6.1: Visualization of the adversarial training setup

The predictor is trained with the loss function

L = LY (Y (X), Y )− λLZ(Z(Y (X)), Z), (6.25)

where λ is a free hyper-parameter. The second term is a penalty term. It causes thepredictor Y to attempt to deteriorate the result of the adversary Z. This eliminates Ydependency on the distribution of Z.

The two training steps are iterated until a balance between the two nets is achieved.

This learning method was developed by G. Louppe et al. in the article Learning toPivot with Adversarial Networks [66].

6.5 Particle Flow Networks

A set is a collection of objects that remain invariant under permutation. Each functionthat maps from a set X = {x1, ..., xM} to a space Y must have therefore be invariantunder all permutations π of its arguments.

f({x1, ..., xM}) = f({xπ(1), ..., xπ(M)}) (6.26)

Such a function is permutation invariant if it can be decomposed into two function ρand φ in the form

ρ

(∑

x∈X

φ(x)

). (6.27)

40

Moreover, for a sufficiently large l with the continuous functions F : Rl → Y andΦ : X → Rl,

f({x1, ..., xM}) = F (M∑

i=1

Φ(xi)) (6.28)

is an arbitrarily good approximation. A proof for this statement can be found in [67].

In application to jet physics, a jet can be understood as a set of particles with arbi-trary descriptive variables x. Φ is then a per particle mapping into a l-dimensionalhyperspace. Komiske et al. name this hyperspace latent space. From this latent spaceF maps to the target space R.

A Particle Flow Network [10] consists of a Φ network that maps from the particle’sfour momentum into the latent space. For each jet, the resulting latent space vectorsare summed up and mapped to the target space with a second network.

This partitioning of the net to satisfy the permutation invariance of sets was first dis-cussed by Zaheer et al. [67]. The application of deep sets in particle jet physics andparticle flow networks was first implemented by Komiske et al [10].

6.6 Deep Learning Tools

The networks in this work are implemented in Keras [68]. Keras is a high-level neuralnetworkAPIwhich handles underlying frameworks. Here as a frameworkTensorflow[69] is used. In TensorFlowmathematical operations are represented in the form of agraph. The graph represents the sequential sequence of all operations to be performedby TensorFlow.

41

Chapter 7Calorimetry Energy Analyis

calo sim linear fit FCN data aug-mentation

ConvNetadversarialtraining

likelihoodloss achievments

Figure 7.1: TheGraph displayes the structure of the analysis performed in this chapter

In this chapter, the possibility to improve the resolution in the energy measurementof a calorimeter by using deep learning will be presented. For this purpose, two dif-ferent networks are trained with simulation data. In addition, existing distributiondisplacements are overcome.

Instead of hadrons, electrons are simulated here since their energy response is ex-pected to be linear and basic properties of the DNN approach can be studied moreeasily. The reconstruction of pions and jets is the task of further studies.

7.1 Calorimeter Simulation

Geant4 [23] is used to simulate a calorimeter with a layer structure similar to thestructure of the CMS hadron calorimeter. The calorimeter has a depth of 931.5 mmand a lateral dimension of 300 mm. The layer structure is listed in table 7.2. The firstlayer consists of a 9mm thick scintillator layer and a 40mm thick stainless steel layer.The steel layers in the CMS HCAL are the carriers holding the calorimeter. The firstlayer is followed by 8 layers, each consisting of a 3.7 mm thick scintillator and a 50.5mm thick brass absorber. This is followed by 6 layers of 3.7 mm thick scintillators,with 56.5 mm thick brass plates. The last two layers consist of a 3.7 mm scintillator,

42

followed by a 75mm steel holder, completedwith a 9mm scintillator. Each scintillatorlayer consists of 64 equal-sized scintillator tiles each with a height and width of 75mm.

layer scint in mm abs in mm abs material0 9 40 steel

1-8 3.7 50.5 brass9-14 3.7 56.5 brass15 3.7 75 steel16 9

Table 7.1: The structure of the layers in the simulation.

In the simulation, the paths of incoming particles and their interaction products aresimulated. The incoming particles are electrons.

The momentum of the incoming particles is randomly initialized between 0 and 10GeV following a flat distribution, which is shown in figure 7.2. The point of arrival ofthe particles is always the exact center of the first detector layer.

After the trajectories of the particles had been simulated, for each event the number oftraces in each scintillator cell are counted. These 1088 values are then stored as datapoints for further analysis.

0 2 4 6 8 10

Etrue [GeV]

0.0

0.1

0.2

0.3

0.4

0.5

nor

m.

to1

Figure 7.2: The energy distribution of the incoming particles

In the following figures the complete spectrum is shown on the left, while on the righttwo samples from the distribution are shown. In this case only values in a range of2± 0.1 and 8± 0.1 GeV are shown.

43

0 50 100 150 200 250 300 350∑i ni

0.000

0.001

0.002

0.003

0.004

nor

m.

to1

(a) complete spectrum

0 50 100 150 200 250 300 350∑i ni

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

nor

m.

to1

2± 0.1 GeV

8± 0.1 GeV

(b) spectrum at 2 and 8 GeV

Figure 7.3: Distribution of summed traces in the calorimeter cells

The figures 7.3 show the distribution of the total summed traces in the calorimetercells. Figure 7.3a shows that the distribution follows the flat energy spectrum, onlyat the end of the distribution the values and are no longer uniformly distributed. InFigure 7.3b you can see that as expected higher energies lead to more traces in thedetector and their distribution gets wider.

0 2 4 6 8 10 12 14 16

layer in x

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

nor

m.

to1


0 2 4 6 8 10 12 14 16

layer in x

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

nor

m.

to1

2± 0.1 GeV

8± 0.1 GeV


Figure 7.4: Distribution of summed traces per layer in the direction of the incomingparticle

Figure 7.4 shows the distribution of the summed traces in the individual layers. Figure7.4a shows that electrons in the selected energy range with the selected calorimeterarchitecture almost exclusively interact with the first layers. In 7.4b that the variationbetween 2 and 8 GeV electrons is visible. In 7.4b that the variation between 2 and 8GeV electrons is clearly visible in the longitudinal shower profile there are no majordifferences.

Figure 7.5 shows the distribution of the summed traces in the individual cells per-pendicular to the beam direction. It can be seen that there is basically no energy in-formation perpendicular to the beam direction, as expected from the transverse cellsize.

44

0 1 2 3 4 5 6 7

cell in y

0.0

0.1

0.2

0.3

0.4

0.5

nor

m.

to1


0 1 2 3 4 5 6 7

cell in y

0.0

0.1

0.2

0.3

0.4

0.5

nor

m.

to1

2± 0.1 GeV

8± 0.1 GeV


Figure 7.5: Distribution of summed up traces perpendicular to the direction of theincoming particle

7.2 Energy Regression By Linear Fitting

The traditional method of determining the energy of a particle shower is to sum upthe energies of the individual scintillator cells. This summation can still be influencedby weighting to compensate for detector effects. The simulation does not determinethe energy in the individual cells but the number of charged traces. This should belinear to the deposited energy. Detector effects such as inhomogeneities can also beneglected in a simulation.

To calibrate the calorimeter, the straight line with the smallest mean square deviationfrom the data points is determined. Here, the weights c0, c1 of the function

N(E) = c0 · E + c1 (7.1)

are determined and the straight line is then inverted. The inversion is necessary toprevent distortions due to the restricted distribution of the output energy spectrum.The result is shown in Figure 7.6.

7.3 Fully Connected Setup

The training data set consists of 253,064 simulated events and the validation set of28,119 simulated events.

The data structure is visualized in figure 7.7 and is a three-dimensional pixel struc-ture, where each pixel contains the number of traces. First a simple fully connectednetwork (FCN) is applied. data. The data tensor is flatten into a 1088 dimensionalvector to fit the input space of the FCN.

45

0 50 100 150 200 250 300 350∑i ni

0

2

4

6

8

10E

tru

e[G

eV]

c0 = 24.80

c1 = 2.12

Figure 7.6: The graph shows the relation between the energies of the incoming particleEtrue inGeV and the absolute number of charged particles in all scintillator cells. 10000particles from the data are plotted here. The black straight line is the result of the fitdescribed above.

Type # Nodes Activation # ParamsFC 500 ReLU 544500FC 128 ReLU 64128FC 128 ReLU 16512FC 128 ReLU 16512FC 128 ReLU 16512FC 10 ReLU 1290FC 1 Linear 11

Table 7.2: The structure of the fully connected network is shown. The first columnshows the different types of layers of the network, which here are only fully connectedlayers denoted with FC. The second column shows the number of nodes in each layer.The third column shows all activation functions and the last column shows the num-ber of free parameters or weights of each layer.

46

Y

1 2 3 4 5 6 7 8

X2

46

810

1214

16

Z

1

2

3

4

5

6

7

8

Figure 7.7: The three-dimensional structure of a data sample is visulaized by an ex-ample event of an incoming electron with 9.14 GeV

The layer structure of the first network is shown in table 7.2. Different structuresare tried out. The presented version is the one with the best convergence. It has659,465 free parameters and each layer is fully connected to the previous layer. Withthe exception of the output node, all nodes have ReLU as their activation function.The linear output node is chosen to serve the regression task.

7.4 Fully Connected Regression

The network is trained with the RMSprop optimizer and the loss function is the meansquared error between the true energy values and the predicted energy values. Thefully connected network is trained for 150 epochs with a batch size of 128.

In figure 7.8 the loss function is shown. While the loss function for the training setdecreases over time, the loss function for the validation set increases. This implies thatour model is overfitting, which means that the model uses the amount of parametersto remember each point of the data set and not the structure of the data.

In the scatter plot in figure 7.9 the results of the network and the linear fit are com-pared with each other. Globally the distributions of the points hardly visibly deviatefrom each other. Solely the decrease of the network is visible towards the end of thedistribution. The network shows hardly any values for Epred above 10 GeV.

47

0 20 40 60 80 100 120 140

epochs

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

loss

training loss

validation loss

Figure 7.8: The Graph shows the evolution of the loss function for the training setand the validation set. The trained network is the fully connected network describedabove.

0 2 4 6 8 10

Etrue [GeV]

0

2

4

6

8

10

12

14

Ep

red

[GeV

]

neural net

linear fit

Figure 7.9: 10000 data points are plotted in a scatter plot between Epred and Etrue forthe network and the fit. The blue background shows the results of the linear fit. Inthe foreground, the results of the fully connected neural net are shown in black.

48

−0.5

0.0

0.5

µ−E

tru

e[G

eV]

linear fit

neural net

0 2 4 6 8 10

Etrue [GeV]

0.0

0.2

0.4

σ/√

Etr

ue linear fit

neural net

Figure 7.10: The results of the neural network and the linear fit are compared witheach other in the two graphs. The total data points are divided into 20 bins and thenthe mean µ and standard deviation σ of these bins are calculated. The upper graphshows the deviation of the mean values of Epred versus the mean values of Etrue. Inthe lower plot the standard deviation of Epred is plotted for each bin divided by theroot of the mean energy. The graph shows the path from the middle of the first to themiddle of the last bin.

49

The results of the neural network and the linear fit are compared with each other inFigure 7.10. In the upper plot the decrease of the mean values of the energies at theedge of the spectrum is visible. In addition, it is noticeable that the values are slightlyshifted upwards globally.

The lower graph shows the dependence of the mean values of the standard devia-tions on the root of the energies. The quantity is a measure for the resolution of thecalorimeter, as described in section 3.5.1. It is noticeable that the global performanceof the neuronal network is weaker than the one of the linear fit. Only at the end of thedistribution is it significantly better than the resolution of the linear fit.

It can be noted that the FCN optimizes its resolution by respecting the fact that thedistribution is limited between 0 and 10GeV. The result is a drop in the value spectrumthat is not physically motivated, but arises solely from the restricted output statistics.The global shift slightly to higher values is the result of the mean squared error as aloss function. Thus the residuals of values with higher energies are greater than thoseof small energies. The network therefore optimizes itself to these higher values. Sincethe network is overfitted, this is reflected in the poor resolution.

The next section ismainly concernedwith solving the problems previously described.

7.5 Data Augmentation

In order to solve the overfitting problem, the technique of data augmentation is used.In this application data augmentationmeans that all valid symmetry transformationsare used to increase the number of data points. Each data point is randomly rotatedup to 3 times by 90 degrees and randomly mirrored horizontally. This increases thepossible data points eight-fold and should enable the network to learn the symme-tries in the system. The data points are then moved in every direction discrete in cellsize to make the data independent of their arrival point. At this point it should benoted that these transformations are only valid because the simulation is absolutelyhomogeneous and symmetrical. This is not necessarily the case with real detectordata.

As shown in figure 7.11, the overfitting for the dataset can be completely removedwith this method. The fluctuations in the values of the loss function for the validationsample result from the higher statistical fluctuation of a smaller amount of data. Withdata augmentation, the global trend follows asymptotically the loss for the trainingset. It does not longer show the increase of loss function for the validation set withoutdata augmentation. This indicates that the network is not overfitting any longer.

50

0 20 40 60 80 100 120 140

epochs

0.2

0.4

0.6

0.8

1.0

1.2

loss

neural net train

neural net val

data augment train

data augment val

Figure 7.11: The value of the loss function is shown after each epoch for the fully con-nected network. Val and train stand for the value of the loss function on the validationset and the training set. Neural net denotes the network without data augmentation,while data augmentation stands for the network with data augmentation.

51

7.6 Convolutional Setup

To improve the performance of the regression, a convolutional network is used.

The idea is that since our data has the same structure as images, techniques used forimage recognition should provide improved results.

The network consists of a convolutional part to which a fully connected network islinked. The layer structure is shown in table 7.3.

Type # Filter Activation # ParamsConv3D(3,3,3) 32 ReLU 896Conv3D(3,3,3) 10 ReLU 8650Conv3D(5,5,5) 5 ReLU 6255Type # Nodes Activation # ParamsFC 128 ReLU 28288FC 128 ReLU 16512FC 128 ReLU 16512FC 10 ReLU 1290FC 1 Linear 11

Table 7.3: The structure of the 3D convolutional network is presented. The first col-umn shows the different types of layers of the network. The 3D tuples behind theConv3D are the kernel dimensions. The second column shows the number of nodesor filter in each layer. The third column shows all activation functions and the lastcolumn shows the number of free parameters or weights of each layer.

The convolutional network has 78,414 free parameters, which are only 12 percent ofthe parameters of the fully connected network previously used.

The structure is changed iteratively until the performance of the network converged.

52

7.7 Adversarial Training

In order to reduce the unphysical deviations at the end of the distribution, an adver-sarial network is used.

Here the trained network consisted of the convolutional network described above asthe predictor to which a fully connected network is connected as an adversary. Thestructure of the adversary network is shown in Table 7.4

Type # Nodes Activation # ParamsDense 50 ReLU 100Dense 50 ReLU 2550Dense 50 ReLU 2550Dense 50 ReLU 2550Dense 128 ReLU 6528Dense 20 Sigmoid 2580

Table 7.4: The structure of adversarial network is presented.

The convolutional network is first pretrained with the mean squared error as a lossfunction.

Out of the result of the predictor the quantity

(Epred − Etrue)/√Etrue

is calculated, which is an approximation for the resolution of a calorimeter, see section3.5.1. This quantity is the input for the adversary which also tries to reconstruct theenergy out of this quantity. The adversary network is trained with Categorial CrossEntropy as loss function

Lr = −∑C (Ytrue) log

(r

(f(X)− Ytrue√

Ytrue

)). (7.2)

Here f and r are the function of the network and its adversary. C is the function thatdivides all values from 0. to 10. GeV into 20 categories equal in distance and thenmaps each category to a 20-dimensional base vector.

After training the adversary the predictor is trained with the loss function

Lf =∑

(Ytrue − f(X))2 − λLr. (7.3)

λ is an arbitrary factor that has been set to 2.5 here. This is chosen so that the gradientof the two parts of the loss function are nearly equal. By integrating the adversary

53

negatively, the predictor tries to make the performance of the adversary as poor aspossible. When training the adversary, the weights of the predictor are locked andvice versa.

The training course consisted of alternative training of both networks for 5 epochseach. This process is repeated nine times and after each process the results for thetest sample of the neural network are recorded.

Figure 7.12: The results of the neural network after every of the 9 subsequent stepsare presented, see numbers. To improve clarity, no axes or labels are shown, they arethe same as in figure 7.9.

The results are shown in figure 7.12. It is evident that the results fluctuate and do notconverge. With the exception of anomalies that occur in themeantime, the final resulthardly changes noticeably.

54

7.8 Likelihood Solution

As the Adversarial Training did not lead to desired results in the following part thecompensation of the deviations in the distribution by changing the loss function isexamined.

Different to the previous parts a modified loss function is used. It is outlined how anadequate loss function can be determined with respect to the underlying distributionof labels.

By transforming the mean squared error and adding permissible values, a maximumlikelihood can be derived

min∑

(ytrue − ypred)2 (7.4)

=max∑ −(ytrue − ypred)2

2σ2− ln

(√2πσ2

)(7.5)

=max∑

ln

(exp

(−(ytrue − ypred)2

2σ2

))− ln

(√2πσ2

)(7.6)

=max∑

ln

(1√

2πσ2e−

(ytrue−ypred)2

2σ2

)(7.7)

=max ln∏ 1√

2πσ2e−

(ytrue−ypred)2

2σ2 (7.8)

=max∏ 1√

2πσ2e−

(ytrue−ypred)2

2σ2 . (7.9)

The assumption that the mean squared error is the valid minimization function of theneural network is therefore equivalent to the assumption that the values are Gaussiandistributed and have a constant standard deviation.

However, no constant standard deviation is given for a calorimeter. In addition, theassumption that the values have a full Gaussian distribution is not applicable at theedges. Here the absence of entries for higher energies must be compensated witha normalization factor. To illustrate this in figure 7.13 random values normal dis-tributed with a mean of 9.5 and a standard deviation of

√9.5 are shown. All val-

ues of the random sample above 10 are dropped. To this data points two maximumlikelihood fits are applied. One for a Gaussian distribution and one for a Gaussiandistribution divided by the cumulative Gaussian distribution at 10,

1√2πσ2

e−(x−µ)2

2σ2

∫ 10

−∞1√2πσ2

e−(x−µ)2

2σ2 dx. (7.10)

55

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.00.00

0.05

0.10

0.15

0.20

0.25 gaussian

normed

data

Figure 7.13: The data shown are random values normal distributedwith amean of 9.5and a standard deviation of

√9.5. All values above 10 is excluded. The two lines are

gaussians fittedwith amaximum likelihood fit to the data. The black line is a standardgaussian and the blue line is a gaussian divided by a normalization compensating forthe missing values above 10.

56

It is visible that without this normalization the distribution is shifted and not resem-bles the data. So in order to consider truncated distributions the probability densityfunction can be calculated as

1√2πσ2

e−(x−µ)2

2σ2

∫ ba

1√2πσ2

e−(x−µ)2

2σ2 dx=

1√2πσ2

e−(x−µ)2

2σ2

1/2(erf(µ−a√2σ

)− erf

(µ−b√2σ

)) , (7.11)

where a and b in the simulation are 0.0 and 10.0 GeV. A loss function which has to beminimized for the neural net can than be derived as

max

∏

1√2πσ2

e−(x−µ)2

2σ2


)− erf

(µ−b√2σ

))

(7.12)

= min

− ln

∏ 1√2πσ2

e−(x−µ)2

2σ2


)− erf

(µ−b√2σ

))

(7.13)

= min[−∑

ln

(e−

(x−µ)2

2σ2

√πσ2

2

(erf(µ− a√

2σ

)− erf

(µ− b√

2σ

)))](7.14)

= min[−∑−(x− µ)2

2σ2− ln

(√πσ2

2

(erf(µ− a√

2σ

)− erf

(µ− b√

2σ

)))](7.15)

= min∑ (x− µ)2

σ2+ ln

(πσ2

2

(erf(µ− a√

2σ

)− erf

(µ− b√

2σ

))2). (7.16)

The resulting loss function consists of a mean weighted squared term part and aboundary correction term. This weighting prevents the global shift of the mean µtowards higher values. Because values with high energies contribute to the total lossfunction the same amount as values with low energies. The second term is a correc-tion factor that takes into account the ends of the distribution and reduces its effects.

Convergence problems occurred with the resulting loss function when the selectedinitial parameters are far from the optimum. Therefore the network is pretrainedwith the MSE until it converged. The pre-trained ConvNet is then further trainedwith (7.16) as loss function for 50 epochs. The result is shown in figure 7.14. Thedecrease of the mean values to high energy values is completely removed. Globally,the network has no tendency to higher values.

57

−0.2

0.0

0.2

µ−E

tru

e[G

eV]

linear fitneural net

0 2 4 6 8 10

Etrue [GeV]

0.0

0.2

0.4

σ/√

Etr

ue linear fit

neural net

Figure 7.14: The results of the neural network and the linear fit are compared in thetwo graphs. The total data points are divided into 20 bins and then the mean andstandard deviation of these bins are calculated. The upper graph shows the deviationof the mean values of Epred versus the mean values of Etrue. In the lower plot thestandard deviation of Epred is plotted for each Bin divided by the root of the meanenergy.

58

7.9 Comparison of Network Architectures

Finally, the performance of the two network architectures, the ConvNet and the FCNare compared with each other. Both networks are treated equally. Both the ConvNetand the FCN are each trained 150 epochs with mean squared error and data augmen-tation. Afterwards both networks are trained 50 epochs with the likelihood loss tosolve the boundary problems.

0 2 4 6 8 10

Etrue [GeV]

0.25

0.30

0.35

0.40

0.45

0.50

σ/√

Etr

ue

linear fit

denseconv

Figure 7.15: The results of the linear fit, the FCN and the ConvNet are compared. Thestandard deviation of Epred is plotted for each bin divided by the root of the meanenergy.

Figure 7.15 shows the different resolutions of the linear fit, the ConvNet and the FCN.The linear fit performed a mean resolution of 0.362

√E, the FCN of 0.327

√E and the

ConvNet of 0.319√E. With the ConvNet the resolution can be improved by almost

12 percent.

It should be noted that only the basic possibility of improvement is presented here. Itis beyond the focus of this work to show how much the resolution can be improved.

Altogether it could be shown that a better resolution for calorimeters can be achievedwith deep learning, even if the information density in the structure of the data is nothigh for electromagnetic showers. The effectiveness of data augmentation to elimi-nate overfitting could be demonstrated. Adversarial training is in this case not suc-cessful. A method is described how a corresponding loss function could be derived

59

from the distribution of the values. It could be shown how otherwise occurring sys-tematic errors are eliminated with utilizing this loss function. The comparison ofConvNet and FCN shows that a network architecture suitable for the data leads tobetter results.

60

Chapter 8Jet Calibration Analysis

In this chapter, the jet transverse momentum calibration using deep learning is dis-cussed. The application of the a suited loss function for the reduction of boundaryeffects on jet data is presented. In addition to a simple fully connected network archi-tecture, a particle flow network is used for regression.

In this analysis, data generated by the fully detailed Geant simulation of the CMSdetector are used. The data sample used is a QCD data set belonging to the 2018detector simulation. The data set has a PT spectrum from 15 to 3000 GeV. The eventsgenerated in the data set are created without pile up. For comparison the PT resultsof the anti kT-Algorithm is used and denoted as PT,Reco.

Individual jets are extracted from the data set for analysis. Only jets in the energyrange between 30 and 150 GeV are selected. The distribution of the jets is shown infigure 8.1. The distribution is not flat, but shows the characteristic falling distributionresulting from the convolution of parton density function and matrix elements. Forthis reason the shape of this distribution must be taken into account in contrast to thelast chapter.

In total the training set consists of 14,786,472 jets which are divided into a trainingset of 14,490,070 jets. The validation set consisted of 148,318 jets and the test sample148,084 jets.

The four-vectors of the constituent particle flow objects are stored for each jet. Thetransversal momenta of the jets are recorded at generator level and after simulation.For each jet, only a maximum of 200 four-vectors sorted by transverse momentum arestored. In order to obtain a constant input size. When a jet consisted of less than 200particles, the free entries are filled with zero vectors.

61

40 60 80 100 120 140

Gen PT

104

6× 103

2× 104

3× 104

Nu

mb

.of

Jet

s

Figure 8.1: The distribution of the jets versus PT on generator level is plotted for thevalidation set.

62

8.1 Deep Learning Setup

The regression process in this chapter is iterative. First the network is trained withMSE as loss function then the trained network is used to calculate the standard devi-ation σ, as it is needed for the loss function described below. With the later describedloss function the network is trained again. And again σ is determined. After this σ isdetermined again and the network is trained again. This training scheme is shown infigure 8.2.

Model

LJ(Y , Y, σ(Y, c0,1,2))

1st

L(Y , Y ) = (Y − Y )2

MSE

LJ(Y , Y, σ(Y, c0,1,2))

2nd

train

predict σ

train predict σ

train

Figure 8.2: The schematic training sequence of the three training steps.

The results will be labeled as MSE, 1st and 2nd. Here MSE means that the resultsare obtained after training with mean squared error as a loss function. 1st and 2nddenotes the two subsequent training results with the loss function described below.To deduce the momenta, the 4 vectors of the particles of the jets are used as an inputfor the networks. First, the dimensions of the (200, 4) input space are flattened intoa (800, 1) space. In the training process, the data is processed in batch sizes of 1024samples.The possible network architectures are kept general, they will be described later forthe two analyzed cases, the JetNet and the Particle Flow Network. The basic require-ment for the network is that the input layer consists of 800 inputs and that the outputlayer is a single linear output node.Since the standard deviation σ is not a priori apparent, it is approximated before.For this purpose the results of the network are calculated and divided into 20 binsaccording to PT,Gen. The standard deviation is then determined for each bin and theanalytical function

σ(PT,Gen) = c0√PT,Gen + c1PT,Gen + c2 (8.1)

is fitted to these σ values. This function is used as a σ term in the loss function. At thispoint it should be noted that a differentiable function for σ is indispensable, since the

63

gradient of the loss function is required for the backpropagation algorithm. This σfunction is than inserted in the loss function. An example of this fitting is visualizedin figure 8.3.

40 60 80 100 120 140

PT,Gen [GeV]

2

4

6

8

10

12

σ[G

eV]

Figure 8.3: Example of the σ fit. The σ values are calculated for theMSE training step.The blue line is the analytical function 8.1 fitted to the data.

Since the data spectrum covered only a limited space of PT,Gen, it is to be expectedthat the systematic deviations described in the previous chapter would occur. Thefact that the information from the calorimeters is incorporated into the reconstructedparticles is also correlated with their error structure.

Another factor that comes to bear is that the value spectrum is not flat. This is takeninto account by the fact that the loss function is binned. The resulting loss function is

LJ(Y , Y, σ) =1

n

∑

i

1

nbi

∑

bi<Y≤bi+1

(Y − Y )2

σ2

+ ln

πσ

2

2

(erf(Y − a√

2σ

)− erf

(Y − b√

2σ

))2. (8.2)

where n is the number of bins and nbi is the number of entries per bin. The values aresorted by PT,Gen and divided into 20 bins, each covering a 5.5 GeVwide spectrum. Per

64

bin themean value is determined and the total loss is themean of themean values perbin. As a result, the individual value ranges in the spectrum no longer contributedto the total value, even if their population is higher. The bias to values with a highpopulation will be reduced in this way.

The response R is used as a measure for the calibration,

R =PT,XPT,Gen

. (8.3)

where X stands for the choosen calibration algorithm.

65

8.2 Jetnet Analysis

The network used in this section is a very simple fully connected network. Table 8.1shows the structure. The number of nodes is reduced for each layer. The first layer has800 nodes, which corresponds to the number of input features of one jet. The 7 layershave the ReLU activation function. The last layer has a linear node to get a linear one-dimensional output space. In total the network has 2,043,201 free parameters. Thisnetwork is referred to as JetNet in the following.

Type # Nodes Activation # ParamsDense 800 ReLU 640800Dense 700 ReLU 560700Dense 600 ReLU 420600Dense 400 ReLU 240400Dense 300 ReLU 120300Dense 200 ReLU 60200Dense 1 Linear 201

Table 8.1: The structure of the fully connected network is shown. The first columnshows the different types of layers of the network. The second column shows thenumber of nodes in each layer. The third column shows all activation functions andthe last column the number of free parameters or weights of each layer.

The network is trained with the training procedure described in the previous section.

Figure 8.4 shows the results of the Reco algorithm and the neural network in a scatterplot. MSE is used as loss function for the neural network. The systematic deviationssimilar to the previous chapter can be seen at low and high values of PT,Gen.

Figure 8.5 shows the mean resulting R values in bin of PT,Gen for the Jetnets and theReco algorithm. It can be seen that the first iteration corrects the systematic deviationsat the boundaries, but yields to too high results. This shift to higher values is not foundafter the second iteration. The boundary effects could not be removed completely andpersist even after multiple iterations.

In figure 8.6 the resolution is shown. The Jetnet results are always than the one of theReco algorithm. At the edges the resolution increases again artificially. This system-atic deviation could be reduced by the alternative loss function.

66

40 60 80 100 120 140

PT,Gen

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

R

MSE

Reco

Figure 8.4: The resulting response R of the Reco algorithm and the Jetnet as functionof PT,Gen are displayed. The results are taken from the MSE training step.

40 60 80 100 120 140

PT,Gen

0.950

0.975

1.000

1.025

1.050

1.075

1.100

1.125

R

Reco

MSE

1st

2nd

Figure 8.5: The mean values of the R distribution are represented as a function ofPT,Gen.

67

40 60 80 100 120 140

PT,Gen

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

σ/√

PT,G

en

Reco

MSE

1st2nd

Figure 8.6: The standard deviation σ divided by the root of PT,Gen versus PT,Gen ispresented.

68

8.3 Particle Flow Network

Subsequently, a Particle Flow Network (PFNet) is trained for comparison. To createthe network the EnergyFlow package is used [10], see section 5.3. The training proce-dure is the one described above. The structure of the PFNet consists of a filter moduleconsisting of three fully connected layers with 100, 100 and 128 nodes. The latencyspace of center of the networks is 128 dimensional. The second network consisted ofthree layers of 100 fully connected nodes each layer followed by a linear output node.

40 60 80 100 120 140

PT,Gen

0.975

1.000

1.025

1.050

1.075

1.100

1.125

R

Reco

MSE

1st

2nd

Figure 8.7: The mean values of the R distribution versus PT,Gen.

Figure 8.7 shows the mean values of the R distribution. The results are very similarto those of the JetNet. You can see that the alternative loss function helps to eliminatethe kink at the beginning. Overall the entire distribution has a systematic skewness,but is quite linear.

In 8.8 the distribution of the resolutions is shown. All PFNet results have approxi-mately the same progression and are clearly below the Reco results. The best perfor-mance is achieved after the second iteration.

69

40 60 80 100 120 140

PT,Gen

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

σ/√

PT,G

en

Reco

MSE1st

2nd


70

8.4 Comparison

The last section compares the two networks. In Fig. 8.9 it can be seen that the Rdistribution of the two network types is almost identical.

40 60 80 100 120 140

PT,Gen

0.90

0.95

1.00

1.05

1.10

1.15

R Reco

JETNETPFNET

Figure 8.9: The mean values of the R distribution are represented in dependence toPT,Gen.

Figure 8.10 shows that this also applies to the resolution. The PFNet has fewer devi-ations from the edges. In the middle, however, the lines are comparable.

Overall, the performance of both networks is on average about 15 percent better thanthat of the clustering algorithm. It is therefore possible to improve the momentumresolution of the jets with the help of deep learning techniques. For this analysis, nei-ther the hyper parameters are tuned nor the input variables changed. Therefore theanalysis can only show that the resolution can be improved, but not how far this is pos-sible. That the two architectures perform roughly equally well is either an indicationthat the PFNet structure does not provide a better result for momentum calibrationsor that the unprocessed input variables are not suitable for this type of network. An-other possibility is that the overall performance maximum is nearly reached by theanalysis. Further analysis is needed to investigate this. In the same way, the alterna-tive loss function could reduce the boundary effect, but not eliminate it altogether.

71

40 60 80 100 120 140

PT,Gen

0.8

0.9

1.0

1.1

1.2

1.3

σ/√

PT,G

en

Reco

JETNETPFNET


72

Chapter 9Conclusion

The aimof this analysiswas to study energymeasurements of calorimters and particleflow jets and possible improvements on calibration and resolution via deep neuralnetworks.

Therefore a simplified hadronic calorimeter with electromagnetic showers was sim-ulated during this work. This calorimeter was used to train various networks in en-ergy determination. Using data augmentation the overfitting of the networks couldbe reduced. The application of Adversarial Training to reduce systematic errors is un-successful. From the distribution of the values, a loss function could be determined,which greatly reduces the systematic deviations due to boundary effects. The princi-ple behind this design is universal. It could be shown that a Convolutional Neural Netis better suited for the regression of calorimeter data than a Fully Connected NeuralNet. The performance of the networks is significantly higher than that of a linear fit.

In the second part of the analysis, networks for jet energy calibration are trained. Theapproach of loss function design developed in the previous part could be extendedand reduced the systematic deviations caused by the shape of the input distribution.In addition to a Fully Connected Neural Net, a Particle Flow Network was also betrained. The overall performance of the two architecture types is approximately iden-tical. Both network types performed about 15% better than the default particle flowjet algorithm.

All in all it could be shown that regression in particle physics is a good field to applydeep learning. This allows to achieve better performances than traditional algorithms.However, machine learning algorithms still lack robustness against systematic devi-ations due to the distribution shape. Further research is needed to overcome thisproblem. Neither the architectural design, the hyper parameter tuning nor the inputvariable design is really treated in the study. This has to be investigated in furtherstudies. Stay tuned.

73

Bibliography

[1] Lyndon Evans and Philip Bryant. Lhc machine. Journal of Instrumentation,3(08):S08001–S08001, 2008.

[2] Oliver Brüning and Lucio Rossi. The high-luminosity large hadron collider. Na-ture Reviews Physics, 1(4):241–243, 2019.

[3] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5(1):4308, 2014.

[4] P. Baldi, P. Sadowski, and D. Whiteson. Enhanced higgs boson to τ+τ− searchwith deep learning. Physical Review Letters, 114(11):111801, 2015.

[5] R. Santos, M. Nguyen, J. Webster, S. Ryu, J. Adelman, S. Chekanov, and J. Zhou.Machine learning techniques in searches for tth in the h → bb decay channel.Journal of Instrumentation, 12(04):P04014–P04014, 2017.

[6] Raffaele Tito D’Agnolo and Andrea Wulzer. Learning new physics from a ma-chine. Physical Review D, 99(1):015014, 2019.

[7] Andrea De Simone and Thomas Jacques. Guiding new physics searches withunsupervised learning. The European Physical Journal C, 79(4):289, 2019.

[8] Theo Heimel, Gregor Kasieczka, Tilman Plehn, and Jennifer Thompson. Qcd orwhat? SciPost Physics, 6(3):030, 2019.

[9] Patrick T. Komiske, Eric M. Metodiev, Benjamin Nachman, and Matthew D.Schwartz. Pileup mitigation with machine learning (pumml). Journal of HighEnergy Physics, 2017(12):51, 2017.

[10] Patrick T. Komiske, Eric M. Metodiev, and Jesse Thaler. Energy flow networks:Deep sets for particle jets. Journal of High Energy Physics, 2019(1):121, 2019.

[11] Josh Cogan, Michael Kagan, Emanuel Strauss, and Ariel Schwarztman. Jet-images: Computer vision inspired techniques for jet tagging. Journal of HighEnergy Physics, 2015(2):118, 2015.

[12] Luke de Oliveira, Michael Kagan, Lester Mackey, Benjamin Nachman, and ArielSchwartzman. Jet-images - deep learning edition. Journal of High Energy Physics,2016(7):69, 2016.

74

[13] Pierre Baldi, Kevin Bauer, Clara Eng, Peter Sadowski, and Daniel Whiteson. Jetsubstructure classification in high-energy physics with deep neural networks.Physical Review D, 93(9):094034, 2016.

[14] James Barnard, EdmundNoel Dawe, Matthew J. Dolan, and Nina Rajcic. Partonshower uncertainties in jet substructure analyses with deep neural networks.Physical Review D, 95(1):014018, 2017.

[15] Gregor Kasieczka, Tilman Plehn, Michael Russell, and Torben Schell. Deep-learning top taggers or the end of qcd? Journal of High Energy Physics, 2017(5):6,2017.

[16] Kaustuv Datta andAndrew Larkoski. Howmuch information is in a jet? Journalof High Energy Physics, 2017(6):73, 2017.

[17] Patrick T. Komiske, Eric M. Metodiev, andMatthew D. Schwartz. Deep learningin color: Towards automated quark/gluon jet discrimination. Journal of HighEnergy Physics, 2017(1):110, 2017.

[18] M. et al. Tanabashi. Review of particle physics. Phys. Rev. D, 98:030001, Aug2018.

[19] Hermann Kolanoski and Norbert Wermes. 15 Kalorimeter. Teilchendetektoren.Springer Berlin Heidelberg, 2016.

[20] Richard Wigmans. Calorimetry - energy measurements in particle physics, 2ndedition. Contemporary Physics, 59(2):226–227, 2018.

[21] Christian W. Fabjan and Fabiola Gianotti. Calorimetry for particle physics. Re-views of Modern Physics, 75(4):1243–1286, 2003.

[22] Ugo Amaldi. Fluctuations in calorimetry measurements. Physica Scripta,23(4A):409–424, 1981.

[23] Geant4 Collaboration. Geant4 - a simulation toolkit. Nuclear Instruments andMethods in Physics Research Section A: Accelerators, Spectrometers, Detectors and As-sociated Equipment, 506(3):250–303, 2003.

[24] Geant4 Collaboration. Geant4 user documentation. 2018.

[25] Geant4 Collaboration. Recent developments in geant4. page nil, - 2014.

[26] Geant 4 Collaboration. Geometry and physics of the geant4 toolkit for high andmedium energy applications. Radiation Physics and Chemistry, 78(10):859–873,2009.

[27] The CMS Collaboration. Observation of a new boson at a mass of 125 gev withthe cms experiment at the lhc. Physics Letters B, 716(1):30–61, 2012.

75

[28] The ATLAS Collaboration. Observation of a new particle in the search for thestandard model higgs boson with the atlas detector at the lhc. Physics Letters B,716(1):1–29, 2012.

[29] Richard Mitnick. Cern/lhc map. 2015.

[30] K. et al. Aamodt. The ALICE experiment at the CERN LHC. JINST, 3:S08002,2008.

[31] G. et al. Aad. The ATLAS Experiment at the CERN Large Hadron Collider.JINST, 3:S08003, 2008.

[32] S. et al. Chatrchyan. The CMS Experiment at the CERN LHC. JINST, 3:S08004,2008.

[33] A. Augusto Alves, Jr. et al. The LHCb Detector at the LHC. JINST, 3:S08005,2008.

[34] The CMS Collaboration. Particle-flow reconstruction and global event descrip-tion with the cms detector. Journal of Instrumentation, 12(10):P10003–P10003,2017.

[35] Christian Szegedy,Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, andAndrewRabinovich. Goingdeeperwith convolutions. In 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR), 6 2015.

[36] Jonathan J Tompson, Arjun Jain, Yann LeCun, andChristoph Bregler. Joint train-ing of a convolutional network and a graphical model for human pose estima-tion. pages 1799–1807, 2014.

[37] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learninghierarchical features for scene labeling. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35(8):1915–1929, 2013.

[38] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classificationwith deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.

[39] Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas Burget, and Jan Cernocky.Strategies for training large scale neural network languagemodels. In 2011 IEEEWorkshop on Automatic Speech Recognition & Understanding, 12 2011.

[40] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed,Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, TaraSainath, and Brian Kingsbury. Deep neural networks for acoustic modeling inspeech recognition. Signal Processing Magazine, 2012.

76

[41] Tara N. Sainath, Abdel rahmanMohamed, Brian Kingsbury, and Bhuvana Ram-abhadran. Deep convolutional neural networks for lvcsr. In 2013 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, page nil, 5 2013.

[42] TeroKarras, TimoAila, Samuli Laine, and JaakkoLehtinen. Progressive growingof gans for improved quality, stability, and variation. CoRR, 2017.

[43] Junshui Ma, Robert P. Sheridan, Andy Liaw, George E. Dahl, and Vladimir Svet-nik. Deep neural nets as a method for quantitative structure-activity relation-ships. Journal of Chemical Information and Modeling, 55(2):263–274, 2015.

[44] F. Rosenblatt. The perceptron: a probabilistic model for information storage andorganization in the brain. Psychological Review, 65(6):386–408, 1958.

[45] Warren S. McCulloch andWalter Pitts. A logical calculus of the ideas immanentin nervous activity. The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.

[46] Marvin Minsky and Seymour Papert. Perceptrons. an introduction to computa-tional geometry. Science, 165(3895):780–782, 1969.

[47] Kurt Hornik, Maxwell Stinchcombe, andHalbertWhite. Multilayer feedforwardnetworks are universal approximators. Neural Networks, 2(5):359–366, 1989.

[48] Augustin-Louis Cauchy. ANALYSE MATHÉMATIQUE. - Méthodc générale pourla résolution des systèmes d’équations simultanées. Oeuvres complètes. CambridgeUniversity Press, 1847.

[49] H. J. Kelley. Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954, 1960.

[50] A. E. Bryson. Agradientmethod for optimizingmulti-stage allocation processes.1961.

[51] Arthur E. Bryson, Jr. and W. F. Denham. A steepest-ascent method for solvingoptimum programming problems. (BR-1303), 1961.

[52] Lev Semenovich Pontryagin, V. G. Boltyanskii, R. V. Gamrelidze, and E. F.Mishchenko. The Mathematical Theory of Optimal Processes. 1961.

[53] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathe-matical Analysis and Applications, 5(1):30–45, 1962.

[54] J. H.Wilkinson, editor. TheAlgebraic Eigenvalue Problem. OxfordUniversity Press,Inc., New York, NY, USA, 1965.

[55] S. Amari. A theory of adaptive pattern classifiers. IEEE Trans. EC, 16(3):299–307,1967.

77

[56] A.E. Bryson and Y.C. Ho. Applied optimal control: optimization, estimation, andcontrol. Blaisdell Pub. Co., 1969.

[57] Jürgen Schmidhuber. Deep learning in neural networks: an overview. NeuralNetworks, 61:85–117, 2015.

[58] S. Linnainmaa. The representation of the cumulative rounding error of an algo-rithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ.Helsinki, 1970.

[59] Seppo Linnainmaa. Taylor expansion of the accumulated rounding error. BITNumerical Mathematics, 16(2):146–160, 1976.

[60] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Pro-ceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC, pages 762–770, 1981.

[61] D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’sstriate cortex. The Journal of Physiology, 148(3):574–591, 1959.

[62] Kunihiko Fukushima. Neocognitron: a self-organizing neural network modelfor a mechanism of pattern recognition unaffected by shift in position. BiologicalCybernetics, 36(4):193–202, 1980.

[63] Hampshire and Waibel. A novel objective function for improved phonemerecognition using time delay neural networks. In International Joint Conferenceon Neural Networks, 1989.

[64] Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano,and Kevin J. Lang. Phoneme recognition using time-delay neural networks. InReadings in Speech Recognition, Readings in Speech Recognition, pages 393–404.Elsevier, 1990.

[65] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neu-ral Computation, 1(4):541–551, 1989.

[66] Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with ad-versarial networks. CoRR, 2016.

[67] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, RuslanSalakhutdinov, and Alexander Smola. Deep sets. CoRR, 2017.

[68] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.

[69] MartínAbadi et al. TensorFlow: Large-scalemachine learning on heterogeneoussystems, 2015. Software available from tensorflow.org.

78

https://github.com/fchollet/keras

Eidesstattliche Versicherung

Ich versichere, dass ich die beigefügte schriftliche Masterarbeit selbstständig angefer-tigt und keine anderen als die angegebenen Hilfsmittel benutzt habe. Alle Stellen,die dem Wortlaut oder dem Sinn nach anderen Werken entnommen sind, habe ichin jedem einzelnen Fall unter genauer Angabe der Quelle deutlich als Entlehnungkenntlich gemacht. Dies gilt auch für alle Informationen, die dem Internet oder an-derer elektronischer Datensammlungen entnommenwurden. Ich erkläre ferner, dassdie von mir angefertigte Masterarbeit in gleicher oder ähnlicher Fassung noch nichtBestandteil einer Studien- oder Prüfungsleistung im Rahmen meines Studiums war.Die vonmir eingereichte schriftliche Fassung entspricht jener auf dem elektronischenSpeichermedium.

Ich bin damit einverstanden, dass die Masterarbeit veröffentlicht wird.

Hamburg, den Unterschrift:

Energy Regression with Deep Learning in Particle Collider ...

Documents

Transcript of Energy Regression with Deep Learning in Particle Collider ...