Download - A GPU High-Level Trigger 1 for the upgraded LHCb …...04-04-2019 CTD 2019, Valencia 1 A GPU High-Level Trigger 1 for the upgraded LHCb detector Brij Kishor Jashal, ([email protected])

CTD 2019, Valencia 104-04-2019

A GPU High-Level Trigger 1 for the upgraded LHCb detector

Brij Kishor Jashal, ([email protected])Instituto de Física Corpuscular, UV – Valencia

on behalf of LHCb Collaboration and Allen team.

04-04-2019 CTD 2019, Valencia 2

Outline:

• LHCb upgrade

• LHCb Tracking system

• Introduction to Allen project

• Framework and algorithm designs

• Summary and future work

04-04-2019 CTD 2019, Valencia 3

The LHCb upgrade during LS2 (ongoing)

• The present LHCb detector will be mostly dismantled and anew detector will be installed.

• More than 90% of the active detector channels replaced.• Trigger-less readout, full software trigger

=> 2x increased efficiency of hadronic signals

5x higher instantaneous luminosity 2x1033𝑐𝑚−2𝑠−1

10x per unit time signal yield 6x number of primary interactions

04-04-2019 CTD 2019, Valencia 4

During run in 2018

Upgrade

• Triggering is expensive: => Bandwidth [GB/s] ∝ Accept rate [kHz] × Event size [kB]• Can’t do much to reduce the raw event size; it’s all or nothing! Or can we ?

=> If event size is reduced, there’s room for more physics!

Full stream: • zero-suppressed raw detector data

Turbo stream: • Only a set of higher-level fully reconstructed physics

objects along with monitoring information.

TurCal: • Calibration information of selected candidates

together with raw detector data.

04-04-2019 CTD 2019, Valencia 5

Output bandwidth scenarios will be optimised

• On the basis of physics• Based on the fraction of events in FULL and Turbo• Operational and resource cost considerations

It is assumed that all flavour physics channels not involving beauty physics will be migrated to the Turbo stream in the upgrade

04-04-2019 CTD 2019, Valencia 6

LHCb Upgrade software trigger

Bandwidth [GB/s] ∝ Accept rate [kHz] × Event size [kB]

04-04-2019 CTD 2019, Valencia 7

LHCb : Tracking system and Track types.

04-04-2019 CTD 2019, Valencia 8

LHCb HLT1

The entire HLT1 involves the decoding of all tracking sub-detectors, clustering, track reconstruction,Kalman filter, Primary Vertex finder and trigger decision algorithms.

04-04-2019 CTD 2019, Valencia 9

Computing challenge:

• Full software only trigger to reduce 40 TB/s to a manageable 10 GB/s throughput.

• Fit in the time budget

• Type of workload: Small event size, O(150) KiB

• Events are required to be run in parallel to fully utilize underlying hardware

• Strategically integrate the system in the DAQ system (accelerators ?)

• In any case, solution should be cost effective

04-04-2019 CTD 2019, Valencia 10

Allen:

R&D standalone project:

• Run full first stage of HLT (HLT1) on GPUs• Process events in parallel• Exploit data-parallelism within events

Requirement

• A C++14 compliant compiler

• CUDA v10.0

• Graphics card with CUDA support,

• CMake 3.12

https://gitlab.cern.ch/lhcb-parallelization/Allen

Allen Gitlab repository

A scalable and modular framework, built for running the LHCb HLT1 on GPUs


04-04-2019 CTD 2019, Valencia 11

Allen Framework: salient features #1

SIMD implementation: Parallelism at two levels : Block and Thread.

• Support of custom binary input format

• Built-in physics validation

• Pipelined stream sequence

• One event per block along with sub-event parallelism

Computing problem is granular and same is reflected in algorithm design

04-04-2019 CTD 2019, Valencia 12


Scalability and modularity:

• Underlying framework is designed to be scalable for growing code base • Easy to add new algorithm with compile time sequence configuration • Compilation with nvcc / llvm backend • Optional compilation with ROOT for generation of graphs

Memory management

• Memory allocation is done at the start-up of application • Custom memory manager for GPU memory • Developers don’t have to invoke the memory allocation routines• Not dependent on dynamic libraries for memory allocation

04-04-2019 CTD 2019, Valencia 13

Data types and access

• Memory coalescing for optimal usage of global memory bandwidth.

• Data types are consolidated into Structure of Arrays (SOA).

• SOA data structure allows contiguous data access patterns optimizing the use of cache memory

Spatial reduction:

• K - Dimensional tree structure for spatial search window optimization.


04-04-2019 CTD 2019, Valencia 14


Platform support and Portability:

• Algorithm design not specific to GPUs but benefit any SIMD processors

• Intel® SPMD Program Compiler employs SPMD model in a similar manner to the SIMT model in GPU

• Compatibility demonstrated by translating algorithms for x86-64 processors

• Porting on-going for AMDs Radeon Open Compute (ROCm)

SIMT design • Even workload across group of cores (warps)• Masking is used to disable and enable the various threads as

appropriate

04-04-2019 CTD 2019, Valencia 15

A codebase under development.

04-04-2019 CTD 2019, Valencia 16

LHCb : Tracking system and Track types.

04-04-2019 CTD 2019, Valencia 17

Algorithm designs #1

Velo Clustering

• Find cluster seeds (using 8-bit mask)• Load only neighbouring pixels of a

seed

Velo tracking: search by triplets

• Implemented using SOA data structure for the VELO reconstruction, allows efficient use of cache memory

• Modules in Velo sub-detector are visited only once reducing iterations

Velo: Pixel detector, 52 planes

1

2

3

04-04-2019 CTD 2019, Valencia 18


Primary vertex finding

• Extrapolate Velo tracks to beamline, find z position of track• Fill histogram with z-positions• Find peaks in histogram => seeds• Vertex fit using weight

04-04-2019 CTD 2019, Valencia 19


UT: Strip detector, 4 planes

Pattern recognition

• Extrapolate VELO tracks to the UT planes, define search windows• Extends Velo tracks with 3/4 UT hits• Obtain momentum estimate from χ2 fit• UT threshold: p > 5GeV, pt > 300MeV• Implemented in K- Dimensional tree data structure

04-04-2019 CTD 2019, Valencia 20


SciFi: Scintillating Fibre detector (12 planes of 2 x 2.5 m long ) (Work in progress)

• Determine a search window extrapolating Velo-UT tracks to SciFi planes with a parametrization for the magnetic field deflection.

• With the above search window, create triplets in the SciFistation.

• Select best triplets and extend these triplets to create a track.

04-04-2019 CTD 2019, Valencia 21


Kalman filter

• Implemented using single precision • Parametrized transport in magnetic field

Muon: 4 multi wire proportional chambers (work in progress)

Muon ID

• Extrapolate SciFi tracks to muon stations, find closest hits• Decide if track originates from a muon

04-04-2019 CTD 2019, Valencia 22


Downstream tracking (work in progress)

• Decays outside the VELO ( Long living particles )• Hard to reconstruct due to missing VELO input track• Start with the leftover hits in the UT ( after VELO-UT

reconstruction )• Create a seed in the 4 – UT stations using a optimized search

window• Extend the seed to the SciFi and create a downstream track

04-04-2019 CTD 2019, Valencia 23

Performance and results #1

Methodology

For all results shown, Allen was configured with the following options:

Benchmarked over listed GPUs in the table.• Sequence with Velo, PV, UT and decoding of SciFi

• Number of events: 4000 minbias

• Number of streams: 8

• Number of repetitions: 200

Feature Geforce

GTX 1060

Geforce

GTX 1080 Ti

Geforce

RTX 2080 Ti

Tesla

T4

Tesla

V100

#cores(CUDA) 1280 3584 4352 2560 5120

Max Freq.

(GHz)

1.81 1.67 1.545 1.59 1.37

Cache (L2)

(MiB)

1,5 2,75 6 6 6

DRAM (GiB) 5.94

GDDR5

10.92

GDDR5

10.92

GDDR5

16

GDDR6

32

HBM2

CUDA

Capability

6.1 6.1 7.5 7.5 7.0

TDP (Watts) 120 250 250 70 250

04-04-2019 CTD 2019, Valencia 24

Performance and results #2

Performance Benchmarks

• Geforce GTX 1060: 24 kHz

• Geforce GTX 1080 Ti: 56 kHz

• Geforce RTX 2080 Ti: 88 kHz

• Tesla T4: 51 kHz

• Tesla V100: 112 kHz

LHCb SimulationGPU R&D

04-04-2019 CTD 2019, Valencia 25

Performance and results #3Breakout of performance

46%

8%

20%

11%

15%

VELO PV UT SciFi Common

04-04-2019 CTD 2019, Valencia 26

• Allen project has been presented with the context of LHCb detector upgrade.

• We are close to completing the full LHCb HLT1 sequence on GPU

• All decoding, clustering and tracking algorithms are implemented in CUDA.

• Framework scales well to the new line of underlying hardware.

• Physics performance is good and in the lines of our requirements for the Upgrade.

Summary

04-04-2019 CTD 2019, Valencia 27

Ongoing and future work

Under Development:

• SciFi tracking• Muon ID• Downstream tracking

Full sequence execution on AMD ROCm

Stress testing and integration with online system

04-04-2019 CTD 2019, Valencia 28

• Allen Project: https://gitlab.cern.ch/lhcb-parallelization/Allen• Talk by Daniel Campora – ACAT2019 https://indico.cern.ch/event/708041/contributions/3276185/• Poster by Dorothea Vom Bruch ACAT2019 - https://indico.cern.ch/event/708041/contributions/3308650/• LHCb upgrade TDR - LHCB-TDR-015, LHCB-TDR-016, LHCB-TDR-017 and LHCB-TDR-018• Talk by Alex Pearce - HOW-workshop-real-time-analysis

References:

Thanks to all people involved in the development of Allen!


https://indico.cern.ch/event/708041/contributions/3276185/

https://indico.cern.ch/event/708041/contributions/3308650/

https://cds.cern.ch/record/1647400/files/LHCB-TDR-015.pdf

https://cds.cern.ch/record/1701361/files/LHCB-TDR-016.pdf?version=2



https://indico.cern.ch/event/759388/contributions/3303363/attachments/1815518/2967048/2019-03-20-apearce-HOW-workshop-real-time-analysis.pdf

29

The LHCb upgrade: trigger system

• It is useless to produce more interesting events if we are unable to study them.

• LHCb used to rely on hardware + software trigger.

• At such high luminosities, hardware trigger cannot copeand starts rejecting random events.

• Full software trigger needed:

• Means not only reconstruction at 25ns but alignmentand calibration.

• Even more pressure on sub-detectors to deliver data fast→ need to develop more efficient algorithms.

design 2018

Backup