CTD 2019, Valencia 104-04-2019
A GPU High-Level Trigger 1 for the upgraded LHCb detector
Brij Kishor Jashal, ([email protected])Instituto de Física Corpuscular, UV – Valencia
on behalf of LHCb Collaboration and Allen team.
04-04-2019 CTD 2019, Valencia 2
Outline:
• LHCb upgrade
• LHCb Tracking system
• Introduction to Allen project
• Framework and algorithm designs
• Summary and future work
04-04-2019 CTD 2019, Valencia 3
The LHCb upgrade during LS2 (ongoing)
• The present LHCb detector will be mostly dismantled and anew detector will be installed.
• More than 90% of the active detector channels replaced.• Trigger-less readout, full software trigger
=> 2x increased efficiency of hadronic signals
5x higher instantaneous luminosity 2x1033𝑐𝑚−2𝑠−1
10x per unit time signal yield 6x number of primary interactions
04-04-2019 CTD 2019, Valencia 4
During run in 2018
Upgrade
• Triggering is expensive: => Bandwidth [GB/s] ∝ Accept rate [kHz] × Event size [kB]• Can’t do much to reduce the raw event size; it’s all or nothing! Or can we ?
=> If event size is reduced, there’s room for more physics!
Full stream: • zero-suppressed raw detector data
Turbo stream: • Only a set of higher-level fully reconstructed physics
objects along with monitoring information.
TurCal: • Calibration information of selected candidates
together with raw detector data.
04-04-2019 CTD 2019, Valencia 5
Output bandwidth scenarios will be optimised
• On the basis of physics• Based on the fraction of events in FULL and Turbo• Operational and resource cost considerations
It is assumed that all flavour physics channels not involving beauty physics will be migrated to the Turbo stream in the upgrade
04-04-2019 CTD 2019, Valencia 6
LHCb Upgrade software trigger
Bandwidth [GB/s] ∝ Accept rate [kHz] × Event size [kB]
04-04-2019 CTD 2019, Valencia 7
LHCb : Tracking system and Track types.
04-04-2019 CTD 2019, Valencia 8
LHCb HLT1
The entire HLT1 involves the decoding of all tracking sub-detectors, clustering, track reconstruction,Kalman filter, Primary Vertex finder and trigger decision algorithms.
04-04-2019 CTD 2019, Valencia 9
Computing challenge:
• Full software only trigger to reduce 40 TB/s to a manageable 10 GB/s throughput.
• Fit in the time budget
• Type of workload: Small event size, O(150) KiB
• Events are required to be run in parallel to fully utilize underlying hardware
• Strategically integrate the system in the DAQ system (accelerators ?)
• In any case, solution should be cost effective
04-04-2019 CTD 2019, Valencia 10
Allen:
R&D standalone project:
• Run full first stage of HLT (HLT1) on GPUs• Process events in parallel• Exploit data-parallelism within events
Requirement
• A C++14 compliant compiler
• CUDA v10.0
• Graphics card with CUDA support,
• CMake 3.12
https://gitlab.cern.ch/lhcb-parallelization/Allen
Allen Gitlab repository
A scalable and modular framework, built for running the LHCb HLT1 on GPUs
04-04-2019 CTD 2019, Valencia 11
Allen Framework: salient features #1
SIMD implementation: Parallelism at two levels : Block and Thread.
• Support of custom binary input format
• Built-in physics validation
• Pipelined stream sequence
• One event per block along with sub-event parallelism
Computing problem is granular and same is reflected in algorithm design
04-04-2019 CTD 2019, Valencia 12
Allen Framework: salient features #2
Scalability and modularity:
• Underlying framework is designed to be scalable for growing code base • Easy to add new algorithm with compile time sequence configuration • Compilation with nvcc / llvm backend • Optional compilation with ROOT for generation of graphs
Memory management
• Memory allocation is done at the start-up of application • Custom memory manager for GPU memory • Developers don’t have to invoke the memory allocation routines• Not dependent on dynamic libraries for memory allocation
04-04-2019 CTD 2019, Valencia 13
Data types and access
• Memory coalescing for optimal usage of global memory bandwidth.
• Data types are consolidated into Structure of Arrays (SOA).
• SOA data structure allows contiguous data access patterns optimizing the use of cache memory
Spatial reduction:
• K - Dimensional tree structure for spatial search window optimization.
Allen Framework: salient features #3
04-04-2019 CTD 2019, Valencia 14
Allen Framework: salient features #4
Platform support and Portability:
• Algorithm design not specific to GPUs but benefit any SIMD processors
• Intel® SPMD Program Compiler employs SPMD model in a similar manner to the SIMT model in GPU
• Compatibility demonstrated by translating algorithms for x86-64 processors
• Porting on-going for AMDs Radeon Open Compute (ROCm)
SIMT design • Even workload across group of cores (warps)• Masking is used to disable and enable the various threads as
appropriate
04-04-2019 CTD 2019, Valencia 15
A codebase under development.
04-04-2019 CTD 2019, Valencia 16
LHCb : Tracking system and Track types.
04-04-2019 CTD 2019, Valencia 17
Algorithm designs #1
Velo Clustering
• Find cluster seeds (using 8-bit mask)• Load only neighbouring pixels of a
seed
Velo tracking: search by triplets
• Implemented using SOA data structure for the VELO reconstruction, allows efficient use of cache memory
• Modules in Velo sub-detector are visited only once reducing iterations
Velo: Pixel detector, 52 planes
1
2
3
04-04-2019 CTD 2019, Valencia 18
Algorithm designs #2
Primary vertex finding
• Extrapolate Velo tracks to beamline, find z position of track• Fill histogram with z-positions• Find peaks in histogram => seeds• Vertex fit using weight
04-04-2019 CTD 2019, Valencia 19
Algorithm designs #3
UT: Strip detector, 4 planes
Pattern recognition
• Extrapolate VELO tracks to the UT planes, define search windows• Extends Velo tracks with 3/4 UT hits• Obtain momentum estimate from χ2 fit• UT threshold: p > 5GeV, pt > 300MeV• Implemented in K- Dimensional tree data structure
04-04-2019 CTD 2019, Valencia 20
Algorithm designs #4
SciFi: Scintillating Fibre detector (12 planes of 2 x 2.5 m long ) (Work in progress)
• Determine a search window extrapolating Velo-UT tracks to SciFi planes with a parametrization for the magnetic field deflection.
• With the above search window, create triplets in the SciFistation.
• Select best triplets and extend these triplets to create a track.
04-04-2019 CTD 2019, Valencia 21
Algorithm designs #5
Kalman filter
• Implemented using single precision • Parametrized transport in magnetic field
Muon: 4 multi wire proportional chambers (work in progress)
Muon ID
• Extrapolate SciFi tracks to muon stations, find closest hits• Decide if track originates from a muon
04-04-2019 CTD 2019, Valencia 22
Algorithm designs #6
Downstream tracking (work in progress)
• Decays outside the VELO ( Long living particles )• Hard to reconstruct due to missing VELO input track• Start with the leftover hits in the UT ( after VELO-UT
reconstruction )• Create a seed in the 4 – UT stations using a optimized search
window• Extend the seed to the SciFi and create a downstream track
04-04-2019 CTD 2019, Valencia 23
Performance and results #1
Methodology
For all results shown, Allen was configured with the following options:
Benchmarked over listed GPUs in the table.• Sequence with Velo, PV, UT and decoding of SciFi
• Number of events: 4000 minbias
• Number of streams: 8
• Number of repetitions: 200
Feature Geforce
GTX 1060
Geforce
GTX 1080 Ti
Geforce
RTX 2080 Ti
Tesla
T4
Tesla
V100
#cores(CUDA) 1280 3584 4352 2560 5120
Max Freq.
(GHz)
1.81 1.67 1.545 1.59 1.37
Cache (L2)
(MiB)
1,5 2,75 6 6 6
DRAM (GiB) 5.94
GDDR5
10.92
GDDR5
10.92
GDDR5
16
GDDR6
32
HBM2
CUDA
Capability
6.1 6.1 7.5 7.5 7.0
TDP (Watts) 120 250 250 70 250
04-04-2019 CTD 2019, Valencia 24
Performance and results #2
Performance Benchmarks
• Geforce GTX 1060: 24 kHz
• Geforce GTX 1080 Ti: 56 kHz
• Geforce RTX 2080 Ti: 88 kHz
• Tesla T4: 51 kHz
• Tesla V100: 112 kHz
LHCb SimulationGPU R&D
04-04-2019 CTD 2019, Valencia 25
Performance and results #3Breakout of performance
46%
8%
20%
11%
15%
VELO PV UT SciFi Common
04-04-2019 CTD 2019, Valencia 26
• Allen project has been presented with the context of LHCb detector upgrade.
• We are close to completing the full LHCb HLT1 sequence on GPU
• All decoding, clustering and tracking algorithms are implemented in CUDA.
• Framework scales well to the new line of underlying hardware.
• Physics performance is good and in the lines of our requirements for the Upgrade.
Summary
04-04-2019 CTD 2019, Valencia 27
Ongoing and future work
Under Development:
• SciFi tracking• Muon ID• Downstream tracking
Full sequence execution on AMD ROCm
Stress testing and integration with online system
04-04-2019 CTD 2019, Valencia 28
• Allen Project: https://gitlab.cern.ch/lhcb-parallelization/Allen• Talk by Daniel Campora – ACAT2019 https://indico.cern.ch/event/708041/contributions/3276185/• Poster by Dorothea Vom Bruch ACAT2019 - https://indico.cern.ch/event/708041/contributions/3308650/• LHCb upgrade TDR - LHCB-TDR-015, LHCB-TDR-016, LHCB-TDR-017 and LHCB-TDR-018• Talk by Alex Pearce - HOW-workshop-real-time-analysis
References:
Thanks to all people involved in the development of Allen!
29
The LHCb upgrade: trigger system
• It is useless to produce more interesting events if we are unable to study them.
• LHCb used to rely on hardware + software trigger.
• At such high luminosities, hardware trigger cannot copeand starts rejecting random events.
• Full software trigger needed:
• Means not only reconstruction at 25ns but alignmentand calibration.
• Even more pressure on sub-detectors to deliver data fast→ need to develop more efficient algorithms.
design 2018
Backup
Top Related