Niko Neufeld "A 32 Tbit/s Data Acquisition System"
description
Transcript of Niko Neufeld "A 32 Tbit/s Data Acquisition System"
LHCb Trigger & DAQ an Introductory Overview
Niko NeufeldCERN/PH Department
Yandex, July 3rd Moscow
The Large Hadron Collider
LHC Trigger & DAQ - Niko Neufeld, CERN 2
3
Physics, Detectors, Trigger & DAQ
High throughput DAQ, Niko Neufeld, CERN
rare, need many collisions
High rate collider
Fast electronics
Data acquisitionTrigger
decisions
Event Filter
MassStorage
data
decisions
data
signals
4
The Data Acquisition Challenge at LHC
?
• 15 million detector channels• @ 40 MHz• = ~15 * 1,000,000 * 40 * 1,000,000 bytes
• = ~ 600 TB/sec
LHC Trigger & DAQ - Niko Neufeld, CERN
LHC Trigger & DAQ - Niko Neufeld, CERN 5
Should we read everything? • A typical collision is “boring”
– Although we need also some of these “boring” data as cross-check, calibration tool and also some important “low-energy” physics
• “Interesting” physics is about 6–8 orders of magnitude rarer (EWK & Top)
• “Exciting” physics involving new particles/discoveries is 9 orders of magnitude below tot– 100 GeV Higgs 0.1 Hz*– 600 GeV Higgs 0.01 Hz
• We just need to efficiently identify these rare processes from the overwhelming background before reading out & storing the whole event
109 Hz
5106 Hz
EWK: 20–100 Hz
10 Hz
*Note: this is just the production rate, properly finding it is much rarer!
6
Know Your Enemy: pp Collisions at 14 TeV at 1034 cm-2s-1
• (pp) = 70 mb --> >7 x 108 /s (!)
• In ATLAS and CMS* 20 – 30 min bias events overlap
• HZZZ mmH 4 muons:the cleanest(“golden”)signature
Reconstructed tracks with pt > 25 GeV
And this (not the H though…)
repeats every 25 ns…
*)LHCb @4x1033 cm-2-1 isn’t much nicer and in Alice (PbPb) is even more busyLHC Trigger & DAQ - Niko Neufeld, CERN
Trivial DAQ with a real trigger 2
High throughput DAQ, Niko Neufeld, CERN 7
ADC
Sensor
Delay
Proces-sing
Interrupt
Discriminator
Trigger
Start
Deadtime (%) is the ratio between the time the DAQis busy and the total time.
SetQClear
and not
Busy Logic
Ready
storage
A “simple” 40 MHz track trigger – the LHCb PileUp system
LHC Trigger & DAQ - Niko Neufeld, CERN 8
Finding vertices in FPGAs
• Use r-coordinates of hits in Si-detector discs (detector geometry made for this task!)
• Find coincidences between hits on two discs
• Count & histogram
LHC Trigger & DAQ - Niko Neufeld, CERN 9
LHCb Pileup Finding multiple vertices and quality
LHC Trigger & DAQ - Niko Neufeld, CERN 10
Comparing with the “offline” truth(full tracking, calibration, alignment)
LHCb Pileup Algorithm
• Time-budget for this algorithm about 2 us
• Runs in conventional FPGAs in a radiation-safe area
• Limited to low pile-up (ok for LHCb)
LHC Trigger & DAQ - Niko Neufeld, CERN 11
After the TriggerDetector Read-out and DAQ
DAQ design guidelines
• Scalability – change in event-size, luminosity (pileup!)• Robust (very little dead-time, high efficiency, non-
expert operators) intelligent control-systems• Use industry-standard, commercial technologies (long-
term maintenance) PCs, Ethernet• Low cost PCs, standard LANs• High band-width (many Gigabytes/s) use local area
networks (LAN)• “Creative” & “Flexible” (open for new things) use
software and reconfigurable logic (FPGAs)
LHC Trigger & DAQ - Niko Neufeld, CERN 13
One network to rule the all• Ethernet, IEEE 802.3xx, has almost become
synonymous with Local Area Networking• Ethernet has many nice features: cheap,
simple, cheap, etc…• Ethernet does not:
– guarantee delivery of messages – allow multiple network paths– provide quality of service or bandwidth
assignment (albeit to a varying degree this is provided by many switches)
• Because of this raw Ethernet is rarely used, usually it serves as a transport medium for IP, UDP, TCP etc…
High throughput DAQ, Niko Neufeld, CERN 14
Ethernet
• Flow-control in standard Ethernet is only defined between immediate neighbors
• Sending station is free to throw away x-offed frames (and often does )
Xoff data
Generic DAQ implemented on a LAN
LHC Trigger & DAQ - Niko Neufeld, CERN 15
Powerful Core routers
“Readout Units”for protocol adaptation
Custom links from thedetector
Edge switches
Servers for eventfiltering
Typical number of piecesDetector
1
1000
100 to 1000
2 to 8
50 to 100
> 1000
16
Congestion
2• "Bang" translates into
random, uncontrolled packet-loss
• In Ethernet this is perfectly valid behavior and implemented by many low-latency devices
• This problem comes from synchronized sources sending to the same destination at the same time
• Either a higher level “event-building” protocol avoids this congestion or the switches must avoid packet loss with deep buffer memories
LHC Trigger & DAQ - Niko Neufeld, CERN
2
2
Bang
17
Push-Based Event Building with store& forward switching and load-balancing
Data AcquisitionSwitch
1Event Builders notify Event Manager available capacity 2
Event Managerensures that data are sent only to nodes with available capacity 3
Readout system relies on feedback from Event Builders
“Send next event
to EB1”
“Send next event
to EB2”
EB1: EB2:EB3:
0 00
“Send me an event!”
“Send me an event!”
1 11
“Send me an event!”
“Send me an event!”
0 11
0 01 “Send
next event to EB3”
1 01
Event Builder 1
Event Builder 2
Event Builder 3
LHC Trigger & DAQ - Niko Neufeld, CERN
Event Manager
Sources do not buffer –so switch must buffer to avoid packet loss due to overcommitment
18
LHCb DAQ
SWITCH
HLT farm
Detector
TFC System
SWITCHSWITCH SWITCH SWITCH SWITCH SWITCH
READOUT NETWORK
L0 triggerLHC clock
MEP Request
Event building
Front-End
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Readout Board
Expe
rimen
t Con
trol
Sys
tem
(EC
S)
VELO ST OT RICH ECal HCal MuonL0
Trigger
Event dataTiming and Fast Control SignalsControl and Monitoring data
SWITCH
MON farm
CPU
CPU
CPU
CPU
Readout Board
Readout Board
Readout Board
Readout Board
Readout Board
Readout Board
FEElectronics
FEElectronics
FEElectronics
FEElectronics
FEElectronics
FEElectronics
FEElectronics
55 GB/s
200 - 300 MB/s
Average event size 55 kBAverage rate into farm 1 MHzAverage rate to tape 4 – 5 kHz
LHC Trigger & DAQ - Niko Neufeld, CERN
LHCb DAQ
• Events are very small (about 55 kB) total – each read-out board contributes about 200 bytes (only!!)– A UDP message on Ethernet takes 8 + 14 + 20 + 8 + 4 = 52
bytes 25% overhead(!)• LHCb uses coalescence of messages, packing about 10
to 15 events into one message (called MEP) message rate is ~ 80 kHz (c.f. CMS, ATLAS)
• Protocol is a simple, single stage push, every farm-node builds complete events, the TTC system is used to assign IP addresses coherently to the read-out boards
LHC Trigger & DAQ - Niko Neufeld, CERN 19
DAQ network parametersLink load Technology Protocol Eventbuilding[%]
30% Ethernet TCP/IP pull InfiniBand (HLT) pull (RDMA)
20% 10 Gbit/s (L2) Ethernet TCP/IP pull 50% (Event-collection)
65% Myrinet Myrinet push (with credits) 40 – 80% Ethernet TCP/IP pull
40 - 80% Ethernet UDP push
ALIC
EAT
LAS
CMS
LHCb
20LHC Trigger & DAQ - Niko Neufeld, CERN
LHC Trigger/DAQ parameters (as seen 2011/12)
# Level-0,1,2 Event Network StorageTrigger Rate (Hz) Size (Byte) Bandw.(GB/s) MB/s (Event/s)
4 Pb-Pb 500 5x107 25 4000 (102) p-p 103 2x106 200 (102)
3 LV-1 105 1.5x106 6.5 700 (6x102) LV-2 3x103
2 LV-1 105 106 100 ~1000 (102)
2 LV-0 106 5.5x104 55 250 (4.5x103)
ALIC
EAT
LAS
CMS
LHCb
21LHC Trigger & DAQ - Niko Neufeld, CERN
High throughput DAQ, Niko Neufeld, CERN 22
High Level Trigger Farms
And that, in simple terms, is what we do in the High Level Trigger
LHC Trigger & DAQ - Niko Neufeld, CERN 23
Online Trigger Farms 2012 ALICE ATLAS CMS LHCb
# cores(+ hyperthreading)
2700 17000 13200 15500
# servers (mainboards)
~ 2000 ~ 1300 1574
total available cooling power
~ 500 ~ 820 800 525
total available rack-space (Us)
~ 2000 2400 ~ 3600 2200
CPU type(s) AMD Opteron, Intel 54xx, Intel 56xx
Intel 54xx, Intel 56xx
Intel 54xx, Intel 56xxIntel E5-2670
Intel 5450,Intel 5650,AMD 6220
And counting…
Not yet approved!
LHC planning
LHC Trigger & DAQ - Niko Neufeld, CERN 24
Long Shutdown 1 (LS1)
Long Shutdown 2 (LS2)Long Shutdown 3 (LS3)
CMS track-trigger
ALICE continuous read-outLHCb 40 MHz read-out
CMS: Myrinet InfiniBand / EthernetATLAS: Merge L2 and EventCollection infrastructures
Motivation
• The LHC (large hadron collider) collides protons every 25 ns (40 MHz)
• Each collision produces about 100 kB of data in the detector
• Currently a pre-selection in custom electronics rejects 97.5% of these events unfortunately a lot of them contain interesting physics
• In 2017 the detector will be changed so that all events can be read-out into a standard compute platform for detailed inspection
Niko Neufeld, CERN 25
LHCb after LS2
LHC Trigger & DAQ - Niko Neufeld, CERN 26
• Ready for all software trigger (resources permitting)• 0-suppression on front-end electronics mandatory!• Event-size about 100 kB, readout-rate up to 40 MHz• Will need a network scalable up to 32 Tbit/s:
InfiniBand, 10/40/100 Gigabit Ethernet?
Key figures
• Minimum required bandwidth: > 32 Tbit/s • # of 100 Gigabit/s links > 320• # of compute units > 1500• An event (“snapshot of a collision”) is about 100
kB of data• # of events processed every second: 10 to 40
millions• # of events retained after filtering: 20000 to
30000 (data reduction of at least a factor 1000)
Niko Neufeld, CERN 27
GBT: custom radiation- hard link over MMF, 3.2 Gbit/s (about 10000)
Input into DAQ network (10/40 Gigabit Ethernet or FDR IB) (1000 to 4000)
Output from DAQ network into compute unit clusters (100 Gbit Ethernet / EDR IB) (200 to 400 links)
Compute units could be servers with GPUs or other coprocessors
LHCb DAQ as of 2018
LHC Trigger & DAQ - Niko Neufeld, CERN 28
Detector
DAQ network
100 m rock
Readout Units
Compute Units
Readout Unit
• Readout Unit needs to collect custom-links• Some pre-processing• Buffering• Coalescing of data-fragment reduce message-rate /
transport overheads• Needs an FPGA• Sends data using standard network protocol (IB, Ethernet)• Sending of data can be done directly from the FPGA or via
a standard network silicon• Works together with Compute Units to build events
Niko Neufeld, CERN 29
Compute Unit
• A compute unit is a destination for the event-data fragments from the readout units
• It assembles the fragments into a complete “event” and runs various selection algorithms on this event
• About 0.1 % of events is retained• A compute unit will be a high-density server
platform (mainboard with standard CPUs), probably augmented with a co-processor card (like Intel MIC or GPU)
Niko Neufeld, CERN 30
Future DAQ systems: trends
• Certainly LAN based – InfiniBand deserves a serious evaluation for high-bandwidth (> 100
GB/s)– In Ethernet if DCB works, might be able to build networks from
smaller units, otherwise we will stay with large store&forward boxes• Trend to “trigger-free” do everything in software bigger
DAQ will continue– Physics data-handling in commodity CPUs
• Will there be a place for multi-core / coprocessor cards (Intel MIC / CUDA)?– IMHO this will depend on if we can establish a development
framework which allows for longterm maintenance of the software by non-”geek” users, much more than on the actual technology
High throughput DAQ, Niko Neufeld, CERN 31
Fat-Tree Topology for One Slice• 48-port 10 GbE switches• Mix readout-boards (ROB) and filter-farm-servers in one switch
– 15 x readout-boards– 18 x servers– 15 x uplinks
Non-block switchinguse 65% of installed bandwidth(classical DAQ only 50%)
• Each slice accomodates– 690 x inputs (ROBS)– 828 x outputs servers
Ratio (server/ROB) is adjustable
High throughput DAQ, Niko Neufeld, CERN 32
33
Pull-Based Event Building
Data AcquisitionSwitch
1Event Builders notify Event Manager of available capacity 2
Event Manager elects event-builder node 3
Readout traffic is driven by Event Builders
“EB1, get next
event”
“EB2, get next
event”
EB1: EB2:EB3:
0 00
“Send me an event!”
“Send me an event!”
1 11
“Send me an event!”
“Send me an event!”
0 11
0 01
1 01
Event Builder 1
Event Builder 2
Event Builder 3
LHC Trigger & DAQ - Niko Neufeld, CERN
“Send event to EB 1!”“Send event
to EB 1!”“Send event to EB 1!”“Send event
to EB 1!”
Summary
• Large modern DAQ systems are based entirely (mostly) on Ethernet and big PC-server farms
• Bursty, uni-directional traffic is a challenge in the network and the receivers, and requires substantial buffering in the switches
• The future:– It seems that buffering in switches is being reduced (latency vs. buffering)– Advanced flow-control is coming, but it will need to be tested if it is sufficient for
DAQ– Ethernet is still strongest, but InfiniBand looks like a very interesting alternative– Integrated protocols (RDMA) can offload servers, but will be more complex– Integration of GPUs, non-Intel processors and other many-cores will be need to
be studied
• For the DAQ and triggering the question is not if we can do it, but how we can do it so we can afford it!
High throughput DAQ, Niko Neufeld, CERN 34
More Stuff
36
Cut-through switchingHead of Line Blocking
2 24
Packet to node 4 must waiteven though port to node 4 is free
• The reason for this is the First in First Out (FIFO) structure of the input buffer
• Queuing theory tells us* that for random traffic (and infinitely many switch ports) the throughput of the switch will go down to 58.6% that means on 100 MBit/s network the nodes will "see" effectively only ~ 58 MBit/s
*) "Input Versus Output Queueing on a Space-Division Packet Switch"; Karol, M. et al. ; IEEE Trans. Comm., 35/12
LHC Trigger & DAQ - Niko Neufeld, CERN
42
1 3
GBT: custom radiation- hard link over MMF, 3.2 Gbit/s (about 10000)
Input into DAQ network (10/40 Gigabit Ethernet or FDR IB) (1000 to 4000)
Output from DAQ network into compute unit clusters (100 Gbit Ethernet / EDR IB) (200 to 400 links)
Event-building
Niko Neufeld, CERN 37
Detector
DAQ network
100 m rock
Readout Units
Compute Units
Readout Units send to Compute UnitsCompute Units receive passively“Push-architecture”
Runcontrol
© W
arne
r B
ros.
LHC Trigger & DAQ - Niko Neufeld, CERN 39
Runcontrol challenges
• Start, configure and control O(10000) processes on farms of several 1000 nodes
• Configure and monitor O(10000) front-end elements
• Fast data-base access, caching, pre-loading, parallelization and all this 100% reliable!
LHC Trigger & DAQ - Niko Neufeld, CERN 40
Runcontrol technologies
• Communication:– CORBA (ATLAS)– HTTP/SOAP (CMS)– DIM (LHCb, ALICE)
• Behavior & Automatisation:– SMI++ (Alice) – CLIPS (ATLAS)– RCMS (CMS)– SMI++ (in PVSS) (used also in the DCS)
• Job/Process control:– Based on XDAQ, CORBA, … – FMC/PVSS (LHCb, does also fabric monitoring)
• Logging:– log4C, log4j, syslog, FMC (again), …