Exploiting the Grid to Simulate and Design the LHCb Experiment
Future experiment specific needs for LHCb
description
Transcript of Future experiment specific needs for LHCb
Future experiment specific needs for LHCb
OpenFabrics/Infiniband Workshop at CERN
Monday June 26Sai Suman Cherukuwada and Niko Neufeld CERN/PH
Niko NeufeldCERN, PH
2
LHCb Trigger-DAQ system: Today
• LHC crossing-rate: 40MHz
• Visible events: 10MHz
• Two stage trigger system– Level-0: synchronous in
hardware; 40 MHz 1 MHz– High Level Trigger (HLT):
software on CPU-farm; 1 MHz 2 kHz
• Front-end Electronics (FE): interface to Readout Network
• Readout network– Gigabit Ethernet LAN– Full readout at 1MHz
• Event filter farm– ~ 1800 to 2200 1 U servers
FE
Readout network
CPU
FE FE
L0 trigger
Timing and Fast Control
CPUCPUCPU CPUpermanent
storage
Niko NeufeldCERN, PH
3
LHCb DAQ system: features
• On average every 1 us new data become available at each of ~ 300 sources (= custom electronics boards, “TELL1”)
• Data from several 1 us cycles (=“triggers”) are concatenated into 1 IP packet reduces message / packet-rate
• IP packets are pushed over 1000 BaseT links short distances allow using 1000 BaseT throughout
• Destination IP-address is synchronously assigned via a custom optical network (TTC) to all TELL1s
• For each trigger a PC-server must receive IP packets from all TELL1 boards (“event-building”).
Niko NeufeldCERN, PH
4
Terminology
• channel: elementary sensitive element = 1 ADC = 8 to 10 bits. The entire detector comprises millions of channels
• event: all data fragments (comprising several channels) created at the same discrete time together form an event. It is an electronic snap-shot of the detector response to the original physics reaction
• zero-suppression: send only channel-numbers of non-zero value channels (applying a suitable threshold)
• packing-factor: number of event-fragments (“triggers”) packed into a single packet/message– reduces the message rate– optimises bandwidth usage– is limited by the number of
CPU cores in the receiving CPU (to guarantee prompt processing and thus limit latency)
Niko NeufeldCERN, PH
5
PC #876
Following the data-flow
TELL1 TELL1 TELL1 TELL1 TELL1 TELL1 UKL1 UKL1 TELL1 TELL1 TELL1
Front-end Electronics
400 Links35 GByte/s
TFCSystem
StorageSystem
Readout Network
Switch
PC
PC
PC
Switch
PC
PC
PC
Switch
PC
PC
PC
Switch
PC
PC
Switch
PC
PC
PC
Event FilterFarm
50 Subfarms
L0Yes
MEPDestination
PC #876
VELO#2
RICH#1
BΦΚs
L0Yes
VELO TT IT OT CALO RICH MUON L0
VELO#1
RICH#2
#2#1
VELOMEP TOPC #876
#2#1
RICHMEP
PC #876
RICH#2RICH#1VELO#2VELO#1
MEP
HLTProcessMEP
RequestPC #876
Niko NeufeldCERN, PH
6
Data pre-processing: The LHCb common Readout Board TELL1
PP-FPGA
A-RxCard
L1B
SyncLink-FPGA
PP-FPGA
L1B
PP-FPGA
L1B
PP-FPGA
L1B
RO-TxTTCrxECS
4 x 1000 BaseT
TTCECS
FE FE FE FE
A-RxCard O-RxCard
Throttle
Receiver cards get data from detector via optical fibres
FPGAs do pre-processing, zero-suppression and data formatting (into IP packets)
FPGA attached to Ethernet Quad-MAC on SPI3 bus (simple FIFO protocol)
IP packets are pushed out to the Data Acquisition on a private LAN over 4 x 1000 BaseT links
Niko NeufeldCERN, PH
7
Improving the LHCb trigger
• Triggering is filtering. The quality of the trigger is determined (using simulated data) by measuring how many good events of the possible good events are selected:
efficiency ε = Ngood-selected / Ngood-all
• Each stage has its own efficiency. LHCb looses mostly in the “L0” step: 40 MHz 1 MHz
• Reason: only coarse information (“high pT”)
used
• Solution: reconstruct secondary vertices at collision rate 40 MHz!
Niko NeufeldCERN, PH
8
Upgrade
We want to have a DAQ and Event filter which:
• allows for vertex triggering at collision rate (40 MHz)
• fits within the existing infrastructure:– 1 MW power and cooling– 50 racks with a total space of 2200 Us
• preserves the main good features of the current LHCb DAQ– simple, scalable, industry-standard technologies,
as much as possible commodity items
• costs <107 of a reasonable currency
Niko NeufeldCERN, PH
9
Two Options
• Two stage readout:– Readout ~ 10 kB @ 40 MHz.
Data are buffered in the FL1 for a suitable amount of time: 40 ms (?)
– Algorithm on event-filter farm selects 1 MHz of “good” events and informs (how?) FL1 boards of its decision (yes/continue – no/discard):
– In case of “yes” the entire detector is read out: 35 kB @ 1 MHz
• Always read out entire detector 35 kB @ 40 MHz (“brute force”)
Niko NeufeldCERN, PH
10
Full read-out at 40 MHz
• At a collision rate of 40 MHz, the data rate for a full readout is ~ 1400 GB/s, or ~ 12 Tb/s– network with ~ 2 x 1200 x 10 Gigabit ports
• Need several switches as building blocks optimised topology highly desirable (non-Banyan)
• Advantages:– No latency constraints– Less memory requirements on the FL1
• Disadvantages:– Huge, expensive– Almost all of the data shipped will never be looked at
(physics algorithms do not change much)– Requires zero-suppression and FPGA pre-processing for all
detector data 40 MHz (not obvious)
Niko NeufeldCERN, PH
11
Parameters / Assumptions
• Vertex reconstruction requires only a subset of the total event of roughly 10 kB @ 40 MHz (essentially the VertexLocator of the future + some successor of TT)
• FE with full 40 MHz readout capability
• We dispose of the successor of the TELL1, FL1* , which has several 10 Gigabit output links and can do pre-processing / zero-suppression at the required rate
• Several triggers are packed into a MTP. This reduces the message rate from each board. In this presentation we assume 8 triggers per message == RTX-message = 5 MHz (per FL1)
(*) FL1 for Future L1 or Fast L1 or FormuLa 1
Niko NeufeldCERN, PH
12
Data pre-processing: A new readout-board: FL1
PP-FPGA
L1B
SyncLink-FPGA
PP-FPGA
L1B
PP-FPGA
L1B
PP-FPGA
L1B
RO-TxSync
Info
Host
ProcessorECS
4 x CX4
TTCECS
FE FE FE FE
O-RxCard
Throttle
Receiver cards get data from detector via optical fibres
FPGAs do pre-processing, zero-suppression and data formatting)
FPGA attached to HCA on ??? bus (are there alternatives to PCIe?)
Output to the Data Acquisition private LAN on (up-to) 4 x CX4 cables
O-RxCard
Host processor needed (??) to handle complex protocol stack
Niko NeufeldCERN, PH
13
Event filter farm for upgraded LHCb
• We need an event-filter which can absorb 4 * 10^7 * 10 kB/s + 10^6 * 35 kB/s ~ 435 GB/s!
• Assume 2000 servers:– A server is something which takes one U in space and has p
two processor sockets– Each socket holds a chip, which comprises several CPU
cores
• Each server must accept ~ 210 MB/s as 500 kHz of messages of ~ 400 BytesOptions for attaching servers to network: – 3 Gigabit links as a trunk: not very practical because would
have to bring > 130 links into one rack!– Use an (underused) 10 Gigabit link
Niko NeufeldCERN, PH
14
Server Horoscopes
• Quad-core processors from Intel and AMD will most likely be available in 2007
• Could we have “Octo-cores” by end of 2008?
• Can thus assume to have 8 cores running at 2 to 2.4 GHz (prob. not more!) in one U.
• Commitment by Intel and AMD: power consumption per processor < 100 W
• Reasonable rumors:– 2007 will see first mainboards with 10 Gigabit
interface on board: most likely CX4 for either 10 Gigabit Ethernet or Infiniband (?)
Niko NeufeldCERN, PH
15
CPU power for triggering / latency / buffering
• Assuming 2000 servers / 16000 cores and40 MHz of events each core has on average 2.5 ms to reach a decision when processing the ~ 10 kB of vertex-detector data should have at least 40 ms buffering in the
FL1s to cope with fluctuations in processing time (the processing time distribution is known to have long tails)
• Assuming 400 FL1 means that they have to have 12.5 GB buffer memory
Niko NeufeldCERN, PH
16
30 m+
LHCb Detector
1. Readout 10KB events @ 40MHz,Buffer on FL1
1 2 3 … 400Front-End L1 Boards
1 2 3 4
4x10Gbps Links
High Density Switches
Rack1 … Rack50
FabricFabric
400 ports “in” per Switch
65Gbps per Rack
60 m
20m+?
Farm Racks with 1 x 32-port Switch or 2 x 16-port Switch
2. Send to Farm for Trigger decision 3. Send trigger
Decision to FL1
4. Receive TriggerDecision. 5. If trigger decision
Positive, readout 35KB @ 1MHz
Niko NeufeldCERN, PH
17
Power Consumption
• Probably need 512 MB per core (trigger process)
• x 8 ==> 4 GB
• 4 GB of high-speed memory + onboard 10 Gigabit interface will need also power (assume conservatively 50 W)
• The 1 U box should stay below 300 W
• Total power for CPUs < 600 kW
• 10 Gigabit distribution switches need also power (should count at least 250 W)
Niko NeufeldCERN, PH
18
Open questions
• Can an FPGA drive the HCA or do we need an embedded host-processor with an OS?
• It would be nice to centrally assign the next destination (server) to all FL1 boards. This means determining the Queue Pair number and DLID/DGID to send a message to. Can we use the Infiniband network for this as well?
• Almost the entire traffic is unidirectional (from the FL1s to the servers). Can we take advantage of this fact?