A TCP/IP transport layer for the DAQ of the CMS Experiment
description
Transcript of A TCP/IP transport layer for the DAQ of the CMS Experiment
A TCP/IP transport layer for the DAQ of A TCP/IP transport layer for the DAQ of the CMS Experimentthe CMS Experiment
Miklos Kozlovszky Miklos Kozlovszky
for the CMS TriDAS collaborationfor the CMS TriDAS collaboration
CERNCERNEuropean Organization for Nuclear European Organization for Nuclear
ResearchResearch
ACAT03 - December 2003ACAT03 - December 2003
CMS & Data AcquisitionCMS & Data Acquisition
Collision rate 40 MHz Level-1 Maximum trigger rate 100 kHz
Average event size 1 Mbyte
No. of In-Out units 1000 Readout network bandwidth 1 Terabit/s Event filter computing power 5 10 6 MIPS Data production Tbyte/day
CMS
Detector Frontend
Computing Services
Readout Systems
Filter Systems
Event Manager Builder Networks
Level 1 Trigger
Run Control
Data Data
Event builder : Physical system interconnecting data sources with data destinations. It has to move each event data fragments into a same destination
Event fragments : Event data fragments are stored in separated physical memory systems
Full events : Full event data are stored into one physical memory system associated to a processing unit
12
33
512
11 22 512512 3
512 Data sources for 1 MByte events~1000s HTL processing nodes
NxM EVB
Building the eventsBuilding the events
• Distributed DAQ framework developed within CMS.
• Construct homogeneous applications for heterogeneous processing clusters.
• Multi-threaded (important to take advantage of SMP efficiently).
• Zero copy message passing for the event data.
• Peer to peer communication between the applications.
• I2O for data transport, and SOAP for configuration and control.
• Hardware and transport independency.
OS and Device Drivers
HTTP
Ethernet Myrinet
XDAQ
Util/DDM
Processing
Sensor readout
TCP
PCI
Subjectof presentation
XDAQ Framework XDAQ Framework
• Reuse old, “cheap” Ethernet for DAQ
• Transport layer requirements – Reliable communication– Hide the complexity of TCP– Efficient implementation– Simplex communication via sockets – Configurable
• Support of blocking and non-blocking I/O
TCP/IP Peer Transport RequirementsTCP/IP Peer Transport Requirements
• Pending Queues– Thread safe PQ management– One PQ for each destination – Independent sending through sockets
• Only one “Select” function call both to receive the packet and send the blocked data.
Implementation of the non-blocking modeImplementation of the non-blocking mode
1 2 3 4 5 n1 2 3 4 5 n #2
Pending Queues
XDAQ Application
Framesend
1 2 3 4 5 n #n
Select
Receiver Object(s)
OS
XDAQ Executive
Peer TransportLayer
ptATCP
Applications (XDAQ)
ptATCPPort(s)
XDAQ Framework
Sender Object(s)
Input SAP(s) Output SAP(s)
Driver(s)
NIC (10GE)NIC (FE) NIC (GE)
= Creation of object= Sending= Receiving= other communication
Communication via the transport layerCommunication via the transport layer
Throughput optimisationThroughput optimisation
Single rail Multi-rail
App 1
App 2 App 2
App 1
• Operating System tuning (kernel options+buffers)
• Jumbo Frames• Transport protocol options
• Communication techniques
– Blocking vs. Non-Blocking I/O
– Single/Multi-rail
– Single/Multi-thread
– TCP options (e.g.:Nagle algorithm)
– ….
Test networkTest network
Cluster size: 8x8 CPU: 2x Intel Xeon (2.4 GHz), 512KB CacheI/O system: PCI-X: 4 buses (max 6) .Memory: Two-way interleaved DDR: 3.2 GB/s (512 MB)NICs: 1 Intel 82540EM GE
1 Broadcom NeXtreme BCM 5703x GE1 Intel Pro 2546EB GE (2port)
OS: Linux RedHat 2.4.18-27.7 (SMP)
Switches: 1 BATM- T6 Multi Layer Gigabit Switch (medium range)
2 Dell Power Connect 5224 (medium range)
0
20
40
60
80
100
120
140
100 1000 10000 100000
Fragment Size (Byte)
Th
rou
gh
pu
t p
er N
od
e (M
B/s
)
link BW (1Gbps)
8x8 EVB [P4 e1000 Powerconnect 5224]
32x32 EVB [P3 AceNIC FastIron8000]
Conditions:• XDAQ+Event Builder
– No Readout Unit inputs– No Builder Unit outputs– No Event Manager
• PC: dual P4 Xeon• Linux 2.4.19• NIC: e-1000• Switch: Powerconnect 5224• Standard MTU (1500 Bytes)• Each BU builds 128 events • Fixed fragment sizes
Result:For fragment size > 4 kB:• Thru /node ~100 MB/s i.e.
80% utilisation
Working point
Event Building on the cluster Event Building on the cluster
Two Rail Event Builder measurementsTwo Rail Event Builder measurements
Test case:
Bare Event Builder (2x2)• No RU inputs• No BU outputs• No Event Manager
Options:• Non blocking TCP• Jumbo frames (mtu 8000)• Two rail• One thread
RU working point (16 kB)Throughput/node = 240 MB/ si.e. 95% bandwidth
• Achieved 100 MB/s per node in 8x8 configuration (1rail).
• Improvements seen with the use of two rail, non-blocking I/O, with Jumbo frames. In 2x2 configuration over 230 MB/s obtained.
• High CPU load.
• We are also studying other networking and traffic shaping options.
ConclusionsConclusions