high throughput data acquisition at the cms experiment · pdf filehigh throughput data...
TRANSCRIPT
High throughput Data
Acquisition at the CMS
experiment at CERN
André G. Holzner
University of California San Diego
on behalf of the CMS Data Acquisition Group
HPC Advisory Council
Switzerland Conference
Lugano, Switzerland
23rd-25th March 2015 1
Outline
The Compact Muon Solenoid experiment at the Large Hadron
Collider
from collisions to observations
Event building in a nutshell
Event builder upgrade
technology choices
first layer
PC performance tuning
second layer
performance measurements
High Level Trigger
Summary
2
What are we doing ? 3
Image credit:
Particle Data Group
at Lawrence Berkeley
National Lab.
understanding
fundamental laws of nature at the smallest
scales
reproduce conditions similar to early after
the big bang in the
laboratory
higher energy ⟷
closer in time to big
bang
this
presentation
Reproducing the early universe in the lab 4
Large Hadron Collider Aerial View
Image credit: Maximilien Brice (CERN) CC BY-SA 3.0
Lac Léman GVA airport
Mt. Blanc
Detecting early universe interactions 5
CMS detector In real life 6
https://goo.gl/maps/prT4N
Collisions to look for
https://cds.cern.ch/record/1406325
FPGAs
8 From collisions to observations
FPGAs
Level 1 trigger
(electronics)
576 x 10 GBit/s
Ethernet
trig
ge
r d
ata
4
0 M
Hz
rea
do
ut
req
ue
sts
2 Mbyte x 100 kHz custom protocols
PCs
Infiniband FDR
6 spine / 12 leaf
folded Clos
1st stage
data aggregation PCs
2nd stage
data aggregation
PCs
High level trigger
(software) reduction ~ 1:100
Storage
2 Mbyte x 1 kHz
PCs
offline reconstruction,
data analysis
40 GBit/s
Ethernet
fully assembled
collision data
Grid
LHC offline computing grid 9
http://wlcg-public.web.cern.ch/
T2_CH_CSCS T0_CH_CERN
PC
PC
P
C
full
switching
matrix
PC
2
2
2
1
1
1
3
3
3
1
1
1
2
2
2
3
3
3
Event building in a nutshell 10
custom
electronics
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
4
4
4
4
4
4
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
Why upgrade ? 11
Many pieces of equipment have reached
their end of life → need to replace hardware
more detector channels added
in the next years
Increase sensitivity to new physics phenomena by increasing beam
energy, intensity and focusing
→ more collisions per beam crossing
→ more parts of the detector traversed by particles
→ higher data volume per collision
Infiniband Pros:
Designed as a High Performance Computing interconnect over short distances (within datacenters)
Protocol is implemented in the network card silicon → low CPU load
56 GBit/s per link (copper or optical), now 100 GBit/s available
Native support for Remote Direct Memory Access (RDMA)
No copying of bulk data between user space and kernel (‘true zero-copy’)
affordable
Cons:
Less widely known, API significantly differs from BSD sockets for TCP/IP
more difficult to implement in an FPGA
Fewer vendors than Ethernet
Niche market
12
Top500.org share by Interconnect family
Infiniband
DAQ1 TDR (2002)
Myrinet
1 Gbit/s
Ethernet
10 Gbit/s
Ethernet
2013
Run II event builder overview 13
da
ta flo
w
From custom to standard protocols 14
subdetectors use custom electronics
modules (VME, uTCA) to communicate with on-detector
ASICs
many different designs due to many
different requirements
want to use commercial off-the-shelf
equipment as ‘early’ as possible
first common element: ‘SLINK’ (copper,
64 bit x 50 MHz = 3.2 GBit/s), future: optical 6 / 10 GBit/s custom
protocol
output: 10 GBit/s Ethernet
TCP sender implemented in a
mid-range FPGA using
a reduced TCP state machine
6 GBit/s in
6 GBit/s in
SLINK in
SLINK in
10 GBit/s Ethernet out
10 GBit/s in
Compact PCI
module
FEROL*
*Front-End Readout Optical Link
FEROLs in real life 15
First layer switching network
16 switches with
48 x 10 GBit/s ports
to FEROLs
12 x 40 GBit/s ports
to first layer event
data aggregation
PCs
aggregation layer, full connectivity
not needed in principle
future: adding 40 GBit/s switches
to connect the 10/40 GBit/s switches
for faster failover
16
optimization on event building
PCs of:
assignment of Ethernet receive
queue interrupts to CPU cores
TCP kernel settings
assignments of software
threads to CPUs cores
per thread local memory
allocation using libnuma
PC Performance tuning 17
graphical monitoring of IRQ activity
Infiniband Clos network 18
12 leaf, 6 spine switches
36 FDR (56 GBit/s) ports per switch
3 links between each leaf/spine
pair
18 x 12 = 216 external ports
~ 6 Tbit/s bisection bandwidth
Subnet manager running on
a switch
full connectivity needed here
all sources send to all destinations
switching to Ethernet an option
12 leaf switches 6 spine switches
Infiniband throughput test
running a large scale test
84 sending PCs
50 receiving PCs
1 LID per host
using off-the-shell software: qperf
rc_rdma_write_bw
84 x 50 = 4200 connections
Obtain an upper limit on what
we can get with our (in-house) event builder software and protocol
achieved ~ 37 GBit/s per receiver
(~ 70% of linespeed after encoding)
19
link occupancy during test
spines
leaves
data flow
84 senders 50 receivers
Test with Event Building 20
receiving PCs
must always
wait for slowest
sender
overhead due
to handshaking
~ 32 GBit/s
per receiving
PC for 72
senders x 54
receivers
(86% of qperf
throughput)
High Level Trigger 21
assembly of full collision data on second layer of PCs
multiple 36 x 40 GBit/s switches
few to many distribution
must reduce the rate of selected collisions from 100 kHz to ~ 1 kHz
~ 15’000 cores → 150 ms decision time on average
software of 3.8M C++ and 1.2M python lines of source code
partial reconstruction of collision data
finding clusters of high energy deposit
3D track fitting (Kalman filtering) from 3D and 2D points
matching of tracks to clusters
Data exchange between 2nd stage event building PCs and event filtering PCs via files (NFS)
allows decoupling of event building and filtering software
needs careful tuning of NFS
BU
BU
BU
FUs
FUs
FUs
Summary / Conclusions
We presented the new Data Acquisition network (event builder) for
for the CMS detector at CERN for LHC Run II
Multiple networking technologies are used:
10 / 40 GBit/s in the first aggregation layer
56 GBit/s FDR Infiniband in the second, fully connected layer
40 / 10 / 1 GBit/s in the output (filtering) layer
Ready for LHC run II and looking forward to exploring new energies !
22
Thank you for your attention !
BACKUP
23
Accelerator complex 24
Particle identification 25
image credit: CERN