ieee real time 2007, fermilab, 29 april – 4 may r. hughes-jones manchester 1 using fpgas to...
Post on 20-Dec-2015
215 views
TRANSCRIPT
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester1
Using FPGAs to Generate Gigabit Ethernet Data Transfers
&The Network Performance of DAQ Protocols
Dave Bailey, Richard Hughes-Jones, Marc Kelly The University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks”
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester2
Collecting Data over the Network
●●●
Custom Links
Ethernet Switches
Processing Nodes
1 Burst / Node
Detector elements e.g. Calorimeter Planks
Output linkBottleneckQueue
Aim for a general purpose DAQ solution for CALICE
CAlorimeter for the LInear Collider Experiment
Take ECAL as an example. At the end of the beam spill the
planks send all the data, to the concentrators
Concentrators ‘pack’ data & send to one processing node
Classic bottleneck problem for the switch
Concentrators
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester3
XpressFX Vertex4 Network Test Board XpressFX Development Card from PLDApplications
8 lane PCI-e card Xilinx Virtex4FX60 FPGA DDR2 memory 2 SFP cages – 1GigE 2 HSSDC connectors
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester4
Overview of the Firmware Design Virtex4FX60 has:
16 RocketIP Multi-Gigabit Tranceivers Large internal memory 2 PPC CPUs
Ethernet Interface Embedded MAC RocketIO
Packet Buffers & logic Allows routing of input Prioritising of output
Packet State Machine Packet Generator State Machines
VHDL model HC11 CPU Control of MAC
State Machines (Green Mountain Computer Systems)
Reserve the PPC for data processing
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester5
The State Machine Blocks Packet Generator CSRs (set by HC11) for
Packet length Packet count Inter-packet delay Destination Address
Request – Response RX State Machine
Decode Request Packet Checksum RFC768 Action Mem writes Q Other Requests
FIFO TX State Machine
Process Request Construct reply Fragment if needed Checksum
Packet Analyser
PacketAnalyser
StateMachine
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester6
The Receive State Machine Idle
ReadHeader
ReadCmd
CheckCmd
DoCmd
WriteMem
FillFifo
EmptyPacket
Packet in Queue
Correct packet type
All bytes received
Good cmdIs a memory write
Write finished
Fifo written
End of packet
Wrong packet type
Not a memory write
Bad cmd
Fifo has: Addresscmd
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester7
The Receive State Machine
Idle
SendHeader&cmd
CheckCmd
AllSent?
UpdateCounter
SendXsum
SendMemory
EndPkt
cmd needs no data
All bytes have been sent
Header & cmd sent
More data to send
cmd requires data
Max packet size or byte count done
cmd in fifoEnd of packet
Xsum sent
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester8
The Test Network
Responding nodes
Cisco 76091 GE and 10 GE blades
Requesting Node
FPGA Concentrator
Use for testing Raw Ethernet Frame generation by the FPGA
Test Data collection with Request-Response protocols
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester9
Request-Response Latency 1 GE Request sent from PC Linux Kernel 2.6.20-web100_pktd-
plus Intel e1000 NIC Interrupt Coalescence OFF on PC MTU 1500 bytes
Response Frames generated by FPGA code
Latency 19.7 µs well behaved Latency Slope 0.018 µs/byte B2B Expect: 0.0182 µs/byte
Mem 0.0004 PCI-e 0.0018 1GigE 0.008 FPGA 0.008
Smooth to 35,000 bytes
man2-fpga
y = 0.0183x + 19.719
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400
Message length bytes
Lat
ency
us
man2-fpga
y = 0.0085x + 30.578
0
50
100
150
200
250
300
350
0 5000 10000 15000 20000 25000 30000 35000
Message length bytes
Lat
ency
us
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester10
FPGA PC ethCal_recv : Frame jitter
25 us frame spacing
12 us frame spacing (line speed)
Peak separation 4-5 us no coalescence
12 us fpga-man2_21Apr07
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
0 10 20 30 40 50Frame spacing us
N(t
)
25 us fpga-man2_21Apr07
0
100000
200000
300000
400000
500000
600000
700000
800000
0 50 100 150Frame Spacing us
N(t
)
12 us fpga-man2_21Apr07
1
10
100
1000
10000
100000
1000000
0 10 20 30 40 50Frame spacing us
N(t
)
Packet loss
12 us fpga-man2_21Apr07
0
20
40
60
80
0 1 2 3 4 5 6 7 8 9Num Packets Lost
N(n
)
25 us fpga-man2_21Apr07
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9Num Packets lost
N(t
)
25 us fpga-man2_21Apr07
1
10
100
1000
10000
100000
1000000
0 10 20 30 40 50Frame spacing us
N(t
)
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester11
Test the Frame Spacing from the FPGA Frames generated by FPGA code Interrupt Coalescence OFF on PC Frame size 1472 bytes 1M packets sent.
Plot mean of observed frame spacing vs requested spacing
Appear have offset of -1 us ? Slope close to 1 as expect
Packet loss decreases with packet rate.
Packet lost in receiving host
Larger effect than UDP/IP packets UDP/IP losses linked to scheduling
fpga-man2_21Apr07
y = 0.993x - 1.0726
0
5
10
15
20
25
0 10 20 30Requested frame spacing us
Mea
n f
ram
e sp
acin
g
us
fpga-man2_21Apr07
y = -93.531x + 2563.7
0200400600800
10001200140016001800
0 10 20 30Requested frame spacing us
Nu
m f
ram
es lo
st
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester12
The Test Network
Responding nodes
Requesting Node
FPGA Concentrator
Use for testing Raw Ethernet Frame generation by the FPGA
Test Data collection with Request-Response protocols
This time use 10GE hosts But does 10GE work on a PC??
Cisco 76091 GE and 10 GE blades
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester13
10 GigE Back2Back: UDP Throughput Motherboard: Supermicro X7DBE Kernel: 2.6.20-web100_pktd-plus NIC: Myricom 10G-PCIE-8A-R Fibre
rx-usecs=25 Coalescence ON
MTU 9000 bytes Max throughput 9.4 Gbit/s
Notice rate for 8972 byte packet
~0.002% packet loss in 10M packetsin receiving host
Sending host, 3 CPUs idle For <8 µs packets,
1 CPU is >90% in kernel modeinc ~10% soft int
Receiving host 3 CPUs idle For <8 µs packets,
1 CPU is 70-80% in kernel modeinc ~15% soft int
gig6-5_myri10GE
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 10 20 30 40Spacing between frames us
Re
cv W
ire r
ate
Mb
it/s 1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
gig6-5_myri10GE
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
%c
pu
1 k
ern
el
sn
d
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
C
gig6-5_myri10GE
0
20
40
60
80
100
0 10 20 30 40Spacing between frames us
% c
pu
1
ke
rne
l re
c
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester14
Scaling of Request-Response Messages
Requests from 10GE system Interrupt Coalescence OFF on PC Frame size 1472 bytes 1M packets sent. Request 10,000 bytes of data Host does fragment collection
like the IP layer
Sequential Requests: Time to receive all responses scales
with round trip time. As expected from sequential
requests
Grouped Requests: Collection time increases by 24.6µs
per node. From network alone expect
1+12.3 = 13.3 µs
gig5-fpga_29Apr07
y = 130.41x + 31.035
y = 24.615x + 126.580
100
200
300
400
500
600
0 1 2 3 4 5Number of nodes
Req
-res
p t
ime
us
●●●
Time ●●●
Time
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester15
Sequential Request-Response Interrupt Coalescence OFF
on PCs MTU 1500 bytes 10,000 packets sent.
Histograms similar Strong 1st peak Second peak 5 µs later Small group ~25 µs later
Ethernet occupancy for 1500 bytes: 1Gig 12.3 µs 10Gig 1.2 µs
h2 n1 10GE-->fpga
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
250 270 290 310 330 350 370 390
Latency us
N(t
)
h4 n1 10GE-->fpga
0
500
1000
1500
2000
2500
3000
3500
4000
4500
500 520 540 560 580 600 620 640Latency us
N(t
)
h3 n1 10GE-->fpga
0
500
1000
1500
2000
2500
3000
3500
4000
350 370 390 410 430 450 470 490Latency us
N(t
)
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester16
h4 n4 10GE-->fpga
0
500
1000
1500
2000
2500
3000
150 170 190 210 230 250 270 290Latency us
N(t
)
Grouped Request-Response Interrupt Coalescence OFF
on PCs MTU 1500 bytes 10,000 packets sent.
Histograms multi-nodal Second peak ~ 7 µs later Small group ~25 µs later
h2 n2 10GE-->fpga
0
500
1000
1500
2000
2500
3000
3500
4000
4500
150 170 190 210 230 250 270 290Latency us
N(t
)
h3 n3 10GE-->fpga
0
500
1000
1500
2000
2500
150 170 190 210 230 250 270 290Latency us
N(t
)
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester17
Conclusions Implemented MAC and PHY layers inside Xilinx Virtex4 FPGA Learning curve steep had to overcome issues with
Xilinx “CoreGen” design Clock generation & stability on PCB
FPGA easily drives 1Gigabit Ethernet at line rate Packet dynamics on the wire as expected Loss of Raw Ethernet frames in end host being investigated
Request-Response style data collection promising Developing a simple Network test system Planned upgrade to operate at 10Gbit/s
Work performed in collaboration with ESLEA UK e-Science & EU EXPReS projects:
IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester19
10 GigE UDP Throughput vs packet size Motherboard: Supermicro X7DBE Linux Kernel 2.6.20-web100_
pktd-plus Myricom NIC 10G-PCIE-8A-R Fibre myri10ge v1.2.0 + firmware v1.4.10
rx-usecs=0 Coalescence ON MSI=1 Checksums ON tx_boundary=4096
Steps at 4060 and 8160 byteswithin 36 bytes of 2n boundaries
Model data transfer time as t= C + m*Bytes C includes the time to set up transfers Fit reasonable C= 1.67 µs m= 5.4 e4 µs/byte Steps consistent with C increasing by 0.6 µs
The Myricom drive segments the transfers, limiting the DMA to 4096 bytes – PCI-e chipset dependent!
gig6-5_myri_udpscan
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 2000 4000 6000 8000 10000Size of user data in packet bytes
Rec
v W
ire
rate
Mbi
t/s