network performance lessons from the coal face - networkshop44
TRANSCRIPT
Networkshop March 2016
Network performance: lessons from the coal face
Chris Walker
20/03/16
Achieving high performance
20/03/16 Networkshop March 2016 2
●
●
What can be achieved?
How we achieved it–
–
Architecture Network tuning
●
●
●
TCP tuning Parallel transfers Multiple streams
●
●
Monitoring
Bottlenecks– Found and fixed
● Conclusions
Motivation (LHC @CERN)
● Collisions 25ns– 100 PB/year
● QMUL– Small fraction
20/03/16 Networkshop March 2016 3
Network
20/03/16 Networkshop March 2016 4
● 1 Terabyte can be transferred in:–
–
–
100 Mbps network : 30 hrs1 Gbps network : 3 hrs10 Gbps network : 20 minutes
● Takes work to achieve this in practice–
–
–
TCP tuningFind and eliminate bottlenecks Reduce packet loss
● Fasterdata.es.net– Excellent source of information
LAN topology
20/03/16 Networkshop March 2016 5
● 2 * Data transfer nodes (SE) connected at 10Gbit/s–
–
Optimised for WAN transfers Fast (lustre) filesystem
● Network:–
–
2 * 20Gbit/s WAN links – HEP use onePreviously HEP 1 Gbit dedicated + 80% of resilient link
20/03/16 Networkshop March 2016 6
WAN performance
● April 2012: 1 Gbit dedicate link (Saturated)– Source based routing (+ 80% of resilient link)
● Sept 2013: 2* 10 Gig link (1 used by HEP)– 1*10Gig used by High Energy Physics (HEP)
April 2012
Feb 2013
20/03/16 6
20/03/16 Networkshop March 2016 7
WLCG World sites
20/03/16 7
Data Transferred● QMUL (2012)
–
–
–
–
2.6PB downloaded (3.9 million files)
1.4PB uploaded 870MB/s peak rate380MB/s average on busy days
● Atlas– 1PB in 1 week
(October 2012?) worldwide!!!
20/03/16 Networkshop March 2016 8
Networkshop March 2016
How TCP works: A very short overview● Congestion window (CWND) = the number of packets the sender is
allowed to send–
–
The larger the window size, the higher the throughput Throughput = Window size / Round-trip Time
● TCP Slow start–
–
exponentially increase the congestion window size until a packet is lost this gets a rough estimate of the optimal congestion window size
CWND slow start:
exponential
increase
congestionavoidance: linear increase
packet loss
time
retransmit: slow start again
timeout
20/03/16 9
TCP Tuning
●
●
●
Latency: time to send 1 packet from the source to the destination
RTT: Round-trip time
Bandwidth*Delay = Bandwidth Delay Product
–
–
The number of bytes in flight to fill the entire path
Example: 10 Gbps path; ping shows a 90 ms RTT (QMUL->BNL)● BDP = 10 * 0.090 = 0.9 Gbits (112 MBytes)
– QMUL ->Taiwan 273ms RTT (at 10Gbps path)● BDP = 10*0.273 = 2.73 Gbits (340 MBytes)
20/03/16 10
Networkshop March 2016
Effect of packet loss with distance
● From fasterdata.es.net20/03/16 Networkshop March 2016 11
Multiple streams
● Parallel streams can help– Potentially unfair on other users
20/03/16 Networkshop March 2016 12
TCP lessons
20/03/16 Networkshop March 2016
● Increase TCP buffers for distant transfers– Fasterdata.es.net has good recommendations
●
●
Packet loss needs eliminating
Application–
–
–
large buffers (not scp) Multiple streams GridFTP has these
●
●
Aspera uses UDP (and GridFTP can)
Fasterdata.es.net has excellent recommendations
13
Bottlenecks found
● Gbit connected at 100Mbit–
–
–
–
GridFTP node DeptCollegeIperf tests with another Uni●
●
1 min CPU limit
2* 1Gbit hashing– Can also cause packet loss 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Layer 3+4Layer 2 HP switch
20/03/16 Networkshop March 2016
Network card
Pro
porti
on o
f tra
f ic
14
Routing
20/03/16 Networkshop March 2016
● Linux Layer 2– 1 Gig (not 10Gig)
● 12 *1G → 1G● Router Flapping
– Route 1Gbit or advertise routes● To CERN via the
US
15
Routing problems with 10Gbit/s upgrade
● 4th September 10Gbit/s WAN upgrade UK sites – increased ratesASGC (Taiwan) decrease–
Route not advertised via GEANT.
16
–
Firewalls
20/03/16 Networkshop March 2016
● ICMP– Often blocked
● timeout rather than failure● IPv6
– Tracepath6 blocked (ICMP blocking )● Barclays bank blocked
– Deep packet inspection and rewriting of http packets (but not https)●
●
Scp failing half way through transfer
GridFTP Slow performance– 1 MB/s through firewall, 50MB/s avoiding firewall
● GridFTP control connection forgotten
17
IPv6
20/03/16 Networkshop March 2016
● Routes– May be different to IPv4
● Geneva ->QMUL via New York (fixed)● Software (IPv6) / ASIC (IPv4)
–
●
Older routers may give poor performance (see perfsonar talk)
Preferred over IPv4–
●
If IPv6 address (AAAA record) in DNS, it will be used by machines that think they are IPv6 connected.
Blocked differently by firewalls
18
Jumbo Frames
20/03/16 Networkshop March 2016
● Ethernet–
–
MTU =1500 - normal MTU=9000 Jumbo (convention)
● Janet network supposed to allow this–
–
Only for the brave at presentEncapsulation If site uses MTU=9000 jumbo frames, fragmented over Janet
● Path MTU discovery–
–
–
Sometimes blocked by firewallsMore likely to be dropped (misconfigured switches etc) net.ipv4.tcp_mtu_probing=1
19
Networkshop March 2016
Network monitoring (perfsonar +ripe ATLAS)
● Cacti–
–
Monitor packet loss 64 bit counters
● Perfsonar–
–
–
Bandwidth LatencyReverse traceroute● Ripe ATLAS probe
–
–
Atlas.ripe.net Latencytraceroute
20/03/16 20
–
Bufferbloat Www.bufferbloat.net
20/03/16 Networkshop March 2016
●
●
●
Chaotic and laggy network performance
Buffers too big for bandwidth
Affects home users on low bandwidth links with big buffers– Packet loss signalling bandwidth limit too late
21
Conclusions
20/03/16 Networkshop March 2016
● Large transfers routine–
–
But take work (GridPP sites have this experience) Needs management layer
● Monitoring vital–
–
Transfers Network
● Network–
–
Low packet lossGood relationship with network team useful
● Information– Fasterdata.es.net
22
Acknowledgements
20/03/16 Networkshop March 2016
● Fasterdata.es.net (Brian Tierney)– Much thanks for the TCP tuning slides
●
●
●
●
Duncan Rand
Brian Davies
Dan Traynor
Terry Froy
23
jisc.ac.uk
20/03/16 Networkshop March 2016
Christopher Walker
24