recent linux tcp updates, and how to tune your 100g … linux tcp updates, and how to tune your 100g...
TRANSCRIPT
RecentLinuxTCPUpdates,andhowtotuneyour100GhostNateHanford,BrianTierney,[email protected]://fasterdata.es.net
Internet2TechnologyExchange,
Sept27,2016,Miami,FL
Observa(on#1
• TCPismorestableinCentOS7vsCentOS6– Throughputrampsupmuchquicker
• Moreaggressiveslowstart
– Lessvariabilityoverlifeoftheflow
9/30/162
BerkeleytoAmsterdam
9/30/164
NewYorktoTexas
Observa(on#2
• TurningonFQhelpsthroughputevenmore– TCPisevenmorestable– WorksbeGerwithsmallbufferdevices
• PacingtomatchboGlenecklinkworksbeGeryet
9/30/165
TCPop(on:FairQueuingScheduler(FQ)AvailableinLinuxkernel3.11(releasedlate2013)orhigher
– availableinFedora20,Debian8,andUbuntu13.10– Backportedto3.10.0-327kernelinv7.2CentOS/RHEL(Dec2015)
ToenableFairQueuing(whichisoffbydefault),do:– tcqdiscadddev$ETHrootfqOraddthisto/etc/sysctl.conf:net.core.default_qdisc=fq
Tobothpaceandshapethetraffic:– tcqdiscadddev$ETHrootfqmaxrateNgbit
• Canreliablypaceuptoamaxrateof32Gbpsonafastprocessors
CanalsodoapplicaDonpacingusinga‘setsockopt(SO_MAX_PACING_RATE)’systemcall– iperf3supportsthisviathe“—bandwidth’flag
FQBackground
• Lotsofdiscussionaround‘bufferbloat’starDngin2011– hGps://www.bufferbloat.net/
• GooglewantedtobeabletogethigheruDlizaDonontheirnetwork– Paper:“B4:ExperiencewithaGlobally-DeployedSonwareDefinedWAN,SIGCOMM2013
• GooglehiredsomeverysmartTCPpeople• VanJacobson,MaGMathis,EricDumazet,andothers
• Result:LotsofimprovementstotheTCPstackin2013-14,includingmostnotablythe‘fairqueuing’pacer
9/30/167
9/30/168
NewYorktoTexas:WithPacing
100GHostTuning
9/30/169
TestEnvironment• Hosts:
– SupermicroX10DRiDTNs– IntelXeonE5-2643v3,2sockets,6coreseach– CentOS7.2runningKernel3.10.0-327.el7.x86_64– MellanoxConnectX-4EN/VPI100GNICswithportsinENmode– MellanoxOFEDDriver3.3-1.0.4(03Jul2016),Firmware12.16.1020
• Topology– BothsystemsconnectedtoDellZ9100100GbpsONTop-of-RackSwitch– Uplinktonersc-tb1ALUSR7750Routerrunning100GlooptoStarlightandback
• 92msRTT– UsingTagged802.1qtoswitchbetweenLoopandLocalVLANs– LANhad54usecRTT
• ConfiguraDon:– MTUwas9000B– irqbalance,tuned,andnumadwereoff– coreaffinitywassettocores7and8(ontheNUMAnodeclosesttotheNIC)– AlltestsareIPV4unlessotherwisestated
9/30/1610
nersc-tb1
Dell z9100
nersc-tbn-4 nersc-tbn-5
star-tb1
100G loop: RTT = 92ms
100G
StarLight (Chicago)
Oakland, CA
Each host has: • Mellanox ConnectX-4 (100G) • Mellanox ConnectX-3 (40G)
Alcatel 7750 Router
TestbedTopologyAlcatel 7750 Router
40G
100G 100G 40G
OurCurrentBestSingleFlowResults• TCP
– LAN:79Gbps– WAN(RTT=92ms):36.5Gbps,49Gbpsusing‘sendfile’API(‘zero-copy’)– Testcommands:
• LAN:nuGcp-i1-xc7/7–w1m-T30hostname• WAN:nuGcp-i1-xc7/7–w900M-T30hostname
• UDP:– LANandWAN:33Gbps– Testcommand:nuGcp-l8972-T30-u-w4m-Ru-i1-xc7/7hostname
Othershavereportedupto85GbpsLANperformancewithsimilarhardware
9/30/1612
CPUgovernorLinuxCPUgovernor(P-States)seungmakesabigdifference:
RHEL: cpupower frequency-set -g performanceDebian:cpufreq-set -r -g performance
57Gbpsdefaultseungs(powersave)vs.79Gbps‘performance’modeontheLANTowatchtheCPUgovernorinacDon:watch -n 1 grep MHz /proc/cpuinfo
cpuMHz:1281.109cpuMHz:1199.960cpuMHz:1299.968cpuMHz:1199.960cpuMHz:1291.601cpuMHz:3700.000cpuMHz:2295.796cpuMHz:1381.250cpuMHz:1778.492
9/30/1613
CPUfrequency• Driver:KernelmoduleorcodethatmakesCPUfrequencycallstohardware• Governor:Driverseungthatdetermineshowthefrequencywillbeset• PerformanceGovernor:Biastowardshigherfrequencies• UserspaceGovernor:Allowusertospecifyexactcoreandpackagefrequencies• OnlytheIntelP-StatesDrivercanmakeuseofTurboBoost• Checkcurrentseungs:cpupower frequency-info
9/30/1614
P-StatesPerformance
ACPI-CPUfreqPerformance
ACPI-CPUfreqUserspace
LAN 79G 72G 67G
WAN 36G 36G 27G
TCPBuffers# add to /etc/sysctl.conf# allow testing with 2GB buffers
net.core.rmem_max = 2147483647net.core.wmem_max = 2147483647# allow auto-tuning up to 2GB buffers
net.ipv4.tcp_rmem = 4096 87380 2147483647net.ipv4.tcp_wmem = 4096 65536 2147483647
2GBisthemaxallowableunderLinuxWANBDP=12.5GB/s*92ms=1150MB(autotuningsetthisto1136MB)LANBDP=12.5GB/s*54us=675KB(autotuningsetthisto2-9MB)ManualbuffertuningmadeabigdifferenceontheLAN:
– 50-60Gbpsvs79Gbps
9/30/1615
zerocopy(sendfile)results
• iperf3–ZopDon• NosignificantdifferenceontheLAN• SignificantimprovementontheWAN
– 36.5Gbpsvs49Gbps
9/30/1616
IPv4vsIPv6results
• IPV6isslightlyfasterontheWAN,slightlyslowerontheLAN
• LAN:– IPV4:79Gbps– IPV6:77.2Gbps
• WAN– IPV4:36.5Gbps– IPV6:37.3Gbps
9/30/1617
Don’tForgetaboutNUMAIssues
9/30/1618
• Upto2xperformancedifferenceifyouusethewrongcore.
• Ifyouhavea2CPUsocketNUMAhost,besureto:– Turnoffirqbalance– FigureoutwhatsocketyourNICisconnectedto: cat /sys/class/net/ethN/device/numa_node
– RunMellanoxIRQscript: /usr/sbin/set_irq_affinity_bynode.sh 1 ethN
– BindyourprogramtothesameCPUsocketastheNIC: numactl -N 1 program_name
• WhichcoresbelongtoaNUMAsocket?– cat/sys/devices/system/node/node0/cpulist– (note:onsomeDellservers,thatmightbe:0,2,4,6,...)
SecngstoleavealoneinCentOS7
Recommendleavingtheseatthedefaultseungs,andnoneoftheseseemtoimpactperformancemuch
• InterruptCoalescence• RingBuffersize• LRO(off)andGRO(on)• net.core.netdev_max_backlog
• txqueuelen• tcp_Dmestamps
9/30/1619
ToolSelec(on• BothnuGcpandiperf3havedifferentstrengths.• nuGcpisabout10%fasteronLANtests• iperf3JSONoutputopDonisgreatforproducingplots• Useboth!Botharepartofthe‘perfsonar-tools’package
– InstallaDoninstrucDonsat:hGp://fasterdata.es.net/performance-tesDng/network-troubleshooDng-tools/
9/30/1620
OSComparisons
• CentOS7(3.10kernel)vs.Ubuntu14.04(4.2kernel)vs.Ubuntu16.04(4.4kernel)– Note:4.2kernelareabout5-10%slower(senderandreceiver)
• SampleResults:• CentOS7toCentOS7:79Gbps
• CentOS7toUbuntu14.04(4.2.0kernel):69Gbps
• Ubuntu14.04(4.2)toCentOS7:71Gbps
• CentOS7toUbuntu16.04(4.4kernel):73Gbps
• Ubuntu16.04(4.4kernel)toCentOS7:75Gbps
• CentOS7toDebian8.4with4.4.6kernel:73.6G
• Debian8.4with4.4.6KerneltoCentOS7:76G
9/30/1621
BIOSSecng
• DCA/IOAT/DDIO:ON– AllowstheNICtodirectlyaddressthecacheinDMAtransfers
• PCIeMaxReadRequest:Turnitupto4096,butourresultssuggestitdoesn’tseemtohurtorhelp
• Turboboost:ON• Hyperthreading:OFF
– AddedexcessivevariabilityinLANperformance(51Gto77G)
• node/memoryinterleaving:??
9/30/1622
PCIBusCommandsMakesureyou’reinstallingtheNICintherightslot.Usefulcommandsinclude:
FindyourPCIslot: lspci | grep Ethernet 81:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
ConfirmthatthisslotisPCIeGen3x16:lspci -s 81:00.0 -vvv | grep PCIeGen [V0] Vendor specific: PCIeGen3 x16
ConfirmthatPCIMaxReadReqis4096Blspci -s 81:00.0 -vvv | grep MaxReadReq MaxPayload 256 bytes, MaxReadReq 4096 bytes
Ifnot,youcanincreaseitusing‘setpci’
• Formoredetails,see:hGps://community.mellanox.com/docs/DOC-2496
9/30/1623
Benchmarkingvs.Produc(onHostSecngs
Therearesomeseungsthatwillgiveyoumoreconsistentresultsforbenchmarking,butyoumaynotwanttorunonaproducDonDTNBenchmarking:• UseaspecificcoreforIRQs: /usr/sbin/set_irq_affinity_cpulist.sh 8 ethN
• Useafixedclockspeed(settothemaxforyourprocessor):– /bin/cpupower -c all frequency-set -f 3.4GHz
ProducDonDTN: /usr/sbin/set_irq_affinity_bynode.sh 1 ethN /bin/cpupower frequency-set -g performance
9/30/1624
FQon100GHosts
9/30/1625
100GHost,ParallelStreams:nopacingvs20Gpacing
9/30/1626
WealsoseeconsistentlossontheLANwith4streams,nopacingPacketlossduetosmallbuffersinDellZ9100switch?
100GHostto10GHost
9/30/1627
FastHosttoSlowhost
9/30/1628
ThroGledthereceivehostusing‘cpupower’command:/bin/cpupower -c all frequency-set -f 1.2GHz
Summaryofour100Gresults
• NewEnhancementstoLinuxKernelmaketuningeasieringeneral.
• Afewofthestandard10Gtuningknobsnolongerapply• TCPbufferautotuningdoesnotworkwell100GLAN• Usethe‘performance’CPUgovernor
• UseFQPacingtomatchreceivehostspeedifpossible
• ImportanttobeusingtheLatestdriverfromMellanox– version:3.3-1.0.4(03Jul2016),firmware-version:12.16.1020
9/30/1629
What’snextintheTCPworld?
• TCPBBR(BoGleneckBandwidthandRTT)fromGoogle– hGps://patchwork.ozlabs.org/patch/671069/– GoogleGroup:hGps://groups.google.com/forum/#!topic/bbr-dev
• AdetaileddescripDonofBBRwillbepublishedinACMQueue,Vol.14No.5,September-October2016:– "BBR:CongesDon-BasedCongesDonControl".
• Googlereports2-4ordersofmagnitudeperformanceimprovementonapathwith1%lossand100msRTT.– Sampleresult:cubic:3.3Mbps,BBR:9150Mbps!!– EarlytesDngonESnetlessconclusive,butseemstohelponsomepaths
9/30/1630
Ini(alBBRTCPresults(bwctl,3streams,40sectest)RemoteHost Throughput Retransmits
perfsonar.nssl.noaa.gov htcp:183bbr:803
htcp:1070bbr:240340
kstar-ps.nfri.re.kr htcp:4301bbr:4430
htcp:1641bbr:98329
ps1.jpl.net htcp:940bbr:935
htcp:1247bbr:399110
uhmanoa-tp.ps.uhnet.net htcp:5051bbr:3095
htcp:5364bbr:412348
9/30/1631
Variesbetween4xbeGerand30%worse,allwithWAYmoreretransmits.
MoreInforma(on
• hGp://fasterdata.es.net/host-tuning/packet-pacing/• hGp://fasterdata.es.net/host-tuning/100g-tuning/• TalkonSwitchBuffersizeexperiments:
– hGp://meeDngs.internet2.edu/2015-technology-exchange/detail/10003941/
• MellanoxTuningGuide:– hGps://community.mellanox.com/docs/DOC-1523
• Email:[email protected]
9/30/1632
ExtraSlides
9/30/1633
mlnx_tunecommand
• See:hGps://community.mellanox.com/docs/DOC-1523
9/30/1634
Coalescing Parameters
• Varies by manufacturer • usecs: Wait this amount of microseconds after 1st packet is received/
transmitted • frames: Interrupt after this many frames are received or transmitted • tx-usecs and tx-frames aren’t as important as rx-usecs • Due to the higher line rate, lower is better, until interrupts get in the way
(at 100G, we are sending almost 14 frames/usec • Default settings seem best for most cases 8/30/2016 35
AsmallamountofpacketlossmakesahugedifferenceinTCPperformance
MetroArea
Local(LAN)
Regional ConDnental
InternaDonal
Measured(TCPReno) Measured(HTCP) Theore(cal(TCPReno) Measured(noloss)
Withloss,highperformancebeyondmetrodistancesisessen(allyimpossible
TCP’sConges(onControl
© 2015 Internet2
50ms simulated RTT Congestion w/ 2Gbps UDP traffic HTCP / Linux 2.6.32
SlidefromMichaelSmitasin,LBLnet
FairQueuingandandSmallSwitchBuffersTCPThroughputonSmallBufferSwitch(CongesDonw/2GbpsUDPbackgroundtraffic)
RequiresCentOS7.2orhigher
tcqdiscadddevEthNrootfqEnableFairQueuing
PacingsideeffectofFairQueuingyields~1.25Gbpsincreaseinthroughput@10GbpsonourhostsTSOdifferencessDllnegligibleonourhostsw/IntelX520
SlidefromMichaelSmitasin,LBL
Moreexamplesofpacinghelping
ParallelStreamTest1Lehside:sumof4streamsRightside:tputofeachstreamStreamsappeartobemuchbeierbalancedwithFQ,pacingto2.4performedbest
ParallelStreamTest2Lehside:sumof4streamsRightside:tputofeachstreamStreamsappeartobemuchbeierbalancedwithFQ
FQPacketsaremuchmoreevenlyspacedtcptrace/xplotoutput:FQonleh,StandardTCPonright
42
Runyourowntests
• FindaremoteperfSONARhostonapathofinterest– Mostofthe2000+worldwideperfSONARhostswillaccepttests
• See:hGp://stats.es.net/ServicesDirectory/• Runsometests
– bwctl-chostname-t60--parsable>results.json
• ConvertJSONtognuplotformat:– hGps://github.com/esnet/iperf/tree/master/contrib