time capsule: tracing packet latency across different layers in...
TRANSCRIPT
TimeCapsule:TracingPacketLatencyacrossDifferent
LayersinVirtualizedSystems
KunSuo*,JiaRao*,LuweiChengw,FrancisC.M.Lau⌃
UTArlington*,Facebookw,TheUniversityofHongKong⌃
Virtualiza@onandMul@-Tenancy
• Mainstreamindatacentersü FaultisolaEonandenhancedsecurity
ü ImprovedhardwareuElizaEon
• ChallengingtoguaranteeQoSü ComplexvirtualizaEonstacks
ü Performanceinterference
Latency-sensi@venetworkappssufferpoorandunpredictableperformance
Para-virtualizedNetworkI/O
Physicalmachine
HardwareNIC
Hypervisor
Privilegeddomain Guest
NaEveDriver
BackendDriver
FrontendDriver
App
SharedMemory
VirtualSwitch
Physicalinterrupt
Virtualinterrupt
ProtecEondomain
Cross-machinecommunicaEon
PossibleCausesofLongLatency
• AddiEonallayersofsoRwarestackü AsynchronousnoEficaEons
ü Datacopy
• ResourcecontenEonü Time-sharingCPU
ü ContenEonsondatastructures,e.g.,locksandqueues
• PackettransmissioninDCnetwork
End-to-endlatencymonitoringandanalysisiskeytoiden@fying
thecauses
ChallengesinMonitoringPacketLatencyinVirtualizedSystems
• AcrosstheboundariesofprotecEondomainsü Machines,privilegeddomains,guests,andhypervisor
• Correla@ngeventsinvariouscomponentsü Asynchronouspacketprocessing
• Fine-grainedtracingwithlowoverheadü TroubleshooEngatpacketlevel
• ApplicaEontransparencyü Awidespectrumofnetworkapps,noaccesstocode
RelatedWork
• Tracingtoolsü App:gperfü OS:SystemTap,Dtrace,Perf,andbccü Hypervisor:Xentrace
• Distributedtracingü Causaltracing:Pip[NSDI’06],X-Trace[NSDI’07],Dapper[Google],Fay[SOSP’11],Magpie[HotOS]
ü Logmining:Draco[DSN’12],[NagarajNSDI’12],[XuSOSP’09]
RelatedWork(cont’)
• TracingmetadatapropagaEonü PivotTracing[SOSP’15],X-Trace[NSDI’07],Dapper[Google]
ü PropagaEngtaskIDsandEmestampsusingtaskcontainersorembeddingtheinfointoprotocolheaders
TracingvirtualizednetworkI/O:ü Fine-grainedtracingatpacketlevel
ü Packetprocessingatmul@plehostsanddifferentlayers.Tracemetadatapropaga@onisdifficult
TimeCapsule
Timestamppacketprocessingateachtracepointandappendthetracinginfotothepacketpayload
Advantages:ü Tracesembeddedinpacketsgoacrosstheboundariesof
protec@ondomainsü Timestampstakenatdifferentpointshavehappened-
TimeCapsuleinAc@on
Physicalmachine
HardwareNIC
Hypervisor
Privilegedhost Guest
NaEveDriver
BackendDriver
FrontendDriver
App
SharedMemory
VirtualSwitch
Tracepoint
Receive
Send
DumptraceRestorepacketpayload
Preservethelast@mestampatasendertocapturecausalityrela@onshipacrossmachines
TimestampingatTracepoints
• Challengesü Lowoverheadü AvailableinseparateprotecEondomainsü DealingwithEmedriRonmulEplehosts
• SoluEonü Para-virtualizedclocksoucexenorkvm-clock ü FuncEonnative_read_tscptoreadtscvalues,nanosecondgranularity,~20nsoverheadforeachreading
ü Setconstant_tsctoensureconsistenttscreadingsacrossdifferencecores
TimestampingacrossMachines
• NetworktransmissionEme
• TransmissionEmecanbenegaEveorinaccurateü TSCEcksatdifferentratesonmachinesaandb
ü TSCresetsatdifferentEmesonmachinesaandb
tscb −tsca
TSCCalibra@on
Hypervisor
Hardware
Dom 0 Dom UFrontend
driver
Dom U Tracepoint(Enabled)
Tracepoint
Packet receive path
Backenddriver
Frontenddriver
Localdriver
NIC Packet send path
Original packet
TC receive packet
Collect data path
Physical Machine
TC send packet
Figure 1. The architecture of Time Capsule.
no changes to the traced applications. We demonstrate thatfine-grained tracing and latency decomposition enabled byTC shed light on the root causes of long tail network latencyand help identify real performance bugs in Xen.
The rest of the paper is organized as follows. Section 2and 3 describe the design, implementation, and optimiza-tions of Time Capsule, respectively. Section 4 presents eval-uation results. Section 5 and 6 discuss future and relatedwork, respectively. Section 7 concludes this paper.
2. Design and ImplementationThe challenges outlined in Section 1 motivate the followingdesign goals of Time Capsule: (1) cross-boundary tracing;(2) fine-grained tracing; (3) low overhead; (4) applicationtransparency. Figure 1 presents a high-level overview of howTC enables tracing for network send (Tx) and receive (Rx).TC places tracepoints throughout the virtualized networkstack and timestamps packets at enabled tracepoints. Thetiming information is appended to the payload of the packet.For network receive, before the traced packet is copied touser space, TC restores the packet payload to its original sizeand dumps the tracing data to a kernel buffer, from where thetracing data can be copied to user space for offline analysis.For network send, trace dump happens before the packetis transmitted by the physical NIC. Compared to packetreceive, we preserve the timestamp of the last tracepointin the payload of a network send packet to support tracingacross physical machines. Since tracepoints are placed ineither the hypervisor, the guest kernel or the host kernel, TCis transparent to user applications. Next, we elaborate on thedesign and implementation of TC in a Xen environment.
Clocksource To accurately attribute network latency tovarious processing stages across different protection do-mains, e.g., Dom0, DomU, and Xen, a reliable and cross-domain clocksource with high resolution is needed. Thepara-virtualized clocksource xen meets the requirements.In a Xen environment, the hypervisor, Dom0, and the guestOS all use the same clocksource xen for time measurement.Therefore, packet timestamping using the xen clocksourceavoids time drift across different domains. Next, the clock-source xen is based on the Time Stamp Counter (TSC) onthe processor and has nanosecond resolution. It is adequate
Time
TSC
β
α t2t10
tscb
'tsca
tsca
Machine b
Machine a
'tscb
Figure 2. The relationship of TSCs on different physicalmachines. The constant tsc flag allows TSC to tick at theprocessor’s maximum rate regardless of the actual CPUspeed. Line slope represents the maximum CPU frequency.
for latency measurement at the microsecond granularity. Asimilar clocksource kvm-clock is also available in KVM.Furthermore, we enabled constant tsc and nonstop tsc
of Intel processors, which can guarantee that TSC rate is notonly synchronized across all sockets and cores, but also isnot affected by power management on individual processors.As such, TSC ticks at the maximum CPU clock rate regard-less of the actual CPU running speed. For cross-machinetracing, the clocks on physical nodes may inevitably tick atdifferent rates due to different CPU speeds. Therefore, therelative difference between timestamps recorded on separatemachines does not reflect the actual time passage. Figure 2shows the relationship between the TSC readings on twomachines with different CPU speeds. The slopes of the twolines represent the maximum CPU frequency on the respec-tive machines. There exist two challenges in correlating thetimestamps on separate machines. First, TSC readings areincremented at different rates (i.e., different slopes). Sec-ond, TSC registers are reset at boot time or when resumingfrom hibernation. The relative difference between TSC read-ings on two machines includes the absolute distance of thesemachines since last TSC reset. For example, as shown inFigure 2, the distance between TSC reset on two machinesis denoted by ↵ = |tareset � tbreset|, where tareset and tbresetare the last TSC reset time of machine a and b, respectively.
Tracepoints are placed on the critical path of packet process-ing in the virtualized network stack. When a target packetpasses through a predefined tracepoint, a timestamp basedon local clocksource is appended to the packet payload. Thetime difference between two tracepoints measures how muchtime it spent in a particular processing stage. For example,two tracepoints can be placed at the backend in Dom0 andfrontend in DomU to measure packet processing time in thehypervisor. As timestamps are taken sequentially at vari-ous tracepoints throughout the virtualized network stack, TCdoes not need to infer the causal relationship of the trace-points (as Pivot Tracing does in [27]) and the timestamps inthe packet payload have strict happened-before relations.
Cross-machine tracing requires that the varying TSC ratesand reset times on different machines be taken into ac-
2
Objec@ve:EsEmatet2–t1usingreadings
oftsc_bandtsc_a
Nota@on::Emedifferencewhentwomachines’tscwasreset:absolutetscdifference:CPUfrequencyofamachine
α
βcpufreq
Steps:1) EsEmatemachinea’stscreading
tsc’_aatt1asifbothmachinesresettscatthesameEme
2) Converttsc’_atotsc’_b,theequivalenttscreadingonmachineb
3) CalculatepackettransmissionEmeastsc_b–tsc’_b
Moredetailsinthepaper
TracingPayloadcount for accurate latency attribution. Specifically, times-tamps recorded on separate machines should be calibratedto determine the latency due to network transmission be-tween machines. We illustrate the TSC calibration processin Figure 2. Assume that a packet is sent from machine a(denoted by the dotted blue line) at time t1, which has afaster CPU and its TSC starts to tick earlier, and receivedat time t2 on machine b (denoted by the solid red line)with a slower CPU. Without TSC calibration, the differ-ence tscb� tsca can show negative transmission time. Thereare two ways to measure packet transmission time in thenetwork based on the two timestamps tsca and tscb takenat the sender and receiver machines. First, the difference ofTSC reset time ↵ can be estimated as | tscasync
cpufreqa� tscbsync
cpufreqb|,
where tscasync and tscbsync are the instantaneous TSC read-ings on the two machines at exactly the same time. Thiscan be achieved through distributed clock synchronizationalgorithms which estimate the packet propagation time in acongestion-free network and adjust the two TSC readings.Once ↵ is obtained, the absolute TSC difference � is cal-culated as � = ↵ ⇥ cpufreqa. Then, the first calibrationstep is to derive tsc0a = tsca � � to remove the absoluteTSC difference. As shown in Figure 2, tsc0a is the TSC read-ing of packet transmission at the sender if machine a resetsTSC at the same time as machine b. Further, the equiva-lent TSC reading at the receiver machine b when the packetstarts transmission is tsc0b = tsc0a ⇥ cpufreqb
cpufreqa. Finally, the
packet transmission time is the difference between the times-tamps of packet send and receive on the receiver machine b:t2 � t1 = tscb�tsc0b
cpufreqb.
The first calibration method only requires the examina-tion of one packet to measure packet transmission time butrelies on an accurate estimation of ↵. Since ↵ is constant forall packet transmissions between two particular machines,an alternative is to estimate network condition based onthe comparisons of multiple packet transmissions. Similarto [25], which compares packet transmission time with a ref-erence value in a congestion-free environment to estimatenetwork congestions, we can roughly measure packet trans-mission time as tsca
cpufreqa� tscb
cpufreqband use cross-packet
comparison to identify abnormally long transmission time.However, this method only identifies relative transmissiondelays with respect to a reference transmission time, whichis difficult to obtain in production datacenter network andmay be variable due to packets being transmitted throughdifferent routes.
Tracing payload To enable tracing latency across physicalor virtual boundaries, TC adds an extra payload to a packetto store the timestamps of tracepoints. Upon receiving apacket at the physical NIC or copying a packet from userspace to kernel space for sending, TC uses skb put(skb,
SIZE) to allocate additional space in the original packet.The tracing information is removed from packet payload anddumped to a kernel buffer before a packet is copied to theapplication buffer in user space or sent out by the physical
MAC IP TCP/UDP Data Tracing
raw dataTracing
meta data
Original packetTime Capsule packet
……
Raw data(Sender side)
Samplingdecision bit
The i-thannotation bit
Tracing payload
Raw data(Receiver side)
Figure 3. The structure of a Time Capsule packet.
NIC. Figure 3 shows the structure of a TC-enabled packet.The tracing payload contains two types of data: the tracingraw data and the tracing metadata. The tracing raw data con-sists of 8-byte entries, each of which stores the timestampof a tracepoint. Users can place plenty of tracepoints in thevirtualized network stack based on their needs and choosewhich tracepoints to enable for a particular workload. Thetracing metadata uses the annotation bits to indicate if thecorresponding tracepoint is enabled or not (1 as enabled).Users define an event mask to specify the enabled tracepointsand initialize the tracing metadata. The SIZE of the tracingpayload depends on the number of enabled tracepoints. Forlatency tracing across VMs on the same physical machine,packet is transferred between the hypervisor and domainsthrough shared memory. The packet size is not limited by themaximum transmission units (MTUs). Thus, TC is able toallocate sufficient spaces in packet payload for tracing with-out affecting the number of packets communicated by theapplication workloads. For tracing across different physicalmachines, we dump all the timestamps before packets aresent out by the NIC but preserve the last timestamp recordedat the sender side (the sender side raw data in Figure 3) in thetracing payload. When the packet arrives at the receiver ma-chine, new tracing data will be added after the sender side’slast timestamp. As such, the tracing data pertaining to thesame packet stored on multiple machines can be stitched to-gether by the shared timestamp.
3. Overhead and OptimizationsDespite the benefits, tracing at packet level can introduceconsiderable overhead to network applications. For highlyoptimized services in the cloud, such tracing overhead cansignificantly hurt performance. In this section, we discussthe sources of overhead and the corresponding optimizationsin Time Capsule.
Time measurement It usually incurs various levels of costto obtain timestamps in the kernel space. If it is not properlydesigned, fine-grained tracing can significantly degrade net-work performance, especially increasing packet tail latency.Optimization We compare the cost of different clocksourceread functions available in various domains using a simpleclock test tool [2] and adopt native read tscp to readfrom the xen clocksource at each tracepoint. Compared to
3
• Use__skb_puttoappendtracedatatotheoriginalpayload
• EachEmestampis8bytes
• Samplingdecisionbitdeterminesthesamplingrate
• AnnotaEonbitdecideswhichtracepoint(s)toenable
TraceCollec@on
• RingbuffersinphysicalNICdriver(Tx)andguestOSnetworkstack(Rx)
• Tracingdataisremovedfromthepacketpayloadandcopiedtotheringbuffers
ü Beforepacketiscopiedtouserspace(Rx)
ü BeforepacketistransmimedbyNICdriver(Tx)
• mmaptheringbuffersto/proc filesystemsinuserspace
• Periodicallydumptracetostorageforlatencyanalysis
Evalua@onHardware
• twoPowerEdgeT420servers• two6-core1.90GHzIntelXeonE5-2420CPUs• 32GBmemory
• GigabitEthernet
SoUware
• Hypervisor:Xen4.5• Dom0andDomUkernel:Linux3.18.21
• VM:1vCPU+4GBmemory
TimeCapsuleOverhead
100105110115120125130135140
0 1 2 3 4 5 6 7 8 9 10
avg 99% 99.90%
Latency(μs)
100105110115120125130135140
1/16 1/8 1/4 1/2 1
avg 99% 99.90%
Latency(μs)
Numberoftracepoints Samplingrate
SockperfUDPlatencySockperfUDPlatency
TimeCapsuleincursnomorethan2%latencyincrease
0
500
1000
1500
2000
0 1 2 3 4 5 6 7 8 9 10 11 12
sockperfTime(second)
Latency(μs)
0
500
1000
1500
2000
0 1 2 3 4 5 6 7 8 9 10 11
sockperfTime(second)
Latency(μs)
PerPacketLatency
ForegroundsockperfUDPrequestBackgroundnetperfinterferencearriveatthe1.5thsecondandleUatthe6.5thsecond
Packetlevellatencymonitoring:1. Capturelatencyfluctua@on
2. Moreresponsivetotrafficchange3. Capturetransientspikesinuser-perceivedlatency
Averagelatencypersecond Latencyperpacket
0
20
40
60
80
100
120Dom0 Xen DomU
Latency(μs)
0
500
1000
1500
2000
2500
3000Dom0(a) Dom(b) Xen DomU
Latency(μs)
DiagnosiswithLatencyBreakdown
VM
sockperf
VM
HPCBenchsockperf
Coloca@onoflatency-sensi@veandthroughput-intensiveworkloadsinthesameVMincurslongandunpredictablelatency
LatencybreakdownrevealsthatpacketbatchingatthebackendNICdriverwastheculprit
TamingLongTailLatencyinXen
sockperf
Xen
Case1
sockperf
Xen
loop
Case2
VM-1
VM-1 VM-2
BugsinXen’screditschedulerTimecapsulehelpsfindthem!
020004000600080001000012000140001600018000
Case1 Case2
Latency(μs)
BUG-1
05
10152025303540
Dom0 Xen DomU
PacketindexLatency(ms)
Case2Observa@ons:
1. Predictablelatencyspikeevery250packets
2. Thetaillatencycloseto30ms
3. SpikealwaysstarEngwithdelayinXen
BUG-1:XenmistakenlybooststhepriorityofCPU-intensiveVMs,whichcauseslongscheduling
delaysoftheI/O-boundVMinXenhmp://lists.xenproject.org/archives/html/xen-
devel/2015-10/msg02853.html
AUerBUG-1isFixed
0
5000
10000
15000
20000
25% 50% 75% 90% 95% 99% 99.50%99.90%99.99%
Case1 Case2aRerBUG-1isfixed2
Latency(μs)
Taillatencyiss@llnotbounded
Addi@onalXenBugs
0
20
40
60
Dom0 Xen DomU
Latency(m
s)
Packetindex
Observa@ons:1. Theoccurrenceand
magnitudeofthelatencyspikeareunpredictable
2. SpikealwaysstarEngwith
delayinXen
BUG-2:Xendoesnot@melyac@vateI/OVMsthataredeac@vatedduetolongidling
hmps://lists.xenproject.org/archives/html/xen-devel/2016-05/msg01362.html
BUG-3:I/OVMs’BOOSTprioritycanbeprematurelydemoted
hmps://lists.xenproject.org/archives/html/xen-devel/2016-05/msg01362.html
FutureWork
• Dynamicinstrumenta@onü ExtendTimeCapsuletodynamicallyaddtracepointsusingBPF
• Automatedanalysisü usemachinelearningtoextractbemerinformaEonfrompackettraces
• DiskI/Oü extendTCtodiskI/Orequests.Challengeisthelackofacommonlyshareddatastructure,suchasskbinnetworking,acrosslayersinvirtualizedblockI/Ostacks
Conclusions
• Mo@va@onTracinglatencyinvirtualizedsystemsischallengingduetotheisolaEonofprotecEondomainsandrequirementsforlowoverheadandapplicaEontransparency
• TimeCapsuleAnin-bandprofilertotracenetworklatencyatpacketlevelinvirtualizedenvironments
• Evalua@onTimeCapsuleincurslowoverhead,enablesfine-grainedpacketleveltracing,andlatencybreakdown,whichhelpstodetectbugsthatcauselongtaillatencyinXen
Thank you ! Questions?
FAQ
• Will@mecapsuleaffecttheuser-perceivedpacketsize?Noandyes.Whenpacketsarereceived,MTUisnolongeralimitforpacketsize.Timecapsuleisabletoappendasmuchdataasneededtothepayloadandthetracingpayloadisremovedbeforethepacketarrivesattheuserspace.Thus,theusersareunawareofthetracingacEviEes.However,Emecapsuleneedstoappend8bytestothepackettostorethelastEmestampatthesenderside.Inrarecases,thiswillaffectthenumberofpacketstransmimed.Forexample,iftheoriginalpacketsizeis2999bytes(alimlelessthantwoMTUs),adding8bytestotheoriginalpayloadwillrequireonemorepackettobetransmimed.