time capsule: tracing packet latency across diﬀerent layers in...

TimeCapsule:TracingPacketLatencyacrossDifferent

LayersinVirtualizedSystems

KunSuo*,JiaRao*,LuweiChengw,FrancisC.M.Lau⌃

UTArlington*,Facebookw,TheUniversityofHongKong⌃

Virtualiza@onandMul@-Tenancy

• Mainstreamindatacentersü FaultisolaEonandenhancedsecurity

ü ImprovedhardwareuElizaEon

• ChallengingtoguaranteeQoSü ComplexvirtualizaEonstacks

ü Performanceinterference

Latency-sensi@venetworkappssufferpoorandunpredictableperformance

Para-virtualizedNetworkI/O

Physicalmachine

HardwareNIC

Hypervisor

Privilegeddomain Guest

NaEveDriver

BackendDriver

FrontendDriver

App

SharedMemory

VirtualSwitch

Physicalinterrupt

Virtualinterrupt

ProtecEondomain

Cross-machinecommunicaEon

PossibleCausesofLongLatency

• AddiEonallayersofsoRwarestackü AsynchronousnoEficaEons

ü Datacopy

• ResourcecontenEonü Time-sharingCPU

ü ContenEonsondatastructures,e.g.,locksandqueues

• PackettransmissioninDCnetwork

End-to-endlatencymonitoringandanalysisiskeytoiden@fying

thecauses

ChallengesinMonitoringPacketLatencyinVirtualizedSystems

• AcrosstheboundariesofprotecEondomainsü Machines,privilegeddomains,guests,andhypervisor

• Correla@ngeventsinvariouscomponentsü Asynchronouspacketprocessing

•  Fine-grainedtracingwithlowoverheadü TroubleshooEngatpacketlevel

• ApplicaEontransparencyü Awidespectrumofnetworkapps,noaccesstocode

RelatedWork

•  Tracingtoolsü App:gperfü OS:SystemTap,Dtrace,Perf,andbccü Hypervisor:Xentrace

• Distributedtracingü Causaltracing:Pip[NSDI’06],X-Trace[NSDI’07],Dapper[Google],Fay[SOSP’11],Magpie[HotOS]

ü Logmining:Draco[DSN’12],[NagarajNSDI’12],[XuSOSP’09]

RelatedWork(cont’)

•  TracingmetadatapropagaEonü PivotTracing[SOSP’15],X-Trace[NSDI’07],Dapper[Google]

ü PropagaEngtaskIDsandEmestampsusingtaskcontainersorembeddingtheinfointoprotocolheaders

TracingvirtualizednetworkI/O:ü  Fine-grainedtracingatpacketlevel

ü  Packetprocessingatmul@plehostsanddifferentlayers.Tracemetadatapropaga@onisdifficult

TimeCapsule

Timestamppacketprocessingateachtracepointandappendthetracinginfotothepacketpayload

Advantages:ü  Tracesembeddedinpacketsgoacrosstheboundariesof

protec@ondomainsü  Timestampstakenatdifferentpointshavehappened-

[email protected]

TimeCapsuleinAc@on

Physicalmachine

HardwareNIC

Hypervisor

Privilegedhost Guest

NaEveDriver

BackendDriver

FrontendDriver

App

SharedMemory

VirtualSwitch

Tracepoint

Receive

Send

DumptraceRestorepacketpayload

Preservethelast@mestampatasendertocapturecausalityrela@onshipacrossmachines

TimestampingatTracepoints

•  Challengesü Lowoverheadü AvailableinseparateprotecEondomainsü DealingwithEmedriRonmulEplehosts

•  SoluEonü Para-virtualizedclocksoucexenorkvm-clock ü FuncEonnative_read_tscptoreadtscvalues,nanosecondgranularity,~20nsoverheadforeachreading

ü Setconstant_tsctoensureconsistenttscreadingsacrossdifferencecores

TimestampingacrossMachines

• NetworktransmissionEme

•  TransmissionEmecanbenegaEveorinaccurateü TSCEcksatdifferentratesonmachinesaandb

ü TSCresetsatdifferentEmesonmachinesaandb

tscb −tsca

TSCCalibra@on

Hypervisor

Hardware

Dom 0 Dom UFrontend

driver

Dom U Tracepoint(Enabled)

Tracepoint

Packet receive path

Backenddriver

Frontenddriver

Localdriver

NIC Packet send path

Original packet

TC receive packet

Collect data path

Physical Machine

TC send packet

Figure 1. The architecture of Time Capsule.

no changes to the traced applications. We demonstrate thatfine-grained tracing and latency decomposition enabled byTC shed light on the root causes of long tail network latencyand help identify real performance bugs in Xen.

The rest of the paper is organized as follows. Section 2and 3 describe the design, implementation, and optimiza-tions of Time Capsule, respectively. Section 4 presents eval-uation results. Section 5 and 6 discuss future and relatedwork, respectively. Section 7 concludes this paper.

2. Design and ImplementationThe challenges outlined in Section 1 motivate the followingdesign goals of Time Capsule: (1) cross-boundary tracing;(2) fine-grained tracing; (3) low overhead; (4) applicationtransparency. Figure 1 presents a high-level overview of howTC enables tracing for network send (Tx) and receive (Rx).TC places tracepoints throughout the virtualized networkstack and timestamps packets at enabled tracepoints. Thetiming information is appended to the payload of the packet.For network receive, before the traced packet is copied touser space, TC restores the packet payload to its original sizeand dumps the tracing data to a kernel buffer, from where thetracing data can be copied to user space for offline analysis.For network send, trace dump happens before the packetis transmitted by the physical NIC. Compared to packetreceive, we preserve the timestamp of the last tracepointin the payload of a network send packet to support tracingacross physical machines. Since tracepoints are placed ineither the hypervisor, the guest kernel or the host kernel, TCis transparent to user applications. Next, we elaborate on thedesign and implementation of TC in a Xen environment.

Clocksource To accurately attribute network latency tovarious processing stages across different protection do-mains, e.g., Dom0, DomU, and Xen, a reliable and cross-domain clocksource with high resolution is needed. Thepara-virtualized clocksource xen meets the requirements.In a Xen environment, the hypervisor, Dom0, and the guestOS all use the same clocksource xen for time measurement.Therefore, packet timestamping using the xen clocksourceavoids time drift across different domains. Next, the clock-source xen is based on the Time Stamp Counter (TSC) onthe processor and has nanosecond resolution. It is adequate

Time

TSC

β

α t2t10

tscb

'tsca

tsca

Machine b

Machine a

'tscb

Figure 2. The relationship of TSCs on different physicalmachines. The constant tsc flag allows TSC to tick at theprocessor’s maximum rate regardless of the actual CPUspeed. Line slope represents the maximum CPU frequency.

for latency measurement at the microsecond granularity. Asimilar clocksource kvm-clock is also available in KVM.Furthermore, we enabled constant tsc and nonstop tsc

of Intel processors, which can guarantee that TSC rate is notonly synchronized across all sockets and cores, but also isnot affected by power management on individual processors.As such, TSC ticks at the maximum CPU clock rate regard-less of the actual CPU running speed. For cross-machinetracing, the clocks on physical nodes may inevitably tick atdifferent rates due to different CPU speeds. Therefore, therelative difference between timestamps recorded on separatemachines does not reflect the actual time passage. Figure 2shows the relationship between the TSC readings on twomachines with different CPU speeds. The slopes of the twolines represent the maximum CPU frequency on the respec-tive machines. There exist two challenges in correlating thetimestamps on separate machines. First, TSC readings areincremented at different rates (i.e., different slopes). Sec-ond, TSC registers are reset at boot time or when resumingfrom hibernation. The relative difference between TSC read-ings on two machines includes the absolute distance of thesemachines since last TSC reset. For example, as shown inFigure 2, the distance between TSC reset on two machinesis denoted by ↵ = |tareset � tbreset|, where tareset and tbresetare the last TSC reset time of machine a and b, respectively.

Tracepoints are placed on the critical path of packet process-ing in the virtualized network stack. When a target packetpasses through a predefined tracepoint, a timestamp basedon local clocksource is appended to the packet payload. Thetime difference between two tracepoints measures how muchtime it spent in a particular processing stage. For example,two tracepoints can be placed at the backend in Dom0 andfrontend in DomU to measure packet processing time in thehypervisor. As timestamps are taken sequentially at vari-ous tracepoints throughout the virtualized network stack, TCdoes not need to infer the causal relationship of the trace-points (as Pivot Tracing does in [27]) and the timestamps inthe packet payload have strict happened-before relations.

Cross-machine tracing requires that the varying TSC ratesand reset times on different machines be taken into ac-

2

Objec@ve:EsEmatet2–t1usingreadings

oftsc_bandtsc_a

Nota@on::Emedifferencewhentwomachines’tscwasreset:absolutetscdifference:CPUfrequencyofamachine

α

βcpufreq

Steps:1)  EsEmatemachinea’stscreading

tsc’_aatt1asifbothmachinesresettscatthesameEme

2)  Converttsc’_atotsc’_b,theequivalenttscreadingonmachineb

3)  CalculatepackettransmissionEmeastsc_b–tsc’_b

Moredetailsinthepaper

TracingPayloadcount for accurate latency attribution. Specifically, times-tamps recorded on separate machines should be calibratedto determine the latency due to network transmission be-tween machines. We illustrate the TSC calibration processin Figure 2. Assume that a packet is sent from machine a(denoted by the dotted blue line) at time t1, which has afaster CPU and its TSC starts to tick earlier, and receivedat time t2 on machine b (denoted by the solid red line)with a slower CPU. Without TSC calibration, the differ-ence tscb� tsca can show negative transmission time. Thereare two ways to measure packet transmission time in thenetwork based on the two timestamps tsca and tscb takenat the sender and receiver machines. First, the difference ofTSC reset time ↵ can be estimated as | tscasync

cpufreqa� tscbsync

cpufreqb|,

where tscasync and tscbsync are the instantaneous TSC read-ings on the two machines at exactly the same time. Thiscan be achieved through distributed clock synchronizationalgorithms which estimate the packet propagation time in acongestion-free network and adjust the two TSC readings.Once ↵ is obtained, the absolute TSC difference � is cal-culated as � = ↵ ⇥ cpufreqa. Then, the first calibrationstep is to derive tsc0a = tsca � � to remove the absoluteTSC difference. As shown in Figure 2, tsc0a is the TSC read-ing of packet transmission at the sender if machine a resetsTSC at the same time as machine b. Further, the equiva-lent TSC reading at the receiver machine b when the packetstarts transmission is tsc0b = tsc0a ⇥ cpufreqb

cpufreqa. Finally, the

packet transmission time is the difference between the times-tamps of packet send and receive on the receiver machine b:t2 � t1 = tscb�tsc0b

cpufreqb.

The first calibration method only requires the examina-tion of one packet to measure packet transmission time butrelies on an accurate estimation of ↵. Since ↵ is constant forall packet transmissions between two particular machines,an alternative is to estimate network condition based onthe comparisons of multiple packet transmissions. Similarto [25], which compares packet transmission time with a ref-erence value in a congestion-free environment to estimatenetwork congestions, we can roughly measure packet trans-mission time as tsca

cpufreqa� tscb

cpufreqband use cross-packet

comparison to identify abnormally long transmission time.However, this method only identifies relative transmissiondelays with respect to a reference transmission time, whichis difficult to obtain in production datacenter network andmay be variable due to packets being transmitted throughdifferent routes.

Tracing payload To enable tracing latency across physicalor virtual boundaries, TC adds an extra payload to a packetto store the timestamps of tracepoints. Upon receiving apacket at the physical NIC or copying a packet from userspace to kernel space for sending, TC uses skb put(skb,

SIZE) to allocate additional space in the original packet.The tracing information is removed from packet payload anddumped to a kernel buffer before a packet is copied to theapplication buffer in user space or sent out by the physical

MAC IP TCP/UDP Data Tracing

raw dataTracing

meta data

Original packetTime Capsule packet

……

Raw data(Sender side)

Samplingdecision bit

The i-thannotation bit

Tracing payload

Raw data(Receiver side)

Figure 3. The structure of a Time Capsule packet.

NIC. Figure 3 shows the structure of a TC-enabled packet.The tracing payload contains two types of data: the tracingraw data and the tracing metadata. The tracing raw data con-sists of 8-byte entries, each of which stores the timestampof a tracepoint. Users can place plenty of tracepoints in thevirtualized network stack based on their needs and choosewhich tracepoints to enable for a particular workload. Thetracing metadata uses the annotation bits to indicate if thecorresponding tracepoint is enabled or not (1 as enabled).Users define an event mask to specify the enabled tracepointsand initialize the tracing metadata. The SIZE of the tracingpayload depends on the number of enabled tracepoints. Forlatency tracing across VMs on the same physical machine,packet is transferred between the hypervisor and domainsthrough shared memory. The packet size is not limited by themaximum transmission units (MTUs). Thus, TC is able toallocate sufficient spaces in packet payload for tracing with-out affecting the number of packets communicated by theapplication workloads. For tracing across different physicalmachines, we dump all the timestamps before packets aresent out by the NIC but preserve the last timestamp recordedat the sender side (the sender side raw data in Figure 3) in thetracing payload. When the packet arrives at the receiver ma-chine, new tracing data will be added after the sender side’slast timestamp. As such, the tracing data pertaining to thesame packet stored on multiple machines can be stitched to-gether by the shared timestamp.

3. Overhead and OptimizationsDespite the benefits, tracing at packet level can introduceconsiderable overhead to network applications. For highlyoptimized services in the cloud, such tracing overhead cansignificantly hurt performance. In this section, we discussthe sources of overhead and the corresponding optimizationsin Time Capsule.

Time measurement It usually incurs various levels of costto obtain timestamps in the kernel space. If it is not properlydesigned, fine-grained tracing can significantly degrade net-work performance, especially increasing packet tail latency.Optimization We compare the cost of different clocksourceread functions available in various domains using a simpleclock test tool [2] and adopt native read tscp to readfrom the xen clocksource at each tracepoint. Compared to

3

•  Use__skb_puttoappendtracedatatotheoriginalpayload

•  EachEmestampis8bytes

•  Samplingdecisionbitdeterminesthesamplingrate

•  AnnotaEonbitdecideswhichtracepoint(s)toenable

TraceCollec@on

•  RingbuffersinphysicalNICdriver(Tx)andguestOSnetworkstack(Rx)

•  Tracingdataisremovedfromthepacketpayloadandcopiedtotheringbuffers

ü Beforepacketiscopiedtouserspace(Rx)

ü BeforepacketistransmimedbyNICdriver(Tx)

•  mmaptheringbuffersto/proc filesystemsinuserspace

•  Periodicallydumptracetostorageforlatencyanalysis

Evalua@onHardware

•  twoPowerEdgeT420servers•  two6-core1.90GHzIntelXeonE5-2420CPUs•  32GBmemory

•  GigabitEthernet

SoUware

•  Hypervisor:Xen4.5•  Dom0andDomUkernel:Linux3.18.21

•  VM:1vCPU+4GBmemory

TimeCapsuleOverhead

100105110115120125130135140

0 1 2 3 4 5 6 7 8 9 10

avg 99% 99.90%

Latency(μs)

100105110115120125130135140

1/16 1/8 1/4 1/2 1

avg 99% 99.90%

Latency(μs)

Numberoftracepoints Samplingrate

SockperfUDPlatencySockperfUDPlatency

TimeCapsuleincursnomorethan2%latencyincrease

0

500

1000

1500

2000

0 1 2 3 4 5 6 7 8 9 10 11 12

sockperfTime(second)

Latency(μs)

0

500

1000

1500

2000

0 1 2 3 4 5 6 7 8 9 10 11

sockperfTime(second)

Latency(μs)

PerPacketLatency

ForegroundsockperfUDPrequestBackgroundnetperfinterferencearriveatthe1.5thsecondandleUatthe6.5thsecond

Packetlevellatencymonitoring:1.   Capturelatencyfluctua@on

2.   Moreresponsivetotrafficchange3.   Capturetransientspikesinuser-perceivedlatency

Averagelatencypersecond Latencyperpacket

0

20

40

60

80

100

120Dom0 Xen DomU

Latency(μs)

0

500

1000

1500

2000

2500

3000Dom0(a) Dom(b) Xen DomU

Latency(μs)

DiagnosiswithLatencyBreakdown

VM

sockperf

VM

HPCBenchsockperf

Coloca@onoflatency-sensi@veandthroughput-intensiveworkloadsinthesameVMincurslongandunpredictablelatency

LatencybreakdownrevealsthatpacketbatchingatthebackendNICdriverwastheculprit

TamingLongTailLatencyinXen

sockperf

Xen

Case1

sockperf

Xen

loop

Case2

VM-1

VM-1 VM-2

BugsinXen’screditschedulerTimecapsulehelpsfindthem!

020004000600080001000012000140001600018000

Case1 Case2

Latency(μs)

BUG-1

05

10152025303540

Dom0 Xen DomU

PacketindexLatency(ms)

Case2Observa@ons:

1.  Predictablelatencyspikeevery250packets

2.  Thetaillatencycloseto30ms

3.  SpikealwaysstarEngwithdelayinXen

BUG-1:XenmistakenlybooststhepriorityofCPU-intensiveVMs,whichcauseslongscheduling

delaysoftheI/O-boundVMinXenhmp://lists.xenproject.org/archives/html/xen-

devel/2015-10/msg02853.html

AUerBUG-1isFixed

0

5000

10000

15000

20000

25% 50% 75% 90% 95% 99% 99.50%99.90%99.99%

Case1 Case2aRerBUG-1isfixed2

Latency(μs)

Taillatencyiss@llnotbounded

Addi@onalXenBugs

0

20

40

60

Dom0 Xen DomU

Latency(m

s)

Packetindex

Observa@ons:1.  Theoccurrenceand

magnitudeofthelatencyspikeareunpredictable

2.  SpikealwaysstarEngwith

delayinXen

BUG-2:Xendoesnot@melyac@vateI/OVMsthataredeac@vatedduetolongidling

hmps://lists.xenproject.org/archives/html/xen-devel/2016-05/msg01362.html

BUG-3:I/OVMs’BOOSTprioritycanbeprematurelydemoted

hmps://lists.xenproject.org/archives/html/xen-devel/2016-05/msg01362.html

FutureWork

•  Dynamicinstrumenta@onü ExtendTimeCapsuletodynamicallyaddtracepointsusingBPF

•  Automatedanalysisü usemachinelearningtoextractbemerinformaEonfrompackettraces

•  DiskI/Oü extendTCtodiskI/Orequests.Challengeisthelackofacommonlyshareddatastructure,suchasskbinnetworking,acrosslayersinvirtualizedblockI/Ostacks

Conclusions

• Mo@va@onTracinglatencyinvirtualizedsystemsischallengingduetotheisolaEonofprotecEondomainsandrequirementsforlowoverheadandapplicaEontransparency

•  TimeCapsuleAnin-bandprofilertotracenetworklatencyatpacketlevelinvirtualizedenvironments

•  Evalua@onTimeCapsuleincurslowoverhead,enablesfine-grainedpacketleveltracing,andlatencybreakdown,whichhelpstodetectbugsthatcauselongtaillatencyinXen

Thank you ! Questions?

FAQ

• Will@mecapsuleaffecttheuser-perceivedpacketsize?Noandyes.Whenpacketsarereceived,MTUisnolongeralimitforpacketsize.Timecapsuleisabletoappendasmuchdataasneededtothepayloadandthetracingpayloadisremovedbeforethepacketarrivesattheuserspace.Thus,theusersareunawareofthetracingacEviEes.However,Emecapsuleneedstoappend8bytestothepackettostorethelastEmestampatthesenderside.Inrarecases,thiswillaffectthenumberofpacketstransmimed.Forexample,iftheoriginalpacketsizeis2999bytes(alimlelessthantwoMTUs),adding8bytestotheoriginalpayloadwillrequireonemorepackettobetransmimed.

time capsule: tracing packet latency across diﬀerent layers in...

Documents