![Page 1: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/1.jpg)
1
Extensible Distributed Tracing from Kernels to Clusters
Úlfar Erlingsson, Google Inc.Marcus Peinado, Microsoft Research
Simon Peter, Systems Group, ETH ZurichMihai Budiu, Microsoft Research
Fay
![Page 2: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/2.jpg)
2
Wouldn’t it be nice if…
• We could know what our clusters were doing?
• We could ask any question,… easily, using one simple-to-use system.
• We could collect answers extremely efficiently… so cheaply we may even ask
continuously.
![Page 3: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/3.jpg)
3
Let’s imagine...
• Applying data-mining to cluster tracing• Bag of words technique– Compare documents w/o structural knowledge– N-dimensional feature vectors– K-means clustering
• Can apply to clusters, too!
![Page 4: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/4.jpg)
4
Cluster-mining with Fay
• Automatically categorize cluster behavior, based on system call activity
![Page 5: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/5.jpg)
5
Cluster-mining with Fay
• Automatically categorize cluster behavior, based on system call activity – Without measurable overhead on the execution– Without any special Fay data-mining support
![Page 6: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/6.jpg)
6
Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; }
var kernelFunctionFrequencyVectors =
cluster.Function(kernel, “syscalls!*”)
.Where(evt => evt.time < Now.AddMinutes(3))
.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });
Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count());}
Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs;}
Fay K-Means Behavior-Analysis Code
![Page 7: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/7.jpg)
7
var kernelFunctionFrequencyVectors =
cluster.Function(kernel, “syscalls!*”)
.Where(evt => evt.time < Now.AddMinutes(3))
.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });
Fay K-Means Behavior-Analysis Code
![Page 8: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/8.jpg)
8
Fay vs. Specialized Tracing
• Could’ve built a specialized tool for this– Automatic categorization of behavior (Fmeter)
• Fay is general, but can efficiently do– Tracing across abstractions, systems (Magpie)– Predicated and windowed tracing (Streams)– Probabilistic tracing (Chopstix)– Flight recorders, performance counters, …
![Page 9: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/9.jpg)
9
Key Takeaways
Fay: Flexible monitoring of distributed executions– Can be applied to existing, live Windows servers
1. Single query specifies both tracing & analysis– Easy to write & enables automatic optimizations
2. Pervasively data-parallel, scalable processing– Same model within machines & across clusters
3. Inline, safe machine-code at tracepoints– Allows us to do computation right at data source
![Page 10: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/10.jpg)
10
Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (|pt – c| < |pt – near|) near = c; return near; }
var kernelFunctionFrequencyVectors =
cluster.Function(kernel, “*”)
.Where(evt => evt.time < Now.AddMinutes(3))
.Select(evt => new { Machine = MachineID(), Interval = w.Cycles / CPS, Function = w.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });
Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count());}
Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs;}
K-Means: Single, Unified Fay Queryvar kernelFunctionFrequencyVectors =
cluster.Function(kernel, “*”)
.Where(evt => evt.time < Now.AddMinutes(3))
.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });
Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; }
Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count());}
Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs;}
![Page 11: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/11.jpg)
11
Fay is Data-Parallel on Cluster
• View trace query as distributed computation• Use cluster for analysis
![Page 12: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/12.jpg)
12
Fay is Data-Parallel on Cluster
System call trace events• Fay does early aggregation & data reduction• Fay knows what’s needed for later analysis
![Page 13: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/13.jpg)
13
Fay is Data-Parallel on Cluster
System call trace events• Fay does early aggregation & data reduction
K-Means analysis• Fay builds an efficient processing plan from query
![Page 14: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/14.jpg)
14
Fay is Data-Parallel within Machines
• Early aggregation• Inline, in OS kernel• Reduce dataflow & kernel/user transitions
• Data-parallel per each core/thread
![Page 15: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/15.jpg)
15
Processing w/o Fay Optimizations
• Collect data first (on disk)• Reduce later• Inefficient, can suffer data overload
K-Means: System calls K-Means: Clustering
![Page 16: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/16.jpg)
16
Traditional Trace Processing
• First log all data (a deluge)• Process later (centrally)• Compose tools via scripting
K-Means: System calls K-Means: Clustering
![Page 17: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/17.jpg)
17
Takeaways so far
Fay: Flexible monitoring of distributed executions
1. Single query specifies both tracing & analysis
2. Pervasively data-parallel, scalable processing
![Page 18: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/18.jpg)
18
Safety of Fay Tracing Probes
• A variant of XFI used for safety [OSDI’06]
– Works well in the kernel or any address space– Can safely use existing stacks, etc.– Instead of language interpreter (DTrace)– Arbitrary, efficient, stateful computation
• Probes can access thread-local/global state• Probes can try to read any address– I/O registers are protected
![Page 19: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/19.jpg)
19
Key Takeaways, Again
Fay: Flexible monitoring of distributed executions
1. Single query specifies both tracing & analysis
2. Pervasively data-parallel, scalable processing
3. Inline, safe machine-code at tracepoints
![Page 20: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/20.jpg)
20
Target
Installing and Executing Fay Tracing
• Fay runtime on each machine• Fay module in each traced address space• Tracepoints at hotpatched function boundary
Tracing Runtime
Fay
User-Space
Kernel
Probe
XFI
Createprobe
Hotpatching
query
ETW
200 cycles
![Page 21: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/21.jpg)
21
Low-level Code Instrumentation
Caller: ... e8ab62ffff call Foo ...
ff1508e70600 call[Dispatcher]Foo: ebf8 jmp Foo-6 ccccccFoo2: 57 push rdi
...
c3 ret
Module with a traced function Foo
• Replace 1st opcode of functions
![Page 22: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/22.jpg)
22
Low-level Code Instrumentation
Caller: ... e8ab62ffff call Foo ...
ff1508e70600 call[Dispatcher]Foo: ebf8 jmp Foo-6 ccccccFoo2: 57 push rdi
...
c3 ret
Module with a traced function Foo Fay platform module
Dispatcher: t = lookup(return_addr) ...
call t.entry_probes ...
call t.Foo2_trampoline ...
call t.return_probes ... return /* to after call Foo */
• Replace 1st opcode of functions• Fay dispatcher called via trampoline
![Page 23: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/23.jpg)
23
Low-level Code Instrumentation
PF5
PF3
PF4
Caller: ... e8ab62ffff call Foo ...
ff1508e70600 call[Dispatcher]Foo: ebf8 jmp Foo-6 ccccccFoo2: 57 push rdi
...
c3 ret
Module with a traced function Foo Fay platform module
Dispatcher: t = lookup(return_addr) ...
call t.entry_probes ...
call t.Foo2_trampoline ...
call t.return_probes ... return /* to after call Foo */
Fay probes
XFI XFI
XFI
• Replace 1st opcode of functions• Fay dispatcher called via trampoline• Fay calls the function, and entry & exit probes
![Page 24: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/24.jpg)
24
• Fay adds 220 to 430 cycles per traced function • Fay adds 180% CPU to trace all kernel functions• Both approx 10x faster than Dtrace, SystemTap
What’s Fay’s Performance & Scalability?
Fay Solaris Dtrace
OS X Dtrace
Stap Linux
0
2000
4000
6000
8000
10000
Fay Solaris Dtrace
OS X Dtrace
Stap Linux
05
1015202530
2.8
17.2
26.7 CrashNull-probe overhead Slowdown (x)
Cycl
es
![Page 25: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/25.jpg)
25
Fay Scalability on a Cluster
• Fay tracing memory allocations, in a loop:– Ran workload on a 128-node, 1024-core cluster– Spread work over 128 to 1,280,000 threads– 100% CPU utilization
• Fay overhead was 1% to 11% (mean 7.8%)
![Page 26: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/26.jpg)
26
More Fay Implementation Details
• Details of query-plan optimizations• Case studies of different tracing strategies• Examples of using Fay for performance analysis
• Fay is based on LINQ and Windows specifics– Could build on Linux using Ftrace, Hadoop, etc.
• Some restrictions apply currently– E.g., skew towards batch processing due to Dryad
![Page 27: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/27.jpg)
27
Conclusion
• Fay: Flexible tracing of distributed executions
• Both expressive and efficient– Unified trace queries– Pervasive data-parallelism– Safe machine-code probe processing
• Often equally efficient as purpose-built tools
![Page 28: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/28.jpg)
28
Backup
![Page 29: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/29.jpg)
29
A Fay Trace Query
from io in cluster.Function("iolib!Read")where io.time < Now.AddMinutes(5)let size = io.Arg(2) // request size in bytesgroup io by size/1024 into gselect new { sizeInKilobytes = g.Key,
countOfReadIOs = g.Count() };
• Aggregates read activity in iolib module• Across cluster, both user-mode & kernel• Over 5 minutes
![Page 30: Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH](https://reader036.vdocuments.net/reader036/viewer/2022070323/56649d9c5503460f94a8593e/html5/thumbnails/30.jpg)
30
A Fay Trace Query
from io in cluster.Function("iolib!Read")where io.time < Now.AddMinutes(5)let size = io.Arg(2) // request size in bytesgroup io by size/1024 into gselect new { sizeInKilobytes = g.Key,
countOfReadIOs = g.Count() };
• Specifies what to trace• 2nd argument of read function in iolib
• And how to aggregate• Group into kb-size buckets and count 1024 2048 4096 8192
0200040006000