profiling android applications with nanoscopeweb.cs.ucla.edu/~lunliu93/assets/nanoscope.pdf ·...
TRANSCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
Profiling Android Applications with Nanoscope
Lun LiuUniversity of California
Los Angeles, USA
Leland TakamineUber TechnologiesSan Francisco, USA
Adam WelcUber TechnologiesSan Francisco, USA
AbstractUser-level tooling support for profiling Java applicationsexecuting on modern JVMs for desktop and server isquite mature – from OpenJDK’s Java Flight Recorder [5]enabling low-overhead CPU and heap profiling, throughthird-party async profilers (e.g. async-profiler [7], honest-profiler [11]), to OpenJDK’s support for low-overheadtracking of allocation call sites [1].On the other hand, despite architectural similarities
between language execution environments, tooling sup-port for analyzing performance of Java code on Androidlags significantly behind. Arguably the most popular,but also virtually the only tool in this category is An-droid Profiler [3] – it provides process-level informationabout CPU, Java heap, and network activities. It alsoprovides support for method tracing and sampling. How-ever, the former exhibits high overhead and the latter,while providing better performance, sacrifices precision.Information provided by Android Profiler can be aug-mented with data collected by other tools, particularlysystrace [4] which collects data at the operating systemlevel. Unsurprisingly, third-party tools have also startedemerging recently, such as Facebook’s Profilo [6] frame-work. However, using these additional tools requiresworking with multiple tool-chains, may require a certainamount of manual effort (e.g., building application in aspecial way, inserting profiling framework API calls intothe application), and it is often non-trivial to infer in-teresting information from the data being collected (e.g.because only part of the tooling framework is publiclyavailable).In this paper we describe our work on Nanoscope, a
single open source, low-overhead, and extensible tool thatnot only works on unmodified Android applications andprovides precise method traces with low overhead, butalso aims to present programmers with selected relevantinformation within the same framework to help them de-velop an intuition behind a concrete non-performant ap-plication behavior. The tradeoff here is that Nanoscoperequires a custom Android build as it relies on addi-tional support from ART. In this paper, we will de-scribe Nanoscope’s design and implementation, presentexamples of performance-related information that can be
VMIL’18, November 04, 2018, Boston, MA, USA
2018.
obtained from the tool, and discuss Nanoscope-relatedoverheads.
1 IntroductionCompared to “regular” Java programs (i.e., executingon a desktop and on a server), tooling for Android pro-grams (i.e., executing on mobile devices) lags significantlybehind. Arguably, the most popular analysis tool for An-droid to locally identify performance problems comesfrom Google itself in the form of Android Profiler [3],providing information about an executing program andits threads, such as per-process heap memory and CPUusage, network traffic, and per-thread execution datain the form of method traces. In our experience, for areasonably tuned applications, the most interesting pieceof information provided by Android Profiler are methodtraces, as such applications do not exhibit sudden orlarge spikes in object allocation rates or per-processCPU usage. Android Profiler collects these traces ei-ther by instrumenting all application Java methods orby sampling all application threads - the former sacri-fices performance (as not only all methods of all threadsare traced, but tracing requires inter-thread coordina-tion) and the latter sacrifices precision (as not everymethod execution is recorded). Other tool-chains (e.g.,systrace [4]) exist to augment information provided byAndroid Profiler, but they either cannot provide VM-level information (e.g., lock contention) or they cannotoperate on unmodified production application files (e.g.require special compilation process or in-application APIcalls), and in general cannot provide uniform user expe-rience within a single framework. One exception to thisrule is the Profilo [6] framework developed by Facebook,but this toolkit has been only partially open-sourced(e.g., without visualization tools) and is also geared to-wards aggregating performance-related information fromapplications running in production rather than providingthe ability to interactively deep dive into specific perfor-mance problems of a given single application execution.In order to overcome limitations of the existing An-
droid performance-focused tools, we started developingour own framework called Nanoscope. In designing andimplementing Nanoscope we were guided by the followingprinciples:
• as the starting point, we would like to get similartypes of information to that provided by AndroidProfiler with precise execution traces, but we are
1
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
VMIL’18, November 04, 2018, Boston, MA, USA Lun Liu, Leland Takamine, and Adam Welc
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
not willing to pay the price of using Android Pro-filer tracing as it can slow down execution of ourapplications by an order of magnitude
• we would like to overlay additional important met-rics at the OS level (e.g., CPU utilization) and atthe VM level (e.g. lock contention) directly on themethod execution traces
• we would like to be able to analyze performanceof unmodified production applications
• when analyzing performance of our mobile appli-cations we are mostly interested in the behavior ofthe main thread
In the following section we describe how these princi-ples have been reflected in the design and implementationof Nanoscope. Please note that this is very much workin progress!
2 Design and implementationThe focus of the first open source release of “basic”Nanoscope [9, 10] was to provide precise method tracingsimilar to that offered by Android Profiler but at a muchlower cost. The goal of the second “extended” versionof Nanoscope1 is to overlay additional information onmethod execution traces. Currently, Nanoscope is onlysupported in Android 7.1.2 as it requires modificationsto Android Runtime [2] (ART) which are specific to agiven OS version.
2.1 “Basic” Nanoscope
Currently Nanoscope focuses on analyzing a single thread(e.g., the main thread of the application) – it collectsmethod tracing data into a thread-local data structure,where it records method entry end exit events alongwith their timestamps. It is therefore straightforward toextend it to analyze multiple threads – traces for multiplethreads can be aligned in the UI using their timestampsin post-processing. We report overheads of collectingtraces for the main thread (which in our application isthe most active thread in terms of Java code execution)in Section 3 – evaluation of method trace collection formultiple threads is left for future work.In order to implement method tracing in Nanoscope
we needed to modify both the interpreter and the opti-mizing compiler so that the code to record a timestampand a method name in an in-memory buffer is injectedfor each method prologue and epilogue – the buffer isflushed to persistent storage once tracing is done. Trac-ing can be enabled/disabled via system properties (noapplication modification required) or via in-applicationAPI calls (for increased flexibility in controlling a tracingspan). Method traces for a given thread are visualized asa flame graph in a custom browser-based UI written in
1Also open-sourced recently into the same Nanoscope repositories
JavaScript that needed to be implemented to efficientlyhandle large volumes of tracing data. In addition to “reg-ular” Java methods, “basic” Nanoscope also traces classloading and initialization time, which are representedas “virtual”2 methods. A more detailed description of“basic” Nanoscope can be found in Nanoscope blog [8].
2.2 “Extended” Nanoscope
The idea for extending Nanoscope stemmed from theobservation that analyzing flame graphs themselves isnot the most intuitive method of gaining insight intoa given thread’s execution anomalies. Even if certainmethods take a long time to execute, it is not alwaysimmediately clear why that is, particularly if the meth-ods do not contain obvious “hot spots” (e.g., hot loops).Consequently, we decided to extend Nanoscope to over-lay additional metrics on the flame graph to help analyzedifferent aspects of thread’s execution at the same time.In particular, “extended” Nanoscope supports a sam-pling framework described in Section 2.2.1 where variousinteresting per-thread events (e.g. CPU utilization) canbe periodically recorded for further analysis – the frame-work has been implemented in such a way that it isrelatively easy to both add additional metrics to betracked (as all of them are collected within the samefunction called at the time of collecting a sample) andto visualize them in our UI (as each one is visualizedon a separate canvas whose implementation can be eas-ily derived from the existing ones). We have also madecertain additional “surgical” modifications to ART totrack other interesting events in the thread’s lifetime,such as thread state transitions that describe when athread is blocked waiting for a lock, sleeping, performingGC-related activities, etc.
2.2.1 Sampling framework
Design and implementation of Nanoscope’s samplingframework has been inspired by the profilers imple-mented for “regular” (server or desktop) Java appli-cations, such as async-profiler [7] or honest-profiler [11].The main idea is to use a system call to schedule a peri-odic signal generation by the OS that will be delivered toa given process, or in our case, a given thread. In termsof implementation, this has been achieved by modifyingrelevant parts of ART.There are multiple ways to have OS to periodically
send a signal to a given thread. One method is to usethe perf event open system call – an apparent risk ofusing this method is that a thread receiving a signal canbe interrupted during another system call which could
2In the sense of methods that do not exist in reality rather thenin the sense of virtual method dispatch.
2
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
Profiling Android Applications with Nanoscope VMIL’18, November 04, 2018, Boston, MA, USA
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
Figure 1. Nanoscope UI window
result in an application crash, but we have never experi-enced this situation in practice. Another method is touse timer settime system call with a clock measuring agiven thread’s CPU time, which is supposed to only inter-rupt a thread in user-level code, but in our experimentswe could not achieve desired fidelity of signal delivery(at least every 1ms) using this method. Consequently,we decided to support both methods in Nanoscope – amethod is selected when tracing is initiated, defaultingto using perf event open system call.Currently, the most important (in terms of insights
it provided) metric collected in the sampling frameworkis CPU utilization - it is calculated by obtaining wallclock time (same mechanism as used for timestampsalready collected by Nanoscope for method tracing) andCPU time (via clock gettime), and calculating the ra-tio of the two. However, we also experimented with othermetrics, such as the number of page faults (major andminor) and the number of context switches per sam-pling period (both collected by reading counters set upby the perf event open system call and subsequentlyavailable in a signal handler), as well as allocation ratesper sampling period (collected by enabling ART Run-timeStats tracking and reading from it at every sample).We demonstrate how these metrics have been visualizedin Section 2.3.
2.3 Nanoscope in action
Currently, Nanoscope focuses on analyzing performanceof a single (in most cases main) thread, and our visual-ization tool reflects that. In Figure 1 we present a viewof the entire Nanoscope UI window showing executiontrace of the main thread in one of our applications. Thisparticular execution represents a startup sequence of theapplication – we started the application and let it exe-cute for ∼10s (on Nexus 6p running Nanoscope-enabledROM based on Android 7.1.2). At the bottom of thefigure is the flame graph 3 representing execution of themain thread, and at the top there are charts for variousadditional execution metrics (from top to bottom):
• CPU utilization• number of context switches• number of page faults (minor in green, major inred)
• bytes allocated by the whole process (minus freedones)
• object allocated by the whole process (minus freedones)
• total bytes allocated by the thread• total bytes freed by the thread (none in this case)• current thread state (e.g. executing native method,blocked on a lock, waiting on a condition variable,sleeping, etc.)
3We omitted names of methods in the flame graph for confiden-tiality reasons
3
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
VMIL’18, November 04, 2018, Boston, MA, USA Lun Liu, Leland Takamine, and Adam Welc
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
Figure 2. Renderer thread’s nFence method
We currently display all metrics that are being collected,but in the near future we plan to make Nanoscope moreconfigurable – have the user choose which metrics are ofinterest and visualize only these metrics. At the bottomleft corner there is also a window containing detailedinformation about the method currently selected in theflame graph (in this case the main method of this thread,represented by top-most frame in the flame graph), suchas duration of the selected method call, average durationof this method’a execution across all its invocations,number of times this method has been called, etc.
At this stage of the project we have not yet taken anyactions on the data generated by Nanoscope, but it is ev-ident that certain, potentially useful, observations wouldbe difficult to make without the kind of information itprovides.For example, focusing on the part of the execution
trace representing the end of the startup sequence whenthe application is largely idle waiting for user input (right-hand side of Figure 1) reveals that while the main thread
0
0.2
0.4
0.6
0.8
1
1.2
baselin
e ROM
"bas
ic" Nan
osco
pe ROM
"bas
ic" Nan
osco
pe
"exten
ded" N
anos
cope
Figure 3. Nanoscope overheads
no longer allocates any memory (as expected), the overallnumber of objects in the application process is stillgrowing. It does not necessarily indicate existence of areal problem, but considering that the main thread is theone expected to execute a large portion of application’sJava code, it may be worth investigating increase inobject allocations on background threads.Another example is unusually long (almost 100ms)
period of low CPU utilization while executing a na-tive nFence method that belongs to Android’s rendererthread (it can be observed as a light blue rectanglein a chart describing current thread state metric 4) –a zoomed-in relevant portion of the execution trace ispresented in Figure 2. Again, it is not necessarily anindication of an actual (and fixable) problem but maybe worth a second look.
3 OverheadsIn its current form, as compared to Android Profiler,Nanoscope strikes a certain tradeoff – the analysis isfocused on a single thread, but at least some executionmetrics go beyond what Android Profiler can produce,and the overhead of data collection is expected to bemuch lower. While we already made an argument forusefulness of execution metrics provided by Nanoscope,we have not yet discussed its overheads.
In order to measure Nanoscope-related overheads, weexecuted startup sequence of the same application weused for demonstrating Nanoscope features in Section 2.3100 times and measured the time from application startto the point when it finished one of the internally de-fined execution spans indicating the end of the startupsequence. We measured execution times for four differentconfigurations:
4All represented thread states are color-coded and light blue colorindicates native method call.
4
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
Profiling Android Applications with Nanoscope VMIL’18, November 04, 2018, Boston, MA, USA
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
• baseline ROM – “clean” build of unmodifiedAndroid OS 7.1.2 that is the basis for Nanoscopeimplementation
• “basic” Nanoscope ROM – a build of AndroidOS 7.1.2 with “basic” Nanoscope extensions butwith all data collection disabled
• “basic” Nanoscope – a build of Android OS7.1.2 with “basic” Nanoscope extensions and withmethod tracing collection enabled
• “extended” Nanoscope – a build of Android OS7.1.2 with “extended” Nanoscope extensions andwith both method tracing and additional metricscollection enabled
In Figure 3 we plot average execution times (along with95% confidence intervals) for the Nanoscope-enabled con-figurations normalized with respect to the execution timefor unmodified Android 7.1.2 (“baseline”). As we can see,the overhead of “base” Nanoscope is very reasonable,and does not exceed 10%, which compares very favorablywith Android Profiler that can slow down application ex-ecution by an order of magnitude when collecting precisetraces. It is somewhat surprising that even with all trac-ing disabled, the overhead of Nanoscope-enabled build isnoticeable – we attribute that to the code that is addedto both interpreter and optimizing compiler whose fastpath is always executed, but we defer more thoroughinvestigation of this issue to future work. More impor-tantly, however, the overhead of the sampling frameworkand additional several metrics collection is limited to atotal of 18%.
4 Future directionsOur work on Nanoscope has only just begun, and whilewe feel like the tool can already be useful in locallydiagnosing various performance problems, we plan tocontinue refining and extending Nanoscope in foreseeablefuture, with the help of both internal customers and alarger open source community. In particular, we wouldlike to extend Nanoscope to collect data for multiplethreads. In addition to more work at the VM level thatwould likely be required to accomplish this, it wouldalso require different ways of visualizing execution of,potentially, a large number of threads – flame graphs arenot very practical in this kind of setting. We may alsoconsider further refining information provided by the tool– for example to not only indicate points of lock contentionor waiting on a condition variable but also to highlightwhich threads exactly are involved in these kinds ofevents. Finally, we are thinking of providing additionalinformation about object allocations, for example vialow-overhead heap sampling similar to the one proposedfor the HotSpot JVM [1].
References[1] Jean Christophe Beyler. 2018. JEP 331: Low-Overhead Heap
Profiling. http://openjdk.java.net/jeps/331[2] Google. 2018. ART and Dalvik. https://source.android.com/
devices/tech/dalvik/[3] Google. 2018. Measure app performance with Android Profiler.
https://developer.android.com/studio/profile/android-profiler[4] Google. 2018. systrace. https://developer.android.com/studio/
command-line/systrace[5] Markus Gronlund and Erik Gahlin. 2018. JEP 328: Flight
Recorder. http://openjdk.java.net/jeps/328[6] Delyan Kratunov. 2018. Profilo: Understanding app per-
formance in the wild. https://code.fb.com/android/profilo-understanding-app-performance-in-the-wild/
[7] Andrei Pangin and Vadim Tsesko. 2018. async-profiler. https://github.com/jvm-profiling-tools/async-profiler
[8] Leland Takamine and Brian Attwell. 2018. IntroducingNanoscope: An Extremely Accurate Method Tracing Tool
for Android. https://source.android.com/devices/tech/dalvik/[9] Uber. 2018. Nanoscope. https://github.com/uber/nanoscope
[10] Uber. 2018. Nanoscope ART. https://github.com/uber/nanoscope-art
[11] Richard Warburton. 2018. honest-profiler. https://github.com/jvm-profiling-tools/honest-profiler
5