profiling android applications with nanoscopeweb.cs.ucla.edu/~lunliu93/assets/nanoscope.pdf ·...

5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 Profiling Android Applications with Nanoscope Lun Liu University of California Los Angeles, USA Leland Takamine Uber Technologies San Francisco, USA Adam Welc Uber Technologies San Francisco, USA Abstract User-level tooling support for profiling Java applications executing on modern JVMs for desktop and server is quite mature – from OpenJDK’s Java Flight Recorder [5] enabling low-overhead CPU and heap profiling, through third-party async profilers (e.g. async-profiler [7], honest- profiler [11]), to OpenJDK’s support for low-overhead tracking of allocation call sites [1]. On the other hand, despite architectural similarities between language execution environments, tooling sup- port for analyzing performance of Java code on Android lags significantly behind. Arguably the most popular, but also virtually the only tool in this category is An- droid Profiler [3] – it provides process-level information about CPU, Java heap, and network activities. It also provides support for method tracing and sampling. How- ever, the former exhibits high overhead and the latter, while providing better performance, sacrifices precision. Information provided by Android Profiler can be aug- mented with data collected by other tools, particularly systrace [4] which collects data at the operating system level. Unsurprisingly, third-party tools have also started emerging recently, such as Facebook’s Profilo [6] frame- work. However, using these additional tools requires working with multiple tool-chains, may require a certain amount of manual effort (e.g., building application in a special way, inserting profiling framework API calls into the application), and it is often non-trivial to infer in- teresting information from the data being collected (e.g. because only part of the tooling framework is publicly available). In this paper we describe our work on Nanoscope, a single open source, low-overhead, and extensible tool that not only works on unmodified Android applications and provides precise method traces with low overhead, but also aims to present programmers with selected relevant information within the same framework to help them de- velop an intuition behind a concrete non-performant ap- plication behavior. The tradeoff here is that Nanoscope requires a custom Android build as it relies on addi- tional support from ART. In this paper, we will de- scribe Nanoscope’s design and implementation, present examples of performance-related information that can be VMIL’18, November 04, 2018, Boston, MA, USA 2018. obtained from the tool, and discuss Nanoscope-related overheads. 1 Introduction Compared to “regular” Java programs (i.e., executing on a desktop and on a server), tooling for Android pro- grams (i.e., executing on mobile devices) lags significantly behind. Arguably, the most popular analysis tool for An- droid to locally identify performance problems comes from Google itself in the form of Android Profiler [3], providing information about an executing program and its threads, such as per-process heap memory and CPU usage, network traffic, and per-thread execution data in the form of method traces. In our experience, for a reasonably tuned applications, the most interesting piece of information provided by Android Profiler are method traces, as such applications do not exhibit sudden or large spikes in object allocation rates or per-process CPU usage. Android Profiler collects these traces ei- ther by instrumenting all application Java methods or by sampling all application threads - the former sacri- fices performance (as not only all methods of all threads are traced, but tracing requires inter-thread coordina- tion) and the latter sacrifices precision (as not every method execution is recorded). Other tool-chains (e.g., systrace [4]) exist to augment information provided by Android Profiler, but they either cannot provide VM- level information (e.g., lock contention) or they cannot operate on unmodified production application files (e.g. require special compilation process or in-application API calls), and in general cannot provide uniform user expe- rience within a single framework. One exception to this rule is the Profilo [6] framework developed by Facebook, but this toolkit has been only partially open-sourced (e.g., without visualization tools) and is also geared to- wards aggregating performance-related information from applications running in production rather than providing the ability to interactively deep dive into specific perfor- mance problems of a given single application execution. In order to overcome limitations of the existing An- droid performance-focused tools, we started developing our own framework called Nanoscope. In designing and implementing Nanoscope we were guided by the following principles: as the starting point, we would like to get similar types of information to that provided by Android Profiler with precise execution traces, but we are 1

Upload: others

Post on 21-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Profiling Android Applications with Nanoscopeweb.cs.ucla.edu/~lunliu93/assets/Nanoscope.pdf · User-level tooling support for profiling Java applications executing on modern JVMs

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

Profiling Android Applications with Nanoscope

Lun LiuUniversity of California

Los Angeles, USA

Leland TakamineUber TechnologiesSan Francisco, USA

Adam WelcUber TechnologiesSan Francisco, USA

AbstractUser-level tooling support for profiling Java applicationsexecuting on modern JVMs for desktop and server isquite mature – from OpenJDK’s Java Flight Recorder [5]enabling low-overhead CPU and heap profiling, throughthird-party async profilers (e.g. async-profiler [7], honest-profiler [11]), to OpenJDK’s support for low-overheadtracking of allocation call sites [1].On the other hand, despite architectural similarities

between language execution environments, tooling sup-port for analyzing performance of Java code on Androidlags significantly behind. Arguably the most popular,but also virtually the only tool in this category is An-droid Profiler [3] – it provides process-level informationabout CPU, Java heap, and network activities. It alsoprovides support for method tracing and sampling. How-ever, the former exhibits high overhead and the latter,while providing better performance, sacrifices precision.Information provided by Android Profiler can be aug-mented with data collected by other tools, particularlysystrace [4] which collects data at the operating systemlevel. Unsurprisingly, third-party tools have also startedemerging recently, such as Facebook’s Profilo [6] frame-work. However, using these additional tools requiresworking with multiple tool-chains, may require a certainamount of manual effort (e.g., building application in aspecial way, inserting profiling framework API calls intothe application), and it is often non-trivial to infer in-teresting information from the data being collected (e.g.because only part of the tooling framework is publiclyavailable).In this paper we describe our work on Nanoscope, a

single open source, low-overhead, and extensible tool thatnot only works on unmodified Android applications andprovides precise method traces with low overhead, butalso aims to present programmers with selected relevantinformation within the same framework to help them de-velop an intuition behind a concrete non-performant ap-plication behavior. The tradeoff here is that Nanoscoperequires a custom Android build as it relies on addi-tional support from ART. In this paper, we will de-scribe Nanoscope’s design and implementation, presentexamples of performance-related information that can be

VMIL’18, November 04, 2018, Boston, MA, USA

2018.

obtained from the tool, and discuss Nanoscope-relatedoverheads.

1 IntroductionCompared to “regular” Java programs (i.e., executingon a desktop and on a server), tooling for Android pro-grams (i.e., executing on mobile devices) lags significantlybehind. Arguably, the most popular analysis tool for An-droid to locally identify performance problems comesfrom Google itself in the form of Android Profiler [3],providing information about an executing program andits threads, such as per-process heap memory and CPUusage, network traffic, and per-thread execution datain the form of method traces. In our experience, for areasonably tuned applications, the most interesting pieceof information provided by Android Profiler are methodtraces, as such applications do not exhibit sudden orlarge spikes in object allocation rates or per-processCPU usage. Android Profiler collects these traces ei-ther by instrumenting all application Java methods orby sampling all application threads - the former sacri-fices performance (as not only all methods of all threadsare traced, but tracing requires inter-thread coordina-tion) and the latter sacrifices precision (as not everymethod execution is recorded). Other tool-chains (e.g.,systrace [4]) exist to augment information provided byAndroid Profiler, but they either cannot provide VM-level information (e.g., lock contention) or they cannotoperate on unmodified production application files (e.g.require special compilation process or in-application APIcalls), and in general cannot provide uniform user expe-rience within a single framework. One exception to thisrule is the Profilo [6] framework developed by Facebook,but this toolkit has been only partially open-sourced(e.g., without visualization tools) and is also geared to-wards aggregating performance-related information fromapplications running in production rather than providingthe ability to interactively deep dive into specific perfor-mance problems of a given single application execution.In order to overcome limitations of the existing An-

droid performance-focused tools, we started developingour own framework called Nanoscope. In designing andimplementing Nanoscope we were guided by the followingprinciples:

• as the starting point, we would like to get similartypes of information to that provided by AndroidProfiler with precise execution traces, but we are

1

Page 2: Profiling Android Applications with Nanoscopeweb.cs.ucla.edu/~lunliu93/assets/Nanoscope.pdf · User-level tooling support for profiling Java applications executing on modern JVMs

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

VMIL’18, November 04, 2018, Boston, MA, USA Lun Liu, Leland Takamine, and Adam Welc

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

not willing to pay the price of using Android Pro-filer tracing as it can slow down execution of ourapplications by an order of magnitude

• we would like to overlay additional important met-rics at the OS level (e.g., CPU utilization) and atthe VM level (e.g. lock contention) directly on themethod execution traces

• we would like to be able to analyze performanceof unmodified production applications

• when analyzing performance of our mobile appli-cations we are mostly interested in the behavior ofthe main thread

In the following section we describe how these princi-ples have been reflected in the design and implementationof Nanoscope. Please note that this is very much workin progress!

2 Design and implementationThe focus of the first open source release of “basic”Nanoscope [9, 10] was to provide precise method tracingsimilar to that offered by Android Profiler but at a muchlower cost. The goal of the second “extended” versionof Nanoscope1 is to overlay additional information onmethod execution traces. Currently, Nanoscope is onlysupported in Android 7.1.2 as it requires modificationsto Android Runtime [2] (ART) which are specific to agiven OS version.

2.1 “Basic” Nanoscope

Currently Nanoscope focuses on analyzing a single thread(e.g., the main thread of the application) – it collectsmethod tracing data into a thread-local data structure,where it records method entry end exit events alongwith their timestamps. It is therefore straightforward toextend it to analyze multiple threads – traces for multiplethreads can be aligned in the UI using their timestampsin post-processing. We report overheads of collectingtraces for the main thread (which in our application isthe most active thread in terms of Java code execution)in Section 3 – evaluation of method trace collection formultiple threads is left for future work.In order to implement method tracing in Nanoscope

we needed to modify both the interpreter and the opti-mizing compiler so that the code to record a timestampand a method name in an in-memory buffer is injectedfor each method prologue and epilogue – the buffer isflushed to persistent storage once tracing is done. Trac-ing can be enabled/disabled via system properties (noapplication modification required) or via in-applicationAPI calls (for increased flexibility in controlling a tracingspan). Method traces for a given thread are visualized asa flame graph in a custom browser-based UI written in

1Also open-sourced recently into the same Nanoscope repositories

JavaScript that needed to be implemented to efficientlyhandle large volumes of tracing data. In addition to “reg-ular” Java methods, “basic” Nanoscope also traces classloading and initialization time, which are representedas “virtual”2 methods. A more detailed description of“basic” Nanoscope can be found in Nanoscope blog [8].

2.2 “Extended” Nanoscope

The idea for extending Nanoscope stemmed from theobservation that analyzing flame graphs themselves isnot the most intuitive method of gaining insight intoa given thread’s execution anomalies. Even if certainmethods take a long time to execute, it is not alwaysimmediately clear why that is, particularly if the meth-ods do not contain obvious “hot spots” (e.g., hot loops).Consequently, we decided to extend Nanoscope to over-lay additional metrics on the flame graph to help analyzedifferent aspects of thread’s execution at the same time.In particular, “extended” Nanoscope supports a sam-pling framework described in Section 2.2.1 where variousinteresting per-thread events (e.g. CPU utilization) canbe periodically recorded for further analysis – the frame-work has been implemented in such a way that it isrelatively easy to both add additional metrics to betracked (as all of them are collected within the samefunction called at the time of collecting a sample) andto visualize them in our UI (as each one is visualizedon a separate canvas whose implementation can be eas-ily derived from the existing ones). We have also madecertain additional “surgical” modifications to ART totrack other interesting events in the thread’s lifetime,such as thread state transitions that describe when athread is blocked waiting for a lock, sleeping, performingGC-related activities, etc.

2.2.1 Sampling framework

Design and implementation of Nanoscope’s samplingframework has been inspired by the profilers imple-mented for “regular” (server or desktop) Java appli-cations, such as async-profiler [7] or honest-profiler [11].The main idea is to use a system call to schedule a peri-odic signal generation by the OS that will be delivered toa given process, or in our case, a given thread. In termsof implementation, this has been achieved by modifyingrelevant parts of ART.There are multiple ways to have OS to periodically

send a signal to a given thread. One method is to usethe perf event open system call – an apparent risk ofusing this method is that a thread receiving a signal canbe interrupted during another system call which could

2In the sense of methods that do not exist in reality rather thenin the sense of virtual method dispatch.

2

Page 3: Profiling Android Applications with Nanoscopeweb.cs.ucla.edu/~lunliu93/assets/Nanoscope.pdf · User-level tooling support for profiling Java applications executing on modern JVMs

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

Profiling Android Applications with Nanoscope VMIL’18, November 04, 2018, Boston, MA, USA

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

Figure 1. Nanoscope UI window

result in an application crash, but we have never experi-enced this situation in practice. Another method is touse timer settime system call with a clock measuring agiven thread’s CPU time, which is supposed to only inter-rupt a thread in user-level code, but in our experimentswe could not achieve desired fidelity of signal delivery(at least every 1ms) using this method. Consequently,we decided to support both methods in Nanoscope – amethod is selected when tracing is initiated, defaultingto using perf event open system call.Currently, the most important (in terms of insights

it provided) metric collected in the sampling frameworkis CPU utilization - it is calculated by obtaining wallclock time (same mechanism as used for timestampsalready collected by Nanoscope for method tracing) andCPU time (via clock gettime), and calculating the ra-tio of the two. However, we also experimented with othermetrics, such as the number of page faults (major andminor) and the number of context switches per sam-pling period (both collected by reading counters set upby the perf event open system call and subsequentlyavailable in a signal handler), as well as allocation ratesper sampling period (collected by enabling ART Run-timeStats tracking and reading from it at every sample).We demonstrate how these metrics have been visualizedin Section 2.3.

2.3 Nanoscope in action

Currently, Nanoscope focuses on analyzing performanceof a single (in most cases main) thread, and our visual-ization tool reflects that. In Figure 1 we present a viewof the entire Nanoscope UI window showing executiontrace of the main thread in one of our applications. Thisparticular execution represents a startup sequence of theapplication – we started the application and let it exe-cute for ∼10s (on Nexus 6p running Nanoscope-enabledROM based on Android 7.1.2). At the bottom of thefigure is the flame graph 3 representing execution of themain thread, and at the top there are charts for variousadditional execution metrics (from top to bottom):

• CPU utilization• number of context switches• number of page faults (minor in green, major inred)

• bytes allocated by the whole process (minus freedones)

• object allocated by the whole process (minus freedones)

• total bytes allocated by the thread• total bytes freed by the thread (none in this case)• current thread state (e.g. executing native method,blocked on a lock, waiting on a condition variable,sleeping, etc.)

3We omitted names of methods in the flame graph for confiden-tiality reasons

3

Page 4: Profiling Android Applications with Nanoscopeweb.cs.ucla.edu/~lunliu93/assets/Nanoscope.pdf · User-level tooling support for profiling Java applications executing on modern JVMs

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

VMIL’18, November 04, 2018, Boston, MA, USA Lun Liu, Leland Takamine, and Adam Welc

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

Figure 2. Renderer thread’s nFence method

We currently display all metrics that are being collected,but in the near future we plan to make Nanoscope moreconfigurable – have the user choose which metrics are ofinterest and visualize only these metrics. At the bottomleft corner there is also a window containing detailedinformation about the method currently selected in theflame graph (in this case the main method of this thread,represented by top-most frame in the flame graph), suchas duration of the selected method call, average durationof this method’a execution across all its invocations,number of times this method has been called, etc.

At this stage of the project we have not yet taken anyactions on the data generated by Nanoscope, but it is ev-ident that certain, potentially useful, observations wouldbe difficult to make without the kind of information itprovides.For example, focusing on the part of the execution

trace representing the end of the startup sequence whenthe application is largely idle waiting for user input (right-hand side of Figure 1) reveals that while the main thread

0

0.2

0.4

0.6

0.8

1

1.2

baselin

e ROM

"bas

ic" Nan

osco

pe ROM

"bas

ic" Nan

osco

pe

"exten

ded" N

anos

cope

Figure 3. Nanoscope overheads

no longer allocates any memory (as expected), the overallnumber of objects in the application process is stillgrowing. It does not necessarily indicate existence of areal problem, but considering that the main thread is theone expected to execute a large portion of application’sJava code, it may be worth investigating increase inobject allocations on background threads.Another example is unusually long (almost 100ms)

period of low CPU utilization while executing a na-tive nFence method that belongs to Android’s rendererthread (it can be observed as a light blue rectanglein a chart describing current thread state metric 4) –a zoomed-in relevant portion of the execution trace ispresented in Figure 2. Again, it is not necessarily anindication of an actual (and fixable) problem but maybe worth a second look.

3 OverheadsIn its current form, as compared to Android Profiler,Nanoscope strikes a certain tradeoff – the analysis isfocused on a single thread, but at least some executionmetrics go beyond what Android Profiler can produce,and the overhead of data collection is expected to bemuch lower. While we already made an argument forusefulness of execution metrics provided by Nanoscope,we have not yet discussed its overheads.

In order to measure Nanoscope-related overheads, weexecuted startup sequence of the same application weused for demonstrating Nanoscope features in Section 2.3100 times and measured the time from application startto the point when it finished one of the internally de-fined execution spans indicating the end of the startupsequence. We measured execution times for four differentconfigurations:

4All represented thread states are color-coded and light blue colorindicates native method call.

4

Page 5: Profiling Android Applications with Nanoscopeweb.cs.ucla.edu/~lunliu93/assets/Nanoscope.pdf · User-level tooling support for profiling Java applications executing on modern JVMs

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

Profiling Android Applications with Nanoscope VMIL’18, November 04, 2018, Boston, MA, USA

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

• baseline ROM – “clean” build of unmodifiedAndroid OS 7.1.2 that is the basis for Nanoscopeimplementation

• “basic” Nanoscope ROM – a build of AndroidOS 7.1.2 with “basic” Nanoscope extensions butwith all data collection disabled

• “basic” Nanoscope – a build of Android OS7.1.2 with “basic” Nanoscope extensions and withmethod tracing collection enabled

• “extended” Nanoscope – a build of Android OS7.1.2 with “extended” Nanoscope extensions andwith both method tracing and additional metricscollection enabled

In Figure 3 we plot average execution times (along with95% confidence intervals) for the Nanoscope-enabled con-figurations normalized with respect to the execution timefor unmodified Android 7.1.2 (“baseline”). As we can see,the overhead of “base” Nanoscope is very reasonable,and does not exceed 10%, which compares very favorablywith Android Profiler that can slow down application ex-ecution by an order of magnitude when collecting precisetraces. It is somewhat surprising that even with all trac-ing disabled, the overhead of Nanoscope-enabled build isnoticeable – we attribute that to the code that is addedto both interpreter and optimizing compiler whose fastpath is always executed, but we defer more thoroughinvestigation of this issue to future work. More impor-tantly, however, the overhead of the sampling frameworkand additional several metrics collection is limited to atotal of 18%.

4 Future directionsOur work on Nanoscope has only just begun, and whilewe feel like the tool can already be useful in locallydiagnosing various performance problems, we plan tocontinue refining and extending Nanoscope in foreseeablefuture, with the help of both internal customers and alarger open source community. In particular, we wouldlike to extend Nanoscope to collect data for multiplethreads. In addition to more work at the VM level thatwould likely be required to accomplish this, it wouldalso require different ways of visualizing execution of,potentially, a large number of threads – flame graphs arenot very practical in this kind of setting. We may alsoconsider further refining information provided by the tool– for example to not only indicate points of lock contentionor waiting on a condition variable but also to highlightwhich threads exactly are involved in these kinds ofevents. Finally, we are thinking of providing additionalinformation about object allocations, for example vialow-overhead heap sampling similar to the one proposedfor the HotSpot JVM [1].

References[1] Jean Christophe Beyler. 2018. JEP 331: Low-Overhead Heap

Profiling. http://openjdk.java.net/jeps/331[2] Google. 2018. ART and Dalvik. https://source.android.com/

devices/tech/dalvik/[3] Google. 2018. Measure app performance with Android Profiler.

https://developer.android.com/studio/profile/android-profiler[4] Google. 2018. systrace. https://developer.android.com/studio/

command-line/systrace[5] Markus Gronlund and Erik Gahlin. 2018. JEP 328: Flight

Recorder. http://openjdk.java.net/jeps/328[6] Delyan Kratunov. 2018. Profilo: Understanding app per-

formance in the wild. https://code.fb.com/android/profilo-understanding-app-performance-in-the-wild/

[7] Andrei Pangin and Vadim Tsesko. 2018. async-profiler. https://github.com/jvm-profiling-tools/async-profiler

[8] Leland Takamine and Brian Attwell. 2018. IntroducingNanoscope: An Extremely Accurate Method Tracing Tool

for Android. https://source.android.com/devices/tech/dalvik/[9] Uber. 2018. Nanoscope. https://github.com/uber/nanoscope

[10] Uber. 2018. Nanoscope ART. https://github.com/uber/nanoscope-art

[11] Richard Warburton. 2018. honest-profiler. https://github.com/jvm-profiling-tools/honest-profiler

5