stanford university jvm '02 august 2, 2002 targeting dynamic compilation for embedded systems...

29
Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford University

Upload: brendan-norton

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Stanford University JVM '02 August 2, 2002 Challenges of Running Java on Embedded Devices  J2ME (micro edition) on CDC (connected device configuration)  PDAs, thin clients, and high-end cellphones  Highly resource constrained  30MHz - 200MHz embedded processors  2MB - 32MB RAM  < 4MB ROM  Differences from running Java on desktop machines  Satisfying performance requirements difficult with slower processors  Virtual machine footprint matters  Limited dynamic memory available for runtime system Embedded Server J2ME/CDC J2EEJ2ME/CLDCJ2SE Desktop

TRANSCRIPT

Page 1: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Targeting Dynamic Compilation for Embedded

Systems

Michael ChenKunle Olukotun

Computer Systems LaboratoryStanford University

Page 2: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Outline Motivating Problem Compiler Design Performance Results Conclusions

Page 3: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Challenges of Running Java on Embedded Devices

J2ME (micro edition) on CDC (connected device configuration) PDAs, thin clients, and high-end cellphones Highly resource constrained

30MHz - 200MHz embedded processors 2MB - 32MB RAM < 4MB ROM

Differences from running Java on desktop machines Satisfying performance requirements difficult with slower processors Virtual machine footprint matters Limited dynamic memory available for runtime system

Embedded Server

J2ME/CDC J2EEJ2ME/CLDC J2SE

Desktop

Page 4: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Java Execution Models Interpretation

Decode and execute bytecodes in software Incurs high performance penalty

Fast code generators Dynamic compilation without aggressive optimization Sacrifices code quality for compilation speed

Lazy compilation Interpret bytecodes and translate code with optimizing compiler for frequently

executed methods Adds complexity and total ROM footprint of interpreter + compiler large

Alternative approach?

Page 5: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

microJIT: An Efficient Optimizing Compiler

Minimize major compiler passes while optimizing aggressively Perform several optimizations concurrently Pipeline information from one pass drive optimizations in

subsequent passes Budget overheads for dataflow analysis

Efficient implementations of straightforward optimizations Use good heuristics for difficult optimizations

Manage compiler dynamic memory requirements Efficient dataflow representation

Page 6: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Using microJIT in Embedded Systems Configuration

Compile everything to native code Potential advantages over other execution models

Lower total system cost Multiple execution engines require more ROM

Reduced complexity Only need to maintain one compiler

Doesn't sacrifice long or short running performance Generates fast code while minimizing overheads

Page 7: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Outline Motivating Problem Compiler Design Performance Results Conclusions

Page 8: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

microJIT Compiler Overview

CFG Construction

DFG Generation

Native Code

Generation

OptimizationsISA

DependentDataflow

Information

Register reservations

Assembler macrosInstruction delays

IR expression optimizations

Register allocatorMachine idiomsInstruction scheduler

IR expressionuse counts

Locals & field accessesLoop identification

Page 9: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Pass 1: CFG Construction Quickly scan bytecodes in one pass

Partially decode bytecodes to extract desired information Decompose method into extended basic blocks (EBBs)

Build blocks and arcs as branches and targets are encountered Compute block-level dataflow information

Identify loops Record local and field accesses for blocks and loops

Page 10: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Pass 2: DFG Generation Intermediate representation (IR)

Closer to machine instructions than bytecodes (LIR)

Triples representation – unnamed destination

Source arguments are pointers to other IR expression nodes

Complex bytecodes decompose into several IR expressions

[L0]

[1] const 1

[2] add [1] [L0]

[3] neg [2]

Page 11: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Block-local Optimizations

Maintain mimic stack when translating into IR expressions Manipulate pointers in place of locals and stack accesses which do not

generate IR expressions Immediately eliminates copy expressions

Optimizations immediately applied to newly created IR expressions Check source arguments for constant propagation and algebraic

simplifications Search backwards in EBB for available matching expression (CSE)

Pass 2: DFG Generation

bpc bytecode

0 aload_0 1 dup 2 getfield count 4 iconst_1 5 iadd 6 putfield count

id IR expression

[L0][1] load @ [L0]+16[2] const 1[3] add [1] [2][4] store [4] @ [L0]+16

Java source

L0.count++;

Page 12: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Global Optimizations Global optimizations also immediately

applied to newly created IR expressions

Global forward flow information available for every new IR expression

Blocks processed in reverse post-order (predecessors first)

Use loop field and locals access statistics from previous pass to calculate fixed point solution at loop header

Restricted to dataflow optimizations that rely primarily on forward flow information

Global constant propagation, copy propagation, and CSE

Pass 2: DFG Generation

B3

B5

B4B6

B7

LD ST

L0 T F

L1 T T

loop locals access table

B1

B2

Page 13: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Loop Invariant Code MotionPass 2: DFG Generation

PH

H

E

LD ST

L0 T F

L1 T F

loop locals access table[1] [G0][3] [G1]

[1] add [L0] [L1][2] const 1[3] sub [1] [2]

Check loop statistics to make sure source arguments are not redefined in loop

Can perform code motion on dependent instructions without iterating

Hoisted IR expressions immediately communicated to successive instructions and blocks in loop

Page 14: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Inlining Optimized for small methods Handles nested inlining

Important for object initializers with deep sub-classing Can inline non-final public virtual and interface methods with

only one target found at runtime Protected with a class check

Pass 2: DFG Generation

Page 15: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Pass 3: Code Generation Registers allocated dynamically as code is generated Instruction scheduling within a basic block

Use standard list scheduling techniques Fills load and branch delay slots

Successfully ported to three different ISAs MIPS, SPARC, StrongARM Ports took only a few weeks to implement Plans to port to x86

Page 16: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Fast Optimization of Machine Idioms

Traditionally done using a peephole optimizer Requires additional pass over generated code

Compiler features allow optimization of machine idioms without additional pass Machine specific code can be invoked two passes Configurable IR expressions Deferred code generation of IR expressions

Optimized machine idioms Register calling conventions Mapping branch implementations Immediate operands Different addressing modes

Pass 3: Code Generation

Page 17: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Code Generation ExamplePass 3: Code Generation

id IR expression

[L0][1] load @ [L0]+16[2] const 5[3] const &newarray

[4] call [3] ([2] [1]) [L1]

[5] const 1[6] add [1] [5]

[7] store [6] @ [L0]+16

{blk,glb}uses

{2,0}{2,0}{1,0}{1,0}

{0,1}

{1,0}{1,0}

{0,0}

flags

%o1%o0

%o0

imm

regalloc generated code

N %l0N %o1 ldw [%l0+16],%o1N %o0 mov 5, %o0

N %l1 mov %o1,%l1F %o0 call newarrayF %o1

N %g1 add %l1,1,%g1F %l1F %g1 stw %g1,[%l0+16]F %l0

DFG generation Code generation

lastuse

[7][6][4][4]

[6][7]

{blk,glb}uses

{1,0}{2,0}{1,0}{1,0}

{0,1}

{1,0}{1,0}

{0,0}

{blk,glb}uses

{1,0}{1,0}{0,0}{0,0}

{0,1}

{1,0}{1,0}

{0,0}

{blk,glb}uses

{1,0}{0,0}{0,0}{0,0}

{0,1}

{0,0}{1,0}

{0,0}

{blk,glb}uses

{0,0}{0,0}{0,0}{0,0}

{0,1}

{0,0}{0,0}

{0,0}

Register conventions%ln – call preserved reg%on – argument reg%gn – temp reg

Page 18: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Global Register AllocationPass 3: Code Generation

B0

B2

B1

B3

B4

B5

J0Out – B0In – B1 B2

J2Out – B2 B4In – B5

J1Out – B1 B3In – B3 B4

Reserve outgoing registers

Reserve outgoing registers

Page 19: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Outline Motivating Problem Compiler Design Performance Results Conclusions

Page 20: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Experiment Setup SPARC VMs chosen for comparison

Large number of VMs with source code available Required for timing and memory use instrumentation Neutral RISC ISA

No embedded JITs available for comparison Variety of benchmarks chosen

Benchmark suites – SPECjvm98, Java Grande, jBYTEmark Other significant applications – MipsSimulator, h263 Decoder, jLex,

jpeg2000

Page 21: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Comparisons to Other Dynamic Compilers

JIT Sun - Client Sun - Server SNU LaTTe microJITIntermediateRepresentation Simple SSA dataflow Dataflow Dataflow

Major Compiler

Passes4 Iterative 7 3-4

Optimizations

Block merging/ elimination

Simple constant propagation

Inlining & specialization

Loop invariant code motion

Global value numberingConditional constant

propagationInlining & specializationInstruction scheduling

EBB value numberingEBB constant

propagationLoop invariant code

motionDead code elimizationInlining & specializationInstruction scheduling

CSECopy propagationConstant propagationLoop invariant code

motionDead code elimizationInlining &

specializationInstruction scheduling

Register Allocation 1-pass dynamic Graph coloring 2-pass dynamic 1-pass dynamic

Virtual Machine HotSpot HotSpot Kaffe KaffeCompiler size(stripped object) 700KB 1.5MB 325KB 200KBInterpreter size(stripped object) 220KB 220KB 65KB None

Page 22: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Compilation Speed

UltraSparcII @ 200MHzSun Solaris 8

30% faster than Sun-client 2.5x faster than nearest dataflow compiler (LaTTe)

0.00

0.10

0.20

0.30

<50B

50B-25

0B

250B

-1KB

1KB-5

KB>5

KB

avera

ge

method bytecode size

byte

code

s / 1

k cy

cles

Sun-server LaTTe Sun-client microJIT

Page 23: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Time spent in each compiler pass

0%

25%

50%

75%

100%

<50B

50B-25

0B

250B

-1KB

1KB-5KB

>5KB

avera

ge

method bytecode size

com

pila

tion

time

CFG generation DFG generation code generation

CFG construction consistently < 10% of compile time

DFG generation grows in proportion for large methods

Can improve code generation time for large methods Limit optimizations with

costs that grow with method size

CSE time grows with increasing code size

Page 24: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Performance on Long Running Benchmarks

Compilation to execution time proportionally smaller Collected times also include Sun interpreter Good performance for numerical programs Performance suffers on object-oriented code

Speedup normalized to microJIT

0.0

0.5

1.0

1.5

2.0

compr

ess db jess mp3 mtrt

jbyte in

t

jbyte fp jpeg

euler

moldyn

searc

h

scim

ark2

benchmark

norm

aliz

ed s

peed

up

Sun-server LaTTe Sun-client microJIT Sun-intrp

Page 25: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Performance on Short Running Benchmarks

Compilation to execution time proportionally larger Fast optimizing compiler can compete against lazy compilation on total run time

Speedup normalized to microJIT

0.0

0.5

1.0

1.5

2.0

Sun_

serv

erLa

TTe

Sun_

clie

ntm

icro

JIT

Sun_

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

compress db jess mp3 mtrt jlex richards deltablue java_cup mips_sim

benchmark

norm

aliz

ed s

peed

up

native interpret compile

Page 26: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Factors limiting microJIT performance

Sun-client and Sun-server support speculative inlining Inline non-final public virtual and interface calls that only have one

target Decompile and fix if class loading adds new targets

Garbage collection overheads are higher for our system Impacted object-oriented programs

Page 27: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Dynamic Memory Usage microJIT compiler requires 2x memory of Sun-client, but less

than ¼ of dataflow compilers 250KB sufficient to compile 1KB method Can reduce memory requirements for compilation of large methods

by build DFG and generating code for only subsections of CFG per pass

300KB native code buffer sufficient for largest benchmark applications (pizza compiler and jpeg2000)

Page 28: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Outline Motivating Problem Compiler Design Performance Results Conclusions

Page 29: Stanford University JVM '02 August 2, 2002 Targeting Dynamic Compilation for Embedded Systems Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford

Stanford UniversityJVM '02August 2, 2002

Conclusions Proposed Java dynamic compilation scheme for embedded

devices Compile all code Fast compiler which performs aggressive optimizations

Results show potential of this approach Small dynamic and static memory footprint Good compilation speed and generated code performance

Possible improvements Memory usage and compilation performance on large methods Implement additional optimizations

Aggressive array bounds check removal from loops