stanford university jvm '02 august 2, 2002 targeting dynamic compilation for embedded systems...

Stanford UniversityJVM '02August 2, 2002

Targeting Dynamic Compilation for Embedded

Systems

Michael ChenKunle Olukotun

Computer Systems LaboratoryStanford University


Outline Motivating Problem Compiler Design Performance Results Conclusions


Challenges of Running Java on Embedded Devices

J2ME (micro edition) on CDC (connected device configuration) PDAs, thin clients, and high-end cellphones Highly resource constrained

30MHz - 200MHz embedded processors 2MB - 32MB RAM < 4MB ROM

Differences from running Java on desktop machines Satisfying performance requirements difficult with slower processors Virtual machine footprint matters Limited dynamic memory available for runtime system

Embedded Server

J2ME/CDC J2EEJ2ME/CLDC J2SE

Desktop


Java Execution Models Interpretation

Decode and execute bytecodes in software Incurs high performance penalty

Fast code generators Dynamic compilation without aggressive optimization Sacrifices code quality for compilation speed

Lazy compilation Interpret bytecodes and translate code with optimizing compiler for frequently

executed methods Adds complexity and total ROM footprint of interpreter + compiler large

Alternative approach?


microJIT: An Efficient Optimizing Compiler

Minimize major compiler passes while optimizing aggressively Perform several optimizations concurrently Pipeline information from one pass drive optimizations in

subsequent passes Budget overheads for dataflow analysis

Efficient implementations of straightforward optimizations Use good heuristics for difficult optimizations

Manage compiler dynamic memory requirements Efficient dataflow representation


Using microJIT in Embedded Systems Configuration

Compile everything to native code Potential advantages over other execution models

Lower total system cost Multiple execution engines require more ROM

Reduced complexity Only need to maintain one compiler

Doesn't sacrifice long or short running performance Generates fast code while minimizing overheads


microJIT Compiler Overview

CFG Construction

DFG Generation

Native Code

Generation

OptimizationsISA

DependentDataflow

Information

Register reservations

Assembler macrosInstruction delays

IR expression optimizations

Register allocatorMachine idiomsInstruction scheduler

IR expressionuse counts

Locals & field accessesLoop identification


Pass 1: CFG Construction Quickly scan bytecodes in one pass

Partially decode bytecodes to extract desired information Decompose method into extended basic blocks (EBBs)

Build blocks and arcs as branches and targets are encountered Compute block-level dataflow information

Identify loops Record local and field accesses for blocks and loops


Pass 2: DFG Generation Intermediate representation (IR)

Closer to machine instructions than bytecodes (LIR)

Triples representation – unnamed destination

Source arguments are pointers to other IR expression nodes

Complex bytecodes decompose into several IR expressions

[L0]

[1] const 1

[2] add [1] [L0]

[3] neg [2]


Block-local Optimizations

Maintain mimic stack when translating into IR expressions Manipulate pointers in place of locals and stack accesses which do not

generate IR expressions Immediately eliminates copy expressions

Optimizations immediately applied to newly created IR expressions Check source arguments for constant propagation and algebraic

simplifications Search backwards in EBB for available matching expression (CSE)

Pass 2: DFG Generation

bpc bytecode

0 aload_0 1 dup 2 getfield count 4 iconst_1 5 iadd 6 putfield count

id IR expression

[L0][1] load @ [L0]+16[2] const 1[3] add [1] [2][4] store [4] @ [L0]+16

Java source

L0.count++;


Global Optimizations Global optimizations also immediately

applied to newly created IR expressions

Global forward flow information available for every new IR expression

Blocks processed in reverse post-order (predecessors first)

Use loop field and locals access statistics from previous pass to calculate fixed point solution at loop header

Restricted to dataflow optimizations that rely primarily on forward flow information

Global constant propagation, copy propagation, and CSE


B3

B5

B4B6

B7

LD ST

L0 T F

L1 T T

loop locals access table

B1

B2


Loop Invariant Code MotionPass 2: DFG Generation

PH

H

E

LD ST

L0 T F

L1 T F

loop locals access table[1] [G0][3] [G1]

[1] add [L0] [L1][2] const 1[3] sub [1] [2]

Check loop statistics to make sure source arguments are not redefined in loop

Can perform code motion on dependent instructions without iterating

Hoisted IR expressions immediately communicated to successive instructions and blocks in loop


Inlining Optimized for small methods Handles nested inlining

Important for object initializers with deep sub-classing Can inline non-final public virtual and interface methods with

only one target found at runtime Protected with a class check



Pass 3: Code Generation Registers allocated dynamically as code is generated Instruction scheduling within a basic block

Use standard list scheduling techniques Fills load and branch delay slots

Successfully ported to three different ISAs MIPS, SPARC, StrongARM Ports took only a few weeks to implement Plans to port to x86


Fast Optimization of Machine Idioms

Traditionally done using a peephole optimizer Requires additional pass over generated code

Compiler features allow optimization of machine idioms without additional pass Machine specific code can be invoked two passes Configurable IR expressions Deferred code generation of IR expressions

Optimized machine idioms Register calling conventions Mapping branch implementations Immediate operands Different addressing modes

Pass 3: Code Generation


Code Generation ExamplePass 3: Code Generation

id IR expression

[L0][1] load @ [L0]+16[2] const 5[3] const &newarray

[4] call [3] ([2] [1]) [L1]

[5] const 1[6] add [1] [5]

[7] store [6] @ [L0]+16

{blk,glb}uses

{2,0}{2,0}{1,0}{1,0}

{0,1}

{1,0}{1,0}

{0,0}

flags

%o1%o0

%o0

imm

regalloc generated code

N %l0N %o1 ldw [%l0+16],%o1N %o0 mov 5, %o0

N %l1 mov %o1,%l1F %o0 call newarrayF %o1

N %g1 add %l1,1,%g1F %l1F %g1 stw %g1,[%l0+16]F %l0

DFG generation Code generation

lastuse

[7][6][4][4]

[6][7]

{blk,glb}uses

{1,0}{2,0}{1,0}{1,0}

{0,1}

{1,0}{1,0}

{0,0}

{blk,glb}uses

{1,0}{1,0}{0,0}{0,0}

{0,1}

{1,0}{1,0}

{0,0}

{blk,glb}uses

{1,0}{0,0}{0,0}{0,0}

{0,1}

{0,0}{1,0}

{0,0}

{blk,glb}uses

{0,0}{0,0}{0,0}{0,0}

{0,1}

{0,0}{0,0}

{0,0}

Register conventions%ln – call preserved reg%on – argument reg%gn – temp reg


Global Register AllocationPass 3: Code Generation

B0

B2

B1

B3

B4

B5

J0Out – B0In – B1 B2

J2Out – B2 B4In – B5

J1Out – B1 B3In – B3 B4

Reserve outgoing registers

Reserve outgoing registers


Experiment Setup SPARC VMs chosen for comparison

Large number of VMs with source code available Required for timing and memory use instrumentation Neutral RISC ISA

No embedded JITs available for comparison Variety of benchmarks chosen

Benchmark suites – SPECjvm98, Java Grande, jBYTEmark Other significant applications – MipsSimulator, h263 Decoder, jLex,

jpeg2000


Comparisons to Other Dynamic Compilers

JIT Sun - Client Sun - Server SNU LaTTe microJITIntermediateRepresentation Simple SSA dataflow Dataflow Dataflow

Major Compiler

Passes4 Iterative 7 3-4

Optimizations

Block merging/ elimination

Simple constant propagation

Inlining & specialization

Loop invariant code motion

Global value numberingConditional constant

propagationInlining & specializationInstruction scheduling

EBB value numberingEBB constant

propagationLoop invariant code

motionDead code elimizationInlining & specializationInstruction scheduling

CSECopy propagationConstant propagationLoop invariant code

motionDead code elimizationInlining &

specializationInstruction scheduling

Register Allocation 1-pass dynamic Graph coloring 2-pass dynamic 1-pass dynamic

Virtual Machine HotSpot HotSpot Kaffe KaffeCompiler size(stripped object) 700KB 1.5MB 325KB 200KBInterpreter size(stripped object) 220KB 220KB 65KB None


Compilation Speed

UltraSparcII @ 200MHzSun Solaris 8

30% faster than Sun-client 2.5x faster than nearest dataflow compiler (LaTTe)

0.00

0.10

0.20

0.30

<50B

50B-25

0B

250B

-1KB

1KB-5

KB>5

KB

avera

ge

method bytecode size

byte

code

s / 1

k cy

cles

Sun-server LaTTe Sun-client microJIT


Time spent in each compiler pass

0%

25%

50%

75%

100%

<50B

50B-25

0B

250B

-1KB

1KB-5KB

>5KB

avera

ge

method bytecode size

com

pila

tion

time

CFG generation DFG generation code generation

CFG construction consistently < 10% of compile time

DFG generation grows in proportion for large methods

Can improve code generation time for large methods Limit optimizations with

costs that grow with method size

CSE time grows with increasing code size


Performance on Long Running Benchmarks

Compilation to execution time proportionally smaller Collected times also include Sun interpreter Good performance for numerical programs Performance suffers on object-oriented code

Speedup normalized to microJIT

0.0

0.5

1.0

1.5

2.0

compr

ess db jess mp3 mtrt

jbyte in

t

jbyte fp jpeg

euler

moldyn

searc

h

scim

ark2

benchmark

norm

aliz

ed s

peed

up

Sun-server LaTTe Sun-client microJIT Sun-intrp


Performance on Short Running Benchmarks

Compilation to execution time proportionally larger Fast optimizing compiler can compete against lazy compilation on total run time

Speedup normalized to microJIT

0.0

0.5

1.0

1.5

2.0

Sun_

serv

erLa

TTe

Sun_

clie

ntm

icro

JIT

Sun_

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

Sun-

serv

erLa

TTe

Sun-

clie

ntm

icro

JIT

Sun-

intrp

compress db jess mp3 mtrt jlex richards deltablue java_cup mips_sim

benchmark

norm

aliz

ed s

peed

up

native interpret compile


Factors limiting microJIT performance

Sun-client and Sun-server support speculative inlining Inline non-final public virtual and interface calls that only have one

target Decompile and fix if class loading adds new targets

Garbage collection overheads are higher for our system Impacted object-oriented programs


Dynamic Memory Usage microJIT compiler requires 2x memory of Sun-client, but less

than ¼ of dataflow compilers 250KB sufficient to compile 1KB method Can reduce memory requirements for compilation of large methods

by build DFG and generating code for only subsections of CFG per pass

300KB native code buffer sufficient for largest benchmark applications (pizza compiler and jpeg2000)


Conclusions Proposed Java dynamic compilation scheme for embedded

devices Compile all code Fast compiler which performs aggressive optimizations

Results show potential of this approach Small dynamic and static memory footprint Good compilation speed and generated code performance

Possible improvements Memory usage and compilation performance on large methods Implement additional optimizations

Aggressive array bounds check removal from loops

stanford university jvm '02 august 2, 2002 targeting dynamic compilation for embedded systems...

Documents