Stanford UniversityJVM '02August 2, 2002
Targeting Dynamic Compilation for Embedded
Systems
Michael ChenKunle Olukotun
Computer Systems LaboratoryStanford University
Stanford UniversityJVM '02August 2, 2002
Outline Motivating Problem Compiler Design Performance Results Conclusions
Stanford UniversityJVM '02August 2, 2002
Challenges of Running Java on Embedded Devices
J2ME (micro edition) on CDC (connected device configuration) PDAs, thin clients, and high-end cellphones Highly resource constrained
30MHz - 200MHz embedded processors 2MB - 32MB RAM < 4MB ROM
Differences from running Java on desktop machines Satisfying performance requirements difficult with slower processors Virtual machine footprint matters Limited dynamic memory available for runtime system
Embedded Server
J2ME/CDC J2EEJ2ME/CLDC J2SE
Desktop
Stanford UniversityJVM '02August 2, 2002
Java Execution Models Interpretation
Decode and execute bytecodes in software Incurs high performance penalty
Fast code generators Dynamic compilation without aggressive optimization Sacrifices code quality for compilation speed
Lazy compilation Interpret bytecodes and translate code with optimizing compiler for frequently
executed methods Adds complexity and total ROM footprint of interpreter + compiler large
Alternative approach?
Stanford UniversityJVM '02August 2, 2002
microJIT: An Efficient Optimizing Compiler
Minimize major compiler passes while optimizing aggressively Perform several optimizations concurrently Pipeline information from one pass drive optimizations in
subsequent passes Budget overheads for dataflow analysis
Efficient implementations of straightforward optimizations Use good heuristics for difficult optimizations
Manage compiler dynamic memory requirements Efficient dataflow representation
Stanford UniversityJVM '02August 2, 2002
Using microJIT in Embedded Systems Configuration
Compile everything to native code Potential advantages over other execution models
Lower total system cost Multiple execution engines require more ROM
Reduced complexity Only need to maintain one compiler
Doesn't sacrifice long or short running performance Generates fast code while minimizing overheads
Stanford UniversityJVM '02August 2, 2002
Outline Motivating Problem Compiler Design Performance Results Conclusions
Stanford UniversityJVM '02August 2, 2002
microJIT Compiler Overview
CFG Construction
DFG Generation
Native Code
Generation
OptimizationsISA
DependentDataflow
Information
Register reservations
Assembler macrosInstruction delays
IR expression optimizations
Register allocatorMachine idiomsInstruction scheduler
IR expressionuse counts
Locals & field accessesLoop identification
Stanford UniversityJVM '02August 2, 2002
Pass 1: CFG Construction Quickly scan bytecodes in one pass
Partially decode bytecodes to extract desired information Decompose method into extended basic blocks (EBBs)
Build blocks and arcs as branches and targets are encountered Compute block-level dataflow information
Identify loops Record local and field accesses for blocks and loops
Stanford UniversityJVM '02August 2, 2002
Pass 2: DFG Generation Intermediate representation (IR)
Closer to machine instructions than bytecodes (LIR)
Triples representation – unnamed destination
Source arguments are pointers to other IR expression nodes
Complex bytecodes decompose into several IR expressions
[L0]
[1] const 1
[2] add [1] [L0]
[3] neg [2]
Stanford UniversityJVM '02August 2, 2002
Block-local Optimizations
Maintain mimic stack when translating into IR expressions Manipulate pointers in place of locals and stack accesses which do not
generate IR expressions Immediately eliminates copy expressions
Optimizations immediately applied to newly created IR expressions Check source arguments for constant propagation and algebraic
simplifications Search backwards in EBB for available matching expression (CSE)
Pass 2: DFG Generation
bpc bytecode
0 aload_0 1 dup 2 getfield count 4 iconst_1 5 iadd 6 putfield count
id IR expression
[L0][1] load @ [L0]+16[2] const 1[3] add [1] [2][4] store [4] @ [L0]+16
Java source
L0.count++;
Stanford UniversityJVM '02August 2, 2002
Global Optimizations Global optimizations also immediately
applied to newly created IR expressions
Global forward flow information available for every new IR expression
Blocks processed in reverse post-order (predecessors first)
Use loop field and locals access statistics from previous pass to calculate fixed point solution at loop header
Restricted to dataflow optimizations that rely primarily on forward flow information
Global constant propagation, copy propagation, and CSE
Pass 2: DFG Generation
B3
B5
B4B6
B7
LD ST
L0 T F
L1 T T
loop locals access table
B1
B2
Stanford UniversityJVM '02August 2, 2002
Loop Invariant Code MotionPass 2: DFG Generation
PH
H
E
LD ST
L0 T F
L1 T F
loop locals access table[1] [G0][3] [G1]
[1] add [L0] [L1][2] const 1[3] sub [1] [2]
Check loop statistics to make sure source arguments are not redefined in loop
Can perform code motion on dependent instructions without iterating
Hoisted IR expressions immediately communicated to successive instructions and blocks in loop
Stanford UniversityJVM '02August 2, 2002
Inlining Optimized for small methods Handles nested inlining
Important for object initializers with deep sub-classing Can inline non-final public virtual and interface methods with
only one target found at runtime Protected with a class check
Pass 2: DFG Generation
Stanford UniversityJVM '02August 2, 2002
Pass 3: Code Generation Registers allocated dynamically as code is generated Instruction scheduling within a basic block
Use standard list scheduling techniques Fills load and branch delay slots
Successfully ported to three different ISAs MIPS, SPARC, StrongARM Ports took only a few weeks to implement Plans to port to x86
Stanford UniversityJVM '02August 2, 2002
Fast Optimization of Machine Idioms
Traditionally done using a peephole optimizer Requires additional pass over generated code
Compiler features allow optimization of machine idioms without additional pass Machine specific code can be invoked two passes Configurable IR expressions Deferred code generation of IR expressions
Optimized machine idioms Register calling conventions Mapping branch implementations Immediate operands Different addressing modes
Pass 3: Code Generation
Stanford UniversityJVM '02August 2, 2002
Code Generation ExamplePass 3: Code Generation
id IR expression
[L0][1] load @ [L0]+16[2] const 5[3] const &newarray
[4] call [3] ([2] [1]) [L1]
[5] const 1[6] add [1] [5]
[7] store [6] @ [L0]+16
{blk,glb}uses
{2,0}{2,0}{1,0}{1,0}
{0,1}
{1,0}{1,0}
{0,0}
flags
%o1%o0
%o0
imm
regalloc generated code
N %l0N %o1 ldw [%l0+16],%o1N %o0 mov 5, %o0
N %l1 mov %o1,%l1F %o0 call newarrayF %o1
N %g1 add %l1,1,%g1F %l1F %g1 stw %g1,[%l0+16]F %l0
DFG generation Code generation
lastuse
[7][6][4][4]
[6][7]
{blk,glb}uses
{1,0}{2,0}{1,0}{1,0}
{0,1}
{1,0}{1,0}
{0,0}
{blk,glb}uses
{1,0}{1,0}{0,0}{0,0}
{0,1}
{1,0}{1,0}
{0,0}
{blk,glb}uses
{1,0}{0,0}{0,0}{0,0}
{0,1}
{0,0}{1,0}
{0,0}
{blk,glb}uses
{0,0}{0,0}{0,0}{0,0}
{0,1}
{0,0}{0,0}
{0,0}
Register conventions%ln – call preserved reg%on – argument reg%gn – temp reg
Stanford UniversityJVM '02August 2, 2002
Global Register AllocationPass 3: Code Generation
B0
B2
B1
B3
B4
B5
J0Out – B0In – B1 B2
J2Out – B2 B4In – B5
J1Out – B1 B3In – B3 B4
Reserve outgoing registers
Reserve outgoing registers
Stanford UniversityJVM '02August 2, 2002
Outline Motivating Problem Compiler Design Performance Results Conclusions
Stanford UniversityJVM '02August 2, 2002
Experiment Setup SPARC VMs chosen for comparison
Large number of VMs with source code available Required for timing and memory use instrumentation Neutral RISC ISA
No embedded JITs available for comparison Variety of benchmarks chosen
Benchmark suites – SPECjvm98, Java Grande, jBYTEmark Other significant applications – MipsSimulator, h263 Decoder, jLex,
jpeg2000
Stanford UniversityJVM '02August 2, 2002
Comparisons to Other Dynamic Compilers
JIT Sun - Client Sun - Server SNU LaTTe microJITIntermediateRepresentation Simple SSA dataflow Dataflow Dataflow
Major Compiler
Passes4 Iterative 7 3-4
Optimizations
Block merging/ elimination
Simple constant propagation
Inlining & specialization
Loop invariant code motion
Global value numberingConditional constant
propagationInlining & specializationInstruction scheduling
EBB value numberingEBB constant
propagationLoop invariant code
motionDead code elimizationInlining & specializationInstruction scheduling
CSECopy propagationConstant propagationLoop invariant code
motionDead code elimizationInlining &
specializationInstruction scheduling
Register Allocation 1-pass dynamic Graph coloring 2-pass dynamic 1-pass dynamic
Virtual Machine HotSpot HotSpot Kaffe KaffeCompiler size(stripped object) 700KB 1.5MB 325KB 200KBInterpreter size(stripped object) 220KB 220KB 65KB None
Stanford UniversityJVM '02August 2, 2002
Compilation Speed
UltraSparcII @ 200MHzSun Solaris 8
30% faster than Sun-client 2.5x faster than nearest dataflow compiler (LaTTe)
0.00
0.10
0.20
0.30
<50B
50B-25
0B
250B
-1KB
1KB-5
KB>5
KB
avera
ge
method bytecode size
byte
code
s / 1
k cy
cles
Sun-server LaTTe Sun-client microJIT
Stanford UniversityJVM '02August 2, 2002
Time spent in each compiler pass
0%
25%
50%
75%
100%
<50B
50B-25
0B
250B
-1KB
1KB-5KB
>5KB
avera
ge
method bytecode size
com
pila
tion
time
CFG generation DFG generation code generation
CFG construction consistently < 10% of compile time
DFG generation grows in proportion for large methods
Can improve code generation time for large methods Limit optimizations with
costs that grow with method size
CSE time grows with increasing code size
Stanford UniversityJVM '02August 2, 2002
Performance on Long Running Benchmarks
Compilation to execution time proportionally smaller Collected times also include Sun interpreter Good performance for numerical programs Performance suffers on object-oriented code
Speedup normalized to microJIT
0.0
0.5
1.0
1.5
2.0
compr
ess db jess mp3 mtrt
jbyte in
t
jbyte fp jpeg
euler
moldyn
searc
h
scim
ark2
benchmark
norm
aliz
ed s
peed
up
Sun-server LaTTe Sun-client microJIT Sun-intrp
Stanford UniversityJVM '02August 2, 2002
Performance on Short Running Benchmarks
Compilation to execution time proportionally larger Fast optimizing compiler can compete against lazy compilation on total run time
Speedup normalized to microJIT
0.0
0.5
1.0
1.5
2.0
Sun_
serv
erLa
TTe
Sun_
clie
ntm
icro
JIT
Sun_
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
Sun-
serv
erLa
TTe
Sun-
clie
ntm
icro
JIT
Sun-
intrp
compress db jess mp3 mtrt jlex richards deltablue java_cup mips_sim
benchmark
norm
aliz
ed s
peed
up
native interpret compile
Stanford UniversityJVM '02August 2, 2002
Factors limiting microJIT performance
Sun-client and Sun-server support speculative inlining Inline non-final public virtual and interface calls that only have one
target Decompile and fix if class loading adds new targets
Garbage collection overheads are higher for our system Impacted object-oriented programs
Stanford UniversityJVM '02August 2, 2002
Dynamic Memory Usage microJIT compiler requires 2x memory of Sun-client, but less
than ¼ of dataflow compilers 250KB sufficient to compile 1KB method Can reduce memory requirements for compilation of large methods
by build DFG and generating code for only subsections of CFG per pass
300KB native code buffer sufficient for largest benchmark applications (pizza compiler and jpeg2000)
Stanford UniversityJVM '02August 2, 2002
Outline Motivating Problem Compiler Design Performance Results Conclusions
Stanford UniversityJVM '02August 2, 2002
Conclusions Proposed Java dynamic compilation scheme for embedded
devices Compile all code Fast compiler which performs aggressive optimizations
Results show potential of this approach Small dynamic and static memory footprint Good compilation speed and generated code performance
Possible improvements Memory usage and compilation performance on large methods Implement additional optimizations
Aggressive array bounds check removal from loops