research supported by ibm cas, nserc, cito context threading: a flexible and efficient dispatch...

25
Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown

Upload: darrell-brown

Post on 31-Dec-2015

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Research supported by IBM CAS, NSERC, CITO

Context Threading: A flexible and efficient dispatch technique for

virtual machine interpreters

Marc Berndl

Benjamin Vitale

Mathew Zaleski

Angela Demke Brown

Page 2: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Interpreter performance

•Why not just JIT?

•High performance JVMs still interpret

•People use interpreted languages that don’t yet have JITs

•They still want performance!

•30-40% of execution time is due to branch misprediction

•Our technique eliminates 95% of branch mispredictions

Page 3: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Overview

Motivation

•Background: The Context Problem

•Existing Solutions

•Our Approach

•Inlining

•Results

Page 4: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

load

A Tale of Two Machines

LoadedProgram

VirtualProgram

Return Address

Wayness(Conditional)

Execution Cycle

BytecodeBodies

Pipeline

Target Address(Indirect)

Predictors

Execution Cycle

Virtual Machine Interpreter

Real MachineCPU

Page 5: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Interpreter

LoadedProgram

Bytecodebodies

Internal Representation

fetch

dispatchLoad

Parms

execute

Execution Cycle

Page 6: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Running Java Example

void foo(){ int i=1; do{ i+=i; } while(i<64); }

Java Source Java Bytecode

Javac compiler

Page 7: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

while(1){ opcode = *vPC++; switch(opcode){

//and many more..

}};

Switched Interpreter

case iload_1: ..

break;

case iadd: ..

break;

slow. burdened by switch and loop overhead

Page 8: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

9Context Threading

“Threading” Dispatch

‣ No switch overhead. Data driven indirect branch.

execution of virtual program

“threads” through bodies

(as in needle & thread)

iload_1: ..goto *vPC++;

iadd: ..goto *vPC++;

istore: ..goto *vPC++;

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Page 9: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

10

Context Threading

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Context Problem

‣ Data driven indirect branches hard to predict

iload_1: ..goto *vPC++;

iadd: ..goto *vPC++;

istore: ..goto *vPC++;

indirect branch

predictor(micro-arch)

Page 10: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Direct Threaded Interpreter

-7

&&if_icmplt64&&bipush&&iload_1&&istore_1&&iadd&&iload_1

&&iload_1…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…

DTT - DirectThreading Table

VirtualProgram

vPC iload_1: ..goto *vPC++;

iadd: ..goto *vPC++;

Target of computed goto is data-driven

C implementationof each body

istore: ..goto *vPC++;

Page 11: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Existing Solutions

BodyBodyBodyBodyBody

GOTO *PC

????

Piumarta & Ricardi :Bodies Replicated

Super InstructionReplicate

iload_1goto *pc

1

iload_1goto *pc

2

1

1

2

2

Ertl & Gregg:Bodies and Dispatch

Replicated

Limited to relocatable virtual instructions

Page 12: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Overview

MotivationBackground: The Context ProblemExisting Solutions

• Our Approach

• Inlining

• Results

Page 13: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Key Observation

•Virtual and native control flow similar

•Linear or straight-line code

•Conditional branches

•Calls and Returns

•Indirect branches

•Hardware has predictors for each type

•Direct uses indirect branch for everything!

‣ Solution: Leverage hardware predictors

Page 14: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Essence of our Solution

iload_1: ..ret;

iadd: ..ret;

..call iload_1call istore_1call iaddcall iload_1call iload_1

CTT - ContextThreading Table (generated code)

Bytecode bodies (ret terminated)

Return Branch Predictor Stack

…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…

Package bodies as subroutines and call them

Page 15: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Subroutine Threading

iload_1: …ret;

iadd: …ret;

call bipush call if_icmplt

call iload_1 call istore_1 call iadd call iload_1 call iload_1

CTT load timegenerated

code

Bytecode bodies (ret terminated)

if_cmplt: …goto *vPC++;

virtual branch instructions as before

…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2… 64

-7

DTT containsaddresses in CTT

vPC

Page 16: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

The Context Threading Table

•A sequence of generated call instructions

•Good alignment of virtual and hardware control flow for straight-line code.

‣ Can virtual branches go into the CTT?

Page 17: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Specialized Branch Inlining

Conditional Branch

Predictor now

mobilized

……target:

…call …call iload_1

if(icmplt)

goto target:

Branch Inlined Into the CTT

5

DTT

vPC

target:…

Inlining conditional branches provides context

Page 18: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Tiny Inlining

•Context Threading is a dispatch technique

•But, we inline branches

•Some non-branching bodies are very small

•Why not inline those?

►Inline all tiny linear bodies into the CTT

Page 19: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Overview

MotivationBackground: The Context ProblemExisting SolutionsOur ApproachInlining

• Results

Page 20: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Experimental Setup

•Two Virtual Machines on two hardware architectures.•VM: Java/SableVM, OCaml interpreter

•Compare against direct threaded SableVM

•SableVM distro uses selective inlining

•Arch: P4, PPC

•Branch Misprediction

•Execution Time

►Is our technique effective and general?

Page 21: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Mispredicted Taken BranchesN

orm

aliz

ed

to

Dir

ect

Th

read

ing

95% mispredictions eliminated on averageSableVm/Java Pentium 4

Page 22: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Execution timeN

orm

aliz

ed

to

Dir

ect

Th

read

ing

27% average reduction in execution time

Pentium 4

Page 23: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Execution Time (geomean)N

orm

aliz

ed

to

Dir

ect

Th

read

ing

Our technique is effective and general

Page 24: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading

Conclusions

•Context Problem: branch mispredictions due to mismatch between native and virtual control flow

•Solution: Generate control flow code into the Context Threading Table

•Results•Eliminate 95% of branch

mispredictions•Reduce execution time by 30-40%

‣recent, post CGO 2005, work follows

Page 25: Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl

Context Threading 32

What about Scripting Languages?

• Recently ported context threading to TCL.

• 10x cycles executed per bytecode dispatched.

• Much lower dispatch overhead.

• Speedup due to subroutine threading, approx. 5%.

• TCL conference 2005

Cycl

es

per

vir

tual

inst

ruct

ion