1 j. bradley chen and bradley d. d. leupen division of engineering and applied sciences harvard...

23
1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In- Time Code Layout

Upload: reynold-mcgee

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

1

J. Bradley Chen and Bradley D. D. Leupen

Division of Engineering and Applied Sciences

Harvard University

Improving Instruction Locality with Just-In-Time Code Layout

Page 2: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

2

Goals

• Improve instruction reference locality–big problem for commodity applications

• Eliminate need for profile information–required by current compiler-based solutions

Page 3: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

3

How?

Implement layout dynamically using Activation Order:• A new heuristic for code layout.• Locate procedures in order of use.

Page 4: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

4

Requirements

• No special hardware support.

• Minimal changes to the operating system.

• Minimal system overhead.

Page 5: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

5

Optimizing Procedure Layout

Bad Layout Better Layout

Page 6: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

6

Current Practice: Pettis and Hansen

• Nodes are procedures.

• Edges are caller/callee pairs.

• Weights are call frequency.

WinMain()

Initialize()

EventLoop()

GetEvent() React(

)

HandleRareCase()

HandleInputError()

CheckForInputError()

HandleCommonCase()

1

1

129394

68754

128404

1

10

68753

Page 7: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

7

Pettis and Hansen Layout

EventLoop()

GetEvent()React()

CheckForInputError()

HandleCommonCase()

129394 68754

128404

68753

EventLoop()

React()

HandleCommonCase()

129394 68754

68753

Node-1

layout: [] layout: [GetEvent,CheckForInputErrors]

Node-2

React()

HandleCommonCase()

68754

68753

layout: [EventLoop, GetEvent,CheckForInputErrors]

Node-3

HandleCommonCase()

68753

layout: [React, EventLoop,GetEvent,CheckForInputErrors]

Node-4

layout: [HandleCommonCase, React,EventLoop, GetEvent,CheckForInputErrors]

Page 8: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

8

A New Heuristic

Code P&H AO

void main(){ Initialize(); Do some stuff; for (10000 interations) { Do some stuff; foo(); bar(); }}

foo()main()bar()Initialize()

main()Initialize()foo()bar()

Activation Order: Co-locate procedures that are activated sequentially.

Example:

Page 9: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

9

Implementing JITCL __start: perform initializations call thunk_main

thunk_main:. . .

thunk_foo:. . .

__InstructionMemory:

Thunk routines implement code layout on-the-fly.

Page 10: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

10

Thunk routines

// Global variables:// ProcPointers[] - one element per procedure// INDEX_proc and LENGTH_proc for each procedurethunk_main: if (InCodeSegment(ProcPointers[INDEX_main])) ProcPointers[INDEX_main] = CopyToTextSegment(ProcPointer[INDEX_main],

LENGTH_main); PatchCallSite(ProcPointer[INDEX_main], ComputeCallSiteFromReturnAddress(RA)); jmp ProcPointer[INDEX_main];

The thunk routines copy procedures into the textsegment and update call sites at run-time.

Page 11: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

11

Simulation Methodology

8K 8KCache Size

Direct-Mapped 2-WayAssociativity

ATOM EtchSimulation

UNIX/RISC Win32/x86

Page 12: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

12

Workloads

Benchmark Description Text Size

UNIX

compress file compression 112

Gcc The GNU C compiler 1552

m88ksim Simulation of Motorola 88K 160

Perl The perl scripting language 376

Raytrace Image rendering 192

Xanim MPEG player 2024

Win32

Mazelord Maze game 1445

Perfmon Windows NT system utility 2805

Wordpro 96 Lotus document preparation 5148

Word 7 Microsoft document preparation 7694

IE302 Microsoft web browser product 4990

Page 13: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

13

Results

• The AO heuristic is effective.• The overhead of JITCL is negligible.• JITCL improves procedure layout without

requiring profile information.• JITCL reduces program memory requirements.

Page 14: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

14

Results: The AO Heuristic

0

0.01

0.02

0.03

0.04

co

mp

res

s

gc

c

m8

8k

sim

pe

rl

ray

tra

ce

xa

nim

Pettis & Hansen

Activation Order

Improvement in I-CacheMiss Rate

Conclusion: Effectiveness of heuristic is comparable to P&H.

Page 15: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

15

Overhead of JITCL

• Copy overhead– instruction overhead– cache overhead

• Cache consistency

• Disk overhead - comparable to demand loaded text; not evaluated.

Page 16: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

16

Results: Overhead

0

0.025

0.05

0.075

0.1

com

pre

ss

gcc

m88

ksim

per

l

rayt

race

xan

im

OverheadInstructions(%)

Conclusion: JITCL Overhead is less than 0.1% in all cases.

Page 17: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

17

Results: Performance

0

0.2

0.4

0.6

0.8

1

1.2

co

mp

res

s

gc

c

m8

8k

sim

pe

rl

ray

tra

ce

xa

nim

Pettis & Hansen

JITCL

SavedCycles perInstruction

Conclusion: Overall performance is comparable to P&H.

Page 18: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

18

JITCL for Win32 Applications

• Windows applications are composed of multiple executable modules.

• When transitions between modules are frequent, intra-module code layout is less effective.

• With JITCL, inter-module code layout is possible and beneficial.

Page 19: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

19

Win32 Cache Miss Rates

0

0.01

0.02

0.03

0.04

0.05

0.06

mazelord perfmon wordpro96

word 7 ie302

default

P&H

JITCL

L1 CacheMiss Rate

Conclusion: Careful layout did not help Win32 applications.

Page 20: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

20

Text Segment Size

0

500

1000

1500

2000

2500

3000

3500

mazelord perfmon wordpro96

word 7 ie302

default

JITCL

Text size inmegabytes

Conclusion: JITCL typically reduces text size by 50%.

Page 21: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

21

JITCL vs. PBO

• JITCL provides an alternative to feedback-based procedure layout.

• Many important optimizations still require profile information.– instruction scheduling– register allocation– other intra-procedural optimizations

• Don’t expect profile-based optimization to go away!

Page 22: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

22

Conclusions

Just-In-Time code layout achieves comparable benefit to profile-based code layout without the need for profiles.• The AO heuristic is effective.• The overhead of procedure copying is low.• Benefit in I-Cache is comparable to Pettis and

Hansen layout.• JITCL can reduce working set size.

Page 23: 1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time

23

The Morph Project

oM phr

For more information:http://www.eecs.harvard.edu/morph/