1 j. bradley chen and bradley d. d. leupen division of engineering and applied sciences harvard...
TRANSCRIPT
1
J. Bradley Chen and Bradley D. D. Leupen
Division of Engineering and Applied Sciences
Harvard University
Improving Instruction Locality with Just-In-Time Code Layout
2
Goals
• Improve instruction reference locality–big problem for commodity applications
• Eliminate need for profile information–required by current compiler-based solutions
3
How?
Implement layout dynamically using Activation Order:• A new heuristic for code layout.• Locate procedures in order of use.
4
Requirements
• No special hardware support.
• Minimal changes to the operating system.
• Minimal system overhead.
5
Optimizing Procedure Layout
Bad Layout Better Layout
6
Current Practice: Pettis and Hansen
• Nodes are procedures.
• Edges are caller/callee pairs.
• Weights are call frequency.
WinMain()
Initialize()
EventLoop()
GetEvent() React(
)
HandleRareCase()
HandleInputError()
CheckForInputError()
HandleCommonCase()
1
1
129394
68754
128404
1
10
68753
7
Pettis and Hansen Layout
EventLoop()
GetEvent()React()
CheckForInputError()
HandleCommonCase()
129394 68754
128404
68753
EventLoop()
React()
HandleCommonCase()
129394 68754
68753
Node-1
layout: [] layout: [GetEvent,CheckForInputErrors]
Node-2
React()
HandleCommonCase()
68754
68753
layout: [EventLoop, GetEvent,CheckForInputErrors]
Node-3
HandleCommonCase()
68753
layout: [React, EventLoop,GetEvent,CheckForInputErrors]
Node-4
layout: [HandleCommonCase, React,EventLoop, GetEvent,CheckForInputErrors]
8
A New Heuristic
Code P&H AO
void main(){ Initialize(); Do some stuff; for (10000 interations) { Do some stuff; foo(); bar(); }}
foo()main()bar()Initialize()
main()Initialize()foo()bar()
Activation Order: Co-locate procedures that are activated sequentially.
Example:
9
Implementing JITCL __start: perform initializations call thunk_main
thunk_main:. . .
thunk_foo:. . .
__InstructionMemory:
Thunk routines implement code layout on-the-fly.
10
Thunk routines
// Global variables:// ProcPointers[] - one element per procedure// INDEX_proc and LENGTH_proc for each procedurethunk_main: if (InCodeSegment(ProcPointers[INDEX_main])) ProcPointers[INDEX_main] = CopyToTextSegment(ProcPointer[INDEX_main],
LENGTH_main); PatchCallSite(ProcPointer[INDEX_main], ComputeCallSiteFromReturnAddress(RA)); jmp ProcPointer[INDEX_main];
The thunk routines copy procedures into the textsegment and update call sites at run-time.
11
Simulation Methodology
8K 8KCache Size
Direct-Mapped 2-WayAssociativity
ATOM EtchSimulation
UNIX/RISC Win32/x86
12
Workloads
Benchmark Description Text Size
UNIX
compress file compression 112
Gcc The GNU C compiler 1552
m88ksim Simulation of Motorola 88K 160
Perl The perl scripting language 376
Raytrace Image rendering 192
Xanim MPEG player 2024
Win32
Mazelord Maze game 1445
Perfmon Windows NT system utility 2805
Wordpro 96 Lotus document preparation 5148
Word 7 Microsoft document preparation 7694
IE302 Microsoft web browser product 4990
13
Results
• The AO heuristic is effective.• The overhead of JITCL is negligible.• JITCL improves procedure layout without
requiring profile information.• JITCL reduces program memory requirements.
14
Results: The AO Heuristic
0
0.01
0.02
0.03
0.04
co
mp
res
s
gc
c
m8
8k
sim
pe
rl
ray
tra
ce
xa
nim
Pettis & Hansen
Activation Order
Improvement in I-CacheMiss Rate
Conclusion: Effectiveness of heuristic is comparable to P&H.
15
Overhead of JITCL
• Copy overhead– instruction overhead– cache overhead
• Cache consistency
• Disk overhead - comparable to demand loaded text; not evaluated.
16
Results: Overhead
0
0.025
0.05
0.075
0.1
com
pre
ss
gcc
m88
ksim
per
l
rayt
race
xan
im
OverheadInstructions(%)
Conclusion: JITCL Overhead is less than 0.1% in all cases.
17
Results: Performance
0
0.2
0.4
0.6
0.8
1
1.2
co
mp
res
s
gc
c
m8
8k
sim
pe
rl
ray
tra
ce
xa
nim
Pettis & Hansen
JITCL
SavedCycles perInstruction
Conclusion: Overall performance is comparable to P&H.
18
JITCL for Win32 Applications
• Windows applications are composed of multiple executable modules.
• When transitions between modules are frequent, intra-module code layout is less effective.
• With JITCL, inter-module code layout is possible and beneficial.
19
Win32 Cache Miss Rates
0
0.01
0.02
0.03
0.04
0.05
0.06
mazelord perfmon wordpro96
word 7 ie302
default
P&H
JITCL
L1 CacheMiss Rate
Conclusion: Careful layout did not help Win32 applications.
20
Text Segment Size
0
500
1000
1500
2000
2500
3000
3500
mazelord perfmon wordpro96
word 7 ie302
default
JITCL
Text size inmegabytes
Conclusion: JITCL typically reduces text size by 50%.
21
JITCL vs. PBO
• JITCL provides an alternative to feedback-based procedure layout.
• Many important optimizations still require profile information.– instruction scheduling– register allocation– other intra-procedural optimizations
• Don’t expect profile-based optimization to go away!
22
Conclusions
Just-In-Time code layout achieves comparable benefit to profile-based code layout without the need for profiles.• The AO heuristic is effective.• The overhead of procedure copying is low.• Benefit in I-Cache is comparable to Pettis and
Hansen layout.• JITCL can reduce working set size.
23
The Morph Project
oM phr
For more information:http://www.eecs.harvard.edu/morph/