runtime optimization with specialization johnathon jamison cs265 susan graham 4-30-2003
Post on 21-Dec-2015
274 views
TRANSCRIPT
![Page 1: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/1.jpg)
Runtime Optimization with Specialization
Johnathon JamisonCS265
Susan Graham4-30-2003
![Page 2: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/2.jpg)
What is Runtime Code Generation (RTCG)?
• Dynamic addition of code to the instruction stream
• Restricted to instructions executed directly by hardware
![Page 3: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/3.jpg)
Problems with RTCG
• Reentrant code
• Portability (HL languages vs. assembly)
• Data vs. Code issues– Caches and memory– Standard compilation schema
• Maintainability and understandability
![Page 4: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/4.jpg)
Benefits of RTCG
• Adaptation to the architecture (cache sizes, various latencies, etc.)
• JIT compilation
• No profiling (actual data is available)
• Literals enable optimizations unknown or impossible at runtime
• Potentially more compact (for caches)
![Page 5: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/5.jpg)
Dynamic Compilation Trade-offs
• Execution time is linear in run count
• Choice between lower startup cost, lower incremental cost
Unoptimized Static Code
Low Optimized Dynamic CodeOptimized Static Code
High Optimized Dynamic Code
Input
Exe
cutio
n T
ime
![Page 6: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/6.jpg)
Observation
• Programmers write the common case– blit routines– image display
• Applications have repetitious data– simulators– Regexp matching
• Optimizations– sparse matrices
![Page 7: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/7.jpg)
One Tack: Specialization
• Take a piece of code and replace variables with constants
• Enables various optimizations– Strength reduction– Constant propagation– etc....
• Generate explicitly or implicitly
• Possibly reuse
![Page 8: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/8.jpg)
Example Of Specialization
int dot_product(int size, int u[], int v[]) {
int res = 0;
for (i = 0; i < size; i++) {
res = res + u[i] * v[i];
}
return res;
}
• Suppose size == 5, u == {14,0,38,12,1}
![Page 9: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/9.jpg)
Example Of Specialization
int dot_product_1(int v[]) {
int res = 0;
for (i = 0; i < 5 ; i++) {
res = res + {14,0,38,12,1}[i] * v[i];
}
return res;
}
• Substitute in the values
![Page 10: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/10.jpg)
Example Of Specialization
int dot_product_1(int v[]) {
int res = 0;
res = res + 14 * v[0];
res = res + 0 * v[1];
res = res + 38 * v[2];
res = res + 12 * v[3];
res = res + 1 * v[4];
return res;
}
• Unroll the loop
![Page 11: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/11.jpg)
Example Of Specialization
int dot_product_1(int v[]) {
int res;
res = 14 * v[0];
res = res + 38 * v[2];
res = res + 12 * v[3];
res = res + v[4];
return res;
}
• Eliminate unneeded code
![Page 12: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/12.jpg)
DyC
• make_static annotation indicates which variables to specialize with respect to
• @ annotation indicates static loads (a reload is not needed)int dot_product(int size, int u[], int v[]) {
make_static(size, u);int res = 0;for (i = 0; i < size; i++) {
res = res + u@[i] * v[i];}return res;
}
![Page 13: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/13.jpg)
DyC Specializer
• Each region has a runtime specializer
• Setup computations are run
• The values are plugged into holes in code templates
• The resultant code is optimized
• The result is translated to machine code and run
![Page 14: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/14.jpg)
DyC Optimizations
• Polyvariant specialization• Internal dynamic-to-static promotions• Unchecked dispatching• Complete loop unrolling• Polyvariant division and conditional
specialization• Static loads and calls• Strength reduction, zero and copy
propagation, and dead-assignment elimination (precomputed!)
![Page 15: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/15.jpg)
DyC Annotations
• Runtime constants and constant functions• Specialization/division should be
mono-/poly-variant• Disable/enable internal promotions• Compile eagerly/lazily downstream of
branches• Code caching style at merges/promotions• Interprocedural specialization
![Page 16: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/16.jpg)
Manual Annotation
• Profile to find potential gains
• Concentrate on areas with high execution times
• If unobvious, log values of parameters to find runtime constants
• Trial and error loop unrolling
![Page 17: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/17.jpg)
Applications Table 1: Application Characteristics
Program Description Annotated Static VariablesValues of Static
VariablesTotal Size(Lines)
Number & Size ofDynamicallyCompiled Functions
# Lines Instructions
Applications
dinero cachesimulator cache configurationparameters
8kB I/D, direct-mapped, 32B blocks
3,317 8 389 1624
m88ksim Motorola 88000simulator
an arrayofbreakpoints no breakpoints 12,531 1 14 145
mipsi MIPS R3000simulator
its inputprogram bubblesort 3,417 1 400 2884
pnmconvol image convolution convolution matrix 11x11with 9% ones,83% zeroes
1,054 1 76 1226
viewperf renderer 3D projection matrix,lighting vars
perspective matrix,one light source
15,006 2 168 1155
Kernels
binary binary search over anarray
theinput array anditscontents
16 integers 147 1 19 134
chebyshev polynomial functionapproximation
thedegree of the polyno-mial
10 145 1 19 146
dotproduct dot-product of twovectors
the contents of oneofthevectors
a 100-integer arraywith 90% zeroes
134 1 11 84
query tests database entryfor match
a query 7 comparisons 149 1 24 272
romberg function integrationby iteration
the iterationbound 6 158 1 24 301
![Page 18: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/18.jpg)
Optimizations used Table2: Optimizations Used by Each Program
Dynamic Region
Optimization
CompleteLoop
Unrollinga
a. SW = single-way, MW = multi-way
StaticLoads
UncheckedDispatching
DynamicDead-
AssignmentElimination
DynamicZero&CopyPropagation
StaticCalls
DynamicStrength
Reduction
InternalDynamic-to-
StaticPromotions
Poly-variantDivision
dinero:mainloop
33 3
m88ksim:ckbrkpts
SW 33
mipsi:run
MW 33 3 3
pnmconvol:do_convol
SW 33 3 3
viewperf:project&clip
33 3 3
viewperf:shader
SW 33 3 3 3 3
binary MW 33
chebyshev SW 33
dotproduct SW 33 3 3
query SW 33
romberg SW 3
![Page 19: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/19.jpg)
Break Even Points
Table 3: Dynamic Region Performance with All Optimizations
Dynamic RegionAsymptoticSpeedup
Break-Even PointDC Overhead
(cycles/instructiongenerated)
Number ofInstructionsGenerated
dinero: mainloop 1.7 1 invocation (3524memory references) 334 634
m88ksim: ckbrkpts 3.7 28 breakpointchecks 365 6
mipsi: run 5.0 1 invocation (484634instructions) 207 36614
pnmconvol: doconvol 3.1 1 invocation (59 pixels) 110 2394
viewperf: project&clip 1.3 16 invocations 823 122
viewperf: shade 1.2 16 invocations 524 618
binary 1.8 836 searches 72 304
chebyshev 6.3 2 interpolations 31 807
dotproduct 5.7 6 dotproducts 85 50
query 1.4 259 database entry comparisons 53 71
romberg 1.3 16 integrations 13 1206
![Page 20: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/20.jpg)
Performance
Table 4: Whole-Program Performance with All Optimizations
Application
ExecutionTime (sec.)
Execution Timein theDynamicRegions
(% of total staticexecution)
AverageWhole-
ProgramSpeedup
StaticallyCompiled
DynamicallyCompiled
dinero 1.3 0.9 49.9 1.5
m88ksim 81.0 76.8 9.8 1.05
mipsi 20.8 4.5 ~ 100 4.6
pnmconvol 10.8 3.6 83.8 3.0
viewperf 1.7 1.6 41.4 1.02
![Page 21: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/21.jpg)
Speedup without a given featureTable 5: Dynamic Region Asymptotic Speedups without a Particular Feature
Dynamic RegionWith All
Opts
Without:
CompleteLoop
Unrolling
StaticLoads
UncheckedDispatching
StaticCalls
DynamicZero&CopyPropagation
DynamicDead-
AssignmentElimination
DynamicStrength
Reduction
InternalDynamic-to-
StaticPromotions
Poly-variantDivision
dinero:mainloop
1.7 0.9 1.6 1.03
m88ksim:ckbrkpts
3.7 0.4 0.6 1.6
mipsi:run
5.0 0.9 0.9 5.0 0.9 0.9
pnmconvol:do_convol
3.1 0.8 0.8 3.1 2.1 0.9
viewperf:project&clip
1.3 1.1 1.3 1.1 1.3
viewperf:shader
1.2 1.0 1.1 1.2 1.02 1.1 1.2 1.1
binary 1.8 0.6 1.3 0.6
chebyshev 6.3 0.9 6.0 1.2
dotproduct 5.7 0.3 0.9 3.4 0.7 0.7
query 1.4 0.5 0.5 0.6
romberg 1.3 0.8 1.2
![Page 22: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/22.jpg)
Calpa
• A system that automatically generates DyC annotations
• Profiles code, collecting statistics
• Analyses results
• Annotates code
• Basically, automates what previously was done manually
![Page 23: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/23.jpg)
Calpa, Step 1
• Instrumentation tool instruments the original binary
• Executed on representative input
• Generates summarized value and frequency data
• Fed into next step
![Page 24: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/24.jpg)
The Instrumenter
• Three types of information collected– Basic block execution frequencies– Variable definitions– Variable uses
• Points-to info for invalidation of constants necessary for safety
• uses stored as value/occurrence pairs, with procedure invocation noted, for groups of related values in a procedure
![Page 25: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/25.jpg)
Profiling Data
Table 2.ProÞling Results. This table shows the effects of instrumentation on application code size and execution time,and the resulting proÞle log size.
ProgramInstrumentation
TimeOriginal
Binary SizeInstrumentedBinary Sizea
a.Instrumented binary sizes include required portions of a 409KB instrumentation library.
BinaryExpansion
Factor
OriginalRunTime(seconds)
InstrumentedRunTime
ProÞle LogFile Size
binary 0.2 seconds 25 KB 224 KB 9.0< 0.11 .9 seconds 275 KB
dotproduct 0.1 seconds 25 KB 224 KB 9.0< 0.10 .3 seconds 92 KB
query 0.4 seconds 27 KB 224 KB 8.3< 0.17 .8 seconds 939 KB
romberg 0.3 seconds 26 KB 224 KB 8.6< 0.1 0.4 seconds 102 KB
dinero 4.6 seconds 57 KB 448 KB 7.91 .3 13.8 minutes 6.8 MB
pnmconvol 1.2 seconds 66 KB 288 KB 4.43 .0 17.1 minutes 266 KB
m88ksim 10.7 minutes 213 KB 2.6 MB 12.6 180.13 .5 hoursb
b.Using a limited amout of binary patching improved m88ksimÕs instrumented run time from 18 to 3.5 hours.
7.5 MB
![Page 26: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/26.jpg)
Profiling
• Seconds to hours
• Naive profiling was sufficient for their purposes, and so left unoptimized
• Another paper describes more efficient profile gathering
![Page 27: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/27.jpg)
Calpa, Step 2
• Annotation tool searches possible space of annotations
• Selects annotations and creates annotated program
• Passed to DyC, which compiles the program
• Calpa == policy, DyC == mechanism
![Page 28: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/28.jpg)
Canadate Static Variable (CSV) Sets
• A CSV set is the set of CSVs that make an instruction static
• Propagate if exactly one definition exists
![Page 29: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/29.jpg)
CSV Sets Example
i = 0 {}L1: if i >= size goto L2 {i, size}
uelem = u[i] {i, u[]}velem = v[i] {i, v[]}t = uelem * velem {i, u[], v[]}sum =sum + t {i, sum, u[],
v[]}i = i + 1 {i}goto L1 {}
L2:
![Page 30: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/30.jpg)
Candidate Division (CD) Sets
• A CD is a set of CSVs
• Set of static instructions in a CD are those instructions whose CSV sets are subsets of the CD
• The CD Set is all CDs produced from some combination of the CSV sets
• No need to consider other CDs (21 out of 32)
![Page 31: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/31.jpg)
CD Sets Example
{} {}{i} {i}{i, size} {i, size}{i, u[]} {i, u[]}{i, v[]} {i, v[]}{i, u[], v[]} {i, u[], v[]}{i, sum, u[], v[]} {i, sum, u[], v[]}
{i, size, u[]}{i, size, v[]}{i, size, u[], v[]}{i, size, sum, u[],
v[]}
![Page 32: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/32.jpg)
Search of CD Space
• The CDs are enumerated, starting with the least variable variables
• As additional CDs are enumerated, the "best" one is kept
• The search terminates if– All CDs are enumerated– a time quota expires– the improvement over the "best" so far
drops below a threshold
![Page 33: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/33.jpg)
Cost Model
• Specialization cost– Basic block size * # of vals– Loop size * # of values of the induction variable
(scale for multiway loops)– Total instruction * instruction generation cost
• Cache cost– Lookup cost– Hash key construction (# of vars * cost per var)– Except if unchecked policy is used
• Invalidation cost– Sum of execution frequency * invalidate cost for
all invalidation points
![Page 34: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/34.jpg)
Benefit Model
• Runs a simplified DyC analysis
• Assumes whole procedure specialization (overestimating costs)
• Count number of saved cycles assuming the given CD
• Only looks at the critical path (a simplifying assumption)
• A win if saved cycles > cycle cost
![Page 35: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/35.jpg)
Calpa is safe
• Static, unchecked, and eager annotations are selected when profile information hints at this
• However, these are unsafe• Calpa does invalidations at all points when it
could upset safety• Also makes pessimistic assumptions about
external routines• It is always safe to avoid these annotations
![Page 36: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/36.jpg)
Testing
• Tested on previously annotated programs
• The annotation process was much quicker
• Found all manual annotations
• Plus more annotations
![Page 37: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/37.jpg)
Annotations found
• All the manual ones, plus two more– Search key in search program– Vector v in dotproduct
• The unvarying nature of these variables was an artifact of atypical use
• But getting good profiling input is someone else's research
![Page 38: Runtime Optimization with Specialization Johnathon Jamison CS265 Susan Graham 4-30-2003](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d635503460f94a456f2/html5/thumbnails/38.jpg)
Related Work • Value Profiling
– Work that builds an efficient system to profile programs with the aim of using the collected information to drive specialization. They do not do value sequence information collection, which Calpa needs. They also do binary instrumentation.
• Fabius– This system takes functions that are curried and generate code for partially
evaluated functions. Thus, the idiom of currying is leveraged to optimize code at runtime.
• Tick C– `C extends C with a few additional constructs that allow explicit manual runtime
code compilation. You specify exactly what code that you wish to have compiled in C-like fragments. In the spirit of runtime trade-offs, code generation can be in one of two forms, one quick to create, and one more efficient.
• Tempo– Tempo can either be a source to source compilation, or a source to runtime code
generator. It is much more limited in scope than DyC/Calpa. However, it does have an automatic side-effect/alias analysis.