the university of texas at austin lizy kurian john, lca, ut austin1 what programming...
TRANSCRIPT
Lizy Kurian John, LCA, UT Austin
1
The University of Texas at Austin
What Programming Language/Compiler Researchers should Know about Computer Architecture
Lizy Kurian John
Department of Electrical and Computer EngineeringThe University of Texas at Austin
Lizy Kurian John, LCA, UT Austin
2
Somebody once said
“Computers are dumb actors and compilers/programmers are the master playwrights.”
Lizy Kurian John, LCA, UT Austin
3
Computer Architecture Basics
ISAs RISC vs CISC Assembly language coding Datapath (ALU) and controller Pipelining Caches Out of order execution
Hennessy and Patterson architecture books
Lizy Kurian John, LCA, UT Austin
4
Basics ILP DLP TLP Massive parallelism SIMD/MIMD VLIW Performance and Power metrics
Hennessy and Patterson architecture booksASPLOS, ISCA, Micro, HPCA
Lizy Kurian John, LCA, UT Austin
5
The Bottomline
Programming Language choice affects performance and powereg: Java
Compilers affect Performance and Power
Lizy Kurian John, LCA, UT Austin
6
A Java Hardware Interpreter
Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001) This technique used by Nazomi Communications,
Parthus (Chicory Systems)
Java class file
Native executabl
e
FetchHardware bytecode translator
Decode Execute
bytecodes
Native machine instructions
Lizy Kurian John, LCA, UT Austin
7
HardInt Performance4-way performance
44.8
109.
3 149.
7
934.
1
911.
7
60.4
135.
9
85.2 12
7.7
492.
2
71.0
133.
7
221.
5
989.
4
867.
8
59.8
108.
8 146.
2
146.
1
321.
9
16.0
27.7
28.8
250.
2
120.
0
0
50
100
150
200
250
300
350
400
db javac jess mpeg mtrt
ex
ecuti
on c
ycle
s (
millions)
J DK 1.1.6 Interpreter JDK 1.1.6 J IT JDK 1.2 Interpreter JDK 1.2 J IT Hard-Int
• Hard-Int performs consistently better than the interpreter
• In JIT mode, significant performance boost in 4 of 5 applications.
Lizy Kurian John, LCA, UT Austin
8
Compiler and PowerA
B
D
F
C
E
A
B
D
F
A
B
D
F
C CE
E
DDG Peak Power = 3 Energy = 6
Peak Power = 2 Energy = 6
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Lizy Kurian John, LCA, UT Austin
9
Valluri et al 2001 HPCA workshop
Quantitative Study Influence of state-of-the-art optimizations
on energy and power of the processor examined
Optimizations studied Standard –O1 to –O4 of DEC Alpha’s cc compiler Four individual optimizations – simple basic-
block instruction scheduling, loop unrolling, function inlining, and aggressive global scheduling
Lizy Kurian John, LCA, UT Austin
10
Standard Optimizations on Power
Benchmark opt level Energy Exec Time Insts Avg Power IPCO0 100 100 100 100 100O1 74.48 81.55 81.52 91.33 99.96O2 75.13 81.44 82.04 92.25 100.73O3 75.13 81.44 82.04 92.25 100.73O4 79.01 82.77 86.11 95.45 104.03O0 100 100 100 100 100O1 66.2 64.13 68.94 103.23 107.5O2 62.62 61.31 63.01 102.14 102.78O3 62.62 61.31 63.01 102.14 102.78O4 63.67 62.19 63.75 102.38 102.51O0 100 100 100 100 100O1 81.32 83.66 83.18 97.2 99.42O2 79.6 75.97 82.97 104.78 109.21O3 79.6 75.97 82.97 104.78 109.21O4 85.71 77.89 90.96 110.05 116.78
compress
go
li
Lizy Kurian John, LCA, UT Austin
11
Somebody once said
“Computers are dumb actors and compilers/programmers are the master playwrights.”
Lizy Kurian John, LCA, UT Austin
12
A large part of modern out of order processors
is hardware that could have been eliminated if a good compiler existed.
Lizy Kurian John, LCA, UT Austin
13
Let me get more arrogant
A large part of modern out of order processors was designed because
computer architects thought compiler writers could not do a good job.
Lizy Kurian John, LCA, UT Austin
14
Value Prediction
Is a slap on your face
Shen and Lipasti
Lizy Kurian John, LCA, UT Austin
15
Value Locality
Likelihood that an instruction’s computed result or a similar predictable result will occur soon
Observation – a limited set of unique values constitute majority of values produced and consumed during execution
Lizy Kurian John, LCA, UT Austin
16
Load Value Locality
Lizy Kurian John, LCA, UT Austin
17
Causes of value locality
Data redundancy – many 0s, sparse matrices, white space in files, empty cells in spread sheets
Program constants – Computed branches – base address for
jump tables is a run-time constant Virtual function calls – involve code to
load a function pointer – can be constant
Lizy Kurian John, LCA, UT Austin
18
Causes of value locality
Memory alias resolution – compiler conservatively generates code – may contain stores that alias with loads
Register spill code – stores and subsequent loads
Convergent algorithms – convergence in parts of algorithms before global convergence
Polling algorithms
Lizy Kurian John, LCA, UT Austin
19
2 Extremist Views
Anything that can be done in hardware should be done in hardware.
Anything that can be done in software should be done in software.
Lizy Kurian John, LCA, UT Austin
20
What do we need?
The Dumb actor
Or the
The defiant actor – who pays very little attention to the script
Lizy Kurian John, LCA, UT Austin
21
Challenging all compiler writers
The last 15 years was the defiant actor’s era
What about the next 15? TLP, Multithreading, Parallelizing compilers – It’s time for a lot more dumb acting from the architect’s side.
And it’s time for some good scriptwriting from the compiler writer’s side.
Lizy Kurian John, LCA, UT Austin
22
The University of Texas at Austin
BACKUP
Lizy Kurian John, LCA, UT Austin
23
Compiler Optimzations
cc - Native C compiler on Dec Alpha 21064 running OSF1 operating system
gcc – Used to study the effect of individual optimizations
Lizy Kurian John, LCA, UT Austin
24
Std Optimizations Levels on cc
-O0 – No optimizations performed-O1 – Local optimizations such as CSE, copy
propagation, IVE etc-O2 – Inline expansion of static procedures
and global optimizations such as loop unrolling, instruction scheduling
-O3 – Inline expansion of global procedures-O4 – s/w pipelining, loop vectorization etc
Lizy Kurian John, LCA, UT Austin
25
Std Optimizations Levels on gcc
-O0 – No optimizations performed-O1 – Local optimizations such as CSE, copy
propagation, dead-code elimination etc-O2 – aggressive instruction scheduling-O3 – Inlining of procedures
Almost same optimizations in each level of cc and gcc
In cc and gcc, optimizations that increase ILP are in levels -O2, -O3, and -O4
cc used where ever possible, gcc used used where specific hooks are required
NOTE:
Lizy Kurian John, LCA, UT Austin
26
Individual Optimizations
Four gcc optimizations, all optimizations applied on top -O1
-fschedule-insns – local register allocation followed by basic-block list scheduling
-fschedule-insns2 – Postpass scheduling done
-finline-functions – Integrated all simple functions into their callers
-funroll-loops – Perform the optimization of loop unrolling
Lizy Kurian John, LCA, UT Austin
27
Some observations
Energy consumption reduces when # of instructions is reduced, i.e., when the total work done is less, energy is less
Power dissipation is directly proportional to IPC
Lizy Kurian John, LCA, UT Austin
28
Observations (contd.)
Function inlining was found to be good for both power and energy
Unrolling was found to be good for energy consumption but bad for power dissipation
Lizy Kurian John, LCA, UT Austin
29
MMX/SIMD
Automatic usage of SIMD ISA still difficult 10+ years after introduction of MMX.
Lizy Kurian John, LCA, UT Austin
30
Standard Optimizations on Power (Contd)
Benchmark opt level Energy Exec Time Insts Avg Power IPCO0 100 100 100 100 100O1 97.38 100.24 92.49 97.15 92.27O2 97.69 99.38 92.49 98.3 93.07O3 97.69 99.38 92.49 98.3 93.07O4 98.31 99.27 92.84 99.02 93.51O0 100 100 100 100 100O1 42.09 51.04 33.21 82.46 65.06O2 40.99 47.52 33.1 86.28 69.67O3 40.99 46.37 33.1 87.65 71.38O0 100 100 100 100 100O1 30.1 36.64 20.01 82.15 5463O2 28.93 34.01 19.05 85.06 56.01O3 28.93 34.01 19.05 85.06 56.01
su2cor
swim
saxpy