cs 612: software design for high-performance architectures
TRANSCRIPT
![Page 1: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/1.jpg)
CS 612:Software Design for
High-performance Architectures
![Page 2: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/2.jpg)
Administration
• Instructor: Keshav Pingali– 457 Rhodes Hall– [email protected]
• TA: Milind Kulkarni– 490 Rhodes Hall– [email protected]
![Page 3: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/3.jpg)
Course content
• Understand high-end programming paradigms, compilers and runtime systems – Applications requirements– Shared-memory programming – Optimistic and pessimistic parallelization– Transactional memory– Memory hierarchy optimization– Self-optimizing systems
• Focus on software problem for multicore processors
![Page 4: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/4.jpg)
Problem
• Silicon designers can choose a variety of methods to increase processor performance
• Commercial end-customers are demanding– More capable systems with more capable processors– That new systems stay within their existing power/thermal
infrastructure
• Processor frequency and power consumption seem to be scaling in lockstep
• How can the industry-standard PC and Server industries stay on our historic performance curve without burning a hole in our motherboards?
![Page 5: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/5.jpg)
What is a processor?
• A single chip package that fits in a socket• ≥1 core (not much point in <1 core…)
– Cores can have functional units, cache, etc.associated with them, just as today
– Cores can be fast or slow, just as today
• Shared resources– More cache– Other integration: memory controllers, high-speed serial
links, etc.
• One system interface no matter how many cores– Number of signal pins doesn’t scale with number
of cores
![Page 6: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/6.jpg)
ILP Problem
• Functional units– Superscalar is known territory– Diminishing returns for adding more functional blocks– Alternatives like VLIW have been considered and rejected
by the market– Single-threaded architectural performance is pegged
• Data paths– Increasing bandwidth between functional units in a
core makes a difference• Such as comprehensive 64-bit design, but then where to?
![Page 7: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/7.jpg)
ILP Problem (contd.)
• Pipeline– Deeper pipeline buys frequency at expense of increased
cache miss penalty and lower instructions per clock– Shallow pipeline gives better instructions per clock at the
expense of frequency scaling– Max frequency per core requires deeper pipelines– Industry converging on middle ground…9 to 11 stages
• Successful RISC CPUs are in the same range
• Cache– Cache size buys performance at expense of die size– Deep pipeline cache miss penalties are reduced by larger
caches
![Page 8: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/8.jpg)
Power problem• Moore’s Law isn’t dead, more transistors
for everyone!– But…it doesn’t really mention scaling transistor power
• Chemistry and physics at nano-scale– Stretching materials science
– Transistor leakage current is increasing
• As manufacturing economies and frequency increase, power consumption is increasing disproportionately
• There are no process or architectural quick-fixes
![Page 9: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/9.jpg)
Static Current vs. Frequency
Frequency
Sta
tic
Cu
rren
t
Embedded Embedded PartsParts
Very High LeakageVery High Leakageand Powerand Power Fast, High Fast, High
PowerPower
Fast, Low Fast, Low PowerPower
1.0 1.5
15
0
Non-linear as processors approach max frequency
![Page 10: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/10.jpg)
Power vs. Frequency• In AMD’s process, for 200MHz frequency steps, two steps back on frequency
cuts power consumption by ~40% from maximum frequency
• Substantially lower power with lower frequency
• Result is dual-core running at n-2 in same thermal envelope as single-core running at top speed
![Page 11: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/11.jpg)
AMD Multi-Core Processor
• Dual-core AMD Opteron™ processor is 199mm2 in 90nm
• Single-core AMD Opteron processor is 193mm2 in 130nm
![Page 12: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/12.jpg)
Multi-Core Processor Architecture
![Page 13: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/13.jpg)
Multi-Core Software
• More aggregate performance for: – Multi-threaded apps – Transactions: many instances of same app – Multi-tasking
• Problem– Most apps are not multithreaded– Writing multithreaded code increases software costs
dramatically (factor of 3 for some game engines)
![Page 14: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/14.jpg)
First problem: Parallelization
“We are the cusp of a transition to multicore, multithreaded architectures, and we still have not demonstrated the ease of programming the move will require… I have talked witha few people at Microsoft Research who say this is also ator near the top of their list [of critical CS research problems].” Justin Rattner, Senior Fellow, Intel
![Page 15: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/15.jpg)
Second problem: memory hierarchy
“…The CPU chip industry has now reached the point that instructions can be executed more quickly than the chips can be fed with code and data. Future chip design is memory design. Future software design is also memory design. .…
Controlling memory access patterns will drive hardware and software designs for the foreseeable future.”
Richard Sites, DEC
![Page 16: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/16.jpg)
Memory Hierarchy of SGI Octane
• R10 K processor: – 4-way superscalar, 2 fpo/cycle, 195MHz
• Peak performance: 390 Mflops• Experience: sustained performance is less than 10% of peak
– Processor often stalls waiting for memory system to load data
size
access time (cycles)2 10 70
64
32KB (I)32KB (D)
1MB
128MB
Regs
L1 cacheL2 cache
Memory
![Page 17: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/17.jpg)
Memory-wall solutions
• Latency avoidance:– multi-level memory hierarchies (caches)
• Latency tolerance:– Pre-fetching– multi-threading
• Techniques are not mutually exclusive:– Most microprocessors have caches and pre-fetching– Modest multi-threading is coming into vogue– Our focus: memory hierarchies
![Page 18: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/18.jpg)
Hiding latency in numerical codes
• Most numerical kernels: O(n3) work, O(n2) data – all factorization codes
• Cholesky factorization: A = LLT (A is spd)• LU factorization: A = LU• LU factorization with pivoting: A = LU• QR factorization: A = QR (Q is orthogonal)
– BLAS-3: matrix multiplicationuse latency avoidance techniques
• Matrix-vector product: O(n2) work, O(n2) data– use latency tolerance techniques such as pre-fetching– particularly important for iterative solution of large sparse systems
![Page 19: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/19.jpg)
Software problem
• Caches are useful only if programs have locality of reference– temporal locality: program references to given memory
address are clustered together in time– spatial locality: program references clustered in address
space are clustered in time
• Problem:– Programs obtained by expressing most algorithms in the
straight-forward way do not have much locality of reference– Worrying about locality when coding algorithms complicates
the software process enormously.
![Page 20: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/20.jpg)
Example: matrix multiplication
• Great algorithmic data reuse: each array element is touched O(N) times!
• All six loop permutations are computationally equivalent (even modulo round-off error).
• However, execution times of the six versions can be very different if machine has a cache.
DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)
![Page 21: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/21.jpg)
IJK version (large cache)
DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)
• Large cache scenario:– Matrices are small enough to fit into cache– Only cold misses, no capacity misses– Miss ratio:
• Data size = 3 N2
• Each miss brings in b floating-point numbers• Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)
C
B
A
K
K
![Page 22: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/22.jpg)
IJK version (small cache)
DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)
• Small cache scenario:– Matrices are large compared to cache/row-major storage– Cold and capacity misses – Miss ratio:
• C: N2/b misses (good temporal locality)• A: N3 /b misses (good spatial locality)• B: N3 misses (poor temporal and spatial locality)• Miss ratio 0.25 (b+1)/b = 0.3125 (for b = 4)
C
B
A
K
K
![Page 23: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/23.jpg)
MMM Experiments• Simulated L1 Cache Miss Ratio for Intel Pentium III
– MMM with N = 1…1300– 16KB 32B/Block 4-way 8-byte elements
![Page 24: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/24.jpg)
Quantifying performance differences
DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)
• Octane– L2 cache hit: 10 cycles, cache miss 70 cycles
• Time to execute IKJ version: 2N3 + 70*0.13*4N3 + 10*0.87*4N3 = 73.2 N3
• Time to execute JKI version: 2N3 + 70*0.5*4N3 + 10*0.5*4N3 = 162 N3
• Speed-up = 2.2• Key transformation: loop permutation
![Page 25: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/25.jpg)
Even better…..• Break MMM into a bunch of smaller MMMs so that large cache model is true
for each small MMM
large cache model is valid for entire computation
miss ratio will be 0.75/bt for entire computation where t is
![Page 26: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/26.jpg)
Loop tiling
• Break big MMM into sequence of smaller MMMs where each smaller MMM multiplies sub-matrices of size txt.
• Parameter t (tile size) must be chosen carefully– as large as possible– working set of small matrix multiplication must fit in cache
A
B
C
It
Kt
Jt
I
K
JDO It = 1,N, t DO Jt = 1,N,t DO Kt = 1,N,t DO I = It,It+t-1 DO J = Jt,Jt+t-1 DO K = Kt,Kt+t-1 C(I,J) = C(I,J)+A(I,K)*B(K,J)
tt
tt
![Page 27: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/27.jpg)
Speed-up from tiling
• Miss ratio for block computation
= miss ratio for large cache model
= 0.75/bt
= 0.001 (b = 4, t = 200) for Octane
• Time to execute tiled version = 2N3 + 70*0.001*4N3 + 10*0.999*4N3 = 42.3N3
• Speed-up over JKI version = 4
![Page 28: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/28.jpg)
Observations
• Locality optimized code is more complex than high-level algorithm.
• Loop orders and tile size must be chosen carefully– cache size is key parameter– associativity matters
• Actual code is even more complex: must optimize for processor resources– registers: register tiling– pipeline: loop unrolling– Optimized MMM code can be ~1000 lines of C code
![Page 29: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/29.jpg)
One solution to both problems: restructuring compilers (1985-)
• Programmer writes high-level architecture independent code
• Restructuring compiler: optimizes program for – Number of cores
– Number of register
– Cache organization
– Instruction set: mul-add? vector extensions? …
![Page 30: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/30.jpg)
Two key issues
P1
P2
P3……
P
1. Program restructuring: given program P, determine set of equivalent programs P1, P2, P3,…2. Program selection: determine which program performs best on target architecture
1
2
![Page 31: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/31.jpg)
Automatic parallelization
• Pessimistic parallelization:– Compiler determines partial order on program operations by
determining dependences– At run-time, execute operations in parallel, respecting dependences– Works reasonably well for array programs but not for irregular data
structures like trees and graphs
• Optimistic parallelization:– Execute operations speculatively in parallel, assuming that
dependences do not exist– Check at runtime if dependences are violated– If so, roll-back execution to “safe” point and re-execute sequentially– Works only if optimism is warranted– Lots of interest in “transactional memory” which is one model of
optimistic parallelization
![Page 32: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/32.jpg)
Automatic locality enhancement
• Some methodology exists for array programs but little is known for irregular programs
• Many compilers can perform tiling and permutation automatically (gcc)
• Choosing parameter values: tile sizes etc.– Compiler can use architectural models
– Self-optimizing systems: system determines best values using some kind of heuristic search (ATLAS,FFTW)
![Page 33: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/33.jpg)
Course outline
• Applications requirements– Scientific and engineering applications– Commercial work-loads
• Shared-memory programming– Memory consistency models– OpenMP
• Optimistic and pessimistic parallelization– Dependence analysis techniques for array and irregular programs– Transactional memory models and implementations
• Automatic locality enhancement• Self-optimizing systems
![Page 34: CS 612: Software Design for High-performance Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062423/5697bfed1a28abf838cb9040/html5/thumbnails/34.jpg)
Course work
• Small number of programming assignments• Paper presentations and class participation
– We will have papers online by next Monday– Sign up for presentation by next Thursday
• Substantial course project• independent reading• implementation work• presentation