system-level isa (sisa) graphs and scratchpad allocation problem wenhao jia research seminar...
Post on 19-Dec-2015
216 views
TRANSCRIPT
System-level ISA (SISA) Graphs and Scratchpad Allocation ProblemWenhao JiaResearch Seminar5-20-2010
1
Hardware Diversity in Multi-core Era
Compile for every possible architecture and ship all of them? What about unknown architectures?
Programs
2 cores 6 cores4 hetero-
geneous cores4 cores with S/W
controlled local storage
2
This Problem Looks Familiar…
Thanks to ISA, we’ve been able to achieve performance portability w/o recompilation
Programs
Simple CPU
On-chip cache
Pipelining Superscalar
3
Performance Portability in Multi-core Systems
We need to create a new layer between programs and available
hardware resources (cores, on-chip storage, etc.),
which exposes resource requirements and data communications
to allow dynamic parallelism management
4
System-level ISA (SISA)
Related Work – Performance Portability NVIDIA CUDA
+ Programmers only know the total thread block count+ Hardware manages concurrent execution of blocks- Only works on NVIDIA GPUs as supplemental to CPUs
OpenMP+ Directives divide code into independent slave threads- Usually only exploits coarse-grained parallelism- Works best when processor count is known in advance
Intel Thread Building Blocks+ Runtime manager uses task stealing to balance workloads- This is only a reactive model- Programs’ high-level info is not used
All three methods lack a way to efficiently express and manage data communication
5
My Work
Built a prototype SISA graph generator based on LLVM
Studied scratchpad allocation problem and built two allocators
Wrote a runtime simulator to evaluate SISA graph execution performance
6
Talk Outline
Introduction SISA Graph Definition Scratchpad Allocation
Problem Description Approach Evaluation
Future Work & Conclusions
7
What Is SISA Graph?
Chunk – scheduling unit in SISA Single entry, single exit Explicitly marked data flow
among chunks No external side-effects
Read memory – Execution – Write memory
Each chunks doesn’t modify main memory until the end
8
read i, data
write n, sum
idata
sumn
Current LLVM-based Prototype
Only deal with DOALL loop programs for nowfloat data[], sqr[];
void main() { int i; init(data); for (i = 0; i < MAX; i++) sqr[i] = data[i] * data[i]; output(square); return;}
entry
body
exit
hdr
9
data[]
sqr[]
i
1
15
5
Talk Outline
Introduction SISA Graph Definition Scratchpad Allocation
Problem Description Approach Evaluation
Future Work & Conclusions
10
Scratchpad Allocation Problem
Assigning variables to core-private local storage with finite size before run
Scratchpad
Core 0
Scratchpad
Core 1
Scratchpad
Core 2
Scratchpad
Core 3
Static
Dynamic
11
Existing Work
Parallel programming model with support for explicitly hierarchical memory Sequoia [Fatahalian, SC 06] X10 [Charles, OOPSLA 05] Focus on how to express data flow more than
allocation strategies Embedded systems
Compile time: stack & global [Dominguez, JEC 05] Just before run: code & stack [Nguyen, CASES 05] Known scratchpad size in particular applications
12
My Work
Does not assume scratchpad size at compile time
Global, stack, and part of heap objects Static method -> Dynamic method
For now content doesn’t change at runtime
13
Overall Approach
Identify what variables can be put in scratchpad
Allocate variables under total scratchpad size constraint Baseline allocator Critical-path allocator
14
Variable Type Identification
Variables that won’t be written to by two worker threads can be put in scratchpad WTs’ stack variables WTs’ heap variables Read-only global variables MT’s heap variables declared
for and solely used by WTs Better pointer analysis will
give more accurate data dependency analysis
Master Thread (MT)
Worker Threads (WTs)
…
A DOALL Program
15
Baseline Allocator – Algorithm PV = getPrivateVariables()for each V in PV V.use_per_byte = V.nLoads / V.sizesort_by_use_per_byte(PV)
for each V in PV in order if (V.size < scratchpad’s free space) put V in scratchpad update scratchpad’s free space
Greedy algorithm Reduce total memory loads
16
D [8B] 16 / 8 = 2
C [16B] 48 / 16 = 3
Baseline Allocator – Example
17
8 x load D4 x load B
8 x load D12 x load C
40 x load A36 x load C
D [8B]
C [16B]
Use per ByteVariables
A [10B] 40 / 10 = 4
B [4B] 4 / 4 = 1
A [10B]
B [4B]
32B Scratchpad
Process ends when scratchpad is full or all variables are allocated
D [8B] 16 / 8 = 2
C [16B] 48 / 16 = 3
Critical-path-based Allocator
18
8 x load D4 x load B
8 x load D12 x load C
40 x load A36 x load C
D [8B]
C [16B]
Use per ByteVariables
B [4B] 4 / 4 = 1B [4B]
32B Scratchpad
36 x load C40 x load A
Critical-path-based Allocator – Algorithm do CP = findCriticalPath() PV = getPrivateVariables(CP) use baseline allocator on PVwhile (scratchpad content has changed)
if (scratchpad is not full) fill it up with the rest variables
Reduce memory operations on critical path
19
Simulation Set-up No cache at each core No memory bandwidth limit Wrote a recursion based runtime predictor (< 5.7%
error for blackscholes, FFT, LU, RADIX & OCEAN)
ScratchpadR: 6 cyclesW: 4 cycles
Reg
Core 0
…
Main Memory (R/W: 160 cycles)
ScratchpadR: 6 cyclesW: 4 cycles
Reg
Core N
20
Variable Types in Various Programs
blackscholes FFT LU0%
20%
40%
60%
80%
100%
Types of Load Instructions
UnidentifiedHeapStackGlobal
Load
Typ
e D
istr
ibuti
on
21
Program blackscholes FFT LU
Private Loads / All Loads 99% 100% 99%
Required Scratchpad Size (bytes) 428K 2.4K 66K
Result – Baseline Allocator
8K 32K 128K 512K 64 256 1K 4K 32K 64K 128K0%
20%
40%
60%
80%
100%
Main Memory Loads vs. Scratchpad Size
Scratchpad Loads Memory Loads
Scratchpad Size (Bytes)
Load
Typ
e Pe
ncen
tage
22
Blackscholes (428K) FFT (2.4K) LU (66K)
Result CP-based Allocator
Gives the same results on previous programs Working to refine this
switch (input)
load key1load key2
key1key2
A B
23
2 Threads, each executing 1 branch
No Allocator Baseline CP-based
Path Optimized - B A
In Scratchpad - key2 key1
Path A Runtime 985 985 831
Path B Runtime 701 547 701
Finish Time 985 985 831Scratchpad Size
Talk Outline
Introduction SISA Graph Definition Scratchpad Allocation
Problem Definition Approach Evaluation
Future Work & Conclusions
24
Future Work
Scratchpad Allocation Explore dynamic allocation Incorporate a better pointer analyzer Per-thread profiling for criticality
In simulation, make each loop iteration independent
SISA graph generator Go beyond DOALL programs
25
Conclusions
A prototype 3-phase SISA system was built Static: chunk generator, block/loop/malloc profiler Pre-run: scratchpad allocator Dynamic: runtime-predictor based simulator
Baseline allocator for reducing overall memory loads and critical-path-based allocator for reducing memory loads on critical path were built
Simulator verifies allocation result with less than 5.7% error.
26
Thank You!
27
References Performance Portability
NVIDIA CUDA Programming Guide, http://www.nvidia.com/cuda/ OpenMP: an industry standard API for shared-memory programming, L
Dagum et. al., IEEE Computational Science & Engineering, 1998 Intel TBB Reference Manual, http://www.threadingbuildingblocks.org/
Scratchpad – Parallel Programming Models Sequoia: Programming the memory hierarchy, K Fatahalian et. al.,
Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006 X10: an object-oriented approach to non-uniform cluster computing, K
Ebcioglu, Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, 2005
Scratchpad – Embedded System Heap Data Allocation to Scratch-Pad Memory in Embedded Systems, A
Dominguez et. al., Journal of Embedded Computing, Vol 1, Number 4, pp 521-540, 2005
Memory Allocation for Embedded Systems with a Compile-Time-Unknown Scratch-Pad Size, N Nguyen, et. al., Proceedings of the ACM International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2005
28
Related Work – CUDA
Threads can only interact within a block Programmers only know the total block
count Hardware manages block concurrency
Slower GPU Faster GPU
Same Program
Thread
Block
29
It works only on NVIDIA GPUs as supplemental to CPUs
Related Work – OpenMP
Use programming directives to divide code into independent slave threads
30
It’s used to exploit only coarse-grained parallelismIt works best when core count is known
Related Work – Intel Thread Building Blocks
Runtime manager uses task stealing to ensure workload balance
Core 0 Core 1
t = 0
Task Queue 1, 2, 3 6
61
Core 0 Core 1
Task Queue 1, 2 3
61
t = t03
31
It uses a reactive method to manage parallelismHigh-level information is lost
3 Phases of System-level ISA (SISA)
Static Phase
Programs are converted to SISA
graphs
SISA graphs are bundled along with partially
compiled executables
Pre-run Mapping
Available system
resources are known (core
count, on-chip storage size,
etc.)
Executable binaries are generated
accordingly
Dynamic Phase
Runtime system manages program
execution with the help of SISA graphs (task
mapping, migration,
prefetching, etc.)
32
SISA Graphs
LLVM Bitcode
SISA Diagram
Chunk Generator
Scratchpad Allocator
Simulator
Predicted Runtime
Static Phase Pre-run Mapping Dynamic Phase
33
Task Scheduler
Runtime Predictor
Runtime Predictor
Runtime Predictor
What We Use a Runtime Predictor for
Simulate how SISA programs execute on various hardware Vary core count and evaluate speed-up
Use it in various SISA components Guide scratchpad allocation Guide task scheduling
34
A Recursive Runtime Predictor
3
5
4
7
5
6
Chunk generator has annotate edges with profiling data
3
5
4
7
5
6
2
1
2 3
5
31
Recursively reduce graph
Predicted runtime: 82Backtrack gives critical path
22+3
×6+32+3
×3=4.2
35
5
4
7
5
4.25
1
15
5
70
7
51
1+4
n: core count
Single-threaded Accuracy
Major error source: loop bodies with vastly varying dynamic length (LU)
Program
blackscholes
FFT LU RADIX OCEAN
CNDF() bsthread() main() Total
Avg Inst Executed 43 25M 0.31M 61M 0.52M 10M 43M 630M
Error 1.9% 1.6% 0.0 % 1.8% 2.2% 5.7% 0.0% 0.4%
36
Multi-threaded Predictions
1 2 4 8 16 32 64 1280
0.2
0.4
0.6
0.8
1
1.2
Instructions Executed per Core vs. Core Count
blackscholes FFT LURADIX Ideal Speed-up
Number of Cores
Nor
mal
ized
Inst
ructi
on C
ount
37
Identify Variable Types
Thread-shared
Thread-private
Master Thread
Stack X
Heap If used by master
If used by workers*
Worker Threads
Stack X
Heap X
Global If modified by workers*
If read-only by workers
Private Variables: Variables that different threads won’t have contention on
* A simplistic assumption for DOALL programs** A conservative (context-insensitive) assumption
Master Thread
Worker Threads
…
A DOALL Program
38
Only private variables have the potential to be put in scratchpad
key2key1
Critical-path-based Allocator – Example Optimizing memory loads not on critical-path
may not reduce overall program runtime
switch (input)
load key1process(key1,
input)
load key2process(key2,
input)
key1 key2
Scratchpad Size
1. Find critical path
2. Allocate variables used on the critical path
3. Loop over
39