system-level isa (sisa) graphs and scratchpad allocation problem wenhao jia research seminar...

System-level ISA (SISA) Graphs and Scratchpad Allocation ProblemWenhao JiaResearch Seminar5-20-2010

1

Hardware Diversity in Multi-core Era

Compile for every possible architecture and ship all of them? What about unknown architectures?

Programs

2 cores 6 cores4 hetero-

geneous cores4 cores with S/W

controlled local storage

2

This Problem Looks Familiar…

Thanks to ISA, we’ve been able to achieve performance portability w/o recompilation

Programs

Simple CPU

On-chip cache

Pipelining Superscalar

3

Performance Portability in Multi-core Systems

We need to create a new layer between programs and available

hardware resources (cores, on-chip storage, etc.),

which exposes resource requirements and data communications

to allow dynamic parallelism management

4

System-level ISA (SISA)

Related Work – Performance Portability NVIDIA CUDA

+ Programmers only know the total thread block count+ Hardware manages concurrent execution of blocks- Only works on NVIDIA GPUs as supplemental to CPUs

OpenMP+ Directives divide code into independent slave threads- Usually only exploits coarse-grained parallelism- Works best when processor count is known in advance

Intel Thread Building Blocks+ Runtime manager uses task stealing to balance workloads- This is only a reactive model- Programs’ high-level info is not used

All three methods lack a way to efficiently express and manage data communication

5

My Work

Built a prototype SISA graph generator based on LLVM

Studied scratchpad allocation problem and built two allocators

Wrote a runtime simulator to evaluate SISA graph execution performance

6

Talk Outline

Introduction SISA Graph Definition Scratchpad Allocation

Problem Description Approach Evaluation

Future Work & Conclusions

7

What Is SISA Graph?

Chunk – scheduling unit in SISA Single entry, single exit Explicitly marked data flow

among chunks No external side-effects

Read memory – Execution – Write memory

Each chunks doesn’t modify main memory until the end

8

read i, data

write n, sum

idata

sumn

Current LLVM-based Prototype

Only deal with DOALL loop programs for nowfloat data[], sqr[];

void main() { int i; init(data); for (i = 0; i < MAX; i++) sqr[i] = data[i] * data[i]; output(square); return;}

entry

body

exit

hdr

9

data[]

sqr[]

i

1

15

5

Talk Outline


Problem Description Approach Evaluation


10

Scratchpad Allocation Problem

Assigning variables to core-private local storage with finite size before run

Scratchpad

Core 0

Scratchpad

Core 1

Scratchpad

Core 2

Scratchpad

Core 3

Static

Dynamic

11

Existing Work

Parallel programming model with support for explicitly hierarchical memory Sequoia [Fatahalian, SC 06] X10 [Charles, OOPSLA 05] Focus on how to express data flow more than

allocation strategies Embedded systems

Compile time: stack & global [Dominguez, JEC 05] Just before run: code & stack [Nguyen, CASES 05] Known scratchpad size in particular applications

12

My Work

Does not assume scratchpad size at compile time

Global, stack, and part of heap objects Static method -> Dynamic method

For now content doesn’t change at runtime

13

Overall Approach

Identify what variables can be put in scratchpad

Allocate variables under total scratchpad size constraint Baseline allocator Critical-path allocator

14

Variable Type Identification

Variables that won’t be written to by two worker threads can be put in scratchpad WTs’ stack variables WTs’ heap variables Read-only global variables MT’s heap variables declared

for and solely used by WTs Better pointer analysis will

give more accurate data dependency analysis

Master Thread (MT)

Worker Threads (WTs)

…

A DOALL Program

15

Baseline Allocator – Algorithm PV = getPrivateVariables()for each V in PV V.use_per_byte = V.nLoads / V.sizesort_by_use_per_byte(PV)

for each V in PV in order if (V.size < scratchpad’s free space) put V in scratchpad update scratchpad’s free space

Greedy algorithm Reduce total memory loads

16

D [8B] 16 / 8 = 2

C [16B] 48 / 16 = 3

Baseline Allocator – Example

17

8 x load D4 x load B

8 x load D12 x load C

40 x load A36 x load C

D [8B]

C [16B]

Use per ByteVariables

A [10B] 40 / 10 = 4

B [4B] 4 / 4 = 1

A [10B]

B [4B]

32B Scratchpad

Process ends when scratchpad is full or all variables are allocated

D [8B] 16 / 8 = 2

C [16B] 48 / 16 = 3

Critical-path-based Allocator

18

8 x load D4 x load B

8 x load D12 x load C

40 x load A36 x load C

D [8B]

C [16B]

Use per ByteVariables

B [4B] 4 / 4 = 1B [4B]

32B Scratchpad

36 x load C40 x load A

Critical-path-based Allocator – Algorithm do CP = findCriticalPath() PV = getPrivateVariables(CP) use baseline allocator on PVwhile (scratchpad content has changed)

if (scratchpad is not full) fill it up with the rest variables

Reduce memory operations on critical path

19

Simulation Set-up No cache at each core No memory bandwidth limit Wrote a recursion based runtime predictor (< 5.7%

error for blackscholes, FFT, LU, RADIX & OCEAN)

ScratchpadR: 6 cyclesW: 4 cycles

Reg

Core 0

…

Main Memory (R/W: 160 cycles)

ScratchpadR: 6 cyclesW: 4 cycles

Reg

Core N

20

Variable Types in Various Programs

blackscholes FFT LU0%

20%

40%

60%

80%

100%

Types of Load Instructions

UnidentifiedHeapStackGlobal

Load

Typ

e D

istr

ibuti

on

21

Program blackscholes FFT LU

Private Loads / All Loads 99% 100% 99%

Required Scratchpad Size (bytes) 428K 2.4K 66K

Result – Baseline Allocator

8K 32K 128K 512K 64 256 1K 4K 32K 64K 128K0%

20%

40%

60%

80%

100%

Main Memory Loads vs. Scratchpad Size

Scratchpad Loads Memory Loads

Scratchpad Size (Bytes)

Load

Typ

e Pe

ncen

tage

22

Blackscholes (428K) FFT (2.4K) LU (66K)

Result CP-based Allocator

Gives the same results on previous programs Working to refine this

switch (input)

load key1load key2

key1key2

A B

23

2 Threads, each executing 1 branch

No Allocator Baseline CP-based

Path Optimized - B A

In Scratchpad - key2 key1

Path A Runtime 985 985 831

Path B Runtime 701 547 701

Finish Time 985 985 831Scratchpad Size

Talk Outline


Problem Definition Approach Evaluation


24

Future Work

Scratchpad Allocation Explore dynamic allocation Incorporate a better pointer analyzer Per-thread profiling for criticality

In simulation, make each loop iteration independent

SISA graph generator Go beyond DOALL programs

25

Conclusions

A prototype 3-phase SISA system was built Static: chunk generator, block/loop/malloc profiler Pre-run: scratchpad allocator Dynamic: runtime-predictor based simulator

Baseline allocator for reducing overall memory loads and critical-path-based allocator for reducing memory loads on critical path were built

Simulator verifies allocation result with less than 5.7% error.

26

Thank You!

27

References Performance Portability

NVIDIA CUDA Programming Guide, http://www.nvidia.com/cuda/ OpenMP: an industry standard API for shared-memory programming, L

Dagum et. al., IEEE Computational Science & Engineering, 1998 Intel TBB Reference Manual, http://www.threadingbuildingblocks.org/

Scratchpad – Parallel Programming Models Sequoia: Programming the memory hierarchy, K Fatahalian et. al.,

Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006 X10: an object-oriented approach to non-uniform cluster computing, K

Ebcioglu, Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, 2005

Scratchpad – Embedded System Heap Data Allocation to Scratch-Pad Memory in Embedded Systems, A

Dominguez et. al., Journal of Embedded Computing, Vol 1, Number 4, pp 521-540, 2005

Memory Allocation for Embedded Systems with a Compile-Time-Unknown Scratch-Pad Size, N Nguyen, et. al., Proceedings of the ACM International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2005

28

http://www.nvidia.com/cuda/

http://www.nvidia.com/cuda/

http://www.threadingbuildingblocks.org/

http://www.threadingbuildingblocks.org/

Related Work – CUDA

Threads can only interact within a block Programmers only know the total block

count Hardware manages block concurrency

Slower GPU Faster GPU

Same Program

Thread

Block

29

It works only on NVIDIA GPUs as supplemental to CPUs

Related Work – OpenMP

Use programming directives to divide code into independent slave threads

30

It’s used to exploit only coarse-grained parallelismIt works best when core count is known

Related Work – Intel Thread Building Blocks

Runtime manager uses task stealing to ensure workload balance

Core 0 Core 1

t = 0

Task Queue 1, 2, 3 6

61

Core 0 Core 1

Task Queue 1, 2 3

61

t = t03

31

It uses a reactive method to manage parallelismHigh-level information is lost

3 Phases of System-level ISA (SISA)

Static Phase

Programs are converted to SISA

graphs

SISA graphs are bundled along with partially

compiled executables

Pre-run Mapping

Available system

resources are known (core

count, on-chip storage size,

etc.)

Executable binaries are generated

accordingly

Dynamic Phase

Runtime system manages program

execution with the help of SISA graphs (task

mapping, migration,

prefetching, etc.)

32

SISA Graphs

LLVM Bitcode

SISA Diagram

Chunk Generator

Scratchpad Allocator

Simulator

Predicted Runtime

Static Phase Pre-run Mapping Dynamic Phase

33

Task Scheduler

Runtime Predictor

Runtime Predictor

Runtime Predictor

What We Use a Runtime Predictor for

Simulate how SISA programs execute on various hardware Vary core count and evaluate speed-up

Use it in various SISA components Guide scratchpad allocation Guide task scheduling

34

A Recursive Runtime Predictor

3

5

4

7

5

6

Chunk generator has annotate edges with profiling data

3

5

4

7

5

6

2

1

2 3

5

31

Recursively reduce graph

Predicted runtime: 82Backtrack gives critical path

22+3

×6+32+3

×3=4.2

35

5

4

7

5

4.25

1

15

5

70

7

51

1+4

n: core count

Single-threaded Accuracy

Major error source: loop bodies with vastly varying dynamic length (LU)

Program

blackscholes

FFT LU RADIX OCEAN

CNDF() bsthread() main() Total

Avg Inst Executed 43 25M 0.31M 61M 0.52M 10M 43M 630M

Error 1.9% 1.6% 0.0 % 1.8% 2.2% 5.7% 0.0% 0.4%

36

Multi-threaded Predictions

1 2 4 8 16 32 64 1280

0.2

0.4

0.6

0.8

1

1.2

Instructions Executed per Core vs. Core Count

blackscholes FFT LURADIX Ideal Speed-up

Number of Cores

Nor

mal

ized

Inst

ructi

on C

ount

37

Identify Variable Types

Thread-shared

Thread-private

Master Thread

Stack X

Heap If used by master

If used by workers*

Worker Threads

Stack X

Heap X

Global If modified by workers*

If read-only by workers

Private Variables: Variables that different threads won’t have contention on

* A simplistic assumption for DOALL programs** A conservative (context-insensitive) assumption

Master Thread

Worker Threads

…

A DOALL Program

38

Only private variables have the potential to be put in scratchpad

key2key1

Critical-path-based Allocator – Example Optimizing memory loads not on critical-path

may not reduce overall program runtime

switch (input)

load key1process(key1,

input)

load key2process(key2,

input)

key1 key2

Scratchpad Size

1. Find critical path

2. Allocate variables used on the critical path

3. Loop over

39

system-level isa (sisa) graphs and scratchpad allocation problem wenhao jia research seminar...

Documents