streaming-oriented parallelization of domain-independent ...weidendo/uchpc10/slides-blanco.pdf ·...

32
Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels UnConventional High Performance Computing 2010 (UCHPC) Computer Architecture Group, University of A Coruña, Spain UNIVERSITY OF ACORUÑA Jacobo Lobeiras Blanco Margarita Amor López Manuel Carlos Arenaz Silva Basilio B. Fraguela Rodríguez

Upload: others

Post on 18-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of

Domain-Independent Irregular Kernels

UnConventional High Performance Computing 2010 (UCHPC)

Computer Architecture Group, University of A Coruña, Spain

UNIVERSITY OF A CORUÑA

Jacobo Lobeiras Blanco

Margarita Amor López

Manuel Carlos Arenaz Silva

Basilio B. Fraguela Rodríguez

Page 2: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 1 / 25

Presentation Overview

1. Motivation• GPU programming

• Computational Kernel Analysis

• Brook+ language

2. Computational kernel parallelization• Assignment

• Reduction

3. Performance analysis

4. Conclusions

Page 3: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 1 / 25

Presentation Overview

1. Motivation• GPU programming

• Computational Kernel Analysis

• Brook+ language

2. Computational kernel parallelization• Assignment

• Reduction

3. Performance analysis

4. Conclusions

Page 4: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 2 / 25

GPU programming

• The performance of GPUs is quickly evolving, however compared to CPUs,

they still have many architectural restrictions and their programming model

is more complex, requiring special languages and tools.

1. HLSL

2. Cg

3. CUDA

4. Parallel Nsight

5. BrookGPU

6. Brook+

7. AMD Stream Profiler

8. OpenCL

Page 5: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 3 / 25

GPU programming

CPU

• General purpose processors, 4 core models are widely

extended and their typical computing power is about 50 GFlops.

• Easily programmable using standard languages like C++ or

Java, with parallel standards like OpenMP and advanced

debugging tools.

GPU

• Graphic oriented processor capable of thousands of

simultaneous threads and TFlop level computing power.

• Complex and hardware dependent low level programming.

• Propietary high level languages like CUDA or Brook+, directive

based proposals are still experimental.

• Recent efforts have led to the creation of OpenCL, a standard

for heterogeneous computing.

Page 6: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 4 / 25

GPU programming

CPU

• High clock speed, out of order execution

optimized for sequential performance.

• Memory architecture designed for low

latency access.

• Complex processing cores and large

chip area devoted to cache.

GPU

• High bandwidth and high throughput

memory architecture.

• Small cache thanks to memory latency

hiding techniques like multithreading.

• Most of the chip area devoted to hundreds

of small processing units.

RV770 core

Penryn

core

45 n

m Inte

l C

ore

2

55 n

m A

TI R

adeon 4

850

107 mm2 260 mm2

Page 7: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 5 / 25

GPU programming

Radeon 4850 architecture diagram

- 10 SIMD modules

- 16 SPs per SIMD module, as well as a

texture unit and 16 KB shared memory

- Each SP has 5 processing elements,

however only the T unit supports FP64

- Four 64-bit memory channels

Page 8: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 6 / 25

GPU programming

• The parallelization of applications in GPUs is a complex task that usually

requires specially designed algorithms to be able to exploit their advantage

in computational power, the straightforward approach tends to provide little

performance benefit.

S M S M

Limited shared memory

Coalescence issues

PCI-E connection bandwidth and latency

S M

Low sequential and divergent code performance

Page 9: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 7 / 25

Motivation

• There are a series of CPU well-known general parallelization strategies

that can be applied to many problems, studying how they can be adapted

for GPUs to reduce development time and effort has great interest.

parallel for

0 1 2 3

0 1 4 9

parallel reduction

0 1 2 3

1 5

6

parallel recursion

36

9 4

3 3 2 2

parallel scan

1 2 3 4

3 7

6

1 3

7

13

Page 10: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 8 / 25

Computational Kernel Analysis

• In this work we refer to computational kernel as a regular code pattern frecuently used in

programs. We use Domain-independent concept-level computational kernels, that enable the

recognition of code patterns with independence of the programming language.

• Computational kernel analysis is a tool that enables the identification of potential code

parallelism without a depth knowledge of the underlying algorithms.

• Computational kernels can be classified in several families, depending on their characteristics

and memory access patterns.

a) Assignments b) Inductions c) Reductions

d) Masks e) Recurrences f) Reinitializations

Page 11: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 9 / 25

Brook+ language

Brook+ is based on C, but following a SIMD streaming paradigm.

• Data resides on structures similar to arrays called streams.

• In each invocation a kernel is executed over the whole domain in parallel, creating a thread

for each element of the output.

• By default Brook+ uses cached memory reads so coalescence is not an issue, however

each thread can only write to a certain location of the output stream.

• The language also supports reductions to perform collective operations, like finding the

maximum element of a set.

kernel

Page 12: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 10 / 25

Brook+ language

Page 13: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 11 / 25

Presentation Overview

1. Motivation• GPU programming

• Computational Kernel Analysis

• Brook+ language

2. Computational kernel parallelization• Assignment

• Reduction

3. Performance analysis

4. Conclusions

Page 14: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 12 / 25

Assignment

Assignment is the simplest kernel, it stores a value in the specified memory address.

1. If the value to write is a scalar, like the evaluation of a expression or the solution of a

equation, it is called scalar assignment.

2. If the memory is accessed through an indexed variable, like an array, and the access patern

can be expressed as a linear, polynomial or geometric function, it is a regular assignment.

3. If the memory is accessed through an indexed variable, but the access pattern is irregular or

unknown at compile time, it is called irregular assignment.

SCALAR REGULAR IRREGULAR

Page 15: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 13 / 25

Irregular assignment

Executor parallel processing

1

2

5

6

3

3

1

2

4

0

6

5

inspectorindirection

Inspector indirection analysis

The proposed solution to keep the equivalence is based on an inspector-executor strategy:

1. An inspector function performs an analysis of the indirections access pattern. This

analysis is normally computed by a single processor.

2. An executor function uses the vector generated by the inspector analysis to distribute the

iterations among the processors ensuring that there will be no write conflicts.

Another advantege of this technique is that it enables runtime dynamic dead code elemination.

Page 16: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 13 / 25

Irregular assignment

1

2

5

6

3

3

1

2

4

0

6

5

inspectorindirection

Inspector indirection analysis

1

2

5

6

4

0

1.0

2.0

3.0

0.0

6.0

5.0

1.0

2.0

3.0

4.0

5.0

6.0

source targetinspector

Executor parallel processing

P1

P2

The proposed solution to keep the equivalence is based on an inspector-executor strategy:

1. An inspector function performs an analysis of the indirections access pattern. This

analysis is normally computed by a single processor.

2. An executor function uses the vector generated by the inspector analysis to distribute the

iterations among the processors ensuring that there will be no write conflicts.

Another advantege of this technique is that it enables runtime dynamic dead code elemination.

Page 17: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 13 / 25

Irregular assignment

The proposed solution to keep the equivalence is based on an inspector-executor strategy:

1. An inspector function performs an analysis of the indirections access pattern. This

analysis is normally computed by a single processor.

2. An executor function uses the vector generated by the inspector analysis to distribute the

iterations among the processors ensuring that there will be no write conflicts.

Another advantege of this technique is that it enables runtime dynamic dead code elemination.

1

2

5

6

3

3

1

2

4

0

6

5

inspectorindirection

Inspector indirection analysis

1

2

5

6

4

0

1.0

2.0

3.0

0.0

6.0

5.0

1.0

2.0

3.0

4.0

5.0

6.0

source targetinspector

Executor parallel processing

P1

P2

Page 18: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 14 / 25

Reduction

The reduction kernel combines two or more elements to obtain a single value.

1. The scalar reduction reduces a complete array of elements to a single value, for example the

sum or the maximum of a vector.

2. The regular reduction computes several scalar reductions over the elements in a set or array

using a regular write pattern.

3. The irregular reduction uses an irregular or unknown access pattern to compute the reduction

of an array. The operation is performed according to an indirection array or function.

SCALAR REGULAR IRREGULAR

Page 19: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 15 / 25

Irregular reduction

1

2

5

6

3

3

1

2

3

0

6

5

inspectorindirection

0

0

4

0

0

0

The proposed solution to avoid the write conflict is based on an inspector-executor strategy:

1. An inspector function sequentially analyzes the write pattern of the indirections, grouping

together iterations that write to the same memory address.

2. The executor function distributes among the processors the iterations of the vector

generated by the inspector so that there are no write conflicts.

In this case the inspector array is a rectangular table, with as many columns as the maximum

number of reductions to the same address.

Inspector indirection analysis Executor parallel processing

Page 20: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 15 / 25

Irregular reduction

1

2

5

6

3

3

1

2

3

0

6

5

inspectorindirection

0

0

4

0

0

0

1

2

5

6

3

0

1.0

2.0

7.0

0.0

6.0

5.0

1.0

2.0

3.0

4.0

5.0

6.0

source targetinspector

P1

P2

0

0

4

0

0

0

The proposed solution to avoid the write conflict is based on an inspector-executor strategy:

1. An inspector function sequentially analyzes the write pattern of the indirections, grouping

together iterations that write to the same memory address.

2. The executor function distributes among the processors the iterations of the vector

generated by the inspector so that there are no write conflicts.

In this case the inspector array is a rectangular table, with as many columns as the maximum

number of reductions to the same address.

Inspector indirection analysis Executor parallel processing

Page 21: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 15 / 25

Irregular reduction

The proposed solution to avoid the write conflict is based on an inspector-executor strategy:

1. An inspector function sequentially analyzes the write pattern of the indirections, grouping

together iterations that write to the same memory address.

2. The executor function distributes among the processors the iterations of the vector

generated by the inspector so that there are no write conflicts.

1

2

5

6

3

3

1

2

3

0

6

5

inspectorindirection

0

0

4

0

0

0

In this case the inspector array is a rectangular table, with as many columns as the maximum

number of reductions to the same address.

1

2

5

6

3

0

1.0

2.0

7.0

0.0

6.0

5.0

1.0

2.0

3.0

4.0

5.0

6.0

source targetinspector

P1

P2

0

0

4

0

0

0

Inspector indirection analysis Executor parallel processing

Page 22: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 16 / 25

Irregular reduction

1

2

5

6

3

3

1

2

3

0

6

5

inspectorindirection

1 indirection level inspector

0

0

4

0

0

0

2 indirection levels inspector

1

2

3

4

6

5

inspector

1

2

3

5

5

6

7

posiciones

1

2

5

6

3

3

indirection

Although many problems have a well balanced data distribution, when it is uneven and a few

elements receive most of the writes, this implementation wastes some space.

An alternative to make a more efficient usage of the space is to use two arrays, one of them

contiguously stores all the inspector data, and the other points to the address where the data

of each iteration begins.

The problem of this second approach is that it requires an additional indirection level, and

according to our tests this offers worse GPU performance.

Page 23: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 17 / 25

Presentation Overview

1. Motivation• GPU programming

• Computational Kernel Analysis

• Brook+ language

2. Computational kernel parallelization• Assignment

• Reduction

3. Performance analysis

4. Conclusions

Page 24: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Test platform:

Phenom II X4 940 at 3.0 GHz, 4 GB DDR2 800 CL5, 790X chipset

Radeon 4850 using Catalyst driver 9.12

Windows XP x64, MS Visual C++ 2005 (x64), Brook+ 1.4.

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 18 / 25

Performance analysis

• The test input sizes were 512x512, 1024x1024 and 2048x2048 elements,

with several levels of inspector reusability R0, R10 and R100.

• To simulate some light computational load we perform about 100 floating

point operations per element.

• The GPU uses the CPU inspector without any modification, however we need

to transfer the analysis to the GPU memory before launching the executor.

Page 25: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 19 / 25

Performance analysis

Benchmark CPU 1P (Original)CPU 2P

(OpenMP)

CPU 4P

(OpenMP)

GPU

(Brook+)

Asig_Irr

R0 71.44 ---- 37.20 1.9x 22.94 3.1x 10.59 6.7x

R10 71.61 ---- 26.92 2.7x 13.64 5.2x 1.64 43.7x

R100 71.38 ---- 25.83 2.8x 12.70 5.6x 0.91 78.4x

Red_Irr

R0 75.09 ---- 65.69 1.1x 44.81 1.7x 29.98 2.5x

R10 75.27 ---- 43.42 1.7x 22.19 3.4x 4.01 18.8x

R100 75.12 ---- 41.09 1.8x 19.97 3.8x 1.28 58.7x

Execution time in seconds for 2048x2048 problem size and several reusability levels.

Page 26: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 20 / 25

Performance analysisThe regular assignment kernel displays good performance in both architectures. For the best

results, the GPU requires some reusability to compensate the inspector memory transfer time,

but even without reusability it is able to surpass the CPU.

Irregular

Assignment

Page 27: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 21 / 25

Performance analysisThe irregular reduction kernel is a bit more complex and both architectures show lower gains.

With just a small reusability the GPU is able to achieve a 16x speedup, but even without any

reusability it is able to overtake the CPU in the same test conditions.

Irregular

Reduction

Page 28: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 22 / 25

Presentation Overview

1. Motivation• GPU programming

• Computational Kernel Analysis

• Brook+ language

2. Computational kernel parallelization• Assignment

• Reduction

3. Performance analysis

4. Conclusions

Page 29: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 23 / 25

Conclusions

• In this work the language Brook+ for GPU programming has been

studied, proving to be a good choice for applications that require high

computational power and can be adapted to the streaming paradigm.

• A general parallelization technique adapted for GPUs has been

proposed for two common computational irregular kernels.

• The performance of the proposed solutions has been evaluated on a

current CPU using OpenMP and a mid-range GPU using Brook+,

proving that even with general parallelization strategies and irregular

memory access patterns it is possible to attain a good speedup in GPU.

Page 30: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 24 / 25

Conclusions

• As future work we plan to study the parallelization of other less common

computational kernels using general strategies.

• The performance of the proposed solutions will be also studied in other

platforms like OpenCL or CUDA to compare the efficiency.

• Based on a related work, a tool capable of analyzing programs to assist

the programmer in the design and adaptation of parallel code is planned.

Page 31: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain 25 / 25

Questions?

Page 32: Streaming-Oriented Parallelization of Domain-Independent ...weidendo/uchpc10/slides-blanco.pdf · Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels University of A Coruña, Spain xx / 25