may 2, 2015©2006 craig zilles1 (easily) exposing thread-level parallelism previously, we...

15
June 18, 2022 ©2006 Craig Zilles 1 (Easily) Exposing Thread-level Parallelism Previously, we introduced Multi-Core Processors and the (atomic) instructions necessary to handle data races Today, a simple approach to exploiting one kind of parallelism. Data vs. Control Parallelism Fork/Join Model of Parallelization What performance can we expect? Introduction to OpenMP AMD dual-core Opteron

Upload: estefany-ament

Post on 15-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

April 21, 2023 ©2006 Craig Zilles 1

(Easily) Exposing Thread-level Parallelism

Previously, we introduced Multi-Core Processors—and the (atomic) instructions necessary to handle data races

Today, a simple approach to exploiting one kind of parallelism.—Data vs. Control Parallelism—Fork/Join Model of Parallelization—What performance can we expect?—Introduction to OpenMP

AMD dual-core Opteron

Page 2: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

2

What happens if we run a program on a multi-core?

voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i];

}}

As programmers, do we care?

#1 #2

Page 3: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

February 12, 2003 Multi-Core and Thread-level Parallelism

3

What if we want a program to run on both processors?

We have to explicitly tell the machine to do this— And how to do it— This is called parallel programming or concurrent

programming

There are many parallel/concurrent programming models— We will look at a relatively simple one: fork-join parallelism— In CS241, you learn about Posix threads and explicit

synchronization

Page 4: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

4

1. Fork N-1 threads2. Break work into N pieces (and do it)3. Join (N-1) threads

voidarray_add(int A[], int B[], int C[], int length) {

cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) {

C[i] = A[i] + B[i]; }

join();}

Fork/Join Logical Example

Page 5: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

5

1. Fork N-1 threads2. Break work into N pieces (and do it)3. Join (N-1) threads

voidarray_add(int A[], int B[], int C[], int length) {

cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) {

C[i] = A[i] + B[i]; }

join();}

Note: in general, we would break the work into contiguous chunks of data because of caches, e.g.,

for (i = cpu_num *(length/N) ; i < min(length,(cpu_num+1)*(length/N) ; ++ i )

Fork/Join Logical Example

Page 6: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

February 12, 2003 Multi-Core and Thread-level Parallelism

6

How does this help performance?

Parallel speedup measures improvement from parallelization:

time for best serial versiontime for version with p processors

What can we realistically expect?

speedup(p) =

Page 7: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

7

In general, the whole computation is not (easily) parallelizable

Reason #1: Amdahl’s Law

Serial regions

Page 8: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

8

Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (not parallelized)

For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system?

What is the maximum speedup from parallelization?

Reason #1: Amdahl’s Law

New Execution

Time=

1-s

+ sP

New Execution

Time=

.9T

+ .1T =3Speedup =

Page 9: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

9

voidarray_add(int A[], int B[], int C[], int length) {

cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) {

C[i] = A[i] + B[i]; }

join();}

— Forking and joining is not instantaneous1. Involves communicating between processors2. May involve calls into the operating system— Depends on the implementation

Reason #2: Overhead

New Execution

Time=

1-s

+ s +overhead(

P)P

Page 10: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

February 12, 2003 Multi-Core and Thread-level Parallelism

10

Programming Explicit Thread-level Parallelism

As noted previously, the programmer must specify how to parallelize But, want path of least effort

Division of labor between the Human and the Compiler— Humans: good at expressing parallelism, bad at bookkeeping— Compilers: bad at finding parallelism, good at bookkeeping

Want a way to take serial code and say “Do this in parallel!” without:— Having to manage the synchronization between processors— Having to know a priori how many processors the system has— Deciding exactly which processor does what— Replicate the private state of each thread

OpenMP: an industry standard set of compiler extensions— Works very well for programs with structured parallelism.

Page 11: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

11

voidarray_add(int A[], int B[], int C[], int length) { int i;

for (i =0 ; i < length ; i += 1) { // Without OpenMP C[i] = A[i] + B[i];

}}

voidarray_add(int A[], int B[], int C[], int length) { int i;

#pragma omp parallel for (i =0 ; i < length ; i += 1) { // With OpenMP

C[i] = A[i] + B[i]; }}

OpenMP figures out how many threads are available, forks (if necessary), divides the work among them, and then joins after the loop.

OpenMP

Page 12: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

OpenMP Notes

Feature: Works for (almost) arbitrarily complex code:— i.e., can include control flow, function calls, etc…

Caveat: OpenMP trusts you (it does not check the independence of code)

voidarray_add(int A[], int B[], int C[], int length) { int i;

#pragma omp parallel for (i = 1 ; i < length ; i += 1) {

B[i] = A[i] + B[i-1]; // likely to cause problems }}

Must first re-structure code to be independent.— Much like we did for SIMD execution

• But without worrying about a SIMD register size.

February 12, 2003 Performance 12

Page 13: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

February 12, 2003 Multi-Core and Thread-level Parallelism

13

OpenMP “hello world” Example

#include <omp.h>

main () {int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */

#pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */}

Page 14: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

February 12, 2003 Performance 14

Privatizing Variables

Page 15: May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions

04/21/23 Multi-Core and Thread-level Parallelism

15

Multi-core is having more than one processor on the same chip.—Soon most PCs/servers and game consoles will be multi-core—Results from Moore’s law and power constraint

Exploiting multi-core requires parallel programming—Automatically extracting parallelism too hard for compiler, in

general.—But, can have compiler do much of the bookkeeping for us

—OpenMP Fork-Join model of parallelism

—At parallel region, fork a bunch of threads, do the work in parallel, and then join, continuing with just one thread

—Expect a speedup of less than P on P processors• Amdahl’s Law: speedup limited by serial portion of program• Overhead: forking and joining are not free

Summary