computer organization david monismith cs345 notes to help with the in class assignment

Computer Organization

David MonismithCS345

Notes to help with the in class assignment.

Flynn’s Taxonomy

• SISD = Single Instruction Single Data = Serial Programming• SIMD = Single Instruction Multiple Data = Implicit Parallelism

(Instruction/Architecture Level)• MISD = Multiple Instruction Single Data (Rarely implemented)• MIMD = Multiple Instruction Multiple Data = Multiprocessor

Single Data Multiple Data

Single Instruction SISD SIMD

Multiple Instruction MISD MIMD

Flynn’s Taxonomy

• SIMD instructions and architectures allow for implicit parallelism when writing programs

• To provide a sense of how these work, examples are shown in the following slides.

• Our focus on MIMD is through the use of processes and threads, and examples are shown in later slides.

Understanding SIMD Instructions• Implicit parallelism occur via AVX (Advanced Vector Extensions) or SSE

(Streaming SIMD Instructions)

• Example: without SIMD the following loop might be executed with four add instructions:

//Serial Loopfor(int i = 0; i < n; i+=4){ c[i] = a[i] + b[i]; //add c[i], a[i], b[i] c[i+1] = a[i+1] + b[i+1]; //add c[i+1], a[i+1], b[i+1] c[i+2] = a[i+2] + b[i+2]; //add c[i+2], a[i+2], b[i+2] c[i+3] = a[i+3] + b[i+3]; //add c[i+3], a[i+3], b[i+3]}

Understanding SIMD Instructions

• With SIMD the following loop might be executed with one add instruction:

//SIMD Loopfor(int i = 0; i < n; i+=4){ c[i] = a[i] + b[i]; //add c[i to i+3], a[i to i+3], b[i to i+3] c[i+1] = a[i+1] + b[i+1]; c[i+2] = a[i+2] + b[i+2]; c[i+3] = a[i+3] + b[i+3];}

Understanding SIMD Instructions• Note that the add instructions above are pseudo-assembly instructions• The serial loop is implemented as follows:

+------+ +------+ +------+| a[i] | + | b[i] | -> | c[i] |+------+ +------+ +------+

+------+ +------+ +------+|a[i+1]| + |b[i+1]| -> |c[i+1]|+------+ +------+ +------+

+------+ +------+ +------+|a[i+2]| + |b[i+2]| -> |c[i+2]|+------+ +------+ +------+

+------+ +------+ +------+|a[i+3]| + |b[i+3]| -> |c[i+3]|+------+ +------+ +------+


• Versus SIMD:

+------+ +------+ +------+| a[i] | | b[i] | | c[i] || | | | | ||a[i+1]| |b[i+1]| |c[i+1]|| | + | | -> | ||a[i+2]| |b[i+2]| |c[i+2]|| | | | | ||a[i+3]| |b[i+3]| |c[i+3]|+------+ +------+ +------+


• In the previous example 4x Speedup was achieved by using SIMD instructions

• Note that SIMD Registers are often 128, 256, or 512 bits wide allowing for addition, subtraction, multiplication, etc., of 2, 4, or 8 double precision variables.

• Performance of SSE and AVX Instruction Sets, Hwancheol Jeong, Weonjong Lee, Sunghoon Kim, and Seok-Ho Myung, Proceedings of Science, 2012, http://arxiv.org/pdf/1211.0820.pdf

Processes and Threads

• These exist only at execution time

• They have fast state changes -> in memory and waiting

• A Process – is a fundamental computation unit– can have one or more threads– is handled by process management module– requires system resources

Process

• Process (job) - program in execution, ready to execute, or waiting for execution

• A program is static whereas a process (running program) is dynamic.

• In Operating Systems (cs550) we will implement processes using an API called the Message Passing Interface (MPI).

• MPI will provide us with an abstract layer that will allow us to create and identify processes without worrying about the creation of data structures for sockets or shared memory.

Threads

• Threads - lightweight processes– Dynamic component of processes– Often, many threads are part of a process

• Current OSes and Hardware support multithreading– multiple threads (tasks) per process– One or more threads per CPU-core

• Execution of threads is handled more efficiently than that of full weight processes (although there are other costs).

• At process creation, one thread is created, the "main" thread.

• Other threads are created from the "main" thread

Embarrassingly Parallel (Map)

• Processes and threads are MIMD.• Performing array (or matrix) addition is a straightforward

example that is easily parallelized• The serial example of this follows:

for(i = 0; i < N; i++) C[i] = A[i] + B[i];

• OpenMP allows you to write a #pragma to parallelize code that you write in a serial (normal) fashion.

• Three OpenMP parallel versions follow on the next slides

OpenMP First Try • We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; }• Notice that i is declared private because it it is not shared between

threads – each thread gets its own copy of i• Arrays A, B, and C are declared shared because they are shared between

threads

OpenMP for clause

• It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows

#pragma omp parallel private(i) shared(A,B,C){ #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i];}

• Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread

Shortened OpenMP for

• When using a single for loop, the parallel and for clauses may be combined

#pragma omp parallel for private(i) \shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i];

computer organization david monismith cs345 notes to help with the in class assignment

Documents