lecture 2 the art of concurrency 张奇复旦大学 comp630030 data intensive computing 1

1

Lecture 2 The Art of Concurrency

张奇复旦大学

COMP630030 Data Intensive Computing

2

并行让程序运行的更快

3

Why Do I Need to Know This? What’s in It for Me?

• There’s no way to avoid this topic– Multicore processors are here now and here to

stay– “The Free Lunch Is Over: A Fundamental Turn

Toward Concurrency in Software” (Dr. Dobb’s Journal, March 2005)

4

Isn’t Concurrent Programming Hard?

• Concurrent programming is no walk in the park.

• With a serial program– execution of your code takes a predictable path

through the application.• Concurrent algorithms– require you to think about multiple execution

streams running at the same time

5

PRIMER ON CONCURRENT PROGRAMMING

• Concurrent programming is all about independent computations that the machine can execute in any order.

• Not everything within an application will be independent, so you will still need to deal with serial execution amongst the concurrency.

6

Four Steps of a Threading Methodology

• Step 1. Analysis: Identify Possible Concurrency– Find the parts of the application that contain

independent computations.– identify hotspots that might yield independent

computations

7


• Step 2. Design and Implementation: Threading the Algorithm

• This step is what this book is all about

8


• Step 3. Test for Correctness: Detecting and Fixing Threading Errors

• Step 4. Tune for Performance: Removing Performance Bottlenecks

10

Design Models for Concurrent Algorithms

• The way you approach your serial code will influence how you reorganize the computations into a concurrent equivalent.– Task decomposition• Independent tasks that threads

– Data decomposition• Compute every element of the data independently.

11

Task Decomposition

• Any concurrent transformation process is to identify computations that are completely independent.

• Satisfy or remove dependencies

12

Example: numerical integration

• What are the independent tasks in this simple application?

• Are there any dependencies between these tasks and, if so, how can we satisfy them?

• How should you assign tasks to threads?

13

Three key elements for any task decomposition design

• What are the tasks and how are they defined?• What are the dependencies between tasks

and how can they be satisfied?• How are the tasks assigned to threads?

14

Two criteria for the actual decomposition into tasks

• There should be at least as many tasks as there will be threads (or cores).

• The amount of computation within each task (granularity) must be large enough to offset the overhead that will be needed to manage the tasks and the threads.

16

What are the dependencies between tasks and how can they be satisfied?

• Order dependency– some task relies on the completed results of the

computations from another task

– schedule tasks that have an order dependency onto the same thread

– insert some form of synchronization to ensure correct execution order

17

What are the dependencies between tasks and how can they be satisfied?

• Data dependency– assignment of values to the same variable that

might be done concurrently– updates to a variable that could be read

concurrently

– create variables that are accessible only to a given thread.

– Atomic Operation

18

How are the tasks assigned to threads?

• Tasks must be assigned to threads for execution.

• The amount of computation done by threads should be roughly equivalent.

• We can allocate tasks to threads in two different ways: static scheduling or dynamic scheduling.

19


• In static scheduling, the division of labor is known at the outset of the computation and doesn’t change during the computation.

• Static scheduling is best used in those cases where the amount of computation within each task is the same or can be predicted at the outset.

20


• Under a dynamic schedule, you assign tasks to threads as the computation proceeds.

• The driving force behind the use of a dynamic schedule is to try to balance the load as evenly as possible between threads.

21

Example: numerical integration

• What are the independent tasks in this simple application?

• Are there any dependencies between these tasks and, if so, how can we satisfy them?

• How should you assign tasks to threads?

22

Data Decomposition

• Execution is dominated by a sequence of update operations on all elements of one or more large data structures.

• These update computations are independent of each other

• Dividing up the data structure(s) and assigning those portions to threads, along with the corresponding update computations (tasks)

23

Three key elementsfor data decomposition design

• How should you divide the data into chunks?• How can you ensure that the tasks for each

chunk have access to all data required for updates?

• How are the data chunks assigned to threads?

24

How should you divide the data into chunks?

25

How should you divide the data into chunks?

• Granularity of chunk• Shape of chunk– the neighboring chunks are and how any exchange

of data

• More vigilant with chunks of irregular shapes

26

How can you ensure that the tasks for each chunk have access to all data required for updates?

27

Example: Game of Life on a finite grid

28


29


• What is the large data structure in this application and how can you divide it into chunks?

• What is the best way to perform the division?

30

What’s Not Parallel

• Algorithms with State– something kept around from one execution to the

next. – For example, the seed to a random number

generator or the file pointer for I/O would be considered state.

31


• Recurrences

32


• Induction Variables

33


• Reduction– Reductions take a collection (typically an array) of

data and reduce it to a single scalar value through some combining operation.

34

Loop-Carried Dependence

36

Rule 1: Identify Truly Independent Computations

• It’s the crux of the whole matter!

37

Rule 2: Implement Concurrency at the Highest Level Possible

• Two directions： bottom-up and top-down• bottom-up

– consider threading the hotspots directly– If this is not possible, search up the call stack

• top-down– first consider the whole application and what the computation is

coded to accomplish– While there is no obvious concurrency, distill the parts of the

computation

Video encoding application：individual pixels frames videos

38

Rule 3: Plan Early for Scalability to Take Advantage of Increasing Numbers of Cores

• Quad-core processors are becoming the default multicore chip.

• Flexible code that can take advantage of different numbers of cores.

• C. Northcote Parkinson, “Data expands to fill the processing power available.”

39

Rule 4: Make Use of Thread-Safe Libraries Wherever Possible

• Intel Math Kernel Library (MKL) • Intel Integrated Performance Primitives (IPP)

40

Rule 5: Use the Right Threading Model

• Don’t use explicit threads if an implicit threading model (e.g., OpenMP or Intel Threading Building Blocks) has all the functionality you need.

41

Rule 6: Never Assume a Particular Order of Execution

42

Rule 7: Use Thread-Local Storage Whenever Possible or Associate Locks to Specific Data

• Synchronization is overhead that does not contribute to the furtherance of the computation

• Should actively seek to keep the amount of synchronization to a minimum.

• Using storage that is local to threads or using exclusive memory locations

43

Rule 8: Dare to Change the Algorithm for a Better Chance of Concurrency

• When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution

• O(n log n) algorithm will run faster than an O(n2) algorithm

• If you cannot easily turn a hotspot into threaded code, you should consider using a suboptimal serial algorithm to transform, rather than the algorithm currently in the code.

45

Parallel Sum

46

PRAM Algorithm

47

PRAM Algorithm

Can we use the PRAM algorithm for parallel sum in a threaded code?

48

A More Practical Algorithm

• Divide the data array into chunks equal to the number of threads to be used.

• Assign each thread a unique chunk and sum the values within the assigned subarray into a private variable.

• Add these local partial sums to compute the total sum of the array elements.

49

Prefix Scan

50

Prefix Scan

• PRAM computation for prefix scan

51

Prefix Scan

• A More Practical Algorithm

53

Implicit Threading

• Implicit threading libraries take care of much of the minutiae needed to create, manage, and (to some extent) synchronize threads.

• All the little niggly details are hidden from programmers to make concurrent programming easier to implement and understand.

• OpenMP implements concurrency through special pragmas and directives inserted into your source code to indicate segments that are to be executed concurrently. These pragmas are recognized and processed by the compiler.

• Intel TBB uses defined parallel algorithms to execute methods within user-written classes that encapsulate the concurrent operations.

54

OpenMP

• OpenMP is a set of compiler directives, library routines, and environment variables that specify shared-memory concurrency in FORTRAN, C, and C++ programs.

• All major compilers support the OpenMP language. – Microsoft Visual C/C++ .NET for Windows– GNU GCC compiler for Linux. – Intel C/C++ compilers, for both Windows and Linux

55

OpenMP

• OpenMP directives demarcate code that can be executed in parallel (called parallel regions ) and control how code is assigned to threads.

• For C and C++#pragma omp parallel

• OpenMP also has an atomic construct to ensure that statements will be executed in an atomic

• OpenMP provides a clause to handle the details of a concurrent reduction.

56

OpenMP

57

Intel Threading Building Blocks

• Intel TBB is a C++ template-based library for loop-level parallelism that concentrates on defining tasks rather than explicit threads.

• Programmers using TBB can parallelize the execution of loop iterations by treating chunks of iterations as tasks and allowing the TBB task scheduler to determine:– the task sizes– number of threads to use– assignment of tasks to those threads– how those threads are scheduled for execution.

58

Intel Threading Building Blocks

59

Explicit Threading

• Explicit threading libraries require the programmer to control all aspects of threads, including – creating threads– associating threads to functions– Synchronizing– controlling the interactions between threads and

shared resources.

60

Pthreads

• Pthreads has a thread container data type of pthread_t.

• To create a thread and associate it with a function for execution, use the pthread_create() function

• When one thread needs to be sure that some other thread has terminated before proceeding with execution, it calls pthread_join().

61

Pthreads

• Threads request the privilege of holding a mutex by calling pthread_lock().

• Other threads attempting to gain control of the mutex will be blocked until the thread that is holding the lock calls pthread_unlock().

62

Pthreads

• Threads block and wait on a condition variable to be signaled when calling pthread_cond_wait() on a given condition variable.

• executing thread calls pthread_cond_signal() on a condition variable to wake up a thread that has been blocked.

• The pthread_cond_broadcast() function will wake all threads that are waiting on the condition variable.

63

Pthreads

64

Questions?

lecture 2 the art of concurrency 张奇 复旦大学 comp630030 data intensive computing 1

Documents

lecture 2 the art of concurrency 张奇复旦大学 comp630030 data intensive computing 1