lecture 2 the art of concurrency 张奇 复旦大学 comp630030 data intensive computing 1

64

Click here to load reader

Upload: jesse-bradford

Post on 16-Dec-2015

266 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

1

Lecture 2 The Art of Concurrency

张奇复旦大学

COMP630030 Data Intensive Computing

Page 2: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

2

并行让程序运行的更快

Page 3: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

3

Why Do I Need to Know This? What’s in It for Me?

• There’s no way to avoid this topic– Multicore processors are here now and here to

stay– “The Free Lunch Is Over: A Fundamental Turn

Toward Concurrency in Software” (Dr. Dobb’s Journal, March 2005)

Page 4: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

4

Isn’t Concurrent Programming Hard?

• Concurrent programming is no walk in the park.

• With a serial program– execution of your code takes a predictable path

through the application.• Concurrent algorithms– require you to think about multiple execution

streams running at the same time

Page 5: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

5

PRIMER ON CONCURRENT PROGRAMMING

• Concurrent programming is all about independent computations that the machine can execute in any order.

• Not everything within an application will be independent, so you will still need to deal with serial execution amongst the concurrency.

Page 6: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

6

Four Steps of a Threading Methodology

• Step 1. Analysis: Identify Possible Concurrency– Find the parts of the application that contain

independent computations.– identify hotspots that might yield independent

computations

Page 7: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

7

Four Steps of a Threading Methodology

• Step 2. Design and Implementation: Threading the Algorithm

• This step is what this book is all about

Page 8: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

8

Four Steps of a Threading Methodology

• Step 3. Test for Correctness: Detecting and Fixing Threading Errors

• Step 4. Tune for Performance: Removing Performance Bottlenecks

Page 9: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

9

Page 10: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

10

Design Models for Concurrent Algorithms

• The way you approach your serial code will influence how you reorganize the computations into a concurrent equivalent.– Task decomposition• Independent tasks that threads

– Data decomposition• Compute every element of the data independently.

Page 11: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

11

Task Decomposition

• Any concurrent transformation process is to identify computations that are completely independent.

• Satisfy or remove dependencies

Page 12: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

12

Example: numerical integration

• What are the independent tasks in this simple application?

• Are there any dependencies between these tasks and, if so, how can we satisfy them?

• How should you assign tasks to threads?

Page 13: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

13

Three key elements for any task decomposition design

• What are the tasks and how are they defined?• What are the dependencies between tasks

and how can they be satisfied?• How are the tasks assigned to threads?

Page 14: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

14

Two criteria for the actual decomposition into tasks

• There should be at least as many tasks as there will be threads (or cores).

• The amount of computation within each task (granularity) must be large enough to offset the overhead that will be needed to manage the tasks and the threads.

Page 15: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

15

Page 16: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

16

What are the dependencies between tasks and how can they be satisfied?

• Order dependency– some task relies on the completed results of the

computations from another task

– schedule tasks that have an order dependency onto the same thread

– insert some form of synchronization to ensure correct execution order

Page 17: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

17

What are the dependencies between tasks and how can they be satisfied?

• Data dependency– assignment of values to the same variable that

might be done concurrently– updates to a variable that could be read

concurrently

– create variables that are accessible only to a given thread.

– Atomic Operation

Page 18: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

18

How are the tasks assigned to threads?

• Tasks must be assigned to threads for execution.

• The amount of computation done by threads should be roughly equivalent.

• We can allocate tasks to threads in two different ways: static scheduling or dynamic scheduling.

Page 19: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

19

How are the tasks assigned to threads?

• In static scheduling, the division of labor is known at the outset of the computation and doesn’t change during the computation.

• Static scheduling is best used in those cases where the amount of computation within each task is the same or can be predicted at the outset.

Page 20: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

20

How are the tasks assigned to threads?

• Under a dynamic schedule, you assign tasks to threads as the computation proceeds.

• The driving force behind the use of a dynamic schedule is to try to balance the load as evenly as possible between threads.

Page 21: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

21

Example: numerical integration

• What are the independent tasks in this simple application?

• Are there any dependencies between these tasks and, if so, how can we satisfy them?

• How should you assign tasks to threads?

Page 22: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

22

Data Decomposition

• Execution is dominated by a sequence of update operations on all elements of one or more large data structures.

• These update computations are independent of each other

• Dividing up the data structure(s) and assigning those portions to threads, along with the corresponding update computations (tasks)

Page 23: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

23

Three key elementsfor data decomposition design

• How should you divide the data into chunks?• How can you ensure that the tasks for each

chunk have access to all data required for updates?

• How are the data chunks assigned to threads?

Page 24: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

24

How should you divide the data into chunks?

Page 25: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

25

How should you divide the data into chunks?

• Granularity of chunk• Shape of chunk– the neighboring chunks are and how any exchange

of data

• More vigilant with chunks of irregular shapes

Page 26: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

26

How can you ensure that the tasks for each chunk have access to all data required for updates?

Page 27: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

27

Example: Game of Life on a finite grid

Page 28: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

28

Example: Game of Life on a finite grid

Page 29: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

29

Example: Game of Life on a finite grid

• What is the large data structure in this application and how can you divide it into chunks?

• What is the best way to perform the division?

Page 30: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

30

What’s Not Parallel

• Algorithms with State– something kept around from one execution to the

next. – For example, the seed to a random number

generator or the file pointer for I/O would be considered state.

Page 31: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

31

What’s Not Parallel

• Recurrences

Page 32: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

32

What’s Not Parallel

• Induction Variables

Page 33: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

33

What’s Not Parallel

• Reduction– Reductions take a collection (typically an array) of

data and reduce it to a single scalar value through some combining operation.

Page 34: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

34

Loop-Carried Dependence

Page 35: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

35

Page 36: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

36

Rule 1: Identify Truly Independent Computations

• It’s the crux of the whole matter!

Page 37: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

37

Rule 2: Implement Concurrency at the Highest Level Possible

• Two directions: bottom-up and top-down• bottom-up

– consider threading the hotspots directly– If this is not possible, search up the call stack

• top-down– first consider the whole application and what the computation is

coded to accomplish– While there is no obvious concurrency, distill the parts of the

computation

Video encoding application:individual pixels frames videos

Page 38: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

38

Rule 3: Plan Early for Scalability to Take Advantage of Increasing Numbers of Cores

• Quad-core processors are becoming the default multicore chip.

• Flexible code that can take advantage of different numbers of cores.

• C. Northcote Parkinson, “Data expands to fill the processing power available.”

Page 39: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

39

Rule 4: Make Use of Thread-Safe Libraries Wherever Possible

• Intel Math Kernel Library (MKL) • Intel Integrated Performance Primitives (IPP)

Page 40: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

40

Rule 5: Use the Right Threading Model

• Don’t use explicit threads if an implicit threading model (e.g., OpenMP or Intel Threading Building Blocks) has all the functionality you need.

Page 41: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

41

Rule 6: Never Assume a Particular Order of Execution

Page 42: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

42

Rule 7: Use Thread-Local Storage Whenever Possible or Associate Locks to Specific Data

• Synchronization is overhead that does not contribute to the furtherance of the computation

• Should actively seek to keep the amount of synchronization to a minimum.

• Using storage that is local to threads or using exclusive memory locations

Page 43: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

43

Rule 8: Dare to Change the Algorithm for a Better Chance of Concurrency

• When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution

• O(n log n) algorithm will run faster than an O(n2) algorithm

• If you cannot easily turn a hotspot into threaded code, you should consider using a suboptimal serial algorithm to transform, rather than the algorithm currently in the code.

Page 44: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

44

Page 45: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

45

Parallel Sum

Page 46: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

46

PRAM Algorithm

Page 47: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

47

PRAM Algorithm

Can we use the PRAM algorithm for parallel sum in a threaded code?

Page 48: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

48

A More Practical Algorithm

• Divide the data array into chunks equal to the number of threads to be used.

• Assign each thread a unique chunk and sum the values within the assigned subarray into a private variable.

• Add these local partial sums to compute the total sum of the array elements.

Page 49: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

49

Prefix Scan

Page 50: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

50

Prefix Scan

• PRAM computation for prefix scan

Page 51: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

51

Prefix Scan

• A More Practical Algorithm

Page 52: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

52

Page 53: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

53

Implicit Threading

• Implicit threading libraries take care of much of the minutiae needed to create, manage, and (to some extent) synchronize threads.

• All the little niggly details are hidden from programmers to make concurrent programming easier to implement and understand.

• OpenMP implements concurrency through special pragmas and directives inserted into your source code to indicate segments that are to be executed concurrently. These pragmas are recognized and processed by the compiler.

• Intel TBB uses defined parallel algorithms to execute methods within user-written classes that encapsulate the concurrent operations.

Page 54: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

54

OpenMP

• OpenMP is a set of compiler directives, library routines, and environment variables that specify shared-memory concurrency in FORTRAN, C, and C++ programs.

• All major compilers support the OpenMP language. – Microsoft Visual C/C++ .NET for Windows– GNU GCC compiler for Linux. – Intel C/C++ compilers, for both Windows and Linux

Page 55: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

55

OpenMP

• OpenMP directives demarcate code that can be executed in parallel (called parallel regions ) and control how code is assigned to threads.

• For C and C++#pragma omp parallel

• OpenMP also has an atomic construct to ensure that statements will be executed in an atomic

• OpenMP provides a clause to handle the details of a concurrent reduction.

Page 56: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

56

OpenMP

Page 57: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

57

Intel Threading Building Blocks

• Intel TBB is a C++ template-based library for loop-level parallelism that concentrates on defining tasks rather than explicit threads.

• Programmers using TBB can parallelize the execution of loop iterations by treating chunks of iterations as tasks and allowing the TBB task scheduler to determine:– the task sizes– number of threads to use– assignment of tasks to those threads– how those threads are scheduled for execution.

Page 58: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

58

Intel Threading Building Blocks

Page 59: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

59

Explicit Threading

• Explicit threading libraries require the programmer to control all aspects of threads, including – creating threads– associating threads to functions– Synchronizing– controlling the interactions between threads and

shared resources.

Page 60: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

60

Pthreads

• Pthreads has a thread container data type of pthread_t.

• To create a thread and associate it with a function for execution, use the pthread_create() function

• When one thread needs to be sure that some other thread has terminated before proceeding with execution, it calls pthread_join().

Page 61: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

61

Pthreads

• Threads request the privilege of holding a mutex by calling pthread_lock().

• Other threads attempting to gain control of the mutex will be blocked until the thread that is holding the lock calls pthread_unlock().

Page 62: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

62

Pthreads

• Threads block and wait on a condition variable to be signaled when calling pthread_cond_wait() on a given condition variable.

• executing thread calls pthread_cond_signal() on a condition variable to wake up a thread that has been blocked.

• The pthread_cond_broadcast() function will wake all threads that are waiting on the condition variable.

Page 63: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

63

Pthreads

Page 64: Lecture 2 The Art of Concurrency 张奇 复旦大学 COMP630030 Data Intensive Computing 1

64

Questions?