parallel computing concepts

Upload: saggy00

Post on 22-Feb-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 Parallel Computing Concepts

    1/63

    Parallel Programming Concepts

    OpenHPI Course

    Week 5 : Distributed Memory Parallelism

    Unit 5.1: Hardware

    Dr. Peter Trger + Teaching Team

  • 7/24/2019 Parallel Computing Concepts

    2/63

    Summary: Week 4

    !

    Accelerators enable major speedup for data parallelism

    " SIMD execution model (no branching)

    " Memory latency managed with many light-weight threads

    ! Tackle diversity with OpenCL

    "

    Loop parallelism with index ranges

    " Kernels in C, compiled at runtime

    " Complex memory hierarchy supported

    ! Getting fast is easy, getting faster is hard

    "

    Best practices for accelerators" Hardware knowledge needed

    2

    What if my computational problem still demands more power?

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    3/63

    Parallelism for

    !

    Speed compute faster

    ! Throughput compute more in the same time

    ! Scalability compute faster / more with additional resources

    " Huge scalability only with shared nothingsystems

    "

    Still also depends on application characteristics

    Processing Element A1

    Processing Element A2

    Processing Element A3

    Processing Element B1

    Processing Element B2

    Processing Element B3ScalingUp

    Scaling Out

    Main

    Memory

    Main

    Memory

    3

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    4/63

    Parallel Hardware

    !

    Shared memory system

    " Typically a single machine, common address space for tasks

    " Hardware scaling is limited (power / memory wall)

    ! Shared nothing (distributed memory) system

    "

    Tasks on multiple machines, can only access local memory" Global task coordination by explicit messaging

    " Easy scale-out by adding machines to the network

    4

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    Processing

    Element

    Task

    Shared Memory

    Processing

    Element

    Task

    Processing

    Element

    Task

    Processing

    Element

    Task

    Message

    Message

    Message

    Message

    Cache CacheLocal

    MemoryLocal

    Memory

  • 7/24/2019 Parallel Computing Concepts

    5/63

    Parallel Hardware

    !

    Shared memory system !collection of processors

    " Integrated machine for capacity computing

    " Prepared for a large variety of problems

    ! Shared-nothing system !collection of computers

    "

    Clusters and supercomputers for capability computing

    " Installation to solve few problems in the best way

    " Parallel software must be able leverage multiple machines

    at the same time

    "

    Difference to distributed systems (Internet, Cloud)! Single organizational domain, managed as a whole

    ! Single parallel application at a time,

    no separation of client and server application

    ! Hybrids are possible (e.g. HPC in Amazon AWS cloud)

    5

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    6/63

    Shared Nothing: Clusters

    !

    Collection of stand-alone machines connected by a local network

    " Cost-effective technique for a large-scale parallel computer

    " Users are builders, have control over their system

    " Synchronization much slower than in shared memory

    "

    Task granularity becomes an issue

    6

    ProcessingElement

    Task

    ProcessingElement

    Task

    Message

    Message

    Message

    Message

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    LocalMemory

    LocalMemory

  • 7/24/2019 Parallel Computing Concepts

    7/63

    Shared Nothing: Supercomputers

    !

    Supercomputers / Massively Parallel Processing (MPP) systems

    " (Hierarchical) cluster with a lot of processors

    " Still standard hardware, but specialized setup

    " High-performance interconnection network

    "

    For massive data-parallel applications, mostly simulations(weapons, climate, earthquakes, airplanes, car crashes, ...)

    ! Examples (Nov 2013)

    " BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops

    "

    Tianhe-2, 3.1 million cores,1 PB memory, 17.808 kW power,

    33.86 PFlops (quadrillions

    calculations per second)

    ! Annual ranking with the TOP500 list

    (www.top500.org)

    7

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    8/63

    Example

    8

    2011 IBM Corporation

    1. Chip:16+2 !P

    cores

    2. Single Chip Module

    4. Node Card:32 Compute Cards,Optical Modules, Link Chips; 5D Torus

    5a. Midplane:16 Node Cards

    6. Rack: 2 Midplanes

    7. System:96 racks, 20PF/s

    3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

    5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

    Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming modelsfor exploitation of node hardware concurrency

    Blue Gene/Q

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    9/63

    Interconnection Networks

    !

    Bus systems

    " Static approach, low costs

    " Shared communication path,

    broadcasting of information

    "

    Scalability issues with shared bus! Completely connected networks

    " Static approach, high costs

    " Only direct links, optimal performance

    ! Star-connected networks

    " Static approach with central switch

    " Less links, still very good performance

    " Scalability depends on central switch

    9

    PE PE PE PE

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    PE

    PE

    PE

    PE

    PE

    PE PE

    PE

    PE

    PE

    PE

    PE

    PE

    PE PE

    PE

    Switch

  • 7/24/2019 Parallel Computing Concepts

    10/63

    Interconnection Networks

    !

    Crossbar switch

    " Dynamic switch-based network

    " Supports multiple parallel direct

    connections without collisions

    "

    Less edges than completely connectednetwork, but still scalability issues

    ! Fat tree

    " Use wider links in higher parts of the

    interconnect tree

    "

    Combine tree design advantages with

    solution for root node scalability

    " Communication distance between any

    two nodes is no more than 2 log #PEs

    10

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    PE1 PE2 PE3 PEn

    PE1

    PE2

    PE3

    PEn

    PE PE PE PEPE PE

    Switch SwitchSwitch

    SwitchSwitch

    Switch

  • 7/24/2019 Parallel Computing Concepts

    11/63

    Interconnection Networks

    !

    Linear array

    ! Ring

    " Linear array with connected endings

    ! N-way D-dimensional mesh

    "

    Matrix of processing elements" Not more than N neighbor links

    " Structured in D dimensions

    ! N-way D-dimensional torus

    "

    Mesh with wrap-around connection

    11

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    PE PE PE PE

    PE PE PE PE

    PE PE

    PE PE

    PE PE

    PE

    PE

    PE

    PE PE

    PE PE

    PE PE

    PE

    PE

    PE

    4-way 2D torus8-way 2D mesh

    4-way 2D mesh

  • 7/24/2019 Parallel Computing Concepts

    12/63

    Example: Blue Gene/Q 5D Torus

    !

    5D torus interconnect in Blue Gene/Q supercomputer

    " 2 GB/s on all 10 links, 80ns latency to direct neighbors

    " Additional link for

    communication

    with I/O nodes

    12

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    [IBM]

  • 7/24/2019 Parallel Computing Concepts

    13/63

    Parallel Programming Concepts

    OpenHPI Course

    Week 5 : Distributed Memory Parallelism

    Unit 5.2: Granularity and Task Mapping

    Dr. Peter Trger + Teaching Team

  • 7/24/2019 Parallel Computing Concepts

    14/63

    Workload

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    14

    ! Last week showed that task granularity may be flexible

    " Example: OpenCL work group size

    ! But: Communication overhead becomes significant now

    " What is the right level of task granularity ?

  • 7/24/2019 Parallel Computing Concepts

    15/63

    Surface-To-Volume Effect

    !

    Envision the work to be done

    (in parallel) as sliced 3D cube

    " Not a demand on the application

    data, just a representation

    !

    Slicing represents splitting into tasks! Computationalwork of a task

    " Proportional to the volumeof the cube slice

    " Represents the granularity of decomposition

    ! Communication requirements of the task

    " Proportional to the surfaceof the cube slice

    ! communication-to-computation ratio

    " Fine granularity: Communication high, computation low

    " Coarse granularity: Communication low, computation high

    15

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    16/63

    Surface-To-Volume Effect

    16

    [nicerweb.com]

    !

    Fine-grained decomposition forusing all processing elements ?

    ! Coarse-grained decomposition

    to reduce communication

    overhead ?

    !

    A tradeoff question !

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    17/63

    Surface-To-Volume Effect

    !

    Heatmap example with64 data cells

    ! Version (a): 64 tasks

    " 64x4=

    256 messages,

    256 data values

    " 64 processing

    elements used in

    parallel

    !

    Version (b): 4 tasks" 16 messages,

    64 data values

    " 4 processing elements

    used in parallel

    17

    [Foster]

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    18/63

    Surface-To-Volume Effect

    !

    Rule of thumb

    " Agglomerate tasks to avoid communication

    " Stop when parallelism is no longer exploited well enough

    " Agglomerate in all dimensions at the same time

    !

    Influencing factors" Communication technology + topology

    " Serial performance per processing element

    " Degree of application parallelism

    !

    Task communication vs. network topology" Resulting task graphmust be

    mapped to network topology

    " Task-to-task communication

    may need multiple hops

    18

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    [Foster]

  • 7/24/2019 Parallel Computing Concepts

    19/63

    The Task Mapping Problem

    !

    Given

    " a number of homogeneous processing elements

    with performance characteristics,

    " some interconnection topology of the processing elements

    with performance characteristics,

    " an application dividable into parallel tasks.

    ! Questions:

    " What is the optimal task granularity ?

    " How should the tasks be placed on processing elements ?

    "

    Do we still get speedup / scale-up by this parallelization ?

    ! Task mapping is still research, mostly manual tuning today

    ! More options with configurable networks / dynamic routing

    " Reconfiguration of hardware communication paths

    19

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    20/63

    Parallel Programming Concepts

    OpenHPI Course

    Week 5 : Distributed Memory Parallelism

    Unit 5.3: Programming with MPI

    Dr. Peter Trger + Teaching Team

  • 7/24/2019 Parallel Computing Concepts

    21/63

    Message Passing

    !

    Parallel programming paradigm for shared nothing environments

    " Implementations for shared memory available,

    but typically not the best approach

    ! Users submit their message passing program & data as job

    !

    Cluster management system creates program instances

    Instance0

    Instance1

    Instance2 Instance

    3

    Execution Hosts

    21

    Cluster Management Software

    SubmissionHost

    Job

    Appli-cation

  • 7/24/2019 Parallel Computing Concepts

    22/63

    Single Program Multiple Data (SPMD)

    22

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    // (determine rank and comm_size) int token;

    if (rank != 0) {// Receive from your left neighbor if you are not rank

    0

    MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",rank, token, rank - 1);

    } else {

    // Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.

    if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);

    }

    Input data

    SPMD program

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);

    }

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    Instance 0

    Instance 1

    Instance 2

    Instance 3

    Instance 4

  • 7/24/2019 Parallel Computing Concepts

    23/63

    Message Passing Interface (MPI)

    !

    Many optimized messaging libraries for shared nothingenvironments, developed by networking hardware vendors

    ! Need for standardized API solution: Message Passing Interface

    " Definition of API syntax and semantics

    "

    Enables source code portability, not interoperability" Software independent from hardware concepts

    ! Fixed number ofprocess instances, defined on startup

    " Point-to-point and collective communication

    ! Focus on efficiency of communication and memory usage

    ! MPI Forum standard

    ! Consortium of industry and academia

    ! MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)

    23

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    24/63

    MPI Communicators

    !

    Each application instance (process) has a rank, starting at zero! Communicator: Handle for a group of processes

    " Unique rank numbers inside the communicator group

    " Instance can determine communicatorsizeand own rank

    "

    Default communicator MPI_COMM_WORLD" Instance may be in multiple communicator groups

    24

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    Rank 0Size 4 Rank 1

    Size 4

    Rank 2Size 4 Rank 3

    Size 4 Communic

    ator

  • 7/24/2019 Parallel Computing Concepts

    25/63

    Communication

    !

    Point-to-point communication between instancesintMPI_Send(void* buf, int count, MPI_Datatype type,

    int destRank, int tag, MPI_Comm com);

    intMPI_Recv(void* buf, int count, MPI_Datatype type,

    int sourceRank, int tag, MPI_Comm com);

    !

    Parameters

    " Send / receive buffer + size + data type

    " Sender provides receiver rank, receiver provides sender rank

    " Arbitrary message tag

    !

    Source / destination identified by [tag, rank, communicator] tuple

    ! Default send / receive will block until the match occurs

    ! Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST

    ! Variations in the API for different buffering behavior

    25

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    26/63

    Example: Ring communication

    26

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    // (determine rank and comm_size) int token;

    if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0

    MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);} else {

    // Set the token's value if you are rank 0

    token = -1;}

    // Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,

    0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {

    MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",rank, token, comm_size - 1);

    }

    [mpitutorial.com]

  • 7/24/2019 Parallel Computing Concepts

    27/63

    Deadlocks

    27Consider:

    int a[10], b[10], myrank;MPI_Status status;

    ...

    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    if (myrank == 0) {MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);

    MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);}

    else if (myrank == 1) {MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);

    MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);

    }...

    If MPI_Sendis blocking, there is a deadlock.

    intMPI_Send(void* buf, int count, MPI_Datatype type,int destRank, int tag, MPI_Comm com);

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    28/63

    Collective Communication

    !

    Point-to-point communication vs. collective communication

    ! Use cases: Synchronization, data distribution & gathering

    ! All processes in a (communicator) group communicate together

    " One sender with multiple receivers (one-to-all)

    "

    Multiple senders with one receiver (all-to-one)"

    Multiple senders and multiple receivers (all-to-all)

    ! Typical pattern in supercomputer applications

    ! Participants continue if the group communication is done

    "

    Always blocking operation" Must be executed by all processes in the group

    " No assumptions on the state of other participants on return

    28

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    29/63

    Barrier

    29 !

    Communicator members block until everybody reaches the barrier

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    MPI_Barrier(comm) MPI_Barrier(comm) MPI_Barrier(comm)

    // (determine rank and comm_size) int token;

    if (rank != 0) {// Receive from your left neighbor if you are not rank 0

    MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {

    MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {

    MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size ,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {

    MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

    // (determine rank and comm_size) int token;if (rank != 0) {

    // Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, rank - 1);

    } else {// Set the token's value if you are rank 0token = -1;

    }// Send your local token value to your right neighbor

    MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

    // Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

    rank, token, comm_size - 1);}

  • 7/24/2019 Parallel Computing Concepts

    30/63

    Broadcast

    !

    int MPI_Bcast( void *buffer, int count,

    MPI_Datatype datatype, int rootRank, MPI_Comm comm )

    " rootRank is the rank of the chosen root process

    " Root process broadcasts data in buffer to all other processes,

    itself included

    "

    On return, all processes have the same data in their buffer

    30

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    Data

    Processes

    D0

    Data

    Processes

    D0

    D0

    D0

    D0

    D0

    D0

    Broadcast

  • 7/24/2019 Parallel Computing Concepts

    31/63

    Scatter

    !

    int MPI_Scatter(void *sendbuf, int sendcnt,

    MPI_Datatype sendtype, void *recvbuf, int recvcnt,

    MPI_Datatype recvtype, int rootRank, MPI_Comm comm)

    " sendbuf buffer on root process is divided, parts are sent

    to all processes, including root

    " MPI_SCATTERVallows varying count of data per rank

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    31

    Data

    Processes

    D0 D1 D2 D3 D4 D5

    Data

    Processes

    D0

    D1

    D2

    D3

    D4

    D5

    Scatter

    Gather

  • 7/24/2019 Parallel Computing Concepts

    32/63

    Gather

    !

    int MPI_Gather(void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf, int recvcnt,

    MPI_Datatype recvtype, int rootRank, MPI_Comm comm)

    " Each process (including the root process) sends the data in its

    sendbuf buffer to the root process

    "

    Incoming data in recvbufis stored in rank order"

    recvbuf parameter is ignored for all non-root processes

    32

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    Data

    Processes

    D0 D1 D2 D3 D4 D5

    Data

    Processes

    D0

    D1

    D2

    D3

    D4

    D5

    Scatter

    Gather

  • 7/24/2019 Parallel Computing Concepts

    33/63

    Reduction

    !

    int MPI_Reduce(void *sendbuf, void *recvbuf,int count, MPI_Datatype datatype, MPI_Op op,

    int rootRank, MPI_Comm comm)

    " Similar to MPI_Gather

    " Additional reductionoperation op to aggregate received

    data: maximum, minimum, sum, product, boolean

    operators, max-min, min-min

    ! MPI implementation can overlap communication and

    reduction calculation for faster results

    33

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    Data

    Processes D0A

    D0B

    D0C

    Reduce +

    Data

    Processes D0A+ D0B+ D0C

    D0B

    D0C

  • 7/24/2019 Parallel Computing Concepts

    34/63

    Example: MPI_Scatter + MPI_Reduce

    34/* -- E. van den Berg 07/10/2001 -- */

    #include

    #include "mpi.h"

    int main (int argc, char *argv[]) {

    int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processorsint rank, i = -1, j = -1;

    MPI_Init (&argc, &argv);

    MPI_Comm_rank (MPI_COMM_WORLD, &rank);

    MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i ,

    1, MPI_INT, 0, MPI_COMM_WORLD);

    printf ("[%d] Received i = %d\n", rank, i);

    MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD,

    0, MPI_COMM_WORLD);

    printf ("[%d] j = %d\n", rank, j);

    MPI_Finalize();return 0;

    }

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    35/63

    What Else

    ! Variations:

    MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall,

    ! Definition of virtual topologies for better task mapping

    ! Complex data types

    !

    Packing / Unpacking (sprintf / sscanf)

    ! Group / Communicator Management

    ! Error Handling

    ! Profiling Interface

    !

    Several implementations available

    " MPICH - Argonne National Laboratory

    " OpenMPI - Consortium of Universities and Industry

    " ...

    35

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    36/63

    Parallel Programming Concepts

    OpenHPI Course

    Week 5 : Distributed Memory Parallelism

    Unit 5.4: Programming with Channels

    Dr. Peter Trger + Teaching Team

  • 7/24/2019 Parallel Computing Concepts

    37/63

    Communicating Sequential Processes

    !

    Formal process algebra to describe concurrent systems

    " Developed by Tony Hoare at University of Oxford (1977)

    ! Also inventor of QuickSort and Hoare logic

    " Computer systems act and interact with the environment

    "

    Decomposition in subsystems (processes) that operate

    concurrently inside the system

    " Processes interact with other processes, or the environment

    ! Book: T. Hoare, Communicating Sequential Processes, 1985

    !

    A mathematical theory, described with algebraic laws! CSP channel concept available in many programming

    languages for shared nothing systems

    ! Complete approach implemented in Occam language

    37

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    38/63

    CSP: Processes

    !

    Behavior of real-world objects can be described through theirinteraction with other objects

    " Leave out internal implementation details

    " Interface of a process is described as set of atomic events

    ! Example:ATMand User, both modeled as processes

    " cardevent insertion of a credit card in an ATM card slot

    " moneyevent extraction of money from the ATM dispenser

    ! Alphabet- set of relevant events for an object description

    " May never happen, interaction is restricted to these events

    "

    !ATM= !User= {card, money}

    ! A CSP processis the behavior of an object, described with its

    alphabet

    38

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    39/63

    Communication in CSP

    !

    Special class of event: Communication" Modeled as unidirectional channel between processes

    " Channel name is a member of the alphabets of both processes

    " Send activity described by multiple c.v events

    !

    Channel approach assumes rendezvousbehavior" Sender and receiver block on the channel operation until the

    message is transmitted

    " Implicit barrier based on communication

    ! With formal foundation, mathematical proofs are possible

    " When two concurrent processes communicate with each other

    only over a single channel, they cannot deadlock.

    " Network of non-stopping processes which is free of cycles

    cannot deadlock.

    "

    39

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    40/63

    Whats the Deal ?

    !

    Any possible system can be modeled through event chains

    " Enables mathematical proofs for deadlock freedom,

    based on the basic assumptions of the formalism

    (e.g. single channel assumption)

    ! Some tools available (check readings page)

    ! CSP was the formal base for the Occam language

    " Language constructs follow the formalism

    " Mathematical reasoning about the behavior of written code

    ! Still active research (Welsh University), channel concept

    frequently adopted

    " CSP channel implementations for Java, MPI, Go, C, Python

    " Other formalisms based on CSP, e.g. Task/Channel model

    40

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    41/63

    Channels in Scala

    41actor {var out: OutputChannel[String] = null

    val child= actor{

    react {

    case "go" => out!"hello"

    }}

    val channel= new Channel[String]

    out= channelchild!"go"

    channel.receive{case msg => println(msg.length)

    }

    }

    case class ReplyTo(out:OutputChannel[String])

    val child= actor{

    react {

    case ReplyTo(out) => out!"hello"}

    }

    actor {

    val channel= new Channel[String]child!ReplyTo(channel)

    channel.receive{

    case msg => println(msg.length)

    }}

    Scope-based channel sharing

    Sending channels in messages

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    42/63

    Channels in Go

    42

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    package mainimport fmt fmtfunc sayHello (ch1 chan string) {

    ch1

  • 7/24/2019 Parallel Computing Concepts

    43/63

    Channels in Go

    !

    selectconcept allows to switch between available channels" All channels are evaluated

    " If multiple can proceed, one is chosen randomly

    " Default clause if no channel is available

    ! Channels are typically first-class language constructs

    " Example: Client provides a response channel in the request

    ! Popular solution to get deterministic behavior

    43

    select {case v :=

  • 7/24/2019 Parallel Computing Concepts

    44/63

    Task/Channel Model

    !

    Computational modelfor multi-computer by Ian Foster!

    Similar concepts to CSP

    ! Parallel computation consists of one or more tasks

    " Tasks execute concurrently

    "

    Number of tasks can vary during execution" Task: Serial programwith localmemory

    " A task has in-portsand outportsas interface to the

    environment

    " Basic actions: Read / write local memory,

    send message on outport,

    receive message on in-port,

    create new task, terminate

    44

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    45/63

    Task/Channel Model

    !

    Outport / in-port pairs are connected by channels

    " Channels can be created and deleted

    " Channels can be referenced as ports,

    which can be part of a message

    "

    Sendoperation is non-blocking" Receiveoperation is blocking

    " Messages in a channel stay in order

    ! Tasks are mappedto physical processors by the execution

    environment

    "

    Multiple tasks can be mapped to one processor

    ! Data locality is explicit part of the model

    ! Channels can model controland datadependencies

    45

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    46/63

    Programming With Channels

    !

    Channel-only parallel programs have advantages

    " Performance optimization does not influence semantics

    ! Example: Shared-memory channels for some parts

    " Task mapping does not influence semantics

    !

    Align number of tasks for the problem,not for the execution environment

    ! Improves scalability of implementation

    " Modular design with well-defined interfaces

    !

    Communication should be balanced between tasks!

    Each task should only communicate with a small group of

    neighbors

    46

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    47/63

    Parallel Programming Concepts

    OpenHPI Course

    Week 5 : Distributed Memory Parallelism

    Unit 5.5: Programming with Actors

    Dr. Peter Trger + Teaching Team

  • 7/24/2019 Parallel Computing Concepts

    48/63

    Actor Model

    !

    Carl Hewitt, Peter Bishop and Richard Steiger. A Universal ModularActor Formalism for Artificial Intelligence IJCAI 1973.

    " Mathematical model for concurrent computation

    " Actor as computational primitive

    ! Local decisions, concurrently sends / receives messages

    ! Has a mailboxfor incoming messages

    ! Concurrently creates more actors

    " Asynchronous one-way message sending

    " Changing topology allowed, typically no order guarantees

    !

    Recipient is identified by mailing address

    ! Actors can send their own identity to other actors

    ! Available as programming language extension or library

    in many environments

    48

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    49/63

    Erlang Ericsson Language

    !

    Functional language with actor support

    ! Designed for large-scale concurrency

    " First version in 1986 by Joe Armstrong, Ericsson Labs

    " Available as open source since 1998

    !

    Language goals driven by Ericsson product development" Scalable distributed execution of phone call handling software

    with large number of concurrent activities

    " Fault-tolerant operation under timing constraints

    " Online software update

    ! Users

    " Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile

    SMS and authentication, Motorola call processing, Ericsson

    GPRS and 3G mobile network products, CouchDB, EJabberD,

    49

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    50/63

    Concurrency in Erlang

    !

    Concurrency Oriented Programming" Actor processes are completely independent (shared nothing)

    " Synchronization and data exchange with message passing

    " Each actor process has an unforgeable name

    "

    If you know the name, you can send a message" Default approach is fire-and-forget

    " You can monitor remote actor processes

    ! Using this gives you

    " Opportunity for massive parallelism

    " No additional penalty for distribution, despite latency issues

    " Easier fault tolerance capabilities

    " Concurrency by default

    50

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    51/63

    Actors in Erlang

    !

    Communication via message passing is part of the language! Send never fails, works asynchronously (PID ! Message)

    ! Actors have mailboxfunctionality

    " Queue of received messages, selective fetching

    " Only messages from same source arrive in-order

    " receivestatement with set of clauses, pattern matching

    " Process is suspended in receive operation until a match

    receive

    Pattern1 when Guard1 -> expr1, expr2, ..., expr_n;Pattern2 when Guard2 -> expr1, expr2, ..., expr_n;

    Other -> expr1, expr2, ..., expr_n

    end

    51

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    52/63

    Functions exported + #args

    Erlang Example: Ping Pong Actors

    52

    Start Pingand Pongactors

    Blocking recursive receive,scanning the mailbox

    Pingactor,sending message to Pong

    Blocking recursive receive,scanning the mailbox

    Sending message to Ping

    [erlang.org]

    -module(tut15).-export([test/0, ping/2, pong/0]).

    ping(0, Pong_PID) ->Pong_PID ! finished,

    io:format("Ping finished~n", []);

    ping(N, Pong_PID) ->

    Pong_PID ! {ping, self()},

    receivepong ->io:format("Ping received pong~n", [])

    end,

    ping(N - 1, Pong_PID).

    pong() ->

    receivefinished ->

    io:format("Pong finished~n", []);{ping, Ping_PID} ->

    io:format("Pong received ping~n", []),

    Ping_PID ! pong,pong()

    end.

    test() ->

    Pong_PID = spawn(tut15, pong, []),

    spawn(tut15, ping, [3, Pong_PID]).

    Pongactor

  • 7/24/2019 Parallel Computing Concepts

    53/63

    Actors in Scala

    !

    Actor-based concurrency in Scala, similar to Erlang

    ! Concurrency abstraction on top of threads or processes

    ! Communication by non-blocking send operation and blocking

    receive operation with matching functionality

    actor{

    var sum = 0

    loop{

    receive{

    caseData(bytes) => sum += hash(bytes)

    caseGetSum(requester) => requester ! sum

    }}}

    ! All constructs are library functions(actor, loop, receiver, !)

    ! Alternative self.receiveWithin()call with timeout

    ! Case classes act as message type representation

    53

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    54/63

    Case classes,acting as message types

    Start the counteractor

    Scala Example: Counter Actor

    54

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    import scala.actors.Actorimport scala.actors.Actor._case class Inc(amount: Int)

    case class Value

    class Counterextends Actor {var counter: Int = 0;def act() = {

    while (true) {

    receive{case Inc(amount)=>

    counter += amountcase Value =>

    println("Value is "+counter)exit()

    }}}}

    object ActorTest extends Application {

    val counter = new Countercounter.start()for (i

  • 7/24/2019 Parallel Computing Concepts

    55/63

    Actor Deadlocks

    55!

    Synchronous send operator !? available in Scala" Sends a message and blocks in receive afterwards

    " Intended for request-response pattern

    ! Original asynchronous send makes deadlocks less probable

    [http://savanne.be/articles/concurrency-in-erla

    ng-scala/]// actorA

    actorB!?Msg1(value) match {case Response1(r) =>

    // }

    receive{caseMsg2(value) =>reply(Response2(value))

    }

    // actorBactorA!?Msg2(value) match {

    case Response2(r) =>

    // }

    receive{caseMsg1(value) =>reply(Response1(value))

    }

    // actorAactorB!Msg1(value)while (true) {

    receive{caseMsg2(value) =>

    reply(Response2(value))case Response1(r) =>

    // ...

    }}

    // actorBactorA!Msg2(value)while (true) {

    receive{caseMsg1(value) =>

    reply(Response1(value))case Response2(r) =>// ...

    }}

  • 7/24/2019 Parallel Computing Concepts

    56/63

    Parallel Programming Concepts

    OpenHPI Course

    Week 5 : Distributed Memory Parallelism

    Unit 5.6: Programming with MapReduce

    Dr. Peter Trger + Teaching Team

  • 7/24/2019 Parallel Computing Concepts

    57/63

    MapReduce

    !

    Programming model for parallel processing of large data sets

    " Inspired by map()and reduce()in functional programming

    " Intended for best scalability in data parallelism

    ! Huge interest started with Google Research publication

    "

    Jeffrey Dean and Sanjay Ghemawat.MapReduce: Simplified Data Processing on Large Clusters

    " Google products rely on internal implementation

    ! Apache Hadoop: Widely known open source implementation

    " Scales to thousands of nodes

    " Has shown to process petabytes of data

    " Cluster infrastructure with custom file system (HDFS)

    ! Parallel programming on very high abstraction level

    57

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    58/63

    MapReduce Concept

    !

    Map step

    " Convert input tuples

    [key, value] with map()

    function into one / multiple

    intermediate tuples

    [key2, value2]per input

    ! Shufflestep: Collect all intermediate tuples with the same key

    ! Reducestep

    " Combine all intermediate tuples with the same key by

    some reduce()function to one result per key!

    Developer just defines stateless map() and reduce()functions

    ! Framework automatically ensures parallelization

    ! Persistence layer needed for input and output only

    58

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    [developers.google.com]

  • 7/24/2019 Parallel Computing Concepts

    59/63

    Example: Character Counting

    59

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    60/63

    Java Example: Hadoop Word Count

    60

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    public classWordCount{

    public static classMapextends MapReduceBaseimplementsMapper {

    private final static IntWritable one = new IntWritable(1);private Text word = new Text();

    publicvoidmap(LongWritablekey, Textvalue,OutputCollector output, Reporter reporter)

    throws IOException {String line = value.toString();

    StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

    word.set(tokenizer.nextToken());

    output.collect(word, one);}}}

    public static classReduceextends MapReduceBase

    implementsReducer {public void reduce(Text key, Iterator values,

    OutputCollector output, Reporter reporter)throws IOException {int sum = 0;

    while (values.hasNext()) {sum += values.next().get();

    }output.collect(key, new IntWritable(sum));

    }}...}

    [h

    adoop.apache.org]

  • 7/24/2019 Parallel Computing Concepts

    61/63

    MapReduce Data Flow

    61

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

    [developer.yahoo.com]

  • 7/24/2019 Parallel Computing Concepts

    62/63

    Advantages

    !

    Developer never implements communication or synchronization,implicitly done by the framework

    " Allows transparent fault tolerance and optimization

    ! Running mapand reducetasks are stateless

    "

    Only rely on their input, produce their own output"

    Repeated execution in case of failing nodes

    " Redundant execution for compensating nodes with different

    performance characteristics

    ! Scale-out only limited by

    "

    Distributed file system performance (input / output data)

    " Shuffle step communication performance

    ! Chaining of map/reduce tasks is very common in practice

    ! But: Demands embarrassingly parallel problem

    62

    OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

  • 7/24/2019 Parallel Computing Concepts

    63/63

    Summary: Week 5

    !

    Shared nothing systems provide very good scalability" Adding new processing elements not limited by walls

    " Different options for interconnect technology

    ! Task granularity is essential

    "

    Surface-to-volume effect" Task mapping problem

    ! De-facto standard is MPI programming

    ! High level abstractions with

    " Channels

    " Actors

    " MapReduce

    63

    What steps / strategy would you applyto parallelize a given compute-intense program?