parallel programming on the iucaa clusters

36
Parallel Programming On the IUCAA Clusters Sunu Engineer

Upload: nate

Post on 11-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Parallel Programming On the IUCAA Clusters. Sunu Engineer. IUCAA Clusters. The Cluster – Cluster of Intel Machines on Linux Hercules – Cluster of HP ES45 quad processor nodes References: http://www.iucaa.ernet.in/. The Cluster. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Programming On the IUCAA Clusters

Parallel Programming On the IUCAA Clusters

Sunu Engineer

Page 2: Parallel Programming On the IUCAA Clusters

IUCAA Clusters

The Cluster – Cluster of Intel Machines on LinuxHercules – Cluster of HP ES45 quad processor

nodes

References: http://www.iucaa.ernet.in/

Page 3: Parallel Programming On the IUCAA Clusters

The Cluster

Four Single Processor Nodes with 100 Mbps Ethernet interconnect.

1.4 GHz, Intel Pentium 4 512 MB RAM Linux 2.4 Kernel (Redhat 7.2 Distribution) MPI – LAM 6.5.9 PVM – 3.4.3

Page 4: Parallel Programming On the IUCAA Clusters

Hercules

Four quad processor nodes with Memory Channel interconnect

1.25 GHz Alpha 21264D RISC Processor 4 GB RAM Tru64 5.1A with TruCluster software Native MPI LAM 7.0 PVM 3.4.3

Page 5: Parallel Programming On the IUCAA Clusters

Expected Computational Performance

Intel Cluster Processor - 512/590 System GFLOPS ~ 2 Algorithm/Benchmark

Used – Specint/float/HPL

ES45 Cluster Processor ~ 679/960 System GFLOPS ~ 30 Algorithm/Benchmark

Used – Specint/float/HPL

Page 6: Parallel Programming On the IUCAA Clusters

Parallel Programs

Move towards large scale distributed programs Larger class of problems with higher resolution Enhanced levels of details to be explored …

Page 7: Parallel Programming On the IUCAA Clusters

The Starting Point

Model Single Processor Program Multi Processor Program

Model Multiprocessor Program

Page 8: Parallel Programming On the IUCAA Clusters

Decomposition of a Single Processor Program

Temporal Initialization Control Termination

Spatial Functional Modular Object based

Page 9: Parallel Programming On the IUCAA Clusters

Multi Processor Programs

Spatial delocalization – Dissolving the boundary Single spatial coordinate - Invalid Single time coordinate - Invalid

Temporal multiplicity Multiple streams at different rates w.r.t an external

clock.

Page 10: Parallel Programming On the IUCAA Clusters

In comparison

Multiple points of initialization Distributed control Multiple points and times of termination Distribution of the activity in space and time

Page 11: Parallel Programming On the IUCAA Clusters

Breaking up a problem

Page 12: Parallel Programming On the IUCAA Clusters

Yet Another way

Page 13: Parallel Programming On the IUCAA Clusters

And another

Page 14: Parallel Programming On the IUCAA Clusters

Amdahl’s Law

Page 15: Parallel Programming On the IUCAA Clusters

Degrees of refinement

Fine parallelism Instruction level Program statement level Loop level

Coarse parallelism Process level Task level Region level

Page 16: Parallel Programming On the IUCAA Clusters

Patterns and Frameworks

Patterns - Documented solutions to recurring design problems.

Frameworks – Software and hardware structures implementing the infrastructure

Page 17: Parallel Programming On the IUCAA Clusters

Processes and Threads

From heavy multitasking to lightweight multitasking on a single processor

Isolated memory spaces to shared memory space

Page 18: Parallel Programming On the IUCAA Clusters

Posix Threads in Brief

pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments)

pthread_exit pthread_join pthread_self pthread_mutex_init pthread_mutex_lock/unlock Link with –lpthread

Page 19: Parallel Programming On the IUCAA Clusters

Multiprocessing architectures

Symmetric Multiprocessing Shared memory

Space Unified Different temporal streams

OpenMP standard

Page 20: Parallel Programming On the IUCAA Clusters

OpenMP Programming

Set of directives to the compiler to express shared memory parallelism

Small library of functions Environment variables. Standard language bindings defined for

FORTRAN, C and C++

Page 21: Parallel Programming On the IUCAA Clusters

Open MP example

#include <stdio.h>#include <omp.h> int main(int argc, char ** argv) {#pragma omp parallel { printf(“Hello World from

%d\n”,omp_get_thread_num());

}return(0);}

C An openMP program program openmp

!$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num()

!$OMP END PARALLELstop

end

Page 22: Parallel Programming On the IUCAA Clusters

Open MP directivesParallel and Work sharing

OMP Parallel [clauses] OMP do [ clauses] OMP sections [ clauses] OMP section OMP single

Page 23: Parallel Programming On the IUCAA Clusters

Combined work sharingSynchronization

OMP parallel do OMP parallel sections OMP master OMP critical OMP barrierOMP atomicOMP flushOMP orderedOMP threadprivate

Page 24: Parallel Programming On the IUCAA Clusters

OpenMP Directive clauses

shared(list) private(list)/threadprivate firstprivate/lastprivate(list) default(private|shared|none) default(shared|none) reduction (operator|intrinsic : list) copyin(list) if (expr) schedule(type[,chunk]) ordered/nowait

Page 25: Parallel Programming On the IUCAA Clusters

Open MP Library functions

omp_get/set_num_threads() omp_get_max_threads() omp_get_thread_num() omp_get_num_procs() omp_in_parallel() omp_get/set_(dynamic/nested)() omp_init/destroy/test_lock() omp_set/unset_lock()

Page 26: Parallel Programming On the IUCAA Clusters

OpenMP environment variables

OMP_SCHEDULE OMP_NUM_THREADS OMP_DYNAMIC OMP_NESTED

Page 27: Parallel Programming On the IUCAA Clusters

OpenMP Reduction and Atomic Operators

Reduction : +,-,*,&,|,&&,|| Atomic : ++,--,+,*,-,/,&,>>,<<,|

Page 28: Parallel Programming On the IUCAA Clusters

Simple loops

do I=1,N z(I) = a * x(I) + y end do

!$OMP parallel do do I=1,N z(I) = a * x(I) + y end do

Page 29: Parallel Programming On the IUCAA Clusters

Data Scoping

Loop index private by default Declare as shared, private or reduction

Page 30: Parallel Programming On the IUCAA Clusters

Private variables

!$OMP parallel do private(a,b,c) do I=1,m

do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do end do#pragma omp parallel for private(a,b,c)

Page 31: Parallel Programming On the IUCAA Clusters

Dependencies

Data dependencies (Lexical/dynamic extent) Flow dependencies Classifying and removing the dependencies Non removable dependenciesExamples

Do I=2,na(I) =a(I)+a(I-1)

end doDo I=2,N,2 a(I)= a(I)+a(I-1)End do

Page 32: Parallel Programming On the IUCAA Clusters

Making sure everyone has enough work

Parallel overhead – Creation of threads, synchronization vs. work done in the loop

$!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime

Page 33: Parallel Programming On the IUCAA Clusters

Parallel regions – from fine to coarse parallelism

$!OMP Parallel threadprivate and copyin Work sharing constructs

do, sections, section, singleSynchronization critical, atomic, barrier, ordered, master

Page 34: Parallel Programming On the IUCAA Clusters

To distributed memory systems

MPI, PVM, BSP …

Page 35: Parallel Programming On the IUCAA Clusters

Existing parallel libraries and toolkits include: PUL, the Parallel Utilities Library from EPCC. The Multicomputer Toolbox from Tony Skjellum and

colleagues at LLNL and MSU. The Portable, Extensible, Toolkit for Scientific

computation from ANL. ScaLAPACK from ORNL and UTK. ESSL, PESSL on AIX PBLAS, PLAPACK, ARPACK

Some Parallel Libraries

Page 36: Parallel Programming On the IUCAA Clusters