introducción a la programación en paralelo - csic · cada proceso (/ador) tiene su propia...

Programa MOSSNOHO-CM, 7-8 Febrero 2007, Madrid

Introducción a la progamaciónen paralelo

Enrique LombaInstituto de Química Física Rocasolano,

CSIC

Esquema⇨ Modelos computacionales paralelos⇨ El estandard MPI⇨ Basic Linear Algebra Comunication System (BLACS)⇨ Scalable Linear Algebra PACKage

(ScaLAPACK)

Modelos de computación⇨ Paralelismo de datos (CRAY 1), arquitectura

SIMD (Single Instruction Multiple Data). Tambien en nuevos procesadores (PIV, alpha)

⇨ Memoria compartida (Shared Memory), en equipos con varios procesadores (Symmetric MultiProcessors = SMP)

M e m oria R A M

P roc 1 P roc 2 P roc 3

Programación con hebras (threads)High Performance FortranHigh Performance Fortran (HPF)OpenMP (Chandra et al., Parallel Programming in OpenMP, AP)

o Toda la memoria es compartida (las variables pueden ser comunes a todas las hebras)

o Un programa puede tener varias hebras. En cada hebra:Programme counter (instrucción a ejecutar)Stack (almacenamiento intermedio)

⇨ Paso de mensajes. (Message Passing)● Message Passing Interface (MPI)● Parallel Virtual Machine (PVM)

R AM R AM R AM

P1

P2

P3

● Cada proceso (/ador) tiene su propia memoria, se establece intercomunicación.

✔ Participación activa de todos los procesos en la comunicaciónEnvío de datos (en proceso emisor)Recepción de datos (en proceso receptor)

⇨ Operaciones remotas de memoria● Acceso a la memoria de otro proceso sin su

participación ✔ Típico en CRAY T3E✔ Nuevo estandard MPI-2✔ Nuevo estandard de arquitectura: Virtual Interface

Arquitecture (VIA)⇨ Creación dinámica de procesos● PVM, POSIX threads, OpenMP, MPI-2● Ausente en MPI-1

Hidra y Ladon

⇨ Los modelos de computaciónse adaptan a un hardware específico

⇨ Cualquier modelo de computación es compatible con cualquier hardware por emulación de software (a costa de la eficiencia)

Myrinet 2Gb/sEthernet 1Gb/s

Hardware en LADON

⇨ LADON es un cluster de memoria distributida (DMP)

⇨ El modelo de paralelismo más adecuado es el de paso de mensajes (MPI-1). En su momento con operaciones en memoria remota (MPI-2)

Communicationprotocol

Logical device

shm

Ch/p4

Mx/Gm

CommunicationDevice

(network card,shared memory)

TCP/IP

MX/GM

Threads

MP

Iapplication

Niveles de Software en LADON

⇨ MX library: específica de Myrinet (alto rendimiento), Paso de mensajes con operaciones en memoria remota. Se accede desde programas en C.

⇨ MPI-1: construido sobre MX (C, C++ o Fortran)⇨ BLACS: subrutinas de comunication alto nivel

en Fortran. Construido sobre MPI.⇨ ScaLAPACK: Librería Fortran de Algebra

Lineal, construida sobre BLACS.

Message Passing Interface

⇨ Es una especificación (conjunto de subrutinas). Version 1.2.

⇨ Diferentes implementaciones✔ LAM (Notre Dame University)✔ MPICH (Argonne National Lab.)

⇨ Cada proceso tiene su espacio de memoria asignado

⇨ Comunicación = copia de una porción de memoria de un proceso a otro: operación cooperativa

Process 2 Process 1

Comm.device

Comm.device

RAMbuffer

RAMbuffer

Devicebuffer

Devicebuffer

Non RAM bufferedcommunication

RAM Bufferedcommunication

Node 2 Node 1

⇨ MPI implementa la comunicación con tres niveles de buffering● Buffer RAM explícito : MPI_BSEND● Buffer en dispositivo (implícito): MPI_SEND● Sin buffer : (MPI_SSEND)

⇨ Argumentos mínimos de una subrutina● Sender

✔ Datos (primera posición de memoria+longitud)✔ Proceso de destino✔ Identificación de datos (tag)

● Receiver✔ Primera posición de memoria + longitud donde se

almacenarán los datos✔ Variable en la que se almacena el identificador del proceso

que envía los datos.● send(address,length,destination,tag)● receive(address,length,source,tag,actual_length)

⇨ Tipos de datos en MPI● MPI_INTEGER● MPI_REAL● MPI_DOUBLE_PRECISION● MPI_COMPLEX● MPI_LOGICAL● MPI_CHARACTER

⇨ Descripción de datos en envío● (address,count,datatype) Ej. (A,3,MPI_REAL)

⇨ Los tipos de datos MPI son genéricos⇨ La identificación tag permite la recepción

ordenada de los mensajes⇨ Los procesos se identifican por rango (rank) de

0 a N-1 (N procesos)

A (1 ) A (4 ) A (7 )

⇨ MPI define además● contexto: conjunto de procesos y variables

involucrados en un cálculo● grupo: Procesos involucrados en una parte del

calculo

⇨ context+group = communicator: se identifican por una variable entera. Se evita recurrir a los tags (y posibles interferencias con subrutinas de librerías internas)

⇨ Inicialización en información● MPI_Init(ierr)

✔ Integer : ierr● MPI_Comm_rank(com,my_rank,ier) = ¿Quien soy?

✔ Integer : com, my_rank,ier➢ com =MPI_COMM_WORLD

● MPI_Comm_size(com,np,ierr) ¿Cuantos procesos tengo ?

✔ Integer: np

⇨ Comunicación punto a punto● MPI_Send(mess,cont,datatyp,dest,tag,com,ierror)

✔ Integer cont,data,dest,tag,com,ierror● MPI_Recv(mess,cont,datatyp,source,tag,com,status,i

error)✔ Integer cont,data,source,tag,com,ierror✔ Integer status(MPI_STATUS_SIZE)

● Comunicación con bloqueo ✔ En MPI_Send parcial, hasta disponibilidad de buffer✔ En MPI_Recv total, hasta recepción de datos

⇨ Comunicación colectiva● MPI_Bcast(mess,count,datatyp,root,comm,ierror)

✔ Integer root (identifica proceso que envia los datos)● MPI_Barrier(comm,ierror)

✔ Sincroniza procesos● MPI_Gather (MPI_AllGather)

✔ Todos los procesos envian datos a root que los recoje ordenadamente por rango

● MPI_Scatter (MPI_AllScatter)✔ Root envia datos (un vector) a todos los procesos. Cada

proceso recibe una porción del vector por orden de rango. ● Bloqueo total

Esquema de comunicación colectiva

⇨ Computación colectiva● MPI_Reduce(operand, result, count, datatyp, op, root,

comm, ierror)✔ <tipo> operand(*), result(*)✔ Integer OP =

● MPI_MAX, MPI_MIN● MPI_SUM, MPI_PROD● MPI_MAXLOC, MPI_MINLOC● Operaciones definidas por el usuario (avanzado)

✔ Operación colectiva sobre los operand almacenando el resultado en result

● Bloqueo total● MPI_AllReduce

✔ Retorna el resultado a todos los procesos dentro del comunicador

• Ejemplo de Programa: Cálculo de π

20

11

π= dxx

∞

+∫

0( / 2)

N

ii

h f x hπ=

= +∑

program maininclude "mpif.h"double precision PI25DTparameter (PI25DT = 3.141592653589793238462643d0)double precision mypi, pi, h, sum, x, f, ainteger n, myid, numprocs, i, ierr

c function to integratef(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

10 if ( myid .eq. 0 ) thenprint *, 'Enter the number of intervals: (0 quits) 'read(*,*) n

endifc broadcast n

call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)c check for quit signal

if ( n .le. 0 ) goto 30c calculate the interval size

h = 1.0d0/nsum = 0.0d0do 20 i = myid+1, n, numprocs

x = h * (dble(i) - 0.5d0)sum = sum + f(x)

20 continuemypi = h * sum

c collect all the partial sumscall MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,

& MPI_COMM_WORLD,ierr)c node 0 prints the answer.

if (myid .eq. 0) thenprint *, 'pi is ', pi, ' Error is', abs(pi - PI25DT)

endifgoto 10

30 call MPI_FINALIZE(ierr)stopend

⇨ Prototipo de algoritmo paralelo: Master-slave(self-scheduling)● El master distribuye tareas entre los esclavos y recoge

los resultados

⇨ Ejemplo: Matriz x vector

b=c A

j

n

j=iji bA=c ∑

1

numsent = 0c send b to each slave process

call MPI_BCAST(b, cols, MPI_DOUBLE_PRECISION, master, & MPI_COMM_WORLD, ierr)

c send a row to each slave process; tag with row numberdo 40 i = 1,min(numprocs-1,rows)

do 30 j = 1,colsbuffer(j) = a(i,j)

30 continuecall MPI_SEND(buffer, cols, MPI_DOUBLE_PRECISION, i,

& i, MPI_COMM_WORLD, ierr)numsent = numsent+1

40 continuedo 70 i = 1,rows

call MPI_RECV(ans, 1, MPI_DOUBLE_PRECISION, & MPI_ANY_SOURCE, MPI_ANY_TAG, & MPI_COMM_WORLD, status, ierr)

sender = status(MPI_SOURCE) anstype = status(MPI_TAG) ! row is tag valuec(anstype) = ansif (numsent .lt. rows) then ! send another row

do 50 j = 1,colsbuffer(j) = a(numsent+1,j)

50 continuecall MPI_SEND(buffer, cols, MPI_DOUBLE_PRECISION,

& sender, numsent+1, MPI_COMM_WORLD, ierr)numsent = numsent+1

else ! Tell sender that there is no more workcall MPI_SEND(MPI_BOTTOM, 0, MPI_DOUBLE_PRECISION,

& sender, 0, MPI_COMM_WORLD. Ierr)

endif70 continue

Código del master

Código de los esclavos

c slaves receive b, then compute dot products untilc done message received

call MPI_BCAST(b, cols, MPI_DOUBLE_PRECISION, master, & MPI_COMM_WORLD, ierr)

c skip if more processes than workif (rank .gt. rows)

& goto 200 90 call MPI_RECV(buffer, cols, MPI_DOUBLE_PRECISION, master,

& MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)if (status(MPI_TAG) .eq. 0) then

go to 200else

row = status(MPI_TAG)ans = 0.0do 100 i = 1,cols

ans = ans+buffer(i)*b(i)100 continue

call MPI_SEND(ans, 1, MPI_DOUBLE_PRECISION, master, & row, MPI_COMM_WORLD, ierr)

go to 90endif

200 continue

Escalado

• Nº de operaciones: n(n+n-1)• Nº de transmisiones: n(n+1)

2

2

( )lim(2 ) 2

com com

nop op

n n T TEficiencian n T T→∞

+= =

−

Uso de los comunicadores⇨ Permiten restringir la comunicación a un grupo de

procesos (con características determinadas)⇨ MPI_COMM_WORLD (integer) = todos los

procesos⇨ Creación de comunicadores

● MPI_COMM_SPLIT (MPI_COMM_WORLD, color, key, newcom, ierr)

Integer color, key, newcom, ierr (los procesos con color=0 forman un nuevo comunicador, y se ordenan según el valor de key)

Call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)Call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)key = 0

!! Separate processes with even numbers from those with odd numbers! in communicators newcom and newcom1!

color = Mod(myid,2)color1 = Mod(myid+1,2)If (myid.Eq.1) itest = -50

! create communicator newcom (even processes) Call MPI_COMM_SPLIT(MPI_COMM_WORLD,color,key,newcom,ierr)

! create communicator newcom1 (odd processes) Call MPI_COMM_SPLIT(MPI_COMM_WORLD,color1,key,newcom1,ierr)

! broadcast to odd processesCall MPI_BCAST(itest,1,MPI_INTEGER,0,newcom1,ierr)If (myid.Eq.0) Then

Print *, ' Enter itest ?'Read(*,*)itest

! broadcast to even processesCall MPI_BCAST(itest,1,MPI_INTEGER,0,newcom,ierr)Print *, 'Comunicador nuevo: itest =',itest,' in process ', myid

ElseCall MPI_BCAST(itest,1,MPI_INTEGER,0,newcom,ierr)Print *, 'Comunicador nuevo: itest =',itest,' in process ', myid

End IfIf (myid.Eq.0) Then

Print *, ' Enter itest again ?'Read(*,*) itest

EndifCall MPI_BCAST(itest,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)Print *, ' itest =',itest,' in process ', myidCall MPI_FINALIZE(ierr)

End Program comunicador

Ejemplo de uso de comunicadores

Cálculo de π por Monte Carlo

• Algoritmo:

4 in

total

NN

π =

⇨ Servidor: genera y despacha números aleatorios a los clientes

⇨ Clientes: determinan aproximaciones parciales a π y las suman

⇨ Podría haber más de un servidor:

IntracomunicadorservidoresIntracomunicador

clientes

Intercomunicador

⇨ Código del servidor

request = 1Do While (request > 0)

!! receive request from workers!Call MPI_RECV(request,1,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG&

&,MPI_COMM_WORLD,status,ierr)If (request > 0) Then

!! Send random numbers!Call Random_number(rand(1:size))Call MPI_SEND(rand,size,MPI_DOUBLE_PRECISION&

&,status(MPI_SOURCE),reply_tag,MPI_COMM_WORLD,ierr)End If

End Do

⇨ Código de los clientesDo While (.Not. done)

!! send request for work!Call MPI_SEND(request,1,MPI_INTEGER,server,request_tag&

&,MPI_COMM_WORLD,ierr)!! receive random numbers! Call MPI_RECV(rand,size,MPI_DOUBLE_PRECISION,server,reply_tag&

&,MPI_COMM_WORLD,status,ierr)Do i=1,size-1,2

x = 2*rand(i)-1y = 2*rand(i+1)-1If (x*x+y*y > 1.0d0) Then

out = out+1Else

in = in+1End If

End Do!! get results from all workers and add them!Call MPI_ALLREDUCE(in, totalin, 1, MPI_INTEGER, MPI_SUM,&

& workers,ierr)Call MPI_ALLREDUCE(out, totalout, 1, MPI_INTEGER, MPI_SUM,&

& workers,ierr)Pi= (4.0d0*Dble(totalin))/Dble(totalin + totalout)

Tipos de datos derivados en MPI⇨ Permiten la transmisión eficiente de datos no

contiguos en memoria y de datos estructurados● Typemap = {(tipo0,desp0),…(tipon-1,despn-1)}

⇨ Los datos han de estar alineados en los límitesadecuados (e.g. los real*8 con desplazamientos absolutos múltiplos de 8 bytes)

Character*1Real*8 Real*8

80 16 17

Definición de nuevos tipos

⇨ MPI_type_vector(count,blocklength,stride,oldtype,newtype,ierror)

⇨ MPI_type_contiguous(count, oldtype, newtype, ierror)

⇨ MPI_type_struct(count, blocklengths,offsets,oldtypes,newtype,ierror)● Integer count, blocklength, blocklengths(count), stride, oldtype, oldtypes(count),

newtype, ierror

Utilidades

MPI_type_extent(datatype, extent, ierror): Longitud de un tipo

MPI_type_commit(datatype): Materialización del tipo

oInteger datatype, extent, ierror

MPI_type_free(datatype)

• Matrices NrowsΧ Ncols en memoria (sólo Fortran)

11 12 13

21 22 23

31 32 33

a a aa a aa a a

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Elementoscontiguos

Elementos separados por Nrows

Enviando columnas y filas de matrices!! Broadcast of columns (Fortran coding only)!Call MPI_BCAST(Forces(1,5),Nmol,MPI_DOUBLE_PRECISION,0, &&MPI_COMM_WORLD,ierr)If (myid.Ne.0) Then

Write(*,'(/"* Column broadcast: forces matrix in proc. ",i2/40("-"))')myidDo i=1, Nmol

Write(*,'(10f18.4)')(Forces(i,j),j=1,Nmol)End Doforces(:,:) = 0

End If!! Define new type of strided type. One element contails Nmol dp! values separated by a Nmol stride in the original matrix!Call MPI_TYPE_VECTOR(Nmol,1,Nmol,MPI_DOUBLE_PRECISION,rowtype,ierr)Call MPI_TYPE_COMMIT(rowtype,ierr)! Broadcast one element of rowtype (Nmol values)Call MPI_BCAST(Forces(2,1),1,rowtype,0,MPI_COMM_WORLD,ierr)

Envio de datos estructuradosType Molec

SequenceReal(kind=8) :: r(ndim)Real(kind=8) :: v(ndim)Real(kind=8) :: a(ndim)Real(kind=8) :: massInteger :: countCharacter, Dimension(4) :: nombre*1

End Type MolecType (Molec), Dimension(:), Allocatable :: Moleculas

……..!! determine the extent of various types involved in the calculations!

Call MPI_TYPE_EXTENT(MPI_DOUBLE_PRECISION, extent, ierr)Call MPI_TYPE_EXTENT(MPI_INTEGER, ext_int, ierr)Call MPI_TYPE_EXTENT( MPI_CHARACTER, ext_char, ierr)

!! define blockcounts and displacements! blockcount(0) = ndim*3+1blockcount(1) = 1blockcount(2) = 4displ(0) = 0displ(1) = 10*extentdispl(2) = displ(1)+ext_int

!! define old types inside the structure!types(0) = MPI_DOUBLE_PRECISIONtypes(1) = MPI_INTEGERtypes(2) = MPI_CHARACTER

!! commit the structure!

Call MPI_TYPE_STRUCT(3,blockcount,displ,types,particletype,ierr)Call MPI_TYPE_COMMIT(particletype,ierr)

! Broadcast array of structCall MPI_BCAST(Moleculas,Nmol,particletype,0,MPI_COMM_WORLD,ierr)

⇨ Topologías virtuales● Ordenamiento virtual de los procesos en una malla

(grid)✔ Monodimensional (0...Np)✔ Bidimensional : Cartesiana

(0,0),(0,1)..(0,mp)..(np,0)...(np,mp)✔ Tridimensional

● El sistema operativo gestiona la distribución óptima de acuerdo al hardware

● Las mallas pueden ser periódicas: e.g. El proceso 0 es vecino del proceso 1 y del proceso Np

Malla periódica cartesiana en 2D

2 3 1

5 6 7 8 5

9 10 11 12 9

13 14 15 16 13

4

8

9

16

13 14 15 16

1 4(0,0) (0,4)

(4,4)

Funciones básicas

⇨ MPI_CART_CREATE(oldcomm, Ndim, dims, isperiodic, reorder, commNd, ierr )

● Integer oldcomm, Ndim, dims(Ndim), commNd, ierr

● Logical reorder, isperiodic(Ndim)

⇨ MPI_DIMS_CREATE( numprocs, Ndims, dims, ierr )● Integer numprocs

⇨ MPI_CART_GET( commNd, Ndim, dims, isperiodic, coords, ier)● Integer coords(Ndim)

⇨ MPI_CART_COORDS(comNd, rank, Ndim, coords, ier)● Integer rank

¿Cómo identificar a los vecinos ?

P1(0,0) P0(0,1) P3(0,2) P5(0,3)

P7(1,0) P6(1,1) P4(1,2)

P12(2,3)P11(2,2)P9(2,1)P2(2,0)

P8(1,3)

P15(3,0) P13(3,1) P15(3,2) P14(3,3)

Malla cartesiana no periódica

⇨ MPI_CART_SHIFT(comm, direction, shift, source, dest, ierr)● Integer comm, direction=0,..Ndim-1, shift, source, dest, ierr

⇨ MPI_PROC_NULL. Ausencia de vecinos: la comunicación con el proceso MPI_PROC_NULL no se ejecuta (i.e. no consume recursos)

⇨ Útil en algoritmos en los que un proceso hace de tubería de datos(data pipeline)

Ecuación de Poisson: método de Jacobi

2 ( , ) ( , )( , ) ( , ) 1/ 1

u x y f x yu x y g x y x y∇ =

= ⇐ = =

1, , 1 , 1 1, ,,2

1 2, 1, , 1 , 1 1, ,

4

, 0,..., 11

, 0,..., 11

1 ( )4

i j i j i j i j i ji j

i

j

k k k k ki j i j i j i j i j i j

u u u u uf

hix i n

njy i n

n

u u u u u h f

− + − +

+− + − +

+ + + −=

= = ++

= = ++

= + + + −

Algoritmo iterativo de Jacobi

Intercambio de datos entre procesos(matrix uij distribuida)

Ghost points for P0

Ghost points for processesother than P0

P0

Internal zone

Esquema de iteración

1. Construcción de una malla cartesiana 2D de procesos (no periódica)

2. Distribuir la malla de NxN puntos (x,y) entre los procesos de la malla de cálculo.

3. Determinar procesos vecinos4. Intercambio de ghost points entre procesos vecinos5. Iteración de las ecuaciones de Jacobi (local)6. Comprobar convergencia (global)7. Volver a 4 si no hay convergencia

Determinación de vecinos

MPI_Cart_shift( comm2d, 0, 1, nbrleft, nbrright, ierr )

MPI_Cart_shift( comm2d, 1, 1, nbrbottom, nbrtop, ierr )

Creación de un tipo de datos para transmisión de filas

!! Create a new, "strided" datatype for the exchange in the "non-contiguous"! direction (up/down)!MPI_Type_vector( ey-sy+1, 1, ex-sx+3,MPI_DOUBLE_PRECISION, stride, ierr )MPI_Type_commit( stride, ierr )

Intercambio de datosnx = ex - sx + 1

c These are just like the 1-d versions, except for less datacall MPI_SENDRECV( a(sx,ey), nx, MPI_DOUBLE_PRECISION,

& nbrright, 0, & a(sx,sy-1), nx, MPI_DOUBLE_PRECISION, & nbrleft, 0, comm2d, status, ierr )

call MPI_SENDRECV( a(sx,sy), nx, MPI_DOUBLE_PRECISION,& nbrleft, 1, & a(sx,ey+1), nx, MPI_DOUBLE_PRECISION, & nbrright, 1, comm2d, status, ierr )

cc This uses the vector datatype stridetype

call MPI_SENDRECV( a(ex,sy), 1, stridetype, nbrtop, 0, & a(sx-1,sy), 1, stridetype, nbrbottom, 0,& comm2d, status, ierr )

call MPI_SENDRECV( a(sx,sy), 1, stridetype, nbrbottom, 1,& a(ex+1,sy), 1, stridetype, nbrtop, 1,& comm2d, status, ierr )

MPI_SENDRECV(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, rectype, recvtag, comm, status, ierr)

• integer sendcount, sendtype, dest, sendtag, recvcount, recvtype, recvtag, comm, status(MPI_STATUS_SIZE), ierr•<type> sendbuf(*), recvbuf(*)

Solapamiento de comunicación y cálculo: comunicación sin bloqueo (asíncrona)

⇨ MPI_ISEND(buf, count, datatype, dest, tag, comm, request, ierr)● <type> buf(*)

● Integer count, datatype, dest, tag, comm, request, ierr

⇨ MPI_IRECV(buf, count, datatype, src, tag, comm, request, ierr)● Integer src

⇨ MPI_TEST(request, flag, status, ierr)● Integer status(MPI_STATUS_SIZE)

● Logical flag

⇨ MPI_WAITALL(count, array_of_requests, array_of_status, ierr)● Integer array_of_requests(nrqst), status(MPI_STATUS_SIZE, nrqst)

⇨ MPI_WAIT, MPI_WAITANY

⇨ MPI_ISEND⇨ MPI_IRECV

⇨ MPI_WAITALL (Bloqueo hasta finalización de la comunicación)

⇨ Iteración sobre ghost points

Iteración sobre zona interna

Envío y recepción de ghost points(sin bloqueo)

⇨ La comunicación asíncrona es la más eficiente⇨ No todo el hardware la soporta

•Bibliografía•Using MPI, Gropp,Lusk & Skjellum (MIT Press, 1999)

•Parallel Programming with MPI, Pacheco (MK, 1997)

•On-line tutorial http://www.msi.umn.edu/tutorial/scicomp/general/index.html

Basic Linear Algebra Comunication System

⇨ Define una malla de procesos⇨ Componentes● Subrutinas de incialización e información

✔ blacs_gridinit(icntxt,order,nprow,npcol) (order='r','c')✔ blacs_pinfo(mypnum,nprocs)✔ blacs_gridinfo(icntxt.nprow,npcol,nyprow,mypcol)✔ Integer function blacs_pnum(icntxt,prow,pcol)✔ blacs_pcoord(icntxt,pnum,prow,pcol)✔ blacs_abort(icntxt,errornum)✔ blacs_gridexit(icntxt)

✔ blacs_exit(icontinue)

⇨ Comunicación basada en matrices● Comunicación punto a punto (send/receive)● Broadcast

✔ Call vXXYY2D(...)v= (I,S.D,C,Z) --(Integer, single, double,complex, double complex)XX=(GE,matriz rectangular) (TR,matriz trapezoidal)YY= (SD, send), (RV, receive, BS(Broadcast send), (BR, broadcastreceive)

● Operaciones colectivas✔ blacs_barrier(icntxt,scope) (scope='R','C','A')

(sincronización)✔ Call vGZZZ2D(...)

➢ ZZZ = (AMX, maximo),(AMN,minimo),(SUM,suma)

ScaLAPACK

⇨ Subrutinas de alto nivel para álgebra lineal⇨ Comunicación basada en BLACS⇨ Esencial : Las matrices han de estar distribuidas

en los procesadores de forma eficiente● Distribución no cíclica (ineficiente)

P1 P2 P3 P4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

● Distribución cíclica por bloques (e.g. Bloque nb=2)

P1 P2 P3 P4

1 2 9 10 3 4 11 12 5 6 13 14 7 8 15 16

⇨ Lapack vs ScaLapack : A(LDA,n), B(LDB,nrhs)

● Call dgesv(n,nrhs,A(i,j),LDA,IPIV,B(I,1),LDB,info)(Lapack) (normalmente i=j=1)

● Call pdgesv(n,nrhs,A,i,j,desca,IPIV,B,I,1,descb,info)(ScaLapack)

⇨ Descriptores de matrices distribuidas : desca,desb● Inicialización de descriptor (integer descx(9)) (en todos

los procesos)● Dimension A(m,n) (m,n) dimensión globlal● Call descinit(desca,m,n,mb,nb,irscr,icscr,icntxt,llda,info)

✔ llda=dimensión local de A en cada proceso ✔ (ircsr,icscr) coordenadas del proceso a partir del que se empieza

la distribución -normalmente (0,0)-.

b̃l= A x̃l

⇨ Cálculo de llda para A(M,N) en una malla de tamaño mp x np con una distribucion en bloques de mb x nb desde el proceso (isr,isc)

⇨ no_cols = numroc(N,nb,mycol,isc,np)⇨ no_rows = numroc(M,mb,myrow,isr,mp) = llda

Instrucciones de compilación⇨ mpif90 -o prueba_mpi.exe prueba_mpi.f90 -lmkl_scalapack -lmkl_blacsF77init

-lmkl_blacs -lmkl_blacsCinit -lmkl_lapack -lmkl_em64t

⇨ Se emplea la librerías MathKernel 7.2 de cluster junto con las MathKernel 8.1 serial proporcionadas por Intel.

Ejemplo

•Diagonalización de una matriz simétrica distribuida por columnas usando Scalapack

introducción a la programación en paralelo - csic · cada proceso (/ador) tiene su propia...

Documents