unified parallel c at lbnl/ucb upc: a portable high performance dialect of c kathy yelick christian...

Unified Parallel C at LBNL/UCB

UPC: A Portable High Performance Dialect of C

Kathy Yelick

Christian Bell, Dan Bonachea,

Wei Chen, Jason Duell,

Paul Hargrove, Parry Husbands,

Costin Iancu, Wei Tu, Mike Welcome


Parallelism on the Rise

• 1.8x annual performance increase - 1.4x from improved technology and on-chip parallelism- 1.3x in processor count for larger machines

Year

Single CPU Performance

CPU Frequencies

Aggregate Systems Performance

0.0010

0.0100

0.1000

1.0000

10.0000

100.0000

1000.0000

10000.0000

100000.0000

1000000.0000

1980 1985 1990 1995 2000 2005 2010

FLO

PS

100M

1G

10G

100G

10T

100T

1P

10M

1T

100MHz

1GHz

10GHz

10MHz

X-MP VP-200

S-810/20S-820/80

VP-400SX-2

CRAY-2 Y-MP8VP2600/10

SX-3

C90

SX-3RS-3800 SR2201

SR2201/2K

CM-5

Paragon

T3D

T90

T3E

NWT/166VPP500

SX-4ASCI Red VPP700

SX-5

SR8000

SR8000G1SX-6

ASCI BlueASCI Blue Mountain

VPP800

VPP5000

ASCI WhiteASCI Q

Earth Simulator

1M

Increasing Parallelism


Parallel Programming Models

• Parallel software is still an unsolved problem !

• Most parallel programs are written using either:- Message passing with a SPMD model

- for scientific applications; scales easily- Shared memory with threads in OpenMP, Threads, or Java

- non-scientific applications; easier to program• Partitioned Global Address Space (PGAS) Languages

- global address space like threads (programmability)- SPMD parallelism like MPI (performance)- local/global distinction, i.e., layout matters (performance)


Partitioned Global Address Space Languages

• Explicitly-parallel programming model with SPMD parallelism- Fixed at program start-up, typically 1 thread per processor

• Global address space model of memory- Allows programmer to directly represent distributed data

structures• Address space is logically partitioned

- Local vs. remote memory (two-level hierarchy)• Programmer control over performance critical decisions

- Data layout and communication • Performance transparency and tunability are goals

- Initial implementation can use fine-grained shared memory• Base languages differ: UPC (C), CAF (Fortran), Titanium

(Java)


UPC Design Philosophy

• Unified Parallel C (UPC) is:- An explicit parallel extension of ISO C - A partitioned global address space language- Sometimes called a GAS language

• Similar to the C language philosophy- Concise and familiar syntax- Orthogonal extensions of semantics- Assume programmers are clever and careful

- Given them control; possibly close to hardware- Even though they may get intro trouble

• Based on ideas in Split-C, AC, and PCP


A Quick UPC Tutorial


Virtual Machine Model

• Global address space abstraction- Shared memory is partitioned over threads- Shared vs. private memory partition within each thread- Remote memory may stay remote: no automatic caching implied- One-sided communication through reads/writes of shared variables

• Build data structures using- Distributed arrays- Two kinds of pointers: Local vs. global pointers (“pointers to shared”)

Shared

Glo

bal

ad

dre

ss s

pac

e

X[0]

Privateptr: ptr: ptr:

X[1] X[P]

Thread0 Thread1 Threadn


UPC Execution Model

• Threads work independently in a SPMD fashion- Number of threads given by THREADS set as compile

time or runtime flag- MYTHREAD specifies thread index (0..THREADS-1)- upc_barrier is a global synchronization: all wait

• Any legal C program is also a legal UPC program

#include <upc.h> /* needed for UPC extensions */

#include <stdio.h>

main() {

printf("Thread %d of %d: hello UPC world\n",

MYTHREAD, THREADS);

}


Private vs. Shared Variables in UPC

• C variables and objects are allocated in the private memory space• Shared variables are allocated only once, in thread 0’s space shared int ours; int mine;• Shared arrays are spread across the threads

shared int x[2*THREADS] /* cyclic, 1 element each, wrapped */shared int [2] y [2*THREADS] /* blocked, with block size 2 */

• Shared variables may not occur in a function definition unless static

Shared

Glo

bal

ad

dre

ss

sp

ace

Privatemine: mine: mine:

Thread0 Thread1 Threadn

ours:

x[0,n+1]

y[0,1]

x[1,n+2]

y[2,3]

x[n,2n]

y[2n-1,2n]


• This owner computes idiom is common, so UPC has upc_forall(init; test; loop; affinity)

statement;• Programmer indicates the iterations are independent

- Undefined if there are dependencies across threads- Affinity expression indicates which iterations to run

- Integer: affinity%THREADS is MYTHREAD- Pointer: upc_threadof(affinity) is MYTHREAD

Work Sharing with upc_forall()

shared int v1[N], v2[N], sum[N];void main() {

int i;for(i=0; i<N; i++)

if (MYTHREAD = = i%THREADS)sum[i]=v1[i]+v2[i];

}

cyclic layout

owner computes

shared int v1[N], v2[N], sum[N];void main() {

int i;upc_forall(i=0; i<N; i++; &v1[i])

sum[i]=v1[i]+v2[i];}

i would also work


Memory Consistency in UPC

• Shared accesses are strict or relaxed, designed by:- A pragma affects all otherwise unqualified accesses

- #pragma upc relaxed- #pragma upc strict- Usually done by including standard .h files with these

- A type qualifier in a declaration affects all accesses- int strict shared flag;

- A strict or relaxed cast can be used to override the current pragma or declared qualifier.

• Informal semantics- Relaxed accesses must obey dependencies, but

non-dependent access may appear reordered by other threads

- Strict accesses appear in order: sequentially consistent


Other Features of UPC

• Synchronization constructs- Global barriers

- Variant with labels to document matching of barriers- Split-phase variant (upc_notify and upc_wait)

- Locks - upc_lock, upc_lock_attempt, upc_unlock

• Collective communication library- Allows for asynchronous entry/exit shared [] int A[10]; shared [10] int B[10*THREADS]; // Initialize A. upc_all_broadcast(B, A, sizeof(int)*NELEMS, UPC_IN_MYSYNC | UPC_OUT_ALLSYNC );

• Parallel I/O library


The Berkeley UPC Compiler


Goals of the Berkeley UPC Project

• Make UPC Ubiquitous on- Parallel machines- Workstations and PCs for development- A portable compiler: for future machines too

• Components of research agenda:

1. Runtime work for Partitioned Global Address Space (PGAS) languages in general

2. Compiler optimizations for parallel languages

3. Application demonstrations of UPC


Berkeley UPC Compiler

UPC

Higher WHIRL

Lower WHIRL

• Compiler based on Open64• Multiple front-ends, including gcc• Intermediate form called WHIRL

• Current focus on C backend• IA64 possible in future

• UPC Runtime • Pointer representation• Shared/distribute memory

• Communication in GASNet• Portable • Language-independent

Optimizingtransformations

C + Runtime

Assembly: IA64, MIPS,… + Runtime


Optimizations

• In Berkeley UPC compiler- Pointer representation- Generating optimizable single processor code- Message coalescing (aka vectorization)

• Opportunities - forall loop optimizations (unnecessary iterations)- Irregular data set communication (Titanium)- Sharing inference- Automatic relaxation analysis and optimizations


Pointer-to-Shared Representation

• UPC has three difference kinds of pointers:- Block-cyclic, cyclic, and indefinite (always local)

• A pointer needs a “phase” to keep track of where it is in a block- Source of overhead for updating and de-referencing- Consumes space in the pointer

• Our runtime has special cases for:- Phaseless (cyclic and indefinite) – skip phase update- Indefinite – skip thread id update- Some machine-specific special cases for some memory layouts

• Pointer size/representation easily reconfigured- 64 bits on small machines, 128 on large, word or struct

Address Thread Phase


Performance of Pointers to Shared

• Phaseless pointers are an important optimization Indefinite pointers almost as fast as regular C pointers General blocked cyclic pointer 7x slower for addition

• Competitive with HP compiler, which generates native code Both compiler have improved since these were measured

Pointer-to-shared operations

05

101520253035404550

generic cyclic indefinite

type of pointer

# o

f cy

cles

(1.

5 n

s/cy

cle)

ptr + int -- HP

ptr + int -- Berkeley

ptr == ptr -- HP

ptr == ptr-- Berkeley

Cost of shared remote access

0

1000

2000

3000

4000

5000

6000

HP read Berkeleyread

HP write Berkeleywrite

# of

cyc

les

(1.5

ns

/cyc

le)


Generating Optimizable (Vectorizable) Code

• Translator generated C code can be as efficient as original C code• Source-to-source translation a good strategy for portable PGAS language implementations

Livermore Loops

0.85

0.9

0.95

1

1.05

1.1

1.15

1 3 5 7 9 11 13 15 17 19 21 23

Kernel

C t

ime

/ U

PC

tim

e


NAS CG: OpenMP style vs. MPI style

• GAS language outperforms MPI+Fortran (flat is good!)• Fine-grained (OpenMP style) version still slower

• shared memory programming style leads to more overhead (redundant boundary computation)

• GAS languages can support both programming styles

NAS CG Performance

0

20

40

60

80

100

120

2 4 8 12

Threads (SSP mode, two nodes)

MF

LO

PS

pe

r th

rea

d/

se

co

nd

UPC (OpenMP style)

UPC (MPI Style)

MPI Fortran


Message Coalescing

• Implemented in a number of parallel Fortran compilers (e.g., HPF)

• Idea: replace individual puts/gets with bulk calls• Targets bulk calls and index/strided calls in UPC runtime (new)• Goal: ease programming by speeding up shared memory style

shared [0] int * r;

…

for (i = L; i < U; i++)

exp1 = exp2 + r[i];

Unoptimized loop

int lr[U-L];

…

upcr_memget(lr, &r[L], U-L);

for (i = L; i < U; i++)

exp1 = exp2 + lr[i-L];

Optimized Loop


Message Coalescing vs. Fine-grained

speedup over fine-grained code

0

50

100

150

200

250

0 2 4 6 8

Threads

spee

du

pelan comp-indefinite

elan comp-cyclic

• One thread per node• Vector is 100K elements, number of rows is 100*threads• Message coalesced code more than 100X faster• Fine-grained code also does not scale well

- Network overhead


Message Coalescing vs. Bulk

matvec multiply (elan)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 6 8

Threads

Tim

e (s

eco

nd

s) comp-indefinite

man-indefnite

comp-cyclic

man-cyclic

matvec multiply -- lapi

0

0.1

0.2

0.3

0.4

0.5

2 4 6 8

Threads

time

(sec

onds

)

comp-indefinite

man-indefnite

comp-cyclic

man-cyclic

• Message coalescing and bulk style code have comparable performance - For indefinite array the generated code is identical- For cyclic array, coalescing is faster than manual bulk code

on elan- memgets to each thread are overlapped

- Points to need for language extension


Automatic Relaxtion

• Goal: simplify programming by giving programmers the illusion that the compiler and hardware are not reordering

• When compiling sequential programs:

Valid if y not in expr1 and x not in expr2 (roughly)• When compiling parallel code, not sufficient test.

y = expr2;

x = expr1;

x = expr1;

y = expr2;

Initially flag = data = 0

Proc A Proc B

data = 1; while (flag!=1);

flag = 1; ... = ...data...;


Cycle Detection: Dependence Analog

• Processors define a “program order” on accesses from the same thread P is the union of these total orders

• Memory system define an “access order” on accesses to the same variable

A is access order (read/write & write/write pairs)

• A violation of sequential consistency is cycle in P U A.• Intuition: time cannot flow backwards.

write data read flag

write flag read data


Cycle Detection

• Generalizes to arbitrary numbers of variables and processors

• Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor

write x write y read y

read y write x


Static Analysis for Cycle Detection

• Approximate P by the control flow graph• Approximate A by undirected “dependence” edges• Let the “delay set” D be all edges from P that are part of a minimal

cycle

• The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)

• Conclusions:- Cycle detection is possible for small language- Synchronization analysis is critical - Open: is pointer/array analysis accurate enough for this to be

practical?

write z read x

read y write z

write y read x


GASNet: Communication Layer for PGAS Languages


GASNet Design Overview - Goals

• Language-independence: support multiple PGAS languages/compilers

- UPC, Titanium, Co-array Fortran, possibly others..

- Hide UPC- or compiler-specific details such as pointer-to-shared representation

• Hardware-independence: variety of parallel arch., OS's & networks

- SMP's, clusters of uniprocessors or SMPs

- Current networks: - Native network conduits: Myrinet GM, Quadrics Elan, Infiniband VAPI, IBM LAPI- Portable network conduits: MPI 1.1, Ethernet UDP- Under development: Cray X-1, SGI/Cray Shmem, Dolphin SCI

- Current platforms: - CPU: x86, Itanium, Opteron, Alpha, Power3/4, SPARC, PA-RISC, MIPS - OS: Linux, Solaris, AIX, Tru64, Unicos, FreeBSD, IRIX, HPUX, Cygwin, MacOS

• Ease of implementation on new hardware

- Allow quick implementations

- Allow implementations to leverage performance characteristics of hardware

- Allow flexibility in message servicing paradigm (polling, interrupts, hybrids, etc)

• Want both portability & performance


GASNet Design Overview - System Architecture

• 2-Level architecture to ease implementation:• Core API

- Most basic required primitives, as narrow and general as possible

- Implemented directly on each network

- Based heavily on active messages paradigm

Compiler-generated code

Compiler-specific runtime system

GASNet Extended API

GASNet Core API

Network Hardware• Extended API– Wider interface that includes more complicated operations

– We provide a reference implementation of the extended API in terms of the core API

– Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations

– Currently includes:

– blocking and non-blocking puts/gets (all contiguous), flexible synchronization mechanisms, barriers

– Just recently added non-contiguous extensions (coming up later)


GASNet Performance Summary

GASNet Put/Get Roundtrip Latency (min over msg sz)

0

5

10

15

20

25

30

35

40

45

50

55

60

65

mpi elan mpi elan mpi gm mpi gm mpi lapi lapi-poll

mpi gm mpi vapi

Ro

un

dtr

ip L

ate

nc

y (

mic

ros

ec

on

ds

)

put_nb

get_nb

quadrics opus(IA64)

quadrics lemieux(Alpha)

myrinetalvarez(x86)

Colony/GXseaborg

(PowerPC)

infinibandpcp

(x86 PCI-X)

myrinetcitris(IA64)

myrinetpcp

(x86 PCI-X)


GASNet Performance Summary

GASNet Put/Get Bulk Flood Bandwidth (max over msg sz)

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

mpi elan mpi elan mpi gm mpi gm mpi lapi lapi-poll

mpi gm mpi vapi

Ba

nd

wid

th (

MB

/se

c)

put_nb_bulk

get_nb_bulk

quadrics opus(IA64)

quadrics lemieux(Alpha)

myrinetalvarez(x86)

Colony/GXseaborg

(PowerPC)

infinibandpcp

(x86 PCI-X)

myrinetcitris(IA64)

myrinetpcp

(x86 PCI-X)


GASNet vs. MPI on InfinibandRoundtrip Latency of GASNet vapi-conduit and MVAPICH 0.9.1 MPI

0

5

10

15

20

25

30

1 2 4 8 16 32 64 128 256 512 1KB 2KB

Message Size (bytes)

Ro

un

dtr

ip L

ate

nc

y (

us

)

MPI_Send/MPI_Recv ping-pong

gasnet_put_nb + sync

OSU MVAPICH widely regarded as the "best" MPI implementation on Infiniband

MVAPICH code based on the FTG project MVICH (MPI over VIA)

GASNet wins because fully one-sided, no tag matching or two-sided sync.overheads

MPI semantics provide two-sided synchronization, whether you want it or not


GASNet vs. MPI on InfinibandBandwidth of GASNet vapi-conduit and MVAPICH 0.9.1 MPI

0

100

200

300

400

500

600

700

800

1 2 4 8 16 32 64 128 256 512 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB

Message Size (bytes)

Ban

dw

idth

(M

B/s

ec)

MVAPICH MPI

gasnet_put_nb_bulk (source pre-pinned)

gasnet_put_nb_bulk (source not pinned)

GASNet significantly outperforms MPI at mid-range sizes - the cost of MPI tag matching

Yellow line shows the cost of naïve bounce-buffer pipelining when local side not prepinned - memory registration is an important issue


Applications in PGAS Languages


PGAS Languages Scale

• Use of the memory model (relaxed/strict) for synchronization

• Medium sized messages done through array copies

NAS MG in UPC (Berkeley Version)

0

50,000

100,000

150,000

200,000

250,000

128 256 512 1024

# Procs

MF

lop

s


Performance ResultsBerkeley UPC FT vs MPI Fortran FT

NAS FT 2.3 Class A - NERSC Alvarez Cluster

0

500

1000

1500

2000

2500

4 8 16 32

Threads (1 per node)

MF

LO

PS

UPC (blocking)

UPC (non-blocking)

MPI Fortran

80 Dual PIII-866MHz Nodes running Berkeley UPC

(gm-conduit /Myrinet 2K, 33Mhz-64Bit bus)


Challenging Applications

• Focus on the problems that are hard for MPI- Naturally fine-grained- Patterns of sharing/communication unknown until runtime

• Two examples- Adaptive Mesh Refinement (AMR)

- Poisson problem in Titanium (low flops to memory/comm)

- Hyperbolic problems in UPC (higher ratio, not adaptive so far)

- Task parallel view (first)

- Immersed boundary method simulation - Used for simulating the heart, cochlea, bacteria, insect flight,..

- Titanium version is a general framework

– Specializations for the heart and cochlea

- Particle methods with two structures: regular fluid mesh + list of materials


Ghost Region Exchange in AMR

• Ghost regions exist even in the serial code- Algorithm decomposed as operations on grid patches- Nearest neighbors (7, 9, 27-point stencils, etc.)

• Adaptive mesh organized by levels- Nasty meta-data problem to find neighbors- May exists only at a different level

Grid 1 Grid 2

Grid 3 Grid 4

C

BA


Distributed Data Structures for AMR

• This shows just one level of the grid hierarchy• Not an distributed array in any of the languages that support

them• Note: Titanium uses this structure even for regular arrays

P1 P2

G1G2

G3

P1 P2

G4

PROCESSOR 1 PROCESSOR 2


Programmability Comparison in Titanium

• Ghost region exchange in AMR- 37 lines of Titanium - 327 lines of C++/MPI, of which 318 are MPI-related

• Speed (single processor, full solve)- The same algorithm, Poisson AMR on same mesh- C++/Fortran Chombo: 366 seconds- Titanium by Chombo programmer: 208 seconds- Titanium after expert optimizations: 155 seconds

- The biggest optimization was avoiding copies of single element arrays, which required domain/performance expert to find/fix

- Titanium is faster on this platform!


Heart Simulation in Titanium

• Programming experience- Code existed in Fortran for

vector machines- Complete rewrite in

Titanium- Except fast FFTs- 3 GSR years + 1.5 postdoc

year

- # of numerical errors found along the way- About 1 every 2 weeks- Mostly due to missing code

- # of race conditions: 1


Scalability

• 5123 in < 1 second per timestep not possible • 10x increase in bisection bandwidth would fix this

Actual and Modeled Performance of Immersed Boundary Method

0.1

1

10

100

1000

# procs

tim

e (

se

cs

)

Total time (256 model)

Total time (256 actual)

Total time (512 model)

Total time (512 actual)


Those Finicky Users

• How do we get people to use new languages?- Needs to be incremental- Start at the bottom, not at the top of the software

stack• Need to demonstrate advantages

- Performance is the easiest: comes from ability to use great hardware

- Productivity is harder: - Managers may be convinced by data- Programmers will vote by experience- Wait for programmer turnover

• Key: language must run well everywhere- As well as the hardware allows


PGAS Languages are Not the End of the Story

• Flat parallelism model- Machines are not flat: vectors,streams,SIMD, VLIW,

FPGAs, PIMs, SMP nodes,…• No support for dynamic load balancing

- Virtualize memory structure moving load is easier- No virtualization of processor space taskqueue library

• No fault tolerance- SPMD model is not a good fit if nodes fail frequently

• Little understanding of scientific problems- CAF and Titanium have multiD arrays- A matrix and grid are both arrays, but they’re different- Next level example: Immersed boundary method language


To Virtualize or Not to Virtualize

• Why to virtualize- Portability- Fault tolerance- Load imbalance

• Why to not virtualize- Deep memory hierarchies- Expensive system

overhead- Some problems match the

hardware; don’t want to pay overhead

If we design for these problems, need to beat

MPI and HPC will always be a niche

• PGAS languages virtualize memory structure but not processor number

• Can we provide virtualized machine, but still allow for control in mapping (separate code)?

unified parallel c at lbnl/ucb upc: a portable high performance dialect of c kathy yelick christian...

Documents