unified parallel c at lbnl/ucb the berkeley upc compiler: implementation and performance wei chen,...

Unified Parallel C at LBNL/UCB

The Berkeley UPC Compiler: Implementation and Performance

Wei Chen, Dan Bonachea, Jason Duell,Parry Husbands, Costin Iancu, Kathy Yelick

the LBNL/Berkeley UPC Grouphttp://upc.lbl.gov


Outline

• An Overview of UPC

• Design and Implementation of the Berkeley UPC Compiler

• Preliminary Performance Results

• Communication Optimizations


Unified Parallel C (UPC)

• UPC is an explicitly parallel global address space language with SPMD parallelism- An extension of C- Shared memory is partitioned by threads

- One-sided (bulk and fine-grained) communication through reads/writes of shared variables

• Collective Efforts by industry, academia, and government- http://upc.gwu.edu

Shared

Glo

bal ad

dre

ss

sp

ace X[0]

Private

X[1] X[P]


UPC Programming Model Features

• Block cyclically distributed arrays• Shared and private pointers• Global synchronization -- barriers• Pair-wise synchronization – locks• Parallel loops• dynamic shared memory allocation• Bulk Shared Memory accesses• Strict vs. Relaxed memory consistency models


Design and Implementation of the Berkeley UPC Compiler


Overview of the Berkeley UPC Compiler

TranslatorUPC Code

Translator Generated C Code

Berkeley UPC Runtime System

GASNet Communication System

Network Hardware

Platform-independent

Network-independent

Compiler-independent

Language-independent

Two Goals: Portability and High-PerformanceLower UPC code into ANSI-

C code

Shared Memory Management and pointer operations

Uniform get/put interface for underlying

networks


A Layered Design

• Portable: - C is our intermediate language- Can run on top of MPI (with performance penalty)- GASNet has a layered design with a small core

• High-Performance: - Native C compiler optimizes serial code- Translator can perform high-level communication

optimizations- GASNet can access network hardware directly


Implementing the UPC to C Translator

Preprocessed File

C front end

Whirl w/ shared types

Backend lowering

Whirl w/ runtime calls

Whirl2c

ANSI-compliant C Code

• Based on the Open64 compiler

• Source to source transformation • Convert shared memory operations into runtime library calls

• Designed to incorporate existing optimization framework in open64

• Communicate with runtime via a standard API


Shared Arrays and Pointers in UPC

• Cyclic shared int A[n];• Block Cyclicshared [2] int B[n];• Indefinite shared [0] int * C =

(shared [0] int *) upc_alloc(n);• Use global pointers (pointer-to-shared) to access

shared (possibly remote) data- Block size part of the pointer type- A generic pointer-to-shared contains:

- Address, Thread id, Phase

A[0] A[2] A[4] …B[0] B[1] B[4] B[5]…C[0] C[1] C[2] …

A[1] A[3] A[5] …B[2] B[3] B[6] B[7]…T0 T1


Thread 1 Thread N -1

Address Thread Phase

0 2addr

Phase SharedMemory

Thread 0

block size

start of array object

…

…

Accessing Shared Memory in UPC

start of block


Phaseless Pointer-to-Shared

• A pointer needs a “phase” to keep track of where it is in a block- Source of overhead for pointer arithmetic

• Special case for “phaseless” pointers: Cyclic + Indefinite- Cyclic pointers always have phase 0- Indefinite pointers only have one block- Don’t need to keep phase in pointer operations for

cyclic and indefinite- Don’t need to update thread id for indefinite

pointer arithmetic


Pointer-to-Shared Representation

• Pointer Size- Want to allow pointers to reside in a register- But very large machines may require a longer representation

• Datatype- Use of scalar types (long) rather than a struct may improve

backend code quality- Faster pointer manipulation, e.g., ptr+int as well as dereferencing

• Portability and performance balance in UPC compiler- 8-byte scalar vs. struct format (configuration time option) - The pointer representation is hidden in the runtime layer- Modular design means easy to add new representations


Performance Results


Performance Evaluation

• Testbed- HP AlphaServer (1GHz processor), with Quadrics

interconnect- Compaq C compiler for the translated C code- Compare with HP UPC 2.1

• Cost of Language Features- Shared pointer arithmetic, shared memory accesses,

parallel loops, etc• Application Benchmarks

- EP: no communication- IS: large bulk memory operations- MG: bulk memget - CG: fine-grained vs. bulk memput

• Potentials of Optimizations- Measure the effectiveness of various communication

optimizations


Performance of Shared Pointer Arithmetic

• Phaseless pointer an important optimization• Packed representation also helps.

1 cycle = 1ns

Struct: 16 bytes

Cost of Shared Address Arithmetic

0102030405060708090

Ber

kele

yS

truc

t

Ber

kele

yP

acke

d HP

Ber

kele

yS

truc

t

Ber

kele

yP

acke

d HP

ptr+int ptr ==ptr

Nu

mb

er o

f C

ycle

s

generic

cyclic

indefinite


Cost of Shared Memory Access

• Local accesses somewhat slower than private accesses• Layered design does not add additional overhead

• Remote accesses a few magnitude worse than local

Cost of Shared Local Access

0

2

4

6

8

10

12

14

HP read Berkeley read HP write Berkeley write

Num

ber o

f Cyc

les

Cost of Shared Remote Access

0

1000

2000

3000

4000

5000

6000

HP read Berkeley read HP write Berkeley write

Num

ber o

f Cyc

les


Parallel Loops in UPC

• UPC has a “forall” construct for distributing computationshared int v1[N], v2[N], v3[N];

upc_forall(i=0; i < N; i++; &v3[i])

v3[i] = v2[i] + v1[i];

• Affinity tests performed on every iteration to decide if it should execute

• Two kinds of affinity expressions: - Integer (compare with thread id) - Shared address (check the affinity of address)

Exp Type none integer Shared address

# Cycles 6 17 10


Application Performance


NAS Benchmarks (EP and IS)

EP -- class A

0

50

100

150

200

250

0 10 20 30 40

Number of Threads

Mop

s/se

cond

Berkeley

HP

IS -- Class A

0

50

100

150

200

250

300

350

0 10 20 30 40

Number of Threads

Mop

s/se

cond Berkeley

HP

• EP shows the backend C compiler can still successfully optimize translated C code• IS shows Berkeley UPC compiler is effective for communication operations


NAS Benchmarks (CG and MG)

MG Performance

0

2000

4000

6000

8000

10000

12000

14000

0 5 10 15 20 25 30 35

Number of Threads

MFL

OPS Berkeley

HP

NPBCG Performance

0

200

400

600

800

1000

1200

1400

0 10 20 30 40

Number of ThreadsM

FLO

PS Berkeley

HP

• Berkeley UPC compiler scales well


Performance of Fine-grained Applications

• Doesn’t scale well due to nature of the benchmark (lots of small reads)

• HP UPC’s software caching helps performance

CG

0

5

10

15

20

0 5 10 15 20

Number of Threads

Tim

e (

se

co

nd

s)

Berkeley

HP

HP with caching


Observations on the Results

• Acceptable worst-case overhead for shared memory access latencies- < 10 cycle overhead for shared local accesses- ~ 1.5 usec overhead compared to end-to-end

network latency

• Optimizations on pointer-to-shared representation are effective- Both phaseless pointers and packed 8-byte format

• Good performance compared to HP UPC 2.1


Communication Optimizations for UPC


Communication Optimizations

• Hiding Communication Latencies• Use of non-blocking operations• Possible placement analysis, by separating get(), put() as far as possible from sync()• Message pipelining to overlap communication with more communication

• Optimizing Shared Memory Accesses• Eliminating locality test for local shared pointers: flow- and context-sensitive analysis• Transforming forall loops into equivalent for loops• Eliminate redundant pointer arithmetic for pointers with same thread and phase


More Optimizations

• Message Vectorization and Aggregation- Scatter/gather techniques- Packing generally pays off for small (< 500

byte) messages• Software Caching and Prefetching

- A prototype implementation:- Local knowledge only – no coherence messages- Cache remote reads and buffer outgoing writes- Based on the weak ordering consistency model


Example: Optimizing Loops

for(…) {

init 1; sync1;

compute1;

write1;

init 2; sync 2;

compute 2;

write 2;

…….

}

(base)

for (…) {

init 1;

init 2;

init 3;

….

sync_all;

compute all;

}

(read pipelining)

for (…) {

init 1;

init 2;

sync 1;

compute 1;

….

}

(Communication/

Computation Overlap)


Experimental Results

Effects of Software Pipelining

4

6

8

10

12

14

16

0 20 40 60 80 100

percentage of remote accesses

time

(sec

onds

)

naïve

sw pipeline

Effects of message pipelining

0

1

2

3

4

5

6

0 20 40 60 80 100

percentage of remote accesses

time

(sec

on

ds)

read pipeline

naive

write pipeline

• Computation/communication overlap better than communication/communication overlap, for Quadrics• Results likely different for other networks


Example: Optimizing Local Shared Accesses

shared [B] int a1[N], a2[N], a3[N];

forall (i = 0; i < N; i++; &a[i]) {

a[i] = a1[i] + a2[i];

}

(base)

int *l1, *l2, *l3;

for (i = MYTHREAD; i < N/THREADS; i+=THREADS) {

l1 = (int*) a1[i*B]; l2 = (int*) a2[i*B]; l3 = (int*) a3[i*B];

for (j = 0; j < B; j++)

l1[j] = l2[j] + l3[j];

}

(Use of private pointers)


Experimental Results

• Neither compiler performs well for naïve version• Culprit: pointer-to-shared operations + affinity tests• Privatizing local shared accesses improves

performance by an order of magnitude

Optimizations on Vector Add

1

10

100

1000

10000

0 5 10 15 20

Number of Threads

Mo

ps

/se

co

nd

(lo

g s

ca

le)

naive

opt1

opt2

Vector Addition

0

5

10

15

20

25

0 5 10 15 20

Number of Threads

Mop

s/se

cond Berkeley

cyclic

BerkeleyblockedHP


Compiler Status

• A fully UPC 1.1 compliant public release in April• Supported Platforms:

- HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC

• Supported Networks:- Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI

• A release this summer will include:- Pthreads/System V shared memory support- GASNet Infiniband support


Conclusion

• The Berkeley UPC Compiler achieves both portability and good performance- Layered, modular design- Effective pointer-to-shared optimizations- Good performance compared to commercially

available UPC compiler- Still lots of opportunities for communication

optimizations - Available for download at:

http://upc.lbl.gov

unified parallel c at lbnl/ucb the berkeley upc compiler: implementation and performance wei chen,...

Documents

extension of c

lbnlucb shared arrays

berkeley upc compiler

native c compiler

shared memory operations

block slide

lbnlucb thread

lbnlucb design