unified parallel c at lbnl/ucb the berkeley upc compiler: implementation and performance wei chen,...

31
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Kathy Yelick the LBNL/Berkeley UPC Group http://upc.lbl.gov

Post on 19-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

The Berkeley UPC Compiler: Implementation and Performance

Wei Chen, Dan Bonachea, Jason Duell,Parry Husbands, Costin Iancu, Kathy Yelick

the LBNL/Berkeley UPC Grouphttp://upc.lbl.gov

Page 2: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Outline

• An Overview of UPC

• Design and Implementation of the Berkeley UPC Compiler

• Preliminary Performance Results

• Communication Optimizations

Page 3: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Unified Parallel C (UPC)

• UPC is an explicitly parallel global address space language with SPMD parallelism- An extension of C- Shared memory is partitioned by threads

- One-sided (bulk and fine-grained) communication through reads/writes of shared variables

• Collective Efforts by industry, academia, and government- http://upc.gwu.edu

Shared

Glo

bal ad

dre

ss

sp

ace X[0]

Private

X[1] X[P]

Page 4: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

UPC Programming Model Features

• Block cyclically distributed arrays• Shared and private pointers• Global synchronization -- barriers• Pair-wise synchronization – locks• Parallel loops• dynamic shared memory allocation• Bulk Shared Memory accesses• Strict vs. Relaxed memory consistency models

Page 5: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Design and Implementation of the Berkeley UPC Compiler

Page 6: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Overview of the Berkeley UPC Compiler

TranslatorUPC Code

Translator Generated C Code

Berkeley UPC Runtime System

GASNet Communication System

Network Hardware

Platform-independent

Network-independent

Compiler-independent

Language-independent

Two Goals: Portability and High-PerformanceLower UPC code into ANSI-

C code

Shared Memory Management and pointer operations

Uniform get/put interface for underlying

networks

Page 7: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

A Layered Design

• Portable: - C is our intermediate language- Can run on top of MPI (with performance penalty)- GASNet has a layered design with a small core

• High-Performance: - Native C compiler optimizes serial code- Translator can perform high-level communication

optimizations- GASNet can access network hardware directly

Page 8: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Implementing the UPC to C Translator

Preprocessed File

C front end

Whirl w/ shared types

Backend lowering

Whirl w/ runtime calls

Whirl2c

ANSI-compliant C Code

• Based on the Open64 compiler

• Source to source transformation • Convert shared memory operations into runtime library calls

• Designed to incorporate existing optimization framework in open64

• Communicate with runtime via a standard API

Page 9: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Shared Arrays and Pointers in UPC

• Cyclic shared int A[n];• Block Cyclicshared [2] int B[n];• Indefinite shared [0] int * C =

(shared [0] int *) upc_alloc(n);• Use global pointers (pointer-to-shared) to access

shared (possibly remote) data- Block size part of the pointer type- A generic pointer-to-shared contains:

- Address, Thread id, Phase

A[0] A[2] A[4] …B[0] B[1] B[4] B[5]…C[0] C[1] C[2] …

A[1] A[3] A[5] …B[2] B[3] B[6] B[7]…T0 T1

Page 10: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Thread 1 Thread N -1

Address Thread Phase

0 2addr

Phase SharedMemory

Thread 0

block size

start of array object

Accessing Shared Memory in UPC

start of block

Page 11: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Phaseless Pointer-to-Shared

• A pointer needs a “phase” to keep track of where it is in a block- Source of overhead for pointer arithmetic

• Special case for “phaseless” pointers: Cyclic + Indefinite- Cyclic pointers always have phase 0- Indefinite pointers only have one block- Don’t need to keep phase in pointer operations for

cyclic and indefinite- Don’t need to update thread id for indefinite

pointer arithmetic

Page 12: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Pointer-to-Shared Representation

• Pointer Size- Want to allow pointers to reside in a register- But very large machines may require a longer representation

• Datatype- Use of scalar types (long) rather than a struct may improve

backend code quality- Faster pointer manipulation, e.g., ptr+int as well as dereferencing

• Portability and performance balance in UPC compiler- 8-byte scalar vs. struct format (configuration time option) - The pointer representation is hidden in the runtime layer- Modular design means easy to add new representations

Page 13: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Performance Results

Page 14: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Performance Evaluation

• Testbed- HP AlphaServer (1GHz processor), with Quadrics

interconnect- Compaq C compiler for the translated C code- Compare with HP UPC 2.1

• Cost of Language Features- Shared pointer arithmetic, shared memory accesses,

parallel loops, etc• Application Benchmarks

- EP: no communication- IS: large bulk memory operations- MG: bulk memget - CG: fine-grained vs. bulk memput

• Potentials of Optimizations- Measure the effectiveness of various communication

optimizations

Page 15: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Performance of Shared Pointer Arithmetic

• Phaseless pointer an important optimization• Packed representation also helps.

1 cycle = 1ns

Struct: 16 bytes

Cost of Shared Address Arithmetic

0102030405060708090

Ber

kele

yS

truc

t

Ber

kele

yP

acke

d HP

Ber

kele

yS

truc

t

Ber

kele

yP

acke

d HP

ptr+int ptr ==ptr

Nu

mb

er o

f C

ycle

s

generic

cyclic

indefinite

Page 16: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Cost of Shared Memory Access

• Local accesses somewhat slower than private accesses• Layered design does not add additional overhead

• Remote accesses a few magnitude worse than local

Cost of Shared Local Access

0

2

4

6

8

10

12

14

HP read Berkeley read HP write Berkeley write

Num

ber o

f Cyc

les

Cost of Shared Remote Access

0

1000

2000

3000

4000

5000

6000

HP read Berkeley read HP write Berkeley write

Num

ber o

f Cyc

les

Page 17: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Parallel Loops in UPC

• UPC has a “forall” construct for distributing computationshared int v1[N], v2[N], v3[N];

upc_forall(i=0; i < N; i++; &v3[i])

v3[i] = v2[i] + v1[i];

• Affinity tests performed on every iteration to decide if it should execute

• Two kinds of affinity expressions: - Integer (compare with thread id) - Shared address (check the affinity of address)

Exp Type none integer Shared address

# Cycles 6 17 10

Page 18: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Application Performance

Page 19: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

NAS Benchmarks (EP and IS)

EP -- class A

0

50

100

150

200

250

0 10 20 30 40

Number of Threads

Mop

s/se

cond

Berkeley

HP

IS -- Class A

0

50

100

150

200

250

300

350

0 10 20 30 40

Number of Threads

Mop

s/se

cond Berkeley

HP

• EP shows the backend C compiler can still successfully optimize translated C code• IS shows Berkeley UPC compiler is effective for communication operations

Page 20: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

NAS Benchmarks (CG and MG)

MG Performance

0

2000

4000

6000

8000

10000

12000

14000

0 5 10 15 20 25 30 35

Number of Threads

MFL

OPS Berkeley

HP

NPBCG Performance

0

200

400

600

800

1000

1200

1400

0 10 20 30 40

Number of ThreadsM

FLO

PS Berkeley

HP

• Berkeley UPC compiler scales well

Page 21: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Performance of Fine-grained Applications

• Doesn’t scale well due to nature of the benchmark (lots of small reads)

• HP UPC’s software caching helps performance

CG

0

5

10

15

20

0 5 10 15 20

Number of Threads

Tim

e (

se

co

nd

s)

Berkeley

HP

HP with caching

Page 22: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Observations on the Results

• Acceptable worst-case overhead for shared memory access latencies- < 10 cycle overhead for shared local accesses- ~ 1.5 usec overhead compared to end-to-end

network latency

• Optimizations on pointer-to-shared representation are effective- Both phaseless pointers and packed 8-byte format

• Good performance compared to HP UPC 2.1

Page 23: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Communication Optimizations for UPC

Page 24: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Communication Optimizations

• Hiding Communication Latencies• Use of non-blocking operations• Possible placement analysis, by separating get(), put() as far as possible from sync()• Message pipelining to overlap communication with more communication

• Optimizing Shared Memory Accesses• Eliminating locality test for local shared pointers: flow- and context-sensitive analysis• Transforming forall loops into equivalent for loops• Eliminate redundant pointer arithmetic for pointers with same thread and phase

Page 25: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

More Optimizations

• Message Vectorization and Aggregation- Scatter/gather techniques- Packing generally pays off for small (< 500

byte) messages• Software Caching and Prefetching

- A prototype implementation:- Local knowledge only – no coherence messages- Cache remote reads and buffer outgoing writes- Based on the weak ordering consistency model

Page 26: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Example: Optimizing Loops

for(…) {

init 1; sync1;

compute1;

write1;

init 2; sync 2;

compute 2;

write 2;

…….

}

(base)

for (…) {

init 1;

init 2;

init 3;

….

sync_all;

compute all;

}

(read pipelining)

for (…) {

init 1;

init 2;

sync 1;

compute 1;

….

}

(Communication/

Computation Overlap)

Page 27: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Experimental Results

Effects of Software Pipelining

4

6

8

10

12

14

16

0 20 40 60 80 100

percentage of remote accesses

time

(sec

onds

)

naïve

sw pipeline

Effects of message pipelining

0

1

2

3

4

5

6

0 20 40 60 80 100

percentage of remote accesses

time

(sec

on

ds)

read pipeline

naive

write pipeline

• Computation/communication overlap better than communication/communication overlap, for Quadrics• Results likely different for other networks

Page 28: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Example: Optimizing Local Shared Accesses

shared [B] int a1[N], a2[N], a3[N];

forall (i = 0; i < N; i++; &a[i]) {

a[i] = a1[i] + a2[i];

}

(base)

int *l1, *l2, *l3;

for (i = MYTHREAD; i < N/THREADS; i+=THREADS) {

l1 = (int*) a1[i*B]; l2 = (int*) a2[i*B]; l3 = (int*) a3[i*B];

for (j = 0; j < B; j++)

l1[j] = l2[j] + l3[j];

}

(Use of private pointers)

Page 29: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Experimental Results

• Neither compiler performs well for naïve version• Culprit: pointer-to-shared operations + affinity tests• Privatizing local shared accesses improves

performance by an order of magnitude

Optimizations on Vector Add

1

10

100

1000

10000

0 5 10 15 20

Number of Threads

Mo

ps

/se

co

nd

(lo

g s

ca

le)

naive

opt1

opt2

Vector Addition

0

5

10

15

20

25

0 5 10 15 20

Number of Threads

Mop

s/se

cond Berkeley

cyclic

BerkeleyblockedHP

Page 30: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Compiler Status

• A fully UPC 1.1 compliant public release in April• Supported Platforms:

- HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC

• Supported Networks:- Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI

• A release this summer will include:- Pthreads/System V shared memory support- GASNet Infiniband support

Page 31: Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Unified Parallel C at LBNL/UCB

Conclusion

• The Berkeley UPC Compiler achieves both portability and good performance- Layered, modular design- Effective pointer-to-shared optimizations- Good performance compared to commercially

available UPC compiler- Still lots of opportunities for communication

optimizations - Available for download at:

http://upc.lbl.gov