the berkeley upc compiler: implementation and performance
DESCRIPTION
The Berkeley UPC Compiler: Implementation and Performance. Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Kathy Yelick the LBNL/Berkeley UPC Group http://upc.lbl.gov. Outline. An Overview of UPC Design and Implementation of the Berkeley UPC Compiler - PowerPoint PPT PresentationTRANSCRIPT
Unified Parallel C at LBNL/UCB
The Berkeley UPC Compiler: Implementation and Performance
Wei Chen, Dan Bonachea, Jason Duell,Parry Husbands, Costin Iancu, Kathy Yelick
the LBNL/Berkeley UPC Grouphttp://upc.lbl.gov
Unified Parallel C at LBNL/UCB
Outline
• An Overview of UPC
• Design and Implementation of the Berkeley UPC Compiler
• Preliminary Performance Results
• Communication Optimizations
Unified Parallel C at LBNL/UCB
Unified Parallel C (UPC)
• UPC is an explicitly parallel global address space language with SPMD parallelism- An extension of C- Shared memory is partitioned by threads
- One-sided (bulk and fine-grained) communication through reads/writes of shared variables
• Collective Efforts by industry, academia, and government- http://upc.gwu.edu
Shared
Glo
bal ad
dre
ss
sp
ace X[0]
Private
X[1] X[P]
Unified Parallel C at LBNL/UCB
UPC Programming Model Features
• Block cyclically distributed arrays• Shared and private pointers• Global synchronization -- barriers• Pair-wise synchronization – locks• Parallel loops• dynamic shared memory allocation• Bulk Shared Memory accesses• Strict vs. Relaxed memory consistency models
Unified Parallel C at LBNL/UCB
Design and Implementation of the Berkeley UPC Compiler
Unified Parallel C at LBNL/UCB
Overview of the Berkeley UPC Compiler
TranslatorUPC Code
Translator Generated C Code
Berkeley UPC Runtime System
GASNet Communication System
Network Hardware
Platform-independent
Network-independent
Compiler-independent
Language-independent
Two Goals: Portability and High-PerformanceLower UPC code into ANSI-
C code
Shared Memory Management and pointer operations
Uniform get/put interface for underlying
networks
Unified Parallel C at LBNL/UCB
A Layered Design
• Portable: - C is our intermediate language- Can run on top of MPI (with performance penalty)- GASNet has a layered design with a small core
• High-Performance: - Native C compiler optimizes serial code- Translator can perform high-level communication
optimizations- GASNet can access network hardware directly
Unified Parallel C at LBNL/UCB
Implementing the UPC to C Translator
Preprocessed File
C front end
Whirl w/ shared types
Backend lowering
Whirl w/ runtime calls
Whirl2c
ANSI-compliant C Code
• Based on the Open64 compiler
• Source to source transformation • Convert shared memory operations into runtime library calls
• Designed to incorporate existing optimization framework in open64
• Communicate with runtime via a standard API
Unified Parallel C at LBNL/UCB
Shared Arrays and Pointers in UPC
• Cyclic shared int A[n];• Block Cyclicshared [2] int B[n];• Indefinite shared [0] int * C =
(shared [0] int *) upc_alloc(n);• Use global pointers (pointer-to-shared) to access
shared (possibly remote) data- Block size part of the pointer type- A generic pointer-to-shared contains:
- Address, Thread id, Phase
A[0] A[2] A[4] …B[0] B[1] B[4] B[5]…C[0] C[1] C[2] …
A[1] A[3] A[5] …B[2] B[3] B[6] B[7]…T0 T1
Unified Parallel C at LBNL/UCB
Thread 1 Thread N -1
Address Thread Phase
0 2addr
Phase SharedMemory
Thread 0
block size
start of array object
…
…
Accessing Shared Memory in UPC
start of block
Unified Parallel C at LBNL/UCB
Phaseless Pointer-to-Shared
• A pointer needs a “phase” to keep track of where it is in a block- Source of overhead for pointer arithmetic
• Special case for “phaseless” pointers: Cyclic + Indefinite- Cyclic pointers always have phase 0- Indefinite pointers only have one block- Don’t need to keep phase in pointer operations for
cyclic and indefinite- Don’t need to update thread id for indefinite
pointer arithmetic
Unified Parallel C at LBNL/UCB
Pointer-to-Shared Representation
• Pointer Size- Want to allow pointers to reside in a register- But very large machines may require a longer representation
• Datatype- Use of scalar types (long) rather than a struct may improve
backend code quality- Faster pointer manipulation, e.g., ptr+int as well as dereferencing
• Portability and performance balance in UPC compiler- 8-byte scalar vs. struct format (configuration time option) - The pointer representation is hidden in the runtime layer- Modular design means easy to add new representations
Unified Parallel C at LBNL/UCB
Performance Results
Unified Parallel C at LBNL/UCB
Performance Evaluation
• Testbed- HP AlphaServer (1GHz processor), with Quadrics
interconnect- Compaq C compiler for the translated C code- Compare with HP UPC 2.1
• Cost of Language Features- Shared pointer arithmetic, shared memory accesses,
parallel loops, etc• Application Benchmarks
- EP: no communication- IS: large bulk memory operations- MG: bulk memget - CG: fine-grained vs. bulk memput
• Potentials of Optimizations- Measure the effectiveness of various communication
optimizations
Unified Parallel C at LBNL/UCB
Performance of Shared Pointer Arithmetic
• Phaseless pointer an important optimization• Packed representation also helps.
1 cycle = 1ns
Struct: 16 bytes
Cost of Shared Address Arithmetic
0102030405060708090
Ber
kele
yS
truc
t
Ber
kele
yP
acke
d HP
Ber
kele
yS
truc
t
Ber
kele
yP
acke
d HP
ptr+int ptr ==ptr
Nu
mb
er o
f C
ycle
s
generic
cyclic
indefinite
Unified Parallel C at LBNL/UCB
Cost of Shared Memory Access
• Local accesses somewhat slower than private accesses• Layered design does not add additional overhead
• Remote accesses a few magnitude worse than local
Cost of Shared Local Access
0
2
4
6
8
10
12
14
HP read Berkeley read HP write Berkeley write
Num
ber o
f Cyc
les
Cost of Shared Remote Access
0
1000
2000
3000
4000
5000
6000
HP read Berkeley read HP write Berkeley write
Num
ber o
f Cyc
les
Unified Parallel C at LBNL/UCB
Parallel Loops in UPC
• UPC has a “forall” construct for distributing computationshared int v1[N], v2[N], v3[N];
upc_forall(i=0; i < N; i++; &v3[i])
v3[i] = v2[i] + v1[i];
• Affinity tests performed on every iteration to decide if it should execute
• Two kinds of affinity expressions: - Integer (compare with thread id) - Shared address (check the affinity of address)
Exp Type none integer Shared address
# Cycles 6 17 10
Unified Parallel C at LBNL/UCB
Application Performance
Unified Parallel C at LBNL/UCB
NAS Benchmarks (EP and IS)
EP -- class A
0
50
100
150
200
250
0 10 20 30 40
Number of Threads
Mop
s/se
cond
Berkeley
HP
IS -- Class A
0
50
100
150
200
250
300
350
0 10 20 30 40
Number of Threads
Mop
s/se
cond Berkeley
HP
• EP shows the backend C compiler can still successfully optimize translated C code• IS shows Berkeley UPC compiler is effective for communication operations
Unified Parallel C at LBNL/UCB
NAS Benchmarks (CG and MG)
MG Performance
0
2000
4000
6000
8000
10000
12000
14000
0 5 10 15 20 25 30 35
Number of Threads
MFL
OPS Berkeley
HP
NPBCG Performance
0
200
400
600
800
1000
1200
1400
0 10 20 30 40
Number of ThreadsM
FLO
PS Berkeley
HP
• Berkeley UPC compiler scales well
Unified Parallel C at LBNL/UCB
Performance of Fine-grained Applications
• Doesn’t scale well due to nature of the benchmark (lots of small reads)
• HP UPC’s software caching helps performance
CG
0
5
10
15
20
0 5 10 15 20
Number of Threads
Tim
e (
se
co
nd
s)
Berkeley
HP
HP with caching
Unified Parallel C at LBNL/UCB
Observations on the Results
• Acceptable worst-case overhead for shared memory access latencies- < 10 cycle overhead for shared local accesses- ~ 1.5 usec overhead compared to end-to-end
network latency
• Optimizations on pointer-to-shared representation are effective- Both phaseless pointers and packed 8-byte format
• Good performance compared to HP UPC 2.1
Unified Parallel C at LBNL/UCB
Communication Optimizations for UPC
Unified Parallel C at LBNL/UCB
Communication Optimizations
• Hiding Communication Latencies• Use of non-blocking operations• Possible placement analysis, by separating get(), put() as far as possible from sync()• Message pipelining to overlap communication with more communication
• Optimizing Shared Memory Accesses• Eliminating locality test for local shared pointers: flow- and context-sensitive analysis• Transforming forall loops into equivalent for loops• Eliminate redundant pointer arithmetic for pointers with same thread and phase
Unified Parallel C at LBNL/UCB
More Optimizations
• Message Vectorization and Aggregation- Scatter/gather techniques- Packing generally pays off for small (< 500
byte) messages• Software Caching and Prefetching
- A prototype implementation:- Local knowledge only – no coherence messages- Cache remote reads and buffer outgoing writes- Based on the weak ordering consistency model
Unified Parallel C at LBNL/UCB
Example: Optimizing Loops
for(…) {
init 1; sync1;
compute1;
write1;
init 2; sync 2;
compute 2;
write 2;
…….
}
(base)
for (…) {
init 1;
init 2;
init 3;
….
sync_all;
compute all;
}
(read pipelining)
for (…) {
init 1;
init 2;
sync 1;
compute 1;
….
}
(Communication/
Computation Overlap)
Unified Parallel C at LBNL/UCB
Experimental Results
Effects of Software Pipelining
4
6
8
10
12
14
16
0 20 40 60 80 100
percentage of remote accesses
time
(sec
onds
)
naïve
sw pipeline
Effects of message pipelining
0
1
2
3
4
5
6
0 20 40 60 80 100
percentage of remote accesses
time
(sec
on
ds)
read pipeline
naive
write pipeline
• Computation/communication overlap better than communication/communication overlap, for Quadrics• Results likely different for other networks
Unified Parallel C at LBNL/UCB
Example: Optimizing Local Shared Accesses
shared [B] int a1[N], a2[N], a3[N];
forall (i = 0; i < N; i++; &a[i]) {
a[i] = a1[i] + a2[i];
}
(base)
int *l1, *l2, *l3;
for (i = MYTHREAD; i < N/THREADS; i+=THREADS) {
l1 = (int*) a1[i*B]; l2 = (int*) a2[i*B]; l3 = (int*) a3[i*B];
for (j = 0; j < B; j++)
l1[j] = l2[j] + l3[j];
}
(Use of private pointers)
Unified Parallel C at LBNL/UCB
Experimental Results
• Neither compiler performs well for naïve version• Culprit: pointer-to-shared operations + affinity tests• Privatizing local shared accesses improves
performance by an order of magnitude
Optimizations on Vector Add
1
10
100
1000
10000
0 5 10 15 20
Number of Threads
Mo
ps
/se
co
nd
(lo
g s
ca
le)
naive
opt1
opt2
Vector Addition
0
5
10
15
20
25
0 5 10 15 20
Number of Threads
Mop
s/se
cond Berkeley
cyclic
BerkeleyblockedHP
Unified Parallel C at LBNL/UCB
Compiler Status
• A fully UPC 1.1 compliant public release in April• Supported Platforms:
- HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC
• Supported Networks:- Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI
• A release this summer will include:- Pthreads/System V shared memory support- GASNet Infiniband support
Unified Parallel C at LBNL/UCB
Conclusion
• The Berkeley UPC Compiler achieves both portability and good performance- Layered, modular design- Effective pointer-to-shared optimizations- Good performance compared to commercially
available UPC compiler- Still lots of opportunities for communication
optimizations - Available for download at:
http://upc.lbl.gov