some challenges in parallel and distributed hardware design and programming ganesh gopalakrishnan*...

Some Challenges in Parallel and DistributedHardware Design and Programming

Ganesh Gopalakrishnan* School of Computing, University of Utah,

Salt Lake City, UT

* Past work supported in part by SRC Contract 1031.001, NSF Award 0219805 and an equipment grant from Intel Corporation

2

Background: Shared Memory and Distributed Processors

Memory performance

CPU performance

(Photo courtesy LLNL / IBM)

http://www.theinquirer.net/?article=12145

By Nebojsa Novakovic: Thursday 16 October 2003, 06:49

NOVA HAS been to the Microprocessor Forum and captured this picture of POWER5 chief scientist Balaram Sinharoy holding this eight way POWER5 MCM with a staggering 144MB of cache. Sheesh Kebab!

8 x 2 cpus x 2-way SMT = “32 shared memory cpus” on the palm

Released in 2000-- Peak Performance : 12.3 teraflops. -- Processors used : IBM RS6000 SP Power3's - 375 MHz. -- There are 8,192 of these processors -- The total amount of RAM is 6Tb. -- Two hundred cabinets - area of two basket ball courts.

3

Background: Motivation for (weak) Shared Memory Consistency models:

A Hardware Perspective : • Cannot afford to do eager updates across large SMP systems

• Delayed updates allow considerable latitude in memory consistency protocol design

less bugs in protocols more complex shared memory consistency models

…

dir dir

Chip-level protocols

Inter-cluster protocols

Intra-cluster protocols

mem mem

4

Background: Programming Models for “Supercomputers”

A likely programming model for ASCI White is four MPI tasks per node, with four threads per MPI task. This model exploits both the number of CPUs per node and each node's switch adapter bandwidth.

Job limits are 4,096 MPI tasks for US (high speed) protocol and 8,192 MPI tasks for IP (lower speed).

(Diagram courtesy LLNL / IBM)

5

Some Challenges in Shared Memory Processor Designand SMP / Distributed Programming

• Model Checking Cache Coherency / Shared Memory Consistency protocols -- ongoing work in our group

• Model Checking Distributed Memory programs used for Scientific Simulations (“MPI programs”) – incipient in our group

• Runtime Checking under Limited Observability – spent some time during sabbatical on it

6

Solved Problems in FV for Shared Memory Consistency

• Modeling and Verification of Directory Protocols for small configurations for Cache Coherency

Unsolved• Scaling industrial coherence protocol verif. beyond 4 nodes

- State explosion

• Parameterized verification with reasonable automation- Invariant discovery

• Many decidability results are unknown- Inadequate general interest in the community

• Small configuration verification of Shared Memory Consistency even for midscale benchmarks - Added complexity of property being verified See tutorial on Shared Memory Consistency Models and Protocols, by

Chou, German, and Gopalakrishnan, available from http://www.cs.utah.edu/~ganesh/presentations/fmcad04_tutorial2

7

Challenges in producing Dependable and Fast MPI / Threads programs

• Threads style : - Deal with Locks, Condition Variables, Re-entrancy, Thread Cancellation, …

• MPI :

- Deal with the complexity of * Single-program Multiple Data (SPMD) programming

* Performance optimizations to reduce communication costs

* Deal with the complexity of MPI (MPI-1.has 130 calls ; MPI-2 has 180 ; various flavors of sends / receives) • Threads and MPI are often used together

• MPI libraries are threaded

8

Solved and Unsolved Problems in MPI/Thread programming

• Solved Problems : (Avrunin and Siegel (MPI) as well as our group)

- Modeling MPI library in Promela

- Model-checking simple MPI programs

• Unsolved Problems: a rather long list, with some being:

- Model-extraction

- Handling Mixed-paradigm programs

- Formal Methods to find / justify optimizations

- Verifying Reactive aspects / Computational aspects

9

Needs of an HPC programmer (learned byworking with a domain-expert – Prof. Kirby)

• Typical HPC program development cycle consists of:

* Understand what is being simulated (the physics, biology, etc).

* Develop a mathematical model of relevant "features" of interest

* Generate a numerical model that

* Solve numerical model • Usually begins as serial code• Later the numerical model – not the serial code – is parallelized

* Often best to develop numerical model that’s amenable for parallelization

* At every step, check consistency (e.g. conservation of energy)

* Tune for load-balancing ; make code adaptive ; …

10

Another Domain Expert (Berzins):Adaptive Mesh-refinement Code is Hard!

(Photo courtesy NHTSA)

11

EnvironmentModel

Model Extractor

MC Server

MC ClientMC Client

MC ClientMC Client

MC ClientMC Client

MC ClientMC Client

MC Client…

#include <mpi.h>#include <stdio.h>#include <stdlib.h>

int main(int argc, char** argv){

int myid; int numprocs;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid);

if(myid == 0){ int i; for(i = 1; i < numprocs; ++i){ MPI_Send(&i, 1, MPI_INT, i, 0, MPI_COMM_WORLD); } printf("%d Value: %d\n", myid, myid); } else { int val; MPI_Status s; MPI_Recv(&val, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &s); printf("%d Value: %d\n", myid, val); }

MPI_Finalize(); return 0;}

MPI Program int y;active proctype T1(){ int x; x = 1; if :: x = 0; :: x = 2; fi; y = x;}active proctype T2(){ int x; x = 2; if :: y = x + 1; :: y = 0; fi; assert( y == 0 );}

ProgramModel

CIL / MPICC

10010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011100100100111010101101101001001001001100

MPI Binary

Error Visualization & Simulation

Result Analyzer

OK

proctype MPI_Send(chan out, int c){ out!c;}proctype MPI_Bsend(chan out, int c){ out!c;}proctype MPI_Isend(chan out, int c){ out!c;}typedef MPI_Status{ int MPI_SOURCE; int MPI_TAG; int MPI_ERROR;}…

MPI LibraryModel

+

Zing

Abstraction Refinement

Under Construction at Utah(students: Palmer, Yang, Barrus)

12

Where Post-Si Verification fitsin the Hardware Verification Flow

SpecificationValidation

DesignVerification

Testing forFabrication

Faults

Post-SiliconVerification

product

Does functionalitymatch designed behavior?

Pre-manufacture Post-manufacture

Spec

13

Post-Si Verification for Cache Protocol Execution

• Future

• CANNOT Assume there is a “front-side bus”• CANNOT Record all link traffic• CAN ONLY Generate sets of possible cache states • HOW BEST can one match against designed behavior?

cpu cpu cpu cpu

Invisible“miss” traffic

Visible“miss” traffic

14

Back to our specific problem domain...

Verify the operation of systems at runtimewhen we can’t see all transactions

Could also be offline analysis of a partial log of activities

a

b

x

y

c

d

a x c d y b …

15

Required Constraint-Solving Approaches

Constraint Solving in the context of Coupled Reactive Processes

a

bc

de

a

bc

de

a

bc

de

a

bc

de Observed eventLikely cause

16

Contribution that we can make:

• Create benchmark problems

• Can define tangible measures of success in each domain

• Can work with the industry

• Contribute tools and work with other expert groups

17

Formal Methods

• Principal faculty:•Konrad Slind (does deductive verification)•Ganesh Gopalakrishnan (does algorithmic verification)

18

Background: Shared Memory and Distributed Processors

Memory performance

CPU performance

(Photo courtesy LLNL / IBM)

http://www.theinquirer.net/?article=12145

By Nebojsa Novakovic: Thursday 16 October 2003, 06:49

NOVA HAS been to the Microprocessor Forum and captured this picture of POWER5 chief scientist Balaram Sinharoy holding this eight way POWER5 MCM with a staggering 144MB of cache. Sheesh Kebab!

8 x 2 cpus x 2-way SMT = “32 shared memory cpus” on the palm

Released in 2000-- Peak Performance : 12.3 teraflops. -- Processors used : IBM RS6000 SP Power3's - 375 MHz. -- There are 8,192 of these processors -- The total amount of RAM is 6Tb. -- Two hundred cabinets - area of two basket ball courts.

1919

2.3: How MPEC Works:

Itanium Ordering rules in HOL

CheckerProgram Derivation(by hand, presently)

Checker Program

Satisfiability Problem with Clauses carrying annotations

Sat Solver

SatUnsat

Explanationin the form ofone possibleinterleaving

Unsat CoreExtraction using Zcore

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st.rel [y] = 1

• Find Offending Clauses• Trace their annotations• Determine “ordering

cycle”

MP execution to be verified

20

Another Domain Expert (Berzins):Adaptive Mesh-refinement Code is Hard!

(Photo courtesy NHTSA)

21

EnvironmentModel

Model Extractor

MC Server

MC ClientMC Client

MC ClientMC Client

MC ClientMC Client

MC ClientMC Client

MC Client…

#include <mpi.h>#include <stdio.h>#include <stdlib.h>

int main(int argc, char** argv){

int myid; int numprocs;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid);

if(myid == 0){ int i; for(i = 1; i < numprocs; ++i){ MPI_Send(&i, 1, MPI_INT, i, 0, MPI_COMM_WORLD); } printf("%d Value: %d\n", myid, myid); } else { int val; MPI_Status s; MPI_Recv(&val, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &s); printf("%d Value: %d\n", myid, val); }

MPI_Finalize(); return 0;}

MPI Program int y;active proctype T1(){ int x; x = 1; if :: x = 0; :: x = 2; fi; y = x;}active proctype T2(){ int x; x = 2; if :: y = x + 1; :: y = 0; fi; assert( y == 0 );}

ProgramModel

CIL / MPICC

10010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011110010101000101010001010100101010010111001001001110101011011010010010010011001001110010010000111100101100111100011100100100111010101101101001001001001100

MPI Binary

Error Visualization & Simulation

Result Analyzer

OK

proctype MPI_Send(chan out, int c){ out!c;}proctype MPI_Bsend(chan out, int c){ out!c;}proctype MPI_Isend(chan out, int c){ out!c;}typedef MPI_Status{ int MPI_SOURCE; int MPI_TAG; int MPI_ERROR;}…

MPI LibraryModel

+

Zing

Abstraction Refinement

Under Construction at Utah(students: Palmer, Yang, Barrus)

some challenges in parallel and distributed hardware design and programming ganesh gopalakrishnan*...

Documents