algorithms-based extension of serial computing education to parallelism

Algorithms-based extension of serial computing education to parallelism

Uzi Vishkin

- Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,

January 2011, pp. 75-85- http://www.umiacs.umd.edu/users/vishkin/XMT/

Commodity computer systems

If you want your program to run significantly faster … you’re going to have to parallelize it

Parallelism: only game in town

But, what about the programmer? “The Trouble with Multicore: Chipmakers are busy designing microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010

Only heroic programmers can exploit the vast parallelism in current machines – Report by CSTB, U.S. National Academies 12/2010

A San Antonio spin Where would Mr. Maverick be on this issue? Conform with things that do not really work?!

Parallel Random-Access Machine/Model

PRAM:

n synchronous processors all having unit time access to a shared memory. Each processor has also a local memory.At each time unit, a processor can:1. write into the shared memory (i.e., copy one of its local memory registers

into a shared memory cell), 2. read into shared memory (i.e., copy a shared memory cell into one of its

local memory registers ), or 3. do some computation with respect to its local memory.

So, an algorithm in the PRAM modelis presented in terms of a sequence of parallel time units (or “rounds”, or “pulses”); we allow p instructions to be performed at each time unit, one per processor; this means that a time unit consists of a sequence of exactly p instructions to be performed concurrently

SV-MaxFlow-82: way too difficultContrast e.g, TCPP 12/2010: simplest parallel model2 drawbacks to PRAM mode (i) Does not reveal how the algorithm will run on

PRAMs with different number of processors; e.g., to what extent will more processors speed the computation, or fewer processors slow it?

(ii) Fully specifying the allocation of instructions to processors requires a level of detail which might be unnecessary (e.g., a compiler may be able to extract from lesser detail)

1st round of discounts ..

Work-Depth presentation of algorithmsWork-Depth algorithms are also presented as a sequence of

parallel time units (or “rounds”, or “pulses”); however, each time unit consists of a sequence of instructions to be performed concurrently; the sequence of instructions may include any number.

Why is this enough? See J-92, KKT01, or my classnotes

SV-MaxFlow-82: still way too difficult

Drawback to WD mode Fully specifying the serial number of eachinstruction requires a level of detail that may be added later

2nd round of discounts ..

Informal Work-Depth (IWD) descriptionSimilar to Work-Depth, the algorithm is presented in terms of a sequence of

parallel time units (or “rounds”); however, at each time unit there is a set containing a number of instructions to be performed concurrently. ‘ICE’

Descriptions of the set of concurrent instructions can come in many flavors.Even implicit, where the number of instruction is not obvious. The main methodical issue addressed here is how to train CS&E professionals “to think in parallel”. Here is the informal answer: train yourself to provide IWD description of parallel algorithms. The rest is detail (although important) that can be acquired as a skill, by training (perhaps with tools).

Why is this enough for PRAM? See J-92, KKT01, or my classnotes

Input: (i) All world airports. (ii) For each, all its non-stop flights.Find: smallest number of flights from

DCA to every other airport.

Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet

unvisited” airports as requiring i flights (note nesting)

Serial: forces queue; O(T) time; T – total # of flights

Parallel: parallel data-structures. Inherent serialization: S.

Gain relative to serial: (first cut) ~T/S!Decisive also relative to coarse-grained

parallelism.

Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm

(ii) No “decomposition”/”partition”

Mental effort of PRAM-like programming1. sometimes easier than serial 2. considerably easier than for any

parallel computer currently sold. Understanding falls within the common denominator of other approaches.

Example of Parallel ‘PRAM-like’ Algorithm

Elements in My education platform • Identify ‘thinking in parallel’ with the basic abstraction behind the SV82b

work-depth framework. Note: adopted as the presentation framework in PRAM algorithms texts: J92, KKT01.

• Teach as much PRAM algorithms as timing and developmental stage of the students permit; extensive ‘dry’ theory homework: is required from graduate students, but little from high-school students.

• Students self-study programming in XMTC (standard C plus 2 commands, spawn and prefix-sum) and do demanding programming assignments

• Provide a programmer’s workflow: links the simple PRAM abstraction with XMTC (even tuned) programming. The synchronous PRAM provides ease of algorithm design and reasoning about correctness and complexity. Multi-threaded programming relaxes this synchrony for implementation. Since reasoning directly about soundness and performance of multi-threaded code is known to be error prone, the workflow only tasks the programmer with: establish that the code behavior matches the PRAM-like algorithm

• Unlike PRAM, XMTC is far from ignoring locality. Unlike most approaches, XMTC preempts harm of locality on programmer’s productivity.

• If XMT architecture is presented: only at the end of the course; parallel programming more difficult than serial that does not require architecture.

Where to find a machine that supports effectively such parallel algorithms?

• Parallel algorithms researchers realized decades ago that the main reason that parallel machines are difficult to program has been that bandwidth between processors/memories is limited. Lower bounds [VW85,MNV94].

• [BMM94]: 1. HW vendors see the cost benefit of lowering performance of interconnects, but grossly underestimate the programming difficulties and the high software development costs implied. 2. Their exclusive focus on runtime benchmarks misses critical costs, including: (i) the time to write the code, and (ii) the time to port the code to different distribution of data or to different machines that require different distribution of data.

G. Blelloch, B. Maggs & G. Miller. The hidden cost of low bandwidth communication. In Developing a CS Agenda for HPC (Ed. U. Vishkin). ACM Press, 1994

• Patterson, CACM04: Latency Lags Bandwidth. HP12: as latency improved by 30-80X, bandwidth improved by 10-25KX

Isn’t this great news: cost benefit of low bandwidth drastically decreasing• Not so fast. Senior HW Eng, 1/2011: Okay, you do have a ‘convenient’ way

to do parallel programming; so what’s the big deal?!• Commodity HW Decomposition-first programming doctrine heroic

programmers sigh … Has the ‘bw ease-of-programming opportunity’ got lost? Do we sugarcoat a salty cake instead of ‘return to baker/store’?

Suggested answers in this talk (soft, more like BMM)

1. Fault line One side: commodity HW. Other side: this ‘convenient way’

2. ‘Life’ across fault line so, what’s the point of heroic programmers?!

3. ‘Every CS major could program’: ‘no way’ vs promising evidence

4. Sooner or later, system vendors will see the connection to their bottom line and abandon directions perceived today as hedging one’s bets

The fault line Is PRAM Too Easy or Too difficult?

BFS Example BFS in TCPP curriculum, 12/2010. But,1. XMT/GPU Speed-ups: same-silicon area, highly parallel input: 5.4X! Small HW configuration, 20-way parallel input: 109X wrt same GPUNote: BFS on GPUs: research papers; but PRAM version: too easy for paperMakes one wonder: why work so hard on a GPU? 2. BFS using OpenMP. Good news: Easy coding (since no meaningful decomposition). Bad news: none of the 42 students in joint F2010 UIUC/UMD got any speedups

(over serial) on an 8-processor SMP machine.So, not only PRAM was too easy: no speedups. Also BFS…Speedups on a 64-processor XMT, using <= 1/4 of the silicon area of SMP

machine, ranged between 7x and 25x PRAM is ‘too difficult’ approach worked.Makes one wonder BFS is unavoidable. Can we (professionals/instructors)

really defend teaching/using OpenMP for it? Any other commercial approach?

Chronology around fault lineToo easy• ‘Paracomputer’ Schwartz80• BSP Valiant90• LOGP UC-Berkeley93• Map-Reduce. Success; not manycore• CLRS-09, 3rd edition• TCPP curriculum 2010• Nearly all parallel machines to date• “.. machines that most programmers

cannot handle"• “Only heroic programmers”

Too difficult• SV-82 and V-Thesis81 • PRAM theory (in effect)• CLR-90 1st edition• J-92• NESL• KKT-01• XMT97+ Supports the rich PRAM

algorithms literature • V-11

Just right: PRAM model FW77

Nested parallelism: issue for both; e.g., CilkCurrent interest new "computing stacks“: programmer's model, programming languages, compilers, architectures, etc.Merit of fault-line image Two pillars holding a building (the stack) must be on the same side of a fault line chipmakers cannot expect: wealth of algorithms and high programmer’s productivity with architectures for which PRAM is too easy (e.g., force programming for decomposition).

Telling a fault line from the surface

PRAM too difficult• ICE• WD• PRAM

Effective bandwidth

PRAM too easy• PRAM “simplest model”*• BSP/Cilk *

In(e/su)fficient bandwidth*per TCPP

Old soft claim, e.g., [BMM94]: hidden cost of low bandwidthNew soft claim: the surface (PRAM easy/difficult) reveals side WRT the bandwidth fault line.

Surface

Fault line

Ease of Teaching/Learning• Benchmark Can any CS major program your manycore? Cannot really avoid it!

Teachability demonstrated so far for XMT [SIGCSE’10] - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort*, integer-sort* & sample-sort.

Other teachers:- Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. See also, keynote at CS4HS’09@CMU + interview with teacher.- High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

*Also in Nvidia’s Satish, Harris & Garland IPDPS09

Middle School Summer Camp Class Picture, July’09 (20 of 22

students)

19

From UIUC/UMD questionnaireSplit between UIUC and UMD students on: did PRAM algorithms

help for XMT programming?UMD students: strong yes. Majority of Illinois students: No. Exposure of UIUC students to PRAM algorithms and XMT

programming was more limited. This may demonstrate that students must be exposed to a minimal amount of parallel algorithms and their programming in order to internalize their merit. If this conclusion is valid, it creates tension with:

1. The pressure on instructors of parallel computing courses to cover several programming paradigms along with their required architecture background;

2. The tendency to teach “Parallel computing” as a hodgepodge of topics jumping from one to the other without teaching anything at any depth, contrary to many other CS courses

Not just talkingAlgorithms

PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase

“Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill.

Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01

Later: programming & workflow

PRAM-On-Chip HW Prototypes64-core, 75MHz FPGA of XMT(Explicit Multi-Threaded) architecture

SPAA98..CF08

128-core intercon. network IBM 90nm: 9mmX5mm, 400 MHz [HotI07]Fund

work on asynch NOCS’10

• FPGA designASIC • IBM 90nm: 10mmX10mm • 150 MHzRudimentary yet stable compiler. Architecture scales to 1000+ cores on-

chip

But, what is the performance penalty for easy programming?Surprise benefit! vs. GPU [HotPar10]

1024-TCU XMT simulations vs. code by others for GTX280. < 1 is slowdown. Sought: similar silicon area & same clock.

Postscript regarding BFS - 59X if average parallelism is 20- 111X if XMT is … downscaled to 64 TCUs

Problem acronymsBFS: Breadth-first search on graphsBprop: Back propagation machine learning alg.Conv: Image convolution kernel with separable

filterMsort: Merge-sort algorithNW: Needleman-Wunsch sequence alignmentReduct: Parallel reduction (sum)Spmv: Sparse matrix-vector multiplication

New workBiconnectivityNot aware of GPU work 12-processor SMP: < 4X speedups. TarjanV log-time PRAM

algorithm practical version significant modification. Their 1st try: 12-processor below serial

XMT: >9X to <42X speedups. TarjanV practical version. More robust for all inputs than BFS, DFS etc.

Significance: 1. log-time PRAM graph algorithms ahead on speedups. 2. Paper makes a similar case for Shiloach-V log-time

connectivity. Beats also GPUs on both speed-up and ease (GPU paper versus grad course programming assignment and even couple of 10th graders implemented SV)

Even newer result: PRAM max-flow (hybrid ShiloachV & GoldbergTarjan) provides unprecedented speedup

algorithms-based extension of serial computing education to parallelism

Documents