simplifying parallel programming with compiler transformations
DESCRIPTION
Simplifying Parallel Programming with Compiler Transformations. Matt Frank University of Illinois. What I’m ranting about. Transformations that alleviate tedium Analogous to: code generation, register allocation, and instr. Sched (Not really “optimizations”) Mainly: - PowerPoint PPT PresentationTRANSCRIPT
[email protected] 31-Oct-2008
Matt FrankUniversity of Illinois
Simplifying Parallel Programming with Compiler Transformations
[email protected] 31-Oct-2008
What I’m ranting about
• Transformations that alleviate tedium• Analogous to:
– code generation, register allocation, and instr. Sched– (Not really “optimizations”)
• Mainly:– Loop distribution, reassociation, “scalar” expansion,
inspector-executor, hashing.
• Cover much more than you might think• || language expressivity
[email protected] 31-Oct-2008
Assumptions
• Cache-coherent shared-memory many-cores– (I’m not addressing distributed memory issues)
• Synchronization somewhat expensive– Don’t use barriers gratuitously (but don’t avoid at all
costs)
• Analysis is not my problem– Programmer annotates
• Non-determinism is outside realm of this talk– No race detection in this talk either
[email protected] 31-Oct-2008
Compiler Flow Front-end type systems and whole-program analysis
Dependence Graph (PDG) based compiler
Runtime/Execution platform
Feedback
New information:1. Type systems (e.g. DPJ)2. Domain-specific objects3. run-time feedback
Program analysis (info about high level program invariants) for more efficient coherence, checkpointing, q.o.s.
New capabilities: checkpointing, q.o.s. guarantees.
[email protected] 31-Oct-2008
I’m leaving out localityFront-end type systems and whole-program analysis
|| exposing transformations
Runtime/Execution platform
Tiling, etc.
[email protected] 31-Oct-2008
What’s enabled?
• Loops that contain arbitrary control flow– Including early exits, arbitrary function calls, etc.
• Arbitrary iterators (even sequential ones)– Can’t depend on main body of computation though
• Arbitrary combinations of data parallel work, scans and reductions
• Can use “partial sums” inside loop• Buffered printf
[email protected] 31-Oct-2008
The transformations• Scalar expansion
– Eliminates anti, output deps– Can be applied to properly scoped aggregates
• Reassociation– Integer reassociation extraordinarily useful– Can use partial sums later in loop!
• Loop distribution– Think of it as scheduling
• Inspector-executor– As long as the data access pattern is invariant in the
loop
[email protected] 31-Oct-2008
You’ve heard of map-reducedoall i(1..n) private j = f(X[i]) total = total + j
shared j[n]doall i(1..n) j[i] = f(X[i])do i(1..n) total = total + j[i]
[email protected] 31-Oct-2008
How ‘bout scan-map?struct { data; *next;} *p;
doall p != NULL modify(p->data) p = p->next
n=0do a[n++] = p p = p->nextdoall i(0..n) modify(a[i]->data) p = p->next
[email protected] 31-Oct-2008
Sparse matrix construction
scan int ptr = 0shared data[float]shared rows[int]doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++
rowsdata
ptr
row
[email protected] 31-Oct-2008
Partial Sum Expansionscan int ptr = 0shared float data[m]shared int rows[n]doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++
scan int ptr[n] # scalar expand ptrshared data[float]shared int rows[n]doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++
expand partial sum
[email protected] 31-Oct-2008
Scalar Expansionscan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++
scan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront()
and inner loop fission
[email protected] 31-Oct-2008
Outer Loop Fissionscan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1]+ ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront()
scan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j private vector mydata ptr[row] = 0 for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++do row (1..n) rows[row] = rows[row-1] + ptr[row-1]doall row (1..n) for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront()
[email protected] 31-Oct-2008
printf() is same pattern
doall i (1..n) private mystring = s(i) printf(mystring)
private mystrings
stdout buffer
[email protected] 31-Oct-2008
Sparse array updatesdoall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) x[i]+= temp x[j]+= temp
[email protected] 31-Oct-2008
Becomesdoall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) continue[hash(i)][myproc].push(i,temp) continue[hash(j)][myproc].push(j,temp)doall p(1..P) for t (1..P) private (ptr,val) = continue[p][t] x[ptr] += val
1 2 3 4
1
2
3
4
the continuation matrix ->
[email protected] 31-Oct-2008
Graph updatesdoall i(1..n) newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, value[pred]) value[i] = newvalue
[email protected] 31-Oct-2008
Inspector Executor
int wavefront[n] = {0}
do i(1..n) wavefront[i] = max(wavefronts[i’s predecessors])
do w(1..maxdepth) doall i suchthat wf[i] = w newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, pred[i]) value[i] = newvalue
Polychronopolous ’88
Saltz ’91
Leung/Zahorjan, ‘93
[email protected] 31-Oct-2008
What I’ve shown you• Scalar expansion
– Eliminates anti, output deps– Can be applied to properly scoped aggregates
• Reassociation– Integer reassociation extraordinarily useful– Can use partial sums later in loop!
• Loop distribution– Think of it as scheduling
• Inspector-executor– As long as the data access pattern is invariant in the
loop
[email protected] 31-Oct-2008
Where next?
• Relieve Tedium– (build the compiler, or frameworks, or …)
• Find new patterns– Delauney triangulation– Pick an example application: there will be something
new you wish could be transformed automatically
• Parallel languages beyond “doall” and “reduce”