november 2, 2006 an hardware inspired model for parallel programming arvind computer science &...

32
November 2, 2006 http:// csg.csail.mit.edu/6.827/ L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

Upload: terence-carr

Post on 18-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 http://csg.csail.mit.edu/6.827/ L15-1

An hardware inspired model for parallel programming

Arvind Computer Science & Artificial Intelligence LabMassachusetts Institute of Technology

Page 2: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-2http://csg.csail.mit.edu/6.827/

This subject is about

The foundations of functional languages: the -calculus, types, monads, confluence,

operational semantics, TRS...

General purpose implicit parallel programming in Haskell & pHParallel programming based on atomic actions or transactions in BluespecDataflow model of computation

and understanding connections ...

Bluespec and pH borrow heavily from functional languages but their execution models differ completely from each other

What we said in the first lecture

Page 3: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-3http://csg.csail.mit.edu/6.827/

Multithreaded C

pH: Implicit Parallel Programming

Multithreaded Intermediate Language

pH: parallel Haskell(Types, Higher-order functions,

I-structures, M-structures)

front-end compilation Dataflow and multithreaded

compilation model

code generation

R.S.Nikhil, Arvind &many brilliant students@ MIT mid 80’s to 90’s

SMP’s Clusters

We didn’t discuss compilation much !

Page 4: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-4http://csg.csail.mit.edu/6.827/

Fully Parallel, Multithreaded Model Global Heap of

Shared ObjectsTree ofActivationFrames

h:g:

f:

loop

activethreads

asynchronousat all levels

Efficient mappings on architectures has proved difficult

Synchronization?

Page 5: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-5http://csg.csail.mit.edu/6.827/

Instead of focusing on compilation, we will study

A hardware inspired methodology for “synthesizing” parallel programs Rule-based specification of behavior

(Guarded Atomic Actions) Lets you think one rule at a time

Composition of modules with guarded interfaces

Example: 802.11a transmitterBluespec

Unity – late 80sChandy & Misra

Warning: The ideas are untested in the software domain; you are the trailblazers.

Page 6: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-6http://csg.csail.mit.edu/6.827/

Bluespec: State and Rules organized into modules

All state (e.g., Registers, FIFOs, RAMs, ...) is explicit.Behavior is expressed in terms of atomic actions on the state:

Rule: condition action Rules can manipulate state in other modules only via their interfaces.

interface

module

Page 7: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-7http://csg.csail.mit.edu/6.827/

TermsGCD(x,y), integers

Rewrite rules GCD(x, y) GCD(y, x) if x>y, y0 (R1)

GCD(x, y) GCD(x, y-x) if x y, y0 (R2)

Initial termGCD(initX,initY)

Execution

Programming with rules: Example Euclid’s GCD

GCD(6, 15) GCD(6, 9) GCD(6, 3)

GCD(3, 6) GCD(3, 3) GCD(3, 0)

R2 R2 R1

R2 R2

Page 8: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-8http://csg.csail.mit.edu/6.827/

module mkGCD (I_GCD); Reg#(int) x <- mkRegU; Reg#(int) y <- mkReg(0);

rule swap when ((x>y)&&(y!=0)) ==> x <= y; y <= x; endrule rule subtract when ((x<=y)&&(y!=0))==> y <= y – x; endrule

method Action start(int a, int b) when (y==0) ==>x <= a; y <= b;

endmethod method int result() when (y==0); return x; endmethodendmodule

Internalbehavior

GCD in Bluespec

Externalinterface

State typedef int Int#(32)

Assumes x /= 0 and y /= 0

x y

swap sub

Page 9: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-9http://csg.csail.mit.edu/6.827/

rdyenab

int

intrdy

star

tre

sult

GC

Dm

odule

int

y == 0

y == 0

implicit conditions

interface I_GCD; method Action start (int a, int b); method int result();endinterface

GCD Hardware Modulet

#(type t)

t

t

t tt

In a GCD call t could beInt#(32),UInt#(16),Int#(13), ...

The module can easily be made polymorphic

Many different implementations can provide the same interface: module mkGCD (I_GCD)

Page 10: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-10http://csg.csail.mit.edu/6.827/

Object code(Verilog/C)

Bluespec: Two-Level Compilation

Rules and Actions(Term Rewriting System)

• Rule conflict analysis• Rule scheduling

James Hoe & Arvind@MIT 1997-2000

Bluespec(Objects, Types,

Higher-order functions)

Level 1 compilation• Type checking• Massive partial evaluation and static elaboration

Level 2 synthesis

Lennart Augustsson@Sandburst 2000-2002

Page 11: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-11http://csg.csail.mit.edu/6.827/

Static Elaboration

.exe

compile

design2 design3design1

elaborate w/params

run1run1run2.1…

run1run1run3.1…

run1run1run1.1…

run w/params

run w/params

run2run1 run1run1…

run3

Inline function calls and datatypesInstantiate modules with specific parametersResolve polymorphism/overloading

sourceSoftwareToolflow: source

HardwareToolflow:

Page 12: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 http://csg.csail.mit.edu/6.827/ L15-12

Expressing designs for 802.11a transmitter in Bluespec (BSV)

Page 13: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-13http://csg.csail.mit.edu/6.827/

802.11a Transmitter Overview

Controller Scrambler Encoder

Interleaver Mapper

IFFTCyclicExtend

headers

data

IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)

complex numbers accounts for > 95% area

24 Uncoded

bits

One OFDM symbol (64 Complex Numbers)

Must produce one OFDM symbol every 4 sec

Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol

Page 14: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-14http://csg.csail.mit.edu/6.827/

Preliminary results

Design Lines of RelativeBlock Code (BSV) Area

Controller 49 0%Scrambler 40 0%Conv. Encoder 113 0%Interleaver 76 1%Mapper 112 11%IFFT 95 85%Cyc. Extender 23 3%

Complex arithmetic libraries constitute another 200 lines of code

Page 15: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-15http://csg.csail.mit.edu/6.827/

Combinational IFFT

in0

in1

in2

in63

in3

in4

Radix 4

Radix 4

Radix 4

x16

Radix 4

Radix 4

Radix 4

Radix 4

Radix 4

Radix 4

out0

out1

out2

out63

out3

out4

Perm

ute

_1

Perm

ute

_2

Perm

ute

_3

All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...

*

*

*

*

+

-

-

+

+

-

-

+

*jt2

t0

t3

t1

Page 16: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-16http://csg.csail.mit.edu/6.827/

Design Alternative

f g

Reuse a block over multiple cycles

we expect:

Throughput to reduce – less parallelism

Energy/unit work to increase - due to extra HW

Area to decrease – reusing a block

ff g

Page 17: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-17http://csg.csail.mit.edu/6.827/

Combinational IFFTOpportunity for reuse

in0

in1

in2

in63

in3

in4

Radix 4

Radix 4

Radix 4

x16

Radix 4

Radix 4

Radix 4

Radix 4

Radix 4

Radix 4

out0

out1

out2

out63

out3

out4

Perm

ute

_1

Perm

ute

_2

Perm

ute

_3

Reuse the same circuit three times

Page 18: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-18http://csg.csail.mit.edu/6.827/

Circular pipeline: Reusing the Pipeline Stagein0

in1

in2

in63

in3

in4

out0

out1

out2

out63

out3

out4

Radix 4

Radix 4

Perm

ute

_1Perm

ute

_2Perm

ute

_3

Stage Counter16 Radix 4s can be

shared but not the three permutations. Hence the need for muxes

64

, 4-w

ay

Muxes

Page 19: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-19http://csg.csail.mit.edu/6.827/

Superfolded circular pipeline: Just one Radix-4 node!in0

in1

in2

in63

in3

in4

out0

out1

out2

out63

out3

out4

Radix 4

Perm

ute

_1Perm

ute

_2Perm

ute

_3

Stage Counter 0 to 2

Index Counter 0 to 15

64

, 4-w

ay

Muxes

4, 1

6-w

ay

Muxes

4, 1

6-w

ay

DeM

uxes

Page 20: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-20http://csg.csail.mit.edu/6.827/

Which design consumes the least energy to transmit a symbol?

Can we quickly code up all the alternatives? single source with parameters?

Not practical in traditional hardware description languages like Verilog/VHDL

Page 21: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-21http://csg.csail.mit.edu/6.827/

Bluespec code: Radix-4 Nodefunction Vector#(4,Complex) radix4(Vector#(4,Complex) t, Vector#(4,Complex) k);

Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector();

m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3];

y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]);

z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3];

return(z);endfunction

Polymorphic code: works on any type of numbers for which *, + and - have been defined

*

*

*

*

+

-

-

+

+

-

-

+

*j

Page 22: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-22http://csg.csail.mit.edu/6.827/

Combinational IFFTCan be used as a reference

in0

in1

in2

in63

in3

in4

Radix 4

Radix 4

Radix 4

x16

Radix 4

Radix 4

Radix 4

Radix 4

Radix 4

Radix 4

out0

out1

out2

out63

out3

out4

Perm

ute

_1

Perm

ute

_2

Perm

ute

_3

stage_f function

repeat it three times

Page 23: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-23http://csg.csail.mit.edu/6.827/

Bluespec Code for Combinational IFFTfunction SVector#(64, Complex) ifft

(SVector#(64, Complex) in_data);//Declare vectors SVector#(4,SVector#(64, Complex)) stage_data =

replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]);return(stage_data[3]);

The code is unfolded to generate a combinational circuit

Page 24: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-24http://csg.csail.mit.edu/6.827/

Bluespec Code for stage_ffunction SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = radix4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; endreturn(stage_out);

Stage function

Page 25: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-25http://csg.csail.mit.edu/6.827/

Synchronous pipeline

rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule

xsReg1inQ

f1 f2 f3

sReg2 outQ

This is real IFFT code; just replace f1, f2 and f3 with stage_f code

Page 26: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-26http://csg.csail.mit.edu/6.827/

What about pipeline bubbles?rule sync-pipeline (True);

Maybe#(data_T) sx, ox;for (Integer i = 1; i < n; i = i + 1) begin //Get stage input if (i == 0)

if (inQ.notEmpty) begin sx = inQ.first();inQ.deq(); end

else sx = Invalid; else sx = sRegs[i-1]; case(sx) matches //Calculate value

tagged Valid .x: ox = f(fromInteger(i),x);tagged Invalid: ox = Invalid;

endcase if (i == n-1) outQ.enq(ox); //Write Outputs else sRegs[i] <= ox; endendrule

typedef union tagged {void Invalid;data_T Valid;

} Maybe#(type data_T);

Page 27: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-27http://csg.csail.mit.edu/6.827/

Folded pipeline

x

sReginQ

rule folded-pipeline (True); if (stage==1) begin inQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1;endrule

f

outQstage

f1

f2

f3

function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcaseendfunction

This is real IFFT code too ...

Page 28: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-28http://csg.csail.mit.edu/6.827/

Expressing these designs in Bluespec is easy

All these designs were done in less than one day!Area and power estimates?

Combinational

Pipelined

Folded (16 Radices)

Super-Folded (8 Radices)

Super-Folded (4 Radices)

Super-Folded (2 Radices)

Super-Folded (1 Radix)

Page 29: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-29http://csg.csail.mit.edu/6.827/

802.11a Transmitter Synthesis results

IFFT Design Area (mm2)

Symbol

Latency

(CLKs)

ThroughputLatency

(CLKs/sym)

Min. Freq Required

Average Power(mW)

Pipelined 5.25 12 04 1.0 MHz 4.92

Combinational 4.91 10 04 1.0 MHz 3.99

Folded(16 Radices)

3.97 12 04 1.0 MHz 7.27

Super-Folded(8 Radices)

3.69 15 06 1.5 MHz 10.9

SF(4 Radices) 2.45 21 12 3.0 MHz 14.4

SF(2 Radices) 1.84 33 24 6.0 MHz 21.1

SF (1 Radix) 1.52 57 48 12 MHZ 34.6

Page 30: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-30http://csg.csail.mit.edu/6.827/

Why are the areas so similiar

Folding should have given a 3x improvement in IFFT area

BUT a constant twiddle allows low-level optimization on a radix4 block a 2.5x area reduction!

Page 31: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-31http://csg.csail.mit.edu/6.827/

802.11a Observation

Dataflow network aka Kahn networks

How should this level of concurrency be expressed in a reference code (say in C or systemC?Can we write Specs which work for both hardware and software

Page 32: November 2, 2006 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence

November 2, 2006 L15-32http://csg.csail.mit.edu/6.827/

Bluespec Tool flowBluespec SystemVerilog source

Verilog 95 RTL

Verilog sim

VCD output

DebussyVisualization

Bluespec Compiler

RTL synthesis

gates

C

Bluesim CycleAccurate

FPGAPower estimatio

n tool

Power estimatio

n tool

Sequence Design PowerTheater