a case for teaching parallel programming to freshmen
DESCRIPTION
A Case for Teaching Parallel Programming to Freshmen. Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009. One view of parallel programming. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/1.jpg)
1
A Case for Teaching Parallel Programming to Freshmen
ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology
Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009
![Page 2: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/2.jpg)
2
One view of parallel programming
Multicores are coming (have come) Performance gains no longer automatic and
transparent Most programmers have never written a parallel
program Different models for exploiting parallelism,
depending upon the application Data parallel, Threads, TM, Map-Reduce, …
How to migrate my softwareHow to get performanceHow to educate my programmers
It is all about performance
![Page 3: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/3.jpg)
3
Another view of parallel programming
Every gadget is concurrent and reactive Many weakly interrelated tasks happening concurrently
cell phones- playing music, receiving calls, web browsing Hither to independent programs are required to interact
What should the music player do when you are browsing the web Ambiguous specs: Not clear a priori what a user wants
Infrastructure is a parallel database for processing queries and commands
Scalable infrastructure to deal with ever increasing queries The database is more than just records -- Many streams of
data constantly being fed in Each interaction requires many queries and transactions
Parallelism is obvious but interactions between modules can be complex even when infrequent
Even though the substrate is multicore,performance is a secondary issue
![Page 4: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/4.jpg)
4
My takeModeling, simulating, and programming parallel and concurrent systems is a more fundamental problem than how to make use of multicores efficientlyFreshman teaching should focus on composing parallel programs; sequential programming should be taught (perhaps) as a way of writing the modules to be composed
Within a few years multicores will be viewed as a transparent way of simplifying and speeding up parallel programs (not very different than the way we used to view computers with faster clocks)
![Page 5: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/5.jpg)
5
The remainder of the talk
Parallel programming can be simpler than sequential programming for inherently parallel computationsSome untested ideas on what we should teach Freshman
![Page 6: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/6.jpg)
6
Parallel programming can be easier than sequential programming
![Page 7: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/7.jpg)
7
H.264 Video Decoder
May be implemented in hardware or software depending upon ...
NALunwrap
Parse+
CAVLC
Inverse Quant
Transformation
DeblockFilter
IntraPrediction
InterPrediction
RefFrames
Com
pre
ssed
B
its
Fram
es
Different requirements for different environments - QVGA 320x240p (30 fps) - DVD 720x480p - HD DVD 1280x720p (60-75 fps)
![Page 8: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/8.jpg)
8
Sequential code from ffmpeg
void h264decode(){int stage = S_NAL;
while (!eof()){ createdOutput = 0; stallFromInterPred = 0; case (stage){ S_NAL: try_NAL();
stage=(createdOutput) ? S_Parse:S_NAL; break; S_Parse: try_Parse(); stage=(createdOutput) ? S_IQIT:S_NAL; break; S_IQIT: try_IQIT(); stage=(createdOutput) ? S_Parse:S_Inter; break; S_Inter: try_Inter(); stage=(createdOutput) ? S_IQIT:S_Intra; stage=(stallFromInterPred)?S_Deblock:S_Intra; break; S_Intra: try_Intra(); stage=(createdOutput) ? S_Inter:S_Deblock; break; S_Deblock: try_deblock(); stage= S_Intra; break } } }
Parse
NAL
IQ/IT
Inter-Predict
Intra-Predict
20K Lines of Cout of 200K
Deblocking
The programmer is forced to choose a sequential order of evaluation and write the code accordingly (non trivial)
![Page 9: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/9.jpg)
9
Price of obscuring the parallelism
Program structure is difficult to understand
Packets are kept and modified in a global heap (nothing to do with the logical structure)
Unscrambling the over-specified control structure for parallelization is beyond the capability of current compiler techniques
Thread-level data parallelism?
![Page 10: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/10.jpg)
10
P ThreadsA (p)thread of each block
But there is no control over mapping
int main(){pthread_create(NAL);phtread_create(Parse);pthread_create(IQIT);pthread_create(Interpred);pthread_create(Intrapred);pthread_create(Deblock);}
Processors
NALthread
Parsethread
DeBlkthread
Intraprthread
IQ/IT threadInterpredict thread
Sleeping threads
This is an implementation
model
![Page 11: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/11.jpg)
11
StreamIT (Amarasinghe & Thies)a more natural expression using filters
bit -> frame pipeline H264Decode {add; NAL();add; Parse();add; IQIT();add; feedbackloop{
join roundrobin;body pipeline{
add; InterPredict();add; IntraPredict();add; Deblock();}
split roundrobin;}}
Parse
NAL
IQ/IT
Inter-Predict
Intra-Predict
Deblocking
Given the required rates StreamIt compiler can do a great job of generating efficient code
Feedback is Problematic!
![Page 12: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/12.jpg)
12
Functional languages (pH)Natural expression of parallelism but too general
do_H264 :: Stream Chunk -> Stream Framedo_H264 = let
fMem :: IStructFrameMem MacroBlockfMem = makeIStructureMemorynalStream = nal inputStreamparseStream = parse nalStreamiqitStream = iqit parseStreaminterStream = inter iqitStream fMemintraStream = intra interStreamdeblockStream = deblock intraStream fMem
in deblockStream
The language does not provide any hints about which level of granularity the parallelism should be considered by either the programmer or the compiler
FLs provide a solid base for building domain-specific
parallel languages
![Page 13: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/13.jpg)
13
An Idea we are testing: Hardware-design inspired parallel programming
![Page 14: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/14.jpg)
14
Hardware-design inspirationHardware is all about parallelism but there is no virtualization of resources
If one asks for two adder then one gets two adders – if one needs to do more than two additions at a time, the adders are time multiplexed explicitly
Two-level compilation model One can do a design with n adders but at some stage
of compilation n must be specified (instantiated) to generate hardware. Each instantiation of n results in different design
Analogy - In software one may want to instantiate a different code for different problem size or different machine configuration.
![Page 15: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/15.jpg)
15
H.264 in Bluespecmodule mkH264( IH264 )// Instantiate the modules
Nal nal <- mkNalUnwrap();...DeblockFilter deblock <- mkDeblockFilter();FrameMemory frameB <- mkFrameMemoryBuffer();
//Connect the modulesmkConnection(nal.out, parse.in);mkConnection(parse.out, iqit.in);…mkConnection(deblock.mem_client, frameB.mem_writer);mkConnection(inter_pred.mem_client, frameB.mem_reader);
interface in = nal.in; //Input goes straight to NALinterface out = deblock.out; // Output from deblockendmodule
Modularity and dataflow is obvious
No sharing of
resources
No time multiplexing issue if each module is mapped on a separate core
![Page 16: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/16.jpg)
16
H.264 Decoder in Bluespec Elliott Fleming, Chun Chieh Lin
NALunwrap
Parse+
CAVLC
Inverse Quant
Transformation
DeblockFilter
IntraPrediction
InterPrediction
RefFrames
Com
pre
ssed
B
its
Fram
es
Behaviors of modules are composableEach module can be refined separatelyAny module can be compiled in SW
Are there ideas worth carrying over to Parallel SW?
8K lines of BluespecDecodes 1080p@70fpsArea 4.4 mm sq (180nm)
![Page 17: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/17.jpg)
17
What should we teach freshman
![Page 18: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/18.jpg)
18
General guidelinesMake it easy to express the parallelism present in the application no unnecessary sequentialization no forced grouping of logically separate
memories
Separate and deemphasize the issue of restructuring code for better sequential performance
![Page 19: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/19.jpg)
19
TopicsFinite state machines
choose problems that have a natural solution as an FSM
show composition and interaction of parallel FSMs
Dataflow networks with unbounded and bounded edges
show programming of nodes in a sequential language with blocking sends and receives
Types, modularity, data structures, etc. are important topics but orthogonal to parallelism; these topics should be taught all the time
![Page 20: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/20.jpg)
20
Some challenges
No appropriate language or tools
Need to think up new illustrative problems from the ground up Fibbonacci, “Hello world”, matrix
multiply won’t do
![Page 21: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/21.jpg)
21
TakeawayParallel programming is not a special topic in programming
Parallel programming is programming Sequential and parallel programming can be
introduced together
Parallel thinking is as natural as sequential thinking
Thanks
![Page 22: A Case for Teaching Parallel Programming to Freshmen](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813ffc550346895dab2a65/html5/thumbnails/22.jpg)
22
Zero cost parameterizationExample: OFDM based protocols
MAC
MAC
standard specific
potential reuse
ScramblerFEC
EncoderInterleaver Mapper
Pilot &Guard
InsertionIFFT
CPInsertion
De-Scrambler
FECDecoder
De-Interleaver
De-Mapper
ChannelEstimater
FFT Synchronizer
TXController
RXController
S/P
D/A
A/D
Different algorithms
Different throughput requirements
Reusable algorithm with different parameter settings
WiFi: 64pt @ 0.25MHz
WiMAX: 256pt @ 0.03MHz
WUSB: 128pt 8MHz
85% reusable code between WiFi and WiMAXFrom WiFi to WiMAX in 4 weeks
(Alfred) Man Chuek Ng, …
WiFi:x7+x4+1
WiMAX:x15+x14+1
WUSB:x15+x14+1
Convolutional
Reed-Solomon
Turbo