eecs 583 – class 16 research topic 1 automatic parallelization

EECS 583 Class 16Research Topic 1Automatic Parallelization

University of Michigan

November 7, 2012

- * -

Announcements + Reading MaterialMidterm exam: Mon Nov 19 in class (Next next Monday)I will post 2 practice examsWell talk more about the exam next classTodays class readingRevisiting the Sequential Programming Model for Multi-Core, M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007. Next class readingAutomatic Thread Extraction with Decoupled Software Pipelining, G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

- * -

Research Topics1. Automatic Parallelization2. Optimizing Streaming Applications for Multicore/GPUs3. Automatic SIMDization4. Security

- * -

Paper Reviews1 per class for the rest of the semesterPaper review purposeRead the paper before class up to this point reading has not been a point of emphasis, now it is!Put together a list of non-trivial observations think about it!Have something interesting to say in classReview content 2 parts1. 3-4 line summary of paper2. Your thoughts/observations it can be any of the following:An interesting observation on some aspect of the paperRaise one or more questions that a cursory reader would not think ofIdentify a weakness of the approach and explain whyPropose an extension to the idea that fills some hole in the workPick a difficult part of the paper, decipher it, and explain it in your own words

- * -

Paper Reviews continued Review formatPlain text only - .txt file page is sufficientReviews are due by the start of each lectureCopy file to andrew.eecs.umich.edu:/y/submitPut uniquename_classXX.txt (XX = class number)First reading due Monday Nov 12 (pdf on the website)Automatic Thread Extraction with Decoupled Software Pipelining, G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

- * -

Class Problem from Last Time Answer 1: y = 2: x = y 3: = x 6: y =7: z = 8: x =9: = y 10: = z 4: y =5: = y 10901199do a 2-coloringcompute cost matrixdraw interference graphcolor graphLR1(x) = {2,3,4,5,6,7,8,9} LR2(y) = {1,2}LR3(y) = {4,5,6,7,8,9}LR4(z) = {3,4,5,6,7,8,9,10}2143Interference graph1234cost201221091nbors3122c/n67210545.5

- * -

Class Problem Answer (continued)21431. Remove all nodes degree < 2,remove node 21432. Cannot remove any nodes, so choose node 4 to spillstack 2133. Remove all nodes degree < 2, remove 1 and 3stack4 (spill)2stack134 (spill)24. Assign colors: 1 = red, 3 = blue, 4 = spill, 2 = blue

- * -

Moores Law

Source: Intel/Wikipedia

- * -

Single-Threaded Performance Not Improving

- * -

What about Parallel Programming? or- What is Good About the Sequential Model?Sequential is easierPeople think about programs sequentiallySimpler to write a sequential programDeterministic executionReproducing errors for debuggingTesting for correctnessNo concurrency bugsDeadlock, livelock, atomicity violationsLocks are not composablePerformance extractionSequential programs are portableAre parallel programs? Ask GPU developers Performance debugging of sequential programs straight-forward

- * -

Compilers are the Answer? - Proebstings LawCompiler Advances Double Computing Power Every 18 Years Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.

Conclusion Compilers not about performance!

Chart1

11

1.041.6

1.08162.56

1.1248644.096

1.169858566.5536

1.216652902410.48576

1.265319018516.777216

1.315931779226.8435456

1.368569050442.94967296

1.423311812468.719476736

1.4802442849109.9511627776

1.5394540563175.9218604442

1.6010322186281.4749767107

1.6650735073450.359962737

1.7316764476720.5759403793

1.80094350551152.9215046068

1.87298124571844.674407371

1.94790049562951.4790517935

2.02581651544722.3664828696

2.1068491767555.7863725914

2.19112314312089.2581961463

2.278768068819342.8131138341

2.369918791530948.5009821345

2.464715543249517.6015714153

2.563304164979228.1625142644

2.6658363315126765.060022823

2.7724697847202824.096036517

2.8833685761324518.553658427

2.9987033192519229.685853484

3.1186514519830767.497365574

3.243397511329227.99578492

3.37313341042126764.79325587

3.50805874683402823.66920939

3.64838109675444517.87073503

3.79431634068711228.59317604

3.946088994213937965.7490817

Years

Speedup

Sheet1

197111

19721.041.6

19731.08162.56

19741.1248644.096

19751.169858566.5536

19761.216652902410.48576

19771.265319018516.777216

19781.315931779226.8435456

19791.368569050442.94967296

19801.423311812468.719476736

19811.4802442849109.9511627776

19821.5394540563175.9218604442

19831.6010322186281.4749767107

19841.6650735073450.359962737

19851.7316764476720.5759403793

19861.80094350551152.9215046068

19871.87298124571844.674407371

19881.94790049562951.4790517935

19892.02581651544722.3664828696

19902.1068491767555.7863725914

19912.19112314312089.2581961463

19922.278768068819342.8131138341

19932.369918791530948.5009821345

19942.464715543249517.6015714153

19952.563304164979228.1625142644

19962.6658363315126765.060022823

19972.7724697847202824.096036517

19982.8833685761324518.553658427

19992.9987033192519229.685853484

20003.1186514519830767.497365574

20013.243397511329227.99578492

20023.37313341042126764.79325587

20033.50805874683402823.66920939

20043.64838109675444517.87073503

20053.79431634068711228.59317604

20063.946088994213937965.7490817

Sheet1

Years

Performance

Sheet2

Years

Speedup

Sheet3

A Step Back in Time: Old Skool ParallelizationAre We Doomed?

- * -

Parallelizing Loops In Scientific Applications

for(i=1; i

Back to the Present Parallelizing C and C++ Programs

- * -

Loop Level Parallelizationi = 0-39i = 10-19i = 30-39i = 0-9i = 20-29Thread 1Thread 0Loop ChunkBad news: limited number of parallel loops in general purpose applications1.3x speedup for SpecINT2000 on 4 cores

- * -

DOALL Loop Coverage

Chart1

0.0009819603

0.1555190216

0.7314436483

0

0.1827254285

0.1903035285

0.1670562402

0.0030356084

0.006329794

0.0059473851

0

0.0000022012

0.0791907125

0.0007382069

0.0106365313

0

0.0013468936

0.0003098647

0.0000235311

0.0097654771

0.0507899221

0.0004511035

0.0116854607

0.0008653182

0.1880064558

0.0008752727

0.0246905106

0.0000002442

0.0000002221

0.0209472656

0.0576407032

0.436017107

0.0612598853

0.0075099884

0.0079754291

0

0

0.155180615

0

0.2057820219

0

0.0680888174

0.0710780594

no dep

Fraction of sequential execution

benches

typeBenchmark

1008.espresso1008.espresso

1023.eqntott1023.eqntott

1026.compress1026.compress

0052.alvinn0052.alvinn

0056.ear0056.ear

1072.sc1072.sc

1099.go1099.go

1124.m88ksim1124.m88ksim

1129.compress1129.compress

1130.li1130.li

1132.ijpeg1132.ijpeg

1164.gzip1164.gzip

0171.swim0171.swim

0172.mgrid0172.mgrid

1175.vpr1175.vpr

0177.mesa0177.mesa

0179.art0179.art

1181.mcf1181.mcf

0183.equake0183.equake

0188.ammp0188.ammp

1197.parser1197.parser

1254.gap1255.vortex

1255.vortex1256.bzip2

1256.bzip21300.twolf

1300.twolf2cjpeg

2cjpeg2djpeg

2djpeg2epic

2epic3fir

3fir2g721decode

2g721decode2g721encode

2g721encode3grep

3grep2gsmdecode

2gsmdecode2gsmencode

2gsmencode3lex

3lex2mpeg2dec

2mpeg2dec2mpeg2enc

2mpeg2enc2pegwitdec

2pegwitdec2pegwitenc

2pegwitenc2rawcaudio

2rawcaudio2rawdaudio

2rawdaudio2unepic

2unepic3wc

3wc3yacc

3yacc

type

0 spec fp

1 spec int

2 media bench

3 util

provable

no mem depfew mem dep (

eecs 583 – class 16 research topic 1 automatic parallelization

Documents

class numberfirst

sequential model

sequential programming

programs sequentiallysimpler

paper reviews1

point reading

nodes degree

portableare parallel