eecs 583 – class 16 research topic 1 automatic parallelization
DESCRIPTION
EECS 583 – Class 16 Research Topic 1 Automatic Parallelization. University of Michigan November 7, 2012. Announcements + Reading Material. Midterm exam: Mon Nov 19 in class (Next next Monday) I will post 2 practice exams We’ll talk more about the exam next class Today’s class reading - PowerPoint PPT PresentationTRANSCRIPT
-
EECS 583 Class 16Research Topic 1Automatic Parallelization
University of Michigan
November 7, 2012
- * -
Announcements + Reading MaterialMidterm exam: Mon Nov 19 in class (Next next Monday)I will post 2 practice examsWell talk more about the exam next classTodays class readingRevisiting the Sequential Programming Model for Multi-Core, M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007. Next class readingAutomatic Thread Extraction with Decoupled Software Pipelining, G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.
- * -
Research Topics1. Automatic Parallelization2. Optimizing Streaming Applications for Multicore/GPUs3. Automatic SIMDization4. Security
- * -
Paper Reviews1 per class for the rest of the semesterPaper review purposeRead the paper before class up to this point reading has not been a point of emphasis, now it is!Put together a list of non-trivial observations think about it!Have something interesting to say in classReview content 2 parts1. 3-4 line summary of paper2. Your thoughts/observations it can be any of the following:An interesting observation on some aspect of the paperRaise one or more questions that a cursory reader would not think ofIdentify a weakness of the approach and explain whyPropose an extension to the idea that fills some hole in the workPick a difficult part of the paper, decipher it, and explain it in your own words
- * -
Paper Reviews continued Review formatPlain text only - .txt file page is sufficientReviews are due by the start of each lectureCopy file to andrew.eecs.umich.edu:/y/submitPut uniquename_classXX.txt (XX = class number)First reading due Monday Nov 12 (pdf on the website)Automatic Thread Extraction with Decoupled Software Pipelining, G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.
- * -
Class Problem from Last Time Answer 1: y = 2: x = y 3: = x 6: y =7: z = 8: x =9: = y 10: = z 4: y =5: = y 10901199do a 2-coloringcompute cost matrixdraw interference graphcolor graphLR1(x) = {2,3,4,5,6,7,8,9} LR2(y) = {1,2}LR3(y) = {4,5,6,7,8,9}LR4(z) = {3,4,5,6,7,8,9,10}2143Interference graph1234cost201221091nbors3122c/n67210545.5
- * -
Class Problem Answer (continued)21431. Remove all nodes degree < 2,remove node 21432. Cannot remove any nodes, so choose node 4 to spillstack 2133. Remove all nodes degree < 2, remove 1 and 3stack4 (spill)2stack134 (spill)24. Assign colors: 1 = red, 3 = blue, 4 = spill, 2 = blue
- * -
Moores Law
Source: Intel/Wikipedia
- * -
Single-Threaded Performance Not Improving
- * -
What about Parallel Programming? or- What is Good About the Sequential Model?Sequential is easierPeople think about programs sequentiallySimpler to write a sequential programDeterministic executionReproducing errors for debuggingTesting for correctnessNo concurrency bugsDeadlock, livelock, atomicity violationsLocks are not composablePerformance extractionSequential programs are portableAre parallel programs? Ask GPU developers Performance debugging of sequential programs straight-forward
- * -
Compilers are the Answer? - Proebstings LawCompiler Advances Double Computing Power Every 18 Years Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.
Conclusion Compilers not about performance!
Chart1
11
1.041.6
1.08162.56
1.1248644.096
1.169858566.5536
1.216652902410.48576
1.265319018516.777216
1.315931779226.8435456
1.368569050442.94967296
1.423311812468.719476736
1.4802442849109.9511627776
1.5394540563175.9218604442
1.6010322186281.4749767107
1.6650735073450.359962737
1.7316764476720.5759403793
1.80094350551152.9215046068
1.87298124571844.674407371
1.94790049562951.4790517935
2.02581651544722.3664828696
2.1068491767555.7863725914
2.19112314312089.2581961463
2.278768068819342.8131138341
2.369918791530948.5009821345
2.464715543249517.6015714153
2.563304164979228.1625142644
2.6658363315126765.060022823
2.7724697847202824.096036517
2.8833685761324518.553658427
2.9987033192519229.685853484
3.1186514519830767.497365574
3.243397511329227.99578492
3.37313341042126764.79325587
3.50805874683402823.66920939
3.64838109675444517.87073503
3.79431634068711228.59317604
3.946088994213937965.7490817
Years
Speedup
Sheet1
197111
19721.041.6
19731.08162.56
19741.1248644.096
19751.169858566.5536
19761.216652902410.48576
19771.265319018516.777216
19781.315931779226.8435456
19791.368569050442.94967296
19801.423311812468.719476736
19811.4802442849109.9511627776
19821.5394540563175.9218604442
19831.6010322186281.4749767107
19841.6650735073450.359962737
19851.7316764476720.5759403793
19861.80094350551152.9215046068
19871.87298124571844.674407371
19881.94790049562951.4790517935
19892.02581651544722.3664828696
19902.1068491767555.7863725914
19912.19112314312089.2581961463
19922.278768068819342.8131138341
19932.369918791530948.5009821345
19942.464715543249517.6015714153
19952.563304164979228.1625142644
19962.6658363315126765.060022823
19972.7724697847202824.096036517
19982.8833685761324518.553658427
19992.9987033192519229.685853484
20003.1186514519830767.497365574
20013.243397511329227.99578492
20023.37313341042126764.79325587
20033.50805874683402823.66920939
20043.64838109675444517.87073503
20053.79431634068711228.59317604
20063.946088994213937965.7490817
Sheet1
Years
Performance
Sheet2
Years
Speedup
Sheet3
-
A Step Back in Time: Old Skool ParallelizationAre We Doomed?
- * -
Parallelizing Loops In Scientific Applications
for(i=1; i
-
Back to the Present Parallelizing C and C++ Programs
- * -
Loop Level Parallelizationi = 0-39i = 10-19i = 30-39i = 0-9i = 20-29Thread 1Thread 0Loop ChunkBad news: limited number of parallel loops in general purpose applications1.3x speedup for SpecINT2000 on 4 cores
- * -
DOALL Loop Coverage
Chart1
0.0009819603
0.1555190216
0.7314436483
0
0.1827254285
0.1903035285
0.1670562402
0.0030356084
0.006329794
0.0059473851
0
0.0000022012
0.0791907125
0.0007382069
0.0106365313
0
0.0013468936
0.0003098647
0.0000235311
0.0097654771
0.0507899221
0.0004511035
0.0116854607
0.0008653182
0.1880064558
0.0008752727
0.0246905106
0.0000002442
0.0000002221
0.0209472656
0.0576407032
0.436017107
0.0612598853
0.0075099884
0.0079754291
0
0
0.155180615
0
0.2057820219
0
0.0680888174
0.0710780594
no dep
Fraction of sequential execution
benches
typeBenchmark
1008.espresso1008.espresso
1023.eqntott1023.eqntott
1026.compress1026.compress
0052.alvinn0052.alvinn
0056.ear0056.ear
1072.sc1072.sc
1099.go1099.go
1124.m88ksim1124.m88ksim
1129.compress1129.compress
1130.li1130.li
1132.ijpeg1132.ijpeg
1164.gzip1164.gzip
0171.swim0171.swim
0172.mgrid0172.mgrid
1175.vpr1175.vpr
0177.mesa0177.mesa
0179.art0179.art
1181.mcf1181.mcf
0183.equake0183.equake
0188.ammp0188.ammp
1197.parser1197.parser
1254.gap1255.vortex
1255.vortex1256.bzip2
1256.bzip21300.twolf
1300.twolf2cjpeg
2cjpeg2djpeg
2djpeg2epic
2epic3fir
3fir2g721decode
2g721decode2g721encode
2g721encode3grep
3grep2gsmdecode
2gsmdecode2gsmencode
2gsmencode3lex
3lex2mpeg2dec
2mpeg2dec2mpeg2enc
2mpeg2enc2pegwitdec
2pegwitdec2pegwitenc
2pegwitenc2rawcaudio
2rawcaudio2rawdaudio
2rawdaudio2unepic
2unepic3wc
3wc3yacc
3yacc
type
0 spec fp
1 spec int
2 media bench
3 util
provable
no mem depfew mem dep (