eecs 583 – class 16 research topic 1 automatic parallelization

40
EECS 583 – Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012

Upload: moses

Post on 06-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

EECS 583 – Class 16 Research Topic 1 Automatic Parallelization. University of Michigan November 7, 2012. Announcements + Reading Material. Midterm exam: Mon Nov 19 in class (Next next Monday) I will post 2 practice exams We’ll talk more about the exam next class Today’s class reading - PowerPoint PPT Presentation

TRANSCRIPT

  • EECS 583 Class 16Research Topic 1Automatic Parallelization

    University of Michigan

    November 7, 2012

    - * -

    Announcements + Reading MaterialMidterm exam: Mon Nov 19 in class (Next next Monday)I will post 2 practice examsWell talk more about the exam next classTodays class readingRevisiting the Sequential Programming Model for Multi-Core, M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007. Next class readingAutomatic Thread Extraction with Decoupled Software Pipelining, G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

    - * -

    Research Topics1. Automatic Parallelization2. Optimizing Streaming Applications for Multicore/GPUs3. Automatic SIMDization4. Security

    - * -

    Paper Reviews1 per class for the rest of the semesterPaper review purposeRead the paper before class up to this point reading has not been a point of emphasis, now it is!Put together a list of non-trivial observations think about it!Have something interesting to say in classReview content 2 parts1. 3-4 line summary of paper2. Your thoughts/observations it can be any of the following:An interesting observation on some aspect of the paperRaise one or more questions that a cursory reader would not think ofIdentify a weakness of the approach and explain whyPropose an extension to the idea that fills some hole in the workPick a difficult part of the paper, decipher it, and explain it in your own words

    - * -

    Paper Reviews continued Review formatPlain text only - .txt file page is sufficientReviews are due by the start of each lectureCopy file to andrew.eecs.umich.edu:/y/submitPut uniquename_classXX.txt (XX = class number)First reading due Monday Nov 12 (pdf on the website)Automatic Thread Extraction with Decoupled Software Pipelining, G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

    - * -

    Class Problem from Last Time Answer 1: y = 2: x = y 3: = x 6: y =7: z = 8: x =9: = y 10: = z 4: y =5: = y 10901199do a 2-coloringcompute cost matrixdraw interference graphcolor graphLR1(x) = {2,3,4,5,6,7,8,9} LR2(y) = {1,2}LR3(y) = {4,5,6,7,8,9}LR4(z) = {3,4,5,6,7,8,9,10}2143Interference graph1234cost201221091nbors3122c/n67210545.5

    - * -

    Class Problem Answer (continued)21431. Remove all nodes degree < 2,remove node 21432. Cannot remove any nodes, so choose node 4 to spillstack 2133. Remove all nodes degree < 2, remove 1 and 3stack4 (spill)2stack134 (spill)24. Assign colors: 1 = red, 3 = blue, 4 = spill, 2 = blue

    - * -

    Moores Law

    Source: Intel/Wikipedia

    - * -

    Single-Threaded Performance Not Improving

    - * -

    What about Parallel Programming? or- What is Good About the Sequential Model?Sequential is easierPeople think about programs sequentiallySimpler to write a sequential programDeterministic executionReproducing errors for debuggingTesting for correctnessNo concurrency bugsDeadlock, livelock, atomicity violationsLocks are not composablePerformance extractionSequential programs are portableAre parallel programs? Ask GPU developers Performance debugging of sequential programs straight-forward

    - * -

    Compilers are the Answer? - Proebstings LawCompiler Advances Double Computing Power Every 18 Years Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.

    Conclusion Compilers not about performance!

    Chart1

    11

    1.041.6

    1.08162.56

    1.1248644.096

    1.169858566.5536

    1.216652902410.48576

    1.265319018516.777216

    1.315931779226.8435456

    1.368569050442.94967296

    1.423311812468.719476736

    1.4802442849109.9511627776

    1.5394540563175.9218604442

    1.6010322186281.4749767107

    1.6650735073450.359962737

    1.7316764476720.5759403793

    1.80094350551152.9215046068

    1.87298124571844.674407371

    1.94790049562951.4790517935

    2.02581651544722.3664828696

    2.1068491767555.7863725914

    2.19112314312089.2581961463

    2.278768068819342.8131138341

    2.369918791530948.5009821345

    2.464715543249517.6015714153

    2.563304164979228.1625142644

    2.6658363315126765.060022823

    2.7724697847202824.096036517

    2.8833685761324518.553658427

    2.9987033192519229.685853484

    3.1186514519830767.497365574

    3.243397511329227.99578492

    3.37313341042126764.79325587

    3.50805874683402823.66920939

    3.64838109675444517.87073503

    3.79431634068711228.59317604

    3.946088994213937965.7490817

    Years

    Speedup

    Sheet1

    197111

    19721.041.6

    19731.08162.56

    19741.1248644.096

    19751.169858566.5536

    19761.216652902410.48576

    19771.265319018516.777216

    19781.315931779226.8435456

    19791.368569050442.94967296

    19801.423311812468.719476736

    19811.4802442849109.9511627776

    19821.5394540563175.9218604442

    19831.6010322186281.4749767107

    19841.6650735073450.359962737

    19851.7316764476720.5759403793

    19861.80094350551152.9215046068

    19871.87298124571844.674407371

    19881.94790049562951.4790517935

    19892.02581651544722.3664828696

    19902.1068491767555.7863725914

    19912.19112314312089.2581961463

    19922.278768068819342.8131138341

    19932.369918791530948.5009821345

    19942.464715543249517.6015714153

    19952.563304164979228.1625142644

    19962.6658363315126765.060022823

    19972.7724697847202824.096036517

    19982.8833685761324518.553658427

    19992.9987033192519229.685853484

    20003.1186514519830767.497365574

    20013.243397511329227.99578492

    20023.37313341042126764.79325587

    20033.50805874683402823.66920939

    20043.64838109675444517.87073503

    20053.79431634068711228.59317604

    20063.946088994213937965.7490817

    Sheet1

    Years

    Performance

    Sheet2

    Years

    Speedup

    Sheet3

  • A Step Back in Time: Old Skool ParallelizationAre We Doomed?

    - * -

    Parallelizing Loops In Scientific Applications

    for(i=1; i

  • Back to the Present Parallelizing C and C++ Programs

    - * -

    Loop Level Parallelizationi = 0-39i = 10-19i = 30-39i = 0-9i = 20-29Thread 1Thread 0Loop ChunkBad news: limited number of parallel loops in general purpose applications1.3x speedup for SpecINT2000 on 4 cores

    - * -

    DOALL Loop Coverage

    Chart1

    0.0009819603

    0.1555190216

    0.7314436483

    0

    0.1827254285

    0.1903035285

    0.1670562402

    0.0030356084

    0.006329794

    0.0059473851

    0

    0.0000022012

    0.0791907125

    0.0007382069

    0.0106365313

    0

    0.0013468936

    0.0003098647

    0.0000235311

    0.0097654771

    0.0507899221

    0.0004511035

    0.0116854607

    0.0008653182

    0.1880064558

    0.0008752727

    0.0246905106

    0.0000002442

    0.0000002221

    0.0209472656

    0.0576407032

    0.436017107

    0.0612598853

    0.0075099884

    0.0079754291

    0

    0

    0.155180615

    0

    0.2057820219

    0

    0.0680888174

    0.0710780594

    no dep

    Fraction of sequential execution

    benches

    typeBenchmark

    1008.espresso1008.espresso

    1023.eqntott1023.eqntott

    1026.compress1026.compress

    0052.alvinn0052.alvinn

    0056.ear0056.ear

    1072.sc1072.sc

    1099.go1099.go

    1124.m88ksim1124.m88ksim

    1129.compress1129.compress

    1130.li1130.li

    1132.ijpeg1132.ijpeg

    1164.gzip1164.gzip

    0171.swim0171.swim

    0172.mgrid0172.mgrid

    1175.vpr1175.vpr

    0177.mesa0177.mesa

    0179.art0179.art

    1181.mcf1181.mcf

    0183.equake0183.equake

    0188.ammp0188.ammp

    1197.parser1197.parser

    1254.gap1255.vortex

    1255.vortex1256.bzip2

    1256.bzip21300.twolf

    1300.twolf2cjpeg

    2cjpeg2djpeg

    2djpeg2epic

    2epic3fir

    3fir2g721decode

    2g721decode2g721encode

    2g721encode3grep

    3grep2gsmdecode

    2gsmdecode2gsmencode

    2gsmencode3lex

    3lex2mpeg2dec

    2mpeg2dec2mpeg2enc

    2mpeg2enc2pegwitdec

    2pegwitdec2pegwitenc

    2pegwitenc2rawcaudio

    2rawcaudio2rawdaudio

    2rawdaudio2unepic

    2unepic3wc

    3wc3yacc

    3yacc

    type

    0 spec fp

    1 spec int

    2 media bench

    3 util

    provable

    no mem depfew mem dep (