greenla:green linear algebra software for gpu-accelerated...

GreenLA: GreenLinearAlgebraSoftwareforGPU-AcceleratedHeterogeneousComputing

JieyangChen*§,LiTan*†§,PanruoWu*,Dingwen Tao*,HongboLi*,Xin

Liang*,Sihuan Li*,Rong Ge‡,Laxmi Bhuyan*,andZizhong Chen*

*UniversityofCalifornia, Riverside‡Clemson University

†UltraScaleSystemsResearchCenter,LosAlamosNationalLaboratory§Authorscontributedequally

Outline

• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion

Introduction• Large-scaleHPCsystem

• Highpowerconsumption

Energy Consumption of Top5 Supercomputers(June, 2016 - TOP500 List)

SummaryofOurContributions• Mostoftheexistingenergysavingapproachesfocusonthesystem/OSlevelorarchitecturelevel,withoutconsideringapplicationcharacteristic.

• Inthiswork,weproposedanenergysavingtechniquethatcombinebotharchitecture andspecificapplication knowledgetoachievebetterenergyefficiency.

• WedesignedGreenLA:GreenLinearAlgebraSoftwareforGPU-AcceleratedHeterogeneousComputing.

DVFSforenergysaving• DynamicVoltageandFrequencyScaling(DVFS)

• Byadjustingtheprocesser’svoltageandfrequencydynamically,energyconsumptioncanbereduced.

• Theoretically:𝑃𝑜𝑤𝑒𝑟 ∝ 𝑓(.* and𝐸𝑥𝑒𝑐𝑇𝑖𝑚𝑒 ∝ 𝑓.• E.g.:↓ 𝑓𝑟𝑒𝑞𝑏𝑦50%à ↓ 𝑝𝑜𝑤𝑒𝑟𝑏𝑦81%à𝑆𝑎𝑣𝑒62%𝐸𝑛𝑒𝑟𝑔𝑦.• ItisprovedinmanypreviousworksthatapplyingDVFStoslacksinapplicationcansaveconsiderableenergywithoutsignificantlyimpactperformance.

Slacks• Slacksexistinmanyapplicationsinparallelcomputingsystems.• Slackscanbecausedby:dependency,communications,I/O,etc.

Task1

Task2

Task3

Sync.

Slack1

Task4 Slack2

Sync.

CPU:

GPU:

Task1

Task2

Task3

Task4

CPU:

GPU:

Task5

Task6

Task5

Task6

Slack3

Sync.

Task dependency

Task execution

MotivationofSlack-basedEnergySaving

• Ifwekeepthesamefrequencyandvoltage,theenergyinshadowareaswouldbewasted.

• Manyenergysavingtechniques havebeendevelopedutilizingtheslacks• Race-to-Halt(R2H);• SlackReclamation(SR).

CPU Power:

GPU Power: High High

High High

Task1

Task2

Task3

Sync.

Slack1

Task4 Slack2

Sync.

CPU:

GPU:

Task5

Task6

Slack3

Sync.Task execution

High

High

Power status

ExistingApproach:Race-to-Halt

• Race:processtasksattheprocessorshighestpowerstate.• Halt:idlingatthe processorstolowestpowerstate.• Simple,noapplicationknowledgerequired.• Butitisproved*,runningtasksathighestpowerstateisnot veryenergyefficient.*Rountree,Barry,etal."Adagio:makingDVSpracticalforcomplexHPCapplications." Proceedingsofthe23rdinternationalconferenceonSupercomputing.ACM,2009.

CPU Power:


LowestHigh High

Task1

Task2

Task3

Sync.

Slack1

Task4 Slack2

Sync.

CPU:

GPU:

Task5

Task6

Slack3

Sync.Task execution

Lowest High

High Lowest

Power status

ExistingApproach:SlackReclamation

• Identifycriticalpathofapplication• Adjustfrequencywhenrunning tasksonnon-criticalpathà reclaimslacks• MoreenergyefficientthanR2H,requirespredictiononslacktoadjustfreq.inadvance.

CPU Power:


High High

Task1

Task2

Task3

Sync.

Slack1

Task4 Slack2

Sync.

CPU:

GPU:

Task5

Task6

Slack3

Sync.Task execution

High

High

Power status

Adjusted

Adjusted

Adjusted

Task1

Task4

Task5

ExistingApproach:SlackReclamation

• Assumefreq.islinearlyproportionaltoexecutiontime

• 𝑓UVWXYZ[\] =_̀ abcdefg ×ij

ik

Task1

Task2

SlackCPU:

GPU:

Sync.𝑇l

𝑇(

𝑓mVWXnopq

𝑓UVWXnopqCPU Power:

GPU Power:

Task1

Task2

CPU:

GPU:

Sync.𝑇(

𝑇(

𝑓mVWXnopq

𝑓UVWXYZ[\]CPU Power:

GPU Power:

Adjustfreq.beforeexecution

Requiespredictiononexecutiontime(i.e.:𝑇l,𝑇()

ChallengesofSlackReclamation• Itrequiresaccurate predictiononexecutiontime.• Inaccuratepredictedexecutiontimeà inaccuratetargetfrequency

• Targetfrequency>idealfrequencyà Stillwasteenergy

• Targetfrequency<idealfrequencyà Impactperformance

Task1

Task2

CPU:

GPU:

Sync.𝑇′(

𝑇(

𝑓mVWXnopq

𝑓UVWXtuuXqopqCPU Power:

GPU Power:

Slack

Task1

Task2

CPU:

GPU:

Sync.𝑇′(

𝑇(

𝑓mVWXnopq

𝑓UVWXtuuX]uvCPU Power:

GPU Power:Slack

𝐸𝑥𝑡𝑟𝑎 𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒

ExistingSlackPrediction• Itrequiresaccuratepredictiononexecutiontime.• Existing approach:statistic learningbasedslackprediction

• Itpredictsfutureexecutiontimebasedonhistory.• Itassumesconstantslacklengthorbasedonhistorydata.

• However,itspredictionability islimited.• Itcanonlycapturelimited or inaccurateapplicationcharacteristicfromhistorydata.• Whenthetargetapplicationiscomplicatedorirregularpatterned,thehistorydatacanbelessuseful,thepredictionresultscanbeinaccurate.

• Inaccurateexecutiontimepredictionà inaccuratetargetfrequencyà lessenergysavingorevenmoreenergycostandworseperformance.

• Itrequirestousefirst10%-20%iterationastrainingsetà extraenergycostandwastesvaluableenergysavingopportunities.

• So,inordertoachievebetterpredictionresult,weneedtoknowmoreaboutthetargetapplication.

Outline


LinearAlgebraOperations• Toachievehighaccurateslackprediction,weneedtoknow:

• Algorithmicknowledge:theapplicationcharacteristicinalgorithmlevel.• Architecturecharacteristic:theapplicationcharacteristicrunningongivenhardwareplatforms.

• Inmostlinearalgebraoperations,thealgorithmsinsidethemusuallyfollowsiterativefashion,whichcanhelpuscapturingthoseinformation.

• Most linearalgebralibrariesareopen-sourced,weareabletocapturethealgorithmicknowledgeinsideeachlinearalgebraoperation.

• Theiterativeexecutionfashionmakesthecomputationworkofeachtaskondifferentiterationssimilar.Withminimumprofiling,wecaneasilyapproximatethearchitecturecharacteristic.

• So,weproposedAlgorithmicSlackPredictionmodelforlinearalgebraoperations.

OurPropose:AlgorithmicSlackPrediction• Leveragebothalgorithmic knowledgeandarchitecture characteristictoachievemoreaccurateslackprediction.

• Algorithmic knowledge• Weexamthetargetapplicationtoidentify:(1)potentialslacks (2)overalltasksexecutioncharacteristic(Weassumesourcecodeisassessable.)

• Architecturecharacteristic• Wedoprofilingonthefirstslack-relatedtasksontargethardwareplatformtocapturethecomputingefficiency.(Weassumethisefficiencyforeachtasksineachiterationstaysconstant.)

• Benefitsoverstatisticlearningbasedprediction• Ourmodelisbuiltdirectlyfromapplicationà moreaccurateresult.• One-timeprofiling isgoodenoughtocapturethearchitecturecharacteristicàmuchlowerpredictionmodeltrainingcost.

AlgorithmicSlackPredictionModel• Offlineapplicationinspectionà capturealgorithmic knowledge

• Findpotentialslacks:Tasksscheduledconcurrentlywithsamesynchronizationpoint(e.g.:Slacksmaycauseby𝑇𝑎𝑠𝑘}and𝑇𝑎𝑠𝑘�)

• Derivetaskexecutioncharacteristic:Identifytherelationshipofexecutiontimebetweentwoneighboriterations.(e.g.:𝑹𝑨 𝑖 = i�

(e)

i�(ecj),𝑹𝑩 𝑖 = i�

(e)

i�(ecj))

• Onlineprediction1. T�

(�), T�(�) ← profilingàcapture architecture characteristic

2. Iteratefrom1à n3. T�

(�)← T�(�Xl) ∗ 𝐑𝐀 i à utilizingalgorithmic knowledge

4. T�(�)← T�

(�Xl) ∗ 𝐑𝐁 i à utilizingalgorithmic knowledge5. Idealfreq.ßCalculateIdealFrequency(T�

(�),T�(�))

6. SlackReclamation(Idealfreq.)7. … processtasks…

Outline


GreenLA Library• BuildbasedonMAGMA

• State-of-the-arthighlyoptimizedlinearalgebralibraryonheterogeneoussystemwithGPUs.

• ItassigntasksstaticallytoCPUsandGPUtoachievehighperformancethantraditionallinearalgebralibrary.

• WeidentifiedthatpotentialslacksexistinMAGMA’s implementation.So,weimplementedouralgorithmicslackspredictionenergysavingtoMAGMAlibrary.

• Astheinitialstageofthisproject,westartwiththreecorelinearalgebrafunctions:LU,Cholesky andQR factorization.

ApplicationonLUfactorization

CapturealgorithmicknowledgeIdentifypotentialslacks

One iteration of LU factorization

PotentialSlacks

Example:SlacksinLUfactorization

Slack time of the first 100 iteration of LU factorization in MAGMASlack time = Trailing Matrix Update(on GPU) - Panel Factorization(on CPU)


• 𝑇�]\��(o) ≈ 𝑇V�

o + 𝑇�i.o − 𝑇i�W

o

• 𝑆𝑖𝑔𝑛 𝑇�]\��o àWhichsideistheslack

• 𝑇�]\��o àLengthoftheslack

PanelFactorization(PF)

DataTransf.(DT)

DataTransf.(DT)

OtherComp.

CPU

GPU Trailing Matrix Update(TMU)

Sync. PointSync. Point

Potential Slack

Potential Slack

ApplicationonLUfactorization• Capturealgorithmicknowledge

• Derivetaskexecutioncharacteristic• Wedeterminetheexecutionrelationbetweenneighboriterationsasfollow:

•ia�(e)

ia�(ecj) =

�]u�a�(e)

�]u�a�(ecj) = 1 − �

¡Xo∗� = 𝑅V�(𝑖)

•i£¤b(e)

i£¤b(ecj) =

�]u�£¤b(e)

�]u�£¤b(ecj) = 1 − �

¡Xo∗�

(= 𝑅i�W(𝑖)

•i¥£(e)

i¥£(ecj) =

�o¦[¥£(e)

�o¦[¥£(ecj) = 1 − l

¡Xo= 𝑅�i(𝑖)

• 𝑇V�� ,𝑇�i.

� ,𝑇i�W� canbedeterminedbyofflineprofiling

Panel Factorization

Trailing MatrixUpdate


• Onlineprediction1. 𝑇V�

� ,𝑇�i.� ,𝑇i�W

� ← profilingà architecture characteristic2. Iteratefrom1à n3. T§¨

(�)← T§¨(�Xl) ∗ 𝐑𝐏𝐅 i

4. T«¬(�)← T«¬

(�Xl) ∗ 𝐑𝐃𝐓 i5. T¬¯°

(�) ← T¬¯°(�Xl) ∗ 𝐑𝐓𝐌𝐔 i

6. Idealfreq.ßCalculateIdealFrequency(T§¨(�),T«¬

(�),T¬¯°(�) )

7. ifSlackonCPUà SetGPUFreq.(Idealfreq.)8. ifSlackonGPUà SetCPUFreq.(Idealfreq.)9. Processtask:PF,DT,andTMU(Slack-relatedtasks)10. Restorepowerstateafterslack-relatedcomputations.

Optimization1

• Iftargetfreq.noavailable:• WeuseFrequencySplit• Combinetwonearbyfrequenciestoapproximateidealfrequency.• Determinefreq.splitratior:• 𝑇( = 𝑟𝑇³uv + (𝑟 − 1)𝑇nopq forr• 𝑇³uv and 𝑇nopq canbedeterminedbyourslackpredictionalgorithm

Task1

Task2

CPU:

GPU:

Sync.𝑇(

𝑇(

𝑓mVWXnopq

𝑓UVWXnopqCPU Power:

GPU Power:

𝑓UVWX³uv

𝑟𝑇³uv (𝑟 − 1)𝑇nopq

Optimization2

• ReduceDVFSOverhead• Frequentpowerstateadjustmentmaybringsmuchoverhead• WeuseRelaxFactortoadjustnecessityofpowerstateadjustment.• AlgorithmwithRelaxedSlackReclamation:1. If𝑓oZ[\] notavailable2. 𝑟ß determinefreq.splitratio3. if𝑟 < 𝑅𝑙𝑥𝐹𝑐𝑡𝑟4. Avoidfreq.split,useoriginalfreq.5. else6. Performfreq.split

Outline


ExperimentalEnvironment• HeterogeneousSystem:IVY

Component CPU GPU

Processor 2*10-coreIntel XeonE5-2670 2496-core NVIDIAKepler K20c

PeakPerf. 0.4TFLOPS 1.17TFLOPS

FrequencyScaling Capability

Core Freq.(GHz) CoreFreq.(MHz)

1.2-2.5(↑by0.1) 324, 614,640,666,705,758

Memory 64 GBRAM 5GBRAM

PowerMeter PowerPack Nvidia-smi tool

Freq. AdjustmentMethod CPUFreq.RegisterFiles NVIDIA GPUFreq.SettingAPI

ExperimentSettings• Wetestedandcompared:

• OriginalMAGMAimplementation• OSLevelRace-to-Halt• LibraryLevelRace-to-Halt• OSLevelStatisticLearningBasedDVFS• OSLevelStatisticLearningBasedDVFS(Relaxed)• AlgorithmicPredictionBasedDVFS• AlgorithmicPredictionBasedDVFS(Relaxed)

• AllversionsarebuildonMAGMA1.6.1• Cholesky,LU,QRfactorizationon4sizes:

• 5120,10240,15360,𝑎𝑛𝑑20480

Experiment

• SlackPredictionErrorRate:

• ErrorRate=Average(|Vº[Z.�]\��io»[Xi¼º[�]\��io»[|iº¼[�]\��io»[

)

• OurAlgorithmicpredictionhasmuchlowerpredictionerrorrate.

Benchmarks StatisticLearning Prediction AlgorithmicPredictionBaseIter.(First10%) BaseIter.(First 20%)

Cholesky 10.51% 6.62% 0.96%

LU 9.95% 5.45% 0.16%

QR 11.29% 5.77% 0.52%

Slack prediction accuracy comparison on input matrix size 20480*20480

CPU&GPUEnergySaving

-15.00%

-10.00%

-5.00%

0.00%

5.00%

10.00%

15.00%

20.00%

Cholesky LU QR

CPUEnergySavingComparison

OS_r2h OS_cpsr_str OS_cpsr_rlx lib_r2h lib_cpsr_str lib_cpsr_rlx

CPU energy saving on input matrix size: 20480*20480

-10.00%

-5.00%

0.00%

5.00%

10.00%

15.00%

20.00%

Cholesky LU QR

GPUEnergySavingComparison


GPU energy saving on input matrix size: 20480*20480

OverallEnergySaving&Performance

-10.00%

-8.00%

-6.00%

-4.00%

-2.00%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

Cholesky LU QR

TotalEnergySavingComaprison


Total energy saving on input matrix size: 20480*20480

0

5

10

15

20

25

Cholesky LU QR

SecExecutionTimeComparison

Original OS_cpsr_str OS_cpsr_rlx OS_r2h lib_r2h lib_cpsr_str lib_cpsr_rlx

Overall execution time with input matrix size: 20480*20480

Outline


Conclusion

• Slack-basedenergysavingapproachrequiresaccurateslackpredictiontoachievehighenergysaving.

• Existingapproachescannotaccuratelypredictsslacks.• So,weproposedanewalgorithmicslackpredictionapproachconsideringbothalgorithmicknowledgeandarchitecturecharacteristic.

• Ourexperimentsshowthatourapproachcansave2x-3xmoreenergythanexistingapproaches.

• Thankseveryoneforattending!

greenla:green linear algebra software for gpu-accelerated...

Documents