greenla:green linear algebra software for gpu-accelerated...
TRANSCRIPT
GreenLA: GreenLinearAlgebraSoftwareforGPU-AcceleratedHeterogeneousComputing
JieyangChen*§,LiTan*†§,PanruoWu*,Dingwen Tao*,HongboLi*,Xin
Liang*,Sihuan Li*,Rong Ge‡,Laxmi Bhuyan*,andZizhong Chen*
*UniversityofCalifornia, Riverside‡Clemson University
†UltraScaleSystemsResearchCenter,LosAlamosNationalLaboratory§Authorscontributedequally
Outline
• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion
Outline
• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion
Introduction• Large-scaleHPCsystem
• Highpowerconsumption
Energy Consumption of Top5 Supercomputers(June, 2016 - TOP500 List)
SummaryofOurContributions• Mostoftheexistingenergysavingapproachesfocusonthesystem/OSlevelorarchitecturelevel,withoutconsideringapplicationcharacteristic.
• Inthiswork,weproposedanenergysavingtechniquethatcombinebotharchitecture andspecificapplication knowledgetoachievebetterenergyefficiency.
• WedesignedGreenLA:GreenLinearAlgebraSoftwareforGPU-AcceleratedHeterogeneousComputing.
DVFSforenergysaving• DynamicVoltageandFrequencyScaling(DVFS)
• Byadjustingtheprocesser’svoltageandfrequencydynamically,energyconsumptioncanbereduced.
• Theoretically:𝑃𝑜𝑤𝑒𝑟 ∝ 𝑓(.* and𝐸𝑥𝑒𝑐𝑇𝑖𝑚𝑒 ∝ 𝑓.• E.g.:↓ 𝑓𝑟𝑒𝑞𝑏𝑦50%à ↓ 𝑝𝑜𝑤𝑒𝑟𝑏𝑦81%à𝑆𝑎𝑣𝑒62%𝐸𝑛𝑒𝑟𝑔𝑦.• ItisprovedinmanypreviousworksthatapplyingDVFStoslacksinapplicationcansaveconsiderableenergywithoutsignificantlyimpactperformance.
Slacks• Slacksexistinmanyapplicationsinparallelcomputingsystems.• Slackscanbecausedby:dependency,communications,I/O,etc.
Task1
Task2
Task3
Sync.
Slack1
Task4 Slack2
Sync.
CPU:
GPU:
Task1
Task2
Task3
Task4
CPU:
GPU:
Task5
Task6
Task5
Task6
Slack3
Sync.
Task dependency
Task execution
MotivationofSlack-basedEnergySaving
• Ifwekeepthesamefrequencyandvoltage,theenergyinshadowareaswouldbewasted.
• Manyenergysavingtechniques havebeendevelopedutilizingtheslacks• Race-to-Halt(R2H);• SlackReclamation(SR).
CPU Power:
GPU Power: High High
High High
Task1
Task2
Task3
Sync.
Slack1
Task4 Slack2
Sync.
CPU:
GPU:
Task5
Task6
Slack3
Sync.Task execution
High
High
Power status
ExistingApproach:Race-to-Halt
• Race:processtasksattheprocessorshighestpowerstate.• Halt:idlingatthe processorstolowestpowerstate.• Simple,noapplicationknowledgerequired.• Butitisproved*,runningtasksathighestpowerstateisnot veryenergyefficient.*Rountree,Barry,etal."Adagio:makingDVSpracticalforcomplexHPCapplications." Proceedingsofthe23rdinternationalconferenceonSupercomputing.ACM,2009.
CPU Power:
GPU Power: High High
LowestHigh High
Task1
Task2
Task3
Sync.
Slack1
Task4 Slack2
Sync.
CPU:
GPU:
Task5
Task6
Slack3
Sync.Task execution
Lowest High
High Lowest
Power status
ExistingApproach:SlackReclamation
• Identifycriticalpathofapplication• Adjustfrequencywhenrunning tasksonnon-criticalpathà reclaimslacks• MoreenergyefficientthanR2H,requirespredictiononslacktoadjustfreq.inadvance.
CPU Power:
GPU Power: High High
High High
Task1
Task2
Task3
Sync.
Slack1
Task4 Slack2
Sync.
CPU:
GPU:
Task5
Task6
Slack3
Sync.Task execution
High
High
Power status
Adjusted
Adjusted
Adjusted
Task1
Task4
Task5
ExistingApproach:SlackReclamation
• Assumefreq.islinearlyproportionaltoexecutiontime
• 𝑓UVWXYZ[\] =_̀ abcdefg ×ij
ik
Task1
Task2
SlackCPU:
GPU:
Sync.𝑇l
𝑇(
𝑓mVWXnopq
𝑓UVWXnopqCPU Power:
GPU Power:
Task1
Task2
CPU:
GPU:
Sync.𝑇(
𝑇(
𝑓mVWXnopq
𝑓UVWXYZ[\]CPU Power:
GPU Power:
Adjustfreq.beforeexecution
Requiespredictiononexecutiontime(i.e.:𝑇l,𝑇()
ChallengesofSlackReclamation• Itrequiresaccurate predictiononexecutiontime.• Inaccuratepredictedexecutiontimeà inaccuratetargetfrequency
• Targetfrequency>idealfrequencyà Stillwasteenergy
• Targetfrequency<idealfrequencyà Impactperformance
Task1
Task2
CPU:
GPU:
Sync.𝑇′(
𝑇(
𝑓mVWXnopq
𝑓UVWXtuuXqopqCPU Power:
GPU Power:
Slack
Task1
Task2
CPU:
GPU:
Sync.𝑇′(
𝑇(
𝑓mVWXnopq
𝑓UVWXtuuX]uvCPU Power:
GPU Power:Slack
𝐸𝑥𝑡𝑟𝑎 𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒
ExistingSlackPrediction• Itrequiresaccuratepredictiononexecutiontime.• Existing approach:statistic learningbasedslackprediction
• Itpredictsfutureexecutiontimebasedonhistory.• Itassumesconstantslacklengthorbasedonhistorydata.
• However,itspredictionability islimited.• Itcanonlycapturelimited or inaccurateapplicationcharacteristicfromhistorydata.• Whenthetargetapplicationiscomplicatedorirregularpatterned,thehistorydatacanbelessuseful,thepredictionresultscanbeinaccurate.
• Inaccurateexecutiontimepredictionà inaccuratetargetfrequencyà lessenergysavingorevenmoreenergycostandworseperformance.
• Itrequirestousefirst10%-20%iterationastrainingsetà extraenergycostandwastesvaluableenergysavingopportunities.
• So,inordertoachievebetterpredictionresult,weneedtoknowmoreaboutthetargetapplication.
Outline
• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion
LinearAlgebraOperations• Toachievehighaccurateslackprediction,weneedtoknow:
• Algorithmicknowledge:theapplicationcharacteristicinalgorithmlevel.• Architecturecharacteristic:theapplicationcharacteristicrunningongivenhardwareplatforms.
• Inmostlinearalgebraoperations,thealgorithmsinsidethemusuallyfollowsiterativefashion,whichcanhelpuscapturingthoseinformation.
• Most linearalgebralibrariesareopen-sourced,weareabletocapturethealgorithmicknowledgeinsideeachlinearalgebraoperation.
• Theiterativeexecutionfashionmakesthecomputationworkofeachtaskondifferentiterationssimilar.Withminimumprofiling,wecaneasilyapproximatethearchitecturecharacteristic.
• So,weproposedAlgorithmicSlackPredictionmodelforlinearalgebraoperations.
OurPropose:AlgorithmicSlackPrediction• Leveragebothalgorithmic knowledgeandarchitecture characteristictoachievemoreaccurateslackprediction.
• Algorithmic knowledge• Weexamthetargetapplicationtoidentify:(1)potentialslacks (2)overalltasksexecutioncharacteristic(Weassumesourcecodeisassessable.)
• Architecturecharacteristic• Wedoprofilingonthefirstslack-relatedtasksontargethardwareplatformtocapturethecomputingefficiency.(Weassumethisefficiencyforeachtasksineachiterationstaysconstant.)
• Benefitsoverstatisticlearningbasedprediction• Ourmodelisbuiltdirectlyfromapplicationà moreaccurateresult.• One-timeprofiling isgoodenoughtocapturethearchitecturecharacteristicàmuchlowerpredictionmodeltrainingcost.
AlgorithmicSlackPredictionModel• Offlineapplicationinspectionà capturealgorithmic knowledge
• Findpotentialslacks:Tasksscheduledconcurrentlywithsamesynchronizationpoint(e.g.:Slacksmaycauseby𝑇𝑎𝑠𝑘}and𝑇𝑎𝑠𝑘�)
• Derivetaskexecutioncharacteristic:Identifytherelationshipofexecutiontimebetweentwoneighboriterations.(e.g.:𝑹𝑨 𝑖 = i�
(e)
i�(ecj),𝑹𝑩 𝑖 = i�
(e)
i�(ecj))
• Onlineprediction1. T�
(�), T�(�) ← profilingàcapture architecture characteristic
2. Iteratefrom1à n3. T�
(�)← T�(�Xl) ∗ 𝐑𝐀 i à utilizingalgorithmic knowledge
4. T�(�)← T�
(�Xl) ∗ 𝐑𝐁 i à utilizingalgorithmic knowledge5. Idealfreq.ßCalculateIdealFrequency(T�
(�),T�(�))
6. SlackReclamation(Idealfreq.)7. … processtasks…
Outline
• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion
GreenLA Library• BuildbasedonMAGMA
• State-of-the-arthighlyoptimizedlinearalgebralibraryonheterogeneoussystemwithGPUs.
• ItassigntasksstaticallytoCPUsandGPUtoachievehighperformancethantraditionallinearalgebralibrary.
• WeidentifiedthatpotentialslacksexistinMAGMA’s implementation.So,weimplementedouralgorithmicslackspredictionenergysavingtoMAGMAlibrary.
• Astheinitialstageofthisproject,westartwiththreecorelinearalgebrafunctions:LU,Cholesky andQR factorization.
ApplicationonLUfactorization
CapturealgorithmicknowledgeIdentifypotentialslacks
One iteration of LU factorization
PotentialSlacks
Example:SlacksinLUfactorization
Slack time of the first 100 iteration of LU factorization in MAGMASlack time = Trailing Matrix Update(on GPU) - Panel Factorization(on CPU)
ApplicationonLUfactorization
• 𝑇�]\��(o) ≈ 𝑇V�
o + 𝑇�i.o − 𝑇i�W
o
• 𝑆𝑖𝑔𝑛 𝑇�]\��o àWhichsideistheslack
• 𝑇�]\��o àLengthoftheslack
PanelFactorization(PF)
DataTransf.(DT)
DataTransf.(DT)
OtherComp.
CPU
GPU Trailing Matrix Update(TMU)
Sync. PointSync. Point
Potential Slack
Potential Slack
ApplicationonLUfactorization• Capturealgorithmicknowledge
• Derivetaskexecutioncharacteristic• Wedeterminetheexecutionrelationbetweenneighboriterationsasfollow:
•ia�(e)
ia�(ecj) =
�]u�a�(e)
�]u�a�(ecj) = 1 − �
¡Xo∗� = 𝑅V�(𝑖)
•i£¤b(e)
i£¤b(ecj) =
�]u�£¤b(e)
�]u�£¤b(ecj) = 1 − �
¡Xo∗�
(= 𝑅i�W(𝑖)
•i¥£(e)
i¥£(ecj) =
�o¦[¥£(e)
�o¦[¥£(ecj) = 1 − l
¡Xo= 𝑅�i(𝑖)
• 𝑇V�� ,𝑇�i.
� ,𝑇i�W� canbedeterminedbyofflineprofiling
Panel Factorization
Trailing MatrixUpdate
ApplicationonLUfactorization
• Onlineprediction1. 𝑇V�
� ,𝑇�i.� ,𝑇i�W
� ← profilingà architecture characteristic2. Iteratefrom1à n3. T§¨
(�)← T§¨(�Xl) ∗ 𝐑𝐏𝐅 i
4. T«¬(�)← T«¬
(�Xl) ∗ 𝐑𝐃𝐓 i5. T¬¯°
(�) ← T¬¯°(�Xl) ∗ 𝐑𝐓𝐌𝐔 i
6. Idealfreq.ßCalculateIdealFrequency(T§¨(�),T«¬
(�),T¬¯°(�) )
7. ifSlackonCPUà SetGPUFreq.(Idealfreq.)8. ifSlackonGPUà SetCPUFreq.(Idealfreq.)9. Processtask:PF,DT,andTMU(Slack-relatedtasks)10. Restorepowerstateafterslack-relatedcomputations.
Optimization1
• Iftargetfreq.noavailable:• WeuseFrequencySplit• Combinetwonearbyfrequenciestoapproximateidealfrequency.• Determinefreq.splitratior:• 𝑇( = 𝑟𝑇³uv + (𝑟 − 1)𝑇nopq forr• 𝑇³uv and 𝑇nopq canbedeterminedbyourslackpredictionalgorithm
Task1
Task2
CPU:
GPU:
Sync.𝑇(
𝑇(
𝑓mVWXnopq
𝑓UVWXnopqCPU Power:
GPU Power:
𝑓UVWX³uv
𝑟𝑇³uv (𝑟 − 1)𝑇nopq
Optimization2
• ReduceDVFSOverhead• Frequentpowerstateadjustmentmaybringsmuchoverhead• WeuseRelaxFactortoadjustnecessityofpowerstateadjustment.• AlgorithmwithRelaxedSlackReclamation:1. If𝑓oZ[\] notavailable2. 𝑟ß determinefreq.splitratio3. if𝑟 < 𝑅𝑙𝑥𝐹𝑐𝑡𝑟4. Avoidfreq.split,useoriginalfreq.5. else6. Performfreq.split
Outline
• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion
ExperimentalEnvironment• HeterogeneousSystem:IVY
Component CPU GPU
Processor 2*10-coreIntel XeonE5-2670 2496-core NVIDIAKepler K20c
PeakPerf. 0.4TFLOPS 1.17TFLOPS
FrequencyScaling Capability
Core Freq.(GHz) CoreFreq.(MHz)
1.2-2.5(↑by0.1) 324, 614,640,666,705,758
Memory 64 GBRAM 5GBRAM
PowerMeter PowerPack Nvidia-smi tool
Freq. AdjustmentMethod CPUFreq.RegisterFiles NVIDIA GPUFreq.SettingAPI
ExperimentSettings• Wetestedandcompared:
• OriginalMAGMAimplementation• OSLevelRace-to-Halt• LibraryLevelRace-to-Halt• OSLevelStatisticLearningBasedDVFS• OSLevelStatisticLearningBasedDVFS(Relaxed)• AlgorithmicPredictionBasedDVFS• AlgorithmicPredictionBasedDVFS(Relaxed)
• AllversionsarebuildonMAGMA1.6.1• Cholesky,LU,QRfactorizationon4sizes:
• 5120,10240,15360,𝑎𝑛𝑑20480
Experiment
• SlackPredictionErrorRate:
• ErrorRate=Average(|Vº[Z.�]\��io»[Xi¼º[�]\��io»[|iº¼[�]\��io»[
)
• OurAlgorithmicpredictionhasmuchlowerpredictionerrorrate.
Benchmarks StatisticLearning Prediction AlgorithmicPredictionBaseIter.(First10%) BaseIter.(First 20%)
Cholesky 10.51% 6.62% 0.96%
LU 9.95% 5.45% 0.16%
QR 11.29% 5.77% 0.52%
Slack prediction accuracy comparison on input matrix size 20480*20480
CPU&GPUEnergySaving
-15.00%
-10.00%
-5.00%
0.00%
5.00%
10.00%
15.00%
20.00%
Cholesky LU QR
CPUEnergySavingComparison
OS_r2h OS_cpsr_str OS_cpsr_rlx lib_r2h lib_cpsr_str lib_cpsr_rlx
CPU energy saving on input matrix size: 20480*20480
-10.00%
-5.00%
0.00%
5.00%
10.00%
15.00%
20.00%
Cholesky LU QR
GPUEnergySavingComparison
OS_r2h OS_cpsr_str OS_cpsr_rlx lib_r2h lib_cpsr_str lib_cpsr_rlx
GPU energy saving on input matrix size: 20480*20480
OverallEnergySaving&Performance
-10.00%
-8.00%
-6.00%
-4.00%
-2.00%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
Cholesky LU QR
TotalEnergySavingComaprison
OS_r2h OS_cpsr_str OS_cpsr_rlx lib_r2h lib_cpsr_str lib_cpsr_rlx
Total energy saving on input matrix size: 20480*20480
0
5
10
15
20
25
Cholesky LU QR
SecExecutionTimeComparison
Original OS_cpsr_str OS_cpsr_rlx OS_r2h lib_r2h lib_cpsr_str lib_cpsr_rlx
Overall execution time with input matrix size: 20480*20480
Outline
• Introductionandourresearchmotivation• Ournewalgorithmicslackprediction• GreenLA Library• ExperimentalEvaluation• Conclusion
Conclusion
• Slack-basedenergysavingapproachrequiresaccurateslackpredictiontoachievehighenergysaving.
• Existingapproachescannotaccuratelypredictsslacks.• So,weproposedanewalgorithmicslackpredictionapproachconsideringbothalgorithmicknowledgeandarchitecturecharacteristic.
• Ourexperimentsshowthatourapproachcansave2x-3xmoreenergythanexistingapproaches.
• Thankseveryoneforattending!