xpdl: extensible platform description language to support ...xpdl: extensible platform description...
TRANSCRIPT
XPDL: Extensible Platform Description Languageto Support Energy Modeling and Optimization
Published in ICPP-EMS’15
Christoph Kessler, Lu Li, Ming-jie Yang, Aras Atalar and Alin [email protected], [email protected]
1 / 15
Motivation
Optimization:Platform-independent: algorithmic improvementPlatform-dependent: SIMD, GPU etc
Platform dependent optimization such as SIMD can yieldsignificant performance and energy improvements.Usually platform dependent optimizations are manuallytuned or partly automated.
Requires understanding of the machine-specific features.Automation of systematic platform-dependentoptimizations is both interesting and challenging.
Adaptivity
RetargetabilityPrerequisite: a formal platform description languagemodeling optimization-relevant platform properties.
2 / 15
Motivation
Platform = Hardware + System Software
Previous work: PDL (Platform Description Language)Limitations: flexibility and scalability.
XPDL is designed to overcome those limitations,Furthermore, new features are added to better supportenergy optimization, such as power island modeling.
3 / 15
A Hello World XPDL Example
Examples are only to illustrate XPDL language constructs,not meant to be complete.
CoreL1
CoreL1
CoreL1
CoreL1
L2 L2
L3
(a) A Typical CPUStructure
1 <cpu name=" Intel_Xeon_E5_2630L ">2 <group p r e f i x = " core_group " q u a n t i t y = " 2 " >3 <group p r e f i x = " core " q u a n t i t y = " 2 " >4 <!�� Embedded d e f i n i t i o n ��>5 < core f requency=" 2 " f requency_un i t= "GHz" >6 <cache name= " L1 " s ize=" 32 " u n i t = " KiB " / >7 </ core>8 </group>9 <cache name= " L2 " s ize=" 256 " u n i t = " KiB " / >
10 </group>11 <cache name= " L3 " s ize=" 15 " u n i t = " MiB " / >12 <power_model type=" power_model_E5_2630L " / >13 </cpu>14
(b) The corresponding XPDL code
4 / 15
Modular and extensible
name: defining a meta-model, stores type informationid: defining a model, stores object information
Figure 1: Type reference and inheritance
5 / 15
Control Relation Decoupled From Hardware
1 < Master i d = " 0 " q u a n t i t y = " 1 ">2 <peppher : PEDescriptor>3 <peppher : Property f i x e d =" t rue ">4 <name>ARCHITECTURE</name>5 <value> x86 </ value> . . .6 <peppher : Worker q u a n t i t y = " 1 " i d = " 1 ">7 <peppher : PEDescriptor>8 <peppher : Property f i x e d =" t rue ">9 <name>ARCHITECTURE</name>
10 <value> gpu </ value> . . .11 </peppher : Worker>12 </Master>13
Listing 1: PDL example description for x86-core (Master) and gpu (Worker)
1 <system i d = " l i u�gpu�server ">2 <socket>3 < cpu i d = " gpu�host " type= " I n t e l�Xeon�E5�2630L " / >4 </ socket>5 < device i d = " gpu1 " type= " Nvidia�K20c " / >6 <interconnects >7 <interconnect i d = " connect ion1 " type=" pcie3 " head=" gpu�host " t a i l = " gpu1 " / >8 </ interconnects>9 </system>
10
Listing 2: XPDL example description for such a GPU server 6 / 15
System Software Modeling
1 <system i d = " XScluster ">2 < cluster >3 <group p r e f i x = " n " q u a n t i t y = " 4 ">4 < node >5 <group i d = " cpu1 ">6 <socket> <cpu i d = "PE0" type=" I n t e l�Xeon� . . . " / > </socket>7 </group>8 <group p r e f i x = " main�mem" q u a n t i t y = " 4 "> <memory type="DDR3�4G" / > </group>9 <device i d = " gpu1 " type=" Nvidia�K20c " / >
10 <interconnects >11 <interconnect i d = " conn1 " type=" pcie3 " head=" cpu1 " t a i l = " gpu1 " / >12 </ interconnects>13 </node>14 </group>15 <interconnects >16 <interconnect i d = " conn3 " type=" i n f i n i b a n d 1 " head=" n1 " t a i l = " n2 " / >17 </ interconnects>18 </ cluster>19 <software >20 <hostOS i d = " l i n u x 1 " type=" Linux � . . . " / >21 < ins ta l l ed type="CUDA�6.0 " path=" / ex t / l o c a l / cuda6 . 0 / " / >22 < ins ta l l ed type="CUBLAS� . . . " path=" . . . " / >23 < ins ta l l ed type=" StarPU�1.0 " path=" . . . " / >24 </ software>25 <properties >26 <property name=" ExternalPowerMeter " type=" . . . " command=" myscr ip t . sh " / >27 </ properties>28 </system>
Listing 3: Example of a concrete cluster machine 7 / 15
Power Modeling
A power model in XPDL consists ofpower domains and their power state machinesmicrobenchmarks with deployment information
8 / 15
Modeling Power Domains
Power domains: hardware components that must changestate together.E.g. in Movidius Myriad2, each SHAVE core form aseparate power island.
9 / 15
Modeling Power Domains
1 <power_domains name=" Myriad1�power�domains ">2 <!�� t h i s i s l a n d i s the main i s l a n d ��>3 <!�� and cannot be turned o f f ��>4 <power_domain name=" main�pd " enableSwi tchOff= " f a l s e ">5 <core type=" Leon " / >6 </power_domain>7 <group name=" Shave�pds " q u a n t i t y = " 8 ">8 < power_domain name= " Shave�pd " >9 <core type= " Myriad1�Shave " / >
10 </power_domain>11 </group>12 <!�� t h i s i s l a n d can only be turned o f f ��>13 <!�� i f a l l the Shave cores are switched o f f ��>14 <power_domain name="CMX�pd "15 s w i t c h o f f C o n d i t i o n = " Shave�pds o f f " >16 <memory type="CMX" / >17 </power_domain>18 </power_domains>
Listing 4: Example meta-model for power domains of Movidius Myriad1
10 / 15
Modeling Power State Machine
Figure 2: Intel Xscale processor (2000)1 <power_state_machine name=" power�s ta te�machine1 "2 power_domain =" I n t e l�Xscale�pd " >3 <power_states>4 < power_state name= "P1" frequency =" 150 " f requency_un i t= "MHz"5 power=" 60 " power_uni t= "mW" vo l tage=" 0.75 " vo l tage="V" / >6 <power_state name="P2" frequency=" 600 " f requency_un i t= "MHz"7 power=" 450 " power_uni t= "mW" vo l tage=" 1.3 " vo l tage="V" / >8 . . .9 </ power_states>
10 <transi t ions >11 < t rans i t ion head= "P1" t a i l = "P2" t ime =" 160 " t ime_un i t = " us " / >12 <t rans i t ion head="P2" t a i l = "P1" t ime=" 160 " t ime_un i t = " us " / >13 . . .14 </ t rans i t ions>15 </power_state_machine>
Listing 5: Example meta-model for a power state machine of Intel Xscale processor(2000) 11 / 15
Modeling Microbenchmarks With Deployment Information
1 <instruct ions name=" x86�base�i sa "2 mb = "mb�x86�base�1" >3 < inst name= " fmul "4 energy = " ? " energy_uni t= " pJ " mb=" fm1 " / >5 < inst name=" fadd "6 energy=" ? " energy_uni t= " pJ " mb=" fa1 " / >7 < inst name= " d ivsd " >8 <data f requency=" 2.8 "9 energy= " 18.625 " energy_uni t= " nJ " / >
10 <data f requency=" 2.9 "11 energy=" 19.573 " energy_uni t= " nJ " / >12 <data f requency=" 3.4 "13 energy=" 21.023 " energy_uni t= " nJ " / >14 </ inst>15 </ instruct ions>
Listing 6: Example meta-model for instruction energy cost
1 <microbenchmarks i d = "mb�x86�base�1"2 i n s t r u c t i o n _ s e t = " x86�base�i sa "3 path=" / usr / l o c a l / micr / s rc "4 command =" mbscr ip t . sh ">5 <microbenchmark i d = " fa1 " type= " fadd " f i l e = " fadd . c "6 c f l a g s ="�O0" l f l a g s =" . . . " / >7 <microbenchmark i d = "mo1" type="mov" f i l e = "mov . c "8 c f l a g s ="�O0" l f l a g s =" . . . " / >9 </microbenchmarks>
Listing 7: An example model for instruction energy cost 12 / 15
XPDL Toolchain and Microbenchmark Generation
XPDL Schema
XPDLXML Files
XPDL Parser
XPDL Query API
IntermediateRepresentation
Microbenchmark generator
Microbenchmark execution
Code generator
C++ library
Applicationsand tool chain
C++ API
Figure 3: XPDL Tool Chain Diagram
13 / 15
XPDL Runtime Query API
Initialization of the XPDL run-time query libraryFunctions for browsing the model treeFunctions for looking up attributes of model elementsModel analysis functions for derived attributes
Static inference, e.g. PCIe bandwidthAggregate numbers,e.g. get the total static poweror the total number of GPUs
14 / 15
Summary of Contributions
We propose XPDL, a portable and extensible platformdescription languageModular and extensibleSoftware roles decoupled from hardware structureSystem software modeledMicrobenchmark generationOpen source tool chainhttp://www.ida.liu.se/labs/pelab/xpdl/
15 / 15
Bibliography I
15 / 15
MeterPU:
A Generic Measurement Abstraction API
– Enabling Energy-tuned Skeleton Backend Selection
Published in Journal of Supercomputing, 2016
Lu Li, Christoph [email protected], [email protected]
1 / 11
Motivation
Energy MeasurementNecessary for energy modeling and optimization.Difficult, especially for GPUs
Power visualization (Li et al., 2015 [2]) for a CUDA program.Illustrating "capacitor effect". Data obtained by Nvidia NVML.
6
6.5
7
7.5
8
8.5
9
·104
Power(mW
)
40
42
44
46
48
50
Temperature
30
32
34
36
Fanspeed
65 70 75 80 85
Time(s)
Requires numerical postprocessing(Burtscher et al., 2014 [1]) to calculate real energy cost
Goal: as simple as measuring time in good old days.2 / 11
MeterPU
A software multimeterA generic measurement abstraction,hiding platform-specific measurement details.Four simple functions to unify measurement interface forvarious metrics on different hardware components.
Time, Energy on CPU, GPUEasy to extend: FLOPS, cache misses etc.
On top of native measurement libraries (plug-ins).CPU time: clock_gettime()GPU time: cudaEventRecord()CPU and DRAM energy: Intel PCM library.GPU energy: Nvidia NVML library.Wattsup_Energy: hardware measurement...
3 / 11
MeterPU API
template<cl ass T ype> / / analogous to swi tch/ / on a r eal mult imeter
cl ass Meter{
publ i c :void st ar t ( ) ; / / st ar t a measurementvoid stop ( ) ; / / stop a measurementvoid cal c ( ) ; / / cal cul at e a met r i c value
/ / of a code regi ontypename Meter_ T raits<Type>:: R esultT ype const &
get_ value( ) const ;/ / get the cal cul at ed met r i c value
pr i vat e :/ / P lat form speci f i c measurement det ai l s hidden . . .
}
4 / 11
An MeterPU Application: Measure CPU Time
#incl ude <MeterPU .h>
i nt main ( ){
cpu_ func( ) ; / / Do sth here
}
#incl ude <MeterPU .h>
i nt main ( ){
using namespace MeterPU ;Meter<CPU_ Time> meter ;
cpu_ func( ) ; / / Do sth here
}
#incl ude <MeterPU .h>
i nt main ( ){
using namespace MeterPU ;Meter<CPU_ Time> meter ;
meter . st ar t ( ) ; / / Measurement Star t !
cpu_ func( ) ; / / Do sth here
meter . stop ( ) ; / / Measurement Stop !
}
#incl ude <MeterPU .h>
i nt main ( ){
using namespace MeterPU ;Meter<CPU_ Time> meter ;
meter . st ar t ( ) ; / / Measurement Star t !
cpu_ func( ) ; / / Do sth here
meter . stop ( ) ; / / Measurement Stop !
meter . cal c ( ) ;
}
#incl ude <MeterPU .h>
i nt main ( ){
using namespace MeterPU ;Meter<CPU_ Time> meter ;
meter . st ar t ( ) ; / / Measurement Star t !
cpu_ func( ) ; / / Do sth here
meter . stop ( ) ; / / Measurement Stop !
meter . cal c ( ) ;BUILD_ CPU_ TIME_ MODEL( meter . get_ value( ) ) ;
}
5 / 11
An MeterPU Application: Measure GPU Energy
#incl ude <MeterPU .h>
i nt main ( ){
using namespace MeterPU ;/ / Only one l i ne d i f f er s ! ! ! !Meter<NVML_ Energy<> > meter ;
meter . st ar t ( ) ; / / Measurement Star t !
cuda_ func<<< . . . , . . . >>>(...) ;cudaDeviceSynchronize( ) ;
meter . stop ( ) ; / / Measurement Stop !
meter . cal c ( ) ;BUILD_GPU_ ENERGY_MODEL( meter . get_ value( ) ) ;
}
6 / 11
Measure Combinations of CPUs and GPUs
#incl ude <MeterPU .h>
i nt main ( ){
using namespace MeterPU ;/ / Only one l i ne d i f f er s ! ! ! !Meter< System_ Energy<0> > meter ;
meter . st ar t ( ) ; / / Measurement Star t !
async_ cpu_ func( ) ;cuda_ func<<< . . . , . . . >>>(...) ;wait_ for_ cpu_ func_ to_ finish ( ) ;cudaDeviceSynchronize( ) ;
meter . stop ( ) ; / / Measurement Stop !
meter . cal c ( ) ;BUILD_ SYSTEM_ ENERGY_MODEL( meter . get_ value( ) ) ;
}
7 / 11
Unification ! Reuse Legacy Autotuning Framework
Figure 1: Unification allows empirical autotuning framework to switch tomultiple meter types on different hardware components.
8 / 11
SkePU Reduce Skeleton (A Single Skeleton Call)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2e+02 1e+03 5e+03 5e+04 5e+05Problem size for reduce
1
10
100
1000
10000
Tim
e(us
): av
erag
e of
100
runs ●
CPUOMPCUDASelection
(a) Time tuning forReduce.
●●
●●
●
●
●
●
●
●
● ● ● ●
●●
●
●
●
●
2e+03 1e+04 5e+04 2e+05Problem size for reduce
100
200
500
1000
Ener
gy(m
illiJ)
: ave
rage
of 1
000
runs
●
CPUOMPCUDASelection
(b) Energy tuning forReduce.
Further results see paperEmpirical autotuning framework for time optimizationreused for energy optimization.
9 / 11
LU decomposition (Multiple Skeleton Calls)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
20 30 40 50Problem size for LU decomposition
200
500
1000
2000
5000
10000Ti
me(
us):
aver
age
of 3
000
runs ●
CPUOMPCUDASelection
(a) Time tuning for LUdecomposition
● ●●
●
●
●
●
● ●●
●
●
●
●
5 10 20 50 100Problem size for LU decomposition
100
200
500
1000
2000
5000
10000
Ener
gy(m
illiJ)
: ave
rage
of 1
000
runs
●
CPUOMPCUDASelection
(b) Energy tuning for LUdecomposition
Easy switching for optimization goals.
10 / 11
Summary of Contributions
Hide complexity of platform-specific energy measurement,especially for GPUs.MeterPU enables to reuse legacy empirical autotuningframeworks, such as the one in SkePU.With MeterPU, SkePU offers the first energy-tunedskeletons, as far as we know.Switching optimization goal can be easy,facilitates building time and energy models formulti-objective optimization.
, Open source, download at:http://www.ida.liu.se/labs/pelab/meterpu
11 / 11
Bibliography
Martin Burtscher, Ivan Zecena, and Ziliang Zong.Measuring GPU power with the K20 built-in sensor.In Proc. Workshop on General Purpose Processing Using GPUs (GPGPU-7). ACM, March 2014.
Lu Li and Christoph Kessler.Validating Energy Compositionality of GPU Computations.In Proc. HIPEAC Workshop on Energy Efficiency with Heterogeneous Computing (EEHCO-2015), January2015.
11 / 11
VectorPU: A Generic and Efficient Data-Containerand Component Model for Transparent Data Transfer
on GPU-based Heterogeneous Systems
Lu Li, Christoph [email protected], [email protected]
1 / 2
Benchmark Results
Laptop A AGC Triolith0
1
2
3
4
Spee
dup
to u
nifie
d m
emor
y by
Vec
torP
U
Figure 1: Conjugate Gradient Benchmark.A Comparison with Nvidia’s UM.
2 / 2