efficient performance scaling of future cgras for mobile applications
DESCRIPTION
Efficient Performance Scaling of Future CGRAs for Mobile Applications. Yongjun Park , Jason Jong Kyu Park , and Scott Mahlke. December 11, 2012 University of Michigan, Ann Arbor. Convergence of Functionalities. Flexible Accelerator!. 4G Wireless. Audio Video 3D. Navigation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/1.jpg)
University of MichiganElectrical Engineering and Computer Science1
Efficient Performance Scaling of Future CGRAs for Mobile Applications
Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke
December 11, 2012
University of Michigan, Ann Arbor
![Page 2: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/2.jpg)
University of MichiganElectrical Engineering and Computer Science
Convergence of Functionalities
2
Convergence of functionalities demands a flexible solution due to the design cost and programmability
Anatomy of an iPhone4
4G Wireless
Navigation
AudioVideo
3D
Flexible Accelerator!
![Page 3: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/3.jpg)
University of MichiganElectrical Engineering and Computer Science
3
CGRA : Attractive Alternative to ASICs
Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration
![Page 4: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/4.jpg)
University of MichiganElectrical Engineering and Computer Science
Bridging the Gap Between Market Demandand Computation Power
4
2009 2010 2011 2012 2013 2014 20150
400
800
1200
1600
2000CPU
Audio
Video
Year
Com
puta
tiona
l req
uire
men
t
How to scale performance with retaining energy efficiency?
[Canali, Internet Computing Magazine, IEEE, 2009]
![Page 5: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/5.jpg)
University of MichiganElectrical Engineering and Computer Science
Agenda:Scaling the Energy Efficiency of CGRAs
5
• Investigate the key factors and their feasibility in the view of performance and power efficiency– Hardware scalability vs. hardware flexibility
• Interconnection topology• Complex PE vs. simple PE• Vector memory operation support• Homogeneity vs. Heterogeneity
![Page 6: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/6.jpg)
University of MichiganElectrical Engineering and Computer Science
Experimental Setup• Target applications
– Media benchmark: AAC decoder, H.264 decoder, and 3D rendering– Game physics benchmarks: line of sight, convolution, and conjugate
• Target architecture: various types of CGRAs– 16 ~ 64 heterogeneous/homogeneous resources
• IMPACT frontend compiler + Edge-centric modulo scheduler
• Power measurement– IBM 65nm technology @ 200MHz/1V
6
![Page 7: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/7.jpg)
University of MichiganElectrical Engineering and Computer Science
Q1: Interconnection Topology
7
• Overview– Routing overhead limits the performance when increasing the size of the CGRA– Common solution: clustering– What is the optimal interconnection topology?
• Methodology
– Compare the performance of three different clustering schemes.• Baseline• Fixed partition: CGRAs are physically split into multiple partitions• Flexible partition: number of partitions can be dynamically changed from 1 to 8
– Total number of PEs: 4 to 128
![Page 8: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/8.jpg)
University of MichiganElectrical Engineering and Computer Science
Q1: Interconnection Topology
8
DLP loops
No-DLP loops
Application
Baseline
Fixed partition
Flexible mapping
![Page 9: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/9.jpg)
University of MichiganElectrical Engineering and Computer Science
Performance Comparison (Base, Fixed, Flex)
9
base fle
x
base 2
flex
base 2 4
flex
base 2 4 8
flex
base 2 4 8
flex
base 2 4 8
flex
4 8 16 32 64 128
02468
1012
Architecture
Rel
ativ
e pe
rform
ance Media
base fle
x
base 2
flex
base 2 4
flex
base 2 4 8
flex
base 2 4 8
flex
base 2 4 8
flex
4 8 16 32 64 128
02468
10
Architecture
Rel
ativ
e pe
rform
ance Game
• Fixed partitioning doesn’t always show better performance.• Flexible architectures show the best performance and retain scalability
![Page 10: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/10.jpg)
University of MichiganElectrical Engineering and Computer Science
Q2: Complex PEs vs. Simple PEs
10
• Overview– CGRAs with complex PEs are introduced
• Two level interconnect• Number of RFs can decrease• Multiple instructions can be chained
– Challenge: resource utilization– Goal: determine the availability of complex PEs in the view of energy consumption
• Methodology– Compare the energy consumption on different PE styles
• Number of FUs inside a PE: 1 ~ 6• Uniform vs. Optimized
![Page 11: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/11.jpg)
University of MichiganElectrical Engineering and Computer Science
PE Designs
11
Register file
Simple integer ALU
Simple integer+ Complex ALU
![Page 12: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/12.jpg)
University of MichiganElectrical Engineering and Computer Science
Energy Consumption
12
• Energy consumption does not increase dramatically as number of PEs• In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions
1 2 3 4 5 60.5
1
1.5
2
2.5
3
3.5
4
Media uniform
Game uniform
Media optimized
Game optimized
# of FUs per PE
Rel
ativ
e en
ergy
con
sum
ptio
n
1.5x energy
![Page 13: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/13.jpg)
University of MichiganElectrical Engineering and Computer Science
Q3: SIMD Memory Support
13
• Overview– SIMD memory support provides less power and less number of instructions– Challenge: degree of DLP.– Goal: determine the availability of SIMD memory access in the view of energy consumption
• Methodology– Compare the energy consumption on different SIMD widths: 1 ~ 16
![Page 14: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/14.jpg)
University of MichiganElectrical Engineering and Computer Science
Relative Energy Consumption
14
1 2 4 8 160
2
4
6
8
10
12
14Relative power per accessRelative # of accessRelative total energy
Vector width
• Total energy consumption at wider vector width can be a similar level to a scalar memory unit– High degree of spatial locality can compensate for power overheads
![Page 15: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/15.jpg)
University of MichiganElectrical Engineering and Computer Science
Conclusion• Flexible partitioning should be supported for further improving the
performance.
• Complex PE can be more energy efficient even in low resource utilizations.
• The wide SIMD memory support can be realistic due to the mobile application characteristics.
15
Beginning
![Page 16: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/16.jpg)
University of MichiganElectrical Engineering and Computer Science16
Questions?
For more informationhttp://cccp.eecs.umich.edu
![Page 17: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/17.jpg)
University of MichiganElectrical Engineering and Computer Science
Q1: Homogeneity vs. Heterogeneity
17
• Overview– Heterogeneous CGRAs are common– No experiments on the effect of heterogeneity over homogeneity
• Methodology– Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)– Decrease the number of PEs supporting complex ALU and memory unit– Performance goal: 80% of performance @ homogeneous CGRA
How about performance?
![Page 18: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/18.jpg)
University of MichiganElectrical Engineering and Computer Science
Performance Degradation
18
basemul_8
mul_4mul_2
mul_1
mem_8
mem_4
mem_2
mem_1exp
_8exp
_4exp
_2exp
_10
0.10.20.30.40.50.60.70.80.9
1
Rela
tive
perf
orm
ance
basemul_8
mul_4mul_2
mul_1
mem_8
mem_4
mem_2
mem_1exp_8
exp_4exp_2
exp_1
0
0.2
0.4
0.6
0.8
1
Rela
tive
perf
orm
ance
• The amounts of performance degradation are not substantial – The performance is normally constrained not by the complex instructions
• Performance degradation depends much more on memory operations• For 80% of the baseline performance, we can decrease the number of both
complex and memory units by up to 75%.
Media Game
![Page 19: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/19.jpg)
University of MichiganElectrical Engineering and Computer Science
Conclusion• Heterogeneous FU organization is highly effective.
• Flexible partitioning should be supported for further improving the performance.
• Complex PE can be more energy efficient even in low resource utilizations.
• The wide SIMD memory support can be realistic due to the mobile application characteristics.
19
Beginning
![Page 20: Efficient Performance Scaling of Future CGRAs for Mobile Applications](https://reader036.vdocuments.net/reader036/viewer/2022070420/56815e38550346895dcc9d02/html5/thumbnails/20.jpg)
University of MichiganElectrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW
Morphosys SiliconHive ADRES
20
Suitable for running multimedia applications for future embedded sys-tems
High throughput, low power consumption, high flexibility
Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW