yakun sophia shao, brandon reagen , gu-yeon wei, david brooks harvard university
DESCRIPTION
A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University. Beyond Homogeneous Parallelism. General-Purpose Cores (CPU). Programmable - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/1.jpg)
A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks
Harvard University
![Page 2: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/2.jpg)
2
Programmable
Accelerators (DSP, GPU)
Application-Specific
Accelerator(ASIP, ASIC)
General-Purpose Cores
(CPU)
FlexibilityProgrammabili
ty
EnergyEfficiency
Beyond Homogeneous Parallelism
Design Cost
![Page 3: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/3.jpg)
3
OMAP 4 SoC
Today’s SoC
![Page 4: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/4.jpg)
4
OMAP 4 SoC
Today’s SoC
ARM Cores GPUDSP DSP
System Bus
Secondary Bus
Secondary Bus
Tertiary Bus
DMA
DMA SDUSBAudio Video Face Imaging
USB
![Page 5: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/5.jpg)
5
Today’s SoC
CPU + L2$ + GPU39%
Other Blocks 61%
Apple A7
Harvard VLSI-ARCH GroupSoC Tapeout
![Page 6: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/6.jpg)
6
Today’s SoC
GPU/DSP
CPU
Buses MemInter-faceAcc
CPU
Acc
Acc
Acc
Acc
Acc
Acc
Acc
Acc
![Page 7: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/7.jpg)
7
Future Accelerator-Centric Architectures
FlexibilityDesign Cost Programmability
How to decompose an application to accelerators?How to rapidly design lots of accelerators?How to design and manage the shared resources?
GPU/DSP
Big Cores
Shared ResourcesMemoryInterface
Sea of Fine-Grained Accelerators
Small Cores
![Page 8: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/8.jpg)
8
Private L1/Scratchpad
Aladdin
AcceleratorSpecific
Datapath
Shared Memory/InterconnectModels
UnmodifiedC-Code
Accelerator DesignParameters
(e.g., # FU, mem. BW)
Power/Area
Performance
“Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems
Design Cost Flexibility Programmability
Aladdin: A pre-RTL, Power-Performance Accelerator
Simulator
“Design Assistant” Understand Algorithmic-HW
Design Space before RTL
![Page 9: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/9.jpg)
9
GPU/DSP
Big Cores
Shared ResourcesMemoryInterface
Sea of Fine-Grained Accelerators
Small Cores
Future Accelerator-Centric Architecture
![Page 10: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/10.jpg)
10
GPU/DSP
Big Cores
Shared ResourcesMemoryInterface
Sea of Fine-Grained Accelerators
Small Cores
Future Accelerator-Centric Architecture
Aladdin can rapidly evaluate large design space of accelerator-centric architectures.
![Page 11: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/11.jpg)
Aladdin Overview
C Code
Power/Area
Performance
ActivityAcc Design Parameters
Optimization Phase
Realization Phase
Optimistic IR
InitialDDDG
IdealisticDDDG
Program Constraine
d DDDG
ResourceConstraine
d DDDG
Power/Area Models
11
Dynamic Data Dependence Graph
(DDDG)
![Page 12: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/12.jpg)
Aladdin Overview
C CodeOptimistic
IRInitialDDDG
IdealisticDDDG
Program Constraine
d DDDG
ResourceConstraine
d DDDG
Power/Area Models
Optimization Phase
Realization Phase
Power/Area
Performance
ActivityAcc Design Parameters
12
![Page 13: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/13.jpg)
13
From C to Design Space
C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];
![Page 14: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/14.jpg)
From C to Design SpaceIR Dynamic Trace
C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];
0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store
c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store
c[i]10. r0 = r0 + 1 //++i…
14
![Page 15: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/15.jpg)
From C to Design SpaceInitial DDDG
0. i=0
1. ld a 2. ld b
3. +
4. st c
5. i++
6. ld a 7. ld b
8. +
9. st c
10. i++
11. ld a 12. ld b
13. +
14. st c
C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];
IR Trace:0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store c[i]10.r0 = r0 + 1 //++i…
15
![Page 16: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/16.jpg)
0. i=05. i+
+
10. i++
11. ld a 12. ld b
13. +
14. st c
6. ld a 7. ld b
8. +
9. st c
1. ld a 2. ld b
3. +
4. st c
C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];
IR Trace:0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store c[i]10.r0 = r0 + 1 //++i…
0. i=0
5. i++ 10. i++
11. ld a 12. ld b
13. +
14. st c
6. ld a 7. ld b
8. +
9. st c
1. ld a 2. ld b
3. +
4. st c
16
From C to Design SpaceIdealistic DDDG
![Page 17: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/17.jpg)
17
• Include application-specific customization strategies. • Node-Level:
– Bit-width Analysis– Strength Reduction– Tree-height Reduction
• Loop-Level:– Remove dependences between loop index variables
• Memory Optimization:– Memory-to-Register Conversion– Store-Load Forwarding– Store Buffer
• Extensible– e.g. Model CAM accelerator by matching nodes in DDDG
From C to Design SpaceOptimization Phase: C->IR->DDDG
![Page 18: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/18.jpg)
From C to Design SpaceOne Design
MEM MEM
MEM MEM
MEM
MEM
+
++
Resource Activity Idealistic DDDG
Acc Design Parameters: Memory BW <= 2 1 Adder
0. i=0
5.i++ 10. i++
11. ld a12. ld b
13. +
14. st c
6. ld a 7. ld b
8. +
9. st c
1. ld a 2. ld b
3. +
4. st c
15. i++
16. ld a17. ld b
18. +
19. st c
Cycle
0. i=0
5.i++
6. ld a 7. ld b
8. +
9. st c
1. ld a 2. ld b
3. +
4. st c
18
![Page 19: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/19.jpg)
From C to Design SpaceAnother Design
MEM MEM MEM MEM
MEM MEM MEM MEM
MEM MEM
MEM MEM
+ +
+ ++ +
+Resource Activity
Cycle
0. i=0
5.i++
10. i++
11. ld a 12. ld b
13. +
14. st c
7. ld b
8. +
9. st c
1. ld a 2. ld b
3. +
4. st c
15. i++
16. ld a 17. ld b
18. +
19. st c
6. ld a
19
Acc Design Parameters: Memory BW <= 4 2 Adders
Idealistic DDDG0.
i=05.i++ 10. i++
11. ld a12. ld b
13. +
14. st c
6. ld a 7. ld b
8. +
9. st c
1. ld a 2. ld b
3. +
4. st c
15. i++
16. ld a17. ld b
18. +
19. st c
![Page 20: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/20.jpg)
20
• Constrain the DDDG with program and user-defined resource constraints
• Program Constraints– Control Dependence– Memory Ambiguation
• Resource Constraints– Loop-level Parallelism– Loop Pipelining– Memory Ports– # of FUs (e.g., adders, multipliers)
From C to Design SpaceRealization Phase: DDDG->Estimates
![Page 21: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/21.jpg)
21
Cycle
PowerAcc Design Parameters: Memory BW <= 4 2 Adders
Acc Design Parameters: Memory BW <= 2 1 Adder
From C to Design SpacePower-Performance per Design
![Page 22: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/22.jpg)
22
From C to Design SpaceDesign Space of an Algorithm
Cycle
Power
![Page 23: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/23.jpg)
Aladdin Validation
C Code Power/Area Performance
Aladdin
ModelSim
Design Compiler
Verilog
Activity
23
![Page 24: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/24.jpg)
Aladdin Validation
C Code Power/Area Performance
Aladdin
RTL Designer
HLS C Tuning
Vivado HLS ModelSim
Design Compiler
Verilog
Activity
24
![Page 25: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/25.jpg)
Aladdin Validation
25
![Page 26: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/26.jpg)
26
Aladdin Validation
![Page 27: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/27.jpg)
Aladdin enables rapid design space exploration for accelerators.
C Code Power/Area Performance
Aladdin
RTL Designer
HLS C Tuning
Vivado HLS ModelSim
Design Compiler
Verilog
Activity
27
7 mins
52 hours
![Page 28: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/28.jpg)
28
Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC.
GPU
Shared ResourcesMemoryInterface
Sea of Fine-Grained Accelerators
Big Cores
Small Cores
GPGPU-Sim
MARSx86...
XIOSim…
Cacti/Orion2
DRAMSim2
![Page 29: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/29.jpg)
29
Acc Core
Cache
Memory
Acc Core
Cache
Memory
Core
Modeling Accelerators in a SoC-like Environment
![Page 30: Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University](https://reader035.vdocuments.net/reader035/viewer/2022062218/56816724550346895ddbae9b/html5/thumbnails/30.jpg)
30
• Architectures with 1000s of accelerators will be radically different; New design tools are needed.
• Aladdin enables rapid design space exploration of future accelerator-centric platforms.
• You can find Aladdin athttp://vlsiarch.eecs.harvard.edu/aladdin
Aladdin: A pre-RTL, Power-Performance Accelerator
Simulator