![Page 1: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/1.jpg)
MEMOCode 2007 Design Contest – MIT Submission
N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan
![Page 2: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/2.jpg)
Resources
• Five “insufficiently busy” grad students
• Three weeks– Nine man weeks used
• Bluespec expertise– Easy parameterization/Fast Concurrency
• The promise of food
![Page 3: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/3.jpg)
Basic Facts
• Matrix Multiply is embarrassingly parallel– More multipliers and adders should help
• Matrices are too large to be stored in FGPA memory
• Time was short, design needed to be partitioned to make use of all designers– Latency insensitive methodology
![Page 4: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/4.jpg)
Outline
• The Problem • Partitioning the Computation • Architectural Overview• Implementation• Results• Things We Wish we could do
![Page 5: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/5.jpg)
The Standard N3 Algorithm
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
for(int k=0; k < N; k++)
c[i][j] += a[i][k] * b[k][j];
![Page 6: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/6.jpg)
and blocking is well understood…
for(int ib = 0; ib < N; ib+=K)
for(int io = 0; io < K; io++)
for(int jb = 0; jb < N/K; jb+=K)
for(int jo = 0; jo < K; jo++)
for(int k = 0; k < K; k++)
c[ib+io][jb+jo] +=a[ib+io][jb+k]
* b[ib+k][jb+jo];
split
split
reduces memory traffic
for(int ib = 0; ib < N; ib+=K)
for(int jb = 0; jb < N/K; jb+=K)
for(int io = 0; io < K; io++)
for(int jo = 0; jo < K; jo++)
for(int k = 0; k < K; k++)
c[ib+io][jb+jo] +=
(a[ib+io][jb+k] *
b[ib+k][jb+jo]);
swap
Kernel
![Page 7: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/7.jpg)
Outline
• The Problem • Partitioning the Computation • Architectural Overview • Implementation• Results• Things We Wish we could do
![Page 8: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/8.jpg)
Hardware Facts
• If we accelerate the computation, DRAM access becomes the bottleneck
• CPU has slow access to DRAM– HW can directly access DRAM via PLB
(Processor Local Bus)
![Page 9: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/9.jpg)
Hardware Facts
• CPU to HW memory bandwidth bound at 150MB/sec– Software overhead in data orchestration, probably only
50% of this bandwidth can be used
• Memory Bus supports 800MB/sec– Direct interface can provide up to a 5x improvement
over software transfer
• Special hardware may not be complicated because memory access patterns are simple
![Page 10: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/10.jpg)
High Level ArchitecutureFunc
Unit
Func
Unit
Func
Unit
CPU
PLB
DRAM
Interconnection
Logic
![Page 11: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/11.jpg)
ArchitectureFunc
Unit
Func
Unit
Func
Unit
Controller
Feeder
CPU
PLB
Switch
PLB Master DRAM
![Page 12: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/12.jpg)
Software Example (C = A x B)Func
Unit
Func
Unit
Func
Unit
Controller
Feeder
CPU
PLB
Switch
PLB Master DRAM
AB
Ld A 0Ld B 0St C 0MAC 0
C
In reality – the execution of several blocks will be overlapped
![Page 13: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/13.jpg)
Outline
• The Problem • Partitioning the Computation • Architectural Overview • Implementation • Results• Things We Wish we could do
![Page 14: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/14.jpg)
Functional Unit - Design
• Instructions:– Load operand (memory) – Store operand (memory)– Zero (C = 0)– Multiply-Add-Accumulate (C += A*B)
• Two FSMs (Read/Write and Compute)– Allows overlapping of Instructions
![Page 15: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/15.jpg)
Functional Unit – Algorithm
• Take algo & unroll P loop iterations
• Adder Tree of P– Crit. path grows
logarithmically
• Can pipeline– Complicated because
of parameterization
for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[i][j] += a[i][k] * b[k][j];
* *+
* *+
* *+
* *+
+ +
+
+
A[i]
[k+
7]
A[i]
[k]
A[i]
[k+
1]
A[i]
[k+
2]
A[i]
[k+
3]
A[i]
[k+
4]
A[i]
[k+
5]
A[i]
[k+
6]
B[k
][j]
B[k
+1]
[j]
B[k
+2]
[j]
B[k
+3]
[j]
B[k
+4]
[j]
B[k
+5]
[j]
B[k
+6]
[j]
B[k
+7]
[j]
C[i]
[j]
C[i]
[j]
![Page 16: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/16.jpg)
Functional Unit – Algorithm
• Different algorithm – reorder multiplies– writes c[i][j] multple
times
• Unroll by P – same # of adders and
multipliers– shorter critical path
• Pipelining is easy– 2 stages
for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[j][k] += a[i][k] * b[j][i];
A[i]
[k]
B[j]
[i]
C[j]
[k]
C[j]
[k]
*+
A[i]
[k+
1]
B[j]
[i]
C[j]
[k+
1]
C[j]
[k+
1]
*+
A[i]
[k+
2]
B[j]
[i]
C[j]
[k+
2]
C[j]
[k+
2]
*+
A[i]
[k+
3]
B[j]
[i]
C[j]
[k+
3]
C[j]
[k+
3]
*+
A[i]
[k+
4]
B[j]
[i]
C[j]
[k+
4]
C[j]
[k+
4]
*+
A[i]
[k+
5]
B[j]
[i]
C[j]
[k+
5]
C[j]
[k+
5]
*+
![Page 17: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/17.jpg)
FU Microarchitecture
FSM FSMFSM
BRAM A
BRAM B
BRAM C
LOAD/STOREFSM
COMPUTEFSM
* +
![Page 18: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/18.jpg)
Memory Bus Master (PLB)• 32-bit bus interface
• 16-word burst transfers– Amortize bus setup
costs
• DRAM may refresh during transfer– Added burst buffer for
rapid recovery
PLB Bus
Input Burst Buffer
Output Burst Buffer
BusControl
FSM
Store Data
Load Data
Store FSM
LoadFSM
PLB Commands
![Page 19: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/19.jpg)
Memory Bus Master (PLB)• Half of critical path
through bus arbiter– Beyond our control
• Substantial retiming needed– Register pushing– State decoupling
• Need fine-grained control over scheduling
PLB Bus
Input Burst Buffer
Output Burst Buffer
BusControl
FSM
Store Data
Load Data
Store FSM
LoadFSM
PLB Commands
![Page 20: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/20.jpg)
Outline
• The Problem • Partitioning the Computation • Architectural Overview • Implementation • Results • Things We Wish we could do
![Page 21: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/21.jpg)
Design Parameters
• Architecture: Number of functional units
• Functional Unit: degree of parallelism, matrix size
• Memory Bus (PLB) Master: matrix memory layout, matrix size
• Switch: Number of functional units• Algorithm Generator: Block size
![Page 22: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/22.jpg)
Final Results
• 100MHz• 1 Functional Unit
– 642 subblocks – 8 Complex Multiplies
• Lines of code – 10K total– Unit Testing Framework – 1.5K– C Code – 2K– BSV – 5.5K– Multiple FU implementations 1K– Additional Unused Hardware 1K
• More than 3 GOps/Sec
![Page 23: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/23.jpg)
Performance
Size Time (µs)
642 799
1282 5120
2562 45300
5122 332000
10242 2710000125x
![Page 24: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/24.jpg)
Things we would have done with more time
• We believe we could have obtained 10 billion ops per second
• 32-PLB -> 64-bit PLB– Double memory bandwidth
• fairly simple improvement
• Multiple Clock Domains – implemented, but had trouble synthesizing in EDK
• Play with # of FUs / registers per FU– HW parameterized for this
• Explore alternative machine organization
• Algorithmic Exploration
![Page 25: MEMOCode 2007 Design Contest – MIT Submission](https://reader036.vdocuments.net/reader036/viewer/2022062520/56815b0e550346895dc8b92a/html5/thumbnails/25.jpg)
Fin