memocode 2007 hw/sw co-design contest documentation of the submission by eric simpson pengyuan yu...
Post on 15-Jan-2016
222 views
TRANSCRIPT
![Page 1: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/1.jpg)
MEMOCODE 2007HW/SW Co-design Contest
Documentation of the submission by
Eric Simpson
Pengyuan Yu
Sumit Ahuja
Sandeep Shukla
Patrick Schaumont
Electrical and Computer Engineering
Department
Virginia Tech
![Page 2: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/2.jpg)
Table of Contents
Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion
![Page 3: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/3.jpg)
Section 1Performance Evaluation and
Analysis
![Page 4: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/4.jpg)
Performance Results
Section 1 Performance Evaluation and Analysis
Matrix Size 64 128 256 512 1024
Run Time
(sec)
Our Design
(Average)0.0052 0.0322 0.2170 1.5176 11.882
Reference 0.0346 0.6697 5.3133 42.302 338.72
SpeedUp 6.65 20.8 24.5 26.9 28.5
Device
Utilization
BRAM 80 (64 Coprocessor + 16 On-Chip-Memory)
Mult 128
![Page 5: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/5.jpg)
Performance Calculation
FCPU-Speed = 1, we used 300Mhz PPC
FFPGA-Capacity = 1, we used XUP’s XC2VP30
FFPGA-speed = 1, we used 100Mhz clock for bus and coprocessor
TimeEffective = (Tmeas,N=1024 + Tmeas,N=256 * 64) *
FCPU-Speed * FFPGA-Capacity * FFPGA-speed
= (11.882 + 64*0.217) * 1 * 1 * 1
= 25.77 secondsSection 1 Performance Evaluation and Analysis
![Page 6: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/6.jpg)
Performance Results
Section 1 Performance Evaluation and Analysis
Speed Up Factor Against Reference Design (Blocked by 16)
6.66
20.79
24.4926.92
28.51
0.00
5.00
10.00
15.00
20.00
25.00
30.00
64 128 256 512 1,024
Test Case Matrix Size
Spe
edU
p Fa
ctor
![Page 7: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/7.jpg)
Section 2Matrix Multiplication
Algorithm Optimization
![Page 8: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/8.jpg)
Algorithm Optimization
Algorithm is optimized based on targeting platform (Virtex2 Pro VP30)
Optimization goal: Best utilized the slow DDR Memory Interface
Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput
Utilize as many fast discrete FPGA Resources as possible 136 18x18-Hardware Multipliers 136 18kbits Block Rams
Section 2 Matrix Multiplication Algorithm Optimization
![Page 9: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/9.jpg)
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Optimized Algorithm
A
B
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 10: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/10.jpg)
Optimized Algorithm
A
B
C
Bring in 4 complex numbers from “A”
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 11: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/11.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 12: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/12.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 13: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/13.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 14: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/14.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 15: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/15.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
• Bring in four numbers from “B” and perform the following calculations:
C[0][0] = C[0][0] + A[0][0]*B[0][0]C[0][1] = C[0][0] + A[0][0]*B[0][1]C[0][2] = C[0][0] + A[0][0]*B[0][2]C[0][3] = C[0][0] + A[0][0]*B[0][3]…C[8][0] = C[8][0] + A[8][0]*B[0][0]C[8][1] = C[8][0] + A[8][0]*B[0][1]C[8][2] = C[8][0] + A[8][0]*B[0][2]C[8][3] = C[8][0] + A[8][0]*B[0][3]
• Where “A*B” is a complex multiplication.
• 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle
![Page 16: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/16.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 17: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/17.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 18: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/18.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 19: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/19.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 20: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/20.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 21: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/21.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 22: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/22.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 23: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/23.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 24: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/24.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 25: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/25.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 26: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/26.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 27: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/27.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 28: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/28.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 29: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/29.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 30: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/30.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 31: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/31.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 32: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/32.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 33: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/33.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 34: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/34.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 35: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/35.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 36: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/36.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM
![Page 37: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/37.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
![Page 38: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/38.jpg)
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
Next, we repeat the previous algorithm to
calculate the next “8xN CSlice”
![Page 39: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/39.jpg)
Optimized Algorithm
Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers)
Linear scan through B matrix (optimizing interface to DDR storage)
Section 2 Matrix Multiplication Algorithm Optimization
![Page 40: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/40.jpg)
Section 3HW/SW System Implementation
![Page 41: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/41.jpg)
System Architecture
Processor Local Bus
Section 3 HW/SW System Implementation
![Page 42: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/42.jpg)
Minor deviation from proposed algorithm I/O size for coprocessor: B elements are loaded 2
at a time instead of 4 PLB DMA failed to function resulting in a much slower
{DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to
Coprocessor FIFO
To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers.
Coprocessor Architecture vs. Optimized Algorithm
Section 3 HW/SW System Implementation
![Page 43: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/43.jpg)
Coprocessor Architecture
Coprocessor is scalable! Reduce the depth of the A-matrix subblock to
reduce the amount of MAC needed
Section 3 HW/SW System Implementation
![Page 44: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/44.jpg)
Coprocessor Architecture
Section 3 HW/SW System Implementation
![Page 45: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/45.jpg)
MAC Unit ArchitectureB(0,0) B(0,1)
A(0,0)
C(0,1)
X
C(0,0)
X
A(L,0)
C(L,1)
X
C(L,0)
X
......
...
Row[0]
Row[15]
Section 3 HW/SW System Implementation
![Page 46: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/46.jpg)
MAC Unit ArchitectureB(0,0) B(0,1)
A(0,0)
C(0,1)
X
C(0,0)
X
A(L,0)
C(L,1)
X
C(L,0)
X
......
...
Row[0]
Row[15]
Complex Multiply
Accumulate
BlockRAM Storage for current “C” value
Input “B” Value
“A” Values
Section 3 HW/SW System Implementation
![Page 47: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/47.jpg)
Section 4Co-design Flow and
Methodology
![Page 48: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/48.jpg)
Design Flow ReferenceC Algorithm
OptimizedC Algorithm
Driver CAlgorithm
GEZELCoprocessor
VHDLPPC Binary
XUP Board
Manual Partitioning
Rectangular-BlockTransformation
Cosimulation
Synthesis
PerformanceAnalysis
Section 4 Co-design Flow and Methodology
![Page 49: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/49.jpg)
Simulation ReferenceC Algorithm
OptimizedC Algorithm
Driver CAlgorithm
GEZELCoprocessor
VHDLPPC Binary
XUP Board
workstation
cycle-basedinstruction-set
cosimulator
FPGA
Section 4 Co-design Flow and Methodology
![Page 50: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/50.jpg)
Simulation Simulation-based verification on three levels
workstation (behavioral) cycle-based ISS (functional model of coprocessor) FPGA board (skipping VHDL simulation since
synthesis is swift and easy) Drawback - simulations capture only behavior,
but not the architecture. Example: Hard to estimate post-synthesis timing Example: Hard to reflect memory-bus behavior
(DMA, DDR, ...) in a C simulation model
Section 4 Co-design Flow and Methodology
![Page 51: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/51.jpg)
Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool
http://rijndael.ece.vt.edu/gezel2
Application SW(C Code)
uP DDR
“N” Reg FIFO IN FIFO OUT
Coprocessor
ExecutableInstruction
Set simulator
CosimulationInterfaces
CoprocessorHardware
Section 4 Co-design Flow and Methodology
![Page 52: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/52.jpg)
Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software
and hardware before synthesis Coprocessor mapped in FSMD semantics
Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL
simulation primitives Memory-mapped register FIFO based (with request/acknowledge
handshake)
Section 4 Co-design Flow and Methodology
![Page 53: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/53.jpg)
HW-SW Interface Example
ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x80000000"; ipparm "status=0x80000004"; } to
cop
roce
ssor
data
exists
read
connectedto ISS
PPC SW can write to address 0x80000000 Will drive data output and perform handshake
PPC SW can check status with read from 0x80000004
fsl1
GEZEL Code Hardware
Section 4 Co-design Flow and Methodology
![Page 54: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/54.jpg)
Synthesis
Application SW(C Code)
uP DDR
“N” Reg FIFO IN FIFO OUT
Coprocessor
InstructionSet simulator
CosimulationInterfaces
CoprocessorHardware
Automatic conversion to hierarchical RTL-VHDL, withblack-boxes for cosimulation
interfacesXilinx EDK + ISE
Section 4 Co-design Flow and Methodology
![Page 55: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/55.jpg)
Conclusions
Matrix Multiplication can be sped up by 25 times over standard reference C implementation Rectangular Blocking Dedicated Coprocessor Hardware, highly scalable Integrated design flow
![Page 56: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/56.jpg)
Conclusions
Remaining Challenges Memory bottleneck (hardware/software codesign
yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data
caching schemes
![Page 57: MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical](https://reader030.vdocuments.net/reader030/viewer/2022033105/56649d375503460f94a0fb1b/html5/thumbnails/57.jpg)
Conclusions
Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to
support the designer