vegas: soft vector processor with scratchpad memory
DESCRIPTION
VEGAS: Soft Vector Processor with Scratchpad Memory. Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia. Motivation. Embedded processing on FPGAs High performance, computationally intensive - PowerPoint PPT PresentationTRANSCRIPT
VEGAS: Soft Vector Processor with Scratchpad Memory
Christopher Han-Yu Chou
Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux
University of British Columbia
1
Motivation
Embedded processing on FPGAs High performance, computationally intensive Soft processors, e.g. Nios/MicroBlaze, too slow
How to deliver High Performance? Multiprocessor on FPGA Custom Hardware accelerators (Verilog RTL) Synthesized accelerators (C to FPGA)
2
Motivation
Soft vector processor to the rescue Previous works have demonstrated soft vector
processor as a viable option to provide: Scalable performance and area Purely software-based Decouples hardware/software development
Key performance bottlenecks Memory access latency On-chip data storage efficiency
3
Contribution
VEGAS Architecture key features Cacheless Scratchpad Memory Fracturable ALUs Concurrent memory access via DMA
Advantages Eliminates on-chip data replication
Also: huge # of vectors, long vector lengths More parallel ALUs Fewer memory loads/stores
4
VEGAS ArchitectureScalar Core:
NiosII/f @ 200MHz
DMA Engine & External DDR2
Vector Core:VEGAS @ 120MHz
Concurrent Execution
FIFO synchronized
5
Scratchpad Memory in Action
Vector Scratchpad
Memory
Vector Lane 0
Vector Lane 1
Vector Lane 2
Vector Lane 3
srcAsrcBDest srcAsrcBDest
6
Scratchpad Memory in Action srcA Dest
7
Scratchpad Advantage
Performance Huge working set (256kB++) Explicitly managed by software Async load/store via concurrent DMA
Efficient data storage Double-clocked memory (Trad. RF 2x copies2x copies)) 8b data stays as 8b (Trad. RF 4x copies4x copies) No cache (Trad. RF +1 copy+1 copy)
8
Scratchpad Advantage
Accessed by address register Huge # of vectors in scratchpad
VEGAS uses only 8 vector addr. reg. (V0..V7) Modify content to access different vectors Auto-increment lessens need to change V0..V7
Long vector lengths Fill entire scratchpad
9
Scratchpad Advantage: Median Filter Vector address registers easier than unrolling Traditional Vector Median Filter
For J = 0..12For I = J .. 24 V1 = vector[i] vector load
V2 = vector[j] vector loadCompareAndSwap( V1, V2 )vector[j] = V2 vector storeVector[i] = V1 vector store
Optimize away 1 vector load + 1 vector store using temp Total of 222 loads and 222 stores
10
11
Scratchpad Advantage: Median Filter
L14: vld.b v2, vbase2, vinc0vmax v31, v2, v4vmin v4, v2, v4vst.b v31, vbase2, vinc1addi r2, r2, 1bge r6, r2, .L14
VIPERS ISA
Fracturable ALUs
12
Multiplier – uses 4 x 16b multipliersMultiplier – uses 4 x 16b multipliers
Multiplier also does shifts + rotateMultiplier also does shifts + rotate
Adder – uses 4 x 8b addersAdder – uses 4 x 8b adders
Fracturable ALUs Advantage
Increased processing power 4-Lane VEGAS
4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle
Median filter example 32b data: 184 cycles / pixel 16b data: 93 cycles / pixel 8b data: 47 cycles / pixel
13
Area and Frequency
14
Num. Lanes
VEGASALM DSP M9K Fmax
1 3831 8 40 131
2 4881 12 40 131
4 6976 20 40 130
8 11824 36 40 125
16 19843 68 40 122
32 36611 132 40 116
ALM Usage
15
Performance
16
Benchmark NiosII/f VEGAS NiosII/V32 SpeedupV1 V32
fir 509919 85549 4693 108x
motest 1668869 82515 24717 67x
median 1388 185 7 208x
autocor 124338 45027 2822 44x
conven 48988 3462 1897 25x
imgblend 1231172 175890 35485 34x
filt3x3 6556592 813471 75349 87x
Area-Delay Product
Area*Delay measures “throughput per mm2” Compared to earlier vector processors, VEGAS
offers 2-3x better throughput per unit area
17
Integer Matrix Multiply Integer Matrix Multiply
4096 x 4096 integers (64MB data set)
Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 Vanilla IJK: 474 seconds Vanilla KIJ: 134 s Tiled IJK: 93 s Tiled KIJ: 68 s
VEGAS (65nm Altera Stratix3) Vector:44 s (Nios only: 5407 s) 256kB Scratchpad, 32 Lanes (about 50% of chip) 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM
18
20
Conclusions
Key features Scratchpad Memory
Enhance performance with fewer loads/stores No on-chip data replication; efficient storage Double-clocked to hide memory latency
Fracturable ALUs Operates on 8b, 16b, 32b data efficiently Single vector core accelerates many applications
Result 2-3x better Area-Delay product than VIPERS/VESPA Out performs Intel Core 2 at Integer Matrix Multiply
Issues / Future Work
No floating-point yet Adding “complex function” support, to include floating-point or
similar operations
Algorithms with only short vectors Split vector processor into 2, 4, 8 pieces Run multiple instances of algorithm
Multiple vector processors Connecting them to work cooperatively Goals: increase throughput, exploit task-level parallelism (ie,
chaining or pipelining)
21