vegas: soft vector processor with scratchpad memory

VEGAS: Soft Vector Processor with Scratchpad Memory

Christopher Han-Yu Chou

Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux

University of British Columbia

1

Motivation

Embedded processing on FPGAs High performance, computationally intensive Soft processors, e.g. Nios/MicroBlaze, too slow

How to deliver High Performance? Multiprocessor on FPGA Custom Hardware accelerators (Verilog RTL) Synthesized accelerators (C to FPGA)

2

Motivation

Soft vector processor to the rescue Previous works have demonstrated soft vector

processor as a viable option to provide: Scalable performance and area Purely software-based Decouples hardware/software development

Key performance bottlenecks Memory access latency On-chip data storage efficiency

3

Contribution

VEGAS Architecture key features Cacheless Scratchpad Memory Fracturable ALUs Concurrent memory access via DMA

Advantages Eliminates on-chip data replication

Also: huge # of vectors, long vector lengths More parallel ALUs Fewer memory loads/stores

4

VEGAS ArchitectureScalar Core:

NiosII/f @ 200MHz

DMA Engine & External DDR2

Vector Core:VEGAS @ 120MHz

Concurrent Execution

FIFO synchronized

5

Scratchpad Memory in Action

Vector Scratchpad

Memory

Vector Lane 0

Vector Lane 1

Vector Lane 2

Vector Lane 3

srcAsrcBDest srcAsrcBDest

6

Scratchpad Memory in Action srcA Dest

7

Scratchpad Advantage

Performance Huge working set (256kB++) Explicitly managed by software Async load/store via concurrent DMA

Efficient data storage Double-clocked memory (Trad. RF 2x copies2x copies)) 8b data stays as 8b (Trad. RF 4x copies4x copies) No cache (Trad. RF +1 copy+1 copy)

8

Scratchpad Advantage

Accessed by address register Huge # of vectors in scratchpad

VEGAS uses only 8 vector addr. reg. (V0..V7) Modify content to access different vectors Auto-increment lessens need to change V0..V7

Long vector lengths Fill entire scratchpad

9

Scratchpad Advantage: Median Filter Vector address registers easier than unrolling Traditional Vector Median Filter

For J = 0..12For I = J .. 24 V1 = vector[i] vector load

V2 = vector[j] vector loadCompareAndSwap( V1, V2 )vector[j] = V2 vector storeVector[i] = V1 vector store

Optimize away 1 vector load + 1 vector store using temp Total of 222 loads and 222 stores

10

11

Scratchpad Advantage: Median Filter

L14: vld.b v2, vbase2, vinc0vmax v31, v2, v4vmin v4, v2, v4vst.b v31, vbase2, vinc1addi r2, r2, 1bge r6, r2, .L14

VIPERS ISA

Fracturable ALUs

12

Multiplier – uses 4 x 16b multipliersMultiplier – uses 4 x 16b multipliers

Multiplier also does shifts + rotateMultiplier also does shifts + rotate

Adder – uses 4 x 8b addersAdder – uses 4 x 8b adders

Fracturable ALUs Advantage

Increased processing power 4-Lane VEGAS

4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle

Median filter example 32b data: 184 cycles / pixel 16b data: 93 cycles / pixel 8b data: 47 cycles / pixel

13

Area and Frequency

14

Num. Lanes

VEGASALM DSP M9K Fmax

1 3831 8 40 131

2 4881 12 40 131

4 6976 20 40 130

8 11824 36 40 125

16 19843 68 40 122

32 36611 132 40 116

ALM Usage

15

Performance

16

Benchmark NiosII/f VEGAS NiosII/V32 SpeedupV1 V32

fir 509919 85549 4693 108x

motest 1668869 82515 24717 67x

median 1388 185 7 208x

autocor 124338 45027 2822 44x

conven 48988 3462 1897 25x

imgblend 1231172 175890 35485 34x

filt3x3 6556592 813471 75349 87x

Area-Delay Product

Area*Delay measures “throughput per mm2” Compared to earlier vector processors, VEGAS

offers 2-3x better throughput per unit area

17

Integer Matrix Multiply Integer Matrix Multiply

4096 x 4096 integers (64MB data set)

Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 Vanilla IJK: 474 seconds Vanilla KIJ: 134 s Tiled IJK: 93 s Tiled KIJ: 68 s

VEGAS (65nm Altera Stratix3) Vector:44 s (Nios only: 5407 s) 256kB Scratchpad, 32 Lanes (about 50% of chip) 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM

18

20

Conclusions

Key features Scratchpad Memory

Enhance performance with fewer loads/stores No on-chip data replication; efficient storage Double-clocked to hide memory latency

Fracturable ALUs Operates on 8b, 16b, 32b data efficiently Single vector core accelerates many applications

Result 2-3x better Area-Delay product than VIPERS/VESPA Out performs Intel Core 2 at Integer Matrix Multiply

Issues / Future Work

No floating-point yet Adding “complex function” support, to include floating-point or

similar operations

Algorithms with only short vectors Split vector processor into 2, 4, 8 pieces Run multiple instances of algorithm

Multiple vector processors Connecting them to work cooperatively Goals: increase throughput, exploit task-level parallelism (ie,

chaining or pipelining)

21

vegas: soft vector processor with scratchpad memory

Documents

v1 vector storeoptimize

v2 vector storevectori

vectori vector loadv2

scratchpad advantageaccessed

v2vmin v1

v2 vectorj

v2vadd v2

v4addi r2