ece 5775 student-led discussions (10/16) - cornell …...ece 5775 student-led discussions (10/16)...

ECE 5775 Student-Led Discussions (10/16)

� Talks: 18-min talk + 2-min Q&A– Adam Macioszek, Julia Currie, Nick Sarkis

Sparse Matrix Vector Multiplication– Nick Comly, Felipe Fortuna, Mark Li, Serena Krech

Matrix Multiplication– Aaron Wisner, Drew Dunne, Alex Katz, Jacob Glueck

Video Systems

� Q&A = Questions, Quizzes & Answers

� Vote for your favorite presentation after class by Friday 10/19 – https://goo.gl/forms/JqdV9JMvzzu0PvI42

• Winners receive a bonus point– Write down specific positive comments (minimum 80 characters)

• Each vote is worth 0.4pt (out of 8 points allocated for student-led presentation)– Presenters must endorse another talk

� Lab 4 will be released tomorrow

� Instructor OH rescheduled to Wed (10/17) 5:15-6:15pm– One-time change

� Midterm exam on Thursday 10/18– Open book, Open notes, Closed Internet

Announcements

Sparse Matrix MultiplicationAdam, Julia, Nick

Content Based off of Parallel Programming for FPGAs by Ryan Kastner, Janarbek Matai, and Stephen Neuendorffer, Chapter 6

Sparse Matrix● Sparse Matrix: A matrix predominantly consisting of zero values● We can leverage this property to efficiently encode the matrix● Various documented encodings exist

○ CRS○ DOK○ LIL○ COO○ CSC / CCS

CRS Encoding● Split the matrix into three arrays

○ Values■ Stores the nonzero entries of the matrix

○ Column Index■ 1 - 1 Relationship with the values array■ Stores the column that each value is in

○ Row Pointer■ For each index k, stores the number of nonzero elements before row k

CRS Encoding Example

Values Array

Columns Array

CRS Encoding ExampleValues Array

Columns Array

Row Pointer Array

Benefits of CRS● Can dramatically reduce overall storage size of sparse matrices● Can be leveraged to drastically increase multiplication speed● Space-Savings Example

○ 1,000 x 1,000 Matrix - 1,000,000 Entries○ If we assume 10% Nonzero values○ Value Array Size: 100,000○ Column Array Size: 100,000○ Row Pointer Array Size: 1,001○ Total CRS Size: 201,001 entries (only ~20% of the original size!)

● Space Utilization: 0.2N^2 + N + 1○ For an NxN matrix○ (As opposed to just N^2 for the non-CRS)

Space Benefits of CRS

CRS - 10%

CRS - 30%

Normal

CRS - 55% CRS - 70%

Sparse Matrix Multiplication

SPMV Visual

Values Array

Columns Array

Row Pointer Array

Matrix M Matrix X

i = 0 (outer loop L1)

Y[0] = 3*1 + 4*2Matrix Y

Loop Trip Count● The number of loop iterations depends on the input data for spmv() function

○ Vivado HLS won’t be able to analyze number of clock cycles

● Solution: provide information about loop bounds○ Loop_tripcount directive○ Allow HLS to estimate number of clock cycles

● #pragma HLS loop_tripcount min=X, max=Y, avg=Z

C/RTL Cosimulation● It can be hard to provide tight bounds on the loop_tripcount parameters● Input matrix is converted to cycle-by-cycle input vectors● Provides minimum, maximum, and average latency and interval of

synthesized function post simulation● The estimate is only as good as the testbench

○ Directively dependent on input data from testbench

Test Bench● Use a “golden” reference implementation● Compare results with the implementation you wish to synthesize● Using 2 implementations provides more assurance

Architecture with Inner Loop Pipelined

Analysis of Initial Design

● II is limited due to resource limitations of the adder● The outer loop is not pipelined, meaning inner loop must flush the pipeline

before it ends● Ideally the adder and multiplier would be used every cycle

Further Optimizations1. Pipelining the outer loop

a. Attempts to increase the parallelism of the taskb. Requires the inner loop to be fully unrolled, which is not possible in this case due to the loop

bound not being constant.

2. Partially unrolling the inner loopa. Allows more operations from the inner loop to be executed simultaneouslyb. Because our II is greater than one we can reuse operators to perform several operations of

the same type instead of having to initiate new ones.

Sparse Matrix Multiplication with Partial Unrolling

Potential Hardware Implementations of Unrolled Design

Analysis of Partially Unrolled Design

matrix multiplication(the dense one)

Parallel Programming for FPGAs: Ch 7

Nick Comly Felipe Fortuna Mark Li Serena Krech

Review: Matrix Multiplication

● Not commutative, as

A x B ≠ B x A

● To multiply A x B, we need

col(A) = row(B)

Image source: mathisfun.com/algebra/matrix-multiplying.html

Code Example

Array Reshaping

Blocking/tiling

● Decompose larger matrix into smaller submatrices● Exploit natural structure

○ Submatrices may be 0

● Operate on smaller sets of data○ In CPUs, increases data locality, use native vector types○ On FPGAs, match on-chip blocks, budget resources

■ Exploit cyclic partitioning

● Assist performance optimizations○ Easier to exploit performance optimizations like

DATAFLOW

Blocking - Implementation

Traditional Data Transfer

● Normally assume that all data is ready when a task begins○ Places an unnecessary constraint on when a task can begin

● Complete computation of large matrices is often very cumbersome○ If blocking is used, results come in batches○ Inefficient to wait for the entire result

● Most accelerators cannot operate on an entire data set at time anyway

Streaming

● Receive data right before it is needed○ Transfer input data in portions instead of all at once

● Reduce memory usage of input and output data by partially processing then overwriting

● More applicable in many applications:○ ADC, GPIO, etc.

Streaming - Advantages

● Reduction in memory for I/O data, because the entire dataset is not needed at one time

○ Overwrite the previous data when the next arrives○ Only valid for applications that allow for blocked computation:

■ Matrix multiplication, FFTs, etc.

Streaming - Implementation

FIFO● Standard first-in-first-out queue

Pros:+ Simple to implement+ Little wasted memory

Cons:- Potential read-write collisions

Ping-Pong● Two buffers one read, one written

○ Both tasks related to the buffer can work simultaneously

● After the production / consumption of a block data, the two tasks switch

○ New data is read and old is overwritten

Pros:+ No read-write collisions

Cons:- Extra memory required

Ping-Pong Buffer

Buffer Written

Buffer Read

Task 1 Task 2

Buffer Read

Buffer Written

Dataflow - Overview

● Pipelines a function by making each stage a set of nested loops● Affects all nested loops within a function

○ If more precision is needed, create another function

● Usually, nested loops are pipelined as well○ The initial interval of the dataflow optimization is limited by the II within the stages

■ IIDataflow >= max{IInested loops}■ Target the worst stage for performance improvements

● Starts operation as soon as data is ready○ Uses streams to communicate between stages instead of registers like pipelining

Dataflow - Block MultiplicationloadA

partialsum

writeoutput

Dataflow - Block MultiplicationloadA

II = 3

partialsum

II = 1

writeoutput

II = 5

II >= max{IIloadA, IIpartialsum, IIwriteoutput}II >= 5

Dataflow Benefits

● Improves throughput and reduces latency○ Operations do not need all data to begin execution

● Maximize parallelism○ Pipelined loops within functions○ Functions pipelined

● Variable bounded loops○ Cannot be unrolled → Cannot be pipelined○ Dataflow can pipeline the function the loop is within

● Reduce BRAM usage○ FIFOs for streaming

Pragma Pipeline vs. Pragma Dataflow

● Reduces initiation interval● Pipelined at the cycle level● Applied to individual loops● Fine-grain operation-level

parallelism

● Reduces overall interval● Pipelined architecture● Applied to functions● Coarse-grain task-level

parallelism

Pipelining parallelizes operations within tasks while dataflow parallelizes tasks

Pipeline Dataflow

ECE 5775: Video SystemsParallel Programming for FPGAs, Chapter 9

Ryan Kastner, Janarbek Matai, Stephen Neuendorffer

Aaron Wisner, Drew Dunne, Alex Katz, Jacob Glueck

Background

Representing Video

● Each pixel encodes a color (many possible encodings)○ RGB (red, green, and blue)○ YUV (Y-brightness, U/V-color)

● struct pixel frame[1080][1920], struct pixel { uint8_t red, green, blue; }● When sending over the wire, must serialize the data.

○ How do you know start and stop of a frame? Need sync signal○ Sync can be encoded with special pixel values or separate wires

● Typical synchronization scheme:

Video Processing on an FPGA

● HD video is 1920 x 1080 30 FPS (over 60 million pixels per second)○ Several frames of processing delay usually acceptable (amenable to pipelining)○ If II1, 60 MHz FPGA clock speed

● Each pixel encodes a color (many possible encodings)○ RGB (red, green, and blue)○ YUV (Y-brightness, U/V-color)

● Where to store video data (1920x1080x24 = 50 Mb)○ On chip BRAM - Zynq-7000 (1.8-26.5 Mb)○ Off chip DDR DRAM (same RAM in your laptop)

Line Buffers & Frame Buffers

Windows

● Most video processing algorithms use a moving window to compute output pixels (like filters in CNNs)

● Example: 4x downsampling

Buffers

● Depending on the algorithm used, there will likely be high temporal and/or spatial locality

● Instead of reading each pixel multiple times, we can read them exactly once into a structure in local memory called a line buffer.

● If line buffers store lines, what do window buffers store?● Line buffers are typically implemented in BRAM and window buffers are

implemented using flip-flops. (why?)

2D Video Processing

● Uses a line buffer and window buffer

● Every iteration { }{ }

○ Window buffer drops a column

○ New column is read into window buffer from line buffer, which drops one pixel and reads another from the source

Causal Filters

Using Line Buffers

● Line buffer brings in only one pixel from pixel_in each iteration.

● Has an adverse effect on our output - line buffer is empty first iteration

● First iteration output is filtering on empty buffer, delaying output to the next iteration.

● Pushes the image ‘down and to the right’.

Resolving With Causal Filter

● Simple fix similar to causal filters in signal processing.

● Increase the iteration count and delay the output one iteration.

● Line buffer is loaded during the first iteration.

● Output for row=0, col=0 is delayed and written the next iteration

● Pushes image back to original ‘up and to the left’ position.

Boundary Conditions

● Filter windows extend beyond the edge of the input image

● Options:○ Smaller output image○ Constant fill○ Boundary extension○ General schemes to generate values for

regions outside the image using internal values

Simple Implementation

● Window buffer stores only values in image

● Compute 2nd buffer with extended values

● A ton of special logic● All unrolled (outer column

loop is pipelined)● Multiplexers for variable

indexing

Better Implementation

● Preload the line buffer and window buffer with extended values● Shift extra values directly in at edges

Conclusions

● Video processing is works really well with HLS○ Pipelining and HLS scheduling enable lots of parallelization

● Lots of well-exploited data-locality○ High throughput streaming and line buffering designs

● Can build video processing dataflow pipelines!

ece 5775 student-led discussions (10/16) - cornell …...ece 5775 student-led discussions (10/16)...

Documents

pesach companion 5775

toldot 5775

specialized computing - cornell university computing ece...

jevrejski kalendar 5775

vayera 5775 !

mikets 5775 hanoucah !

noa'h 5775 !

ece 5775 student-led discussions (10/23)project abstract due...

yom kipur 5775

michpatim 2015 5775

guide tichri 5775

winter 2015/5775

leh leha 5775

terouma 5775/2015 !

guide pourim 5775

vayakel 5775 !

haye sarah 5775

purim 5775

berechit 5775

vayetse 5775 !