ece 5775 student-led discussions (10/16) - cornell …...ece 5775 student-led discussions (10/16)...

53
ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix Multiplication Aaron Wisner, Drew Dunne, Alex Katz, Jacob Glueck Video Systems Q&A = Questions, Quizzes & Answers Vote for your favorite presentation after class by Friday 10/19 https://goo.gl/forms/JqdV9JMvzzu0PvI42 Winners receive a bonus point Write down specific positive comments (minimum 80 characters) Each vote is worth 0.4pt (out of 8 points allocated for student-led presentation) Presenters must endorse another talk 0

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

ECE 5775 Student-Led Discussions (10/16)

� Talks: 18-min talk + 2-min Q&A– Adam Macioszek, Julia Currie, Nick Sarkis

Sparse Matrix Vector Multiplication– Nick Comly, Felipe Fortuna, Mark Li, Serena Krech

Matrix Multiplication– Aaron Wisner, Drew Dunne, Alex Katz, Jacob Glueck

Video Systems

� Q&A = Questions, Quizzes & Answers

� Vote for your favorite presentation after class by Friday 10/19 – https://goo.gl/forms/JqdV9JMvzzu0PvI42

• Winners receive a bonus point– Write down specific positive comments (minimum 80 characters)

• Each vote is worth 0.4pt (out of 8 points allocated for student-led presentation)– Presenters must endorse another talk

0

Page 2: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

� Lab 4 will be released tomorrow

� Instructor OH rescheduled to Wed (10/17) 5:15-6:15pm– One-time change

� Midterm exam on Thursday 10/18– Open book, Open notes, Closed Internet

1

Announcements

Page 3: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Sparse Matrix MultiplicationAdam, Julia, Nick

Content Based off of Parallel Programming for FPGAs by Ryan Kastner, Janarbek Matai, and Stephen Neuendorffer, Chapter 6

Page 4: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Sparse Matrix● Sparse Matrix: A matrix predominantly consisting of zero values● We can leverage this property to efficiently encode the matrix● Various documented encodings exist

○ CRS○ DOK○ LIL○ COO○ CSC / CCS

○ CRS○ DOK○ LIL○ COO○ CSC / CCS

Page 5: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

CRS Encoding● Split the matrix into three arrays

○ Values■ Stores the nonzero entries of the matrix

○ Column Index■ 1 - 1 Relationship with the values array■ Stores the column that each value is in

○ Row Pointer■ For each index k, stores the number of nonzero elements before row k

Page 6: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

CRS Encoding Example

Values Array

Columns Array

Page 7: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

CRS Encoding ExampleValues Array

Columns Array

Row Pointer Array

Page 8: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Benefits of CRS● Can dramatically reduce overall storage size of sparse matrices● Can be leveraged to drastically increase multiplication speed● Space-Savings Example

○ 1,000 x 1,000 Matrix - 1,000,000 Entries○ If we assume 10% Nonzero values○ Value Array Size: 100,000○ Column Array Size: 100,000○ Row Pointer Array Size: 1,001○ Total CRS Size: 201,001 entries (only ~20% of the original size!)

● Space Utilization: 0.2N^2 + N + 1○ For an NxN matrix○ (As opposed to just N^2 for the non-CRS)

Page 9: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Space Benefits of CRS

CRS - 10%

CRS - 30%

Normal

CRS - 55% CRS - 70%

Page 10: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Sparse Matrix Multiplication

Page 11: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

SPMV Visual

Values Array

Columns Array

Row Pointer Array

1

2

3

4

11

37

15

32

Matrix M Matrix X

i = 0 (outer loop L1)

Y[0] = 3*1 + 4*2Matrix Y

Page 12: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Loop Trip Count● The number of loop iterations depends on the input data for spmv() function

○ Vivado HLS won’t be able to analyze number of clock cycles

● Solution: provide information about loop bounds○ Loop_tripcount directive○ Allow HLS to estimate number of clock cycles

● #pragma HLS loop_tripcount min=X, max=Y, avg=Z

Page 13: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

C/RTL Cosimulation● It can be hard to provide tight bounds on the loop_tripcount parameters● Input matrix is converted to cycle-by-cycle input vectors● Provides minimum, maximum, and average latency and interval of

synthesized function post simulation● The estimate is only as good as the testbench

○ Directively dependent on input data from testbench

Page 14: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Test Bench● Use a “golden” reference implementation● Compare results with the implementation you wish to synthesize● Using 2 implementations provides more assurance

Page 15: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Architecture with Inner Loop Pipelined

Page 16: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Analysis of Initial Design

● II is limited due to resource limitations of the adder● The outer loop is not pipelined, meaning inner loop must flush the pipeline

before it ends● Ideally the adder and multiplier would be used every cycle

Page 17: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Further Optimizations1. Pipelining the outer loop

a. Attempts to increase the parallelism of the taskb. Requires the inner loop to be fully unrolled, which is not possible in this case due to the loop

bound not being constant.

2. Partially unrolling the inner loopa. Allows more operations from the inner loop to be executed simultaneouslyb. Because our II is greater than one we can reuse operators to perform several operations of

the same type instead of having to initiate new ones.

Page 18: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Sparse Matrix Multiplication with Partial Unrolling

Page 19: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Potential Hardware Implementations of Unrolled Design

Page 20: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Analysis of Partially Unrolled Design

Page 21: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

matrix multiplication(the dense one)

Parallel Programming for FPGAs: Ch 7

Nick Comly Felipe Fortuna Mark Li Serena Krech

Page 22: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Review: Matrix Multiplication

● Not commutative, as

A x B ≠ B x A

● To multiply A x B, we need

col(A) = row(B)

Image source: mathisfun.com/algebra/matrix-multiplying.html

Page 23: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Code Example

Page 24: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Array Reshaping

Page 25: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Blocking/tiling

● Decompose larger matrix into smaller submatrices● Exploit natural structure

○ Submatrices may be 0

● Operate on smaller sets of data○ In CPUs, increases data locality, use native vector types○ On FPGAs, match on-chip blocks, budget resources

■ Exploit cyclic partitioning

● Assist performance optimizations○ Easier to exploit performance optimizations like

DATAFLOW

Page 26: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Blocking - Implementation

Page 27: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Traditional Data Transfer

● Normally assume that all data is ready when a task begins○ Places an unnecessary constraint on when a task can begin

● Complete computation of large matrices is often very cumbersome○ If blocking is used, results come in batches○ Inefficient to wait for the entire result

● Most accelerators cannot operate on an entire data set at time anyway

Page 28: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Streaming

● Receive data right before it is needed○ Transfer input data in portions instead of all at once

● Reduce memory usage of input and output data by partially processing then overwriting

● More applicable in many applications:○ ADC, GPIO, etc.

Page 29: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Streaming - Advantages

● Reduction in memory for I/O data, because the entire dataset is not needed at one time

○ Overwrite the previous data when the next arrives○ Only valid for applications that allow for blocked computation:

■ Matrix multiplication, FFTs, etc.

Page 30: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Streaming - Implementation

FIFO● Standard first-in-first-out queue

Pros:+ Simple to implement+ Little wasted memory

Cons:- Potential read-write collisions

Ping-Pong● Two buffers one read, one written

○ Both tasks related to the buffer can work simultaneously

● After the production / consumption of a block data, the two tasks switch

○ New data is read and old is overwritten

Pros:+ No read-write collisions

Cons:- Extra memory required

Page 31: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Ping-Pong Buffer

Buffer Written

Buffer Read

Task 1 Task 2

Buffer Read

Buffer Written

Page 32: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Dataflow - Overview

● Pipelines a function by making each stage a set of nested loops● Affects all nested loops within a function

○ If more precision is needed, create another function

● Usually, nested loops are pipelined as well○ The initial interval of the dataflow optimization is limited by the II within the stages

■ IIDataflow >= max{IInested loops}■ Target the worst stage for performance improvements

● Starts operation as soon as data is ready○ Uses streams to communicate between stages instead of registers like pipelining

Page 33: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Dataflow - Block MultiplicationloadA

partialsum

writeoutput

Page 34: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Dataflow - Block MultiplicationloadA

II = 3

partialsum

II = 1

writeoutput

II = 5

II >= max{IIloadA, IIpartialsum, IIwriteoutput}II >= 5

Page 35: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Dataflow Benefits

● Improves throughput and reduces latency○ Operations do not need all data to begin execution

● Maximize parallelism○ Pipelined loops within functions○ Functions pipelined

● Variable bounded loops○ Cannot be unrolled → Cannot be pipelined○ Dataflow can pipeline the function the loop is within

● Reduce BRAM usage○ FIFOs for streaming

Page 36: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Pragma Pipeline vs. Pragma Dataflow

● Reduces initiation interval● Pipelined at the cycle level● Applied to individual loops● Fine-grain operation-level

parallelism

● Reduces overall interval● Pipelined architecture● Applied to functions● Coarse-grain task-level

parallelism

Pipelining parallelizes operations within tasks while dataflow parallelizes tasks

Pipeline Dataflow

Page 37: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

ECE 5775: Video SystemsParallel Programming for FPGAs, Chapter 9

Ryan Kastner, Janarbek Matai, Stephen Neuendorffer

Aaron Wisner, Drew Dunne, Alex Katz, Jacob Glueck

Page 38: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Background

Page 39: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Representing Video

● Each pixel encodes a color (many possible encodings)○ RGB (red, green, and blue)○ YUV (Y-brightness, U/V-color)

● struct pixel frame[1080][1920], struct pixel { uint8_t red, green, blue; }● When sending over the wire, must serialize the data.

○ How do you know start and stop of a frame? Need sync signal○ Sync can be encoded with special pixel values or separate wires

● Typical synchronization scheme:

Page 40: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Video Processing on an FPGA

● HD video is 1920 x 1080 30 FPS (over 60 million pixels per second)○ Several frames of processing delay usually acceptable (amenable to pipelining)○ If II1, 60 MHz FPGA clock speed

● Each pixel encodes a color (many possible encodings)○ RGB (red, green, and blue)○ YUV (Y-brightness, U/V-color)

● Where to store video data (1920x1080x24 = 50 Mb)○ On chip BRAM - Zynq-7000 (1.8-26.5 Mb)○ Off chip DDR DRAM (same RAM in your laptop)

Page 41: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Line Buffers & Frame Buffers

Page 42: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Windows

● Most video processing algorithms use a moving window to compute output pixels (like filters in CNNs)

● Example: 4x downsampling

Page 43: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Buffers

● Depending on the algorithm used, there will likely be high temporal and/or spatial locality

● Instead of reading each pixel multiple times, we can read them exactly once into a structure in local memory called a line buffer.

● If line buffers store lines, what do window buffers store?● Line buffers are typically implemented in BRAM and window buffers are

implemented using flip-flops. (why?)

Page 44: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

2D Video Processing

Page 45: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

2D Video Processing

● Uses a line buffer and window buffer

● Every iteration { }{ }

○ Window buffer drops a column

○ New column is read into window buffer from line buffer, which drops one pixel and reads another from the source

Page 46: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Causal Filters

Page 47: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Using Line Buffers

● Line buffer brings in only one pixel from pixel_in each iteration.

● Has an adverse effect on our output - line buffer is empty first iteration

● First iteration output is filtering on empty buffer, delaying output to the next iteration.

● Pushes the image ‘down and to the right’.

Page 48: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Resolving With Causal Filter

● Simple fix similar to causal filters in signal processing.

● Increase the iteration count and delay the output one iteration.

● Line buffer is loaded during the first iteration.

● Output for row=0, col=0 is delayed and written the next iteration

● Pushes image back to original ‘up and to the left’ position.

Page 49: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Boundary Conditions

Page 50: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Boundary Conditions

● Filter windows extend beyond the edge of the input image

● Options:○ Smaller output image○ Constant fill○ Boundary extension○ General schemes to generate values for

regions outside the image using internal values

Page 51: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Simple Implementation

● Window buffer stores only values in image

● Compute 2nd buffer with extended values

● A ton of special logic● All unrolled (outer column

loop is pipelined)● Multiplexers for variable

indexing

Page 52: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Better Implementation

● Preload the line buffer and window buffer with extended values● Shift extra values directly in at edges

Page 53: ECE 5775 Student-Led Discussions (10/16) - Cornell …...ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A – Adam Macioszek, Julia Currie, Nick Sarkis Sparse

Conclusions

● Video processing is works really well with HLS○ Pipelining and HLS scheduling enable lots of parallelization

● Lots of well-exploited data-locality○ High throughput streaming and line buffering designs

● Can build video processing dataflow pipelines!