lz77 compression using altera opencl mohamed abdelfattah

LZ77 Compression Using Altera OpenCL

Mohamed Abdelfattah

LZ77 Compression in OpenCL

Goal:- Demonstrate that a compression algorithm can be

implemented using the OpenCL compiler

high-performanceefficiently

2 GB/s

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

OpenCL Single-threaded Code

Basically C-code- OpenCL compiler extracts parallelism automatically- Pipeline parallelism

One or more custom kernels

Kernels can communicate directly through “channels”

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

Load x Load y

Store z

Load x Load y

Store z

Load x Load y

Store z

Load x Load y

Store z

Load x Load y

Store z

Load x Load y

Store z

Can start new loop iteration every cycle! Initiation interval II = 1

No loop-carried dependencies

Load x Load y

Store z

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

Load x Load y

Store z

Load x Load y

Store z

Load x Load y

Store z

Loop-carriedcomputation

Need data from iteration x for iteration x+1

Load x Load y

Store z

Load x Load y

Store z

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

1Pipeline bubble!

Takes 2 cycles to computeStall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

1Continue

Takes 2 cycles to compute

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

2Bubble!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

2Continue

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

3Bubble!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

3Continue

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

4Bubble!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

II = 1 II = 2

Double the throughput

Optimize loop-carried computation

A new iteration of the loop starts every “II” cycles

Simple Complex

Outline:

2. LZ77 overview

LZ77 Compression Example

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

1. Scan file byte by byte2. Look for matches

1. Match length2. Match offset

3. Replace with a reference to previous occurrence

1. Match length = 22. Match offset

Match offset = 20 bytes

1. Match length = 82. Match offset = 20

Match offset = 20 bytes

This sentence is an easy @(8,20) to compress.

• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress.

• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

Saved 5 bytes!

Outline:

2. LZ77 overview

Single-threaded OpenCL flow Single kernel: fully pipelined II = 1

Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s

Overview

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Comparison against CPU/Verilog

• Best implementation of Gzip on CPU• By Intel corporation• On Intel Core i5 (32nm) processor• 2013• Compression Speed: 338 MB/s• Compression ratio: 2.18X

• Best implementation on ASICs• AHA products group• Coming up Q2 2014• Compression Speed: 2.5 GB/s

• Best implementation on FPGAs• Verilog• IBM Corporation• Nov. 2013 ICCAD• Altera Stratix-V A7• Compression Speed: 3 GB/s

• OpenCL design example• Altera Stratix-V A7• Developed in 1 month• Compression speed ?• Compression Ratio ?

2.7 GB/s3 GB/s

2.5 GB/s

0.3 GB/s

Comparison against CPU

Same compression ratio

12X better performance/Watt

Comparison against Verilog

12% more resources

Much lower design effort and design time

10% Slower

Implementation Overview

4. Write to output

Current Window Input from DDR memory

Current Window

sample_text

o l d _ t e x t

Cycle boundary

Current Window

sample_text

o l d _ t e x t

Cycle boundary

VEC = 4

Use text in our example, but can be anything

Current Window

sample_text

t e x t

Cycle boundary

Current Window

le_text

t e x t s a m p

Cycle boundary

4. Write to output

e x t sx t s at s a mt e x t

t e x t s a m pCurrent Window:

1. Compute hash2. Look for match in 4 dictionaries3. Update dictionaries

Dictionary0

Dictionary1

Dictionary2

Dictionary3

Dictionaries buffer the text that we have already processed, e.g.:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

t e x l

t e e n

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

t e e n

t e a l

t a n _

t a m e

Possile matches from history (dictionaries)

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t e x t

e x t s

x t s a

t s a m

RD01Dictionary0

RD11Dictionary1

RD21Dictionary2

RD31Dictionary3

Generate exactly the number of read/write ports that we need

t e x t

t a n _

t e x t

t e x l

t e e n

4. Write to output

Current Windows:

t e x t

e x t s

x t s a

t s a m

t a n _t e x tt e x lt e e n

e a t ee a r se e p se n t e

x a n tx y l ox e l yx i r t

t e e n t e a l t a n _t a m e

Comparison Windows:

A set of candidate matches for each incoming substring

The substrings

Compare current window against each of its 4 compare windows

Current Window:

t e x t

Comparison Windows:

1432Match Length:

Comparators

We have another 3 of those

Compare each byte

Current Window:

t e x t

Comparison Windows:

1432Match Length:

Comparators

Match Reduction

Best Length:

Typical C-code

Fixed loop bounds – compiler can unroll loop

One bestlength associated with each current_window

t e x t

e x t s

x t s a

t s a m

t e x t s a m p

Cycle boundary

Matches

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

t e x t s a m p

Cycle boundary

Matches

0 1 2 3

Best lengths:

Too short

Last-fit

Overlap

Last-fit

t e x t s a m p

Cycle boundary

Matches

0 1 2 3

Best lengths:

Last-fit

Too short

Overlap

Last-fit

t e x t s a m p

Cycle boundary

Matches:

0 1 2 3

• (heuristic for bin-packing) last-fit4. Compute “first valid position” for next step

Best lengths:

Last-fit

First Valid position next cycle

0 1 2 33

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

3 1 3 4e.g.: Best lengths:

s a m p First Valid ------position

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

s a m p First Valid ------position

-1 -1 -1 2e.g.: Best lengths:

3. From the remaining set; select the best ones last-fit bin-packing

3 0 3 4e.g.: Best lengths:?

3. From the remaining set; select the best ones last-fit bin-packing

3 -1 -1 4

4. Compute “first valid position” for next step

3 -1 -1 4e.g.: Best lengths:

0 1 2 3

First_valid_pos = 3 3 3 7

t e x t s a m p0 1 2 3 0 1 2 33

4. Write to output

4. Writing to Output

Marker, length, offset- Length is limited by VEC (=16 in our case) – fits in 4 bits- Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits

Use either 3 or 4 bytes for this:- Offset < 2048

- Offset = 2048 .. 262144

MARKER LENGTH OFFSETOFFSET

OFFSET OFFSETMARKER LENGTH OFFSET

Results

84 OFFSET OFFSETMARKER LENGTH OFFSET

Outline:

2. LZ77 overview

4. Optimizations & results Area optimizations Compression ratio Results

Area Optimizations

By choosing the right (hardware) architecture, you are already most of the way there

The last ~5% (of area optimizations) requires some tinkering and advanced knowledge

Example:

Match Search & Filtering

Generates a long vine of logic:

Compute length

Causes longer latency in the pipeline increases area

condition

Generates a long vine of logic:

Compute length

Causes longer latency in the pipeline increases area

Balance the computation:

Balanced tree has shallower pipeline depth Less area

Get rid of the dependency on “length”

Modified Code

Instead of having a length variable (= 2,3,4)We have array of bits (= 0011,0111,1111)

4% smaller areaOR operator is cheaper than adder

OR operator creates a balanced tree (no condition)

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

Want to improve results!

2. Hash Function1. Bin-packing Heuristic

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• heuristic for bin-packing4. Compute “first valid position” for next step

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

Optimization Report in 14.0

!!Stall!

Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1

Constraint: cannot change the first_valid_position in this step

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1

Constraint: Match selection heuristic cannot change “first_valid_position”

But: Last-fit is very inefficient

t e x t s a m p3 2 0

Matches

0 1 2 3

Best lengths:

3. Compute first_valid_pos0

0 0 2 -1

4 -1 -1 -1Much better!

Doesn’t affect first_valid_position

Add a step to eliminate matches that have the same reach but smaller value

8% better ratio

2. Hash Function

Original:- Hash[i] = curr_window[i]- E.g. Hash[text] = ‘t’

XOR2- Hash[i] = curr_window[i] xor curr_window[i+1]- E.g. Hash[text] = ‘t’ xor ‘e’ - Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’- Not utilizing depth efficiently (256 words but BRAMS go up to 1024)

XOR3- Hash[i] = curr_window[i] << 2 xor

curr_window[i+1] << 1 xor curr_window[i+2]

- Match contains information about first 3 bytes + sense of their ordering- More likely that our compare windows will have a match- Hash (BRAM address) is 10 bits utilizes BRAM depth = 1024

3.1% better ratio

7.1% better ratio

Compared to Verilog, it is much easier to try & verify new algorithmsIt is exactly like trying out new C-code

Emulator in 13.1

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

With (simple) huffman encoding (currently on the host)- 47.8% or 2.10X

Work in progress

60.2% or 1.67XAfter Optimizations:

Huffman portion of Gzip

16-way parallel variable-bit-width encoding/alignment

Huffman encoding

Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream

- Separate offset computation from the actual assembly

3 compute phases- Compute code bit-offsets and start offset of next iteration

- Assembly of the codes in the current iteration

- Build fixed-length segments across multiple iterations

∑ 𝑙𝑒𝑛𝑖

<< << <<

Compute offsets

Tight dependency on offset carried across iterations

- Careful about the order of the additions, the compiler does not consider dependencies when it redistributes

associative operations

- Decision whether to write to memory is based on accumulating a full segment

∑ 𝑙𝑒𝑛𝑖

pos[0]

basepos

pos[1]

pos[n]

Bit-level shift

Each code shifts to an arbitrary bit-offset within the entire range

2 shift stages- 16 bit barrel shifters- OR reduction tree for final assembly

Thank YouThank You

lz77 compression using altera opencl mohamed abdelfattah

inputi int y

int size

z outputi

x y outputi

x y loop

load xload y store z

void kernel simpleglobal

void kernel complexglobal

Documents

opencl slides

wstep˛ lz77 lz78 -...

opencl sathish vadhiyar sources: opencl overview from amd...

Обзор opencl

opencl introduction an example for opencl lu oct.11 2014

modern c++, opencl sycl & opencl cl2 -...

the opencl specification -...

opencl bővítmények

opencl tutorial

parallel lz77 decoding · conclusion lz77 is inherently...

improving performance portability in opencl...

opencl: graphics interop - opencl by example

opencl extensions

introduction to communications systems “ lossless … ·...

opencl introduction

introduction to opencl on ti introduction to opencl on ti

iwocl 2014 tech presentation mohamed abdelfattah

wstep˛ lz77 lz78tju/kodwiecz07/kodwiecz07-wyklad04.pdf ·...

perbandingan metode lz77, metode huffman dan metode

parallel lz77 decoding using a gpu - cecs -...