lz77 compression using altera opencl mohamed abdelfattah

Post on 14-Dec-2015

235 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

LZ77 Compression Using Altera OpenCL

Mohamed Abdelfattah

LZ77 Compression in OpenCL

Goal:- Demonstrate that a compression algorithm can be

implemented using the OpenCL compiler

2

high-performanceefficiently

2 GB/s

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

OpenCL Single-threaded Code

Basically C-code- OpenCL compiler extracts parallelism automatically- Pipeline parallelism

3

FPGA

One or more custom kernels

Kernels can communicate directly through “channels”

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

4

FPGA

Load x Load y

Store z

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

5

Load x Load y

Store z

1

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

6

Load x Load y

Store z

1

2

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

7

Load x Load y

Store z

1

2

3

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

8

Load x Load y

Store z

2

3

4

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

9

Load x Load y

Store z

3

4

5

Can start new loop iteration every cycle! Initiation interval II = 1

No loop-carried dependencies

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

10

Load x Load y

Store z

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

OpenCL Single-threaded Code

11

Load x Load y

Store z

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

OpenCL Single-threaded Code

12

Load x Load y

Store z

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

OpenCL Single-threaded Code

13

Load x Load y

Store z

Loop-carriedcomputation

Need data from iteration x for iteration x+1

OpenCL Single-threaded Code

14

Load x Load y

Store z

Load x Load y

Store z

Simple Complex

OpenCL Single-threaded Code

15

Load x Load y

Store z

Load x Load y

Store z

1 1

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

16

11

2 2

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

17

22

3 3

11

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

18

32

4 3

1

2

1

1Pipeline bubble!

Takes 2 cycles to computeStall!

Stall!

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

19

4

2

5

3

3

2

1Continue

Takes 2 cycles to compute

4

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

20

5

2

6

3

4

3

2Bubble!

Takes 2 cycles to compute

4

!!

Stall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

21

6

3

7

4

5

4

2Continue

Takes 2 cycles to compute

5

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

22

7

3

8

4

6

5

3Bubble!

Takes 2 cycles to compute

5

!!

Stall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

23

8

4

9

5

7

6

3Continue

Takes 2 cycles to compute

6

!!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

24

9

4

10

5

8

7

4Bubble!

Takes 2 cycles to compute

6

!!

Stall!

Stall!

Simple Complex

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

25

10

5

11

6

9

8

4

Takes 2 cycles to compute

7

!!

II = 1 II = 2

Double the throughput

Optimize loop-carried computation

A new iteration of the loop starts every “II” cycles

Simple Complex

LZ77 Compression in OpenCL

26

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

LZ77 Compression Example

This sentence is an easy sentence to compress.

27

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

LZ77 Compression Example

28

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

LZ77 Compression Example

29

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

LZ77 Compression Example

30

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

LZ77 Compression Example

31

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

LZ77 Compression Example

32

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

This sentence is an easy sentence to compress.

LZ77 Compression Example

33

1. Scan file byte by byte2. Look for matches

1. Match length2. Match offset

3. Replace with a reference to previous occurrence

This sentence is an easy sentence to compress.

LZ77 Compression Example

34

1. Scan file byte by byte2. Look for matches

1. Match length = 22. Match offset

3. Replace with a reference to previous occurrence

This sentence is an easy sentence to compress.

LZ77 Compression Example

35

1. Scan file byte by byte2. Look for matches

1. Match length = 32. Match offset

3. Replace with a reference to previous occurrence

This sentence is an easy sentence to compress.

LZ77 Compression Example

36

1. Scan file byte by byte2. Look for matches

1. Match length = 82. Match offset

3. Replace with a reference to previous occurrence

Match offset = 20 bytes

This sentence is an easy sentence to compress.

LZ77 Compression Example

37

1. Scan file byte by byte2. Look for matches

1. Match length = 82. Match offset = 20

3. Replace with a reference to previous occurrence

Match offset = 20 bytes

This sentence is an easy @(8,20) to compress.

LZ77 Compression Example

38

1. Scan file byte by byte2. Look for matches

• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress.

LZ77 Compression Example

39

1. Scan file byte by byte2. Look for matches

• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

Saved 5 bytes!

LZ77 Compression in OpenCL

40

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

Single-threaded OpenCL flow Single kernel: fully pipelined II = 1

Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s

Overview

41

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Comparison against CPU/Verilog

42

Comparison against CPU/Verilog

43

• Best implementation of Gzip on CPU• By Intel corporation• On Intel Core i5 (32nm) processor• 2013• Compression Speed: 338 MB/s• Compression ratio: 2.18X

Comparison against CPU/Verilog

44

• Best implementation on ASICs• AHA products group• Coming up Q2 2014• Compression Speed: 2.5 GB/s

Comparison against CPU/Verilog

45

• Best implementation on FPGAs• Verilog• IBM Corporation• Nov. 2013 ICCAD• Altera Stratix-V A7• Compression Speed: 3 GB/s

Comparison against CPU/Verilog

46

• OpenCL design example• Altera Stratix-V A7• Developed in 1 month• Compression speed ?• Compression Ratio ?

Comparison against CPU/Verilog

47

2.7 GB/s3 GB/s

2.5 GB/s

0.3 GB/s

Comparison against CPU

48

Same compression ratio

12X better performance/Watt

Comparison against Verilog

49

12% more resources

Much lower design effort and design time

10% Slower

Implementation Overview

50

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

1. Shift In New Data

51

Current Window Input from DDR memory

1. Shift In New Data

52

Current Window

sample_text

e.g.

o l d _ t e x t

Cycle boundary

1. Shift In New Data

53

Current Window

sample_text

e.g.

o l d _ t e x t

Cycle boundary

VEC = 4

Use text in our example, but can be anything

1. Shift In New Data

54

Current Window

sample_text

e.g.

t e x t

Cycle boundary

1. Shift In New Data

55

Current Window

le_text

e.g.

t e x t s a m p

Cycle boundary

Implementation Overview

56

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

e x t sx t s at s a mt e x t

2. Dictionary Lookup/Update

57

t e x t s a m pCurrent Window:

1. Compute hash2. Look for match in 4 dictionaries3. Update dictionaries

Dictionary0

Dictionary1

Dictionary2

Dictionary3

Dictionaries buffer the text that we have already processed, e.g.:

2. Dictionary Lookup/Update

58

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

Hash

t e x l

t e e n

2. Dictionary Lookup/Update

59

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

Hash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

2. Dictionary Lookup/Update

60

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

2. Dictionary Lookup/Update

61

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

t e e n

t e a l

t a n _

t a m e

Possile matches from history (dictionaries)

2. Dictionary Lookup/Update

62

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Hash

2. Dictionary Lookup/Update

63

W0

RD02

RD03

RD00

RD01Dictionary0

W1

RD12

RD13

RD10

RD11Dictionary1

W2

RD22

RD23

RD20

RD21Dictionary2

W3

RD32

RD33

RD30

RD31Dictionary3

t e x t s a m pCurrent Window:

Generate exactly the number of read/write ports that we need

t e x t

t a n _

t e x t

t e x l

t e e n

Implementation Overview

64

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

3. Match Search & Filtering

65

Current Windows:

t e x t

e x t s

x t s a

t s a m

t a n _t e x tt e x lt e e n

e a t ee a r se e p se n t e

x a n tx y l ox e l yx i r t

t e e n t e a l t a n _t a m e

Comparison Windows:

A set of candidate matches for each incoming substring

The substrings

Compare current window against each of its 4 compare windows

3. Match Search & Filtering

66

Current Window:

t e x t

t a n _t e x tt e x lt e e n

Comparison Windows:

1432Match Length:

Comparators

We have another 3 of those

Compare each byte

3. Match Search & Filtering

67

Current Window:

t e x t

t a n _t e x tt e x lt e e n

Comparison Windows:

1432Match Length:

Comparators

4

Match Reduction

Best Length:

3. Match Search & Filtering

68

3. Match Search & Filtering

69

3. Match Search & Filtering

70

3. Match Search & Filtering

71

Typical C-code

Fixed loop bounds – compiler can unroll loop

3. Match Search & Filtering

One bestlength associated with each current_window

72

t e x t

e x t s

x t s a

t s a m

3

3

4

3

3

1

t e x t s a m p

3. Match Search & Filtering

73

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

1

2

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

3. Match Search & Filtering

74

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

1

2

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

Too short

Last-fit

Overlap

Last-fit

3. Match Search & Filtering

75

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

Last-fit

1

2

Too short

Overlap

Last-fit

3. Match Search & Filtering

76

3

t e x t s a m p

Cycle boundary

1 3 4

Matches:

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit4. Compute “first valid position” for next step

Best lengths:

Last-fit

First Valid position next cycle

0 1 2 33

3. Match Search & Filtering

77

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

3 1 3 4e.g.: Best lengths:

s a m p First Valid ------position

33

3 4 4 2e.g.: Best lengths:

0 1 2

3. Match Search & Filtering

78

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

3 1 3 4e.g.: Best lengths:

s a m p First Valid ------position

33

-1 -1 -1 2e.g.: Best lengths:

0 1 2

3. Match Search & Filtering

79

3. From the remaining set; select the best ones last-fit bin-packing

3 0 3 4e.g.: Best lengths:?

0??

3. Match Search & Filtering

80

3. From the remaining set; select the best ones last-fit bin-packing

3 0 0 4e.g.: Best lengths:

3 -1 -1 4

3. Match Search & Filtering

81

4. Compute “first valid position” for next step

3 -1 -1 4e.g.: Best lengths:

0 1 2 3

First_valid_pos = 3 3 3 7

t e x t s a m p0 1 2 3 0 1 2 33

Implementation Overview

82

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

4. Writing to Output

Marker, length, offset- Length is limited by VEC (=16 in our case) – fits in 4 bits- Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits

Use either 3 or 4 bytes for this:- Offset < 2048

- Offset = 2048 .. 262144

83

MARKER LENGTH OFFSETOFFSET

OFFSET OFFSETMARKER LENGTH OFFSET

Results

84 OFFSET OFFSETMARKER LENGTH OFFSET

LZ77 Compression in OpenCL

85

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results Area optimizations Compression ratio Results

Area Optimizations

By choosing the right (hardware) architecture, you are already most of the way there

The last ~5% (of area optimizations) requires some tinkering and advanced knowledge

Example:

86

Match Search & Filtering

87

Generates a long vine of logic:

Compute length

Compute length

Compute length

Compute length

Compute length

Compute length

Causes longer latency in the pipeline increases area

condition

88

Generates a long vine of logic:

Compute length

Compute length

Compute length

Compute length

Compute length

Compute length

Causes longer latency in the pipeline increases area

Balance the computation:

Balanced tree has shallower pipeline depth Less area

Get rid of the dependency on “length”

Modified Code

89

Instead of having a length variable (= 2,3,4)We have array of bits (= 0011,0111,1111)

4% smaller areaOR operator is cheaper than adder

OR operator creates a balanced tree (no condition)

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

Want to improve results!

90

2. Hash Function1. Bin-packing Heuristic

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

91

1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• heuristic for bin-packing4. Compute “first valid position” for next step

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

Optimization Report in 14.0

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

92

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

2

1

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

93

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

2

1

!!Stall!

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

94

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

2

1

!!Stall!

!!Stall!

3

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

95

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

96

2. Filter bestlength (covered)

4. Filter bestlength (bin-pack)

3. Compute first_valid_pos

1. Filter bestlength (length)

Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

2. Filter bestlength (covered)

4. Filter bestlength (bin-pack)

3. Compute first_valid_pos

1. Filter bestlength (length)

Constraint: cannot change the first_valid_position in this step

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1

1. Bin-packing heuristic

Constraint: Match selection heuristic cannot change “first_valid_position”

But: Last-fit is very inefficient

4

t e x t s a m p3 2 0

Matches

0

1

2

4

0 1 2 3

Best lengths:

4. Filter bestlength (bin-pack)

3. Compute first_valid_pos0

0 0 2 -1

4 -1 -1 -1Much better!

Doesn’t affect first_valid_position

Add a step to eliminate matches that have the same reach but smaller value

8% better ratio

2. Hash Function

Original:- Hash[i] = curr_window[i]- E.g. Hash[text] = ‘t’

XOR2- Hash[i] = curr_window[i] xor curr_window[i+1]- E.g. Hash[text] = ‘t’ xor ‘e’ - Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’- Not utilizing depth efficiently (256 words but BRAMS go up to 1024)

XOR3- Hash[i] = curr_window[i] << 2 xor

curr_window[i+1] << 1 xor curr_window[i+2]

- Match contains information about first 3 bytes + sense of their ordering- More likely that our compare windows will have a match- Hash (BRAM address) is 10 bits utilizes BRAM depth = 1024

99

3.1% better ratio

7.1% better ratio

Compared to Verilog, it is much easier to try & verify new algorithmsIt is exactly like trying out new C-code

Emulator in 13.1

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

With (simple) huffman encoding (currently on the host)- 47.8% or 2.10X

100

Work in progress

60.2% or 1.67XAfter Optimizations:

Huffman portion of Gzip

16-way parallel variable-bit-width encoding/alignment

Huffman encoding

Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream

- Separate offset computation from the actual assembly

3 compute phases- Compute code bit-offsets and start offset of next iteration

- Assembly of the codes in the current iteration

- Build fixed-length segments across multiple iterations

102

∑ 𝑙𝑒𝑛𝑖

<< << <<

STORE

Compute offsets

Tight dependency on offset carried across iterations

- Careful about the order of the additions, the compiler does not consider dependencies when it redistributes

associative operations

- Decision whether to write to memory is based on accumulating a full segment

103

∑ 𝑙𝑒𝑛𝑖

pos[0]

basepos

pos[1]

pos[n]

Bit-level shift

Each code shifts to an arbitrary bit-offset within the entire range

2 shift stages- 16 bit barrel shifters- OR reduction tree for final assembly

104

Thank YouThank You

top related