flexible partial reconfiguration based design architecture

58
Flexible Partial Reconfiguration based Design Architecture for Dataflow Computation Mihir Shah Advisor: Benjamin Carrion Schaefer 1

Upload: others

Post on 16-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flexible Partial Reconfiguration based Design Architecture

Flexible Partial Reconfiguration based Design Architecture for Dataflow Computation

Mihir Shah

Advisor: Benjamin Carrion Schaefer

1

Page 2: Flexible Partial Reconfiguration based Design Architecture

Thesis Index

2

SR. NO TITLE

1 Thesis Motivation

2 Thesis Contribution

5 Proposed Design Methodology

6 Design Implementations – Spatial & Partial Reconfiguration based

7 Comparative Study & Analysis

8 Conclusion & Future Works

Page 3: Flexible Partial Reconfiguration based Design Architecture

3

Thesis Motivation

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

Dataflow Computing (DC) specific model of computation

Target application described as Data Flow Graph (DFG)

Used Extensively in High Frequency Trading, Image, Signal processing based applications

JPEG Encoder as Dataflow Process

Node

Node: Processing Element (PE), Execution

kernel/accelerator/

Links : FIFO (first-in first-out) queue or buffer

Page 4: Flexible Partial Reconfiguration based Design Architecture

4

Thesis Motivation

4

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

𝑷𝑬𝟎

𝑷𝑬𝟏

𝑷𝑬𝒌

Container

FPGA Fabric

Reconfigurable Partition Area……

Partial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

Page 5: Flexible Partial Reconfiguration based Design Architecture

55

Thesis Motivation

5

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

𝑷𝑬𝟎

Container

FPGA Fabric

Reconfigurable Partition Area……

Partial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

𝑷𝑬𝟎

𝑷𝑬𝟏

𝑷𝑬𝒌

Page 6: Flexible Partial Reconfiguration based Design Architecture

666

Thesis Motivation

6

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

Container

FPGA Fabric

𝑷𝑬𝟎……Partial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

𝑷𝑬𝟎

𝑷𝑬𝟏

𝑷𝑬𝒌

𝑷𝑬𝟏

Page 7: Flexible Partial Reconfiguration based Design Architecture

7777

Thesis Motivation

7

Partial Reconfiguration

Allows modification of an operating FPGA design by loading a partial configuration file,

usually a partial BIT file

Time Multiplex several Processing Elements in a dataflow computation process

Container

FPGA Fabric

𝑷𝑬𝟏……Partial

Reconfiguration

Controllers

Config

Port

Matrix Multiplication

𝑷𝑬𝟎

𝑷𝑬𝟏

𝑷𝑬𝒌

Page 8: Flexible Partial Reconfiguration based Design Architecture

88

Thesis Motivation

𝑷𝑬𝟎 𝑷𝑬𝟏 𝑷𝑬𝟐 𝑷𝑬𝒌……………. Spatially placed FPGA Design

Problem ? Area Utilization HighDataflow Process on FPGA Fabric

𝑷𝑬𝟎

FPGA Fabric

Reconfigurable Partition Area ARM

Processor

DDR3 Memory

𝑷𝑬𝟏

𝑷𝑬𝒌

……

Container

PR based design using External Off chip DDR

memory as FIFOSolves: Area Utilization Issue

Problem ? Runtime & Latency Hit !!

PRDDR design method

Static design method

Page 9: Flexible Partial Reconfiguration based Design Architecture

99

Thesis Motivation

𝑷𝑬𝟎

FPGA Fabric

Reconfigurable Partition Area

𝑷𝑬𝟏

𝑷𝑬𝒌

……BRAM Memory

THUS WE PROPOSE - PR based design using Internal On chip BRAM memory as FIFO for dataflow process

Improves Runtime & Latency compared to PRDDR

with reduced FPGA resource savings

PRBRAM design method

Container

Page 10: Flexible Partial Reconfiguration based Design Architecture

1. Semi-automatic design methodology for dataflow computation

Fixed overlay Static architecture - PRBRAM

Support for partial re-configurability

Input is behavior description language for HLS

2. Prototyped on Xilinx Zynq FPGA

JPEG Encoder given in SystemC

Three Testcase images

3. Extensive experimental results

Measure hardware running time vs. area characteristics of static and PR-based methods.

Comparative study between PRBRAM and PRDDR methods

Hardware Running Time vs. Varying Size of Pblock or Reconfigurable Partitions

10

Thesis Contribution

Page 11: Flexible Partial Reconfiguration based Design Architecture

11

JPEG Encoder: A Dataflow Process for Comparative Study

11

DPCM

RLC

Entropy Coding

HeaderTables

Data

CodingTables

Quant…Tables

DCTpixel(i, j)

8 x 8

DCT(i, j)

8 x 8

QuantizationDCTq(i, j)

Original_Image.bmp 512 X 512 pixels

Size: 258 KB

Compressed_Image.jpg 512 X 512 pixels

Size: 36 KBCompression Ratio : 7.17:1

Run-Length Zig-Zag Scan

DC

AC

JPEG File Format

PE1 : DCT

PE2: Quantization

PE3: RunLengthEncoding

PE4: Huffman Encoding

Page 12: Flexible Partial Reconfiguration based Design Architecture

12

Using Xilinx SDKApp.cpp +

𝑺𝒕𝒂𝒕𝒊𝒄𝒃𝒍𝒂𝒏𝒌.bit = BOOT. bin

Overview of the Proposed Design Methodology

12

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Page 13: Flexible Partial Reconfiguration based Design Architecture

1313

Using Xilinx SDKApp.cpp +

𝑺𝒕𝒂𝒕𝒊𝒄𝒃𝒍𝒂𝒏𝒌.bit = BOOT. bin

13

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Stage 2: Validation and Creation of Custom IPs

Overview of the Proposed Design Methodology

Page 14: Flexible Partial Reconfiguration based Design Architecture

141414

Using Xilinx SDKApp.cpp +

𝑺𝒕𝒂𝒕𝒊𝒄𝒃𝒍𝒂𝒏𝒌.bit = BOOT. bin

14

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Stage 2: Validation and Creation of Custom IPs

Stage 3: TCL Automated Floorplan for PR Designs

Overview of the Proposed Design Methodology

Page 15: Flexible Partial Reconfiguration based Design Architecture

15151515

Using Xilinx SDKApp.cpp +

𝑺𝒕𝒂𝒕𝒊𝒄𝒃𝒍𝒂𝒏𝒌.bit = BOOT. bin

15

PRBRAM Static Architecture

Stage 1: Behavioral Algorithm Description to RTL Generation

Stage 2: Validation and Creation of Custom IPs

Stage 3: TCL Automated Floorplan for PR Designs

Stage 4: Deploying the Binaries on Zynq-7000: PRBRAM Static

Architecture

Overview of the Proposed Design Methodology

Page 16: Flexible Partial Reconfiguration based Design Architecture

16

Stage 1: Behavioral Description of Dataflow to RTL Generation

Key-points when describing dataflow application using BDL:

Uniformity in the number, direction and data-widths of I/O all the PE’s

Control interface signals - done, reset and start : Close Loop Feedback when Context Switching

16

𝑷𝑬𝟎. 𝒄 𝑷𝑬𝟏. 𝒄 𝑷𝑬𝟐. 𝒄 𝑷𝑬𝒌. 𝒄…………….

Behavioral Description of Dataflow Algorithm using SystemC/C++/C

High Level Synthesis

RTL GENERATION

Raw Input Data

Processed Output Data

…………….

…………….𝑷𝑬𝟎. 𝒗, 𝑷𝑬𝟏. 𝒗, 𝑷𝑬𝟐. 𝒗, 𝑷𝑬𝟑. 𝒗

Vivado HLS (Xilinx),

Stratus (Cadence),

CyberWorkBench(NEC),

Catapult(Mentor)

Page 17: Flexible Partial Reconfiguration based Design Architecture

17

Stage 2: Validation and Creation of Custom IPs

17

Logical Synthesis & Simulation

Packaging Design using Vivado IP Packager

Creating System Level Design with Xilinx IP

Integrator

Validating the IP design using Xilinx SDK

Structural RTL to

Xilinx Primitives

Writing Test

Benches

Ensure Target

Platform Function

& Timing Violations

Step.1 Step.2 Step.3 Step.4

Design Re-usability

Xilinx Supports

AMBA AXI

I/O instance port-

map to internal

slave registers

Helps to Write

Software Code for

the IP

Compare results

with Simulation

Block Design with

ZYNQ7 PS

Create Top-Level

Wrapper

Export to SDK to

create BSP &

drivers

Page 18: Flexible Partial Reconfiguration based Design Architecture

18

Stage 2: Validation and Creation of Custom IPs

18

PE0.v, PE1.v . . . . PEk.v PE0 PE1 PEk

Post Logical Synthesis

Functional Simulation

Match

?

. . . .

Hardware Results

Integrating IP to ARM

Dedicated IPsRTL

Goal: Ensures correct memory maps in IPs

Validates Control Signals & Functionality of the Module before integrating into tighter PR Flow

Page 19: Flexible Partial Reconfiguration based Design Architecture

19

Stage 3: TCL Automated Floorplan for PR Designs

19

1. Synthesized Static & RMs separately

2. Create physical constraints to create

pblock RP

3. Set property HD.RECONFGIURABLE

on RP

4. Implement static + one RM – complete

design

5. Save the DCP after routing

6. Remove the RM from RP and save

static DCP

7. Lock static placement & routing

8. Add New RM to the static design &

Save the DCP

All RMs covered ?

No

9. Run pr_verifyYes

10. Create bitstreams for each configuration

1) fplan.xdc

2) PEk.v

3) Static.v

1) partial binaries (.bin)

2) blanking.bit

**Processing Element (PE) will be referred as, Reconfigurable Module(RM) in PR Designs

Page 20: Flexible Partial Reconfiguration based Design Architecture

20

Stage 4: Deploying the Binaries on Zynq-7000

20

BOOT.bin = first stage image for PL side + User application

software

After power-on reset, the Boot ROM determines the boot mode (SD flash memory) and the encryption status (non-secure)

Load First Stage Boot Loader (FSBL) into on-chip

RAM (OCM).

Releases CPU control to the FSBL which in turn

configures the PL with the full Staticblank.bit via

PCAP (Processor Control Access Port)

SD Card

Page 21: Flexible Partial Reconfiguration based Design Architecture

2121

Stage 4: Deploying the Binaries on Zynq-7000

21

Partial bitstreams are loaded into DDR memory from SD

card to maximize throughput during configuration

The Raw Image data which is the input to the dataflow

process in JPEG Encoder is also transferred to the DDR

Memory

After this step, the Reconfigurable Modules can

be loaded into the Reconfigurable Partition to start

computation.

Page 22: Flexible Partial Reconfiguration based Design Architecture

2222

Stage 4: Deploying the Binaries on Zynq-7000

Load the BRAM Memory with Raw Data from DDR

Mux Sel=0

Page 23: Flexible Partial Reconfiguration based Design Architecture

232323

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

PCAP fetches partial bitfile (PE0.bin) from ddr3 memory to load into configuration port

Page 24: Flexible Partial Reconfiguration based Design Architecture

24242424

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Read the Raw Image data from BRAM Memory and Input it to PE0.bin

Start/Done BRAM Read Sel Mux=1

Page 25: Flexible Partial Reconfiguration based Design Architecture

2525252525

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Enable ‘start computation’ signal from ARM for PE0.bin

Page 26: Flexible Partial Reconfiguration based Design Architecture

262626262626

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Read ‘done computation’ signal to ARM for PE0.bin

Page 27: Flexible Partial Reconfiguration based Design Architecture

27272727272727

Stage 4: Deploying the Binaries on Zynq-7000

PE0.bin

Write the output generated by PE0.bin to BRAM memory

Start/Done BRAM Write

Page 28: Flexible Partial Reconfiguration based Design Architecture

282828

Stage 4: Deploying the Binaries on Zynq-7000

PEk.bin

The process to Load RM, Read BRAM, Compute & Write BRAM continues till the terminal PE … Finally, PCAP fetches partial bitfile (PEk.bin) from ddr3 memory to load into configuration port

Page 29: Flexible Partial Reconfiguration based Design Architecture

292929

Stage 4: Deploying the Binaries on Zynq-7000

Load the Results generated by PEk.bin to SD Card from BRAM Memory

Mux Sel=0

Page 30: Flexible Partial Reconfiguration based Design Architecture

30

JPEG Encoder PRBRAM Design Implementation

Utilizing 91.42 % of BRAM memory to store intermediate results

Total memory requirements (1048.576 Kbytes) > Available BRAM memory (140 blocks X 36Kb = 630 Kbytes)

Minimum number of partial reconfigurations = 8 (4 (Reconfigurable Modules) * 2(Dividing factor))

30

2048 blocks

4096 8 X 8 pixel

blocks

Divided the image dataset into 2 & Reload BRAM

2048 blocks

Page 31: Flexible Partial Reconfiguration based Design Architecture

3131

PRBRAM Design Implementation – Experimental Result

Page 32: Flexible Partial Reconfiguration based Design Architecture

32

𝑻𝒓𝒖𝒏𝒕𝒊𝒎𝒆 = 𝑻𝒋𝒑𝒆𝒈−𝒄𝒐𝒎𝒑𝒖𝒕𝒊𝒏𝒈 + 𝑻𝒐𝒗𝒆𝒓𝒉𝒆𝒂𝒅 + 𝑻𝒃𝒊𝒏 ∗ 𝑵𝒃𝒊𝒏

Tjpeg-computing: actual computing time it takes for processing all the inputs of each reconfigurable module

Tbin : time it takes to partially configure the bitstream

Nbin : number of times reconfiguration occurs

Toverhead : time it takes to load the partial binaries and raw image data from SD card to DDR memory

The values Tbin = 0.1975 s, Tjpeg-computing = 2.994 s and Toverhead = 1.675 s are obtained experimentally

PRBRAM Design Implementation – Experimental Results II

Cases of no. of reconfigurations (1) 32 (2) 64

(3) 128 (4) 256 (5) 512

32

Page 33: Flexible Partial Reconfiguration based Design Architecture

33

RTBRAM values for varying RPBitsize

PRBRAM Design Implementation – Experimental Results III

33

The size of the pblock or reconfigurable partition affects Tbin

There is a linear relationship

The Hardware running time for reconfigurable architectures is significantly impacted by this.

Page 34: Flexible Partial Reconfiguration based Design Architecture

34

PRBRAM Design Implementation – Experimental Results IV

[Case 1: RPBitsize = 1598.896 KB, Case 2: RPBitsize = 1306.272 KB, Case 3: RPBitsize = 786.664 KB]

34

4.57

3.86706

3.30284

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE :No. of Reconfiguration = 8

9.305 7.71959

5.62322

4

5

6

7

8

9

10

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 32

28.167

23.13036

14.9039

0

5

10

15

20

25

30

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 128

103.619

83.11808

52.02699

0

20

40

60

80

100

120

0 1 2 3 4

FPG

A_R

UN

TIM

E (s

)

CASE

FPGA_RUNTIME vs RP_BITSIZE:No. of Reconfiguration = 512

Page 35: Flexible Partial Reconfiguration based Design Architecture

35

Objective:

Prove area utilization efficacy of PRBRAM

Objective:

Prove PRBRAM is runtime and latency efficient compared to PRDDR

JPEG Encoder PRDDR MethodJPEG Encoder Spatial Method

35

DCT

Quantization

IP containing

Reconfigurable

Partition

Page 36: Flexible Partial Reconfiguration based Design Architecture

36

Calculating Area Utilization

𝐴𝑡𝑜𝑡𝑎𝑙 = 𝐴𝑠𝑡𝑎𝑡𝑖𝑐 + 𝐴𝑑𝑦𝑛𝑎𝑚𝑖𝑐 𝑚𝑎𝑥 𝑃𝐸0, 𝑃𝐸1, 𝑃𝐸2, … . 𝑃𝐸𝑘

36

𝐴𝑡𝑜𝑡𝑎𝑙 = 𝑃𝐸0, 𝑃𝐸1, 𝑃𝐸2, … . 𝑃𝐸𝑘

PR Based Designs

Static Designs

*PE: Process Element

RunLength Encoding PE

Astatic of PRBRAM is high due to additional IPs in the overlay architecture

Max. BRAM Utilization

Page 37: Flexible Partial Reconfiguration based Design Architecture

24759 (Spatial)

18985 (PR_BRAM)

16442 (PR_DDR)

14000

15000

16000

17000

18000

19000

20000

21000

22000

23000

24000

25000

26000

1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3

AR

EA_F

F

FPGA_RUNTIME (s)

FPGA_RUNTIME vs AREA_FF

37

I. COMPARATIVE STUDY – AREA UTILIZATION vs FPGA RUNTIME

37

14997 (Spatial)

12373 (PR_BRAM)

10811 (PR_DDR)

8000

9000

10000

11000

12000

13000

14000

15000

16000

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6

AR

EA_L

UT

FPGA_RUNTIME (s)

FPGA_RUNTIME vs AREA_LUT

PRBRAM design method requires slightly more area compared to PRDDR due additional IPs - ARM-FPGA

Control Bus, ARM-Side BRAM Control, MUX and Block RAM Memory modules.

Comparing spatial design implementation, the utilization of PRBRAM is significantly low, which is as expected.

Page 38: Flexible Partial Reconfiguration based Design Architecture

3838

1.53

1.163

0.947 0.88 0.832

2.0291.854

1.767 1.722 1.701

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272

LATE

NC

Y (

s)

NUMBER_RECONFIGURATIONS

LATENCY COMPARISON

LATENCY_BRAM LATENCY_DDR

7.80710.96317.249

29.824

54.975

10.19416.915

30.344

57.211

110.937

05

101520253035404550556065707580859095

100105110115120

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272

FPG

A_R

UN

TIM

E (s

)

NUMBER_RECONFIGURATIONS

FPGA_RUNTIME COMPARISON

RUNTIME_BRAM RUNTIME_DDR

Experiment with unequal RPBitsize

RPBitsize = 3416.088 KB for PRDDR

RPBitsize = 1598.896 KB for PRBRAM

Non-Linear Relationship in Runtime between the

graphs due to Tbin * Nbin not constant

II. COMPARATIVE STUDY – RUNNING TIME vs LATENCY

Page 39: Flexible Partial Reconfiguration based Design Architecture

39

PRBRAM & PRDDR Static.dcp Floorplan View – Equal Pblock Sizes

3939

Partition Pins

BRAM Blocks

Higher utilization of

logic elements due to

additional IPs.

Page 40: Flexible Partial Reconfiguration based Design Architecture

4040

II. COMPARATIVE STUDY – RUNNING TIME vs LATENCY

Experiment with equal RPBitsize = 1306.272 KB for both PR implementations.

Average improvement in runtime is 0.529s

Runtime varies linearly because Nbin * Tbin is constant

40

1.106429

0.874237

0.758173 0.700145

1.322754

0.982405

0.812247

0.727149

0.650.7

0.750.8

0.850.9

0.951

1.051.1

1.151.2

1.251.3

1.351.4

0 10 20 30 40 50 60 70

LATE

NC

Y (

s)

NUMBER_RECONFIGURATIONS

LATENCY COMPARISON

LATENCY_BRAM LATENCY_DDR

2.2128593.496951

6.065391

11.20233

2.685509

3.969622

6.537978

11.67439

22.5

33.5

44.5

55.5

66.5

77.5

88.5

99.510

10.511

11.512

0 8 16 24 32 40 48 56 64 72

FPG

A_R

UN

TIM

E (s

)

NUMBER_RECONFIGURATIONS

FPGA_RUNTIME COMPARISON

RUNTIME_BRAM RUNTIME_DDR

Page 41: Flexible Partial Reconfiguration based Design Architecture

41

CONCLUSION

41

Improvement in average hardware running (PRBRAM ) of 0.529s vs. PRDDR

Implementation with the proposed Architecture PRBRAM is area efficient compared to spatial

implementation with

LUT area savings up to 21.20 % & FF area savings up to 30.41 % for 1306.272 KB

These %’s are including the additional resources utilized by proposed static architecture

Implemented JPEG Encoder on Zynq -7000

For three testcase images-Lena, Peppers and Goldhill

Using Spatial, PRDDR & PRBRAM for Comparative Study & Analysis

Novel design methodology for dataflow computation with proposed PRBRAM overlay static architecture

Including TCL based automated floorplanning + User software application algorithms

Page 42: Flexible Partial Reconfiguration based Design Architecture

42

FUTURE WORKS

Sophisticated Partial Reconfiguration Controllers

Minimize the time required for reconfiguring

Enhanced parallelism of operations in hardware accelerators/Processing Elements due

to saved resources in reconfigurable architecture.

In extremely data-intensive applications, exploring performance impact on PRBRAM

BRAM + Distributed RAM to deal with limitations of on-chip memory

42

Page 43: Flexible Partial Reconfiguration based Design Architecture

43

THANK YOU

43

Page 44: Flexible Partial Reconfiguration based Design Architecture

44

APPENDIX

Page 45: Flexible Partial Reconfiguration based Design Architecture

45

Experimental Results – Spatial, PRBRAM & PRDDR

Table IV: JPEG Encoder Results for lena.bmp with PRDDR and PRBRAM Design Implementation

Table III: JPEG Encoder Results with Spatial Design Implementation

45

Page 46: Flexible Partial Reconfiguration based Design Architecture

4646

Partial Reconfiguration – Full vs Partial Bitstream

4646

Figure.2 Configuration Process & Contents of (a) Full and

(b) Partial bitstreams

(a) (b)

46

Page 47: Flexible Partial Reconfiguration based Design Architecture

4747

Partial Reconfiguration – Method to configure Partial Bitstreams

Internal Configuration Access Port (ICAP) :

User configuration solutions

Requires ICAP controller + Logic to drive the ICAP interface

JTAG Port :

Quick Testing or Debug

Driven using iMPACT or ChipScope Analyzer

Processor Configuration Port (PCAP) :

Configuration mechanism for all Zynq-7000 designs.

Figure.3 (a) ICAP (b) PCAP (c) JTAG

(a) (b) (c)

47

Page 48: Flexible Partial Reconfiguration based Design Architecture

48

JPEG Encoder Hardware Accelerators : DCT & Quantization

Discrete Cosine Transform (DCT):

Converts spatial domain to frequency domain

𝑫𝑪𝑻 𝒊, 𝒋 =𝟏

𝟒𝑪 𝒊 𝑪 𝒋

𝒙=𝟎

𝟕

𝒚=𝟎

𝟕

𝒑𝒊𝒙𝒆𝒍 𝒙, 𝒚 𝐜𝐨𝐬𝟐𝒙 + 𝟏 𝒊𝜫

𝟏𝟔𝐜𝐨𝐬

𝟐𝒚 + 𝟏 𝒋𝜫

𝟏𝟔𝟏 ,𝐖𝐡𝐞𝐫𝐞, 𝑪 𝒌 =

𝟏

𝟐𝒊𝒇 𝒌 = 𝟎 & 𝑪 𝒌 = 𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

Quantization:

Dividing transformed image DCT matrix by quantization matrix used and rounding off

Aims at reducing most of the less important high frequency DCT coefficients to zero

𝑫𝑪𝑻𝑸 𝒊, 𝒋 = 𝑹𝒐𝒖𝒏𝒅𝑫𝑪𝑻(𝒊, 𝒋)

𝑸(𝒊, 𝒋)(𝟐)

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

48

Page 49: Flexible Partial Reconfiguration based Design Architecture

49

JPEG Encoder Hardware Accelerators : RunLength Encoding

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

Zig Zag Scan

Encoding the difference between current and the previous 8 X 8 block

Encodes a series of zeros as a (skip, value) pair

Differential Pulse Code Modulation (DPCM)

RunLength on AC Components (RLC)

49

Page 50: Flexible Partial Reconfiguration based Design Architecture

50

JPEG Encoder Hardware Accelerators : Entropy Coding

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

SIZE Value Code

0 0 ---

1 -1,1 0,1

2 -3, -2, 2,3 00,01,10,11

3 -7,…, -4, 4,…, 7 000,…, 011, 100,…111

4 -15,…, -8, 8,…, 15

0000,…, 0111, 1000,…, 1111

. .

. .

11 -2047,…, -1024, 1024,… 2047

DC Components

DC components are differentially coded as

(SIZE, Value)

Code for a Value is derived from the

Size_and_Value Table (Table.1)

Code for a SIZE is derived from Table.2

Example: If a DC component is 40 and the previous DC component is 48. The difference is -8. Huffman coded as: 1010111

0111: The value for representing –8

(Size_and_Value table)

101: The size from the same table reads

4, which corresponds to 101 from Table.2

Table.1 Size_and_Value

SIZE Code

Length

Huffman Code

0 2 00

1 3 010

2 3 011

3 3 100

4 3 101

5 3 110

6 4 1110

7 5 11110

8 6 111110

9 7 1111110

10 8 11111110

11 9 111111110

Table.2 Huffman Table for DC component SIZE field

50

Page 51: Flexible Partial Reconfiguration based Design Architecture

5151

JPEG Encoder Hardware Accelerators : Entropy Coding

DCT Quantz RLE Huffman

Inp

ut

Stre

am

Ou

tpu

t St

ream

AC Components: Coded as (S1,S2 pairs)

S1: (RunLength/SIZE) , where RunLength : length of the

consecutive zero values [0..15] & SIZE : No. of bits needed to code

the next nonzero AC component’s value

S2: (Value), where Value is the value of the AC component from

Table.1

Zig-Zag order -> 12,10, 1, -7 2 0s, -4, 56 zeros

12: read as zero 0s,12: (0/4)12 10111100

1011: The code for (0/4 from Table.3)

1100: The code for 12 from the Table.1

56 0s: (0,0) 1010 (Rest of the components are zeros therefore we simply put the EOB to signify this fact)

Run/

SIZE

Code

Length

Code

0/0 4 1010

0/1 2 00

0/2 2 01

0/3 3 100

0/4 4 1011

0/5 5 11010

0/6 7 1111000

0/7 8 11111000

0/8 10 1111110110

0/9 16 1111111110000010

0/A 16 1111111110000011

Run/

SIZE

Code

Length

Code

1/1 4 1100

1/2 5 11011

1/3 7 1111001

1/4 9 111110110

1/5 11 11111110110

1/6 16 1111111110000100

1/7 16 1111111110000101

1/8 16 1111111110000110

1/9 16 1111111110000111

1/A 16 1111111110001000

… 15/A More Such rows

Table.3 Huffman Table for AC component SIZE field

40 12 0 0 0 0 0 010 -7 -4 0 0 0 0 01 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

Figure.9 Example of 8 X 8 block after quantization

51

Page 52: Flexible Partial Reconfiguration based Design Architecture

52

PRDDR Design Implementation – Floorplan View Post P & R

Figure.24 Design Checkpoints after performing Place & Route (a) Staticddr.dcp (b) DCTddr.dcp (c) Quantizationddr.dcp (d) RLEddr.dcp (e) Huffmanddr.dcp for RPBitsize = 1306.272 KB case

52

Page 53: Flexible Partial Reconfiguration based Design Architecture

53

PRDDR Design Implementation – Experimental Results II

Table.13 JPEG Encoder Results for lena.bmp with PRDDR Design Implementation

Additional experiment results were tabulated testcase images of peppers.bmp & goldhill.bmp:

Table.14 peppers.bmp Table.15 goldhill.bmp

53

Page 54: Flexible Partial Reconfiguration based Design Architecture

54

SPATIAL Design Implementation – Experimental Results

Table.11 JPEG Encoder results with Spatial Design Implementation

Figure.22 SSIM Maps (a) Lena (b) Goldhill (c) Peppers

(a) (b) (c)

54

Page 55: Flexible Partial Reconfiguration based Design Architecture

55

SPATIAL Design Implementation - System Implementation and Setup

Using the Vivado IP Integrator, custom IPs of JPEG Encoder are connected

using AXI4 spatially.

Clock frequency : 50 MHz

Additional Hardware Resources :

SD Card and DDR3 connected to the external interfaces on the

Processing Side (PS) of the Zynq FPGA for storing data

Figure.21 Floorplan View

Table.10 Utilization Report

55

Page 56: Flexible Partial Reconfiguration based Design Architecture

56

PRDDR Design Implementation - System Implementation and Setup

2 Micron DDR3 128 Megabit x 16 memory components

creating a 32-bit interface, totaling 512 MB.

The DDR3 is connected to the hard memory controller in

the Processor Subsystem (PS).

DDR3 memory is referenced using pointers in user software

application

address mappings provided in the systems.hdf file

In JPEG Encoder Implemented,

Max. No of I/O: RLE RM block &

Max. data-widths of I/O : Huffman Encoding RM block

Figure.23 PRDDR design methodology Block Diagram for dataflow computation

56

Table.12 Utilization Report Post P & R

Page 57: Flexible Partial Reconfiguration based Design Architecture

5757

Partial Reconfiguration - Ideology & Benefits

Reduced Resource and Power Consumption:

Integrating the design into a lower FPGA IC count.

Power savings due to Reduction in off-chip communication.

Performance Improvements and Flexibility:

Computation capacity of the system adapted at run time

Additional resources for speeding up the operation of the kernel

More number of kernels to perform the operation in parallel.

Improved Fault Tolerance and Dependability:

Safety critical systems - aerospace & defense industries.

Self Adapting Hardware Designs:

Adapt to changing operating and environmental conditions based on AI &

learning.

Figure.1 Basic Ideology of PR

Partial Reconfiguration Definition–Allows the modification of an operating FPGA

by downloading Bitfile

Page 58: Flexible Partial Reconfiguration based Design Architecture

58

PRBRAM Design Implementation – Experimental Results I

Table.7 JPEG Encoder Results for lena.bmp with PRBRAM Design Implementation

Table.8 JPEG Encoder Results for goldhill.bmp and peppers.bmp with PRBRAM Design Implementation

Figure.16 PRBRAM results for lena.bmp testcase with RPBitsize = 1306.272 KB

58

Runtime and Latency have inverse relationship

Latency and Throughput have linear relationship