adsd fall2011 05 architect ing speed 2011nov03

8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03

http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 1/96

Dr. Rehan Hafiz <[email protected]>Lecture # 05



Course Website for ADSD Fall 2011

http://lms.nust.edu.pk/

2

Lectures: Tuesday @ 5:30-6:20 pm, Friday @ 6:30-7:20 pm

Contact: By appointment/Email Office: VISpro Lab above SEECS Library

Acknowledgement: Material from the following sources has been consulted/used in theseslides:1. [CIL] Advanced Digital Design with the Verilog HDL, M D. Ciletti2. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan3. [STV] Advanced FPGA Design, Steve Kilts4. Some slides from : [ECEN 248 Dr Shi]

Material/Slides from these slides CAN be used with following citing reference:

Dr. Rehan Hafiz: Advanced Digital System Design 2010

Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

http://creativecommons.org/licenses/by-nc-sa/3.0/














This Lecture

3

Understanding & Optimizing

Speed

Throughput

Timings

Reading Assignment

Chapter -1: Advanced FPGA Design, by Steve Kilts

Xilinx Application Note Uploaded on MOODLE + Practice in Xilinx ISE

Setup/Hold time violation



Speed

4

Throughput

Amount of data that is processed per clock cycle

Metric: bits/sec

Latency

Time between data input and processed data output

Metric: No. of cycles or time

Timing Logic delays between sequential elements

Metric : Clock period or Frequency.



High Throughput Design

A high-throughput design

More concerned with the steady-state data rate

Less concerned about the time any specific piece of data

requires to propagate through the design (latency)

Techniques

Pipelining5



Throughput

D Q

clk

D Q

clk

Combinational

Logic

Combinational

Logic D Q

clk

top-level entity

Throughput = (bits per output sample) / (time between consecutive output samples)

Bits per output sample: In this example, 8 bits per output sample

Time between consecutive output samples: clock cycles between output(n) to output(n+1) Can be measured in clock cycles, then translated to time

In this example, time between consecutive output samples = 1 clock cycle = 10 ns

Throughput = (8 bits per output sample) / (10 ns) = 0.8 bits / ns = 800 Mbits/s

input output

clk

input input(0) input(1) input(2)

output (unknown) output(0) output(1)

8 bits8 bits

1 cycle betweeen

output samples

100MHz



An Example...

Software Code

Digital Implementation

XPower = 1;

for (i=0;i < 3; i++)XPower = X * XPower;

Throughput 8/3 = 2.7 bits/cyc.

Latency 3 clk cycles

Timing 1 Multiplier Delay

Same register and computational resources

are reused

No new computations can begin until theprevious computation has completed

[KIL]



Coding an iterative algorithm

<with dependency>

XPower = 1;

for (i=0;i < 3; i++)

XPower = X * XPower;

module power3(

output [7:0] XPower,

output finished,

input [7:0] X,

input clk, start);

reg [7:0] ncount;

reg [7:0] XPower;

assign finished = (ncount == 0);

always@(posedge clk)

if(start) begin

XPower <= X;

ncount <= 2;

End

else if(!finished) begin

ncount <= ncount - 1;

XPower <= XPower * X;

End

endmodule



Loop Unrolling

9

XPower = 1;

for (i=0;i < 3; i++)

XPower = X * XPower;

Both the final calculation of X3 (XPower3

resources) and the first calculation of the

next value of X (XPower2 resources)occur simultaneously

x[n-1]2x[n]

x[n-1]

x[n-2]3



Coding

10

module power3(

output reg [7:0] XPower,

input clk,input [7:0] X

);

reg [7:0] XPower1, XPower2;

reg [7:0] X1, X2;

always @(posedge clk) begin

// Pipeline stage 1

X1 <= X;

XPower1 <= X;

// Pipeline stage 2

X2 <= X1;

XPower2 <= XPower1 * X1;// Pipeline stage 3

XPower <= XPower2 * X2;

end

endmodule

X2

XPower1 XPower2

X1



ft

11

Throughput 8/1 = 8 bits/cyc.



Throughput 8/3 = 2.7 bits/cyc.





12

In general, if an algorithm requiring n

iterative loops is “unrolled,” the pipelined

implementation will exhibit a throughput

performance increase of a factor of n. The penalty for unrolling an iterative loop is a

proportional increase in area.



Decreasing Latency

A low-latency design is one that passes the data from

the input to the output as quickly as possible by

minimizing the intermediate processing delays.

Technique

Removal of pipelining, and logical short cuts that may reduce

the throughput or the max clock speed in a design

Parallelisms

13



Latency

D Q

clk

D Q

clk

Combinational

Logic

Combinational

Logic D Q

clk

top-level entity

Latency is the time between input(n) and output(n)

i.e. time it takes from first input to first output, second input to second output, etc.

Also called input-to-output latency

Count the number of rising edges after input

In this example, 2 rising edges latency is 2 cycles

Latency is measured in clock cycles (then translated to seconds)

In this example, say clock period is 10 ns, then latency is 20 ns

input output

clk

input input(0) input(1) input(2)

output (unknown) output(0) output(1)

8 bits8 bits

100 MHz



Removal of pipelining

Throughput 8/1 = 8 bits/cyc.

Latency Less than a cycle

Timing 2 Multiplier Delays



Penalty

16

Penalty in timing

Previousimplementationscould theoretically

run the system clock period close to thedelay of a singlemultiplier

For Low-latency

implementation, theclock period must beat least two multiplierdelays

module power3(

output [7:0] XPower,

input [7:0] X

);

reg [7:0] XPower1, XPower2;

reg [7:0] X1, X2;assign XPower = XPower2 * X2;

always @* begin

X1 = X;

XPower1 = X;

end

always @* beginX2 = X1;

XPower2 = XPower1*X1;

end

endmodule



Understanding Timing

17



Timings

18

Combinational

Logic & Routing

Flip Flops

Setup time

Hold time

Propagation delay t CLK2Q



Timing: Flip Flops (Sequential Logic)

D Qclk

clk

D

Q

tS tH

Input D must remain

stable during

this interval

Input D can freely

change during

this interval

tCLK2Q

Setup time t S – minimum time the input has to be stable before the rising edge of the clock

Hold time t H – minimum time the input has to be stable after the rising edge of the clock

Propagation delay t CLK2Q – time to propagate input to output after the rising edge of the clock

Ti i



Timing:

Path timing

D Q

clk

clk

D Q

clk

CombinationalLogic

tCLK2Q ts

CLOCK PERIOD T

tLOGIC

t CLK2Q + t LOGIC+ t ROUTING < (T - t S ) to avoid setup time

violation

Rewriting the equation: t CLK2Q + t LOGIC + t routing + t S < T

t path

tRout

A path is defined as a path from the output

of one flip-flop to the input of another

flip-flop



Critical Path Delay

Path delay t path = t CLK2Q + t LOGIC + t ROUTE + t S

The largest of all the path delays in a circuit is

called the critical path delay (t critical_path)

The associated path is called the critical path

There can be millions of paths in a circuit; timing

analysis CAD tools help to locate the critical path



Critical PathD Q

D Q

D Q

D Q

D Q

PATH 1

PATH 2

PATH 3

PATH 4

1.1 ns

0.5 ns

0.8 ns

Path delays: t path1 = 2.2 ns, t path2 = 1.1 ns, t path3 = 3.0 ns, t path4 = 1.4 ns

The critical path is path 3; the critical path delay is t critical_path

= t path3

=3.0 ns

t CLK2Q=0.4 ns

t CLK2Q=0.4 ns

t CLK2Q=0.4 ns

t S=0.2 ns

t S=0.2 ns



Setup Time Violation (a.k.a Critical Path Violation)

D Q D Q

t S=0.2 nst CLK2Q=0.4 ns

clk

tCLK2Q ts

CLOCK PERIOD T

CombinationalGate A

CombinationalGate B

t wire1=0.4 ns t gateA=2.0 ns t wire2=0.2 nst gateB=1.2 ns t wire3=0.8 ns

t wire1 t gateA t wire2 t gateB t wire3

Critical path delay = t critical_path = 5.2 ns

The minimum period for this circuit to work is Tmin = 5.2 ns

Maximum clock frequency = 1/Tmin = 192 MHz

If the clock period is smaller than Tmin, you will get a timing violation and circuit will not operate correctly!!

This kind of timing violation is called a "setup time" violation (also known as critical path violation)



25



Review – From Last Lecture

26

Throughput

Amount of data that is processed per clock cycle OR The aggregate/average data

processing rate

Ideally average data rate IN to your system should be able to the average data rate OUT of

your system – OR you will miss data !

Technique : Pipelining & Loop Unrolling !

Streaming Applications – More concerned with throughput !

Metric: bits/sec

Latency

Time between data input and processed data output

Parallelising the system ---

Response Time --- Important for Time Critical Signals, e.g. some interrupt triggered

operation processing an external signal of an avionics system ! Metric: No. of cycles or time

Normally a compromise !



Timing

27

Timing

Logic delays between sequential elements

Metric : Clock period or Frequency.

[t CLK2Q + t LOGIC + t routing + t S ]< T

Clock Skew Rising Edge of the Clock Does Not Arrive at Clock Inputs of All Flip-flops at The

Same Time



Clock Skew

Delay often caused by wire routing delay

D Q

in

clk

D Q

out

delay

D Qin

clk

D Qout

delay

clk'

clk

clk'

tskew

clk

clk'

tskew

Lag clock skew

Lead clock skew



29

Positive slack When the data arrives at the capture flip-flop before the capture

clock less the setup time.

Negative Slack

If the data arrive after the capture clock less the setup time -ve slack is an issue

d l k k b d b



Lead clock skew is bad because it may cause setup

time violations

D Q

clk

D Q

clk

Combinational

Logic

clk

tCLK2Q ts

CLOCK PERIOD T

tLOGIC+tROUTE

D Q

clk

D Q

clk'

Combinational

Logic

clk

tCLK2Q

ts

clk'

CLOCK PERIOD T

tskew

WITHOUT SKEW:

t CLK2Q + t LOGIC + t ROUTE + t s < T

to avoid setup time violation

WITH SKEW:

t CLK2Q + t LOGIC + t ROUTE + t s < (T – t skew)

to avoid setup time violation

less time to perform logic than you

normally would

Soln: Optimize/Pipeline/Speedgrade !

tLOGIC+tROUTE

l k k b d b h ld



Lag clock skew is bad because it may cause hold

time violations

D Q

clk

D Q

clk'

Combinational

Logic

clk

tCLK2Q tLOGIC+Route

clk'

tskew tH

t CLK2Q + t LOGIC + t ROUTE > (t skew + t H ) to avoid hold time violation

If this is violated, get data feedthrough (data gets fed into the next register one cycle too early)

There is no clock period (T) in the equation; changing clock period cannot help this problem!

Solution : Add dummy logic, e.g. Buffer !

For FPGAs hold time violation predict clock skew



Maximum Achievable Frequency

Maximum-frequency equation (ignoring clock-

to-clock jitter):

Tskew is propagation delay of clock between

the launch flip- flop and the capture flip- flop

-ve,+ve depends on lead or lag



Reading Assignment

33



Some Examples34

Example 1:



Example 1:

Analyzing Sequential Circuits

° What is the minimum time between rising clock

edges?• Tmin = TCLK-Q (FFA) + TLogic (G) + TRoute (G) + Ts (FFB)

ZComb.Logic

TClk-Q = 5 nsTs = 2 ns

D Q D Q Y XD

CLK

TClk-Q = 5ns Tlogic+Route = 5ns

FFA FFB

G

Example: 2



Example: 2Hold Time Violation

° Shall we get Hold Time Violation in this example ?

° Make sure Y remains stable for hold time (Th) after rising clock edge

° Remember: contamination delay ensures signal doesn’t change• TCLK2Q(FFA) + Tcd(G) >= Th

• 1ns + 2ns > 2ns

ZComb.Logic

Th = 2 ns

D Q D Q Y XD

CLK

Tclk2Q = 1ns Tcd = 2ns

FFA FFB

G

E l 3



Example-3

° What is the minimum clock period (Tmin) of thiscircuit?

° What if FFB has a clock skew – Lead of 1 ns

ZComb.

Logic H

TClk-Q = 4 ns

Ts = 2 ns

D Q D Q Y X

CLK

TClk-Q = 5ns

Tlogic+Route = 5ns FFA FFB

Comb.Logic F

Togic+Route= 4ns

S l i



Solution

° Path FFA to FFB• TClk-Q(FFA) + Tpd(H) + Ts(FFB) = 5ns + 5ns + 2ns = 12ns

° Path FFB to FFB• TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns

ZComb.Logic H

TClk-Q = 4 nsT

s

= 2 ns

D Q D Q Y X

CLK

TClk-Q = 5ns


Comb.

Logic F

Tlogic+Route = 4ns



Solution(With Lead of 1 ns for FFB)

° Path FFA to FFB• TClk-Q(FFA) + Tpd(H) + Ts(FFB) + Tskew= 5ns + 5ns + 2ns + 1ns= 13ns

° Path FFB to FFB• TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns

ZComb.Logic H

TClk-Q = 4 nsT

s

= 2 ns

D Q D Q Y X

CLK

TClk-Q = 5ns


Comb.

Logic F

Tlogic+Route = 4ns

Example



Example

Analyzing Sequential Circuits: Hold Time Violations

Path FFA to FFB• TClk2q(FFA) + Tlogic+Route (H) > Th(FFB) = 1 ns + 2ns > 2ns

Path FFB to FFB• TClk2q (FFB) + TCD(F) + Tlogic+Route (H) > Th(FFB) = 1ns + 1ns + 2ns > 2ns

Comb.Logic H

Tclk2Q = 1 nsTh = 2 ns

D Q D Q Y X

CLK Tclk2Q = 1ns


Comb.Logic F

Tlogic+Route = 1ns

All paths must satisfy requirements



Optimizing TimingFew Simple Design Considerations

41



Consider an FIR Filter

The equation for the computation of an L-taps

FIR filter is:

If L=5 y[0]= h0x0 + h1x-1 + h2x-2 + h3x-3 +h4x-4

y[1]= h0x1 + h1x0 + h2x-1 + h3x-2 +h4x-3

y[2]= h0x2 + h1x1 + h2x0 + h3x-1 +h4x-2 y[3]= h0x3 + h1x2 + h2x1 + h3x0 +h4x-1

y[4]= h0x4 + h1x3 + h2x2 + h3x1 +h4x0

y[5]= h0x5 + h1x4 + h2x3 + h3x2 +h4x1



Parallel FIR Implementation

43

module fir(



44

Critical Path ??

output [7:0] Y,

input [7:0] A, B, C, X,

input clk,

input validsample);

reg [7:0] X1, X2, Y;

always @(posedge clk)

if(validsample) begin

X1 <= X;

X2 <= X1;

Y <= A* X+B* X1+C* X2;

endendmodule

Technique-1- Pipelining



Technique 1 Pipelining

<Reducing TLOGIC+PROPAGATON>

reg [7:0] X1, X2, Y;



Code

46

reg [7:0] prod1, prod2, prod3;

always @ (posedge clk) begin

if(validsample) begin

X1 <= X;

X2 <= X1;

prod1 <= A * X;

prod2 <= B * X1;

prod3 <= C * X2;

end

Y <= prod1 + prod2 + prod3;

endendmodule

Technique-2- Increasing Parallelism



Technique 2 Increasing Parallelism

<Speeding-up the logic-process>47

…. Optimize the critical path such that logic

structures could be implemented in parallel

Example:

For the x-cube code break the multipliers intoindependent operations and then recombine them.



Taking a square

48

8-bit binary multiplier

8 Muxe shifts +8 8-bit

Additions

b l l



1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

0 0 0 0 0 0 0 0

1 1 1 1 1 0 1 0

0 0 0 0 0 0 0 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 0

8 bit Multiplication



50

Optimizing Logic by adding



Optimizing Logic by adding

Parallelism51

Assume we are squaring an 8-bit number

can be represented by nibbles A and B:

a3 a2 a1 a0 b3 b2 b1 b0



a3 a2 a1 a0 b3 b2 b1 b0

a3b0 a2b0 a1b0 a0b0 b3b0 b2b0 b1b0 b0b0




a0a3 a0a2 a0a1 a0a0 a0b3 a0b2 a0b1 a0b0

a1a3 a1a2 a1a1 a1a0 a1b3 a1b2 a1b1 A1b0

a2a3 a2a2 a2a1 a2a0 a2b3 a2b2 a2b1 a2b0

a3a3 A3a2 a3a1 a3a0 a3b3 a3b2 a3b1 a3b0

B*B

2*A*B

A*A

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0



1 1 1 1 1 0 1 0

0 0 0 0 0 0 0 0

1 1 1 1 1 0 1 0

0 0 0 0 0 0 0 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

1 1 1 1 1 0 1 0

‘0’ 1 1 0 0 1 0 0

‘1’ 0 0 1 0 1 1 0 0

1 1 1 0 0 0 0 1

1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 0



Technique-3- Register Balancing



Technique-3- Register Balancing <Distribute long logic paths evenly across register layers>

55

Keep a balance in the critical path

Redistribute logic evenly between registers to

minimize the worst-case delay between any two

registers



56

Technique-4- Flatten Logic Structures



Technique 4 Flatten Logic Structures

<Removing redundant logic>57

Break up logic structures that are coded in a

serial fashion

Avoiding Priority Structures if not required

control signals coming from an



control signals coming from an

address decode that are used to write four 1-bit registers

58

module regwrite(

output reg [3:0] rout,

input clk, in,

input [3:0] ctrl);


if(ctrl[0]) rout[0] <= in;

else if(ctrl[1]) rout[1] <= in;



endmodule



59

If the control lines are strobes from an address

decoder in another module

Each strobe is mutually exclusive to the others as

they all represent a unique address. Is there any need for priority structure ?



60

module regwrite(

output reg [3:0] rout,

input clk, in,

input [3:0] ctrl);

always @(posedge clk) begin





end

endmodule

Ti



Tip

61

Technique-5- Reordering Paths



q g

<Shortening Critical Paths>62

Mostly done by synthesizer !!!

Reorder the paths in the dataflow to minimize

the critical path

When to use: Where multiple paths combine with the critical path

The combined path can be reordered such that the

critical path can be moved closer to the destination

register

Technique-5- Reordering Paths



q g

63

Events not

mutually

exclusive

module randomlogic(

output reg [7:0] Out,

input [7:0] A, B, C,

input clk,

input Cond1, Cond2);always @(posedge clk)

if(Cond1)

Out <= A;

else if(Cond2 && (C < 8))

Out <= B;

else

Out <= C;

endmodule

module randomlogic(output reg [7:0] Out,



64

output reg [7:0] Out,

input [7:0] A, B, C,

input clk,

input Cond1, Cond2);

wire CondB = (Cond2 & !Cond1);


if(CondB && (C < 8))

Out <= B;

else if(Cond1)

Out <= A;

else

Out <= C;

endmodule

Summary Architecting Speed



Summary- Architecting Speed

65

High Throughput Pipelining

Low Latency Parallelism

Pipeline Removal Timing

Parallelism

Pipelining

Flattening Logic Structure

Register Balancing Path Reordering

In your digital design Make your specification as your goal and apply the

techniques

Recap



66

A high-throughput architecture is one that maximizes the number of bits per second that can be

processed by a design.

Unrolling an iterative loop increases throughput.

The penalty for unrolling an iterative loop is a proportional increase in area.

A low-latency architecture is one that minimizes the delay from the input of a module to the output.

Latency can be reduced by removing pipeline registers

The penalty for removing pipeline registers is an increase in combinatorial delay between registers.

Timing refers to the clock speed of a design. A design meets timing when the maximum delay between any

two sequential elements is smaller than the minimum clock period Adding register layers improves timing by dividing the critical path into two paths of smaller delay.

Separating a logic function into a number of smaller functions that can be evaluated in parallel reduces

the path delay to the longest of the substructures.

By removing priority encodings where they are not needed, the logic structure is flattened, and the path

delay is reduced.

Register balancing improves timing by moving combinatorial logic from the critical path to an adjacent

path

Timing can be improved by reordering paths that are combined with the critical path in such a way that

some of the critical path logic is placed closer to the destination register



Dr. Rehan Hafiz <[email protected]>

Reading

Chapter 3 of Parhi, VLSI Digital Signal Processing Systems

Reading



Reading

Parhi, VLSI Digital Signal Processing Systems

Chapter 3

Direct Form FIR Filters



x(n) Z-1 Z-1 Z-1

h0 h1 h2 hM-1

y(n)

Direct Form FIR Filters

M-tap FIR filter in direct form

Critical path:

TA = delay through adder

TM = delay through multiplier

Critical path delay: 1 TM +(M-1) TA

Area: M-1 registers

M multipliers

M-1 adders

Arithmetic complexity of M-tap filter modeled as:

M multiplications/sample + M-1 adds/sample

1

0

( ) ( ) ( ) ( ) ( )

M

i

y n h i x n i h n x n

Representations of DSP algorithms and architectures



70

Block Diagram

Block diagram of a 3-tap FIR filter




71

Signal Flow Graph – Representation !

Signal Flow Graph of a 3-tap FIR filter

Collection of Nodes & Directed Edges

A directed edge (j,k) denotes a node

originating at node j & terminating

at node k

Edge (j,k) denotes a linear

transformation from signal at node j

to signal at node k – Can specify Gain

Nodes represent computations or

tasks e.g: Addition

Source Node : No input edges; Sink

Node : No originating edges

Technique Signal Flow Graph



Signal Flow Graph

From Direct Form to Transpose Form

72

Reversing the direction of an SFG and interchanging the input and output ports preserves the functionality of the system.

Also called data broadcast structure

x(n)

Z-1 Z-1 Z-1

hM-1 hM-2 hM-3 h0

y(n)

Critical path:

Delay: 1 TM + 1 TA

Area:

M-1 registers + M multipliers +M-1 adders

Disadvantages

Larger register sizes depending on quantization scheme used; since registers are now placed aftermultiplication !

Fanout of x(n) can become prohibitive




73

Data Flow Graph – Representation !

Data Flow Graph of a 3-tap FIR filter

•Nodes represent Computations/tasks: e.g: Addition, Multiplication

•Computational time for a node can be specified with the node

•Edges have a non-negative no. of delays associated with it

•A node shall only compute once all the input data is ready

•Non Recursive DFG Systems have no loops in a DFG

Consider this example !



Consider this example !

74

Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)



Data Flow Graphs (DFGs)

75

Some Terms





DFG based Pipelining – Example (1/4)

76






77






78

x(n) Z-1 Z-1 Z-1

h0 h1 h2 hM-1

Z-1

Z-1

x(n) Z-1 Z-1 Z-1

h0 h1 h2 hM-1Put delay onall cuts

Z-1






79

Let Tm = 10 units, Ta = 2 units, Desied clock = 6 units !

Initial Design be:

x(n)

Z-1 Z-1 Z-1

hM-1 hM-2 hM-3 h0

y(n)

x(n)

Z-1 Z-1 Z-1

hM-1 hM-2 hM-3 h0

y(n)

Z-1

insert

registers

here

Fine Grained Pipelining

Pipelining using the Delay Transfer Theorem



Feedforward – only (Example-1)

80

A convenient way to implement pipelining is to add the desired number of registers to all input edges and then, by repeated application of the node

transfer theorem, systematically move the registers to break the delay of the

critical path.

Functionality is not changed if a register is transferred from all incoming

edges of node (e.g. FA0) to all outgoing edges & vice versa !

Article : 7.2.7 [SHO]




Feedforward – only (Example-2)

81



83

This scheme can also be applied for RegisterBalancing (as discussed earlier)



Technique : DFG based Parallel Processing

Data Flow Graphs (DFGs)84

Technique : DFG based Parallel Processing

D Fl G h (DFG )




85

What if we can’t optimize our system anymore using pipelining ?

Convert a SISO system to a MIMO system using parallel logic !

The effective sampling speed is increased by the level of parallelism: L

Multiple outputs are computed in parallel in a clock period

Parallel processing system is also called block processing, and the number of inputs processed in a clock

cycle is referred to as the block size : L



SISO to MIMO Conversion !



SISO to MIMO Conversion !

87



2 Parallel 3-Tap Filter !



2 Parallel 3 Tap Filter !

90



91

Combining Parallelism & Pipelining



g p g

92

By combining parallel processing (block size: L)and pipelining (pipelining stage: M), the sample

period can be reduced to:

Technique : Parallel Processing + Pipelining

Example : FIR Filtering !



Example : FIR Filtering !

Quiz ...



Q

94

Time – 8 Minutes !



95

Q-1) What is the maximum sampling rate of this system without any optimization?

Q-2) Optimize this design such that the sampling rate of the optimized system is 1/T.

(You must show the DFG for the optimized design) ---

Please assume that computational time required for each node = T

Also Assume that all nodes are atomic !!

Solution !



96

Q-1) What is the maximum sampling rate of this system without any optimization?

Sampling Period = 4T, Sampling Rate = 1/4T

Please assume that computational time required for each node = T

Also Assume that all nodes are atomic !!

adsd fall2011 05 architect ing speed 2011nov03

Documents