adsd fall2011 05 architect ing speed 2011nov03
TRANSCRIPT
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 1/96
Dr. Rehan Hafiz <[email protected]>Lecture # 05
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 2/96
Course Website for ADSD Fall 2011
http://lms.nust.edu.pk/
2
Lectures: Tuesday @ 5:30-6:20 pm, Friday @ 6:30-7:20 pm
Contact: By appointment/Email Office: VISpro Lab above SEECS Library
Acknowledgement: Material from the following sources has been consulted/used in theseslides:1. [CIL] Advanced Digital Design with the Verilog HDL, M D. Ciletti2. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan3. [STV] Advanced FPGA Design, Steve Kilts4. Some slides from : [ECEN 248 Dr Shi]
Material/Slides from these slides CAN be used with following citing reference:
Dr. Rehan Hafiz: Advanced Digital System Design 2010
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 3/96
This Lecture
3
Understanding & Optimizing
Speed
Throughput
Timings
Reading Assignment
Chapter -1: Advanced FPGA Design, by Steve Kilts
Xilinx Application Note Uploaded on MOODLE + Practice in Xilinx ISE
Setup/Hold time violation
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 4/96
Speed
4
Throughput
Amount of data that is processed per clock cycle
Metric: bits/sec
Latency
Time between data input and processed data output
Metric: No. of cycles or time
Timing Logic delays between sequential elements
Metric : Clock period or Frequency.
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 5/96
High Throughput Design
A high-throughput design
More concerned with the steady-state data rate
Less concerned about the time any specific piece of data
requires to propagate through the design (latency)
Techniques
Pipelining5
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 6/96
Throughput
D Q
clk
D Q
clk
Combinational
Logic
Combinational
Logic D Q
clk
top-level entity
Throughput = (bits per output sample) / (time between consecutive output samples)
Bits per output sample: In this example, 8 bits per output sample
Time between consecutive output samples: clock cycles between output(n) to output(n+1) Can be measured in clock cycles, then translated to time
In this example, time between consecutive output samples = 1 clock cycle = 10 ns
Throughput = (8 bits per output sample) / (10 ns) = 0.8 bits / ns = 800 Mbits/s
input output
clk
input input(0) input(1) input(2)
output (unknown) output(0) output(1)
8 bits8 bits
1 cycle betweeen
output samples
100MHz
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 7/96
An Example...
Software Code
Digital Implementation
XPower = 1;
for (i=0;i < 3; i++)XPower = X * XPower;
Throughput 8/3 = 2.7 bits/cyc.
Latency 3 clk cycles
Timing 1 Multiplier Delay
Same register and computational resources
are reused
No new computations can begin until theprevious computation has completed
[KIL]
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 8/96
Coding an iterative algorithm
<with dependency>
XPower = 1;
for (i=0;i < 3; i++)
XPower = X * XPower;
module power3(
output [7:0] XPower,
output finished,
input [7:0] X,
input clk, start);
reg [7:0] ncount;
reg [7:0] XPower;
assign finished = (ncount == 0);
always@(posedge clk)
if(start) begin
XPower <= X;
ncount <= 2;
End
else if(!finished) begin
ncount <= ncount - 1;
XPower <= XPower * X;
End
endmodule
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 9/96
Loop Unrolling
9
XPower = 1;
for (i=0;i < 3; i++)
XPower = X * XPower;
Both the final calculation of X3 (XPower3
resources) and the first calculation of the
next value of X (XPower2 resources)occur simultaneously
x[n-1]2x[n]
x[n-1]
x[n-2]3
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 10/96
Coding
10
module power3(
output reg [7:0] XPower,
input clk,input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X;
// Pipeline stage 2
X2 <= X1;
XPower2 <= XPower1 * X1;// Pipeline stage 3
XPower <= XPower2 * X2;
end
endmodule
X2
XPower1 XPower2
X1
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 11/96
ft
11
Throughput 8/1 = 8 bits/cyc.
Latency 3 clk cycles
Timing 1 Multiplier Delay
Throughput 8/3 = 2.7 bits/cyc.
Latency 3 clk cycles
Timing 1 Multiplier Delay
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 12/96
12
In general, if an algorithm requiring n
iterative loops is “unrolled,” the pipelined
implementation will exhibit a throughput
performance increase of a factor of n. The penalty for unrolling an iterative loop is a
proportional increase in area.
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 13/96
Decreasing Latency
A low-latency design is one that passes the data from
the input to the output as quickly as possible by
minimizing the intermediate processing delays.
Technique
Removal of pipelining, and logical short cuts that may reduce
the throughput or the max clock speed in a design
Parallelisms
13
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 14/96
Latency
D Q
clk
D Q
clk
Combinational
Logic
Combinational
Logic D Q
clk
top-level entity
Latency is the time between input(n) and output(n)
i.e. time it takes from first input to first output, second input to second output, etc.
Also called input-to-output latency
Count the number of rising edges after input
In this example, 2 rising edges latency is 2 cycles
Latency is measured in clock cycles (then translated to seconds)
In this example, say clock period is 10 ns, then latency is 20 ns
input output
clk
input input(0) input(1) input(2)
output (unknown) output(0) output(1)
8 bits8 bits
100 MHz
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 15/96
Removal of pipelining
Throughput 8/1 = 8 bits/cyc.
Latency Less than a cycle
Timing 2 Multiplier Delays
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 16/96
Penalty
16
Penalty in timing
Previousimplementationscould theoretically
run the system clock period close to thedelay of a singlemultiplier
For Low-latency
implementation, theclock period must beat least two multiplierdelays
module power3(
output [7:0] XPower,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;assign XPower = XPower2 * X2;
always @* begin
X1 = X;
XPower1 = X;
end
always @* beginX2 = X1;
XPower2 = XPower1*X1;
end
endmodule
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 17/96
Understanding Timing
17
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 18/96
Timings
18
Combinational
Logic & Routing
Flip Flops
Setup time
Hold time
Propagation delay t CLK2Q
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 19/96
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 20/96
Timing: Flip Flops (Sequential Logic)
D Qclk
clk
D
Q
tS tH
Input D must remain
stable during
this interval
Input D can freely
change during
this interval
tCLK2Q
Setup time t S – minimum time the input has to be stable before the rising edge of the clock
Hold time t H – minimum time the input has to be stable after the rising edge of the clock
Propagation delay t CLK2Q – time to propagate input to output after the rising edge of the clock
Ti i
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 21/96
Timing:
Path timing
D Q
clk
clk
D Q
clk
CombinationalLogic
tCLK2Q ts
CLOCK PERIOD T
tLOGIC
t CLK2Q + t LOGIC+ t ROUTING < (T - t S ) to avoid setup time
violation
Rewriting the equation: t CLK2Q + t LOGIC + t routing + t S < T
t path
tRout
A path is defined as a path from the output
of one flip-flop to the input of another
flip-flop
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 22/96
Critical Path Delay
Path delay t path = t CLK2Q + t LOGIC + t ROUTE + t S
The largest of all the path delays in a circuit is
called the critical path delay (t critical_path)
The associated path is called the critical path
There can be millions of paths in a circuit; timing
analysis CAD tools help to locate the critical path
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 23/96
Critical PathD Q
D Q
D Q
D Q
D Q
PATH 1
PATH 2
PATH 3
PATH 4
1.1 ns
0.5 ns
0.8 ns
Path delays: t path1 = 2.2 ns, t path2 = 1.1 ns, t path3 = 3.0 ns, t path4 = 1.4 ns
The critical path is path 3; the critical path delay is t critical_path
= t path3
=3.0 ns
t CLK2Q=0.4 ns
t CLK2Q=0.4 ns
t CLK2Q=0.4 ns
t S=0.2 ns
t S=0.2 ns
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 24/96
Setup Time Violation (a.k.a Critical Path Violation)
D Q D Q
t S=0.2 nst CLK2Q=0.4 ns
clk
tCLK2Q ts
CLOCK PERIOD T
CombinationalGate A
CombinationalGate B
t wire1=0.4 ns t gateA=2.0 ns t wire2=0.2 nst gateB=1.2 ns t wire3=0.8 ns
t wire1 t gateA t wire2 t gateB t wire3
Critical path delay = t critical_path = 5.2 ns
The minimum period for this circuit to work is Tmin = 5.2 ns
Maximum clock frequency = 1/Tmin = 192 MHz
If the clock period is smaller than Tmin, you will get a timing violation and circuit will not operate correctly!!
This kind of timing violation is called a "setup time" violation (also known as critical path violation)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 25/96
25
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 26/96
Review – From Last Lecture
26
Throughput
Amount of data that is processed per clock cycle OR The aggregate/average data
processing rate
Ideally average data rate IN to your system should be able to the average data rate OUT of
your system – OR you will miss data !
Technique : Pipelining & Loop Unrolling !
Streaming Applications – More concerned with throughput !
Metric: bits/sec
Latency
Time between data input and processed data output
Parallelising the system ---
Response Time --- Important for Time Critical Signals, e.g. some interrupt triggered
operation processing an external signal of an avionics system ! Metric: No. of cycles or time
Normally a compromise !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 27/96
Timing
27
Timing
Logic delays between sequential elements
Metric : Clock period or Frequency.
[t CLK2Q + t LOGIC + t routing + t S ]< T
Clock Skew Rising Edge of the Clock Does Not Arrive at Clock Inputs of All Flip-flops at The
Same Time
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 28/96
Clock Skew
Delay often caused by wire routing delay
D Q
in
clk
D Q
out
delay
D Qin
clk
D Qout
delay
clk'
clk
clk'
tskew
clk
clk'
tskew
Lag clock skew
Lead clock skew
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 29/96
29
Positive slack When the data arrives at the capture flip-flop before the capture
clock less the setup time.
Negative Slack
If the data arrive after the capture clock less the setup time -ve slack is an issue
d l k k b d b
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 30/96
Lead clock skew is bad because it may cause setup
time violations
D Q
clk
D Q
clk
Combinational
Logic
clk
tCLK2Q ts
CLOCK PERIOD T
tLOGIC+tROUTE
D Q
clk
D Q
clk'
Combinational
Logic
clk
tCLK2Q
ts
clk'
CLOCK PERIOD T
tskew
WITHOUT SKEW:
t CLK2Q + t LOGIC + t ROUTE + t s < T
to avoid setup time violation
WITH SKEW:
t CLK2Q + t LOGIC + t ROUTE + t s < (T – t skew)
to avoid setup time violation
less time to perform logic than you
normally would
Soln: Optimize/Pipeline/Speedgrade !
tLOGIC+tROUTE
l k k b d b h ld
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 31/96
Lag clock skew is bad because it may cause hold
time violations
D Q
clk
D Q
clk'
Combinational
Logic
clk
tCLK2Q tLOGIC+Route
clk'
tskew tH
t CLK2Q + t LOGIC + t ROUTE > (t skew + t H ) to avoid hold time violation
If this is violated, get data feedthrough (data gets fed into the next register one cycle too early)
There is no clock period (T) in the equation; changing clock period cannot help this problem!
Solution : Add dummy logic, e.g. Buffer !
For FPGAs hold time violation predict clock skew
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 32/96
Maximum Achievable Frequency
Maximum-frequency equation (ignoring clock-
to-clock jitter):
Tskew is propagation delay of clock between
the launch flip- flop and the capture flip- flop
-ve,+ve depends on lead or lag
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 33/96
Reading Assignment
33
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 34/96
Some Examples34
Example 1:
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 35/96
Example 1:
Analyzing Sequential Circuits
° What is the minimum time between rising clock
edges?• Tmin = TCLK-Q (FFA) + TLogic (G) + TRoute (G) + Ts (FFB)
ZComb.Logic
TClk-Q = 5 nsTs = 2 ns
D Q D Q Y XD
CLK
TClk-Q = 5ns Tlogic+Route = 5ns
FFA FFB
G
Example: 2
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 36/96
Example: 2Hold Time Violation
° Shall we get Hold Time Violation in this example ?
° Make sure Y remains stable for hold time (Th) after rising clock edge
° Remember: contamination delay ensures signal doesn’t change• TCLK2Q(FFA) + Tcd(G) >= Th
• 1ns + 2ns > 2ns
ZComb.Logic
Th = 2 ns
D Q D Q Y XD
CLK
Tclk2Q = 1ns Tcd = 2ns
FFA FFB
G
E l 3
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 37/96
Example-3
° What is the minimum clock period (Tmin) of thiscircuit?
° What if FFB has a clock skew – Lead of 1 ns
ZComb.
Logic H
TClk-Q = 4 ns
Ts = 2 ns
D Q D Q Y X
CLK
TClk-Q = 5ns
Tlogic+Route = 5ns FFA FFB
Comb.Logic F
Togic+Route= 4ns
S l i
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 38/96
Solution
° Path FFA to FFB• TClk-Q(FFA) + Tpd(H) + Ts(FFB) = 5ns + 5ns + 2ns = 12ns
° Path FFB to FFB• TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns
ZComb.Logic H
TClk-Q = 4 nsT
s
= 2 ns
D Q D Q Y X
CLK
TClk-Q = 5ns
Tlogic+Route = 5ns FFA FFB
Comb.
Logic F
Tlogic+Route = 4ns
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 39/96
Solution(With Lead of 1 ns for FFB)
° Path FFA to FFB• TClk-Q(FFA) + Tpd(H) + Ts(FFB) + Tskew= 5ns + 5ns + 2ns + 1ns= 13ns
° Path FFB to FFB• TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns
ZComb.Logic H
TClk-Q = 4 nsT
s
= 2 ns
D Q D Q Y X
CLK
TClk-Q = 5ns
Tlogic+Route = 5ns FFA FFB
Comb.
Logic F
Tlogic+Route = 4ns
Example
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 40/96
Example
Analyzing Sequential Circuits: Hold Time Violations
Path FFA to FFB• TClk2q(FFA) + Tlogic+Route (H) > Th(FFB) = 1 ns + 2ns > 2ns
Path FFB to FFB• TClk2q (FFB) + TCD(F) + Tlogic+Route (H) > Th(FFB) = 1ns + 1ns + 2ns > 2ns
Comb.Logic H
Tclk2Q = 1 nsTh = 2 ns
D Q D Q Y X
CLK Tclk2Q = 1ns
Tlogic+Route = 2ns FFA FFB
Comb.Logic F
Tlogic+Route = 1ns
All paths must satisfy requirements
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 41/96
Optimizing TimingFew Simple Design Considerations
41
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 42/96
Consider an FIR Filter
The equation for the computation of an L-taps
FIR filter is:
If L=5 y[0]= h0x0 + h1x-1 + h2x-2 + h3x-3 +h4x-4
y[1]= h0x1 + h1x0 + h2x-1 + h3x-2 +h4x-3
y[2]= h0x2 + h1x1 + h2x0 + h3x-1 +h4x-2 y[3]= h0x3 + h1x2 + h2x1 + h3x0 +h4x-1
y[4]= h0x4 + h1x3 + h2x2 + h3x1 +h4x0
y[5]= h0x5 + h1x4 + h2x3 + h3x2 +h4x1
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 43/96
Parallel FIR Implementation
43
module fir(
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 44/96
44
Critical Path ??
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
always @(posedge clk)
if(validsample) begin
X1 <= X;
X2 <= X1;
Y <= A* X+B* X1+C* X2;
endendmodule
Technique-1- Pipelining
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 45/96
Technique 1 Pipelining
<Reducing TLOGIC+PROPAGATON>
reg [7:0] X1, X2, Y;
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 46/96
Code
46
reg [7:0] prod1, prod2, prod3;
always @ (posedge clk) begin
if(validsample) begin
X1 <= X;
X2 <= X1;
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
end
Y <= prod1 + prod2 + prod3;
endendmodule
Technique-2- Increasing Parallelism
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 47/96
Technique 2 Increasing Parallelism
<Speeding-up the logic-process>47
…. Optimize the critical path such that logic
structures could be implemented in parallel
Example:
For the x-cube code break the multipliers intoindependent operations and then recombine them.
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 48/96
Taking a square
48
8-bit binary multiplier
8 Muxe shifts +8 8-bit
Additions
b l l
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 49/96
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 0
0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 0
8 bit Multiplication
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 50/96
50
Optimizing Logic by adding
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 51/96
Optimizing Logic by adding
Parallelism51
Assume we are squaring an 8-bit number
can be represented by nibbles A and B:
a3 a2 a1 a0 b3 b2 b1 b0
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 52/96
a3 a2 a1 a0 b3 b2 b1 b0
a3b0 a2b0 a1b0 a0b0 b3b0 b2b0 b1b0 b0b0
a3b1 a2b1 a1b1 a0b1 b3b1 b2b1 b1b1 b0b1
a3b2 a2b2 a1b2 a0b2 b3b2 b2b2 b1b2 b0b2
a3b3 a2b3 a1b3 a0b3 b3b0 b2b0 b1b0 b0b0
a0a3 a0a2 a0a1 a0a0 a0b3 a0b2 a0b1 a0b0
a1a3 a1a2 a1a1 a1a0 a1b3 a1b2 a1b1 A1b0
a2a3 a2a2 a2a1 a2a0 a2b3 a2b2 a2b1 a2b0
a3a3 A3a2 a3a1 a3a0 a3b3 a3b2 a3b1 a3b0
B*B
2*A*B
A*A
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 53/96
1 1 1 1 1 0 1 0
0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 0
0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
1 1 1 1 1 0 1 0
‘0’ 1 1 0 0 1 0 0
‘1’ 0 0 1 0 1 1 0 0
1 1 1 0 0 0 0 1
1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 0
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 54/96
Technique-3- Register Balancing
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 55/96
Technique-3- Register Balancing <Distribute long logic paths evenly across register layers>
55
Keep a balance in the critical path
Redistribute logic evenly between registers to
minimize the worst-case delay between any two
registers
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 56/96
56
Technique-4- Flatten Logic Structures
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 57/96
Technique 4 Flatten Logic Structures
<Removing redundant logic>57
Break up logic structures that are coded in a
serial fashion
Avoiding Priority Structures if not required
control signals coming from an
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 58/96
control signals coming from an
address decode that are used to write four 1-bit registers
58
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl);
always @(posedge clk)
if(ctrl[0]) rout[0] <= in;
else if(ctrl[1]) rout[1] <= in;
else if(ctrl[2]) rout[2] <= in;
else if(ctrl[3]) rout[3] <= in;
endmodule
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 59/96
59
If the control lines are strobes from an address
decoder in another module
Each strobe is mutually exclusive to the others as
they all represent a unique address. Is there any need for priority structure ?
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 60/96
60
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl);
always @(posedge clk) begin
if(ctrl[0]) rout[0] <= in;
if(ctrl[1]) rout[1] <= in;
if(ctrl[2]) rout[2] <= in;
if(ctrl[3]) rout[3] <= in;
end
endmodule
Ti
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 61/96
Tip
61
Technique-5- Reordering Paths
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 62/96
q g
<Shortening Critical Paths>62
Mostly done by synthesizer !!!
Reorder the paths in the dataflow to minimize
the critical path
When to use: Where multiple paths combine with the critical path
The combined path can be reordered such that the
critical path can be moved closer to the destination
register
Technique-5- Reordering Paths
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 63/96
q g
63
Events not
mutually
exclusive
module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);always @(posedge clk)
if(Cond1)
Out <= A;
else if(Cond2 && (C < 8))
Out <= B;
else
Out <= C;
endmodule
module randomlogic(output reg [7:0] Out,
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 64/96
64
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
wire CondB = (Cond2 & !Cond1);
always @(posedge clk)
if(CondB && (C < 8))
Out <= B;
else if(Cond1)
Out <= A;
else
Out <= C;
endmodule
Summary Architecting Speed
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 65/96
Summary- Architecting Speed
65
High Throughput Pipelining
Low Latency Parallelism
Pipeline Removal Timing
Parallelism
Pipelining
Flattening Logic Structure
Register Balancing Path Reordering
In your digital design Make your specification as your goal and apply the
techniques
Recap
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 66/96
66
A high-throughput architecture is one that maximizes the number of bits per second that can be
processed by a design.
Unrolling an iterative loop increases throughput.
The penalty for unrolling an iterative loop is a proportional increase in area.
A low-latency architecture is one that minimizes the delay from the input of a module to the output.
Latency can be reduced by removing pipeline registers
The penalty for removing pipeline registers is an increase in combinatorial delay between registers.
Timing refers to the clock speed of a design. A design meets timing when the maximum delay between any
two sequential elements is smaller than the minimum clock period Adding register layers improves timing by dividing the critical path into two paths of smaller delay.
Separating a logic function into a number of smaller functions that can be evaluated in parallel reduces
the path delay to the longest of the substructures.
By removing priority encodings where they are not needed, the logic structure is flattened, and the path
delay is reduced.
Register balancing improves timing by moving combinatorial logic from the critical path to an adjacent
path
Timing can be improved by reordering paths that are combined with the critical path in such a way that
some of the critical path logic is placed closer to the destination register
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 67/96
Dr. Rehan Hafiz <[email protected]>
Reading
Chapter 3 of Parhi, VLSI Digital Signal Processing Systems
Reading
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 68/96
Reading
Parhi, VLSI Digital Signal Processing Systems
Chapter 3
Direct Form FIR Filters
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 69/96
x(n) Z-1 Z-1 Z-1
h0 h1 h2 hM-1
y(n)
Direct Form FIR Filters
M-tap FIR filter in direct form
Critical path:
TA = delay through adder
TM = delay through multiplier
Critical path delay: 1 TM +(M-1) TA
Area: M-1 registers
M multipliers
M-1 adders
Arithmetic complexity of M-tap filter modeled as:
M multiplications/sample + M-1 adds/sample
1
0
( ) ( ) ( ) ( ) ( )
M
i
y n h i x n i h n x n
Representations of DSP algorithms and architectures
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 70/96
70
Block Diagram
Block diagram of a 3-tap FIR filter
Representations of DSP algorithms and architectures
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 71/96
71
Signal Flow Graph – Representation !
Signal Flow Graph of a 3-tap FIR filter
Collection of Nodes & Directed Edges
A directed edge (j,k) denotes a node
originating at node j & terminating
at node k
Edge (j,k) denotes a linear
transformation from signal at node j
to signal at node k – Can specify Gain
Nodes represent computations or
tasks e.g: Addition
Source Node : No input edges; Sink
Node : No originating edges
Technique Signal Flow Graph
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 72/96
Signal Flow Graph
From Direct Form to Transpose Form
72
Reversing the direction of an SFG and interchanging the input and output ports preserves the functionality of the system.
Also called data broadcast structure
x(n)
Z-1 Z-1 Z-1
hM-1 hM-2 hM-3 h0
y(n)
Critical path:
Delay: 1 TM + 1 TA
Area:
M-1 registers + M multipliers +M-1 adders
Disadvantages
Larger register sizes depending on quantization scheme used; since registers are now placed aftermultiplication !
Fanout of x(n) can become prohibitive
Representations of DSP algorithms and architectures
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 73/96
73
Data Flow Graph – Representation !
Data Flow Graph of a 3-tap FIR filter
•Nodes represent Computations/tasks: e.g: Addition, Multiplication
•Computational time for a node can be specified with the node
•Edges have a non-negative no. of delays associated with it
•A node shall only compute once all the input data is ready
•Non Recursive DFG Systems have no loops in a DFG
Consider this example !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 74/96
Consider this example !
74
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 75/96
Data Flow Graphs (DFGs)
75
Some Terms
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 76/96
Data Flow Graphs (DFGs)
DFG based Pipelining – Example (1/4)
76
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 77/96
Data Flow Graphs (DFGs)
DFG based Pipelining – Example (2/4)
77
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 78/96
Data Flow Graphs (DFGs)
DFG based Pipelining – Example (3/4)
78
x(n) Z-1 Z-1 Z-1
h0 h1 h2 hM-1
Z-1
Z-1
x(n) Z-1 Z-1 Z-1
h0 h1 h2 hM-1Put delay onall cuts
Z-1
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 79/96
Data Flow Graphs (DFGs)
DFG based Pipelining – Example (4/4)
79
Let Tm = 10 units, Ta = 2 units, Desied clock = 6 units !
Initial Design be:
x(n)
Z-1 Z-1 Z-1
hM-1 hM-2 hM-3 h0
y(n)
x(n)
Z-1 Z-1 Z-1
hM-1 hM-2 hM-3 h0
y(n)
Z-1
insert
registers
here
Fine Grained Pipelining
Pipelining using the Delay Transfer Theorem
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 80/96
Feedforward – only (Example-1)
80
A convenient way to implement pipelining is to add the desired number of registers to all input edges and then, by repeated application of the node
transfer theorem, systematically move the registers to break the delay of the
critical path.
Functionality is not changed if a register is transferred from all incoming
edges of node (e.g. FA0) to all outgoing edges & vice versa !
Article : 7.2.7 [SHO]
Pipelining using the Delay Transfer Theorem
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 81/96
Feedforward – only (Example-2)
81
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 82/96
Pipelining using the Delay Transfer Theorem
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 83/96
83
This scheme can also be applied for RegisterBalancing (as discussed earlier)
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 84/96
Technique : DFG based Parallel Processing
Data Flow Graphs (DFGs)84
Technique : DFG based Parallel Processing
D Fl G h (DFG )
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 85/96
Data Flow Graphs (DFGs)
85
What if we can’t optimize our system anymore using pipelining ?
Convert a SISO system to a MIMO system using parallel logic !
The effective sampling speed is increased by the level of parallelism: L
Multiple outputs are computed in parallel in a clock period
Parallel processing system is also called block processing, and the number of inputs processed in a clock
cycle is referred to as the block size : L
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 86/96
SISO to MIMO Conversion !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 87/96
SISO to MIMO Conversion !
87
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 88/96
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 89/96
2 Parallel 3-Tap Filter !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 90/96
2 Parallel 3 Tap Filter !
90
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 91/96
91
Combining Parallelism & Pipelining
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 92/96
g p g
92
By combining parallel processing (block size: L)and pipelining (pipelining stage: M), the sample
period can be reduced to:
Technique : Parallel Processing + Pipelining
Example : FIR Filtering !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 93/96
Example : FIR Filtering !
Quiz ...
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 94/96
Q
94
Time – 8 Minutes !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 95/96
95
Q-1) What is the maximum sampling rate of this system without any optimization?
Q-2) Optimize this design such that the sampling rate of the optimized system is 1/T.
(You must show the DFG for the optimized design) ---
Please assume that computational time required for each node = T
Also Assume that all nodes are atomic !!
Solution !
8/3/2019 ADSD Fall2011 05 Architect Ing Speed 2011Nov03
http://slidepdf.com/reader/full/adsd-fall2011-05-architect-ing-speed-2011nov03 96/96
96
Q-1) What is the maximum sampling rate of this system without any optimization?
Sampling Period = 4T, Sampling Rate = 1/4T
Please assume that computational time required for each node = T
Also Assume that all nodes are atomic !!