vlsi programming 2016: lecture 1wsinmak/education/2imn35/2imn35-2016-slides1.pdf · vlsi...
TRANSCRIPT
19/04/16 1
VLSI Programming 2016: Lecture 1
Course: 2IMN35
Teachers: Kees van Berkel [email protected] Rudolf Mak [email protected]
Lab: Kees van Berkel, Rudolf Mak, Alok Lele
www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Lecture 1: Introduction
19/04/16 2
Introduction to VLSI Programming: goals
• to acquire insight in the description, design, and optimization of fine-grained parallel computations;
• to acquire insight in the (future) capabilities of VLSI as an implementation medium of parallel computations;
• to acquire skills in the design of parallel computations and in their implementation on FPGAs.
19/04/16 3
Contents
Massive parallelism is needed to exploit the huge and still increasing computational capabilities of Very Large Scale Integrated (VLSI) circuits:
• we focus on fine-grained parallelism (not on networks of computers);
• we assume that parallelism is by design (not by compilation);
• we draw inspiration from consumer applications, such as digital TV, 3D TV, image processing, mobile phones, etc.;
• we will use Field Programmable Arrays (FPGA) as fine-grained abstraction of VLSI for practical implementation.
19/04/16 4
FPGA IC on a Xilinx XUP Board (Atlys)
Xilinx Spartan 6
FPGA
19/04/16 5
Atlys board, based on Xilinx Spartan 6
Xilinx Spartan 6
FPGA
19/04/16 6
Lab work prerequisites
• Laptop, running Windows
• Exceed (can be obtained through the TU/e software distribution)
• Access to UNIX server Dept. W&I (can be obtained through BCF)
• Lab work is by teams of two students, with at least 1 Windows laptop.
• Have FPGA tools (SW) installed on your machine by Tuesday April 26
• check website 2IMN35
19/04/16 7
VLSI Programming (2IMN35): time table 2016 2015 in Tue:h5-h8;MF.07 out 2015 in Thu:h1-h4;Gemini-Z3A-08/10/13 out
19-Apr
introduc/on,DSPgraphs,bounds,…
21-Apr
pipelining,re/ming,transposi/on,J-slow,unfolding
T1+T2
26-Apr
toolsinstalled
Introduc/onstoFPGAandVerilog
L1:audiofiltersimula/on
L1L2
28-Apr
T1+T2
unfolding,look-ahead,strengthreduc/on
L1cntd
T3+T4
3-May
folding
L2:audiofilteronXUPboard
5-May
10-May
T3+T4
DSPprocessors
L2cntd
L3
12-May
L3:sequen/alFIR+strength-reducedFIR
17-May
L3cntd
19-May
L3cntd
L4
24-May
systoliccomputa/on
T5
26-May
L4
31-May
T5
L4:audiosamplerateconvertor
2-Jun
L3
L4cntd
L5
7-Jun
L5:1024xaudiosamplerateconvertor
9-Jun
L4
L5cntd
14-Jun
16-Jun
L5
deadlinereportL5
19/04/16 8
Course grading (provisional)
Your course grade is based on:
• the quality of your programs/designs [30%];
• your final report on the design and evaluation of these programs (guidelines will follow) [30%];
• a concluding discussion with you on the programs, the report and the lecture notes [20%];
• intermediate assignments [20%].
• Credits: 5 points = based on 140 hours from your side
19/04/16 9
Note on course literature
Lectures VLSI programming are loosely based on: • Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and
Implementation. Wiley Inter-Science 1999. • This book is recommended, but not mandatory
Accompanying slides can be found on: • http://www.ece.umn.edu/users/parhi/slides.html • http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Mandatory reading: • Keshab K. Parhi. High-Level Algorithm and Architecture
Transformations for DSP Synthesis. Journal of VLSI Signal Processing, 9, 121-143 (1995), Kluwer Academic Publishers.
19/04/16 10
Introduction
• Some inspiration from the technology side • VLSI • FPGAs
• Some inspiration from the application side • Machine Intellligence • Bee, SKA, SETI • Digital Signal Processing (Software Defined Radio)
• Parhi, Chapters 1, 2 • DSP Representation Methods • Iteration bounds
19/04/16 11
Some inspiration
from the technology side
19/04/16 12
Vertical cut through VLSI circuit
19/04/16 13
Intel 4004 processor [1970]
§ 1970
§ 4-bit
§ 2300 transistors
Apple A9 SoC (System on Chip)
• 2015
• Production: Samsung/TSMC
• 14/16 nm FinFet
• 96/104.5 mm2
• > 2B transistors
• Assuming 0.1$/mm2 production costs
• ⇒ 5 nano$ / transistor
19/04/16 14
Flash memory
• 32 GB = 256Gb
• ≈100G transistors => << 1 n$ per transistor
19/04/16 15
Xilinx Kintex7 FPGA
• 2G transistors
• 165mm2
19/04/16 16
• 1920 DSP slices
Stratix 10 FPGA from Altera (Intel)
19/04/16 17
• > 10,000 FLOPs per clock cycle
• @ nearly 1 GHz
Exa-scale computing: 1018 FLOPs/Sec
A scenario (year 2021):
• 1018 FLOPs/sec = 109 arithmetic units running 109Hz
• 109 arithmetic units = 104.5 nodes ×104.5 arithmetic units
• 1 node = 32TFLOPs/s “X”+ 1TB DRAM + “CPU” @ 10 MW
Today (2016: “petaflop” era):
• #1: Tianhe-2 (China): 34 ×1015 FLOPs/sec 104.5 nodes @ 24 MW,
• GPU (Nvidia GM200): 6 TFLOPs/sec
• FPGA (Altera Stratix 10, GX2800): 9 TFLOPs/sec
19/04/16 18
A 2016 “node”
19/04/16 19
Source: Samsung
19/04/16 20
Source: NVidia
19/04/16 21
Moore’s Law: 50th anniversary in 2015!
Cost per Transistor over Time for Intel MPUs
↑ US$
?
×0.5/2years
19/04/16 23
Rule of two [Hu, 1993]
• Every 2 generations of IC technology (6 years)
• device feature size 0.5 x
• chip size 2 x
• clock frequency 2 x (no longer true)
• number of i/o pins 2 x
• DRAM capacity 16 x
• logic-gate density 4 x
ITRS: INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS
• The overall objective of the ITRS is to present industry-wide consensus on the “best current estimate” of the industry’s research and development needs out to a 15-year horizon.
• As such, it provides a guide to the efforts of companies, universities, governments, and other research providers/funders.
• The ITRS has improved the quality of R&D investment decisions made at all levels and has helped channel research efforts to areas that most need research breakthroughs.
• Involves over 1000 technical experts, world wide.
• a self-fulfilling prophecy? … or wishful thinking?
19/04/16 ST-Ericsson confidential 24
ITRS 2013
19/04/16 25
2013 ITRSMPU/ASIC Half Pitch and Gate Length Trends
19/04/16 ST-Ericsson confidential 26
19/04/16 27
Virtex 4 FPGA: 4VSX55 FPGA = Field Programmable Gate Array
500MHz clock Flexible Logic
6,144 CLBs
multi-port RAM 320 × 18 kbit
Programmable 512 DSP slides
450MHz PowerPC™
1Gbps Differential I/O
0.6-11.1Gbps Serial Trx
19/04/16 28
Some inspiration
from the application side
19/04/16 29
All things grand and small [Moravec ‘98]
19/04/16 30
Chess Machine Performance [Moravec ‘98]
19/04/16 31
brain power equivalent per $1000 of computer
Evolution computer power/cost [Moravec ‘98]
MPSoC -- 2010, June 30 32
The Square Kilometer Array (SKA)
... the ultimate exploration tool
... and the ultimate software defined radio
MPSoC -- 2010, June 30 33
The Square Kilometer Array (SKA)
• antenna surface: 1 km2 (sensitivity 50×)
• large physical extent (3000+ km)
• wide frequency range: 50 MHz – 30 GHz
• full design by 2016; phase 1: 2021; phase 2: 2026
• phase 1: 250 dishes (12m) in the central 5 km
• + dense and/or sparse aperture arrays
• connected to a massive data processor by an optical fibre network
• Software Defined Radio Astronomy
• computational load ≈ 1 exa FLOPs/sec (1018 FLops/s)
• power budget = 20 MW (≈ 20 pJ/FLOP “all-in”)
19/04/16 34
References
• Chip fotos: • http://www-vlsi.stanford.edu/group/chips.html
• ITRS Roadmap • http://www.itrs.net/Links/2005ITRS/ExecSum2005.pdf
• When will computer hardware match the human brain? • http://www.jetpress.org/volume1/moravec.htm
• BEE & Square Kilometer Array • http://bwrc.eecs.berkeley.edu/Research/BEE/
• http://seti.berkeley.edu/casper/papers/BEE2_ska2004_poster.pdf
• http://www.skatelescope.org/
19/04/16 35
VLSI Digital Signal Processing Systems
Parhi, Chapters 1&2
19/04/16 36
DSP applications classes
10G 1G
100M 10M
1M
100k 10k
1k 100
10
1
speech audio
video
HDTV
modems
control seismic modeling
radio modems
complexity →
radar S
ampl
e ra
te [H
z]→
# operations/sample [log]
19/04/16 37
Typical DSP algorithms
• speech (de-)coding
• speech recognition
• speech synthesis
• speaker identification
• Hi-fi audio en/decoding
• noise cancellation
• audio equalization
• ambient acoustic emulation.
• sound synthesis
• echo cancellation
• modem: (de-)modulation
• vision
• image (de-)compression
• image composition
• beam cancellation
• spectral estimation
• etc.
19/04/16 38
Typical DSP kernels: FIR Filters
• Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute y(n) :
• where • x is the input sequence
• y is the output sequence
• h is the impulse response (filter coefficients)
• N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and impulse response.
)(*)()()()(1
0nxnhkixkhiy
N
k=−= ∑
−
=
19/04/16 39
Typical DSP kernels: IIR Filters
• Infinite Impulse Response (IIR) filters compute:
• Output sequence depends on input sequence, impulse response,as well as previous outputs
• Adaptive filters (FIR and IIR) update their coefficients to minimize the distance between the filter output and the desired signal.
∑∑−
=
−
=
−+−=1
0
1
1)()()()()(
N
k
M
kkixkbkiykaiy
19/04/16 40
Typical DSP kernels: DFT and FFT The Discrete Fourier Transform (DFT) supports frequency
domain (“spectral”) analysis:
for k = 0, 1, … , N-1, where • x is the input sequence in the time domain (real or complex) • y is an output sequence in the frequency domain (complex)
The Inverse Discrete Fourier Transform (IDFT) is computed as
The Fast Fourier Transform (FFT) and its inverse (IFFT) provide an efficient method for computing the DFT and IDFT.
1 )()(21
0−===
−−
=∑ jeWnxWky N
j
N
N
n
nkN
π
1-n , ... 1, 0, n for ,)()(1
0== ∑
−
=
−N
k
nkN kyWnx
19/04/16 41
Typical DSP kernels: DCT
The Discrete Cosine Transform (DCT) and its inverse IDCT are frequently used in video (de-) compression (e.g., MPEG-2):
where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.
A N-Point, 1D-DCT requires N2 MAC operations.
1-N ... 1, 0, k for ,)(]2
)12(cos[)()(1
0=
+= ∑
−
=
N
nnx
Nknkeky π
1-N ... 1, 0, k for ,)(]2
)12(cos[)(2)(1
0=
+= ∑
−
=
N
kny
Nknke
Nnx π
19/04/16 42
Typical DSP kernels: distance calculation
• Distance calculations are typically used in pattern recognition, motion estimation, and coding.
• Problem: chose the vector rk whose distance (see below) from the input vector x is minimum.
|)()(|1 1
0∑−
=
−=N
ik irix
Nd ∑
−
=
−=1
0
2)]()([1 N
ik irix
Nd
Mean Absolute Difference (MAD) Mean Square Error (MSE)
19/04/16 43
Typical DSP kernels: matrix computations
Matrix computations are typically used to estimate parameters in DSP systems.
• Matrix vector multiplication
• Matrix-matrix multiplication
• Matrix inversion
• Matrix triangulization
Matrices may be dense/sparse/band-structured/….
19/04/16 44
Computation Rates
• To estimate the hardware resources required, we can use the equation:
• where • Rc is the computation rate
• Rs is the sampling rate
• Ns is the (average) number of operations per sample
• For example, a 1-D FIR has NS = 2N and a 2-D FIR has NS = 2N2.
SSC NRR ⋅=
19/04/16 45
Computational Rates for FIR Filtering
Signal type Frequency # taps Performance
Speech 8 kHz N =128 20 MOPs
Music 48 kHz N =256 240 MOPs
Video phone 6.75 MHz N*N = 81 1,090 MOPs
TV 27 MHz N*N = 81 4,370 MOPs
HDTV 144 MHz N*N = 81 23,300 MOPs
DSP systems and programs
• infinite input stream (samples): x(0), x(1), x(2), …
• infinite output stream (samples): y(0), y(1), y(2), …
• (there may be multiple input and/or output streams)
• non-terminating program, e.g:
for n=1 to ∞ y(n) = a*x(n) + b*x(n-1) + c*x(n-2) end
19/04/16 46
DSP System
x(n) y(n)
DSP SYSTEMSGRAPHICAL REPRESENTATIONS
19/04/16 47
DSP systems: 3 graphical representations
• Block diagram: • general
• loose semantics
• Data-flow graph: • used for signal processing
• formal definition
• powerful tools , lots of theory
• Signal-flow graph: • linear time-invariant systems
• formal definition, stilll more theory
19/04/16 48
block diagram
general
data flow graph
signal processing
signal flow graph
LTI systems
19/04/16 49
DSP system: block diagram
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
• delay element = memory element = register
• multiply with constant a
• adder: output value = sum of input values
× a × b × c
+ +
D D
y(n)
x(n) x(n-1) x(n-2)
D
× a
+
19/04/16 50
DSP system: data-flow graph (DFG)
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
• D is (non-negative) number of delays
• multiplier: output value = (constant a) × input value
• adder: output value = sum of input values
a b c
y(n)
x(n)
a
D D
19/04/16 51
Data-flow graph (DFG)
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
Each edge describes a precedence constraint between two nodes:
• D=0: Intra-iteration precedence constraint
• D>0: Inter-iteration precedence constraint
a b c
y(n)
x(n) D D
Data-flow graphs
Tokens can represent numbers, vectors (blocks), matrices …
Nodes may be complex (coarse-grained) functions, e.g.:
Single-rate data flow: Each node:
• consumes one token from each input edge;
• performs its function (in T time units);
• produces one token onto each output edge.
19/04/16 52
Data-flow graphs
Multi-rate data flow: Each node:
• consumes a fixed number of tokens from each input edge;
• performs its function (in T time units);
• produces a fixed number of tokens onto each output edge.
19/04/16 53
Signal-flow graph (representation method 3)
• A join-node denotes an adder
• Label a next to an edge denotes multiplication by constant a • z-k denotes k units delay
• Signal-flow graphs are used to represent Linear Time Invariant systems LTI.
• A signal flow-graph represents a so-called Z-transform (Laplace), a powerful LTI system theory. (outside the scope of 2IN35)
19/04/16 54
19/04/16 55
Linear Systems
input x, output y:
discrete system:
• x(n) y(n)
linear system:
• x1(n) + x2(n) y1(n) + y2(n)
• c1 x1(n) + c2 x2(n) c1 y1(n) + c2 y2(n)
for arbitrary c1 and c2
Most of our examples will be linear systems
results in
results in
results in
19/04/16 56
Linear Time-Invariant Systems
input x, output y:
• x(n+k) = x(n) shifted by integer k sample periods
time-invariant system
• x’(n) =x(n+k) y’(n) = y(n+k)
Most of our examples will be linear time-invariant systems,
or LTI systems
results in
19/04/16 57
Commutativity of LTI systems
LTI System A
LTI System B
x(n) y(n) f(n)
LTI System B
LTI System A
x(n) y(n) g(n)
is equivalent to
LOOP BOUNDS AND ITERATION BOUNDS
19/04/16 58
19/04/16 59
Iteration of a Synchronous Flow Graph
• Each actor fires the minimum number of times to return the graph to a particular state
• Example of a multi-rate DFG:
A 1
B 2
C 3 2 2 1
# firings for 1 iteration A B C 2 2 3
# tokens per edge for 1 iteration
→ A A → B B → C C →
2 4 6 3
19/04/16 60
Iteration period
Iteration period = the time required for the execution of one iteration of the SFG
Example, let
• Tm = 10 = multiplication time
• Ta = 4 = addition time
Iteration period = Tm+Ta = 14 [e.g. nsec]
= minimum sample period Ts; that is: Ts ≥ Tm+Ta
Iteration rate = (iteration period)-1 [e.g. GHz]
×
a
+ D y(n-1) x(n)
Loop and loop bound
• A loop (cycle) in a DFG is a directed path that begins and ends at the same node.
• The loop bound of loop j is defined Tj/Wj where • Tj is the loop computation time (sum of all Ti of loop nodes i ),
• Wj is the number of delays (D-elements) in the loop.
• Example (IIR filter): • Tloop = Tm+Ta = 14 ns
• Wloop = 2 • Loop bound
= Tloop /Wloop = 14 /2 =7 nsec
19/04/16 61
×
a
+ 2D y(n-2) x(n)
Critical loop and Iteration bound
• The critical loop of a DFG is the loop with the maximum loop bound.
• The iteration bound T∞ of a DFG is the loop bound of the critical loop: • L is the set of loops of the DFG
• Tj of is the loop bound of loop j • Wj of is the weight of loop j, i.e. the number of delays D.
19/04/16 62
T∞ =maxj∈L
TjWj
#
$%%
&
'((
Iteration bound cntd
Example:
• TL1 = (10+2)/1 = 12
• TL2 = (2+3+5)/2 = 5
• TL3 = (10+2+3)/2 = 7.5
• Iteration bound = max (12, 5, 7.5) = 12
Notes:
• Delays are non-negative (negative delay would imply non-causality).
• If loop weight equals 0 (no delay elements in loop) then TL/0 = ∞ (deadlock).
19/04/16 63
4 types of delay paths; critical path
• Redraw block diagram by partitioning nodes in D-elements and combinational functions (“FSM view”):
• Paths do not contain delay-elements
• The critical path is the path with the longest computation bound and is an lower bound for the clock period.
19/04/16 64
delay elements = state
1 2
3
4
outputs inputs combinational functions
path from to 1 inputs state 2 state outputs 3 inputs outputs 4 state state
Critical path cntd
Example (FIR filter): • Tm= 10 ns
• Ta= 4 ns
• No loops!
1. 1 path from input to state: 0 ns
2. 4 path from state to outputs: 26, 22, 18, 14 ns
3. 1 path from input to output: 26 ns
4. 3 paths from state to state: 0, 0, 0 ns
The critical path is 26 ns. (can be reduced by pipelining and parallel processing.)
19/04/16 65
19/04/16 66
DSP references
• Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and Implementation. Wiley Inter-Science 1999.
• Richard G. Lyons. Understanding Digital Signal Processing (2nd edition). Prentice Hall 2004.
• John G. Proakis and Dimitris K Manolakis. Digital Signal Processing (4th edition), Prentice Hall, 2006.
• Simon Haykin. Neural Networks, a Comprehensive Foundation (2nd edition). Prentice Hall 1999.
19/04/16 67
Computer Architecture and DSP references
• Hennessy and Patterson, Computer Architecture, a Quantitative Approach. 3rd edition. Morgan Kaufmann, 2002.
• Phil Lapsley, Jeff Bier, Amit Sholam, Edward Lee. DSP Processor Fundamentals, Berkeley Design Technology, Inc, 1994-199
• Jennifer Eyre, Jeff Bier, The Evolution of DSP Processors, IEEE Signal Processing Magazine, 2000.
• Kees van Berkel et al. Vector Processing as an Enabler for Software-Defined Radio in Handheld Devices, EURASIP Journal on Applied Signal Processing 2005:16, 2613-2625.
VLSI Programming:
Preparations for Lab work, before Tuesday April 26:
• team up (2 students/team), and
• install FPGA tools.
19/04/16 68
19/04/16 69
VLSI Programming: Thursday April 21
Transformations:
• Transposition
• Pipelining
• Retiming
• K-slow transformation
• Parallel processing
(Parhi, Chapters 2, 3)
THANK YOU