vlsi programming 2016: lecture 1wsinmak/education/2imn35/2imn35-2016-slides1.pdf · vlsi...

19/04/16 1

VLSI Programming 2016: Lecture 1

Course: 2IMN35

Teachers: Kees van Berkel [email protected] Rudolf Mak [email protected]

Lab: Kees van Berkel, Rudolf Mak, Alok Lele

www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Lecture 1: Introduction

19/04/16 2

Introduction to VLSI Programming: goals

•  to acquire insight in the description, design, and optimization of fine-grained parallel computations;

•  to acquire insight in the (future) capabilities of VLSI as an implementation medium of parallel computations;

•  to acquire skills in the design of parallel computations and in their implementation on FPGAs.

19/04/16 3

Contents

Massive parallelism is needed to exploit the huge and still increasing computational capabilities of Very Large Scale Integrated (VLSI) circuits:

•  we focus on fine-grained parallelism (not on networks of computers);

•  we assume that parallelism is by design (not by compilation);

•  we draw inspiration from consumer applications, such as digital TV, 3D TV, image processing, mobile phones, etc.;

•  we will use Field Programmable Arrays (FPGA) as fine-grained abstraction of VLSI for practical implementation.

19/04/16 4

FPGA IC on a Xilinx XUP Board (Atlys)

Xilinx Spartan 6

FPGA

19/04/16 5

Atlys board, based on Xilinx Spartan 6

Xilinx Spartan 6

FPGA

19/04/16 6

Lab work prerequisites

•  Laptop, running Windows

•  Exceed (can be obtained through the TU/e software distribution)

•  Access to UNIX server Dept. W&I (can be obtained through BCF)

•  Lab work is by teams of two students, with at least 1 Windows laptop.

•  Have FPGA tools (SW) installed on your machine by Tuesday April 26

•  check website 2IMN35

19/04/16 7

VLSI Programming (2IMN35): time table 2016 2015 in Tue:h5-h8;MF.07 out 2015 in Thu:h1-h4;Gemini-Z3A-08/10/13 out

19-Apr

introduc/on,DSPgraphs,bounds,…

21-Apr

pipelining,re/ming,transposi/on,J-slow,unfolding

T1+T2

26-Apr

toolsinstalled

Introduc/onstoFPGAandVerilog

L1:audiofiltersimula/on

L1L2

28-Apr

T1+T2

unfolding,look-ahead,strengthreduc/on

L1cntd

T3+T4

3-May

folding

L2:audiofilteronXUPboard

5-May

10-May

T3+T4

DSPprocessors

L2cntd

L3

12-May

L3:sequen/alFIR+strength-reducedFIR

17-May

L3cntd

19-May

L3cntd

L4

24-May

systoliccomputa/on

T5

26-May

L4

31-May

T5

L4:audiosamplerateconvertor

2-Jun

L3

L4cntd

L5

7-Jun

L5:1024xaudiosamplerateconvertor

9-Jun

L4

L5cntd

14-Jun

16-Jun

L5

deadlinereportL5

19/04/16 8

Course grading (provisional)

Your course grade is based on:

•  the quality of your programs/designs [30%];

•  your final report on the design and evaluation of these programs (guidelines will follow) [30%];

•  a concluding discussion with you on the programs, the report and the lecture notes [20%];

•  intermediate assignments [20%].

•  Credits: 5 points = based on 140 hours from your side

19/04/16 9

Note on course literature

Lectures VLSI programming are loosely based on: •  Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and

Implementation. Wiley Inter-Science 1999. •  This book is recommended, but not mandatory

Accompanying slides can be found on: •  http://www.ece.umn.edu/users/parhi/slides.html •  http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Mandatory reading: •  Keshab K. Parhi. High-Level Algorithm and Architecture

Transformations for DSP Synthesis. Journal of VLSI Signal Processing, 9, 121-143 (1995), Kluwer Academic Publishers.

19/04/16 10

Introduction

• Some inspiration from the technology side • VLSI • FPGAs

• Some inspiration from the application side • Machine Intellligence • Bee, SKA, SETI • Digital Signal Processing (Software Defined Radio)

• Parhi, Chapters 1, 2 • DSP Representation Methods • Iteration bounds

19/04/16 11

Some inspiration

from the technology side

19/04/16 12

Vertical cut through VLSI circuit

19/04/16 13

Intel 4004 processor [1970]

§  1970

§  4-bit

§  2300 transistors

Apple A9 SoC (System on Chip)

• 2015

• Production: Samsung/TSMC

• 14/16 nm FinFet

• 96/104.5 mm2

•  > 2B transistors

• Assuming 0.1$/mm2 production costs

• ⇒ 5 nano$ / transistor

19/04/16 14

Flash memory

• 32 GB = 256Gb

• ≈100G transistors => << 1 n$ per transistor

19/04/16 15

Xilinx Kintex7 FPGA

• 2G transistors

• 165mm2

19/04/16 16

• 1920 DSP slices

Stratix 10 FPGA from Altera (Intel)

19/04/16 17

• > 10,000 FLOPs per clock cycle

• @ nearly 1 GHz

Exa-scale computing: 1018 FLOPs/Sec

A scenario (year 2021):

• 1018 FLOPs/sec = 109 arithmetic units running 109Hz

• 109 arithmetic units = 104.5 nodes ×104.5 arithmetic units

• 1 node = 32TFLOPs/s “X”+ 1TB DRAM + “CPU” @ 10 MW

Today (2016: “petaflop” era):

• #1: Tianhe-2 (China): 34 ×1015 FLOPs/sec 104.5 nodes @ 24 MW,

• GPU (Nvidia GM200): 6 TFLOPs/sec

• FPGA (Altera Stratix 10, GX2800): 9 TFLOPs/sec

19/04/16 18

A 2016 “node”

19/04/16 19

Source: Samsung

19/04/16 20

Source: NVidia

19/04/16 21

Moore’s Law: 50th anniversary in 2015!

Cost per Transistor over Time for Intel MPUs

↑ US$

?

×0.5/2years

19/04/16 23

Rule of two [Hu, 1993]

•  Every 2 generations of IC technology (6 years)

•  device feature size 0.5 x

•  chip size 2 x

•  clock frequency 2 x (no longer true)

•  number of i/o pins 2 x

•  DRAM capacity 16 x

•  logic-gate density 4 x

ITRS: INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS

• The overall objective of the ITRS is to present industry-wide consensus on the “best current estimate” of the industry’s research and development needs out to a 15-year horizon.

• As such, it provides a guide to the efforts of companies, universities, governments, and other research providers/funders.

• The ITRS has improved the quality of R&D investment decisions made at all levels and has helped channel research efforts to areas that most need research breakthroughs.

• Involves over 1000 technical experts, world wide.

• a self-fulfilling prophecy? … or wishful thinking?

19/04/16 ST-Ericsson confidential 24

ITRS 2013

19/04/16 25

2013 ITRSMPU/ASIC Half Pitch and Gate Length Trends

19/04/16 ST-Ericsson confidential 26

19/04/16 27

Virtex 4 FPGA: 4VSX55 FPGA = Field Programmable Gate Array

500MHz clock Flexible Logic

6,144 CLBs

multi-port RAM 320 × 18 kbit

Programmable 512 DSP slides

450MHz PowerPC™

1Gbps Differential I/O

0.6-11.1Gbps Serial Trx

19/04/16 28

Some inspiration

from the application side

19/04/16 29

All things grand and small [Moravec ‘98]

19/04/16 30

Chess Machine Performance [Moravec ‘98]

19/04/16 31

brain power equivalent per $1000 of computer

Evolution computer power/cost [Moravec ‘98]

MPSoC -- 2010, June 30 32

The Square Kilometer Array (SKA)

... the ultimate exploration tool

... and the ultimate software defined radio

MPSoC -- 2010, June 30 33

The Square Kilometer Array (SKA)

• antenna surface: 1 km2 (sensitivity 50×)

• large physical extent (3000+ km)

• wide frequency range: 50 MHz – 30 GHz

• full design by 2016; phase 1: 2021; phase 2: 2026

• phase 1: 250 dishes (12m) in the central 5 km

• + dense and/or sparse aperture arrays

• connected to a massive data processor by an optical fibre network

• Software Defined Radio Astronomy

• computational load ≈ 1 exa FLOPs/sec (1018 FLops/s)

• power budget = 20 MW (≈ 20 pJ/FLOP “all-in”)

19/04/16 34

References

•  Chip fotos: •  http://www-vlsi.stanford.edu/group/chips.html

•  ITRS Roadmap •  http://www.itrs.net/Links/2005ITRS/ExecSum2005.pdf

•  When will computer hardware match the human brain? •  http://www.jetpress.org/volume1/moravec.htm

•  BEE & Square Kilometer Array •  http://bwrc.eecs.berkeley.edu/Research/BEE/

•  http://seti.berkeley.edu/casper/papers/BEE2_ska2004_poster.pdf

•  http://www.skatelescope.org/

19/04/16 35

VLSI Digital Signal Processing Systems

Parhi, Chapters 1&2

19/04/16 36

DSP applications classes

10G 1G

100M 10M

1M

100k 10k

1k 100

10

1

speech audio

video

HDTV

modems

control seismic modeling

radio modems

complexity →

radar S

ampl

e ra

te [H

z]→

# operations/sample [log]

19/04/16 37

Typical DSP algorithms

• speech (de-)coding

• speech recognition

• speech synthesis

• speaker identification

• Hi-fi audio en/decoding

• noise cancellation

• audio equalization

• ambient acoustic emulation.

• sound synthesis

• echo cancellation

• modem: (de-)modulation

• vision

• image (de-)compression

• image composition

• beam cancellation

• spectral estimation

• etc.

19/04/16 38

Typical DSP kernels: FIR Filters

• Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies.

• Finite Impulse Response (FIR) filters compute y(n) :

• where • x is the input sequence

• y is the output sequence

• h is the impulse response (filter coefficients)

• N is the number of taps (coefficients) in the filter

• Output sequence depends only on input sequence and impulse response.

)(*)()()()(1

0nxnhkixkhiy

N

k=−= ∑

−

=

19/04/16 39

Typical DSP kernels: IIR Filters

• Infinite Impulse Response (IIR) filters compute:

• Output sequence depends on input sequence, impulse response,as well as previous outputs

• Adaptive filters (FIR and IIR) update their coefficients to minimize the distance between the filter output and the desired signal.

∑∑−

=

−

=

−+−=1

0

1

1)()()()()(

N

k

M

kkixkbkiykaiy

19/04/16 40

Typical DSP kernels: DFT and FFT The Discrete Fourier Transform (DFT) supports frequency

domain (“spectral”) analysis:

for k = 0, 1, … , N-1, where • x is the input sequence in the time domain (real or complex) • y is an output sequence in the frequency domain (complex)

The Inverse Discrete Fourier Transform (IDFT) is computed as

The Fast Fourier Transform (FFT) and its inverse (IFFT) provide an efficient method for computing the DFT and IDFT.

1 )()(21

0−===

−−

=∑ jeWnxWky N

j

N

N

n

nkN

π

1-n , ... 1, 0, n for ,)()(1

0== ∑

−

=

−N

k

nkN kyWnx

19/04/16 41

Typical DSP kernels: DCT

The Discrete Cosine Transform (DCT) and its inverse IDCT are frequently used in video (de-) compression (e.g., MPEG-2):

where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.

A N-Point, 1D-DCT requires N2 MAC operations.

1-N ... 1, 0, k for ,)(]2

)12(cos[)()(1

0=

+= ∑

−

=

N

nnx

Nknkeky π

1-N ... 1, 0, k for ,)(]2

)12(cos[)(2)(1

0=

+= ∑

−

=

N

kny

Nknke

Nnx π

19/04/16 42

Typical DSP kernels: distance calculation

• Distance calculations are typically used in pattern recognition, motion estimation, and coding.

• Problem: chose the vector rk whose distance (see below) from the input vector x is minimum.

|)()(|1 1

0∑−

=

−=N

ik irix

Nd ∑

−

=

−=1

0

2)]()([1 N

ik irix

Nd

Mean Absolute Difference (MAD) Mean Square Error (MSE)

19/04/16 43

Typical DSP kernels: matrix computations

Matrix computations are typically used to estimate parameters in DSP systems.

•  Matrix vector multiplication

•  Matrix-matrix multiplication

•  Matrix inversion

•  Matrix triangulization

Matrices may be dense/sparse/band-structured/….

19/04/16 44

Computation Rates

• To estimate the hardware resources required, we can use the equation:

• where • Rc is the computation rate

• Rs is the sampling rate

• Ns is the (average) number of operations per sample

• For example, a 1-D FIR has NS = 2N and a 2-D FIR has NS = 2N2.

SSC NRR ⋅=

19/04/16 45

Computational Rates for FIR Filtering

Signal type Frequency # taps Performance

Speech 8 kHz N =128 20 MOPs

Music 48 kHz N =256 240 MOPs

Video phone 6.75 MHz N*N = 81 1,090 MOPs

TV 27 MHz N*N = 81 4,370 MOPs

HDTV 144 MHz N*N = 81 23,300 MOPs

DSP systems and programs

• infinite input stream (samples): x(0), x(1), x(2), …

• infinite output stream (samples): y(0), y(1), y(2), …

• (there may be multiple input and/or output streams)

• non-terminating program, e.g:

for n=1 to ∞ y(n) = a*x(n) + b*x(n-1) + c*x(n-2) end

19/04/16 46

DSP System

x(n) y(n)

DSP SYSTEMSGRAPHICAL REPRESENTATIONS

19/04/16 47

DSP systems: 3 graphical representations

• Block diagram: • general

• loose semantics

• Data-flow graph: • used for signal processing

• formal definition

• powerful tools , lots of theory

• Signal-flow graph: • linear time-invariant systems

• formal definition, stilll more theory

19/04/16 48

block diagram

general

data flow graph

signal processing

signal flow graph

LTI systems

19/04/16 49

DSP system: block diagram

•  Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)

•  delay element = memory element = register

•  multiply with constant a

•  adder: output value = sum of input values

× a × b × c

+ +

D D

y(n)

x(n) x(n-1) x(n-2)

D

× a

+

19/04/16 50

DSP system: data-flow graph (DFG)


•  D is (non-negative) number of delays

•  multiplier: output value = (constant a) × input value

•  adder: output value = sum of input values

a b c

y(n)

x(n)

a

D D

19/04/16 51

Data-flow graph (DFG)


Each edge describes a precedence constraint between two nodes:

•  D=0: Intra-iteration precedence constraint

•  D>0: Inter-iteration precedence constraint

a b c

y(n)

x(n) D D

Data-flow graphs

Tokens can represent numbers, vectors (blocks), matrices …

Nodes may be complex (coarse-grained) functions, e.g.:

Single-rate data flow: Each node:

• consumes one token from each input edge;

• performs its function (in T time units);

• produces one token onto each output edge.

19/04/16 52

Data-flow graphs

Multi-rate data flow: Each node:

• consumes a fixed number of tokens from each input edge;

• performs its function (in T time units);

• produces a fixed number of tokens onto each output edge.

19/04/16 53

Signal-flow graph (representation method 3)

• A join-node denotes an adder

• Label a next to an edge denotes multiplication by constant a • z-k denotes k units delay

• Signal-flow graphs are used to represent Linear Time Invariant systems LTI.

• A signal flow-graph represents a so-called Z-transform (Laplace), a powerful LTI system theory. (outside the scope of 2IN35)

19/04/16 54

19/04/16 55

Linear Systems

input x, output y:

discrete system:

•  x(n) y(n)

linear system:

•  x1(n) + x2(n) y1(n) + y2(n)

•  c1 x1(n) + c2 x2(n) c1 y1(n) + c2 y2(n)

for arbitrary c1 and c2

Most of our examples will be linear systems

results in

results in

results in

19/04/16 56

Linear Time-Invariant Systems

input x, output y:

•  x(n+k) = x(n) shifted by integer k sample periods

time-invariant system

•  x’(n) =x(n+k) y’(n) = y(n+k)

Most of our examples will be linear time-invariant systems,

or LTI systems

results in

19/04/16 57

Commutativity of LTI systems

LTI System A

LTI System B

x(n) y(n) f(n)

LTI System B

LTI System A

x(n) y(n) g(n)

is equivalent to

LOOP BOUNDS AND ITERATION BOUNDS

19/04/16 58

19/04/16 59

Iteration of a Synchronous Flow Graph

• Each actor fires the minimum number of times to return the graph to a particular state

• Example of a multi-rate DFG:

A 1

B 2

C 3 2 2 1

# firings for 1 iteration A B C 2 2 3

# tokens per edge for 1 iteration

→ A A → B B → C C →

2 4 6 3

19/04/16 60

Iteration period

Iteration period = the time required for the execution of one iteration of the SFG

Example, let

• Tm = 10 = multiplication time

• Ta = 4 = addition time

Iteration period = Tm+Ta = 14 [e.g. nsec]

= minimum sample period Ts; that is: Ts ≥ Tm+Ta

Iteration rate = (iteration period)-1 [e.g. GHz]

×

a

+ D y(n-1) x(n)

Loop and loop bound

• A loop (cycle) in a DFG is a directed path that begins and ends at the same node.

• The loop bound of loop j is defined Tj/Wj where • Tj is the loop computation time (sum of all Ti of loop nodes i ),

• Wj is the number of delays (D-elements) in the loop.

• Example (IIR filter): • Tloop = Tm+Ta = 14 ns

• Wloop = 2 • Loop bound

= Tloop /Wloop = 14 /2 =7 nsec

19/04/16 61

×

a

+ 2D y(n-2) x(n)

Critical loop and Iteration bound

• The critical loop of a DFG is the loop with the maximum loop bound.

• The iteration bound T∞ of a DFG is the loop bound of the critical loop: • L is the set of loops of the DFG

• Tj of is the loop bound of loop j • Wj of is the weight of loop j, i.e. the number of delays D.

19/04/16 62

T∞ =maxj∈L

TjWj

#

$%%

&

'((

Iteration bound cntd

Example:

• TL1 = (10+2)/1 = 12

• TL2 = (2+3+5)/2 = 5

• TL3 = (10+2+3)/2 = 7.5

• Iteration bound = max (12, 5, 7.5) = 12

Notes:

• Delays are non-negative (negative delay would imply non-causality).

• If loop weight equals 0 (no delay elements in loop) then TL/0 = ∞ (deadlock).

19/04/16 63

4 types of delay paths; critical path

• Redraw block diagram by partitioning nodes in D-elements and combinational functions (“FSM view”):

• Paths do not contain delay-elements

• The critical path is the path with the longest computation bound and is an lower bound for the clock period.

19/04/16 64

delay elements = state

1 2

3

4

outputs inputs combinational functions

path from to 1 inputs state 2 state outputs 3 inputs outputs 4 state state

Critical path cntd

Example (FIR filter): • Tm= 10 ns

• Ta= 4 ns

• No loops!

1.  1 path from input to state: 0 ns

2.  4 path from state to outputs: 26, 22, 18, 14 ns

3.  1 path from input to output: 26 ns

4.  3 paths from state to state: 0, 0, 0 ns

The critical path is 26 ns. (can be reduced by pipelining and parallel processing.)

19/04/16 65

19/04/16 66

DSP references

•  Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and Implementation. Wiley Inter-Science 1999.

•  Richard G. Lyons. Understanding Digital Signal Processing (2nd edition). Prentice Hall 2004.

•  John G. Proakis and Dimitris K Manolakis. Digital Signal Processing (4th edition), Prentice Hall, 2006.

•  Simon Haykin. Neural Networks, a Comprehensive Foundation (2nd edition). Prentice Hall 1999.

19/04/16 67

Computer Architecture and DSP references

•  Hennessy and Patterson, Computer Architecture, a Quantitative Approach. 3rd edition. Morgan Kaufmann, 2002.

•  Phil Lapsley, Jeff Bier, Amit Sholam, Edward Lee. DSP Processor Fundamentals, Berkeley Design Technology, Inc, 1994-199

•  Jennifer Eyre, Jeff Bier, The Evolution of DSP Processors, IEEE Signal Processing Magazine, 2000.

•  Kees van Berkel et al. Vector Processing as an Enabler for Software-Defined Radio in Handheld Devices, EURASIP Journal on Applied Signal Processing 2005:16, 2613-2625.

VLSI Programming:

Preparations for Lab work, before Tuesday April 26:

• team up (2 students/team), and

• install FPGA tools.

19/04/16 68

19/04/16 69

VLSI Programming: Thursday April 21

Transformations:

•  Transposition

•  Pipelining

•  Retiming

•  K-slow transformation

•  Parallel processing

(Parhi, Chapters 2, 3)

THANK YOU

vlsi programming 2016: lecture 1wsinmak/education/2imn35/2imn35-2016-slides1.pdf · vlsi...

Documents