1 analog devices tigersharc® dsp family presented by: mike lee and mike demcoe date: april 8 th,...

1

Analog Devices TigerSHARC® DSP Family

Presented By: Mike Lee and Mike Demcoe

Date: April 8th, 2002

2

TigerSHARC Architectural Overview High performance, 128-bit successor to the ADSP-2106x SHARC

family ADSP-TS101S, the newest TigerSHARC DSP, operates at

250MHz! Multiple computational units

Two compute blocks, each containing a register file, ALU, multiplier, and shifter.

Two additional integer ALUs Two hardware loop counter registers

Can execute up to four independent 32-bit instructions at a time Or, eight 16-bit instructions

Very wide word widths for high precision arithmetic Designed to be used in a multiple processor environment

3

TigerSHARC Architecture Overview (cont…) BTB (Branch Target Buffer) as a means of

alleviating issues with the deep pipeline 32-instruction, 4-way set-associative cache User controlled Branch Prediction

Three, 128-bit blocks of memory which provide access to a program and two data operands without causing instruction/data conflicts.

Load-store, Harvard architecture, like SHARC. Native support for complex number instructions

4

The TS101S Architecture

5

Details of Multiple Compute Blocks

Two computational units, each containing: Register file – Multi-ported to allow multiple accesses

to registers in a single clock cycle General purpose registers! Contains 32 words, each word being 32-bits in length.

ALU – Fixed-point and floating point Multiplier – Fixed-point and floating point

Also features MAC (multiply-and-accumulate) capabilities Shifter – Standard logical and arithmetic shifts as well

as bit manipulation

6

The TS101S PipelineFetch 1

Fetch 2

Fetch 3

Integer

Access

Execute 1

Execute 2

Decode

IAB

Fetch Stages

Execution Stages

7

Pipelines and Instruction Related Information ADSP-21061

Three stage pipeline20ns instruction cycleSISD but can put instructions in parallel

ADSP-TS101SEight stage pipeline with IAB4ns instruction cycleMIMD and can also put instructions in parallel

8

Loops, Branching and Timers

ADSP-21061 Zero-overhead hardware loop support Delayed Branching One timer

ADSP-TS101S Little support for zero-overhead hardware loops 32-entry 4-way associative BTB cache with Branch

prediction Two timers

9

Memory and Buses

ADSP-21061 1 Mbit dual ported SRAM Shared by three buses (PM, DM, I/O) PM and DM share a port while the I/O receives it’s

own ADSP-TS101S

6 Mbit of SRAM (Quad Ported??) User defined partitions Each block is accessed by one 128-bit bus

10

Multiplication and other Nifty Tricks ADSP-21061

MAC instructions (MRF and MRB) Various precision output (32, 40, or 80 bit)

ADSP-TS101S Each compute block has it’s own set of MAC registers 8 16-bit MAC with 40-bit accumulation or 2 32-bit

MAC with 80-bit accumulation Complex number MAC instructions 128-bit accelerator

Trellis decoding (8 Trellis butterflies per cycle)

11

Data Address Generation

ADSP-21061 2 data address generation units (DAGS) 8 circular buffers per DAG

ADSP-TS101S 2 data address generation units (IALU) 4 circular buffers per IALU

Both support modulo arithmetic, bit reversal addressing, and post and pre-modify instructions

12

Ease of Use

ADSP-21061 Easy to use Algebraic instruction set Visual DSP environment

ADSP-TS101S Similar to 21061 but know have to consider 2

compute blocks ADI suggests leaving parallelization to their optimizing

compiler Visual DSP environment

13

Specific DSP Algorithms and the TigerSHARC In ENEL515 (and/or related articles) we’ve

studied the FIR, IIR, and FFT algorithms TigerSHARC has a massively parallel

architecture that is tailored to performing these algorithms.

14

FIR Filter Characteristics

Think back (or forward, depending on how much you’ve procrastinated) to Lab #3.

FIR Characteristics Simple, long loop Repetitive calculations (multiply, then add!) Access to an array of coefficients, and an array of “delay-line”

values Few data dependency issues during the calculation of a single

output For a filter of length N, require N multiplications and N

adds to obtain a single output value.

15

TigerSHARC and the FIR Filter

The general idea is: Divide and conquer! Take a filter of size N and split it into two groups of N/2

Utilize the TigerSHARC’s multiple computational units and MAC instructions to perform the algorithm in ½ the time (plus some overhead)

Two hardware loop counters to simultaneously control the two new “N/2” size FIR loops with no overhead!

Can do all of the following SIMULTANEOUSLY! Fetch two operands (one coefficient, one delay line value) from two

separate memory banks Fetch the next instruction Perform arithmetic operations on the PREVIOUS operands!

Unlike SHARC, instruction/data clashes are non-existant due to the numerous bus paths linking computational units to memory space

16

TigerSHARC and the FIR Filter (continued….) 8-cycle-deep pipeline

Stalls are expensive.. Branch Target Buffer reduces performance loss that results from

branching in a deeply pipelined processor The long loop characteristic of the FIR filter algorithm

allows us to keep the 8-cycle-deep pipeline full Full pipeline means fast algorithm

FIR Filter algorithms rely heavily on data sets that are aligned in memory Post-increment is your friend TigerSHARC Quad Data Accesses – Supply four aligned words

to one compute block or two aligned words to each compute block.

17

Example InstructionsX/Y Conditional Compute

if xALE; do, R0=R1+R2Condition codes,

AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1. A = Adder, M = Multiplier, S = Shifter

Memory AddessingIndirect post-modify with update, register offset:

YR20=[J1+=J2]Indirect post-modify with update, 8-bit immediate offset:

Q[K1+=0xF8]=XYR3:0Indirect pre-modify no update, register offset:

J3:2=L[K1+K2]Indirect pre-modify no update, immediate offset:

YR3:2=L[K1+0x0003333]Complex Quad 16-bit Fixed Point Multiplication Instructions

{X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})}{X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})}

18

FIR Code Example

19

TigerSHARC and the IIR Filter Short, simple loop characteristic

Means loop overhead is more of a concern Means keeping the pipeline full is tougher!

Time to unroll the loop, although ADI says to let VisualDSP do it for you.

Again, split up the calculations on an N-tap IIR filter into two N/2 sets operating simultaneously Idea: One computational block does feedforward

calculations, one does feedback! Complex numbers commonly required

Hardware support for complex MAC in TigerSHARC Again, Quad Data Access comes in handy for aligned data Post-increment is still your friend

20

TigerSHARC and the FFT

Does not use the same MAC modes that IIR and FIR filters do. Requires more complicated addressing modes

Example: Bit reverse addressing Found on both SHARC and TigerSHARC

Difficult to split onto separate computational units and even more difficult to split amongst distributed processors

Requires large arrays of complex variables and fixed coefficients Hardware complex number MAC comes in handy again! Large arrays of aligned data – Quad Data access again!

Requires HIGH-PRECISION arithmetic Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating point

arithmetic. 80-bit MAC precision

FFT Requires many intermediate values 32 GP registers in a single computational block

21

http://www.analog.com/technology/dsp/Sharc/benchmarks.html

22

http://www.analog.com/technology/dsp/TigerSHARC/benchmarks.html

23

Conclusion

TigerSHARC have a very SHARC-like architecture, except it’s MUCH more complex. Highly optimized for parallelism

Major features: Complex number support, multiple computational units, high instruction throughput, wider buses.

Performs DSP algorithms including FIR, IIR, FFT significantly faster than SHARC!

24

References

1. http://www.analog.com/productSelection/pdf/ADSP-21061_L_b.pdf 2. http://products.analog.com/products/info.asp?product=ADSP-TS101-S 3. http://www.analog.com/technology/dsp/TigerSHARC/backgrounder.html 4. http://www.analog.com/library/dspManuals/Tigersharc_hardware.html 5. http://www.analog.com/library/dspManuals/Tigersharc_instruction.html 6. http://www.btid.com/procsum/tsfloat.htm 7. http://www.analog.com/library/applicationNotes/dsp/tigerSharc/EE-147.pdf 8. http://www.analog.com/technology/dsp/TigerSHARC/architecture.html 9. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsintr.pdf (2-182 - 2-188) 10. ADSP-2106x SHARC User’s Manual, Second Edition 11. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsin_flw.pdf (3-9 - 3-16)

25

Note from Dr. Smith

Information on Burg algorithm outside ICT536. It is essentially an FIR filter used for prediction (i.e. what FIR coefficients are needed so that the filtered signal is "white noise" )

1 analog devices tigersharc® dsp family presented by: mike lee and mike demcoe date: april 8 th,...

Documents