1 analog devices tigersharc® dsp family presented by: mike lee and mike demcoe date: april 8 th,...
TRANSCRIPT
1
Analog Devices TigerSHARC® DSP Family
Presented By: Mike Lee and Mike Demcoe
Date: April 8th, 2002
2
TigerSHARC Architectural Overview High performance, 128-bit successor to the ADSP-2106x SHARC
family ADSP-TS101S, the newest TigerSHARC DSP, operates at
250MHz! Multiple computational units
Two compute blocks, each containing a register file, ALU, multiplier, and shifter.
Two additional integer ALUs Two hardware loop counter registers
Can execute up to four independent 32-bit instructions at a time Or, eight 16-bit instructions
Very wide word widths for high precision arithmetic Designed to be used in a multiple processor environment
3
TigerSHARC Architecture Overview (cont…) BTB (Branch Target Buffer) as a means of
alleviating issues with the deep pipeline 32-instruction, 4-way set-associative cache User controlled Branch Prediction
Three, 128-bit blocks of memory which provide access to a program and two data operands without causing instruction/data conflicts.
Load-store, Harvard architecture, like SHARC. Native support for complex number instructions
5
Details of Multiple Compute Blocks
Two computational units, each containing: Register file – Multi-ported to allow multiple accesses
to registers in a single clock cycle General purpose registers! Contains 32 words, each word being 32-bits in length.
ALU – Fixed-point and floating point Multiplier – Fixed-point and floating point
Also features MAC (multiply-and-accumulate) capabilities Shifter – Standard logical and arithmetic shifts as well
as bit manipulation
6
The TS101S PipelineFetch 1
Fetch 2
Fetch 3
Integer
Access
Execute 1
Execute 2
Decode
IAB
Fetch Stages
Execution Stages
7
Pipelines and Instruction Related Information ADSP-21061
Three stage pipeline20ns instruction cycleSISD but can put instructions in parallel
ADSP-TS101SEight stage pipeline with IAB4ns instruction cycleMIMD and can also put instructions in parallel
8
Loops, Branching and Timers
ADSP-21061 Zero-overhead hardware loop support Delayed Branching One timer
ADSP-TS101S Little support for zero-overhead hardware loops 32-entry 4-way associative BTB cache with Branch
prediction Two timers
9
Memory and Buses
ADSP-21061 1 Mbit dual ported SRAM Shared by three buses (PM, DM, I/O) PM and DM share a port while the I/O receives it’s
own ADSP-TS101S
6 Mbit of SRAM (Quad Ported??) User defined partitions Each block is accessed by one 128-bit bus
10
Multiplication and other Nifty Tricks ADSP-21061
MAC instructions (MRF and MRB) Various precision output (32, 40, or 80 bit)
ADSP-TS101S Each compute block has it’s own set of MAC registers 8 16-bit MAC with 40-bit accumulation or 2 32-bit
MAC with 80-bit accumulation Complex number MAC instructions 128-bit accelerator
Trellis decoding (8 Trellis butterflies per cycle)
11
Data Address Generation
ADSP-21061 2 data address generation units (DAGS) 8 circular buffers per DAG
ADSP-TS101S 2 data address generation units (IALU) 4 circular buffers per IALU
Both support modulo arithmetic, bit reversal addressing, and post and pre-modify instructions
12
Ease of Use
ADSP-21061 Easy to use Algebraic instruction set Visual DSP environment
ADSP-TS101S Similar to 21061 but know have to consider 2
compute blocks ADI suggests leaving parallelization to their optimizing
compiler Visual DSP environment
13
Specific DSP Algorithms and the TigerSHARC In ENEL515 (and/or related articles) we’ve
studied the FIR, IIR, and FFT algorithms TigerSHARC has a massively parallel
architecture that is tailored to performing these algorithms.
14
FIR Filter Characteristics
Think back (or forward, depending on how much you’ve procrastinated) to Lab #3.
FIR Characteristics Simple, long loop Repetitive calculations (multiply, then add!) Access to an array of coefficients, and an array of “delay-line”
values Few data dependency issues during the calculation of a single
output For a filter of length N, require N multiplications and N
adds to obtain a single output value.
15
TigerSHARC and the FIR Filter
The general idea is: Divide and conquer! Take a filter of size N and split it into two groups of N/2
Utilize the TigerSHARC’s multiple computational units and MAC instructions to perform the algorithm in ½ the time (plus some overhead)
Two hardware loop counters to simultaneously control the two new “N/2” size FIR loops with no overhead!
Can do all of the following SIMULTANEOUSLY! Fetch two operands (one coefficient, one delay line value) from two
separate memory banks Fetch the next instruction Perform arithmetic operations on the PREVIOUS operands!
Unlike SHARC, instruction/data clashes are non-existant due to the numerous bus paths linking computational units to memory space
16
TigerSHARC and the FIR Filter (continued….) 8-cycle-deep pipeline
Stalls are expensive.. Branch Target Buffer reduces performance loss that results from
branching in a deeply pipelined processor The long loop characteristic of the FIR filter algorithm
allows us to keep the 8-cycle-deep pipeline full Full pipeline means fast algorithm
FIR Filter algorithms rely heavily on data sets that are aligned in memory Post-increment is your friend TigerSHARC Quad Data Accesses – Supply four aligned words
to one compute block or two aligned words to each compute block.
17
Example InstructionsX/Y Conditional Compute
if xALE; do, R0=R1+R2Condition codes,
AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1. A = Adder, M = Multiplier, S = Shifter
Memory AddessingIndirect post-modify with update, register offset:
YR20=[J1+=J2]Indirect post-modify with update, 8-bit immediate offset:
Q[K1+=0xF8]=XYR3:0Indirect pre-modify no update, register offset:
J3:2=L[K1+K2]Indirect pre-modify no update, immediate offset:
YR3:2=L[K1+0x0003333]Complex Quad 16-bit Fixed Point Multiplication Instructions
{X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})}{X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})}
19
TigerSHARC and the IIR Filter Short, simple loop characteristic
Means loop overhead is more of a concern Means keeping the pipeline full is tougher!
Time to unroll the loop, although ADI says to let VisualDSP do it for you.
Again, split up the calculations on an N-tap IIR filter into two N/2 sets operating simultaneously Idea: One computational block does feedforward
calculations, one does feedback! Complex numbers commonly required
Hardware support for complex MAC in TigerSHARC Again, Quad Data Access comes in handy for aligned data Post-increment is still your friend
20
TigerSHARC and the FFT
Does not use the same MAC modes that IIR and FIR filters do. Requires more complicated addressing modes
Example: Bit reverse addressing Found on both SHARC and TigerSHARC
Difficult to split onto separate computational units and even more difficult to split amongst distributed processors
Requires large arrays of complex variables and fixed coefficients Hardware complex number MAC comes in handy again! Large arrays of aligned data – Quad Data access again!
Requires HIGH-PRECISION arithmetic Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating point
arithmetic. 80-bit MAC precision
FFT Requires many intermediate values 32 GP registers in a single computational block
23
Conclusion
TigerSHARC have a very SHARC-like architecture, except it’s MUCH more complex. Highly optimized for parallelism
Major features: Complex number support, multiple computational units, high instruction throughput, wider buses.
Performs DSP algorithms including FIR, IIR, FFT significantly faster than SHARC!
24
References
1. http://www.analog.com/productSelection/pdf/ADSP-21061_L_b.pdf 2. http://products.analog.com/products/info.asp?product=ADSP-TS101-S 3. http://www.analog.com/technology/dsp/TigerSHARC/backgrounder.html 4. http://www.analog.com/library/dspManuals/Tigersharc_hardware.html 5. http://www.analog.com/library/dspManuals/Tigersharc_instruction.html 6. http://www.btid.com/procsum/tsfloat.htm 7. http://www.analog.com/library/applicationNotes/dsp/tigerSharc/EE-147.pdf 8. http://www.analog.com/technology/dsp/TigerSHARC/architecture.html 9. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsintr.pdf (2-182 - 2-188) 10. ADSP-2106x SHARC User’s Manual, Second Edition 11. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsin_flw.pdf (3-9 - 3-16)