v-iram compiler benchmarks and applications
DESCRIPTION
V-IRAM Compiler Benchmarks and Applications. Adam Janin, David Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, Randi Thomas, David Patterson, Kathy Yelick. Overview of V-IRAM Benchmarks. Hand-Coded Benchmarks Media Kernels FFT H.263 Video Encoder Application - PowerPoint PPT PresentationTRANSCRIPT
Slide 1
V-IRAM Compiler Benchmarks and Applications
Adam Janin, David Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, Randi
Thomas, David Patterson, Kathy Yelick
Slide 2
Overview of V-IRAM Benchmarks
• Hand-Coded Benchmarks– Media Kernels– FFT– H.263 Video Encoder Application
• Compiled Benchmarks– Matrix vector multiplication and VIRAM
addressing– Floating point benchmarks– Integer benchmarks
• Underway– Speech Application (Janin poster)– Scientific (DOE) Applications (Li and Oliker
poster)– Data-intensive (DARPA DIS) Benchmarks
(Gaeke)
Slide 3
Hand-Coded Applications Update
• Image processing kernels (old FPU model)– Note BLAS-2 performance
PeakPerf.
SustainedPerf.
%of Peak
Image Composition 6.4 GOPS 6.40 GOPS 100.0%
iDCT 6.4 GOPS 1.97 GOPS 30.7%
Color Conversion 3.2 GOPS 3.07 GOPS 96.0%
Image Convolution 3.2 GOPS 3.16 GOPS 98.7%
Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5%
Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7%
FP MV Multiply 3.2 GFLOPS 2.80 GFLOPS 87.5%
FP VM Multiply 3.2 GFLOPS 3.19 GFLOPS 99.6%
AVERAGE 86.6%
Slide 4
Media Kernel Comparisons
VIRAM MMX VIS TMS320C82
Image Composition
0.13 - 2.22 (17.0x) -
iDCT 1.18 3.75 (3.2x) - -
Color Conversion
0.78 8.00 (10.2x) - 5.70 (7.6x)
Image Convolution
1.22 5.49 (4.5x) 6.19 (5.1x) 6.50 (5.3x)
• All numbers in cycles/pixel
•VIRAM uses 4 lanes, 1 sub-bank per bank
•MMX and VIS results assume all data in L1 cache
Slide 5
FFT: Uses In-Register Permutations
0
500
1000
1500
2000
2500
100 1000 10000
FFT Points (256 to 2048)
MO
PS/M
FLO
PS
16b Fixed
32b Float
32b Float
Without in-register permutations
Slide 6
Scalable Design 2 lanes, 4 MB
1.6 Gops
4 lanes, 8 MB
3.2 Gops (32-bit)
1 lane, 2 MB
.8 Gops
• Scaling number of lanes for performance, energy, area• Number of DRAM banks may scale independently
–e.g., 16 banks rather than 8• May also scale up to 8 lanes• Single executable file used across lane scaling
experiments
Slide 7
Vector Architectural State
VP0 VP1 VPvl-1
vr0vr1
vr31
vpw
DataRegisters
Virtual Processors (vl)
• Number of VPs given by the Vector Length register vl• Width of each VP given by the register vpw
– vpw is one of {8b,16b,32b,64b}• Maximum vector length is given by a read-only register mvl
– mvl depends on implementation and vpw: {128,128,64,32} in VIRAM-1
Slide 8
VIRAM Compiler
• Based on the Cray’s production compiler • Challenges:
– narrow data types and scalar/vector memory consistency
• Advantages relative to media-extensions:– powerful addressing modes and ISA
independent of datapath width
Optimizer
C
Fortran95
C++
Frontends Code Generators
Cray’s
PDGCS
T3D/T3E
SV2/VIRAM
C90/T90/SV1
Slide 9
Compiler Challenges• Can compiled code effectively use VIRAM
design?
– Is on-chip DRAM bandwidth sufficient?
– How well do multimedia applications vectorize?
– Does VIRAM’s model of variable width data (VPW) fit into a compilation framework?
Slide 10
Matrix-Vector Multiplication
• Matrix vector multiply– dot: 2 vloads, (both unit stride + a reduction)– saxpy: 2 vloads, 1 vstore (2 strided + 1 unit)
• Vector matrix multiply (= mvm with column layout)– saxpy: 2 vloads, 1 vstore (all unit stride)
• Sparse matrix-vector multiply– dot: 3 vloads (1 indexed, 2 unit + reduction)– saxpy: 3 vloads, 1 vstore (2 indexed, 2 unit) (column
layout)
Source vector
Desti
nati
on
vecto
rMatrix
Assume row layout
Slide 11
Matrix Vector Multiplication• Performance of various source optimizations
mvm 128x128, single
0200400600800
10001200140016001800
MF
LOP
S 1 lane
2 lane
4 lane
Column layout mvm = row layout vmm
Column performance ~= peak
Slide 12
Comparison of MVM Performance
• Double precision floating point– compiled for VIRAM (note: chip only does
single)– hand- or Atlas-optimized for other machines
0
100
200
300
400
500
600
VIR
AM
4 row
VIR
AM
8 row
VIR
AM
4 col
VIR
AM
8 col
Sun U
ltra I
Sun U
ltra II
MIPS
12K
Alph
a 21164
Alph
a 21264
Alph
a 21264 1
K
Power PC
604e
Power 3
630
As matrix size increases, performance:
– drops on cache-based designs
– increases on vector designs
– but 64x64 about 20% better on VIRAM
MFLO
PS
100x100 matrix
Slide 13
Sparse MVM Performance• Performance is matrix-dependent: lp matrix
– compiled for VIRAM using “independent” pragma
» sparse column layout
– Sparsity-optimized for other machines» sparse row (or blocked row) layout
0
50
100
150
200
250
VIR
AM
4
VIR
AM
8
Sun U
ltra
I
MIPS
10K
Alpha
21264
Power PC
604e
lp dense
MFLO
PS
Slide 14
Generating Code for Variable VPW
• Strategy: vectorizer determines minimum correct vpw for each loop nest
– Vectorizer assumes vpw=64 initially– At end of vectorization, discard vectorized copy
of loop if greatest width encountered is less than 64 and start vectorization over with new vpw.
– Code gen checks vpw for each loop nest.• Limitation: a single loop nest will run at the speed
of the widest type. – Reason: simplicity & performance of the
common case– No attempt to split/combine loops based on
vpw
Slide 15
Media Benchmarks• Mostly from U Toronto’s benchmark suite• 8-bit data, 16-bit operations
– Colorspace: strided loads/stores– Composition: unit stride– Convolve: strided
• Mixed 16 and 32-bit integer– Detect – Decrypt
• 32-bit Floating point– FIR filter– SAXPY 64: 64 element– SAXPY 1K: 1024 element– matmul: matrix multiplication
Slide 16
Integer Benchmarks
• Strided access important, e.g., RGB– narrow types limited by address generation
• Outer loop vectorization and unrolling used– helps avoid short vectors– spilling can be a problem
• Tiling could probably help
01000200030004000500060007000
1 lane
2 lane
4 lane
Slide 17
Floating Point DSP Benchmarks
• Performance is competitive with hand-coding • Vector length is important (e.g., saxpy)
– but multiple vectors is fine (e.g., matmul)
0
500
1000
1500
2000
fir filter saxpy 64 saxpy 1K saxpy 4K matmul peak
1 lane
2 lane
4 lane
Slide 18
Conclusions• VIRAM ISA shows high performance on
compiled code– competitive with modern processors– limitations are address generation for
strided and indexed memory operations
• Compiler effectively uses variable width data– allows media applications to vectorize– performance scales with inverse data width
• Future compiler work– Fixed point support– Better register allocation
Slide 19
Backup slides
Slide 20
Performance Summary
• Performance of compiled code is generally good – matmul and saxpy meet or beat hand-coded– 3 addressing modes very useful
• Limitations to performance– Dependencies or inadequate compiler analysis– Inadequate memory bandwidth– Lack of address generators– Short vectors
• Future compiler work– Tiling– Fixed point support– Better register allocation
Slide 21
Scaling Media Benchmarks
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
color
spac
e
com
posit
ion
conv
olve
dete
ct
decr
ypt
fir fil
ter
saxp
y 64
saxp
y 1K
mat
mul
MO
PS
1 lane
2 lane
4 lane
8 lane
Slide 22
Compiled matrix-vector multiplication: 2 Flops/element• Easy compilation problem; stresses memory
bandwidth• Compare to 304 Mflops (64-bit) for Power3
(hand-coded)
0
100
200
300
400
500
600
700
800
900
MFLO
PS
mvm
32-bit,
8 b
anks
mvm
32-bit,
16 b
anks
mvm
64-bit,
8 b
anks
mvm
64-bit,
16 b
anks
1 lane
2 lane
4 lane
8 lane
–Performance scales with number of lanes up to 4
–Need more memory banks than default DRAM macro for 8 lanes
Slide 23
Outline
• Why vectors for IRAM?– Including media types
• The virtual lane model• Virtual processor width• Limitations to performance
– Dependencies or inadequate compiler analysis
– Inadequate memory bandwidth– Lack of address generators– Short vectors
• Comparisons to other architectures• Conclusions
Slide 24
Matrix-Vector Multiply• Scaling Matrix-Vector Multiplication
Slide 25
Performance on Media Benchmarks
• Using compiled code: 1, 2, 4, and 8 lanes
Slide 26
Compiled matrix-vector multiplication: 2 Flops/element• Easy compilation problem; stresses memory
bandwidth• Compare to 304 Mflops (64-bit) for Power3
(hand-coded)
0
100
200
300
400
500
600
mvm
64-bit,
8 b
anks
mvm
64-bit,
16 b
anks
VIRAM
VIRAM,16banks
Column 3
Column 4
–Performance scales with number of lanes up to 4
–Need more memory banks than default DRAM macro for 8 lanes
MFLO
PS
Slide 27
Compiling Media Kernels on IRAM• The compiler generates code for narrow data widths,
e.g., 16-bit integer• Compilation model is simple, more scalable (across
generations) than MMX, VIS, etc.
0
500
1000
1500
2000
2500
3000
3500
MFLO
PS
colorspace composite FI R filter
1 lane
2 lane
4 lane
8 lane
– Strided and indexed loads/stores simpler than pack/unpack
– Maximum vector length is longer than datapath width (256 bits); all lane scalings done with single executable
Slide 28
Vector Vs. SIMD: Example• Simple image processing example:
– conversion from RGB to YUV
Y = [( 9798*R + 19235*G + 3736*B) / 32768]
U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128
V = [(20218*R – 16941*G – 3277*B) / 32768] + 128
Slide 29
VIRAM Code (22 instructions)RGBtoYUV: vlds.u.b r_v, r_addr, stride3, addr_inc # load R vlds.u.b g_v, g_addr, stride3, addr_inc # load G vlds.u.b b_v, b_addr, stride3, addr_inc # load B xlmul.u.sv o1_v, t0_s, r_v # calculate Y xlmadd.u.sv o1_v, t1_s, g_v xlmadd.u.sv o1_v, t2_s, b_v vsra.vs o1_v, o1_v, s_s xlmul.u.sv o2_v, t3_s, r_v # calculate U xlmadd.u.sv o2_v, t4_s, g_v xlmadd.u.sv o2_v, t5_s, b_v vsra.vs o2_v, o2_v, s_s vadd.sv o2_v, a_s, o2_v xlmul.u.sv o3_v, t6_s, r_v # calculate V xlmadd.u.sv o3_v, t7_s, g_v xlmadd.u.sv o3_v, t8_s, b_v vsra.vs o3_v, o3_v, s_s vadd.sv o3_v, a_s, o3_v vsts.b o1_v, y_addr, stride3, addr_inc # store Y vsts.b o2_v, u_addr, stride3, addr_inc # store U vsts.b o3_v, v_addr, stride3, addr_inc # store V subu pix_s,pix_s, len_s bnez pix_s, RGBtoYUV
Slide 30
MMX Code (part 1)RGBtoYUV: movq mm1, [eax] pxor mm6, mm6 movq mm0, mm1 psrlq mm1, 16 punpcklbw mm0, ZEROS movq mm7, mm1 punpcklbw mm1, ZEROS movq mm2, mm0 pmaddwd mm0, YR0GR movq mm3, mm1 pmaddwd mm1, YBG0B movq mm4, mm2 pmaddwd mm2, UR0GR movq mm5, mm3 pmaddwd mm3, UBG0B punpckhbw mm7, mm6; pmaddwd mm4, VR0GR paddd mm0, mm1 pmaddwd mm5, VBG0B movq mm1, 8[eax] paddd mm2, mm3 movq mm6, mm1
paddd mm4, mm5 movq mm5, mm1 psllq mm1, 32 paddd mm1, mm7 punpckhbw mm6, ZEROS movq mm3, mm1 pmaddwd mm1, YR0GR movq mm7, mm5 pmaddwd mm5, YBG0B psrad mm0, 15 movq TEMP0, mm6 movq mm6, mm3 pmaddwd mm6, UR0GR psrad mm2, 15 paddd mm1, mm5 movq mm5, mm7 pmaddwd mm7, UBG0B psrad mm1, 15 pmaddwd mm3, VR0GR packssdw mm0, mm1 pmaddwd mm5, VBG0B psrad mm4, 15 movq mm1, 16[eax]
Slide 31
MMX Code (part 2) paddd mm6, mm7 movq mm7, mm1 psrad mm6, 15 paddd mm3, mm5 psllq mm7, 16 movq mm5, mm7 psrad mm3, 15 movq TEMPY, mm0 packssdw mm2, mm6 movq mm0, TEMP0 punpcklbw mm7, ZEROS movq mm6, mm0 movq TEMPU, mm2 psrlq mm0, 32 paddw mm7, mm0 movq mm2, mm6 pmaddwd mm2, YR0GR movq mm0, mm7 pmaddwd mm7, YBG0B packssdw mm4, mm3 add eax, 24 add edx, 8 movq TEMPV, mm4
movq mm4, mm6 pmaddwd mm6, UR0GR movq mm3, mm0 pmaddwd mm0, UBG0B paddd mm2, mm7 pmaddwd mm4, pxor mm7, mm7 pmaddwd mm3, VBG0B punpckhbw mm1, paddd mm0, mm6 movq mm6, mm1 pmaddwd mm6, YBG0B punpckhbw mm5, movq mm7, mm5 paddd mm3, mm4 pmaddwd mm5, YR0GR movq mm4, mm1 pmaddwd mm4, UBG0B psrad mm0, 15 paddd mm0, OFFSETW psrad mm2, 15 paddd mm6, mm5 movq mm5, mm7
Slide 32
MMX Code (pt. 3: 121 instructions) pmaddwd mm7, UR0GR psrad mm3, 15 pmaddwd mm1, VBG0B psrad mm6, 15 paddd mm4, OFFSETD packssdw mm2, mm6 pmaddwd mm5, VR0GR paddd mm7, mm4 psrad mm7, 15 movq mm6, TEMPY packssdw mm0, mm7 movq mm4, TEMPU packuswb mm6, mm2 movq mm7, OFFSETB paddd mm1, mm5 paddw mm4, mm7 psrad mm1, 15 movq [ebx], mm6 packuswb mm4, movq mm5, TEMPV packssdw mm3, mm4 paddw mm5, mm7 paddw mm3, mm7
movq [ecx], mm4
packuswb mm5, mm3
add ebx, 8
add ecx, 8
movq [edx], mm5
dec edi
jnz RGBtoYUV
Slide 33
IRAM Status• Chip
– ISA has not changed significantly in over a year– Verilog complete, except SRAM for scalar cache– Testing framework in place
• Compiler– Backend code generation complete– Continued performance improvements, especially for
narrow data widths
• Application & Benchmarks– Handcoded kernels better than MMX,VIS, gp DSPs
» DCT, FFT, MVM, convolution, image composition,…– Compiled kernels demonstrate ISA advantages
» MVM, sparse MVM, decrypt, image composition,…– Full applications: H263 encoding (done), speech
(underway)
Slide 34
Backup from Dave Judd’s Talk
Slide 35
VIRAM Tools• vas: assembler• vdis: disassembler• vsim-isa: simulator• vsim-db: debugger• vsim-p: performance simulator• vsim-sync:memory consistency simulator
Slide 36
Compiler Testing • C regression test suite (commercial test suite)
– Scalar emphasis, C conformance– All tests pass except:
» Small numerical differences due to lack on 128 f.p. support
• C++ test suite– 1167 of 1183 tests execute correctly.– 12 failures in compilation: “undefined
variables”– 4 failures in execution: bad answers
Slide 37
Compiler Testing• Vector regression test suites (CRAY)
– Specifically tests for vectorization– Compares vector and scalar results– Easy to isolate problems– “vector” status:
» 59 of 62 tests pass» Some minor numerical differences» 1 bad answer, 2 integer overflow
– “vector4” status» 163 of 165 tests execute correctly» 1 bad anwer, 1 illegal use of vector inst.
Slide 38
Kernel Performance: mvmmatrix-vector multiplication
Hand optimized assembly code
579 mflops
vcc w/ restrict keywords added
352 mflops
+ 1 element padding to avoid bank conflicts
401 mflops
+ shortloop directive
Loops interchanged & outer loop vectorized by vcc.
592 mflops
64x64, 32 bit floating pt.
Slide 39
Mods to mvm code
/* Original code mvm.c */ /* Modified code */ void mvm (float * A, void mvm (float * restrict A, float * X, float * restrict X, float * Y, float * restrict Y, int n, int n, int acol ) { int acol ) { int i,j; int i,j; float x_elem < if ( n <= 64 ) { if ( n <= 64 ) { for (i = 0; i < n; i++) { for (i = 0; i < n; i++) { #pragma shortloop for (j = 0; j < n; j++) { for (j = 0; j < n; j++) { Y[j] += A[j*acol+i] * x_elem; Y[j] += A[j*acol+i] * X[i]; } } } } } }} }
Slide 40
Kernel performance: mm_mulmatrix –matrix multiplication
Hand coded assemblymm-mul-small.s
1.58 gigaflops
vcc w/ restrict and shortloop keywords
0.852 gigaflops
+ inner two loops in separate function, allows outer loop vectorization
1.51 gigaflops
64x64x64, 32 bit float, 1.6 gigaflop theoretical peak
Slide 41
Kernel performance: saxpy• 32 bit floating point ops
379 593 691 720
385 596 692 721
N=64
256 1024
4096
Hand coded assembly
vcc w/restrict keywords
Slide 42
Kernel performance: motion_estimate
Hand optimized assembly 1.181 gigaops
vcc w/restrict keywords 170 mops
+ shortloop directives 253 mops
+ outer loop unroll directive
257 mops*
32 bit integer ops, finding the minimum sum of absolute differences for a reference block and a region in an image.
*No improvement because of spilling.
Slide 43
Dongarra loops• 100 loops to test compiler vectorization
capability• Rewritten in C by Cray (?)• vcc vectorizes 74 loops• vcc partially vectorizes 3 loops• vcc conditionally vectorizes 3 loops• 1 loop not vectorized because vector sin/cos
not currently available on viram.• 19 other loops not vectorized• Data provided by Sam Williams
Slide 44
Features Remaining:• Support version 3 isa and version 4 isa:
– Isa changes required by Mips Inc. scalar core
– Performance simulator only supports “old”isa
• Finish sync support– take advantage of Cray implementation
• VIRAM machine “target”– Allow easier maintainence of frontend and
optimizer mods for viram• User documentation
– Summary of differences w/Cray compiler– Useful options, hints for vector code
Slide 45
Performance Features Remaining
• Additional tuning: instruction scheduler• Support new SV2 inliner for C/C++• Shortloop enhancements• Reduce spilling
– Scheduler concern with registers– Ordering of blocks for register assignment
within “priority groups”– Special vector registers carried across calls
• Loop unrolling for vector loops• Tune for key benchmarks
Slide 46
Other Future Compiler Features ?
• Support for speculative execution• Compiler extensions for fixed point hardware• Support for vector functions; vector mlib
Slide 47
Summary• vcc is a reasonably robust compiler for VIRAM
• Performance on kernels is good w/appropriate directives, some effort for optimum vectorization
• Need to prioritize remaining work
Slide 48
Codegen/optimizer issues for VIRAM
• Variable virtual processor width (VPW)• Variable maximum vector register length
(MVL)• Vector flag registers treated as 1 bit wide
vector register• Multiple base, incr, stride regs. +
autoincrement• Fixed point arithmetic (saturating add, etc.)• Memory consistency• New vector instructions not available on SV2
Slide 49
Generating Code for Variable MVL
• Maximum vector length is not specified in IRAM ISA.• However, compiler assumes mvl at compile time
– mvl based on vpw– mvl assumption dependent on VIRAM-1 hardware
implementation– Recompiling required for future hardware
versions if mvl changes• MVL knowledge useful for code gen and vectorizer:
– register spilling– short loop vectorization– length-dependent vectorization ( and may
eliminate safe vector length computation at run time)
for (i = 0; i < n; i=++)
a[i] = a[i+32]
Slide 50
Why Vectors?• Utilizes on-chip bandwidth of IRAM
– parallelism within instructions• Efficient architecture for vectorizable code
– avoids area, power, and design of reorder logic
– low instruction decode overhead• Multimedia algorithms are vectorizable
– e.g., vectorize across pixels in an image• Scales easily across chip generations
– e.g., 32-way parallelism in instruction can be implemented by 1, 2, 4, 8-way
• Leverages well-known compiler technology
Slide 51
Architecture Details
• MIPS64™ 5Kc core (200 MHz)– Single-issue scalar core with 8 Kbyte I&D caches
• Vector unit (200 MHz)– 8 KByte register file (32 64b elements per register)– 256b datapaths, can be subdivided into 16b, 32b,
64b:» 2 arithmetic (1 FP, single), 2 flag processing
– Memory unit » 4 address generators for strided/indexed accesses
• Main memory system– 8 2-MByte DRAM macros
» 25ns random access time, 7.5ns page access time– Crossbar interconnect
» 12.8 GBytes/s peak bandwidth per direction (load/store)
• Off-chip interface– 2 channel DMA engine and 64n SysAD bus
Slide 52
Floorplan• Technology: IBM SA-27E
–0.18m CMOS, 6 metal layers • 290 mm2 die area
–225 mm2 for memory/logic• Transistor count: ~150M• Power supply
–1.2V for logic, 1.8V for DRAM• Typical power consumption: 2.0
W–0.5 W (scalar) + 1.0 W (vector) + 0.2 W (DRAM) + 0.3 W (misc)
• Peak vector performance–1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations)
–3.2/6.4 /12.8 Gops w. madd–1.6 Gflops (single-precision)
• Tape-out planned for Spring ‘01
14.5 mm
20
.0 m
m
Slide 53
Micro-kernel results: simulated systems
1 LaneSystem
2 LaneSystem
4 LaneSystem
8 LaneSystem
# of 64-bit lanes 1 2 4 8
Addresses per cyclefor strided-indexedaccesses
1 2 4 8
Crossbar width 64b 128b 256b 512b
Width of DRAM bankinterface
64b 128b 256b 512b
DRAM banks 8 8 8 8
•Note : simulations performed with 2 load-store units and without decoupled stores or optimizations for strides 2, 3, and 4
Slide 54
Micro-kernels
Benchmark OperationsType
DataWidth
MemoryAccesses
OtherComments
ImageComposition(Blending)
Integer 16b Unit-stride
2D iDCT (8x8image blocks)
Integer 16b Unit-strideStrided
Color Conversion(RGB to YUV)
Integer 32b Unit-stride
ImageConvolution
Integer 32b Unit-stride
Matrix-vectorMultiply (MV)
IntegerFP
32b Unit-stride Uses reductions
Vector-matrixMultiply (VM)
IntegerFP
32b Unit-stride
•Vectorization and scheduling performed manually
Slide 55
Scaled system results
•Near linear speedup for all application apart from iDCT
•iDCT bottlenecks
•large number of bank conflicts
•4 addresses/cycle for strided accesses
0
1
2
3
4
5
6
7
8
Compositing iDCT Color Conversion Convolution MxV INT (32) VxM INT (32) MxV FP (32) VxM FP(32)
Spee
dup
1 Lane 2 Lanes 4 Lanes 8 Lanes
Slide 56
FFT Floating Point Performance
• Note: Pathfinder is a board with multiple FPGAs implementing a kind of block floating point that gives floating point-accurate results.
32 bit Floating Point
Slide 57
• VIRAM is competitive with high-end DSPs • Specialized “FFT chips” can do better:
– CRI Scorpio 24 bit complex fixed point FFT DSP: » 1024 pt = 7 microseconds
FFT Fixed Point Performance
16 bit Fixed Point
Slide 58
DCT Performance for Video Encoder
• 8x8 DCT performance in processor cycles– Used in H.263 encoder
• Variations on VIRAM:– 166 cycles without sub-banks– 59 cycles for VIRAM using 16-bit data
VIRAM (4 sub-banks, LLM) 85
TriMedia TM-1000, Philips 160 (1.88x)
TI TMS320C62 230 (2.71x)
PowerPC with Altivec 102 (1.20x)
HP PA-8000 with MAX2 147 (1.73x)
Intel Pentium II + MMX 500 (5.88x)
NEC V 830/A 201 (2.36x)