the future of vector processors m. valero, r. espasa and j. corbal upc, barcelona kyoto, may 28th,...

The Future of Vector Processors

M. Valero, R. Espasa and J. Corbal

UPC, Barcelona

Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 2

TOP-500 and Vector Processors

0

50

100

150

200

250

300

350

# Systems % Peak Perf.

310

96

4315

65

November 98

Fujitsu…27

NEC……18

SGI……..15

Hitachi….5


The Future of Vector ISA’s

• Cross-Pollination of Vector/Superscalar/VLIW– MMX, Embedded...

• Very-high Performance Architectures– ILP techniques, IRAM, SDRAM

• Vector Microprocessors– Numerical Accelerators– Multimedia Applications


Talk Outline• The Past :

• Initial Motivation for Vector ISA• Evolution of Vector Processors

• The Present :• Recent Announcements• The Decline of Vector Processors• Cross-Pollination of Vector/Superscalars/VLIW

• The Future :• Very-high Performance Architectures• Vector Microprocessors

– Numerical Accelerators– Multimedia Applications

• Conclusions


Characteristics of Numerical Applications

• Examples: Weather prediction, mechanical engineering

• Data structures: Huge matrices (dense, sparse)

• Data types: 64 bits, floating point

• Highly repetitive loops

• Compute-intensive

• Data-Level Parallel


Initial Motivations for Vector Processors

real*8 x(9992), y(9992), u(9984) subroutine loop integer I real*8 q do I=1,9984 q = u(I) * y(I) y(I) = x(I) + q x(I) = q - u(I) * x(I) enddo end

x(I)y(I) u(I)

*

*

+_

q

For I=1 to 9984

Dependence Graph


Execution of scalar codeLoop : ld R1,0(R10) ld R2,0(R11) ld R3,0(R12) mulf R4,R1,R2) mulf R5,R2,R3 add R11,R11,#8 addf R6,R4,R3 subf R7,R4,R5 st 0(R12),R7 add R10, R10,#8 st 0(R12),R7 sub R13,R13,#1 bne Loop add R12,R12,#8

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

WWD/L ALUIF ALU ALU

WWD/L ALUIF ALU ALU

WWD/L ALUIF ALU ALU

WWD/L ALUIF ALU ALU

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

D/LIF MALU

14 cycles / Iteration

Perfect Memory !!!


Generation of Vector Code

Loop : mov s2, vl ; vl <- min(s2,128) ld.l -y(a2),v0 ; v0 <- y(I:I+127) ld.l -u(a2),v1 ; v1 <- u(I:I+127) mul.d v1,v0,v2 ; q(I:I+127) <- u(I:I+127)*y() ld.l -x(a2),v3 ; v3 <-x(I:I+127) add.d v3,v2,v0 ; v0 <- x(I:I+127) + q(I:I+127) st.l v0,-y(a2) ; y(I:I+127) <- x(I:I+127) + q( ) mul.d v1,v3,v1 ; v1 <- u(I:I+127) *x(I:I+127) sub.d v2,v1,v0 ; v0 <- q( ) - u( ) * x( ) st.l v0,-x(a2) ; x(I:I+127) <- q( ) - u( ) * x( ) add.w #1024,a2 ; increment index (128 * 8) add.w # -128,s2 ; 128 iterations less to process lt.w # 0,s2 jbrs.t loop

ld.w #9984,s2 ld.w #0,a2ld.w #8,vs

… . … . … . … . … . … . … . … . … .

0 1 2 127

A vector iteration is equivalent to 128 scalar iterations

DLP !!!


Execution of vector codeLoop : mov s2, vl ld.l -y(a2),v0 ld.l -u(a2),v1 mul.d v1,v0,v2 ld.l -x(a2),v3 add.d v3,v2,v0 st.l v0,-y(a2) mul.d v1,v3,v1 sub.d v2,v1,v0 st.l v0,-x(a2) add.w #1024,a2 add.w # - 128,s2 lt.w #0,s2 jbrs.t loop

5.1 cycles / Iteration

Memory Latency = 24 cycles !!!

14 vector instructions = 1792 scalar instructions

One L/S Port

One Adder, One Multiplier

A vector iteration is equivalent to 128 scalar iterations


Vector Processor

ControlUnit

Main Memory

Instructions (scalar + vector) + Data

Ri := Rj op Rk

Branch (cond.)

Instr. . . .

Vector Reg.

. . .

Scalar Reg.

Vector dataScalar data VR[i] := VR[j] op VR[k]


Why Vector ISA ?

• Natural way to express Data-Level Parallelism– Fewer instructions ( 3 )

• Easy way to convey this information to the hardware

• Good hardware implementation– Affordable/ incremental parallelism ( 2 )

– Simple control/ faster clock ( 1 )

• Mechanism to deal with memory latency• Problem : Memory Bandwidth...


Vector versus Scalar Architectures

0

20

40

60

80

100

120

R10k Convex C3

Number of instructions (in millions)

Vector instruction semantics “encode” many different scalar instructions :

- Loop counters

- Branch computations

- Addresses generation

F. Quintana, R. Espasa and M. Valero “ A case for merging the ILP..” PDP-98

Rate from 140 to 2


Easy to convey information to the hardware• Data path :

• No pressure at fetch, decode and issue

• Decentralized control

• Faster cycle times

• Vector memory instructions :• Spatial locality can be made clearly visible to the

hardware through “strides”

• No overhead and good prefetching

• Reduction of memory latency overhead

• Memory uses facts, not guesses


Key parameters for vector processors

• Cycle time• Scalar processor:

– # of registers and FU’s – Cache

• Vector processor– # of vector registers– # of FU’s and # of pipes/ FU

• Connection to memory:– # of busses and width

• Number of processors


Cray Y-MP Architecture

P0

P1

P7

4*4

4*4

4*4

8*8

8*8

0 4 28

3 7 31

224

228 231 255

228 232

Synchronization

tc = 6 ns.

333 Mflops / processor

256 modules. ta = 30 ns.


Vector Processors (1 of 2)

Year Machine Tc (ns) #FPU’sFlops/cycle

LD/ST path

words/ cycle

#regsElements / register

1972 TI-ASC 60 2 4 LS 4(32) - -1973 STAR-100 40 2 2 L,L,S 3 - -1975 Cray-1 12.5 2 2 LS 1 8 64

1982Fujitsu VP 2000 7 2 4 LS,LS 4 8-256 1024-32

1983 Cray-XMP 9.5 2 2 L,L,S 2+1 8 64

1983Hitachi S810/20 19/14 6?? 12?? L,L,L,LS 8 or 2 32 256

1984 NEC-SX2 6 4 16 L,LS 8 or 4 8+8k 256/64-2561985 Cray-2 4.1 2 2 LS 1 8 64

1987Hitachi S820/80 4 3 12 L,LS 8 or 4 32 512


Vector Processors (2 of 2 )

Year Machine Tc (ns) #FPU’sFlops/cycle

LD/ST path

words/ cycle

#regsElements / register

1987 Convex C2 40 2 2 LS 1 8 128

1988 Cray Y-MP6.3 2 2 L,L,S 2+1 8 64

1989Fujitsu VP 2600 3.2 4 16 LS,LS 8 2048-64 64-2048

1990 NEC SX-3 2.9 4 16 L,L,S 8+4 8+16k 256/64-2561992 Cray C90 4 2 4 L,L,S 4+2 8 128

1993Hitachi S-3800 2 2(?) 16(?) L,L,L,LS 8 or 2 - -

1994 Convex C4 7.4 2 2 LS 1 8 1281996 Nec SX-4 8 2 16 LS,LS 16 8+16k 256/64-2561998 Nec SX-5 4 2 32 LS,LS 32 8+16k 256/64-256


Evolution of Cray Machines

Machine Year Tc MhzMflops/CPU # CPU's

Memory BW/CPU

Load latency(ns)

Cray-1 1976 80 160 1 640 MB/s 150Cray-XMP 1982 105 210 2 2.5 GB/s 123Cray-2 1982 243 486 4 or 8 1.9 GB/s 200Cray-YMP 1989 167 334 8 4 GB/s 100Cray-C90 1992 243 970 16 12 GB/s 95Cray-J90 1995 100 200 32 1.6 GB/S 340Cray-T90 1994 450 1800 32 21 GB/s 70/116Cray-SV-1 1998

Courtesy from SGI/CRAY

Tc : x6 ILP : x2 # of proc. x32 Total : x400


Vector Innovations (1 of 2 ) • Star-100/Cyber-200 had many of them:

– Gather/scatter– Masked operations for conditionals

• Cray-1 introduced vector registers• BSP had instructions for recurrences and

multioperand • Instructions to optimize masked vector

operations• Instructions to handle Index and Bit sequence

on mask register• Flexible addressing of subvector registers(C4)


Vector Innovations ( 2 of 2 )

• Multi-pipes (Star/Cyber)

• Vector with Virtual Memory

• Flexible chaining (multi-ported register-file)

• Multilevel register-file (NEC)

• Scalar units sharing vector FU’s (Fujitsu)

• Combined vector and scalar instructions (Titan)

• Short vectors (CS-2 and CM-5)

• Scalar processor: LIW( Fujitsu), SS(NEC)


Automatic vectorization

• Compiler technology for vectorization: over 25 years of development– Dependence analysis– Elimination of false dependences– Strip mining– Loop interchange– Partial vectorization– Idiom recognition– IF conversion– Vector parallelization


Vector Architectures : Present

• New announcements (NEC, Cray, Fujitsu)

• The decline of vector processors

• Cross-pollination of Vector/ Superscalar/

VLIW processors


NEC SX-5

• Announced on June 5th. of 1998

• 8 Gflops, CMOS, tc = 4 ns

• Superscalar processor at 500 Mflops

• 32 results/cycle (2 FPU, 16-pipe)

• 32 data memory accesses/cycle (2 ports,16 data/port). Memory bandwidth of 64 GB/s

• System composed by 32 nodes of 128 Gflops providing 4 Tflop/s


Cray SV1• Announced on June 16th. of 1998

• CMOS, 250 Mhz and 4 Gigaflop/proc.

• Vector cache memory

• 2 FU’s of 8 operations/cycle

• “Multi-Streaming” Processor

• Scalable vector architecture (32 nodes of 32 processors…4 Teraflops)

• Future processor enhancements !!!


Fujitsu VP5000

• Announced on April 20 th. of 1999

• 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip

• Linpack 1000*1000 gives 8758 Mflop/s

• Crossbar provides 2*1.6 GB/s per processor

• System composed by 512 PE’s or 4.9 Teraflops

• Maximum of 16 GB/PE or 8 TB/512 PE’s


The decline of vector processors

• Why have vector machines declined so fast in popularity?– Cost (Scalar parallel machines use

commodity parts)– Too restricted in applications (lack of

vectorization in many programs)

• Massive use of computers to run so called “Non-numerical Applications”


Characteristics of non-numerical Applications

• Examples: OLTP,DSS, simulators, games…

• General data structures: Lists, trees, tables…

• Data types: Scalar integers of 8 to 64 bits

• Frequent control flow change…Speculation

• Short distance data dependencies... Forwarding

• Instruction/data locality……Caches

• Fine-grain ILP……..Out-of-order


Micro Killers ???

Year Machine Tc (Mhz) #op/cyclePeak Perf. Mflops

1976 Cray-1 80 2 1601978 I-8086 10 - -1992 Cray C-90 243 4 9701992 Alpha 21064 150 1 1501994 Pentium 100 1 1001996 NEC SX-4 125 16 20001997 IBM P2SC 160 4* 6401997 Alpha 21164 500 2 10001998 HP PA8200 240 4* 9601998 NEC SX-5 250 32 80001998 Pentium 400 1 400

Peak performance = Tc * ILP


Bandwidth and PerformanceAlpha21264500 Mhz

Power chipIBM 160 Mhz

HP-8200240 Mhz

Cray T90450 Mhz

NEC SX-4125 Mhz

2 GB/s 2 GB/s 24 GB/S 16 GB/S16 MB

5 Gb/s 768 MB/s

64 KB 128 KB 2 MB8 GB/s 3.84 GB/s 24 GB/s 16 GB/s

576 bytes 704 bytes 8 KB 128 KB

16 GB/s 5.12 GB/s 15.3 GB/s 43.2GB/s 48 GB/s2 FPU1 Gflops

2 (2 pipe)640 Mflops

2 (2 pipe)960Mflops

2 (2 pipe)1.8Gflops

2 (8 pipes)2 Gflops

Main memory

Register file size

Functional Units

L1 cache size

L2 cache size


Peak performance and Bandwidth

0102030405060708090

100

0 1000 2000 3000 4000Vector length

* Measurement condition : RS6000-590(66.6MHz) FORTRAN77 - 03 - qarch=pwr2 - qtune=pwr2

Eff

icie

ncy

(%

)

IBM RS6000 *

VPP500

(C2+C(I)*(C3+D(I)*

(C4+E(I)*(C5+F(I)*

Z(I)=C0+A(I)*(C1+B(I)*

(C8+K(I)*(C9+L(I))))))))))

(C6+G(I)*(C7+H(I)*

Courtesy from Fujitsu


Vector ideas used in SS’s/VLIW processors

• Address prediction and Prefetching• Exploitation of data locality(the stride value is

used for locality detection and exploitation)• Predicate execution(VLIW)• Multiply and add, chaining• Multi-size operands• Data reuse and vectorization• Addressing modes (auto-increment)• Multithreading ( 2 scalar processors in Fujitsu

machines)• Dynamic load/store elimination


Predictions for ALL instructions

0102030405060708090

100

Last valueStrideContext 1Context 3

Y.Sazeides and J.E. Smith ¨The predictability of data values¨MICRO-30.1997


Characterization of Vector Programs

0102030405060708090

100

% vector access% vectorizationAvg. VL

R. Espasa “ Advanced Vector Architectures “. PhD Thesis, Feb.97


SS’s ideas usable in vector processors

• Decoupled Vector Architectures

• Multithreaded Vector Architectures

• Out-of-order Vector Architectures

• Simultaneous Multithreaded Vector Architecture

• Victim Register File

R. Espasa, M. Valero and J.E. Smith HPCA96, HPCA97, MICRO97, ICS97...


ILP+DLP: Out-of-order Vector

LD/STS registers A registers V registers

Reorder Buffer Memory

Decode & RenameFetch

R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.


OOO Vector Performance

R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.


Vector Processors : The Future

• Very high-performance architectures

• Vector Microprocessors• Numerical Accelerators• Multimedia Applications


Architectures for a Billion Transistors

• Advanced/Superspeculative Architectures

• Trace Processors

• Simultaneous Multithreading

• Multiprocessor on a chip

• RAW processors

• IRAM

Billion -Transistor Architectures. IEEE Computer Sept. 1997


SMV• Simultaneous Multithreaded Vector Arch.

• Mixes three paradigms– DLP: vector unit– ILP: O-o-O execution– TLP: multithreaded fetch unit

• Requires a memory system with– high performance at low cost– low pin-count

R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997


Billion Trans. Vector Architecture


Memory

M

e

m

o

r

y

FPU 1

FPU 2

ALU 1

ALU 2

@ gen

@ gen

VFU 1

VFU 2

VFU 3

VFU 4

k

k

k

k

k

kk

k

K (data)

FPRF

128 reg

IRF

128 reg

Vector

Register

File

128 reg

2 data

1

1

Float point

queue (64)

Integer

queue (64)

Memory

queue (64)

Memory

queue (64)

Instruction Issue Execution Pipeline

I cache Decode

8 program

counters

(one/ thread)

8 rename

tables

(one/thread)

I F V

Inst fetch Inst decode

Thread ID

Reorder Buffer

Instruction Slots

PC

B


SMV Performance



V-IRAM1

Memory Crossbar Switch

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

…

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

+

Vector Registers

x

÷

Load/Store

8K I cache 8K D cache

2-way Superscalar processor

Vector

4 x 64 4 x 64 4 x 64 4 x 64 4 x 64

4 x 64or

8 x 32or

16 x 16

4 x 644 x 64

QueueInstruction

I/OI/O

I/OI/O

SerialI/O

D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32MB


Conflict-free access to vectors

Memory Modules

Inte

rcon

nect

ion

Net

wor

k

Inte

rcon

nect

ion

Net

wor

k

Sections

P1

P2

Pn

P3

P1

P2

P3

Pn

Idea: Out-of-order access

M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95, ICS 92, ICS 94,...


Command Memory System

Inte

rcon

nect

ion

Net

wor

kP1

P2

Pn

P3

Memory Modules

Inte

rcon

nect

ion

Net

wor

k

P1

P2

P3

PnCommands Sections Controller

Command = <@,Length,Stride,size>Break commands into bursts at the section controller

J. Corbal, R. Espasa and M. Valero “ Command-Vector Memory System” PACT98


System configuration in 2009

Memory(5TB)

X-Bar

Chip Chip

Memory(5TB)

X-bar

Chip Chip200GF 200GF 200GF 200GF

32Chips6.4TFLOPS

32Chips6.4TFLOPS

32 SMP(cc-NUMA) Nodes 200TFLOPS/160TB

100GB/Sec

800GB/SecX-Bar

Sustained Scalar 250GFLOPS? Vector 1TFLOPS?

T. Watanabe SC98, Orlando.


Vector Microprocessors

• Ways of reducing the design impact• Short Vectors (64 x 16 words = 8 Kbytes)• Vector Functionall units shared with INT/FP units• Vector Register renaming to allow precise exceptions

• Cache hierarchy tuned to vector execution• Vector data locality allows large data transactions

• Very large bandwidth between cache and vector registers

• High performance for numerical and multimedia applications


General Architecture

1024FP INT

8

I-CacheFetch

Decode

RambusController

RDRAM

RDRAM

RDRAM

RDRAM

Vector Cache

VRF


Vector PC Vs SuperScalar

0

5

10

15

20

25

Hydro2D Dyfesm Swm256 Tomcatv

OoO-SS 1x2VEC 16 1x2VEC 16 16x32


Cache Hierarchy

•Where should be allocated the Vector Cache?

DIRECT RAMBUS

L2

VC CPU

VC

L1 CPU

DIRECT RAMBUS


Performance of the cache hierarchies

0

1

2

3

4

5

6

7

8

2 8 16 320

1

2

3

4

5

6

7

2 8 16 320

2

4

6

8

10

12

2 8 16 32

BDNA FLO52 HYDRO2D

EIP

C

FLOPS/CYCLE FLOPS/CYCLE FLOPS/CYCLE

VECTOR CACHE on L1

VECTOR CACHE on L2

PERFECT CACHE


Importance of media Applications

“On the next five years, (1998-2002), we believe that media processing will become the dominant force in computer architecture” (K. Diefendorf and P. K. Dubey in IEEE Computer Journal, Sep.97, pp. 43-45)

“90% of Desktop Cycles will Be Spent on Media Applications by 2000” ( Scott Kirkpatrick of IBM )


Characteristics of media Applications• Examples: Image/ speech processing,

communications, virtual reality, graphics…

• Data structures: matrices and vectors

• Data types: Integer(8 -32 bits), FP (32- 64)

• Demand for high memory bandwidth

• Low data locality and latency problem

• No critical data-dependences

• Real time necessity

• Fine/coarse grain parallelism


Multimedia Applications and Architectures

• • • •

• • • •

• • • •

• • • •

Scientific Applications

Multimedia

Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but

loss of parallelismAs many instructions as SS approach

Superscalar

+ MMXVLIW Vector Architectures

Natural way to express and execute DLP applications


MMX-like processors

• Multimedia extensions are designed to exploit the parallelism inherent in multimedia aplications

• Targeted to leverage full compatibility with existing operating systems and applications, plus minimum chip area investment.

• The highlights of multimedia extensions are:

• Single Instruction, Multiple Data (SIMD) techniques

• New data types (Multimedia Vectors, 32/64 bits)

• Multimedia registers

• SIMD-like instructions, over small integer data types


MMX instruction example• PADDW: Parallel ADD of 4x16-bit data type with Wrap

Around (No Saturation)

A1 A2 xFFFFA3

A1+B1 A2+B2 x0005A3+B3

B1 B2 x0006B3

+ + + +

0 15 31 47 63


Superscalar Multimedia Processors

Register File 32*128 8*64 32*64 32*64 32*64 32*64Mapped Onto Separate FP FP FP IntegerIntegerInteger Support 8/16/32 8/16/32 8/16/32 8/16 bit 16/32 8 bitFP Support Yes MMX2 No MIPS V/ No NoUsual stuff+ Lots Lots Lots Lots Some NoneMultiply /MAC Lots Mult Mult Lots Some NoneMin/Max/Avg Yes No No Min/MaxAvg Min/MaxPack/Unpack Yes Yes Yes Yes Yes YesByte ReorderingAll Some Some Many All NoneUnaligned Data 3 Inst. No 2 Inst. Yes No NoAnnounced 2Q98 2Q96 4Q94 4Q96 4Q95 4Q96

HP MAX2

Alpha MVI

PowerPC Altivec

Intel MMX

Sun VIS

MIPS V /MDMX

Microprocessor Report Vol 12, N 6, May 11, 1998



• • • •

• • • •

• • • •

• • • •


Multimedia



Superscalar




Multimedia Embedded Systems

• NEC V830R/AV includes MIX2, a multimedia

instruction extension (SIMD, MMX-like approach)

• Hitachi SH4 includes FP 4-length vector

instructions, targeted at geometry transformation in

3D rendering applications

• ARM10 Thumb Family processors will include a

Vector FP unit capable of delivering 600 MFLOPS


Widen is better…(?)

• Most multimedia algorithms exhibit vectors no longer than 8/16 elements => widening the multimedia registers could provide diminishing returns.

C1

B1

+

0 15

A1 A2 A4A3

C1 C2 C4C3

B1 B2 B4B3

+ + + +

0 15 31 47 63

A1 A1 A2 A4A3

C1 C2 C4C3

B1 B2 B4B3

+ + + +

0 15 31 47

A5 A6 A8A7

C5 C6 C8C7

B5 B6 B8B7

+ + + +

63 79 95 111 127


VLIW : Widening vs Replication

Memory

Register File

1 word

Memory

Register File

1 word1 word

Memory

Register File

2 words

Memory

Register File

2 words2 words

Bus configurations:

D. López et al. ¨Increasing Memory Bandwidth with Wide Busses¨ICS-97


Widening and Replication Performance

1

2

3

4

5

6

7

8

2 4 8 16

Wide 1wide 2Wide 4

D. López et al. ¨ Widening versus replicating...¨ ICS98, MICRO98



• • • •

• • • •

• • • •

• • • •


Multimedia



Superscalar




Torrent T0 Microprocessor• The first single-chip vector microprocessor.

• Can sustain over 24 operations per cycle while having a issue rate of only one 32-bit instruction per cycle

• Features:• 16 vector registers (32 32-bit elements each)• 2 Vector arithmetic units (8 pipes each)• Reconfigurable composite operation pipelines • 128-bit wide, external memory interface• MIPS-II, 32-bit instruction set, scalar unit.

K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995


Torrent T0 Microprocessor

K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995


Vector versus Superscalar Processors• Comparison of Die Area

– Processor Die Area (in mm2 scaled to 0.25

0

50

100

150

200

250

Torrent-0 Alpha 21164 UltraSPARCII

MIPSR10000

HP PA-8000 Alpha 21264 6-way OoO,Rob128

ControlRegistersDatapath

14.73 21.8637.77

66.92 67.77 69.81

250.0

C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.


• Component Percentages

0

10

20

30

40

50

60

70

80

90

100

Torrent-0 Alpha 21164 UltraSPARCII

MIPSR10000

HP PA-8000 Alpha 21264 6-way OoO,Rob128

Datapath Registers Control

C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.

Vector versus Superscalar Processors


Imagine project

• Focused on developing a programmable architecture that achieves performance similar to special purpose hardware on graphics and image processing.

• Matches media applications demands to the current VLSI capabilities by using a stream-based programming model.

• Most multimedia kernels exhibit a streaming nature.

• Individual stream elements can be operated on in parallel, thus exploiting data parallelism.

Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98


Imagine architecture• Organized around a large stream register file (64Kb)• Memory operations move entire streams of data• Data streams pass through a set of arithmetic clusters (8)• Each cluster unit operates a single element under VLIW control

SDRAM

SDRAM

SDRAM

SDRAM

...

Str

eam

ing

Mem

ory

Sys

temC

C

C

C

Stream Register File

CLUSTER 7

CLUSTER 0

CLUSTER 1

...

Controller

Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98


Matrix extensions for Multimedia• By combining conventional vector approach together with SIMD MMX-like instructions, we can exploit additional levels of DLP with matrix oriented multimedia

extensions.

C1

B1

+

A1 A1 A2 A4A3

C1 C2 C4C3

B1 B2 B4B3

+ + + +

0 15 31 47 63

A1 A2 A4A3

0 15 31 47 63

A5 A6 A8A7

A9 A10 A12A11

A13 A14 A16A15

+

B1 B2 B4B3

15 31 47 63

B5 B6 B8B7

B9 B10 B12B11

B13 B14 B16B15

C1 C2 C4C3

C5 C6 C8C7

C9 C10 C12C11

C13 C14 C16C15


Relative Performance

0

1

2

3

4

5

6

7

way 1 way 2 way 4 way 80

5

10

15

20

25


MMX MDMX MOM

0

1

2

3

4

5

6

7

8

9


INVERSE DCT TRANSFORM

MPEG-2 MOTION ESTIMATION

RGB-YCC Color CONVERSION


Applications and Architectures

+ FPU

+ FPU VFPU+

Integer

Integer

Integer

Numerical Applications

Very Slow+ Subroutines

Very Big Improvement !!!

Additional Speed


Future Applications

• Integer SPEC-like• Commercial

(OLTP,DSS)

• Numerical• Multimedia

IntegerInteger Commercial Numerical Multimedia


Acknowledgments

• Roger Espasa• James E. Smith• Luis A. Villa• Francisca Quintana• Jesús Corbal• David López• Josep Llosa• Eduard Ayguade

• Krste Asanovic• William Dally• Christoforos E. Kozyrakis• Corinna G. Lee• David A. Patterson• Steve Wallace


The End

the future of vector processors m. valero, r. espasa and j. corbal upc, barcelona kyoto, may 28th,...

Documents