the future of vector processors m. valero, r. espasa and j. corbal upc, barcelona kyoto, may 28th,...

74
The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

The Future of Vector Processors

M. Valero, R. Espasa and J. Corbal

UPC, Barcelona

Kyoto, May 28th, 1999

Page 2: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 2

TOP-500 and Vector Processors

0

50

100

150

200

250

300

350

# Systems % Peak Perf.

310

96

4315

65

November 98

Fujitsu…27

NEC……18

SGI……..15

Hitachi….5

Page 3: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 3

The Future of Vector ISA’s

• Cross-Pollination of Vector/Superscalar/VLIW– MMX, Embedded...

• Very-high Performance Architectures– ILP techniques, IRAM, SDRAM

• Vector Microprocessors– Numerical Accelerators– Multimedia Applications

Page 4: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 4

Talk Outline• The Past :

• Initial Motivation for Vector ISA• Evolution of Vector Processors

• The Present :• Recent Announcements• The Decline of Vector Processors• Cross-Pollination of Vector/Superscalars/VLIW

• The Future :• Very-high Performance Architectures• Vector Microprocessors

– Numerical Accelerators– Multimedia Applications

• Conclusions

Page 5: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 5

Characteristics of Numerical Applications

• Examples: Weather prediction, mechanical engineering

• Data structures: Huge matrices (dense, sparse)

• Data types: 64 bits, floating point

• Highly repetitive loops

• Compute-intensive

• Data-Level Parallel

Page 6: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 6

Initial Motivations for Vector Processors

real*8 x(9992), y(9992), u(9984) subroutine loop integer I real*8 q do I=1,9984 q = u(I) * y(I) y(I) = x(I) + q x(I) = q - u(I) * x(I) enddo end

x(I)y(I) u(I)

*

*

+_

q

For I=1 to 9984

Dependence Graph

Page 7: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 7

Execution of scalar codeLoop : ld R1,0(R10) ld R2,0(R11) ld R3,0(R12) mulf R4,R1,R2) mulf R5,R2,R3 add R11,R11,#8 addf R6,R4,R3 subf R7,R4,R5 st 0(R12),R7 add R10, R10,#8 st 0(R12),R7 sub R13,R13,#1 bne Loop add R12,R12,#8

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

WWD/L ALUIF ALU ALU

WWD/L ALUIF ALU ALU

WWD/L ALUIF ALU ALU

WWD/L ALUIF ALU ALU

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

M WD/L ALUIF M W

D/LIF MALU

14 cycles / Iteration

Perfect Memory !!!

Page 8: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 8

Generation of Vector Code

Loop : mov s2, vl ; vl <- min(s2,128) ld.l -y(a2),v0 ; v0 <- y(I:I+127) ld.l -u(a2),v1 ; v1 <- u(I:I+127) mul.d v1,v0,v2 ; q(I:I+127) <- u(I:I+127)*y() ld.l -x(a2),v3 ; v3 <-x(I:I+127) add.d v3,v2,v0 ; v0 <- x(I:I+127) + q(I:I+127) st.l v0,-y(a2) ; y(I:I+127) <- x(I:I+127) + q( ) mul.d v1,v3,v1 ; v1 <- u(I:I+127) *x(I:I+127) sub.d v2,v1,v0 ; v0 <- q( ) - u( ) * x( ) st.l v0,-x(a2) ; x(I:I+127) <- q( ) - u( ) * x( ) add.w #1024,a2 ; increment index (128 * 8) add.w # -128,s2 ; 128 iterations less to process lt.w # 0,s2 jbrs.t loop

ld.w #9984,s2 ld.w #0,a2ld.w #8,vs

… . … . … . … . … . … . … . … . … .

0 1 2 127

A vector iteration is equivalent to 128 scalar iterations

DLP !!!

Page 9: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 9

Execution of vector codeLoop : mov s2, vl ld.l -y(a2),v0 ld.l -u(a2),v1 mul.d v1,v0,v2 ld.l -x(a2),v3 add.d v3,v2,v0 st.l v0,-y(a2) mul.d v1,v3,v1 sub.d v2,v1,v0 st.l v0,-x(a2) add.w #1024,a2 add.w # - 128,s2 lt.w #0,s2 jbrs.t loop

5.1 cycles / Iteration

Memory Latency = 24 cycles !!!

14 vector instructions = 1792 scalar instructions

One L/S Port

One Adder, One Multiplier

A vector iteration is equivalent to 128 scalar iterations

Page 10: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 10

Vector Processor

ControlUnit

Main Memory

Instructions (scalar + vector) + Data

Ri := Rj op Rk

Branch (cond.)

Instr. . . .

Vector Reg.

. . .

Scalar Reg.

Vector dataScalar data VR[i] := VR[j] op VR[k]

Page 11: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 11

Why Vector ISA ?

• Natural way to express Data-Level Parallelism– Fewer instructions ( 3 )

• Easy way to convey this information to the hardware

• Good hardware implementation– Affordable/ incremental parallelism ( 2 )

– Simple control/ faster clock ( 1 )

• Mechanism to deal with memory latency• Problem : Memory Bandwidth...

Page 12: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 12

Vector versus Scalar Architectures

0

20

40

60

80

100

120

R10k Convex C3

Number of instructions (in millions)

Vector instruction semantics “encode” many different scalar instructions :

- Loop counters

- Branch computations

- Addresses generation

F. Quintana, R. Espasa and M. Valero “ A case for merging the ILP..” PDP-98

Rate from 140 to 2

Page 13: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 13

Easy to convey information to the hardware• Data path :

• No pressure at fetch, decode and issue

• Decentralized control

• Faster cycle times

• Vector memory instructions :• Spatial locality can be made clearly visible to the

hardware through “strides”

• No overhead and good prefetching

• Reduction of memory latency overhead

• Memory uses facts, not guesses

Page 14: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 14

Key parameters for vector processors

• Cycle time• Scalar processor:

– # of registers and FU’s – Cache

• Vector processor– # of vector registers– # of FU’s and # of pipes/ FU

• Connection to memory:– # of busses and width

• Number of processors

Page 15: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 15

Cray Y-MP Architecture

P0

P1

P7

4*4

4*4

4*4

8*8

8*8

0 4 28

3 7 31

224

228 231 255

228 232

Synchronization

tc = 6 ns.

333 Mflops / processor

256 modules. ta = 30 ns.

Page 16: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 16

Vector Processors (1 of 2)

Year Machine Tc (ns) #FPU’sFlops/cycle

LD/ST path

words/ cycle

#regsElements / register

1972 TI-ASC 60 2 4 LS 4(32) - -1973 STAR-100 40 2 2 L,L,S 3 - -1975 Cray-1 12.5 2 2 LS 1 8 64

1982Fujitsu VP 2000 7 2 4 LS,LS 4 8-256 1024-32

1983 Cray-XMP 9.5 2 2 L,L,S 2+1 8 64

1983Hitachi S810/20 19/14 6?? 12?? L,L,L,LS 8 or 2 32 256

1984 NEC-SX2 6 4 16 L,LS 8 or 4 8+8k 256/64-2561985 Cray-2 4.1 2 2 LS 1 8 64

1987Hitachi S820/80 4 3 12 L,LS 8 or 4 32 512

Page 17: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 17

Vector Processors (2 of 2 )

Year Machine Tc (ns) #FPU’sFlops/cycle

LD/ST path

words/ cycle

#regsElements / register

1987 Convex C2 40 2 2 LS 1 8 128

1988 Cray Y-MP6.3 2 2 L,L,S 2+1 8 64

1989Fujitsu VP 2600 3.2 4 16 LS,LS 8 2048-64 64-2048

1990 NEC SX-3 2.9 4 16 L,L,S 8+4 8+16k 256/64-2561992 Cray C90 4 2 4 L,L,S 4+2 8 128

1993Hitachi S-3800 2 2(?) 16(?) L,L,L,LS 8 or 2 - -

1994 Convex C4 7.4 2 2 LS 1 8 1281996 Nec SX-4 8 2 16 LS,LS 16 8+16k 256/64-2561998 Nec SX-5 4 2 32 LS,LS 32 8+16k 256/64-256

Page 18: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 18

Evolution of Cray Machines

Machine Year Tc MhzMflops/CPU # CPU's

Memory BW/CPU

Load latency(ns)

Cray-1 1976 80 160 1 640 MB/s 150Cray-XMP 1982 105 210 2 2.5 GB/s 123Cray-2 1982 243 486 4 or 8 1.9 GB/s 200Cray-YMP 1989 167 334 8 4 GB/s 100Cray-C90 1992 243 970 16 12 GB/s 95Cray-J90 1995 100 200 32 1.6 GB/S 340Cray-T90 1994 450 1800 32 21 GB/s 70/116Cray-SV-1 1998

Courtesy from SGI/CRAY

Tc : x6 ILP : x2 # of proc. x32 Total : x400

Page 19: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 19

Vector Innovations (1 of 2 ) • Star-100/Cyber-200 had many of them:

– Gather/scatter– Masked operations for conditionals

• Cray-1 introduced vector registers• BSP had instructions for recurrences and

multioperand • Instructions to optimize masked vector

operations• Instructions to handle Index and Bit sequence

on mask register• Flexible addressing of subvector registers(C4)

Page 20: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 20

Vector Innovations ( 2 of 2 )

• Multi-pipes (Star/Cyber)

• Vector with Virtual Memory

• Flexible chaining (multi-ported register-file)

• Multilevel register-file (NEC)

• Scalar units sharing vector FU’s (Fujitsu)

• Combined vector and scalar instructions (Titan)

• Short vectors (CS-2 and CM-5)

• Scalar processor: LIW( Fujitsu), SS(NEC)

Page 21: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 21

Automatic vectorization

• Compiler technology for vectorization: over 25 years of development– Dependence analysis– Elimination of false dependences– Strip mining– Loop interchange– Partial vectorization– Idiom recognition– IF conversion– Vector parallelization

Page 22: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 22

Vector Architectures : Present

• New announcements (NEC, Cray, Fujitsu)

• The decline of vector processors

• Cross-pollination of Vector/ Superscalar/

VLIW processors

Page 23: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 23

NEC SX-5

• Announced on June 5th. of 1998

• 8 Gflops, CMOS, tc = 4 ns

• Superscalar processor at 500 Mflops

• 32 results/cycle (2 FPU, 16-pipe)

• 32 data memory accesses/cycle (2 ports,16 data/port). Memory bandwidth of 64 GB/s

• System composed by 32 nodes of 128 Gflops providing 4 Tflop/s

Page 24: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 24

Cray SV1• Announced on June 16th. of 1998

• CMOS, 250 Mhz and 4 Gigaflop/proc.

• Vector cache memory

• 2 FU’s of 8 operations/cycle

• “Multi-Streaming” Processor

• Scalable vector architecture (32 nodes of 32 processors…4 Teraflops)

• Future processor enhancements !!!

Page 25: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 25

Fujitsu VP5000

• Announced on April 20 th. of 1999

• 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip

• Linpack 1000*1000 gives 8758 Mflop/s

• Crossbar provides 2*1.6 GB/s per processor

• System composed by 512 PE’s or 4.9 Teraflops

• Maximum of 16 GB/PE or 8 TB/512 PE’s

Page 26: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 26

The decline of vector processors

• Why have vector machines declined so fast in popularity?– Cost (Scalar parallel machines use

commodity parts)– Too restricted in applications (lack of

vectorization in many programs)

• Massive use of computers to run so called “Non-numerical Applications”

Page 27: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 27

Characteristics of non-numerical Applications

• Examples: OLTP,DSS, simulators, games…

• General data structures: Lists, trees, tables…

• Data types: Scalar integers of 8 to 64 bits

• Frequent control flow change…Speculation

• Short distance data dependencies... Forwarding

• Instruction/data locality……Caches

• Fine-grain ILP……..Out-of-order

Page 28: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 28

Micro Killers ???

Year Machine Tc (Mhz) #op/cyclePeak Perf. Mflops

1976 Cray-1 80 2 1601978 I-8086 10 - -1992 Cray C-90 243 4 9701992 Alpha 21064 150 1 1501994 Pentium 100 1 1001996 NEC SX-4 125 16 20001997 IBM P2SC 160 4* 6401997 Alpha 21164 500 2 10001998 HP PA8200 240 4* 9601998 NEC SX-5 250 32 80001998 Pentium 400 1 400

Peak performance = Tc * ILP

Page 29: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 29

Bandwidth and PerformanceAlpha21264500 Mhz

Power chipIBM 160 Mhz

HP-8200240 Mhz

Cray T90450 Mhz

NEC SX-4125 Mhz

2 GB/s 2 GB/s 24 GB/S 16 GB/S16 MB

5 Gb/s 768 MB/s

64 KB 128 KB 2 MB8 GB/s 3.84 GB/s 24 GB/s 16 GB/s

576 bytes 704 bytes 8 KB 128 KB

16 GB/s 5.12 GB/s 15.3 GB/s 43.2GB/s 48 GB/s2 FPU1 Gflops

2 (2 pipe)640 Mflops

2 (2 pipe)960Mflops

2 (2 pipe)1.8Gflops

2 (8 pipes)2 Gflops

Main memory

Register file size

Functional Units

L1 cache size

L2 cache size

Page 30: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 30

Peak performance and Bandwidth

0102030405060708090

100

0 1000 2000 3000 4000Vector length

* Measurement condition : RS6000-590(66.6MHz) FORTRAN77 - 03 - qarch=pwr2 - qtune=pwr2

Eff

icie

ncy

(%

)

IBM RS6000 *

VPP500

(C2+C(I)*(C3+D(I)*

(C4+E(I)*(C5+F(I)*

Z(I)=C0+A(I)*(C1+B(I)*

(C8+K(I)*(C9+L(I))))))))))

(C6+G(I)*(C7+H(I)*

Courtesy from Fujitsu

Page 31: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 31

Vector ideas used in SS’s/VLIW processors

• Address prediction and Prefetching• Exploitation of data locality(the stride value is

used for locality detection and exploitation)• Predicate execution(VLIW)• Multiply and add, chaining• Multi-size operands• Data reuse and vectorization• Addressing modes (auto-increment)• Multithreading ( 2 scalar processors in Fujitsu

machines)• Dynamic load/store elimination

Page 32: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 32

Predictions for ALL instructions

0102030405060708090

100

Last valueStrideContext 1Context 3

Y.Sazeides and J.E. Smith ¨The predictability of data values¨MICRO-30.1997

Page 33: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 33

Characterization of Vector Programs

0102030405060708090

100

% vector access% vectorizationAvg. VL

R. Espasa “ Advanced Vector Architectures “. PhD Thesis, Feb.97

Page 34: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 34

SS’s ideas usable in vector processors

• Decoupled Vector Architectures

• Multithreaded Vector Architectures

• Out-of-order Vector Architectures

• Simultaneous Multithreaded Vector Architecture

• Victim Register File

R. Espasa, M. Valero and J.E. Smith HPCA96, HPCA97, MICRO97, ICS97...

Page 35: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 35

ILP+DLP: Out-of-order Vector

LD/STS registers A registers V registers

Reorder Buffer Memory

Decode & RenameFetch

R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.

Page 36: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 36

OOO Vector Performance

R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.

Page 37: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 37

Vector Processors : The Future

• Very high-performance architectures

• Vector Microprocessors• Numerical Accelerators• Multimedia Applications

Page 38: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 38

Architectures for a Billion Transistors

• Advanced/Superspeculative Architectures

• Trace Processors

• Simultaneous Multithreading

• Multiprocessor on a chip

• RAW processors

• IRAM

Billion -Transistor Architectures. IEEE Computer Sept. 1997

Page 39: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 39

SMV• Simultaneous Multithreaded Vector Arch.

• Mixes three paradigms– DLP: vector unit– ILP: O-o-O execution– TLP: multithreaded fetch unit

• Requires a memory system with– high performance at low cost– low pin-count

R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997

Page 40: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 40

Billion Trans. Vector Architecture

R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997

Memory

M

e

m

o

r

y

FPU 1

FPU 2

ALU 1

ALU 2

@ gen

@ gen

VFU 1

VFU 2

VFU 3

VFU 4

k

k

k

k

k

kk

k

K (data)

FPRF

128 reg

IRF

128 reg

Vector

Register

File

128 reg

2 data

1

1

Float point

queue (64)

Integer

queue (64)

Memory

queue (64)

Memory

queue (64)

Instruction Issue Execution Pipeline

I cache Decode

8 program

counters

(one/ thread)

8 rename

tables

(one/thread)

I F V

Inst fetch Inst decode

Thread ID

Reorder Buffer

Instruction Slots

PC

B

Page 41: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 41

SMV Performance

R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997

Page 42: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 42

V-IRAM1

Memory Crossbar Switch

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

+

Vector Registers

x

÷

Load/Store

8K I cache 8K D cache

2-way Superscalar processor

Vector

4 x 64 4 x 64 4 x 64 4 x 64 4 x 64

4 x 64or

8 x 32or

16 x 16

4 x 644 x 64

QueueInstruction

I/OI/O

I/OI/O

SerialI/O

D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32MB

Page 43: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 43

Conflict-free access to vectors

Memory Modules

Inte

rcon

nect

ion

Net

wor

k

Inte

rcon

nect

ion

Net

wor

k

Sections

P1

P2

Pn

P3

P1

P2

P3

Pn

Idea: Out-of-order access

M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95, ICS 92, ICS 94,...

Page 44: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 44

Command Memory System

Inte

rcon

nect

ion

Net

wor

kP1

P2

Pn

P3

Memory Modules

Inte

rcon

nect

ion

Net

wor

k

P1

P2

P3

PnCommands Sections Controller

Command = <@,Length,Stride,size>Break commands into bursts at the section controller

J. Corbal, R. Espasa and M. Valero “ Command-Vector Memory System” PACT98

Page 45: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 45

System configuration in 2009

Memory(5TB)

X-Bar

Chip Chip

Memory(5TB)

X-bar

Chip Chip200GF 200GF 200GF 200GF

32Chips6.4TFLOPS

32Chips6.4TFLOPS

32 SMP(cc-NUMA) Nodes 200TFLOPS/160TB

100GB/Sec

800GB/SecX-Bar

Sustained Scalar 250GFLOPS? Vector 1TFLOPS?

T. Watanabe SC98, Orlando.

Page 46: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 46

Vector Microprocessors

• Ways of reducing the design impact• Short Vectors (64 x 16 words = 8 Kbytes)• Vector Functionall units shared with INT/FP units• Vector Register renaming to allow precise exceptions

• Cache hierarchy tuned to vector execution• Vector data locality allows large data transactions

• Very large bandwidth between cache and vector registers

• High performance for numerical and multimedia applications

Page 47: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 47

General Architecture

1024FP INT

8

I-CacheFetch

Decode

RambusController

RDRAM

RDRAM

RDRAM

RDRAM

Vector Cache

VRF

Page 48: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 48

Vector PC Vs SuperScalar

0

5

10

15

20

25

Hydro2D Dyfesm Swm256 Tomcatv

OoO-SS 1x2VEC 16 1x2VEC 16 16x32

Page 49: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 49

Cache Hierarchy

•Where should be allocated the Vector Cache?

DIRECT RAMBUS

L2

VC CPU

VC

L1 CPU

DIRECT RAMBUS

Page 50: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 50

Performance of the cache hierarchies

0

1

2

3

4

5

6

7

8

2 8 16 320

1

2

3

4

5

6

7

2 8 16 320

2

4

6

8

10

12

2 8 16 32

BDNA FLO52 HYDRO2D

EIP

C

FLOPS/CYCLE FLOPS/CYCLE FLOPS/CYCLE

VECTOR CACHE on L1

VECTOR CACHE on L2

PERFECT CACHE

Page 51: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 51

Importance of media Applications

“On the next five years, (1998-2002), we believe that media processing will become the dominant force in computer architecture” (K. Diefendorf and P. K. Dubey in IEEE Computer Journal, Sep.97, pp. 43-45)

“90% of Desktop Cycles will Be Spent on Media Applications by 2000” ( Scott Kirkpatrick of IBM )

Page 52: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 52

Characteristics of media Applications• Examples: Image/ speech processing,

communications, virtual reality, graphics…

• Data structures: matrices and vectors

• Data types: Integer(8 -32 bits), FP (32- 64)

• Demand for high memory bandwidth

• Low data locality and latency problem

• No critical data-dependences

• Real time necessity

• Fine/coarse grain parallelism

Page 53: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 53

Multimedia Applications and Architectures

• • • •

• • • •

• • • •

• • • •

Scientific Applications

Multimedia

Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but

loss of parallelismAs many instructions as SS approach

Superscalar

+ MMXVLIW Vector Architectures

Natural way to express and execute DLP applications

Page 54: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 54

MMX-like processors

• Multimedia extensions are designed to exploit the parallelism inherent in multimedia aplications

• Targeted to leverage full compatibility with existing operating systems and applications, plus minimum chip area investment.

• The highlights of multimedia extensions are:

• Single Instruction, Multiple Data (SIMD) techniques

• New data types (Multimedia Vectors, 32/64 bits)

• Multimedia registers

• SIMD-like instructions, over small integer data types

Page 55: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 55

MMX instruction example• PADDW: Parallel ADD of 4x16-bit data type with Wrap

Around (No Saturation)

A1 A2 xFFFFA3

A1+B1 A2+B2 x0005A3+B3

B1 B2 x0006B3

+ + + +

0 15 31 47 63

Page 56: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 56

Superscalar Multimedia Processors

Register File 32*128 8*64 32*64 32*64 32*64 32*64Mapped Onto Separate FP FP FP IntegerIntegerInteger Support 8/16/32 8/16/32 8/16/32 8/16 bit 16/32 8 bitFP Support Yes MMX2 No MIPS V/ No NoUsual stuff+ Lots Lots Lots Lots Some NoneMultiply /MAC Lots Mult Mult Lots Some NoneMin/Max/Avg Yes No No Min/MaxAvg Min/MaxPack/Unpack Yes Yes Yes Yes Yes YesByte ReorderingAll Some Some Many All NoneUnaligned Data 3 Inst. No 2 Inst. Yes No NoAnnounced 2Q98 2Q96 4Q94 4Q96 4Q95 4Q96

HP MAX2

Alpha MVI

PowerPC Altivec

Intel MMX

Sun VIS

MIPS V /MDMX

Microprocessor Report Vol 12, N 6, May 11, 1998

Page 57: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 57

Multimedia Applications and Architectures

• • • •

• • • •

• • • •

• • • •

Scientific Applications

Multimedia

Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but

loss of parallelismAs many instructions as SS approach

Superscalar

+ MMXVLIW Vector Architectures

Natural way to express and execute DLP applications

Page 58: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 58

Multimedia Embedded Systems

• NEC V830R/AV includes MIX2, a multimedia

instruction extension (SIMD, MMX-like approach)

• Hitachi SH4 includes FP 4-length vector

instructions, targeted at geometry transformation in

3D rendering applications

• ARM10 Thumb Family processors will include a

Vector FP unit capable of delivering 600 MFLOPS

Page 59: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 59

Widen is better…(?)

• Most multimedia algorithms exhibit vectors no longer than 8/16 elements => widening the multimedia registers could provide diminishing returns.

C1

B1

+

0 15

A1 A2 A4A3

C1 C2 C4C3

B1 B2 B4B3

+ + + +

0 15 31 47 63

A1 A1 A2 A4A3

C1 C2 C4C3

B1 B2 B4B3

+ + + +

0 15 31 47

A5 A6 A8A7

C5 C6 C8C7

B5 B6 B8B7

+ + + +

63 79 95 111 127

Page 60: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 60

VLIW : Widening vs Replication

Memory

Register File

1 word

Memory

Register File

1 word1 word

Memory

Register File

2 words

Memory

Register File

2 words2 words

Bus configurations:

D. López et al. ¨Increasing Memory Bandwidth with Wide Busses¨ICS-97

Page 61: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 61

Widening and Replication Performance

1

2

3

4

5

6

7

8

2 4 8 16

Wide 1wide 2Wide 4

D. López et al. ¨ Widening versus replicating...¨ ICS98, MICRO98

Page 62: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 62

Multimedia Applications and Architectures

• • • •

• • • •

• • • •

• • • •

Scientific Applications

Multimedia

Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but

loss of parallelismAs many instructions as SS approach

Superscalar

+ MMXVLIW Vector Architectures

Natural way to express and execute DLP applications

Page 63: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 63

Torrent T0 Microprocessor• The first single-chip vector microprocessor.

• Can sustain over 24 operations per cycle while having a issue rate of only one 32-bit instruction per cycle

• Features:• 16 vector registers (32 32-bit elements each)• 2 Vector arithmetic units (8 pipes each)• Reconfigurable composite operation pipelines • 128-bit wide, external memory interface• MIPS-II, 32-bit instruction set, scalar unit.

K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995

Page 64: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 64

Torrent T0 Microprocessor

K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995

Page 65: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 65

Vector versus Superscalar Processors• Comparison of Die Area

– Processor Die Area (in mm2 scaled to 0.25

0

50

100

150

200

250

Torrent-0 Alpha 21164 UltraSPARCII

MIPSR10000

HP PA-8000 Alpha 21264 6-way OoO,Rob128

ControlRegistersDatapath

14.73 21.8637.77

66.92 67.77 69.81

250.0

C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.

Page 66: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 66

• Component Percentages

0

10

20

30

40

50

60

70

80

90

100

Torrent-0 Alpha 21164 UltraSPARCII

MIPSR10000

HP PA-8000 Alpha 21264 6-way OoO,Rob128

Datapath Registers Control

C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.

Vector versus Superscalar Processors

Page 67: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 67

Imagine project

• Focused on developing a programmable architecture that achieves performance similar to special purpose hardware on graphics and image processing.

• Matches media applications demands to the current VLSI capabilities by using a stream-based programming model.

• Most multimedia kernels exhibit a streaming nature.

• Individual stream elements can be operated on in parallel, thus exploiting data parallelism.

Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98

Page 68: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 68

Imagine architecture• Organized around a large stream register file (64Kb)• Memory operations move entire streams of data• Data streams pass through a set of arithmetic clusters (8)• Each cluster unit operates a single element under VLIW control

SDRAM

SDRAM

SDRAM

SDRAM

...

Str

eam

ing

Mem

ory

Sys

temC

C

C

C

Stream Register File

CLUSTER 7

CLUSTER 0

CLUSTER 1

...

Controller

Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98

Page 69: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 69

Matrix extensions for Multimedia• By combining conventional vector approach together with SIMD MMX-like instructions, we can exploit additional levels of DLP with matrix oriented multimedia

extensions.

C1

B1

+

A1 A1 A2 A4A3

C1 C2 C4C3

B1 B2 B4B3

+ + + +

0 15 31 47 63

A1 A2 A4A3

0 15 31 47 63

A5 A6 A8A7

A9 A10 A12A11

A13 A14 A16A15

+

B1 B2 B4B3

15 31 47 63

B5 B6 B8B7

B9 B10 B12B11

B13 B14 B16B15

C1 C2 C4C3

C5 C6 C8C7

C9 C10 C12C11

C13 C14 C16C15

Page 70: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 70

Relative Performance

0

1

2

3

4

5

6

7

way 1 way 2 way 4 way 80

5

10

15

20

25

way 1 way 2 way 4 way 8

MMX MDMX MOM

0

1

2

3

4

5

6

7

8

9

way 1 way 2 way 4 way 8

INVERSE DCT TRANSFORM

MPEG-2 MOTION ESTIMATION

RGB-YCC Color CONVERSION

Page 71: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 71

Applications and Architectures

+ FPU

+ FPU VFPU+

Integer

Integer

Integer

Numerical Applications

Very Slow+ Subroutines

Very Big Improvement !!!

Additional Speed

Page 72: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 72

Future Applications

• Integer SPEC-like• Commercial

(OLTP,DSS)

• Numerical• Multimedia

IntegerInteger Commercial Numerical Multimedia

Page 73: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 73

Acknowledgments

• Roger Espasa• James E. Smith• Luis A. Villa• Francisca Quintana• Jesús Corbal• David López• Josep Llosa• Eduard Ayguade

• Krste Asanovic• William Dally• Christoforos E. Kozyrakis• Corinna G. Lee• David A. Patterson• Steve Wallace

Page 74: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999

Kyoto, May 28th. 1999 74

The End