lecture 12, slide 1 computer architecture vector computers

61
Lecture 12, Slide Computer Architecture Vector Computers

Upload: augustine-parker

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 1

Computer Architecture

Vector Computers

Page 2: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 2

contents

1. Why Vector Processors?

2. Basic Vector Architecture

3. How Vector Processors Work

4. Vector Length and Stride

5. Effectiveness of Compiler Vectorization

6. Enhancing Vector Performance

7. Performance of Vector Processors

Page 3: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 3

Vector Processors

I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors. . . . One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made.

Seymour Cray Public lecture at Lawrence Livermore Laboratories

on the introduction of the Cray-1 (1976)

Page 4: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 4

Supercomputers

Definition of a supercomputer:

• Fastest machine in world at given task

• A device to turn a compute-bound problem into an I/O bound problem

• Any machine costing $30M+

• Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

Page 5: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 5

Supercomputer Applications

Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)

All involve huge computations on large data sets

In 70s-80s, Supercomputer Vector Machine

Page 6: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 6

1. Why Vector Processors?

• A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop.

• The computation of each result in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction.

• Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors.

• Vector instructions that access memory have a known access pattern.

• Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.

Page 7: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 7

2. Basic Vector Architecture

• There are two primary types of architectures for vector processors: vector-register processors and memory-memory vector processors.

– In a vector-register processor, all vector operations—except load and store—are among the vector registers.

– In a memory-memory vector processor, all vector operations are memory to memory.

Page 8: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 8 Vector Memory-Memory versus Vector Register Machines

• Vector memory-memory instructions hold all vector operands in main memory

• The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines

• Cray-1 (’76) was first vector register machine

for (i=0; i<N; i++)

{

C[i] = A[i] + B[i];

D[i] = A[i] - B[i];

}

Example Source Code ADDV C, A, B

SUBV D, A, B

Vector Memory-Memory Code

LV V1, A

LV V2, B

ADDV V3, V1, V2

SV V3, C

SUBV V4, V1, V2

SV V4, D

Vector Register Code

Page 9: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 9 Vector Memory-Memory vs. Vector Register Machines

• Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?

– All operands must be read in and out of memory

• VMMAs make if difficult to overlap execution of multiple vector operations, why?

– Must check dependencies on memory addresses

• VMMAs incur greater startup latency– Scalar code was faster on CDC Star-100 for vectors < 100 elements

– For Cray-1, vector/scalar breakeven point was around 2 elements

Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures

(we ignore vector memory-memory from now on)

Page 10: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 10 The basic structure of a vector-register architecture

VMIPS

Page 11: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 11

Primary Components of VMIPS• Vector registers — VMIPS has eight vector registers, and

each holds 64 elements. Each vector register must have at least two read ports and one write port.

• Vector functional units — Each unit is fully pipelined and can start a new operation on every clock cycle.

• Vector load-store unit —The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency.

• A set of scalar registers —Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit.

Page 12: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 12

Vector Supercomputers

Epitomized by Cray-1, 1976:

Scalar Unit + Vector Extensions• Load/Store Architecture

• Vector Registers

• Vector Instructions

• Hardwired Control

• Highly Pipelined Functional Units

• Interleaved Memory System

• No Data Caches

• No Virtual Memory

Page 13: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 13 Cray-1 (1976)

Page 14: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 14 Cray-1 (1976)

Single PortMemory

16 banks of 64-bit words

+ 8-bit SECDED

80MW/sec data load/store

320MW/sec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V. Mask

V. Length64 Element Vector Registers

Page 15: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 15

Vector Programming Model

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions

ADDV v3, v1, v2 v3

v2v1

Scalar Registers

r0

r7Vector Registers

v0

v7

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and

Store Instructions

LV v1, r1, r2

Base, r1 Stride, r2Memory

Vector Register

Page 16: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 16 • In VMIPS, vector operations use the same names as MIPS operations, but with the letter “V” appended.

Page 17: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 17

Vector Code Example

# Scalar Code

LI R4, 64

loop:

L.D F0, 0(R1)

L.D F2, 0(R2)

ADD.D F4, F2, F0

S.D F4, 0(R3)

DADDIU R1, 8

DADDIU R2, 8

DADDIU R3, 8

DSUBIU R4, 1

BNEZ R4, loop

# Vector Code

LI VLR, 64

LV V1, R1

LV V2, R2

ADDV.D V3, V1, V2

SV V3, R3

# C code

for (i=0; i<64; i++)

C[i] = A[i] + B[i];

Page 18: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 18

Vector Instruction Set Advantages

• Compact– one short instruction encodes N operations

• Expressive, tells hardware that these N operations:– are independent

– use the same functional unit

– access disjoint registers

– access registers in the same pattern as previous instructions

– access a contiguous block of memory (unit-stride load/store)

– access memory in a known pattern (strided load/store)

• Scalable– can run same object code on more parallel pipelines or lanes

Page 19: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 19 3. How Vector Processors Work 3.1 An Example• Let’s take a typical vector problem, X and Y are vectors,

a is a scalar.

• Y = a×X + Y

• This is the socalled SAXPY or DAXPY loop that forms the inner loop of the Linpack benchmark.

• Example Show the code for MIPS and VMIPS for the DAXPY loop.

• Assume that the starting addresses of X and Y are in Rx and Ry. And the number of elements, or length, of a vector register(64) matches the length of the vector operation.

Page 20: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 20

• Here is the MIPS code.

L.D F0,a ;load scalar a

DADDIU R4,Rx,#512 ;last address to load

Loop: L.D F2,0(Rx) ;load X(i)

MUL.D F2,F2,F0 ;a × X(i)

L.D F4,0(Ry) ;load Y(i)

ADD.D F4,F4,F2 ;a × X(i) + Y(i)

S.D 0(Ry),F4 ;store into Y(i)

DADDIU Rx,Rx,#8 ;increment index to X

DADDIU Ry,Ry,#8 ;increment index to Y

DSUBU R20,R4,Rx ;compute bound

BNEZ R20,Loop ;check if done

Page 21: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 21 • Here is the VMIPS code for DAXPY.

L.D F0,a ;load scalar a

LV V1,Rx ;load vector X

MULVS.D V2,V1,F0 ;vector-scalar multiply

LV V3,Ry ;load vector Y

ADDV.D V4,V2,V3 ;add

SV Ry,V4 ;store the result

• The most dramatic comparison is that the vector processor greatly reduces the dynamic instruction bandwidth.

• Another important difference is the frequency of pipeline interlocks. (Pipeline stalls are required only once per vector operation, rather than once per vector element.)

Page 22: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 22 Vector Arithmetic Execution

• Use deep pipeline (=> fast clock) to execute element operations

• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

V1 V2 V3

V3 <- v1 * v2

Six stage multiply pipeline

Page 23: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 23 3.2 Vector Load-Store Units and Vector Memory Systems

Start-up penalties (in clock cycles) on VMIPS

To maintain an initiation rate of 1 word fetched or stored per clock, the memory system must be capable of producing or accepting this much data. This is usually done by spreading accesses across multiple independent memory banks.

Operation Start-up penalty

Vector add 6Vector multiply 7

Vector divide 20

Vector load / store 12

Page 24: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 24

Vector Memory System

0 1 2 3 4 5 6 7 8 9 A B C D E F

+

Base StrideVector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank

Page 25: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 25

• Example

Suppose we want to fetch a vector of 64 elements starting at byte address 136, and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU?

• Answer

Six clocks per access require at least six banks, but because we want the number of banks to be a power of two, we choose to have eight banks. Figure on next page shows the timing for the first few sets of accesses for an eight-bank system with a 6-clock-cycle access latency.

Page 26: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 26

The CPU cannot keep all eight banks busy all the time because it is limited to supplying one new address and receiving one data item each cycle.

Page 27: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 27

4. Two Real-World Issues: Vector Length and Stride

• What do you do when the vector length in a program is not exactly 64?

• How do you deal with nonadjacent elements in vectors that reside in memory?

4.1 Vector-Length Control

do 10 i = 1,n

10 Y(i) = a × X(i) + Y(i)

n may not even be known until run time

Page 28: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 28

• The solution is to create a vector-length register (VLR), which controls the length of any vector operation.

• The value in the VLR, however, cannot be greater than the length of the vector registers — maximum vector length (MVL).

• If the vector is longer than the maximum length, a technique called strip mining is used.

Page 29: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 29 Vector StripminingProblem: Vector registers have finite length

Solution: Break loops into pieces that fit into vector registers, “Stripmining”

ANDI R1, N, #63 ; N mod 64 MTC1 VLR, R1 ; Do remainderloop: LV V1, RA DSLL R2, R1, #3 ; Multiply by 8 DADDU RA, RA, R2 ; Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 ; Subtract elements LI R1, #64 MTC1 VLR, R1 ; Reset full length BGTZ N, loop ; Any more to do?

for (i=0; i<N; i++) C[i] = A[i]+B[i];

+

+

+

A B C

64 elements

Remainder

Page 30: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 30

4.2 Vector Stride

do 10 i = 1,100 do 10 j = 1,100 A(i,j) = 0.0 do 10 k = 1,100

10 A(i,j) = A(i,j)+B(i,k)*C(k,j)

At the statement labeled 10 we could vectorize the multiplication of each row of B with each column of C.When an array is allocated memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory.

Page 31: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 31

Vector Stride

• The vector stride, like the vector starting address, can be put in a general-purpose register.

• Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register.

• Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used.

This distance separating elements that are to be gathered into a single register is called the stride.

Page 32: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 32

5. Effectiveness of Compiler Vectorization

• Two factors affect the success with which a program can be run in vector mode.

• The first factor is the structure of the program itself. This factor is influenced by the algorithms chosen and by how they are coded.

• The second factor is the capability of the compiler.

Page 33: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 33 Automatic Code Vectorizationfor (i=0; i < N; i++) C[i] = A[i] + B[i];

load

load

add

store

load

load

add

store

Iter. 1

Iter. 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing

requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter. 1

Iter. 2

Vectorized Code

Tim

e

Page 34: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 34

6. Enhancing Vector Performance

•Chaining•Conditionally Executed

Statements•Sparse Matrices•Multiple Lanes•Pipelined Instruction Start-Up

In this section we present five techniques for improving the performance of a vector processor.

Page 35: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 35

(1) Vector Chaining• the Concept of Forwarding Extended to Vector Registers

• Vector version of register bypassing– introduced with Cray-1

Memory

V1

Load Unit

Mult.

V2 V3

Chain

Add

V4 V5

Chain

LV v1

MULV v3,v1,v2

ADDV v5, v3, v4

Page 36: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 36

Vector Chaining Advantage

• With chaining, can start dependent instruction as soon as first result appears

Load

Mul

Add

Load

Mul

AddTime

• Without chaining, must wait for last element of result to be written before starting dependent instruction

Page 37: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 37

Implementations of Chaining

• Early implementations worked like forwarding, but this restricted the timing of the source and destination instructions in the chain.

• Recent implementations use flexible chaining, which requires simultaneous access to the same vector register by different vector instructions, which can be implemented either by adding more read and write ports or by organizing the vector-register file storage into interleaved banks in a similar way to the memory system.

Page 38: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 38 (2) Vector Conditional Execution

Problem: Want to vectorize loops with conditional code:for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];

Solution: Add vector mask (or flag) registers– vector version of predicate registers, 1 bit per element

…and maskable vector instructions– vector operation becomes NOP at elements where mask bit is clear

Code example:CVM ; Turn on all elements

LV VA, RA ; Load entire A vector

L.D F0,#0 ; Load FP zero into F0

SGTVS.D VA, F0 ; Set bits in mask register where A>0

LV VA, RB ; Load B vector into A under mask

SV VA, RA ; Store A back to memory under mask

Page 39: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 39

Masked Vector Instructions

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

Density-Time Implementation– scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Enable

A[7] B[7]M[7]=1

Simple Implementation– execute all N operations, turn off result

writeback according to mask

Page 40: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 40 Compress/Expand Operations• Compress packs non-masked elements from one vector

register contiguously at start of destination vector register

– population count of mask vector gives packed vector length

• Expand performs inverse operation

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

A[3]

A[4]

A[5]

A[6]

A[7]

A[0]

A[1]

A[2]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

B[3]

A[4]

A[5]

B[6]

A[7]

B[0]

A[1]

B[2]

Expand

A[7]

A[1]

A[4]

A[5]

Compress

A[7]

A[1]

A[4]

A[5]

Used for density-time conditionals and also for general selection operations

Page 41: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 41

(3) Sparse Matrices

Page 42: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 42

Vector Scatter/Gather

Want to vectorize loops with indirect accesses:

(index vector D designate the nonzero elements of C)for (i=0; i<N; i++)

A[i] = B[i] + C[D[i]]

Indexed load instruction (Gather)LV VD, RD ; Load indices in D vector

LVI VC,(RC, VD) ; Load indirect from RC base

LV VB, RB ; Load B vector

ADDV.D VA, VB, VC ; Do add

SV VA, RA ; Store result

Page 43: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 43

Vector Scatter/Gather

Scatter example:

for (i=0; i<N; i++)

A[B[i]]++;

Is following a correct translation?LV VB, RB ; Load indices in B vector

LVI VA,(RA, VB) ; Gather initial A values

ADDV VA, RA, 1 ; Increment

SVI VA,(RA, VB) ; Scatter incremented values

Page 44: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 44

(4) Multiple Lanes ADDV C,A,B

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]

A[16] B[16]

A[20] B[20]

A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]

A[17] B[17]

A[21] B[21]

A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]

A[18] B[18]

A[22] B[22]

A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]

A[19] B[19]

A[23] B[23]

A[27] B[27]

Execution using four pipelined

functional units

Vector Instruction Execution

Page 45: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 45

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0, 4, 8, …

Elements 1, 5, 9, …

Elements 2, 6, 10, …

Elements 3, 7, 11, …

Page 46: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 46 T0 Vector Microprocessor (1995)

LaneVector register elements striped

over lanes

[0][8]

[16][24]

[1][9]

[17][25]

[2][10][18][26]

[3][11][19][27]

[4][12][20][28]

[5][13][21][29]

[6][14][22][30]

[7][15][23][31]

Page 47: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 47

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

– example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Instruction issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

Page 48: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 48

(5) Pipelined Instruction Start-Up

• The simplest case to consider is when two vector instructions access a different set of vector registers.

• For example, in the code sequenceADDV.D V1,V2,V3ADDV.D V4,V5,V6

• It becomes critical to reduce start-up overhead by allowing the start of one vector instruction to be overlapped with the completion of preceding vector instructions.

• An implementation can allow the first element of the second vector instruction to immediately follow the last element of the first vector instruction down the FP adder pipeline.

Page 49: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 49 Vector StartupTwo components of vector startup penalty

– functional unit latency (time through pipeline)

– dead time or recovery time (time before another vector instruction can start down pipeline)

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

Functional Unit Latency

Dead Time

First Vector Instruction

Second Vector Instruction

Dead Time

Page 50: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 50

Dead Time and Short Vectors

Cray C90, Two lanes

4 cycle dead time

Maximum efficiency 94% with 128 element vectors

4 cycles dead time T0, Eight lanes

No dead time

100% efficiency with 8 element vectors

No dead time

64 cycles active

Page 51: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 51

• Example The Cray C90 has two lanes but requires 4 clock cycles of dead time between any two vector instructions to the same functional unit. For the maximum vector length of 128 elements, what is the reduction in achievable peak performance caused by the dead time? What would be the reduction if the number of lanes were increased to 16?

• Answer A maximum length vector of 128 elements is divided over the two lanes and occupies a vector functional unit for 64 clock cycles. The dead time adds another 4 cycles of occupancy, reducing the peak performance to 64/(64 + 4) = 94.1% of the value without dead time.

• If the number of lanes is increased to 16, maximum length vector instructions will occupy a functional unit for only 128/16 = 8 cycles, and the dead time will reduce peak performance to 8/(8 + 4) = 66.6% of the value without dead time.

Page 52: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 52 7. Performance of Vector Processors

Vector Execution Time

The execution time of a sequence of vector operations primarily depends on three factors:

• the length of the operand vectors

• structural hazards among the operations

• data dependences

Page 53: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 53

Convoy and Chime

• Convoy is the set of vector instructions that could potentially begin execution together in one clock period.

– The instructions in a convoy must not contain any structural or data hazards; if such hazards were present, the instructions in the potential convoy would need to be serialized and initiated in different convoys.

• A chime is the unit of time taken to execute one convoy.

– A chime is an approximate measure of execution time for a vector sequence; a chime measurement is independent of vector length.

– A vector sequence that consists of m convoys executes in m chimes, and for a vector length of n, this is approximately m × n clock cycles.

Page 54: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 54

• Example Show how the following code sequence lays out in convoys, assuming a single copy of each vector functional unit:

LV V1,Rx ;load vector X

MULVS.D V2,V1,F0 ;vector-scalar multiply

LV V3,Ry ;load vector Y

ADDV.D V4,V2,V3 ;add

SV Ry,V4 ;store the result

• How many chimes will this vector sequence take?

• How many cycles per FLOP (floating-point operation) are needed, ignoring vector instruction issue overhead?

Page 55: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 55

• Answer The first convoy is occupied by the first LV instruction. The MULVS.D is dependent on the first LV, so it cannot be in the same convoy. The second LV instruction can be in the same convoy as the MULVS.D. The ADDV.D is dependent on the second LV, so it must come in yet a third convoy, and finally the SV depends on the ADDV.D, so it must go in a following convoy.

1. LV2. MULVS.D LV3. ADDV.D4. SV

• The sequence requires four convoys and hence takes four chimes. Since the sequence takes a total of four chimes and there are two floating-point operations per result, the number of cycles per FLOP is 2 (ignoring any vector instruction issue overhead).

Page 56: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 56

• The most important source of overhead ignored by the chime model is vector start-up time.

• The start-up time comes from the pipelining latency of the vector operation and is principally determined by how deep the pipeline is for the functional unit used.

Unit Start-up overhead (cycles)

Load and store unit 12Multiply unit 7Add unit 6

Start-up overhead

Page 57: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 57

• Example Assume the start-up overhead for functional units is shown in Figure of the previous page.

• Show the time that each convoy can begin and the total number of cycles needed. How does the time compare to the chime approximation for a vector of length 64?

• Answer

The time per result for a vector of length 64 is 4 + (42/64) = 4.65 clock cycles, while the chime approximation would be 4.

Page 58: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 58

Running Time of a Strip-mined Loop

There are two key factors that contribute to the running time of a strip-mined loop consisting of a sequence of convoys:

1. The number of convoys in the loop, which determines the number of chimes. We use the notation Tchime for the execution time in chimes.

2. The overhead for each strip-mined sequence of convoys. This overhead consists of the cost of executing the scalar code for strip-mining each block, Tloop, plus the vector start-up cost for each convoy, Tstart.

• the total running time for a vector sequence operating on a vector of length n:

Page 59: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 59

• Example What is the execution time on VMIPS for the vector operation A = B × s, where s is a scalar and the length of the vectors A and B is 200?

• Answer – Assume the addresses of A and B are initially in Ra and

Rb, s is in Fs, and recall that for MIPS (and VMIPS) R0 always holds 0.

– The first iteration of the strip-mined loop will execute for a vector length of (200 mod 64) = 8 elements, and the following iterations will execute for a vector length of 64 elements.

– Since the vector length is either 8 or 64, we increment the address registers by 8 × 8 = 64 after the first segment and 8 × 64 = 512 for later segments. The total number of bytes in the vector is 8 × 200 = 1600, and we test for completion by comparing the address of the next vector segment to the initial address plus 1600.

Page 60: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 60 Here is the actual code:

DADDUI R2,R0,#1600 ;total # bytes in vector

DADDU R2,R2,Ra ;address of the end of A vector

DADDUI R1,R0,#8 ;loads length of 1st segment

MTC1 VLR,R1 ;load vector length in VLR

DADDUI R1,R0,#64 ;length in bytes of 1st segment

DADDUI R3,R0,#64 ;vector length of other segments

Loop: LV V1,Rb ;load B

MULVS.D V2,V1,Fs ;vector * scalar

SV Ra,V2 ;store A

DADDU Ra,Ra,R1 ;address of next segment of A

DADDU Rb,Rb,R1 ;address of next segment of B

DADDUI R1,R0,#512 ;load byte offset next segment

MTC1 VLR,R3 ;set length to 64 elements

DSUBU R4,R2,Ra ;at the end of A?

BNEZ R4,Loop ;if not, go back

Page 61: Lecture 12, Slide 1 Computer Architecture Vector Computers

Lecture 12, Slide 61

• The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime = 3.

• Use our basic formula:

• The value of Tstart is given by Tstart = 12 + 7 + 12 = 31

• So, the overall value becomes T200 = 660 + 4 × 31= 784

• The execution time per element with all start-up costs is then 784/200 = 3.9, compared with a chime approximation of three.