advanced computer architecture data-level parallel architectures

COMPUTATION

Advanced Computer Architecture

Data-Level Parallel ArchitecturesCourse 5MD00

Henk CorporaalDecember [email protected]

Advanced Computer Architecture pg #1This lectureData-level parallel architecturesVector machineSIMDsub-word parallelism supportGPU

Material: Book of Hennessy & PattersonChapter 4: 4.1-4.7(extra material: app G)

Advanced Computer Architecture pg #2Data ParallelismVector operationsMultiple data elements per operation, e.g.ADDV V1, V2, V3 // forall i V1[i] = V2[i]+V3[i]

Executed using eitherhighly pipelined (fast clocked) FU (function unit): Vector archtitecturemultiple FUs acting in parallel: SIMDortimeSIMD architectureVector architecture Advanced Computer Architecture pg #SIMD vs MIMDSIMD architectures can exploit significant data-level parallelism for:matrix-oriented scientific computingmedia-oriented image and sound processors

SIMD is more energy efficient than MIMDOnly needs to fetch and decode one instruction per data operationMakes SIMD attractive for personal mobile devices

SIMD allows programmer to continue to think sequentially

MIMD is more generic: why? Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer4SIMD & MIMD speedup

Assumptions: +2 MIMD cores / 2 years Doubling SIMD / 4 years Advanced Computer Architecture pg #

Vector ArchitecturesBasic idea:Read sets of data elements into vector registersOperate on those registersDisperse the results back into memory

Registers are controlled by compilerUsed to hide memory latencyby loading data early (many cycles before their use)Leverage memory bandwidth Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer6Example architecture: VMIPSLoosely based on Cray-1Vector registersEach register holds a 64-element, 64 bits/element vectorRegister file has 16 read- and 8 write-portsVector functional unitsFully pipelinedData and control hazards are detectedVector load-store unitFully pipelinedOne word per clock cycle after initial latencyScalar registers32 general-purpose registers32 floating-point registers

Cray-1 1976 Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer7VMIPS InstructionsADDVV.D: add two vectorsADDVS.D: add vector to a scalarLV/SV: vector load and vector store from address

Example: DAXPY ((double) a*X+Y), inner loop of Linpack

L.DF0,a; load scalar aLVV1,Rx; load vector XMULVS.DV2,V1,F0; vector-scalar multiplyLVV3,Ry; load vector YADDVVV4,V2,V3; addSVRy,V4; store the result

Requires 6 instructions vs. almost 600 for MIPS Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer8Vector Execution TimeExecution time depends on three factors:Length of operand vectorsStructural hazardsData dependencies

VMIPS functional units consume one element per clock cycleExecution time is approximately the vector length:Texec ~ Vl

ConveySet of vector instructions that could potentially execute together Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer9ChimesSequences with read-after-write dependency hazards can be in the same convey via chaining

ChainingAllows a vector operation to start as soon as the individual elements of its vector source operand become available

ChimeUnit of time to execute one conveym conveys executes in m chimesFor vector length of n, requires m x n clock cycles Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer10ExampleLVV1,Rx;load vector XMULVS.DV2,V1,F0;vector-scalar multiplyLVV3,Ry;load vector YADDVV.DV4,V2,V3;add two vectorsSVRy,V4;store the sum

Convoys:1LVMULVS.D2LVADDVV.D3SV

3 chimes, 2 FP ops per result, cycles per FLOP = 1.5For 64 element vectors, requires 64 x 3 = 192 clock cycles Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer11ChallengesStart up timeLatency of vector functional unitAssume the same as Cray-1Floating-point add => 6 clock cyclesFloating-point multiply => 7 clock cyclesFloating-point divide => 20 clock cyclesVector load => 12 clock cycles

Improvements:> 1 element per clock cycleNon-64 wide vectorsIF statements in vector codeMemory system optimizations to support vector processorsMultiple dimensional matricesSparse matricesProgramming a vector computer Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer12Multiple LanesElement n of vector register A is hardwired to element n of vector register BAllows for multiple hardware lanes

Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer13Vector Length RegisterVector length not known at compile time?Use Vector Length Register (VLR)Use strip mining for vectors over the maximum length:low = 0;VL = (n % MVL); /*find odd-size piece using modulo % */for (j = 0; j =0; i=i-1)x[i] = x[i] + s;

No loop-carried dependence Advanced Computer Architecture pg #The University of Adelaide, School of Computer Science16 December 2013Chapter 2 Instructions: Language of the Computer49Loop-Level ParallelismExample 2:for (i=0; i

advanced computer architecture data-level parallel architectures

Documents

vector store

vector length

school of computer science16

vector registersoperate

computer6example architecture

rx load vector

ry load vector yaddvvv4

vector architecturesbasic