computer architectures... high performance computing i fall 2001 mae609 /mth667 abani patra
TRANSCRIPT
Computer Architectures ...High Performance Computing I
Fall 2001MAE609 /Mth667
Abani Patra
AP:Lec01 2
Microprocessor Basic Architecture
CISC vs. RISC Superscalar EPIC
AP:Lec01 3
Performance Measures
Floating Point Operations Per Second (FLOPS)
1 MFLOP, workstations 1 GFLOP readily available
HPC 1 TFLOP BEST NOW !! 1 PFLOP … 2010 ??
AP:Lec01 4
Performance
Ttheor: theoretical peak performance; obtained by multiplying clock rate with no. of CPU and no. of FPU/CPU
Treal:real performance on some specific operation e.g. vector add and multiply
Tsustained: sustained performance on an application e.g. CFD Tsustained << Treal << Ttheor
AP:Lec01 5
Performance Performance degrades if the
CPU has to wait for data to operate
Fast CPU => need adequate fast memory
Thumb rule -- Memory in MB = Ttheor in MFLOPS
AP:Lec01 6
Making a Supercomputer Faster Reduce Cycle time
Pipelining Instruction Pipelines Vector Pipelines
Internal Parallelism Superscalar EPIC
External Parallelism
AP:Lec01 7
Making a SuperComputer Faster Reduce Cycle time
increase clock rate Limited by semiconductor manufacture! Current generation 1-2GHz( Immediate future
10GHz)
Pipelining fine subdivision of an operation into sub-
operations leading to shorter cycle time but larger start-up time
AP:Lec01 8
Pipelining Instruction Pipelining
1
2
3 4
1
2
3
4
5
6
• 4 stage instruction pipeline
• 3 instructions A,B,C
• 4 cycles needed by each instruction
A
B
C
A
B
C
A
B
C
A
B
C
cycle
stage
• one result per cycle after pipe is “full” -- startup time
Fetch Ins Fetch Data Execute Store
AP:Lec01 9
Pipelining Almost all current computers
use some pipelining e.g. IBM RS6000
Speedup of instruction pipelining cannot always be achieved !! Next instruction may not be
known till execution --e.g. branch Data for execution may not be
available
AP:Lec01 10
Vector Pipelines Effective for operations like
do 10 I=1,1000 10 c(I)=a(I)*b(I)
same instructions executed 1000 times with different data
using a “vector pipe” the whole loop is one vector instruction Cray XMP, YMP, T90 ...
AP:Lec01 11
Vector pipelining For some operations like
a(I) = b(I) + c(I)*d(I) the results of the multiply are chained to
the addition pipeline Disadvantage:
startup time of vector code has to be vectorized; loops have to be
blocked into vector lengths
AP:Lec01 12
Internal Parallelism Use multiple Functional Units per
processor Cray T90 has 2 track vector units;NEC SX4,
Fujitsu VPP300 -- 8 track vector units superscalar e.g. IBM RS6000 Power2 uses 2
arithmetic units EPIC
Need to provide data to multiple functional unit => fast memory access
Limiting factors are memory-processor bandwidth
AP:Lec01 13
External Parallelism Use multiple processors
Shared Memory (SMP:Symmetric Multi-processors)
many processors accessing the same memory
limited by memory-processors bandwidth SUN Ultra2, SGI Octane, SGI Onyx,
Compaq ...CPU 0
CPU 1
Memory banks
AP:Lec01 14
External Parallelism Distributed memory
many processors each with local memory and some type of high speed interconnect
CPU 0
CPU 1
Local Memories
Interconnection
E.g. IBM SPx, Cray T3E, network of W/S, Beauwolf Clusters of Pentium PCs
AP:Lec01 15
External Parallelism SMP Clusters
nodes with multiple processors that have shared local memory; nodes connected by interconnect
“best of both ?”
AP:Lec01 16
Classification of Computers
Hardware SISD (Single Instruction Single Data) SIMD(Single Instruction Multiple Data) MIMD (Multiple Instruction Multiple
Data)
Programming Model SPMD(Single Program Multiple Data) MPMD(Multiple Program Multiple Data)
AP:Lec01 17
Hardware Classification SISD (Single Instruction Single Data)
classical scalar/vector computer -- one instruction one datum
superscalar -- instructions may run in parallel
SIMD (Single Instruction Multiple Data) vector computers Data Parallel -- Connection Machine etc.
extinct now
AP:Lec01 18
Hardware Classification MIMD (Multiple Instruction Multiple
Data) usual parallel computer each processor executes its own
instructions on different data streams need synchronization to get
meaningful result
AP:Lec01 19
Programming Model SPMD(Single Program Multiple Data)
single program is run on all processors with different data
each processor knows its ID -- thus if(proc ID .eq. N) then
…. Else
…. Constructs can be used for program
control
AP:Lec01 20
Programming Model MPMD(Multiple Program Multiple
Data) Different programs run on different
processors usually a master-slave model is used
AP:Lec01 21
Topologies/Interconnects Hypercube Torus
Prototype Supercomputers and Bottlenecks
AP:Lec01 23
Types of Processors/Computers used
in HPC Prototype processors
Vector Processors Superscalar Processors
Prototype Parallel Computers Shared Memory
Without Cache With Cache SMP
Distributed Memory
AP:Lec01 24
Vector Processors
AP:Lec01 25
Vector Processors
Components Vector registers ADD/Logic pipeline and MULTIPLY Pipelines Load/Store pipelines Scalar registers + pipelines
AP:Lec01 26
Vector Registers Finite length of
vector registers 32/64/128 etc.
Strip mining to operate on longer vectors
Codes often manually restructured to vector-length loops
Sawtooth performance curve -- maximum at multiples of vector length
AP:Lec01 27
Vector Processors Memory-processor bandwidth
performance depends completely on keeping the vector registers supplied with operands from memory
Size of main memory and extended memory bandwidth of main memory is much higher but main
memory is more expensive size determines -- size of problem that can be run
scalar registers/scalar processors for scalar instructions
I/O through special processor - - T90 can produce data at 14400 MB/sec -- Disk 20MB/s.
Thus single word can take 720 cycles on Cray T90 !!
AP:Lec01 28
Superscalar Processor Workstations and nodes of parallel
supercomputers
AP:Lec01 29
Superscalar Processor main components are
Multiple ALU and FPU data and instruction caches
superscalar since the ALU and FPU’s can operate in parallel producing more than one result per cycle
e.g. IBM POWER2 -- 2 FPU/ALU’s each can operate in parallel producing up to 4 results per cycle if operands are in registers
AP:Lec01 30
Superscalar Processor RISC architecture operating at very high
clock speeds (>1GHz now -- more in a year)
Processor works only on data in registers which come only from and go only to data cache. If data is not in cache -- “cache miss” -- processor is idle while another cache line (4 -16 words) are fetched from memory !!
AP:Lec01 31
Superscalar Processor Large off chip Level 2 caches to help in data
availability. L1 cache data is accessed in 1/2 cycles while L2 cache is 3/4 cycles and memory can be 8 times that!
Efficiency directly related to reuse of data in cache
Remedies: Blocked algorithms, contiguous storage, avoid strides and random/non-deterministic access
AP:Lec01 32
Superscalar Processor Remedies:
Blocked algorithms, do I=1,1000 do j=1,20
a(I)=…. do i=(j-1)*50,j*50 a(i)=....
contiguous storage, avoid strides and random/non-deterministic
access a(ix(i)) = ...
AP:Lec01 33
Superscalar Processors Memory bandwidth critical to performance
Many engineering applications are difficult to optimize for cache efficiency
Application efficiency => memory bandwidth
Size of memory determines size of problem that can be solved
DMA (direct memory access) channels take memory access duties for external application (I/O) remote processor request away from CPU
AP:Lec01 34
Shared Memory Parallel Computer
Memory in banks is accessed equally through a switch (crossbar) by the processors (usually vector)
Processors run “p” independent tasks with possibly shared data
Usually some compilers and preprocessors can extract the fine-grained parallelism available
Shared Memory Computer
Cray T90
P1 P2 P3
Shared Memory
Switch
...
AP:Lec01 35
Shared Memory Paralllel ... Memory contention and bandwidth limits the
number of processors that may be connected
Memory contention can be reduced by increasing banks and reducing the bank busy time(bbt)
This type of parallel computer is closest in programming model to the general purpose single processor computer
AP:Lec01 36
Symmetric Multiprocessors (SMP)
Processors are usually superscalar -- SUN Ultra, MIPS R10000 with large cache
Bus/crossbar used to connect to memory modules
For bus -- 1 processor can access memory at a time
SMP Computer
P1 P2 P3
Bus/Crossbar
...c1 c2 c3 c3
M1M2 M3
Sun Ultraenterprise 10000, SGI Powerchallenge
AP:Lec01 37
Symmetric Multi-processors If interconnect -- then there will be
memory contention
Data flows from memory to cache to processors;
Cache coherence: If a piece of data is changed in one cache then all
other caches that contain that data must update the value. Hardware and software must take care of this.
AP:Lec01 38
Symmetric Multi-Processors Performance depends dramatically on
the reuse of data in cache; Fetching data from larger memory with potential
memory contention can be expensive! Caches and cache lines also are bigger
Large L2 cache really plays the role of local fast memory with memory banks are more like extended memory accessed in blocks
AP:Lec01 39
Distributed Memory Parallel Computer
Prototype DMP Processors are
superscalar RISC with only LOCAL memory
Each processor can only work on data in local memory
Communication required for access to remote memory
Comm. network
PM
PM
PM
IBM SP, Intel Paragon,SGI Origin2000
AP:Lec01 40
Distributed Memory Parallel Computer
Problems need to be broken up into independent tasks with independent memory -- naturally matches a data based decomposition of problem using a “owner computes” rule
Parallelization mostly at high granularity level controlled by user -- difficult for compilers/ automatic parallelization tools
Computers are scalable to very large numbers of processors
AP:Lec01 41
Distributed Memory Parallel Computer
Hybrid Parallel Computer NUMA : non uniform
memory access based classification
Intel Paragon (1st teraflop machine had 4 Pentiums per node with a bus)
HP exemplar has bus at node
P
M
P
M
Bus
Comm. network
P
M
P
M
Bus….
AP:Lec01 42
Distributed Memory Parallel Computer
Semi-autonomous memory
Semi-automomous memory: Processor can access remote memory using memory control units (MCU)
CRAY T3E and SGI Origin 2000
Comm. network
P
M
MCU….P
M
MCU
AP:Lec01 43
Distributed Memory Parallel Computer
Fully autonomous memory
Memory and procesors are equally distributed over the network
Tera MTA is only example
Latency and data transfer from memory is at the speed of network!
Comm. network
M P P M
AP:Lec01 44
Accessing Distributed Memory Message Passing
User transfers all data using explicit send/receive instructions
synchronous message passing can be slow Programming with NEW programming model ! User must optimize communication asynchronous/one-sided get and put are faster but
need more care in programming Codes used to be machine specific -- Intel NEXUS etc.
until standardized to PVM (parallel virtual machine) and subsequently MPI (message passing interface)
AP:Lec01 45
Accessing Distributed Memory
Global distributed memory Physically distributed and globally addressable -- Cray T3E/ SGI
Origin 2000 User formally accesses remote memory as if it were local --
operating system/compilers will translate such accesses to fetches/stores over the communication network
High Performance FORTRAN (HPF) -- software realization of distributed memory -- arrays etc. when declared can be distributed using compiler directives. Compiler translates remote memory access to appropriate calls (message passing/ OS calls as supported by the hardware)
AP:Lec01 46
Processor interconnects/topologies Buses
Lower cost -- but only one pair of devices (processors/memories etc. can communicate at a time) e.g. ethernet used to link workstation networks
Switches Like the telephone network -- can sustain many-
many communications; higher cost! Critical measure is bisection bandwidth -- how much
data can be passed between units
AP:Lec01 47
Processor interconnects/topologies .
AP:Lec01 48
Processor interconnects/topologies .
AP:Lec01 49
Processor interconnects/topologies Workstation network on ethernet
Very high latency -- processors must participate in communication
AP:Lec01 50
Processor interconnects/topologies 1D and 2D Meshes and
rings/toruses
AP:Lec01 51
Processor interconnects/topologies 3DMeshes and rings/toruses
AP:Lec01 52
Processor interconnects/topologies D- dimensional hypercubes
AP:Lec01 53
Processor Scheduling Space Sharing
Processor banks of 4/8/16 etc. assigned to users for specific times
Time sharing on processor partitions
Livermore Gang Scheduling
AP:Lec01 54
IBM RS/6000 SP
• Distributed Memory Parallel Computer
•Assembly of workstations using a HPS (a crossbar type switch)
•Comes with a choice of processors -- POWER2 (variants), POWER3 and clusters of PowerPC (also used by Apple G3 G4 etc.)
AP:Lec01 55
POWER 2 Processor
Different versions -- with different frequency, cache size and bandwidth
AP:Lec01 56
POWER 2 ARCHITECTURE
AP:Lec01 57
POWER2
Double fixed point/floating point units -- multiply/add in each
Max. 4 Floating Point results/cycle
ICU (with 32 KB instruction cache) can execute a branch and a condition/cycle
Per cycle 8 instructions may be issued and executed -- truly SUPERSCALAR!
AP:Lec01 58
Wide 77 Node Performance
Theoretical peak performance: 2*77 = 154 MFLOP for dyad
4*77 = 308 MFLOP for triad Cache Effects dominate performance
256 KB Cache and 256 bit path to cache and from cache to memory -- 2 words (8 bytes each) may be fetched and 2 words stored per cycle
AP:Lec01 59
Expected Performance Expected Performance
For Dyad ai= bi*ci or ai=bi+ci -- needs 2 load and 1 store i.e. 6 memory references to feed 2 FPUs -- only 4 are available:
(2*77)*(4/6) = 102.7 MFLOP For linked triad
ai= bi + s*ci (2 load 1 store) (4*77)*(4/6) = 205.3 MFLOP
For vector triad ai = bi + ci * di (3 load 1 store) (4*77)*(4/8)=154 MFLOPS
AP:Lec01 60
Cache Hit/Miss
The Performance numbers assumed that data was available in cache
If data is not in cache it must be fetched in cache lines of 256 bytes each from memory at a much slower pace
AP:Lec01 61
AP:Lec01 62
TERM PAPER Based on the analysis of the Power
2 processor and IBM SP presented here prepare a similar analysis (including estimates of performance) for the new Power4 chip in the IBM SP or a cluster of Pentium4s.