nodes and networks for hpc computing

213
21/12/05 r.innocente 1 Nodes and Networks hardware Roberto Innocente [email protected] Kumasi, GHANA Joint KNUST/ICTP school on HPC on linux clusters

Upload: rinnocente

Post on 12-Feb-2017

147 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Nodes and Networks for HPC computing

21/12/05 r.innocente 1

Nodes and Networkshardware

Roberto [email protected]

Kumasi, GHANAJoint KNUST/ICTP school on

HPC on linux clusters

Page 2: Nodes and Networks for HPC computing

21/12/05 r.innocente 2

Overview

• Nodes:– CPU

– Processor Bus

– I/O bus

• Networks– boards

– switches

Page 3: Nodes and Networks for HPC computing

21/12/05 r.innocente 3

Computer families/1• CISC (Complex Instruction Set Computer)

– large set of complex instructions (VAX, x86)– many instructions can have operands in memory– variable length instructions– many instructions require many cycles to complete

• RISC (Reduced Instruction Set Computer)– small set of simple instructions(Mips,alpha)– also called Load/Store architecture because operations are done in registers and

only load/store instructions can access memory– instructions are typically hardwired and require few cycles– instructions have fixed length so that is easy to parse them

• VLIW (Very Large Instruction Word Computers)– Fixed number of instructions (3-5) scheduled by compiler to be executed together– Itanium (IA64), Transmeta Crusoe

Page 4: Nodes and Networks for HPC computing

21/12/05 r.innocente 4

Computer families/2

It is very difficult to optimize computers with CISC instruction sets, because it is difficult to predict what will be the effect of the instructions.

For this reason today high performance processors have a RISC core even if the external instruction set is CISC.

Starting with the Pentium Pro, Intel x86 processors in fact translate x86 instruction into 1 or more RISC microops (uops).

Page 5: Nodes and Networks for HPC computing

21/12/05 r.innocente 5

Micro architecture featuresfor HPC

• Superscalar

• Pipelining: super/hyper pipelining

• OOO (Out Of Order) execution

• Branch prediction/speculative execution

• Hyperthreading

• EPIC

Page 6: Nodes and Networks for HPC computing

21/12/05 r.innocente 6

Superscalar

It’s a CPU having multiple functional execution units and able to dispatch multiple instruction per cycle (double issue, quad issue ,...).

Pentium 4 has 7 distinct functional units:Load, Store, 2 x double speed ALUs, normal speed ALU, FP,

FP Move. It can issue up to 6 uops per cycle (it has 4 dispatch ports but the 2 simple ALUs are double speed).

The Athlon has 9 functional units. The AMD64/Opteron has a 3 issue core with 11 FU : 3 int,3

Address Generation, 3 FP, 2 ld/stThe IA64/Itanium2 can issue 6 instr, has 11 FU : 2 int, 4

mem, 3 branch, 2 FP

Page 7: Nodes and Networks for HPC computing

21/12/05 r.innocente 7

Pipelining/1

• The division of the work necessary to execute instructions in stages to allow more instructions in execution at the same time (at different stages)

• Previous generation architectures had 5/6 stages• Now there are 10/20 stages, Intel called this

superpipelining or hyperpipelining• Nocona(EM64T) Pentium 4 has 32 stages • Opteron has around 30 stages

Page 8: Nodes and Networks for HPC computing

21/12/05 r.innocente 8

Pipelining/2

Memoryhierarchy

(Registers,caches,memory)

Instruction Fetch (IF)

Instruction decode (ID)

Execute(EX)

Data access(DA)

Write back results(WB)

Page 9: Nodes and Networks for HPC computing

21/12/05 r.innocente 9

Pipelining/3

WBDAEXEXIDIFInstr 3

DAEXIDIFIFInstr 5

DAEXIDIDIFInstr 4

WBDADADADAEXIDIFInstr 2

WBWBDAEXIDIFInstr 1

109876554321

Clock cycle

On the 5th cycle there are 5 instr. simultaneously executing,2nd instr causes a stall of the pipe during DA

Page 10: Nodes and Networks for HPC computing

21/12/05 r.innocente 10

P3/P4 superpipelining

picture from Intel

Page 11: Nodes and Networks for HPC computing

21/12/05 r.innocente 11

Opteron pipeline

Page 12: Nodes and Networks for HPC computing

21/12/05 r.innocente 12

Itanium2 pipeline

Page 13: Nodes and Networks for HPC computing

21/12/05 r.innocente 13

Out Of Order (OOO) execution/1

• In-order execution (as stated by the program) under-uses functional units

• Algorithms for out of order(OOO) execution of instruction were devised such that :– Data consistency is maintained– High utilization of FU is achieved

• Instructions are executed when the inputs are ready, without regard to the original program order

Page 14: Nodes and Networks for HPC computing

21/12/05 r.innocente 14

OOO/2

• Hazards :– RAW (Read After Write): result forwarding

– WAW/WAR(Write After Write,Write After Read) : skip previous WB or ROB

• Tomasulo’s scheduler (IBM360/91-’76)

• Reorder Buffer (ROB) : to allow precise interrupts

Page 15: Nodes and Networks for HPC computing

21/12/05 r.innocente 15

OOO/3

Tomasulo additions for OOO execution with ROB

Page 16: Nodes and Networks for HPC computing

21/12/05 r.innocente 16

OOO/4

In the Tomasulo approach, to each resource (functional units, load and store units, register file) a set of buffers called reservation stations is added. This is a distributed algorithm and each station is in charge of watching out when the operands are available, eventually snooping on the bus, and executing its operation asap.

• Issue :

• Execute

• WriteBack

Page 17: Nodes and Networks for HPC computing

21/12/05 r.innocente 17

Branch predictionSpeculative execution

To avoid pipeline starvation, the most probable branch of a conditional jump (for which the condition register is not yet ready) is guessed (branch prediction) and the following instructions are executed ( speculative execution ) and their result is stored in hidden registers.

If later the condition turns out as predicted we say the instruction has to be retired and the hidden registers renamed to the real registers, otherwise we had a misprediction and the pipe following the branch is cleared.

Page 18: Nodes and Networks for HPC computing

21/12/05 r.innocente 18

Hyperthreading/1

• Introduced with this name by Intel on their Xeon Pentium4 processors. Similar proposals were previously known in literature as Simultaneous multithreading

• With 5% more gates it could achieve a 20-30% performance increase ?

• The processor has two copies of the architectural state (registers, IP and so on) and so appears to the software as 2 logical processors

Page 19: Nodes and Networks for HPC computing

21/12/05 r.innocente 19

Hyperthreading/2

• Multithreaded applications gain the most

• The large cache footprint of typical scientific applications make them quite less favoured by this feature

Resources

State1 State2

0 1 23 4 5

Logical proc numbering

Page 20: Nodes and Networks for HPC computing

21/12/05 r.innocente 20

Hyperthreading/3

• Different approaches could have been used : partition(fixed assignement to logical processors), full sharing, threshold(resource is shared but there is an upper limit on its use by each logical processor)

• Intel for the most part used the full sharing policy

Page 21: Nodes and Networks for HPC computing

21/12/05 r.innocente 21

Hyperthreading/4

Pic from intel

Page 22: Nodes and Networks for HPC computing

21/12/05 r.innocente 22

Hyperthreading/5

Operating modes : ST0,ST1,MT

Pic from Intel

Page 23: Nodes and Networks for HPC computing

21/12/05 r.innocente 23

x86 uarchitectures

Intel:

• P6 uarchitecture

• NetBurst

• Nocona (EM64T)

AMD:

• Thunderbird

• Palomino

• AMD64/Hammer

• SledgeHammer

Page 24: Nodes and Networks for HPC computing

21/12/05 r.innocente 24

Intel x86 family• Pentium III

– Katmai

– Coppermine • 500 Mhz – 1.13 Ghz

– Tualatin• 1.13 – 1.33 Ghz

• Pentium 4

– Willamette • 1.4-2.0 Ghz

– Northwood• 2.0–2.2 Ghz

– Gallatin – 3.4 Ghz

- Prescott 2.8 Ghz -- Nocona (EM64T)

0.18 u

0.13 u

0.25 u

90 nm

Page 25: Nodes and Networks for HPC computing

21/12/05 r.innocente 25

AMD Athlon family

• K7 Athlon 1999 0.25 technology w/3DNow!• K75 Athlon 0.18 techology• Thunderbird (Athlon 3) 0.18 u technology

(on die L2 cache)• Palomino (Athlon 4) 0.18 u Quantispeed

uarchitecture (SSE support) 15 stages pipeline

• Opteron/AMD64 0.18 and then 0.13 u

Page 26: Nodes and Networks for HPC computing

21/12/05 r.innocente 26

Pentium 4 0.13u uarchitecturePic from Intel

Page 27: Nodes and Networks for HPC computing

21/12/05 r.innocente 27

Netburst 90nm uarchPic from Intel

Page 28: Nodes and Networks for HPC computing

21/12/05 r.innocente 28

Athlon uarchitecture

Page 29: Nodes and Networks for HPC computing

21/12/05 r.innocente 29

x86 Architecture extensions/1

x86 architecture was conceived in 1978. Since then it underwent many architecture extensions.

• Intel MMX (Matrix Math eXtensions):– introduced in 1997 supported by all current processors

• Intel SSE (Streaming SIMD Extensions):– introduced on the Pentium III in 1999

• Intel SSE2 (Streaming SIMD Extensions 2):– introduced on the Pentium 4 in Dec 2000

Page 30: Nodes and Networks for HPC computing

21/12/05 r.innocente 30

x86 Architecture extensions/2

• AMD 3DNow! :– introduced in 1998 (extends MMX)

• AMD 3DNow!+ (or 3DNow! Professional, or 3DNow! Athlon):– introduced with the Athlon (includes SSE)

Page 31: Nodes and Networks for HPC computing

21/12/05 r.innocente 31

x86 architecture extensions/3

The so called feature set can be obtained using the assembly instruction

cpuid

that was introduced with the Pentium and returns information about the processor in the processor’s registers: processor family, model, revision, features supported, size and structure of the internal caches.

On linux the kernel uses this instrucion at startup and many of these info are available typing:

cat /proc/cpuinfo

Page 32: Nodes and Networks for HPC computing

21/12/05 r.innocente 32

x86 architecture extensions/4

Part of a typical output can be :vendor_id : GenuineIntel

cpu family : 6

model : 8

model name : Pentium III(Coppermine)

cache size : 256 KB

flags : fpu pae tsc mtrr pse36 mmx

sse

Page 33: Nodes and Networks for HPC computing

21/12/05 r.innocente 33

SIMD technology

• A way to increase processor performance is to group together equal instructions on different data (Data Level Parallelism: DLP)

• SIMD (Single Instruction Multiple Data) comes from Flynn taxonomy of parallel machines(1966 )

• Intel proposed its : – SWAR ( SIMD Within A Register)

Page 34: Nodes and Networks for HPC computing

21/12/05 r.innocente 34

typical SIMD operation

op

X4 X3 X2 X1

Y4 Y3 Y2 Y1

op op op

X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1

Page 35: Nodes and Networks for HPC computing

21/12/05 r.innocente 35

MMX

• Adds 8 64-bits new registers :– MM0 – MM7

• MMX allows computations on packed bytes, word or doubleword integers contained in the MM registers (the MM registers overlap the FP registers !)

• Not useful for scientific computing

Page 36: Nodes and Networks for HPC computing

21/12/05 r.innocente 36

SSE

• Adds 8 128-bits registers :– XMM0 – XMM7

• SSE allows computations on operands that contain 4 Single Precision Floating Points either in memory or in the XMM registers

• Very limited use for scientific computing, because of lack of precision

Page 37: Nodes and Networks for HPC computing

21/12/05 r.innocente 37

SSE2

• Works with operands either in memory or in the XMM registers

• Allows operations on packed Double Precision Floating Points or 128-bit integers

• Using a linear algebra kernel with SSE2 instructions matrix multiplication can achieve 1.8 Gflops on a P4 at 1.4 Ghz : http://hpc.sissa.it/p4/

Page 38: Nodes and Networks for HPC computing

21/12/05 r.innocente 38

Intel SSE3/1

• SSE3 introduces 13 new instructions:– x87 conversion to integer (fisttp)

– Complex arithmetic (addsubps, addsubpd, movesldup,moveshdup,…)

– Video encoding

– Thread synchronization

Page 39: Nodes and Networks for HPC computing

21/12/05 r.innocente 39

SSE3/2

• Conversion to integer (fisttp) :– this was a long awaited feature. It is a

conversion to integer that always perform truncation (required by programming languages). Previously the fistp instruction converted according to the rounding mode specified in the FCW. And so usually required an expensive load/reload of the FCW.

Page 40: Nodes and Networks for HPC computing

21/12/05 r.innocente 40

Cache memory

Cache=a place where you can safely hide something

It is a high-speed memory that holds a small subset of main memory that is in frequent use.

CPU cache Memory

Page 41: Nodes and Networks for HPC computing

21/12/05 r.innocente 41

Cache memory/1

• Processor /memory gap is the motivation :– 1980 no cache at all

– 1995 2 level cache

– 2002 3 level cache

Page 42: Nodes and Networks for HPC computing

21/12/05 r.innocente 42

Cache memory/2

0100200300400500600700800

1995 1996 1997 1998 1999 2000Year

Per

form

ance

Mem perfCPU perf

CPU/DRAMgap :50 %/year

1975 cpu cycle 1 usec, SRAM access 1 usec2001 cpu cycle 0.5 ns, DRAM access 50 ns

Page 43: Nodes and Networks for HPC computing

21/12/05 r.innocente 43

Cache memory/3

As the cache contains only part of the main memory, we need to identify which portions of the main memory are cached. This is done by tagging the data that is stored in the cache.

The data in the cache is organized in lines that are the unit of transfer from/to the CPU and the memory.

Page 44: Nodes and Networks for HPC computing

21/12/05 r.innocente 44

Cache memory/4

Tags Data

cache lineTypical line sizes are 32 (Pentium III), 64 (Athlon), 128 bytes (Pentium 4)

Page 45: Nodes and Networks for HPC computing

21/12/05 r.innocente 45

Cache memory/5

Block placement algorithm :• Direct Mapped : a block can be placed in just one row of the cache• Fully Associative : a block can be placed on any row• n-Way set associative: a block can be placed on n places in a cache

rowWith direct mapping or n-way the row is determined using a simple hash

function such as the least significant bits of the row address.E.g. If the cache line is 64 bytes(bits [5:0] of address are used only to

address the byte inside the row) and there are 1k rows, then bits [15:6] of the address are used to select the cache row. The tag is then the remaining most significant bits [:16] .

Page 46: Nodes and Networks for HPC computing

21/12/05 r.innocente 46

Cache memory/6

Tags Data

cache line

Tags Data

cache line

2-waycache :a blockcan be placed on any of the two positions ina row

Page 47: Nodes and Networks for HPC computing

21/12/05 r.innocente 47

Cache memory/7

• With fully associative or n-way caches an algorithm for block replacement needs to be implemented.

• Usually some approximation of the LRU (least recently used) algorithm is used to determine the victim line.

• With large associativity (16 or more ways) choosing at random the line is almost equally efficient

Page 48: Nodes and Networks for HPC computing

21/12/05 r.innocente 48

Cache memory/8

• When during a memory access we find the data in the cache we say we had a hit, otherwise we say we had a miss.

• The hit ratio is a very important parameter for caches in that it let us predict the AMAT (Average Memory Access Time)

Page 49: Nodes and Networks for HPC computing

21/12/05 r.innocente 49

Cache memory/9

If

• tc is the time to access the cache

• tm the time to access the data from memory

• h the unitary hit ratio

then:

t = h * tc+ (1-h)*tm

Page 50: Nodes and Networks for HPC computing

21/12/05 r.innocente 50

Cache memory/10

In a typical case we could have :

• tc=2 ns

• tm=50 ns

• h=0.90

then the AMAT would be :

AMAT = 0.9 * 2+0.1*50= 6.8 ns

Page 51: Nodes and Networks for HPC computing

21/12/05 r.innocente 51

Cache memory/11

Typical split L1/unified L2 cache organization:

L2unified

L1data

L1code

CPU

Page 52: Nodes and Networks for HPC computing

21/12/05 r.innocente 52

Cache memory/12

• Classification of misses (3C)– Compulsory : can’t be avoided, 1st time data is

loaded(would happen to an infinite cache)– Capacity : data was loaded, but was unloaded

because the cache filled up(would happen to a fully associative cache)

– Conflict : in an N-way cache, the miss that happens because data was unloaded due to low associativity

Page 53: Nodes and Networks for HPC computing

21/12/05 r.innocente 53

Cache memory/13

• Recent processors can prefetch cache lines under program control or under hardware control, snooping the patterns of access for multiple flows. Netburst architecture introduced hw prefetch on pentium.

Page 54: Nodes and Networks for HPC computing

21/12/05 r.innocente 54

Cache memories/14

• Netburst hw prefetch is invoked after 2-3 cache misses with constant stride

• It prefetches 256 bytes ahead

• Up to 8 streams

• Does’nt prefetch past the 4KB page limit

Page 55: Nodes and Networks for HPC computing

21/12/05 r.innocente 55

Real Caches/1

• Pentium III : line size 32 bytes– split L1 – 16KB inst/16KBdata 4-way write through– Unified L2 256 KB (coppermine) 8-way write back

• Pentium 4 : – split L1 :

• Instruction L1 : 12 K uops 8-ways (trace execution cache)• Data L1 : 8 KB – 4 way 64 bytes line size write through

– unified L2: 256 KB 8-way line size 128 bytes write back (512KB on Northwood)

• Athlon : – Split L1 64 KB/64 KB line 64 bytes

– Unified L2 256 KB 16-ways 64 bytes line size

Page 56: Nodes and Networks for HPC computing

21/12/05 r.innocente 56

Real Caches/2

• Opteron : exclusive organization (L1 not subset of L2)– L1 : instruction 64 KB/ 64 bytes per line, data 64 KB /64 bytes per line– L2 : 1 MB 16 ways 128 bytes per line

• Itanium 2 0.13 u 1.5 Ghz (Madison) :– L1I : 16 KB 4 ways/64 bytes line– L1D : 16 KB 4 ways/64 bytes line– Unified L2 : 256 KB 8 ways/ 128 bytes line– Unified L3 : 9 MB 24 ways/ 128 bytes line

Page 57: Nodes and Networks for HPC computing

21/12/05 r.innocente 57

Real Caches/3

An intelligent RAM (IRAM)?

Itanium2 (Madison)die photo

6MB L3 cache455 M transistors

Page 58: Nodes and Networks for HPC computing

21/12/05 r.innocente 58

Memory performance

• stream (McCalpin) : http://www.cs.virginia.edu/stream

• memperf (T.Stricker): http://www.cs.inf.ethz.ch/CoPs/ECT

Page 59: Nodes and Networks for HPC computing

21/12/05 r.innocente 59

Memory Type and Range Registers were introduced with the P6 uarchitecture, they should be programmed to define how the processor should behave regarding the use of the cache in different memory areas. The following types of behaviour can be defined:

UC(Uncacheable), WC(Write Combining), WT(Write through), WB(Write back), WP(write Protect)

MTRR/1

Page 60: Nodes and Networks for HPC computing

21/12/05 r.innocente 60

MTRR/2

• UC=uncacheable:– no cache lookups,– reads are not performed as line reads, but as is– writes are posted to the write buffer and performed in

order

• WC=write combining:– no cache lookups– read requests performed as is– a Write Combining Buffer of 32 bytes is used to buffer

modifications until a different line is addressed or a serializing instruction is executed

Page 61: Nodes and Networks for HPC computing

21/12/05 r.innocente 61

MTRR/3

• WT=write through:– cache lookups are perfomed– reads are performed as line reads– writes are performed also in memory in any case (L1 cache is

updated and L2 is eventually invalidated)

• WB=write back– cache lookups are performed– reads are performed as line reads– writes are performed on the cache line eventually reading a full

line.Only when the line needs to be evicted from cache if modified, it will be written in memory

• WP=write protect– writes are never performed in the cache line

Page 62: Nodes and Networks for HPC computing

21/12/05 r.innocente 62

MTRR/4

There are 8 MTRRs for variable size areas on the P6 uarchitecture.

On Linux you can display them with:cat /proc/mtrr

Page 63: Nodes and Networks for HPC computing

21/12/05 r.innocente 63

MTRR/5

A typical output of the command would be:

reg00: base=0x00000000 (0MB), size=256MB: write-back

reg01: base=0xf0000000 (3840MB), size=32MB:write-combining

the first entry is for the DRAM, the other for the graphic board framebuffer

Page 64: Nodes and Networks for HPC computing

21/12/05 r.innocente 64

MTRR/6

It is possible to remove and add regions using these simple commands:

echo ‘disable=1’ >| /proc/mtrr

echo ‘base=0xf0000000 size=0x4000000 type=write-combining’ >|/proc/mtrr

( Suggestion: Try to use xengine while disabling and re-enabling the write-combining region of the framebuffer.)

Page 65: Nodes and Networks for HPC computing

21/12/05 r.innocente 65

Explicit cache control/1

Caches were introduced as an h/w only optimization and as such destined to be completely hidden to the programmers.

This is no more true, and all recent architectures have introduced some instructions for explicit cache control by the application programmer. On the x86 this was done togheter with the introduction of the SSE instructions.

Page 66: Nodes and Networks for HPC computing

21/12/05 r.innocente 66

Explicit cache control/2

On the Intel x86 architecture the following instructions provide explicit cache control :

• prefetch : these instructions load a cache line before the data is actually needed, to hide latency

• non temporal stores: to move data from registers to memory without writing it in the caches, when it is known data will not be re-used

• fence: to be sure order between prior/following memory operation is respected

• flush : to write back a cache line, when it is known it will not be re-used

Page 67: Nodes and Networks for HPC computing

21/12/05 r.innocente 67

Performance and timestamp counters/1

These counters were initially introduced on the Pentium but documented only on the PPro. They are supported also on the Athlon.

Their aim was fine tuning and profiling of the applications. They were adopted after many other processors had shown their usefulness.

Page 68: Nodes and Networks for HPC computing

21/12/05 r.innocente 68

Performance and timestamp counters/2

Timestamp counter (TSC):• it is a 64 bit counter• counts the clock cycles (processor cycles: now up to 2

Ghz=0.5ns) since reset or since programmer zeroed• when it reaches a count of all 1s, it wraps around without

generating any interrupt• it is an x86 extension that is reported by the cpuid

instruction, as such can be found on /proc/cpuinfo• Intel assured that even on future processor it will never

wrap around in less than 10 years

Page 69: Nodes and Networks for HPC computing

21/12/05 r.innocente 69

Performance and timestamp counters/3

• the TSC can be read with the RDTSC (Read TSC) instruction or the RDMSR (Read ModelSpecificRegister 0x10 ) instruction

• it is possible to zero the lower 32 bits of the TSC using the WRMSR (Write MSR 0x10) instruction (the upper 32 bits will be zeroed automagically by the h/w

• the RDTSC is not a serializing instruction as such does’nt avoid Out of Order execution. This can be obtained using the cpuid instruction

Page 70: Nodes and Networks for HPC computing

21/12/05 r.innocente 70

Performance and timestamp counters/4

The TSC is used by Linux during the startup phase to determine the clock frequency:

• the PIT (Programmable Interval Timer) is programmed to generate a 50 millisecond interval

• the elapsed clock ticks are computed reading the TSC before and after the interval

The TSC can be read by any user or only by the kernel according to a cpu flag in the CR4 register that can be set e.g. by a module.

Page 71: Nodes and Networks for HPC computing

21/12/05 r.innocente 71

Performance and timestamp counters/5

On the P6 uarchitecture there are 4 registers used for performance monitoring:

• 2 event select registers: PerfEvtSel0, PerfEvtSel1

• 2 performance counters 40 bits wide : PerfCtr0, PerfCtr1

You have to enable and specify the events to monitor setting the event select registers.

These registers can be accessed using the RDMSR and WRMSR instructions at kernel level (MSR 0x186,0x187, 0x193 0x194).

The performance counters can also be accessed using the RDPMC (Read Performance Monitoring Counters) at any privilege level if allowed by the CR4 flag.

Page 72: Nodes and Networks for HPC computing

21/12/05 r.innocente 72

Performance and timestamp counters/6

The Performance Monitoring mechanism has been vastly changed and expanded on P4 and Xeons. This new feature is called : PEBS (Precise Event-based Sampling).

There are 45 event selection control (ESCR) registers , 18 performance monitor counters and 18 counter configuration control (CCCR) MSR registers.

Page 73: Nodes and Networks for HPC computing

21/12/05 r.innocente 73

Performance and timestamp counters/7

A useful performance counters library to instrument the code and a tool called rabbit to monitor programs not instrumented with the library can be found at:

http://www.scl.ameslab.gov/Projects/Rabbit/

Page 74: Nodes and Networks for HPC computing

21/12/05 r.innocente 74

Intel Itanium/1

• EPIC (Explicit Parallel Instruction Computing) : a joint HP/Intel project started at the end of the '90 :

– its goal was the overcoming of the limitations of current architectures. Recognizing that it was hard to continue with the exploitation of the ILP (Instruction Level Parallelism) exclusively by hardware, it was thought that the compilers could quite better decide on dependencies and suggest parallelism.

• In-Order core !!!!!!!

Page 75: Nodes and Networks for HPC computing

21/12/05 r.innocente 75

Intel Itanium/2

• It's a 64 bit architecture : adresses are 64 bits

• It supports the following data types:

– Integers 1,2,4 and 8 bytes

– Floating point single, double and extended(80 bits)

– Pointers 64 bits

• Except for a few exceptions the basic data type is 64 bits and operations are done on 64 bits. 1,2,4 bytes operands are zero extended before being loaded.

Page 76: Nodes and Networks for HPC computing

21/12/05 r.innocente 76

Intel Itanium/3

• A typical Itanium instruction is a 3 operand instruction designated with this syntax :– [(qp)]mnemonic[.comp1][.comp2] dsts = srcs

• Examples :– Simple instruction : add r1 = r2,r3

– Predicated instruction:(p5)add r3 = r4,r5

– Completer: cmp.eq p4 = r2,r3

Page 77: Nodes and Networks for HPC computing

21/12/05 r.innocente 77

Intel Itanium/4

• (qp) : a qualifying predicate is a 1 bit predicate register p0-p63. The instruction is executed if the register is true (= 1). Otherwise a NOP is executed. If omitted p0 is assumed which is always true

• mnemonic : • comp1,comp2 : some instr include 1 or more

completers • dests = srcs : most instructions have 2 src opnds

and 1 dest opnd

Page 78: Nodes and Networks for HPC computing

21/12/05 r.innocente 78

Intel Itanium/4bis

If ( i == 3)

x = 5

Else

x = 6

• pT,pF=compare(i,3)

• (pT)x=5

• (pF)y=6

Page 79: Nodes and Networks for HPC computing

21/12/05 r.innocente 79

Intel Itanium/5

• Memory in Itanium :

– single, uniform, linear address space of 2^64 bytes

• single : data and code together, not like Harvard arch.

• uniform: no predefined zones, like interrupt vectors ..

• linear: not segmented like x86

• Code stored in little endian order, usually also data, but there is support for big endian OS too

• Load and store arch.

Page 80: Nodes and Networks for HPC computing

21/12/05 r.innocente 80

Intel Itanium/6

• Improves ILP(instruction level parallelism) :– Enabling the code writer to

explicitly indicate it

– Providing a 3 instruction-wide word called a bundle

– Providing a large number of registers to avoid contention

Page 81: Nodes and Networks for HPC computing

21/12/05 r.innocente 81

Intel Itanium/7

• Instructions are bound in instruction groups. These are sets of instructions that don’t have WAW or RAW dependencies and so can execute in parallel

• An instruction group must contain at least an instruction, there Is no limit on the number of instr in an instruction group. They are indicated in the code by cycle breaks (;;)

• Instr groups are composed of instruction bundles. Each bundle contains 3 instr and a template field

Page 82: Nodes and Networks for HPC computing

21/12/05 r.innocente 82

Intel Itanium/8

• Code generation assures there is no RAW or WAW race between isntr in the bundle, and sets the template field that maps each instruction to an execution unit. In this way the instr in the bundle can be dispatched in parallel

• Bundles are 16 bytes aligned

• The template field can end the instruction group at the end of the bundle or in the middle :

Page 83: Nodes and Networks for HPC computing

21/12/05 r.innocente 83

Intel Itanium/9

• Registers:

• 128 GPR

• 128 FP reg.

• 64 pred.reg

• 8 branch reg.

• 128 Applic. Reg

• IP reg

Page 84: Nodes and Networks for HPC computing

21/12/05 r.innocente 84

Intel Itanium/10

• Reg.Validity :– Each GR has an associated

NaT(Not a Thing) bit

– FP reg use a special pseudozero value called NaTVal

Page 85: Nodes and Networks for HPC computing

21/12/05 r.innocente 85

Intel Itanium/11

• Branches :– Relative direct : the instruction contains a 21 bit displacement that

is added to the current IP of the bundle containing the branch. This is a 16 byte instruction bundle address (=25 bits)

– Indirect branches : using one of the 8 branch regs

• It allows multiple branches to be evaluated in parallel, taking the first predicated true

Page 86: Nodes and Networks for HPC computing

21/12/05 r.innocente 86

Intel Itanium/12

• Memory Latency hiding because of large register file used for :– Data speculation : execution of an op before its data dependency

is resolved

– Control speculation: execution of an op before its control dependency si resolved

– Elimination of mem access through use of regs

Page 87: Nodes and Networks for HPC computing

21/12/05 r.innocente 87

Intel Itanium/13

• Proc calls and register stack :– Usually proc calls were implemented using a procedure stack in

main memory– Itanium uses the regs stack for that

• Rotating regs :– Static : the 1st 32 regs are permanent regs visible to all procs, in

which global vars are placed– Stacked : the other 96 regs behave like a stack, a proc can

allocate up to 96 regs for a proc frame

• Reg Stack Engine (RSE):– A hardware mechanism operating transparently in the background

ensures that no overflow can happen, saving regs to memory

Page 88: Nodes and Networks for HPC computing

21/12/05 r.innocente 88

Intel Itanium/13Pic from Intel

Page 89: Nodes and Networks for HPC computing

21/12/05 r.innocente 89

Intel Itanium/15

• FP registers are 82 bits :– 80 bits like the extended format

used since the x87 coprocessor

– 2 guard bits to allow computing of valid 80 bits results for divide and square roots implemented in s/w

• FP status register (FPSR) :– Provides 4 separate status

fields sf0-3, enabling 4 different computational environments

Page 90: Nodes and Networks for HPC computing

21/12/05 r.innocente 90

Intel Itanium/16

• Itanium 2 (madison):– 1.6 Ghz, 1.5 Ghz, 1.4 Ghz

– L3 cache : 9MB, 6MB, 4MB

– L2 cache: 256 KB

– L1 cache: 32 KB instr , 32 KB data

– 200/266 Mhz bus,400 MT/533 MT @128 bit system bus (6.4 GB/s,8.5 GB/s)

Page 91: Nodes and Networks for HPC computing

21/12/05 r.innocente 91

Intel Itanium/17

Pic from Intel

Page 92: Nodes and Networks for HPC computing

21/12/05 r.innocente 92

Intel Itanium/18 Pic from Intel

Page 93: Nodes and Networks for HPC computing

21/12/05 r.innocente 93

Intel Itanium/18

Page 94: Nodes and Networks for HPC computing

21/12/05 r.innocente 94

Opteron uarch

Pic from AMD

Page 95: Nodes and Networks for HPC computing

21/12/05 r.innocente 95

64 bit architectures

• AMD – Opteron

• Intel – Itanium

• Intel - EMT64T

Page 96: Nodes and Networks for HPC computing

21/12/05 r.innocente 96

AMD-64/1

– AMD extensions to IA-32 to allow 64 bits addressing aka as Direct Connect Architecture (DCA)

– Implementations :• Opteron for servers

• Athlon64, Athlon64 FX for desktops

Page 97: Nodes and Networks for HPC computing

21/12/05 r.innocente 97

AMD Hammer/Opteron

• x86-64 architecture specified in :– http://www.x86-64.org

Page 98: Nodes and Networks for HPC computing

21/12/05 r.innocente 98

Intel EM64T/1

• Extended Memory 64 Technology : an extension to IA-32 compatible with the proposed AMD-64

• Introduced with Nocona Xeons. 3 operating modes :– 64 bit mode : OS, drivers, applications recompiled– Compatibility mode: OS & drivers 64 bits,appl.

unchanged – Legacy mode: runs 32 bits OS, drivers and application

Page 99: Nodes and Networks for HPC computing

21/12/05 r.innocente 99

x86-64 or AMD64Pic from AMD

Page 100: Nodes and Networks for HPC computing

21/12/05 r.innocente 100

Processor bus/1

• Intel (AGTL+):– bus based (max 5 loads)– explicit in band arbitration– short bursts (4 data txfers)– 8 bytes wide(64 bits), up to 133 Mhz

• Compaq Alpha (EV6)/Athlon :– point to point– DDR double data rate (2 transfers x clock)– licensed by AMD for the Athlon (133 Mhz x 2)– source synchronous(up to 400 Mhz)– 8 bytes wide(64 bits)

Page 101: Nodes and Networks for HPC computing

21/12/05 r.innocente 101

Processor bus/2

• Intel Netburst (Pentium 4):– source synchronous (like EV6)

– 8 bytes wide

– 100 Mhz clock

– quad data rate for data transfers (4 x 100 Mhz x 8 bytes= 3.2 GB/s, up tp 4 x 200Mhzx8=6.4 GB/s)

– double data rate for address transfer

– max 128 bytes(a P4 cache line)in a transaction (4 data transfer cycles)

Page 102: Nodes and Networks for HPC computing

21/12/05 r.innocente 102

Processor bus/3

• Itanium2 : – 128 bits wide– 200/266Mhz – Dual pumped– 400/ 533 MT/s x 16 = 6.4 / 8.4 GB/s

• Opteron (not a bus) :– 2,4,8,16,32 bits wide– 800 Mhz– Double pumped– 800 Mhz x 2 DDR x 2 bytes =3.2 GB/s per direction

Page 103: Nodes and Networks for HPC computing

21/12/05 r.innocente 103

Intel IA32 node

Pentium 4 Pentium 4

North BridgeMemory

sharedFSB 64 bits @100/133/200Mhz

Page 104: Nodes and Networks for HPC computing

21/12/05 r.innocente 104

Intel PIII/P4 processor bus

• Bus phases :– Arbitration: 2 or 3 clks– Request phase: 2 clks packet A, packet B (size)– Error phase: 2 clks, check parity on pkts,drive AERR– Snoop phase: variable length 1 ...– Response phase: 2 clk– Data phase : up to 32 bytes (4 clks, 1 cache line), 128

bytes with quad data rate on Pentium 4

• 13 clks to txfer 32 bytes, or 128 bytes on P4

Page 105: Nodes and Networks for HPC computing

21/12/05 r.innocente 105

Alpha node

21264 21264

Tsunamixbar switch

Memory

AlphaEV6 bus64 bit 4*83Mhz

256 bit-83 Mhzstreammeasuredmemory

b/w> 1GB/s

Page 106: Nodes and Networks for HPC computing

21/12/05 r.innocente 106

Alpha/Athlon EV6 interconnect

• 3 high speed channels :– Unidirectional processor request channel

– Unidirectional snoop channel

– 72-bit data channel (ECC)

• source synchronous

• up to 400 Mhz (4 x 100 Mhz: quad pumped, Athlon :133Mhz x 2 ddr )

Page 107: Nodes and Networks for HPC computing

21/12/05 r.innocente 107

Opteron integrated memory controller

• Opteron block diagram• The integrated

memory controller frees the interconnect from local memory access

• Remote memory access and I/O happens through the Hypertransport

Page 108: Nodes and Networks for HPC computing

21/12/05 r.innocente 108

2P Hammer

Page 109: Nodes and Networks for HPC computing

21/12/05 r.innocente 109

4P Hammer

Page 110: Nodes and Networks for HPC computing

21/12/05 r.innocente 110

4P Opteron detailed view

Page 111: Nodes and Networks for HPC computing

21/12/05 r.innocente 111

4P Opteron

Page 112: Nodes and Networks for HPC computing

21/12/05 r.innocente 112

8P Hammer

Page 113: Nodes and Networks for HPC computing

21/12/05 r.innocente 113

Hyper Transport hops

Page 114: Nodes and Networks for HPC computing

21/12/05 r.innocente 114

HT local/xfire bandwidth

Local mem b/w

Xfire mem b/w

Page 115: Nodes and Networks for HPC computing

21/12/05 r.innocente 115

Double/Quad core

Pics From Broadcom : dualcore and quadcoreMIPS64 cpus

Page 116: Nodes and Networks for HPC computing

21/12/05 r.innocente 116

SMPs

Symmetrical Multi Processing denotes a multiprocessing architecture in which there is no master CPU, but all CPUs co-operate.

• processor bus arbitration• cache coherency/consistency• atomic RMW operations• mp interrupts

Page 117: Nodes and Networks for HPC computing

21/12/05 r.innocente 117

Intel MP Processor bus arbitration

• As we have seen there is a special phase for each bus transaction devoted to arbitration (it takes 2 or 3 cycles)

• At startup each processor is assigned a cluster number from 0 to 3 and a processor number from 0 to 3 (the Intel MP specification covers up to 16 processors)

• In the arbitration phase each processor knows who had the bus during last transaction (the rotating ID) and who is requesting it. Then each processor knows who will own the current transaction because they will gain the bus according to the fixed sequence 0-1-2-3.

Page 118: Nodes and Networks for HPC computing

21/12/05 r.innocente 118

Cache coherency

Coherence: all processor should see the same data.This is a problem because each processor can have its own

copy of the data in its cache.There are essentially 3 ways to assure cache coherency:• directory based : there is a central directory that keeps

track of the shared regions (ccNUMA: Sun Enterprise, SGI)

• snooping : all the caches monitor the bus to determine if they have a copy of the data requested (UMA: Intel MP)

• Broadcast based

Page 119: Nodes and Networks for HPC computing

21/12/05 r.innocente 119

Cache consistency

• Consistency: mantain the order in which writes are executed

Writes are serialized by the processor bus.

Page 120: Nodes and Networks for HPC computing

21/12/05 r.innocente 120

Snooping

Two protocols can be used by caches to snoop the bus :

• write-invalidate: when a cache hears on the bus a write request for one of his lines from another bus agent then it invalidates its copy (Intel MP)

• write-update: when a cache hears on the bus a write request for one of his lines from another agent it reloads the line

Page 121: Nodes and Networks for HPC computing

21/12/05 r.innocente 121

Intel MP snooping

• If a memory transaction (memory read and invalidate/read code/read data/write cache line/write) was not cancelled by an error, then the error phase of the bus is followed by the snoop phase(2 or more cycles)

• In this phase the caches will check if they have a copy of the line

• if it is a read and a processor has a modified copy then it will supply its copy that is also written to memory

• if it is a write and a processor has a modified copy then the memory will store first the modified line and then will merge the write data

Page 122: Nodes and Networks for HPC computing

21/12/05 r.innocente 122

MESI protocol

Each cache line can be in 1 of the 4 states:• Invalid (I): the line is not valid and should not be used• Exclusive(E): the line is valid, is the same as main

memory and no other processor has a copy of it, it can be read and written in cache w/o problems

• Shared(S): the line is valid, the same as memory, one or more other processors have a copy of it, it can be read from memory, it should be written-through (even if declared write back!)

• Modified(M): the line is valid, has been updated by the local processor, no other cache has a copy, it can be read and written in cache

Page 123: Nodes and Networks for HPC computing

21/12/05 r.innocente 123

MESI states

I

EM

S

W

W

R

S RS

Legenda:Read accessWrite accessSnoopR

W

R

S

WS

W

RR

Page 124: Nodes and Networks for HPC computing

21/12/05 r.innocente 124

L2/L1 coherence

• An easy way to keep coherent the 2 levels of caches is to require inclusion ( L1 subset of L2)

• Otherwise each cache can perform its snooping

Page 125: Nodes and Networks for HPC computing

21/12/05 r.innocente 125

Atomic Read Modify Write

The x86 instruction set has the possibility to prefix some instructions with a LOCK prefix: – bit test and modify

– exchange

– increment, decrement, not,add,sub,and,or

These will cause the processor to assert the LOCK# bus signal for the duration of the read and write.

The processor automatically asserts the LOCK# signal during execution of an XCHG , during a task switch, while reading a segment descriptor

Page 126: Nodes and Networks for HPC computing

21/12/05 r.innocente 126

Intel MP interrupts

Intel has introduced an I/O APIC (Advanced Programmable Interrupt controller) which replaces the old 8259A.

• each processor of an SMP has its own integrated local APIC

• an Interrupt Controller Communication (ICC) bus connects an external I/O APIC (front-end) to these local APICs

• externel IRQ lines are connected to the I/O APIC that acts as a router

• the I/O APIC can dispatch interrupts to a fixed processor or to the one executing lowest priority activites (the priority table has to be updated by the kernel at each context switch)

Page 127: Nodes and Networks for HPC computing

21/12/05 r.innocente 127

Broadcast cache coherence

• In a 4 proc cluster p3 wants to access m0:– The read cache line request is sent on the link p3-p1

– Request forwarded from p1 to p0

– P0 reads line from memory snooping itself, and sends out snooping req to p1 and p2

– P0 send out snoop response to p3 and p2 forwards snoop request from p0

– P0 get the local memory copy and P2 sends snoop response to p4

– Memory line is forwarded to p3

Page 128: Nodes and Networks for HPC computing

21/12/05 r.innocente 128

PCI Bus

• PCI32/33 4 bytes@33Mhz=132MBytes/s (on i440BX,...)

• PCI64/33 8 bytes@33Mhz=264Mbytes/s

• PCI64/66 8 bytes@66Mhz=528Mbytes/s (on i840)

• PCI-X 8 bytes@133Mhz=1056Mbytes/s

Page 129: Nodes and Networks for HPC computing

21/12/05 r.innocente 129

PCI-X 1.0

• 100/133 Mhz, 32/64 bit – Max 1064 MB/s

• Split transactions

• Backward compatible with PCI 2.1/2.2/2.3 :– PCI-X adapters work in existing PCI slots

– Older PCI cards slow down PCI-X bus

Page 130: Nodes and Networks for HPC computing

21/12/05 r.innocente 130

PCI-X 2.0

• 100% backward compatible

• PCI-X 266 and PCI-X 533– 2.13 GB/s and 4.26 GB/s

– DDR and QDR using 133 Mhz base freq.

Page 131: Nodes and Networks for HPC computing

21/12/05 r.innocente 131

PCI efficiency

• Multimaster bus but arbitration is performed out of band

• Multiplexed but in burst mode (implicit addressing) only start address is txmitted

• Fairness guaranteed by MLT (Maximum Latency Timer)

• 3 / 4 cycles overhead on 64 data txfers < 5 %• on Linux use : lspci –vvv to look at the setup of

the boards

Page 132: Nodes and Networks for HPC computing

21/12/05 r.innocente 132

PCI 2.2/X timing diagram

CLK

Address

Bus cmd Attr

DataAttr

BE

Addressphase

Attributephase

Target response

AD

C/BE#

Dataphase

IRDY#

TRDY#

FRAME #

Page 133: Nodes and Networks for HPC computing

21/12/05 r.innocente 133

common chipsets PCI performance

372303Serverworks HE

315227Intel 860

328315AMD760MPX

248205Tsunami (alpha)

372309Serverworks champion II HE

372315Serverset III HE

399372Intel 460GX (Itanium)

488407Titan (Alpha)

512455Serverworks Serverset III LE

Write MB/sRead MB/sChipset

Page 134: Nodes and Networks for HPC computing

21/12/05 r.innocente 134

chipsets PCI-X performance

1044784HP rx2000 dual 900Mhz Itanium2, HPzx1 chipset

1044932Supermicro XDL8-GG dual 2.4 Ghz Xeon

10369462xOpteron 244,AMD 8131

Write MB/sRead MB/sChipset

Page 135: Nodes and Networks for HPC computing

21/12/05 r.innocente 135

Memory buses

• SDRAM 8 bytes wide (64 bits): these memories are pipelined DRAM. Industry found for them much more appealing to indicate clock frequency than access times. Number of clock cycles to wait for access is written in a small rom on the module(SPD)

– PC-100 PC-133

– DDR PC-200, DDR 266 QDR on the horizon

• RDRAM 2 bytes wide(16 bits) these memories have a completely new signaling technology. Their busses should be terminated at both ends (coninuity module required if slot not used!)

– RDRAM 600/800/1066 double data rate at 300,400,533 Mhz

Page 136: Nodes and Networks for HPC computing

21/12/05 r.innocente 136

Interconnects /1

• The technology used to interconnect processors overlaps today on one end with IC and PCB (printed circuit board) technologies and on the other end with WAN (Wide Area Network) technologies.

• SAN (System Area Network) and LAN technologies are at a midpoint.

Page 137: Nodes and Networks for HPC computing

21/12/05 r.innocente 137

LogP metrics (Culler)

• L = Latency: time data is on flight between the 2 nodes

• o = overhead: time during which the processor is engaged in sending or receiving

• g = gap : minimum time interval between consecutive message txmissions(or receptions)

• P = # of Processors

This metric was introduced to characterize a distributed system with its most important parameters,a bit outdated, but still useful.

(e.g..does’nt take into account pipelining)

Page 138: Nodes and Networks for HPC computing

21/12/05 r.innocente 138

LogP diagram

Network

NI NI

osor

Processor Processor

gg

time=os+L+or

L

Page 139: Nodes and Networks for HPC computing

21/12/05 r.innocente 139

Interconnects /2

• single mode fiber ( sonet 10 gb/s for some km)

• multimode fiber (1 gb 300 m: GigE)

• coaxial cable (800 mb/s 1 km: CATV)

• twisted pair (1 gb/s 100 m: GigE)

Page 140: Nodes and Networks for HPC computing

21/12/05 r.innocente 140

Interconnects /3

• shared / switched media:– bus (shared communication):

• coaxial• pcb trace• backplane

– switch (1-1 connection):• point to point

The required communication speed is forcing a switch towards the pt-to-pt paradigm

Page 141: Nodes and Networks for HPC computing

21/12/05 r.innocente 141

Interconnects

• Full interconnection network :– Ideal but unattainable

and overkilling

– D = 1

– Links : like n^2

 

Page 142: Nodes and Networks for HPC computing

21/12/05 r.innocente 142

Interconnects /4

Hypercube:

• A k-cube has 2**k nodes, each node is labelled by a k-dim binary coordinate

• 2 nodes differing in only 1 dim are connected

• there are at most k hops between 2 nodes, and 2**k * k wires

0 11-cube

00 10

01 11

2-cube

3-cube

Page 143: Nodes and Networks for HPC computing

21/12/05 r.innocente 143

Interconnects /4

Mesh:• an extension of hypercube

• nodes are given a k-dim coordinate (in the 0:N-1 range)

• nodes that differ by 1 in only 1 coordinate are connected

• a k-dim mesh has N**k nodes

• at most there are kN hops between nodes and wires are ~ kN**k

00

01

00 00

21

02 12

11

22

2-dim 3x3 mesh

Page 144: Nodes and Networks for HPC computing

21/12/05 r.innocente 144

Interconnects /4

Crossbar (xbar):• typical telephone network

technology

• organized by rows and columns (NxN)

• requires N**2 switches

• any permutation w/o blocking

• in the pictures the i-row is the sender and the i-col is the receiver of node i

0 1 2 43

0

1

2

3

4

5x5 xbar switch

Page 145: Nodes and Networks for HPC computing

21/12/05 r.innocente 145

Interconnect/4bis

• 1D Torus(Ring)

• 2D Torus

Page 146: Nodes and Networks for HPC computing

21/12/05 r.innocente 146

Interconnects/5

• Diameter : maximum distance in hops between two processors

• Bisection width : number of links to cut to sever the network in two halves

• Bisection bandwidth : minimum sum of b/w of links to be cut in order to divide the net in 2 halves

• Node degree : number of links from a node

Page 147: Nodes and Networks for HPC computing

21/12/05 r.innocente 147

Bisection /1

Bisection width : given an N nodes net, divided the nodes in two sets of N/2 nodes, the number of links that go from one set to the other.

An important parameter is the minimum bisection width: the minimum number of links whatever cut you use to bisect the nodes.

An upper bound on the minimum bisection is N/2 because whatever the topology would be you can always find a cut across half of the node links. If a network has minimum bisection of N/2 we say it has full bisection.

Page 148: Nodes and Networks for HPC computing

21/12/05 r.innocente 148

Bisection /2

Bisection cut:1 link

Bisection cut:4 links

2 different 8 nodes nets

Page 149: Nodes and Networks for HPC computing

21/12/05 r.innocente 149

Bisection /3

It can be difficult to find the min bisection.

For full bisection networks, it can be shown that if a network is re-arrangeable (it can route any permutation w/o blocking) then it has full bisection.

1625403To6543210From

Page 150: Nodes and Networks for HPC computing

21/12/05 r.innocente 150

Interconnects /4

• Topology:– crossbar (xbar) : any to any

– omega

– fat tree

• Bisection:

732hypercube

641024fully connected

5162-d torus

58grid/mesh

6432star

32ring

1bus

ports/switchbisection

Page 151: Nodes and Networks for HPC computing

21/12/05 r.innocente 151

Interconnects /5

• Connection/connectionless:– circuit switched

– packet switched

• routing– source based routing (SBR)

– virtual circuit

– destination based

Page 152: Nodes and Networks for HPC computing

21/12/05 r.innocente 152

Interconnects /6

• switch mechanism (buffer management):– store and forward

• each msg is received completely by a switch before being forwarded to the next

• pros: easy to design because there is no handling of partial msgs• cons: long msg latency, requires large buffer

– cut-through (virtual cut-through)• msg forwarded to the next node as soon as its header arrives• if the next node cannot receive the msg then the full msg needs to be stored

– wormhole routing• the same as cut-trough except that the msg can be buffered togheter by

multiple successive switches• the msg is like a worm crawling through a worm-hole

Page 153: Nodes and Networks for HPC computing

21/12/05 r.innocente 153

Interconnects /7

• congestion control– packet discarding

– flow control• window /credit based

• start/stop

Page 154: Nodes and Networks for HPC computing

21/12/05 r.innocente 154

Gilder’s law

• proposed by G.Gilder(1997) :The total bandwidth of

communication systems triples every 12 months

• compare with Moore’s law (1974):The processing power of a

microchip doubles every 18 months

• compare with:memory performance

increases 10% per year

(source Cisco)

Page 155: Nodes and Networks for HPC computing

21/12/05 r.innocente 155

Optical networks

• Large fiberbundles multiply the fibers per route : 144, 432 and now 864 fibers per bundle

• DWDM (Dense Wavelength Division Multiplexing) multiplies the b/w per fiber : – Nortel announced a 80 x

80G per fiber (6.4Tb/s)– Alcatel announced a 128 x

40G per fiber(5.1Tb/s)

Page 156: Nodes and Networks for HPC computing

21/12/05 r.innocente 156

NIC Interconnection point(from D.Culler)

SP2, Fore ATM cards

Myrinet, 3ComGbe

I/O Bus

HP

Medusa

Graphics

Bus

Intel

Paragon

Meiko

CS-2

T3E annexMemory

TMC CM-5Register

General uprocSpecial uprocController

Manyether cards

Page 157: Nodes and Networks for HPC computing

21/12/05 r.innocente 157

LVDS/1

• Low Voltage Differential Signaling (ANSI/TIA/EIA 644-1995)

• Defines only the electrical characteristics of drivers and receivers

• Transmission media can be copper cable or PCB traces

Page 158: Nodes and Networks for HPC computing

21/12/05 r.innocente 158

LVDS/2

• Differential :– Instead of measuring the voltage Vref+U between a signal line

and a common GROUND, a pair is used and Vref+U and Vref–U are transmitted on the 2 wires

– In this way the transmission is immune to Common Mode Noise (the electrical noise induced in the same way on the 2 wires: EMI,..)

Page 159: Nodes and Networks for HPC computing

21/12/05 r.innocente 159

LVDS/3

• Low Voltage:– voltage swing is just 300 mV, with a driver offset of +1.2V

– receivers are able to detect signals as low as 20 mV,in the 0 to 2.4 V (supporting +/- 1 V of noise)

0 V

1.2 V

+300mV

­300mV

1.35V

1.05V

Page 160: Nodes and Networks for HPC computing

21/12/05 r.innocente 160

LVDS/4

• It consumes only 1 mW(330mV swing constant current): GTL would consume 40 mW(1V swing) and TTL much more

• consumer chips for connecting displays (OpenLDI DS90C031) are already txmitting 6 Gb/s

• The low slew rate (300mV in 333 ps is only 0.9V/ns) minimize xtalk and distortion

Page 161: Nodes and Networks for HPC computing

21/12/05 r.innocente 161

VCSEL/1

VCSELs: Vertical Cavity Surface Emitting Lasers:

• the laser cavity is vertical to the semiconductor wafer

• the light comes out from the surface of the wafer

Distributed Bragg Reflectors(DBR):

• 20/30 pairs of semiconductor layers

Active region

p-DBR

n-DBR

Page 162: Nodes and Networks for HPC computing

21/12/05 r.innocente 162

VCSEL/2-EEL (Edge Emitting)

picture from Honeywell

Page 163: Nodes and Networks for HPC computing

21/12/05 r.innocente 163

VCSEL/3- Surface.Emission

picture from Honeywell

Page 164: Nodes and Networks for HPC computing

21/12/05 r.innocente 164

VCSEL/4

pict from Honeywell

Page 165: Nodes and Networks for HPC computing

21/12/05 r.innocente 165

VCSEL/5

At costs similar to LEDs have the characteristics of lasers.EELs:• it’s difficult to produce a geometry w/o problems cutting

the chips • it’s not possible to know in advance if the chip it’s good

or notVCSELs:• they can be produced and tested with standard I.C.

procedures• arrays of 10 or 1000 can be produced easily

Page 166: Nodes and Networks for HPC computing

21/12/05 r.innocente 166

Ethernet history

• 1976 Metcalfe invented a 1 Mb/s ether • 1980 Ethernet DIX (Digital, Intel, Xerox)

standard• 1989 Synoptics invented twisted pair

Ethernet• 1995 Fast Ethernet• 1998 Gigabit Ethernet• 2002 10 Gigabit Ethernet

Page 167: Nodes and Networks for HPC computing

21/12/05 r.innocente 167

Ethernet

• 10 Mb/s

• Fast Ethernet

• Gigabit Ethernet

• 10GB Ethernet

Page 168: Nodes and Networks for HPC computing

21/12/05 r.innocente 168

Ethernet Frames• 6 bytes dst address

• 6 bytes src address :3 bytes vendor codes, 3 bytes serial #

• 2 bytes ethernet type: ip, ...

• data from 46 up to 1500 bytes

• 4 bytes FCS (checksum)

dstaddress

srcaddress

type FCSdata

Page 169: Nodes and Networks for HPC computing

21/12/05 r.innocente 169

Hubs• Repeaters: layer 1

– like a bus they just repeat all the data across all the connections (not very useful for clusters)

• Switches (bridges) layer 2– they filter traffic and repeat only necessary traffic on

connections (very cheap switches can easily switch at full speed many Fast Ethernet connections)

• Routers : layer 3

Page 170: Nodes and Networks for HPC computing

21/12/05 r.innocente 170

Ethernet flow/control

• With large switches it is necessary to be able to limit the flow of packets to avoid packet discarding following an overflow

• Vendors had tried to implement non standard mechanism to overcome the problem

• One of this mechanism that could be used for half/duplex links is sending carrier bursts to simulate a busy channel

• The pseudo busy channel mechanism does’nt apply to full/duplex links

• MAC control protocol : to support flow control on f/d links– a new ethernet type = 0x8808, for frames of 46 bytes (min length)

– first 2 bytes are the opcode other are the parameters

– opcode = 0x0001 PAUSE stop txmitting

– dst address = 01-80-c2-00-00-01 (an address not forwarded by bridges)

– following 2 bytes represent the number of 512 bit times to stop txmission

Page 171: Nodes and Networks for HPC computing

21/12/05 r.innocente 171

Auto Negotiation

• On twisted pairs each station transmits a series of pulses (quite different from data) to signal link active. These pulses are called Normal Link Pulses (NLP)

• A set of Fast Link Pulses (FLP) is used to code station capabilities.These FLPs bits code the set of station capabilities

• There are some special repeaters that connect 10mb/s links together and 100mb/s together and then the two sets to ports of a mixed speed bridge

• Parallel detection: when autonegotiation is not implemented, this mechanism tries to discover speed used from NLP pulses (will not detect duplex links).

Page 172: Nodes and Networks for HPC computing

21/12/05 r.innocente 172

GigE Peculiarities :

• Flow control necessary on full duplex

• 1000base-T uses all 4-pairs of cat 5 cable in both directions: 250 Mhz*4 , with echo cancellation(10 and 100base-T used only 2 pairs)

• 1000base-SX on multimode fiber uses a laser (usually a VCSEL). Some multimode fibers have a discontinuity in the refraction index for some microns just in the middle of the fiber. This is not important with a LED launch where the light fills the core but is essential with a laser launch. An offset patch cord (a small length of single mode fiber that moves away from the center the spot) has been proposed as solution.

Page 173: Nodes and Networks for HPC computing

21/12/05 r.innocente 173

10 GbE /1

• IEEE 802.3ae 10GbE ratified on June 17, 2002 • Optical Media ONLY : MMF 50u/400Mhz 66m, new

50u/2000Mhz 300m, 62.5u/160Mhz 26m, WWDM 4x 62.5u/160Mhz 300m, SMF 10Km/40Km

• Full Duplex only (for 1 GbE a CSMA/CD mode of operation was still specified despite no one implemented it)

• LAN/WAN different :10.3 Gbaud – 9.95Gbaud (SONET)• 802.3 frame format (including min/max size)

Page 174: Nodes and Networks for HPC computing

21/12/05 r.innocente 174

10GbE /2

from Bottor (Nortel Networks) :

marriage of Ethernet and DWDM

Page 175: Nodes and Networks for HPC computing

21/12/05 r.innocente 175

VLAN/1

Virtual LAN :– IEEE 802.1Q extension to the ethernet std– frames can be tagged with 4 more bytes (so

that now an ethernet frame can be up to 1522 bytes):

• TPID tagged protocol identifier, 2 bytes with value 0x8100

• TCI Tag control information: specifies priority and Virtual LAN (12 bits) this frame belongs

• remaining bytes with standard content

Page 176: Nodes and Networks for HPC computing

21/12/05 r.innocente 176

VLAN/2

• VLAN tagged ethernet frame :

Page 177: Nodes and Networks for HPC computing

21/12/05 r.innocente 177

Infiniband/1

• It represents the convergence of 2 separate proposal:

– NGIO (NextGenerationIO:Intel,Microsoft,Sun)

– FutureIO (Compaq,IBM,HP)

• Infiniband: Compaq, Dell, HP, IBM, Intel, Microsoft, Sun

Page 178: Nodes and Networks for HPC computing

21/12/05 r.innocente 178

Infiniband/2

• Std channel at 2.5 Gb/s (Copper LVDS or FIber) :– 1x width 2.5 Gb/s

– 4x width 10 Gb/s

– 12x width 30 Gb/s

Page 179: Nodes and Networks for HPC computing

21/12/05 r.innocente 179

Infiniband/3

• Highlights:– point to point switched interconnect

– channel based message passing

– computer room interconnect, diameter < 100-300 m

– one connection for all I/O : ipc, storage I/O, network I/O

– up to thousands of nodes

Page 180: Nodes and Networks for HPC computing

21/12/05 r.innocente 180

Infiniband/4

Page 181: Nodes and Networks for HPC computing

21/12/05 r.innocente 181

Infiniband/5

Page 182: Nodes and Networks for HPC computing

21/12/05 r.innocente 182

Myrinet/1

• comes from a USC/ISI research project : ATOMIC

• flow control and error control on all links

• cut-through xbar switches, intelligent boards

• Myrinet packet format :

type CRCdata (no length limit)

Allows multiple protocols

Source route bytes

Page 183: Nodes and Networks for HPC computing

21/12/05 r.innocente 183

Myrinet/2

Page 184: Nodes and Networks for HPC computing

21/12/05 r.innocente 184

Myrinet/3

Every node can discover the topology with probe packets (mapper).

In this way it gets a routing table that can be used to compute headers for any destination.

There is 1 byte for each hop traversed, the first byte with routing information is removed by each switch on the path.

Page 185: Nodes and Networks for HPC computing

21/12/05 r.innocente 185

Myrinet/4

• Currently Myrinet uses a 2 Gb/s link

• There are double link cards for the PCI-X so that a theoretical b/w of 4 Gb/s per node can be achieved : in this way you have to use 2 ports on the switch too

• Their current TCP/IP implementation is very well tuned and it reaches about 80% of the b/w with proprietary protocol, despite having a much larger latency

Page 186: Nodes and Networks for HPC computing

21/12/05 r.innocente 186

Clos networks

Named after Charles Clos who introduced them in 1953. It’s a kind of fat tree network.

These networks have full bisection.Large Myrinet clusters are frequently arranged as a

Clos network.Myricom base block is their 16 port xbar switch.From this brick it’s possible to build 128 nodes/3

level (16+8switches) and 1024 nodes 5 level networks.

Page 187: Nodes and Networks for HPC computing

21/12/05 r.innocente 187

Clos networks/2

from myri.com

Page 188: Nodes and Networks for HPC computing

21/12/05 r.innocente 188

SCI/1

• Acronym for Scalable Coherent Interface, ANSI/ISO/IEEE std 1596-1992

• Packet-based protocol for SAN (System Area Network)• It uses pt-to-pt links of 5.3 Gb/s• Topology : Ring, Torus 2D or 3D• It uses a shared memory approach. SCI addresses are 64

bits :– Most significant 16 bits specify the node– 48 address memory on a node

• Up to 64K nodes/256 TB addresssing space

Page 189: Nodes and Networks for HPC computing

21/12/05 r.innocente 189

SCI/2

• Dolphin SISCI API is the std API used to control the interconnect using a shared memory approach. To enable communications :

– The receiving node sets aside a part of its memory to be used by the SCI interconnect

– The sender imports the memory area in its address space

– The sender can now read or write directly in that memory area

• Because it is difficult to streamline reads through the memory/PCI bus : writes are 10 times faster than reads

• In many implementations of Upper level protocols, It is quite convenient to use only writes

Page 190: Nodes and Networks for HPC computing

21/12/05 r.innocente 190

SCI/2

• The link b/w between nodes is effectively shared

• This puts an upper limit on the number of nodes you can put in a loop. Currently up to 8-10 nodes. So 2d torus up to 64 nodes, 3d over 64 nodes ..

• Because there are no switches the cost scales linearly

• Cable length is a big limitation with a preferred length of 1 m or less

Page 191: Nodes and Networks for HPC computing

21/12/05 r.innocente 191

SCI/3

Page 192: Nodes and Networks for HPC computing

21/12/05 r.innocente 192

SCI/4 3D Torus

Page 193: Nodes and Networks for HPC computing

21/12/05 r.innocente 193

Interconnect/1GbE

• Very cheap almost commodity connection now

• Broadcom chipset based < USD 100 per card

• Switch for 64-128 configurations around USD 200-300 per port

Page 194: Nodes and Networks for HPC computing

21/12/05 r.innocente 194

Interconnect/SCI

Page 195: Nodes and Networks for HPC computing

21/12/05 r.innocente 195

Interconnect/Myrinet

• Currently 2Gb/s links, with the possibility of a dual link card

• SL card 800 USD, 2 links card 1600 USD

• 1 port on a switch

Page 196: Nodes and Networks for HPC computing

21/12/05 r.innocente 196

Interconnect/Infiniband

Page 197: Nodes and Networks for HPC computing

21/12/05 r.innocente 197

Interconnect/10GbE

• cost of optics alone is around USD 3000

• 1 port on a switch around 10,000 USD

• Up to only 48 ports switches

Page 198: Nodes and Networks for HPC computing

21/12/05 r.innocente 198

Software

Despite great advances in network technology(2-3 orders of magnitude), much communication s/w remained almost unchanged for many years (e.g.BSD networking).There is a lot of ongoing research on this theme and very different solutions are proposed(zero-copy,

page remapping,VIA,...)

Page 199: Nodes and Networks for HPC computing

21/12/05 r.innocente 199

Software overhead

txmissiontime

software

 overhead

softwareoverhead software

overhead

totaltime

10Mb/s Ether 100Mb/s Ether  1Gb/s 

Being a constant, is becoming more and more important !!

Page 200: Nodes and Networks for HPC computing

21/12/05 r.innocente 200Nic memory

Processor

Standard data flow /1

Processor

North Bridge

Memory

PCI bus NIC Network

Processor bus

I/O bus Memorybus

1

2

3

Page 201: Nodes and Networks for HPC computing

21/12/05 r.innocente 201

Standard data flow /2

• Copies :1. Copy from user space to kernel buffer

2. DMA copy from kernel buffer to NIC memory

3. DMA copy from NIC memory to network

• Loads:– 2x on the processor bus

– 3x on the memory bus

– 2x on the NIC memory

Page 202: Nodes and Networks for HPC computing

21/12/05 r.innocente 202Nic memory

Processor

Zero-copy data flow

Processor

North Bridge

Memory

PCI bus NIC Network

Processor bus

I/O bus Memorybus

1

2

Page 203: Nodes and Networks for HPC computing

21/12/05 r.innocente 203

Zero Copy Research

• Shared memory between user/kernel:– Fbufs(Druschel,1993)– I/O-Lite (Druschel,1999)

• Page remapping with copy on write (Chu,1996)• Blast: hardware splits headers from data

(Carter,O’Malley,1990)• Ulni (User-level Network Interface): implementation of

communication s/w inside libraries in user space

High speed networks, I/O systems and memory have comparable bandwidths ­> it is essential to avoid any unnecessary copy of  data 

!

Page 204: Nodes and Networks for HPC computing

21/12/05 r.innocente 204

OS bypass – User level networking

• Active Messages (AM) – von Eicken, Culler (1992)

• U-Net –von Eicken, Basu, Vogels (1995)

• PM – Tezuka, Hori, Ishikawa, Sato (1997)

• Illinois FastMessages (FM) – Pakin, Karamcheti, Chien (1997)

• Virtual Interface Architecture (VIA) –Compaq,Intel,Microsoft (1998)

Page 205: Nodes and Networks for HPC computing

21/12/05 r.innocente 205

Active Messages (AM)

• 1-sided communication paradigm(no receive op)

• each message as soon as received triggers a receive handler that acts as a separate thread (in current implementations it is sender based)

Page 206: Nodes and Networks for HPC computing

21/12/05 r.innocente 206

FastMessages (FM)

• FM_send(dest,handler,buf,size)sends a long message

• FM_send_4(dest,handler,i0,i1,i2,i3)sends a 4 words msg (reg to reg)

• FM_extract()process a received msg

Page 207: Nodes and Networks for HPC computing

21/12/05 r.innocente 207

VIA/1

To achieve at the industrial level the advantages obtained from the various User Level Networking (ULN) initiatives, Compaq, Intel and Microsoft proposed an industry standard called VIA (Virtual Interface Architecture). This proposal specifies an API (Application Programming Interface).

In this proposal network reads and writes are done bypassing the OS, while open/close/map are done with the kernel intervention. It requires memory registering.

Page 208: Nodes and Networks for HPC computing

21/12/05 r.innocente 208

VIA/2

VI Hardware

VI  Provider APIOpen/Close/Map            Send/Receive/RDMA               

VI

VI kernel i/f

VI kernel agent

VIDoorbells

SendQRecQ RecQ SendQ

Page 209: Nodes and Networks for HPC computing

21/12/05 r.innocente 209

VIA/3

• VIA is better suited to network cards implementing advanced mechanism like doorbells (a queue of transactions in the address space of the memory card, that are remembered by the card)

• Anyway it can also be implemented completely in s/w, despite less efficiently (look at the M-VIA UCB project, and MVICH)

Page 210: Nodes and Networks for HPC computing

21/12/05 r.innocente 210

Fbufs (Druschel 1993)

• Shared buffers between user processes and kernel

• It is a general schemes that can be used for either network or I/O operations

• Specific API without copy semantic

• A buffer pool specific for a process is created as part of process initialization

• This buffer pool can be pre-mapped in both the user and kernel address space

• uf_read(int fd,void**bufp,size_t nbytes)

• uf_get(int fd,void** bufpp)

• uf_write(int fs,void*bufp,size nbytes)

• uf_allocate(size_t size)

• uf_deallocate(void*ptr,size_t size)

Page 211: Nodes and Networks for HPC computing

21/12/05 r.innocente 211

Network technologies

• Extended frames :– Alteon Jumbograms

(MTU up to 9000)

• Interrupt suppression (coalescing)

• Checksum offloading

(from Gallatin 1999)

Page 212: Nodes and Networks for HPC computing

21/12/05 r.innocente 212

Trapeze driver for Myrinet2000 on freeBSD

• zero-copy TCP by page remapping and copy-on -write

• split header/payload

• gather/scatter DMA

• checksum offload to NIC

• adaptive message pipelining

• 220 MB/s (1760 Mb/s) on Myrinet2000 and Dell Pentium III dual processor

The application does’nt touch the data(from Chase, Gallatin, Yocum

2001)

Page 213: Nodes and Networks for HPC computing

21/12/05 r.innocente 213

Bibliography

• Patterson/Hennessy: Computer Architecture – A Quantitative Approach

• Hwang – Advanced Computer Architecture

• Schimmel – Unix Systems for Modern Architectures