6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...

46
6/28/98 SPAA/PODC 1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

Post on 15-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 1

High-Performance Clusterspart 2: Generality

David E. Culler

Computer Science Division

U.C. Berkeley

PODC/SPAA Tutorial

Sunday, June 28, 1998

Page 2: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 2

What’s Different about Clusters?

• Commodity parts?

• Communications Packaging?

• Incremental Scalability?

• Independent Failure?

• Intelligent Network Interfaces?

• Fast Scalable Communication?

=> Complete System on every node– virtual memory

– scheduler

– file system

– ...

Page 3: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 3

Topics: Part 2

• Virtual Networks– communication meets virtual memory

• Scheduling

• Parallel I/O

• Clusters of SMPs

• VIA

Page 4: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 4

General purpose requirements

• Many timeshared processes– each with direct, protected access

• User and system

• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures

• Multiple packages in a process– each may have own internal communication layer

• Use communication as easily as memory

Page 5: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 5

Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.

Page 6: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 6

Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

***

HostMemory

Processor

NICMem

Network Interface

P

Page 7: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 7

Endpoint Transition Diagram

COLDPaged Host Memory

WARMR/O

Paged Host Memory

HOTR/W

NIC Memory

Read

Evict

Swap

WriteMsg Arrival

Page 8: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 8

Network Interface Support

• NIC has endpoint frames

• Services active endpoints

• Signals misses to driver– using a system endpont

Frame 0

Frame 7

Transmit

Receive

EndPoint Miss

Page 9: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 9

Solaris System Abstractions

Segment Driver• manages portions of an address space

Device Driver• manages I/O device

Virtual Network Driver

Page 10: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 10

LogP Performance

• Competitive latency

• Increased NIC processing

• Difference mostly– ack processing

– protection check

– data structures

– code quality

• Virtualization cheap0

2

4

6

8

10

12

14

16

gam AM gam AM

µs

gLOrOs

Page 11: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 11

Bursty Communication among many

Client

Client

Client

ServerServerServer

Msgburst work

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2 4 6 8 10 12 14 16

Clients

Ms

gs

/Se

c

0

10000

20000

30000

40000

50000

60000

70000

0 5 10 15 20

Clients

Bu

rst

Ba

nd

wid

th

(Ms

g/S

ec

)

Page 12: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 12

Multiple VN’s, Single-thread Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28

Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

Page 13: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 13

Multiple VNs, Multithreaded Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

Page 14: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 14

Perspective on Virtual Networks

• Networking abstractions are vertical stacks– new function => new layer

– poke through for performance

• Virtual Networks provide a horizontal abstraction– basis for building new, fast services

• Open questions– What is the communication “working set” ?

– What placement, replacement, … ?

Page 15: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 15

Beyond the Personal Supercomputer

• Able to timeshare parallel programs – with fast, protected communication

• Mix with sequential and interactive jobs

• Use fast communication in OS subsystems– parallel file system, network virtual memory, …

• Nodes have powerful, local OS scheduler

• Problem: local schedulers do not know to run parallel jobs in parallel

Page 16: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 16

Local Scheduling

• Schedulers act independently w/o global control

• Program waits while trying communicate with its peers that are not running

• 10 - 100x slowdowns for fine-grain programs!

=> need coordinated scheduling

A AAB

BC

A

A

AA

B C

B

C

Time

P1 P2 P3 P4

A

C

Page 17: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 17

Explicit Coscheduling

• Global context switch according to precomputed schedule

• How do you build it? Does it work?

A A AA

B CB C

A A AA

B CB C

TimeP1 P2 P3 P4

Master

Page 18: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 18

Typical Cluster Subsystem Structures

A

LS

A A

LS

A

LS

A

LS

A

Master

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

Local service

Applications

Communication

Global Service

Communication

Communication

Master-Slave

Peer-to-Peer

Page 19: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 19

Ideal Cluster Subsystem Structure

• Obtain coordination without explicit subsystem interaction, only the events in the program

– very easy to build

– potentially very robust to component failures

– inherently “service on-demand”

– scalable

• Local service component can evolve.

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

Page 20: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 20

Three approaches examined in NOW

• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP

– uses stops & signals to try to force desired PP to run

• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris

scheduler

• Implicit– modify the parallel run-time library to allow it to get itself co-

scheduled with standard scheduler

A

LS

A A

LS

A

LS

A

LS

A

M

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

Page 21: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 21

Problems with explicit coscheduling

• Implementation complexity

• Need to identify parallel programs in advance

• Interacts poorly with interactive use and load imbalance

• Introduces new potential faults

• Scalability

Page 22: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 22

Why implicit coscheduling might work

• Active message request-reply model

• Infer non-local state from local observations; react to maintain coordination

observation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

sleep

spin

request response

Page 23: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 23

Obvious Questions

• Does it work?

• How long do you spin?

• What are the requirements on the local scheduler?

Page 24: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 24

How Long to Spin?

• Answer: round trip time + context switch + msg

processing

– round-trip to stay scheduled together

– plus wake-up to get scheduled together

– keep spinning if serving messages

» interval of 3 x wake-up

Job A

Job B

Spin-Wait

WS 1

WS 2

Job A

Job C

Job AWakeup

Job C

Spin-Wait Sleep

Job B

2L+4o2L+4o+W

Page 25: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 25

Does it work?

Page 26: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 26

Synthetic Bulk-synchronous Apps

• Range of granularity and load imbalance– spin wait 10x slowdown

Page 27: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 27

With mixture of reads

• Block-immediate 4x slowdown

Page 28: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 28

Timesharing Split-C Programs

Page 29: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 29

Many Questions

• What about – mix of jobs?

– sequential jobs?

– unbalanced placement?

– Fairness?

– Scalability?

• How broadly can implicit coordination be applied in the design of cluster subsystems?

• Can resource management be completely decentralized?

– Computational economies, ecologies

Page 30: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 30

A look at Serious File I/O

• Traditional I/O system

• NOW I/O system

• Benchmark Problem: sort large number of 100 byte records with 10 byte keys

– start on disk, end on disk

– accessible as files (use the file system)

– Datamation sort: 1 million records

– Minute sort: quantity in a minute

Proc-Mem

P-M P-M P-M P-M

Page 31: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 31

NOW-Sort Algorithm

• Read – N/P records from disk -> memory

• Distribute – scatter keys to processors holding result buckets

– gather keys from all processors

• Sort– partial radix sort on each bucket

• Write– write records to disk

(2 pass: gather data runs onto disk, then local, external merge sort)

Page 32: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 32

Key Implementation Techniques

• Performance Isolation: highly tuned local disk-to-disk sort

– manage local memory

– manage disk striping

– memory mapped I/O with m-advise, buffering

– manage overlap with threads

• Efficient Communication– completely hidden under disk I/O

– competes for I/O bus bandwidth

• Self-tuning Software– probe available memory, disk bandwidth, trade-offs

Page 33: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 33

Minute Sort

SGI Power Challenge

SGI Orgin

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

abyte

s s

orted

World-Record Disk-to-Disk Sort

• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

• but only in the wee hours of the morning

Page 34: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 34

Towards a Cluster File System

• Remote disk system built on a virtual networkR

ate

(MB

/s)

LocalRemote

5.0

6.0

Read Write

CP

U U

tiliz

atio

n

Read Write0%

40%

20%

client

server

Client

RDlibRD server

Activemsgs

Page 35: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 35

Streaming Transfer Experiment

P P P P

0 1 2 3

0 1 2 3

Loca

lP

3F

S L

oca

l

P3F

S R

ever

seP

3F

S R

emot

eP P P P

0 1 2 3

0 1 2 3

P P P P

3 2 1 0

0 1 2 3

P P P P

0 1 2 3

0 1 2 3

Page 36: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 36

Results

• Data distribution affects resource utilization

• Not delivered bandwidth

LocalP3FS LocalP3FS ReverseP3FS Remote

Rat

e (M

B/s

)

5.0

6.0

Access Method

CP

U U

tiliz

atio

n0%

40%

Access Method

20%

client

server

Page 37: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 37

I/O Bus crossings

M

P

NI

M

P

NI

M

P

NI

M

P

NI

Parallel Scan Parallel Sort

(a) local disk (b) remote disk (a) local disk (b) remote disk

Page 38: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 38

Opportunity: PDISK

• Producers dump data into I/O river• Consumers pull it out• Hash data records across disks• Match producers to consumers• Integrated with work scheduling

P P P P P P P P

Fast Communication - Remote Queues

Fast I/O - Streaming Disk Queues

Page 39: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 39

What will be the building block?

networkcloud

memory interconnect

SMP

memory

network cards

SMP

SMP SMP

memory

Processors per node

No

des

SMPs

NO

Ws

Clustersof

SMPs

Page 40: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 40

Multi-Protocol Communication

• Uniform Prog. Model is key

• Multiprotocol Messaging – careful layout of msg queues

– concurrent objects

– polling network hurts memory

• Shared Virtual Memory– relies on underlying msgs

• Pooling vs Contention

Send / Write

sharedmemory

network

communication layer

Rcv / Read

Page 41: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 41

LogP analysis of shared mem AM

-1

0

1

2

3

4

5

6

7

8

latency sendoverhead

receiveoverhead

gap round triptime

mic

rose

cond

s test & set

t&t&s

ticket

Anderson

Posix

lock-free

Page 42: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 42

Virtual Interface Architecture

Application

VI User Agent (“libvia”)

Ope

n, C

onne

ct,

Map

Mem

ory

Descriptor Read, Write

VI-Capable NIC

Sockets, MPI,Legacy, etc.

Host

NIC

RequestsCompleted

VI VI C

S S S COMP

R R R

VI

Do

orb

ells

Un

de

term

ined

VIA Kernel Driver (Slow) User-Level (Fast)

Page 43: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 43

VIA Implementation Overview

Host

NIC

...

Mapped Doorbells

Descriptor Queues

Data Buffers

Kernel MemoryMapped to Application

Doorbell Pages

Desc Buffer

Tx/Rx Buffers

VI

Request

Block Xfer

1Write

2DMAReq

3DMARd

4DMAReq

5DMARd

7DMAWrt

Page 44: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 44

Current VIA Performance

0

100

200

300

400

500

600

4 32 64 128 256 512 1024 2048 3072 4060

Message Size (bytes)

RT

T (

us

ec

)

VIA

UNET

AM2

Page 45: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 45

VIA ahead

• You will be able to buy decent clusters

• Virtualization in host memory is easy– will it go beyond pinned regions

– still need to manage active endpoints (doorbells)

• Complex descriptor queues will hinder low latency short messages

– NICs will chew on them, but many instructions on host

• Need to re-examine where error handling, flow control, retry are performed

• Interactions with scheduling, I/O, locking etc. will dominate application speed-up

– will demand new development methodologies

Page 46: 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June

6/28/98 SPAA/PODC 46

Conclusions

• Complete system on every node makes clusters a very powerful architecture

– can finally get serious about I/O

• Extend the system globally– virtual memory systems,

– schedulers,

– file systems, ...

• Efficient communication enables new solutions to classic systems challenges

• Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN

– where SPAA and PDOC meet