6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...

Post on 15-Jan-2016

227 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

6/28/98 SPAA/PODC 1

High-Performance Clusterspart 2: Generality

David E. Culler

Computer Science Division

U.C. Berkeley

PODC/SPAA Tutorial

Sunday, June 28, 1998

6/28/98 SPAA/PODC 2

What’s Different about Clusters?

• Commodity parts?

• Communications Packaging?

• Incremental Scalability?

• Independent Failure?

• Intelligent Network Interfaces?

• Fast Scalable Communication?

=> Complete System on every node– virtual memory

– scheduler

– file system

– ...

6/28/98 SPAA/PODC 3

Topics: Part 2

• Virtual Networks– communication meets virtual memory

• Scheduling

• Parallel I/O

• Clusters of SMPs

• VIA

6/28/98 SPAA/PODC 4

General purpose requirements

• Many timeshared processes– each with direct, protected access

• User and system

• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures

• Multiple packages in a process– each may have own internal communication layer

• Use communication as easily as memory

6/28/98 SPAA/PODC 5

Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.

6/28/98 SPAA/PODC 6

Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

***

HostMemory

Processor

NICMem

Network Interface

P

6/28/98 SPAA/PODC 7

Endpoint Transition Diagram

COLDPaged Host Memory

WARMR/O

Paged Host Memory

HOTR/W

NIC Memory

Read

Evict

Swap

WriteMsg Arrival

6/28/98 SPAA/PODC 8

Network Interface Support

• NIC has endpoint frames

• Services active endpoints

• Signals misses to driver– using a system endpont

Frame 0

Frame 7

Transmit

Receive

EndPoint Miss

6/28/98 SPAA/PODC 9

Solaris System Abstractions

Segment Driver• manages portions of an address space

Device Driver• manages I/O device

Virtual Network Driver

6/28/98 SPAA/PODC 10

LogP Performance

• Competitive latency

• Increased NIC processing

• Difference mostly– ack processing

– protection check

– data structures

– code quality

• Virtualization cheap0

2

4

6

8

10

12

14

16

gam AM gam AM

µs

gLOrOs

6/28/98 SPAA/PODC 11

Bursty Communication among many

Client

Client

Client

ServerServerServer

Msgburst work

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2 4 6 8 10 12 14 16

Clients

Ms

gs

/Se

c

0

10000

20000

30000

40000

50000

60000

70000

0 5 10 15 20

Clients

Bu

rst

Ba

nd

wid

th

(Ms

g/S

ec

)

6/28/98 SPAA/PODC 12

Multiple VN’s, Single-thread Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28

Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

6/28/98 SPAA/PODC 13

Multiple VNs, Multithreaded Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

6/28/98 SPAA/PODC 14

Perspective on Virtual Networks

• Networking abstractions are vertical stacks– new function => new layer

– poke through for performance

• Virtual Networks provide a horizontal abstraction– basis for building new, fast services

• Open questions– What is the communication “working set” ?

– What placement, replacement, … ?

6/28/98 SPAA/PODC 15

Beyond the Personal Supercomputer

• Able to timeshare parallel programs – with fast, protected communication

• Mix with sequential and interactive jobs

• Use fast communication in OS subsystems– parallel file system, network virtual memory, …

• Nodes have powerful, local OS scheduler

• Problem: local schedulers do not know to run parallel jobs in parallel

6/28/98 SPAA/PODC 16

Local Scheduling

• Schedulers act independently w/o global control

• Program waits while trying communicate with its peers that are not running

• 10 - 100x slowdowns for fine-grain programs!

=> need coordinated scheduling

A AAB

BC

A

A

AA

B C

B

C

Time

P1 P2 P3 P4

A

C

6/28/98 SPAA/PODC 17

Explicit Coscheduling

• Global context switch according to precomputed schedule

• How do you build it? Does it work?

A A AA

B CB C

A A AA

B CB C

TimeP1 P2 P3 P4

Master

6/28/98 SPAA/PODC 18

Typical Cluster Subsystem Structures

A

LS

A A

LS

A

LS

A

LS

A

Master

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

Local service

Applications

Communication

Global Service

Communication

Communication

Master-Slave

Peer-to-Peer

6/28/98 SPAA/PODC 19

Ideal Cluster Subsystem Structure

• Obtain coordination without explicit subsystem interaction, only the events in the program

– very easy to build

– potentially very robust to component failures

– inherently “service on-demand”

– scalable

• Local service component can evolve.

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

6/28/98 SPAA/PODC 20

Three approaches examined in NOW

• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP

– uses stops & signals to try to force desired PP to run

• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris

scheduler

• Implicit– modify the parallel run-time library to allow it to get itself co-

scheduled with standard scheduler

A

LS

A A

LS

A

LS

A

LS

A

M

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

6/28/98 SPAA/PODC 21

Problems with explicit coscheduling

• Implementation complexity

• Need to identify parallel programs in advance

• Interacts poorly with interactive use and load imbalance

• Introduces new potential faults

• Scalability

6/28/98 SPAA/PODC 22

Why implicit coscheduling might work

• Active message request-reply model

• Infer non-local state from local observations; react to maintain coordination

observation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

sleep

spin

request response

6/28/98 SPAA/PODC 23

Obvious Questions

• Does it work?

• How long do you spin?

• What are the requirements on the local scheduler?

6/28/98 SPAA/PODC 24

How Long to Spin?

• Answer: round trip time + context switch + msg

processing

– round-trip to stay scheduled together

– plus wake-up to get scheduled together

– keep spinning if serving messages

» interval of 3 x wake-up

Job A

Job B

Spin-Wait

WS 1

WS 2

Job A

Job C

Job AWakeup

Job C

Spin-Wait Sleep

Job B

2L+4o2L+4o+W

6/28/98 SPAA/PODC 25

Does it work?

6/28/98 SPAA/PODC 26

Synthetic Bulk-synchronous Apps

• Range of granularity and load imbalance– spin wait 10x slowdown

6/28/98 SPAA/PODC 27

With mixture of reads

• Block-immediate 4x slowdown

6/28/98 SPAA/PODC 28

Timesharing Split-C Programs

6/28/98 SPAA/PODC 29

Many Questions

• What about – mix of jobs?

– sequential jobs?

– unbalanced placement?

– Fairness?

– Scalability?

• How broadly can implicit coordination be applied in the design of cluster subsystems?

• Can resource management be completely decentralized?

– Computational economies, ecologies

6/28/98 SPAA/PODC 30

A look at Serious File I/O

• Traditional I/O system

• NOW I/O system

• Benchmark Problem: sort large number of 100 byte records with 10 byte keys

– start on disk, end on disk

– accessible as files (use the file system)

– Datamation sort: 1 million records

– Minute sort: quantity in a minute

Proc-Mem

P-M P-M P-M P-M

6/28/98 SPAA/PODC 31

NOW-Sort Algorithm

• Read – N/P records from disk -> memory

• Distribute – scatter keys to processors holding result buckets

– gather keys from all processors

• Sort– partial radix sort on each bucket

• Write– write records to disk

(2 pass: gather data runs onto disk, then local, external merge sort)

6/28/98 SPAA/PODC 32

Key Implementation Techniques

• Performance Isolation: highly tuned local disk-to-disk sort

– manage local memory

– manage disk striping

– memory mapped I/O with m-advise, buffering

– manage overlap with threads

• Efficient Communication– completely hidden under disk I/O

– competes for I/O bus bandwidth

• Self-tuning Software– probe available memory, disk bandwidth, trade-offs

6/28/98 SPAA/PODC 33

Minute Sort

SGI Power Challenge

SGI Orgin

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

abyte

s s

orted

World-Record Disk-to-Disk Sort

• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

• but only in the wee hours of the morning

6/28/98 SPAA/PODC 34

Towards a Cluster File System

• Remote disk system built on a virtual networkR

ate

(MB

/s)

LocalRemote

5.0

6.0

Read Write

CP

U U

tiliz

atio

n

Read Write0%

40%

20%

client

server

Client

RDlibRD server

Activemsgs

6/28/98 SPAA/PODC 35

Streaming Transfer Experiment

P P P P

0 1 2 3

0 1 2 3

Loca

lP

3F

S L

oca

l

P3F

S R

ever

seP

3F

S R

emot

eP P P P

0 1 2 3

0 1 2 3

P P P P

3 2 1 0

0 1 2 3

P P P P

0 1 2 3

0 1 2 3

6/28/98 SPAA/PODC 36

Results

• Data distribution affects resource utilization

• Not delivered bandwidth

LocalP3FS LocalP3FS ReverseP3FS Remote

Rat

e (M

B/s

)

5.0

6.0

Access Method

CP

U U

tiliz

atio

n0%

40%

Access Method

20%

client

server

6/28/98 SPAA/PODC 37

I/O Bus crossings

M

P

NI

M

P

NI

M

P

NI

M

P

NI

Parallel Scan Parallel Sort

(a) local disk (b) remote disk (a) local disk (b) remote disk

6/28/98 SPAA/PODC 38

Opportunity: PDISK

• Producers dump data into I/O river• Consumers pull it out• Hash data records across disks• Match producers to consumers• Integrated with work scheduling

P P P P P P P P

Fast Communication - Remote Queues

Fast I/O - Streaming Disk Queues

6/28/98 SPAA/PODC 39

What will be the building block?

networkcloud

memory interconnect

SMP

memory

network cards

SMP

SMP SMP

memory

Processors per node

No

des

SMPs

NO

Ws

Clustersof

SMPs

6/28/98 SPAA/PODC 40

Multi-Protocol Communication

• Uniform Prog. Model is key

• Multiprotocol Messaging – careful layout of msg queues

– concurrent objects

– polling network hurts memory

• Shared Virtual Memory– relies on underlying msgs

• Pooling vs Contention

Send / Write

sharedmemory

network

communication layer

Rcv / Read

6/28/98 SPAA/PODC 41

LogP analysis of shared mem AM

-1

0

1

2

3

4

5

6

7

8

latency sendoverhead

receiveoverhead

gap round triptime

mic

rose

cond

s test & set

t&t&s

ticket

Anderson

Posix

lock-free

6/28/98 SPAA/PODC 42

Virtual Interface Architecture

Application

VI User Agent (“libvia”)

Ope

n, C

onne

ct,

Map

Mem

ory

Descriptor Read, Write

VI-Capable NIC

Sockets, MPI,Legacy, etc.

Host

NIC

RequestsCompleted

VI VI C

S S S COMP

R R R

VI

Do

orb

ells

Un

de

term

ined

VIA Kernel Driver (Slow) User-Level (Fast)

6/28/98 SPAA/PODC 43

VIA Implementation Overview

Host

NIC

...

Mapped Doorbells

Descriptor Queues

Data Buffers

Kernel MemoryMapped to Application

Doorbell Pages

Desc Buffer

Tx/Rx Buffers

VI

Request

Block Xfer

1Write

2DMAReq

3DMARd

4DMAReq

5DMARd

7DMAWrt

6/28/98 SPAA/PODC 44

Current VIA Performance

0

100

200

300

400

500

600

4 32 64 128 256 512 1024 2048 3072 4060

Message Size (bytes)

RT

T (

us

ec

)

VIA

UNET

AM2

6/28/98 SPAA/PODC 45

VIA ahead

• You will be able to buy decent clusters

• Virtualization in host memory is easy– will it go beyond pinned regions

– still need to manage active endpoints (doorbells)

• Complex descriptor queues will hinder low latency short messages

– NICs will chew on them, but many instructions on host

• Need to re-examine where error handling, flow control, retry are performed

• Interactions with scheduling, I/O, locking etc. will dominate application speed-up

– will demand new development methodologies

6/28/98 SPAA/PODC 46

Conclusions

• Complete system on every node makes clusters a very powerful architecture

– can finally get serious about I/O

• Extend the system globally– virtual memory systems,

– schedulers,

– file systems, ...

• Efficient communication enables new solutions to classic systems challenges

• Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN

– where SPAA and PDOC meet

top related