6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...

6/28/98 SPAA/PODC 1

High-Performance Clusterspart 2: Generality

David E. Culler

Computer Science Division

U.C. Berkeley

PODC/SPAA Tutorial

Sunday, June 28, 1998

6/28/98 SPAA/PODC 2

What’s Different about Clusters?

• Commodity parts?

• Communications Packaging?

• Incremental Scalability?

• Independent Failure?

• Intelligent Network Interfaces?

• Fast Scalable Communication?

=> Complete System on every node– virtual memory

– scheduler

– file system

– ...

6/28/98 SPAA/PODC 3

Topics: Part 2

• Virtual Networks– communication meets virtual memory

• Scheduling

• Parallel I/O

• Clusters of SMPs

• VIA

6/28/98 SPAA/PODC 4

General purpose requirements

• Many timeshared processes– each with direct, protected access

• User and system

• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures

• Multiple packages in a process– each may have own internal communication layer

• Use communication as easily as memory

6/28/98 SPAA/PODC 5

Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.

6/28/98 SPAA/PODC 6

Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

***

HostMemory

Processor

NICMem

Network Interface

P

6/28/98 SPAA/PODC 7

Endpoint Transition Diagram

COLDPaged Host Memory

WARMR/O

Paged Host Memory

HOTR/W

NIC Memory

Read

Evict

Swap

WriteMsg Arrival

6/28/98 SPAA/PODC 8

Network Interface Support

• NIC has endpoint frames

• Services active endpoints

• Signals misses to driver– using a system endpont

Frame 0

Frame 7

Transmit

Receive

EndPoint Miss

6/28/98 SPAA/PODC 9

Solaris System Abstractions

Segment Driver• manages portions of an address space

Device Driver• manages I/O device

Virtual Network Driver

6/28/98 SPAA/PODC 10

LogP Performance

• Competitive latency

• Increased NIC processing

• Difference mostly– ack processing

– protection check

– data structures

– code quality

• Virtualization cheap0

2

4

6

8

10

12

14

16

gam AM gam AM

µs

gLOrOs


Bursty Communication among many

Client

Client

Client

ServerServerServer

Msgburst work

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2 4 6 8 10 12 14 16

Clients

Ms

gs

/Se

c

0

10000

20000

30000

40000

50000

60000

70000

0 5 10 15 20

Clients

Bu

rst

Ba

nd

wid

th

(Ms

g/S

ec

)


Multiple VN’s, Single-thread Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28

Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs


Multiple VNs, Multithreaded Server

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28Number of virtual networks

Ag

gre

ga

te m

sgs/

s

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs


Perspective on Virtual Networks

• Networking abstractions are vertical stacks– new function => new layer

– poke through for performance

• Virtual Networks provide a horizontal abstraction– basis for building new, fast services

• Open questions– What is the communication “working set” ?

– What placement, replacement, … ?


Beyond the Personal Supercomputer

• Able to timeshare parallel programs – with fast, protected communication

• Mix with sequential and interactive jobs

• Use fast communication in OS subsystems– parallel file system, network virtual memory, …

• Nodes have powerful, local OS scheduler

• Problem: local schedulers do not know to run parallel jobs in parallel


Local Scheduling

• Schedulers act independently w/o global control

• Program waits while trying communicate with its peers that are not running

• 10 - 100x slowdowns for fine-grain programs!

=> need coordinated scheduling

A AAB

BC

A

A

AA

B C

B

C

Time

P1 P2 P3 P4

A

C


Explicit Coscheduling

• Global context switch according to precomputed schedule

• How do you build it? Does it work?

A A AA

B CB C

A A AA

B CB C

TimeP1 P2 P3 P4

Master


Typical Cluster Subsystem Structures

A

LS

A A

LS

A

LS

A

LS

A

Master

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

Local service

Applications

Communication

Global Service

Communication

Communication

Master-Slave

Peer-to-Peer


Ideal Cluster Subsystem Structure

• Obtain coordination without explicit subsystem interaction, only the events in the program

– very easy to build

– potentially very robust to component failures

– inherently “service on-demand”

– scalable

• Local service component can evolve.

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS


Three approaches examined in NOW

• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP

– uses stops & signals to try to force desired PP to run

• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris

scheduler

• Implicit– modify the parallel run-time library to allow it to get itself co-

scheduled with standard scheduler

A

LS

A A

LS

A

LS

A

LS

A

M

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS

A

LS

A

GS

A

LS

GS

A

LS

A

GS

LS

A

GS


Problems with explicit coscheduling

• Implementation complexity

• Need to identify parallel programs in advance

• Interacts poorly with interactive use and load imbalance

• Introduces new potential faults

• Scalability


Why implicit coscheduling might work

• Active message request-reply model

• Infer non-local state from local observations; react to maintain coordination

observation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

sleep

spin

request response


Obvious Questions

• Does it work?

• How long do you spin?

• What are the requirements on the local scheduler?


How Long to Spin?

• Answer: round trip time + context switch + msg

processing

– round-trip to stay scheduled together

– plus wake-up to get scheduled together

– keep spinning if serving messages

» interval of 3 x wake-up

Job A

Job B

Spin-Wait

WS 1

WS 2

Job A

Job C

Job AWakeup

Job C

Spin-Wait Sleep

Job B

2L+4o2L+4o+W


Does it work?


Synthetic Bulk-synchronous Apps

• Range of granularity and load imbalance– spin wait 10x slowdown


With mixture of reads

• Block-immediate 4x slowdown


Timesharing Split-C Programs


Many Questions

• What about – mix of jobs?

– sequential jobs?

– unbalanced placement?

– Fairness?

– Scalability?

• How broadly can implicit coordination be applied in the design of cluster subsystems?

• Can resource management be completely decentralized?

– Computational economies, ecologies


A look at Serious File I/O

• Traditional I/O system

• NOW I/O system

• Benchmark Problem: sort large number of 100 byte records with 10 byte keys

– start on disk, end on disk

– accessible as files (use the file system)

– Datamation sort: 1 million records

– Minute sort: quantity in a minute

Proc-Mem

P-M P-M P-M P-M


NOW-Sort Algorithm

• Read – N/P records from disk -> memory

• Distribute – scatter keys to processors holding result buckets

– gather keys from all processors

• Sort– partial radix sort on each bucket

• Write– write records to disk

(2 pass: gather data runs onto disk, then local, external merge sort)


Key Implementation Techniques

• Performance Isolation: highly tuned local disk-to-disk sort

– manage local memory

– manage disk striping

– memory mapped I/O with m-advise, buffering

– manage overlap with threads

• Efficient Communication– completely hidden under disk I/O

– competes for I/O bus bandwidth

• Self-tuning Software– probe available memory, disk bandwidth, trade-offs


Minute Sort

SGI Power Challenge

SGI Orgin

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100

Processors

Gig

abyte

s s

orted

World-Record Disk-to-Disk Sort

• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

• but only in the wee hours of the morning


Towards a Cluster File System

• Remote disk system built on a virtual networkR

ate

(MB

/s)

LocalRemote

5.0

6.0

Read Write

CP

U U

tiliz

atio

n

Read Write0%

40%

20%

client

server

Client

RDlibRD server

Activemsgs


Streaming Transfer Experiment

P P P P

0 1 2 3

0 1 2 3

Loca

lP

3F

S L

oca

l

P3F

S R

ever

seP

3F

S R

emot

eP P P P

0 1 2 3

0 1 2 3

P P P P

3 2 1 0

0 1 2 3

P P P P

0 1 2 3

0 1 2 3


Results

• Data distribution affects resource utilization

• Not delivered bandwidth

LocalP3FS LocalP3FS ReverseP3FS Remote

Rat

e (M

B/s

)

5.0

6.0

Access Method

CP

U U

tiliz

atio

n0%

40%

Access Method

20%

client

server


I/O Bus crossings

M

P

NI

M

P

NI

M

P

NI

M

P

NI

Parallel Scan Parallel Sort

(a) local disk (b) remote disk (a) local disk (b) remote disk


Opportunity: PDISK

• Producers dump data into I/O river• Consumers pull it out• Hash data records across disks• Match producers to consumers• Integrated with work scheduling

P P P P P P P P

Fast Communication - Remote Queues

Fast I/O - Streaming Disk Queues


What will be the building block?

networkcloud

memory interconnect

SMP

memory

network cards

SMP

SMP SMP

memory

Processors per node

No

des

SMPs

NO

Ws

Clustersof

SMPs


Multi-Protocol Communication

• Uniform Prog. Model is key

• Multiprotocol Messaging – careful layout of msg queues

– concurrent objects

– polling network hurts memory

• Shared Virtual Memory– relies on underlying msgs

• Pooling vs Contention

Send / Write

sharedmemory

network

communication layer

Rcv / Read


LogP analysis of shared mem AM

-1

0

1

2

3

4

5

6

7

8

latency sendoverhead

receiveoverhead

gap round triptime

mic

rose

cond

s test & set

t&t&s

ticket

Anderson

Posix

lock-free


Virtual Interface Architecture

Application

VI User Agent (“libvia”)

Ope

n, C

onne

ct,

Map

Mem

ory

Descriptor Read, Write

VI-Capable NIC

Sockets, MPI,Legacy, etc.

Host

NIC

RequestsCompleted

VI VI C

S S S COMP

R R R

VI

Do

orb

ells

Un

de

term

ined

VIA Kernel Driver (Slow) User-Level (Fast)


VIA Implementation Overview

Host

NIC

...

Mapped Doorbells

Descriptor Queues

Data Buffers

Kernel MemoryMapped to Application

Doorbell Pages

Desc Buffer

Tx/Rx Buffers

VI

Request

Block Xfer

1Write

2DMAReq

3DMARd

4DMAReq

5DMARd

7DMAWrt


Current VIA Performance

0

100

200

300

400

500

600

4 32 64 128 256 512 1024 2048 3072 4060

Message Size (bytes)

RT

T (

us

ec

)

VIA

UNET

AM2


VIA ahead

• You will be able to buy decent clusters

• Virtualization in host memory is easy– will it go beyond pinned regions

– still need to manage active endpoints (doorbells)

• Complex descriptor queues will hinder low latency short messages

– NICs will chew on them, but many instructions on host

• Need to re-examine where error handling, flow control, retry are performed

• Interactions with scheduling, I/O, locking etc. will dominate application speed-up

– will demand new development methodologies


Conclusions

• Complete system on every node makes clusters a very powerful architecture

– can finally get serious about I/O

• Extend the system globally– virtual memory systems,

– schedulers,

– file systems, ...

• Efficient communication enables new solutions to classic systems challenges

• Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN

– where SPAA and PDOC meet

6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...

Documents

complete system

system endpontframe

networkvirtual network

fast scalable communication

virtual networkscommunication

worst case

intelligent network

parallel clients