6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...

6/28/98 SPAA/PODC 1

High-Performance Clusterspart 2: Generality

David E. Culler

Computer Science Division

U.C. Berkeley

PODC/SPAA Tutorial

Sunday, June 28, 1998

6/28/98 SPAA/PODC 2

What’s Different about Clusters?

• Commodity parts?

• Communications Packaging?

• Incremental Scalability?

• Independent Failure?

• Intelligent Network Interfaces?

• Fast Scalable Communication?

=> Complete System on every node– virtual memory

– scheduler

– file system

– ...

6/28/98 SPAA/PODC 3

Topics: Part 2

• Virtual Networks– communication meets virtual memory

• Scheduling

• Parallel I/O

• Clusters of SMPs

• VIA

6/28/98 SPAA/PODC 4

General purpose requirements

• Many timeshared processes– each with direct, protected access

• User and system

• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures

• Multiple packages in a process– each may have own internal communication layer

• Use communication as easily as memory

6/28/98 SPAA/PODC 5

Virtual Networks

• Endpoint abstracts the notion of “attached to the network”

• Virtual network is a collection of endpoints that can name each other.

• Many processes on a node can each have many endpoints, each with own protection domain.

6/28/98 SPAA/PODC 6

Process 3

How are they managed?

• How do you get direct hardware access for performance with a large space of logical resources?

• Just like virtual memory– active portion of large logical space is bound to physical

resources

Process n

Process 2Process 1

HostMemory

Processor

NICMem

Network Interface

6/28/98 SPAA/PODC 7

Endpoint Transition Diagram

COLDPaged Host Memory

WARMR/O

Paged Host Memory

HOTR/W

NIC Memory

WriteMsg Arrival

6/28/98 SPAA/PODC 8

Network Interface Support

• NIC has endpoint frames

• Services active endpoints

• Signals misses to driver– using a system endpont

Frame 0

Frame 7

Transmit

Receive

EndPoint Miss

6/28/98 SPAA/PODC 9

Solaris System Abstractions

Segment Driver• manages portions of an address space

Device Driver• manages I/O device

Virtual Network Driver

6/28/98 SPAA/PODC 10

LogP Performance

• Competitive latency

• Increased NIC processing

• Difference mostly– ack processing

– protection check

– data structures

– code quality

• Virtualization cheap0

gam AM gam AM

gLOrOs

Bursty Communication among many

Client

ServerServerServer

Msgburst work

0 2 4 6 8 10 12 14 16

Clients

0 5 10 15 20

Clients

Multiple VN’s, Single-thread Server

1 4 7 10 13 16 19 22 25 28

Number of virtual networks

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

Multiple VNs, Multithreaded Server

1 4 7 10 13 16 19 22 25 28Number of virtual networks

continuous

1024 msgs

2048 msgs

4096 msgs

8192 msgs

16384 msgs

Perspective on Virtual Networks

• Networking abstractions are vertical stacks– new function => new layer

– poke through for performance

• Virtual Networks provide a horizontal abstraction– basis for building new, fast services

• Open questions– What is the communication “working set” ?

– What placement, replacement, … ?

Beyond the Personal Supercomputer

• Able to timeshare parallel programs – with fast, protected communication

• Mix with sequential and interactive jobs

• Use fast communication in OS subsystems– parallel file system, network virtual memory, …

• Nodes have powerful, local OS scheduler

• Problem: local schedulers do not know to run parallel jobs in parallel

Local Scheduling

• Schedulers act independently w/o global control

• Program waits while trying communicate with its peers that are not running

• 10 - 100x slowdowns for fine-grain programs!

=> need coordinated scheduling

P1 P2 P3 P4

Explicit Coscheduling

• Global context switch according to precomputed schedule

• How do you build it? Does it work?

A A AA

B CB C

A A AA

B CB C

TimeP1 P2 P3 P4

Master

Typical Cluster Subsystem Structures

Master

Local service

Applications

Communication

Global Service

Communication

Master-Slave

Peer-to-Peer

Ideal Cluster Subsystem Structure

• Obtain coordination without explicit subsystem interaction, only the events in the program

– very easy to build

– potentially very robust to component failures

– inherently “service on-demand”

– scalable

• Local service component can evolve.

Three approaches examined in NOW

• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP

– uses stops & signals to try to force desired PP to run

• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris

scheduler

• Implicit– modify the parallel run-time library to allow it to get itself co-

scheduled with standard scheduler

Problems with explicit coscheduling

• Implementation complexity

• Need to identify parallel programs in advance

• Interacts poorly with interactive use and load imbalance

• Introduces new potential faults

• Scalability

Why implicit coscheduling might work

• Active message request-reply model

• Infer non-local state from local observations; react to maintain coordination

observation implication action

fast response partner scheduled spin

delayed response partner not scheduled block

WS 1 Job A Job A

WS 2 Job B Job A

WS 3 Job B Job A

WS 4 Job B Job A

request response

Obvious Questions

• Does it work?

• How long do you spin?

• What are the requirements on the local scheduler?

How Long to Spin?

• Answer: round trip time + context switch + msg

processing

– round-trip to stay scheduled together

– plus wake-up to get scheduled together

– keep spinning if serving messages

» interval of 3 x wake-up

Spin-Wait

Job AWakeup

Spin-Wait Sleep

2L+4o2L+4o+W

Does it work?

Synthetic Bulk-synchronous Apps

• Range of granularity and load imbalance– spin wait 10x slowdown

With mixture of reads

• Block-immediate 4x slowdown

Timesharing Split-C Programs

Many Questions

• What about – mix of jobs?

– sequential jobs?

– unbalanced placement?

– Fairness?

– Scalability?

• How broadly can implicit coordination be applied in the design of cluster subsystems?

• Can resource management be completely decentralized?

– Computational economies, ecologies

A look at Serious File I/O

• Traditional I/O system

• NOW I/O system

• Benchmark Problem: sort large number of 100 byte records with 10 byte keys

– start on disk, end on disk

– accessible as files (use the file system)

– Datamation sort: 1 million records

– Minute sort: quantity in a minute

Proc-Mem

P-M P-M P-M P-M

NOW-Sort Algorithm

• Read – N/P records from disk -> memory

• Distribute – scatter keys to processors holding result buckets

– gather keys from all processors

• Sort– partial radix sort on each bucket

• Write– write records to disk

(2 pass: gather data runs onto disk, then local, external merge sort)

Key Implementation Techniques

• Performance Isolation: highly tuned local disk-to-disk sort

– manage local memory

– manage disk striping

– memory mapped I/O with m-advise, buffering

– manage overlap with threads

• Efficient Communication– completely hidden under disk I/O

– competes for I/O bus bandwidth

• Self-tuning Software– probe available memory, disk bandwidth, trade-offs

Minute Sort

SGI Power Challenge

SGI Orgin

0 10 20 30 40 50 60 70 80 90 100

Processors

World-Record Disk-to-Disk Sort

• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

• but only in the wee hours of the morning

Towards a Cluster File System

• Remote disk system built on a virtual networkR

LocalRemote

Read Write

Read Write0%

client

server

Client

RDlibRD server

Activemsgs

Streaming Transfer Experiment

P P P P

0 1 2 3

eP P P P

0 1 2 3

P P P P

3 2 1 0

0 1 2 3

P P P P

0 1 2 3

Results

• Data distribution affects resource utilization

• Not delivered bandwidth

LocalP3FS LocalP3FS ReverseP3FS Remote

Access Method

client

server

I/O Bus crossings

Parallel Scan Parallel Sort

(a) local disk (b) remote disk (a) local disk (b) remote disk

Opportunity: PDISK

• Producers dump data into I/O river• Consumers pull it out• Hash data records across disks• Match producers to consumers• Integrated with work scheduling

P P P P P P P P

Fast Communication - Remote Queues

Fast I/O - Streaming Disk Queues

What will be the building block?

networkcloud

memory interconnect

memory

network cards

SMP SMP

memory

Processors per node

Clustersof

Multi-Protocol Communication

• Uniform Prog. Model is key

• Multiprotocol Messaging – careful layout of msg queues

– concurrent objects

– polling network hurts memory

• Shared Virtual Memory– relies on underlying msgs

• Pooling vs Contention

Send / Write

sharedmemory

network

communication layer

Rcv / Read

LogP analysis of shared mem AM

latency sendoverhead

receiveoverhead

gap round triptime

s test & set

ticket

Anderson

lock-free

Virtual Interface Architecture

Application

VI User Agent (“libvia”)

Descriptor Read, Write

VI-Capable NIC

Sockets, MPI,Legacy, etc.

RequestsCompleted

VI VI C

S S S COMP

VIA Kernel Driver (Slow) User-Level (Fast)

VIA Implementation Overview

Mapped Doorbells

Descriptor Queues

Data Buffers

Kernel MemoryMapped to Application

Doorbell Pages

Desc Buffer

Tx/Rx Buffers

Request

Block Xfer

1Write

2DMAReq

3DMARd

4DMAReq

5DMARd

7DMAWrt

Current VIA Performance

4 32 64 128 256 512 1024 2048 3072 4060

Message Size (bytes)

VIA ahead

• You will be able to buy decent clusters

• Virtualization in host memory is easy– will it go beyond pinned regions

– still need to manage active endpoints (doorbells)

• Complex descriptor queues will hinder low latency short messages

– NICs will chew on them, but many instructions on host

• Need to re-examine where error handling, flow control, retry are performed

• Interactions with scheduling, I/O, locking etc. will dominate application speed-up

– will demand new development methodologies

Conclusions

• Complete system on every node makes clusters a very powerful architecture

– can finally get serious about I/O

• Extend the system globally– virtual memory systems,

– schedulers,

– file systems, ...

• Efficient communication enables new solutions to classic systems challenges

• Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN

– where SPAA and PDOC meet

6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...

complete system

system endpontframe

networkvirtual network

fast scalable communication

virtual networkscommunication

worst case

intelligent network

parallel clients

Documents

a progress report -...

podc 2004 business meeting mark tuttle, hp labs steering...

heli s č. 3/2018 - spaa sr

7 branch&pennypacker generalization and generality

spaa perbedaan

qfx strategic vision for spaa 2010

ambiguity, generality, and

manual spaa 341 c2.pdf

keynote snir spaa

ieee student professional awareness activities...

generality of languages

generality about dyeing process

gestão empresarial - b - podc

spaa-nädala kokkuvõte 26.02.-06.03.2016

siop podc adapted treatment recommendations for standard...

generality of metrology

astm standards you spaa

rutgers newark spaa/ umdnj sph

spaa symposium proceedings sep 2016

podc business meeting mark tuttle, hp labs steering...