software for distributed systems. distributed systems – case studies now: a network of...

Software for Distributed Systems

Distributed Systems – Case Studies

• NOW: a Network of Workstations• Condor: High Throughput Computing• MOSIX: A Distributed System for Clusters

Projects

• NOW – a Network Of WorkstationsUniversity of California, BerkelyTerminated about 1997 after demonstrating the feasibility of their approach

• Condor – University of Wisconsin-MadisonStarted about 1988Is now an ongoing, world-wide system of shared computing clusters.

• MOSIX – clustering software for Linux machines

NOW: a Network of Workstationshttp://now.cs.berkeley.edu/

• Scale: A NOW system consists of a building-wide collection of machines providing memory, disks, and processors.

• Basic Ideas– Use idle CPU cycles for parallel processing on clusters of

workstations– Use memories as disk cache to break the I/O bottleneck

(slow disk access times)– Share the resources over fast LANs

– Premise: Access to another processorvia LAN would be faster than accessto local disk.

http://now.cs.berkeley.edu/

NOW “Opportunities”: Memory

• Network RAM: fast networks, high bandwidth make it reasonable to page across the network.– Instead of paging out to slow disks, send over fast

networks to RAM in an idle machine• Cooperative file caching: improve performance

by using network RAM as a very large file cache– Shared files can be fetched from another client’s

memory rather than server’s disk– Active clients can extend their disk cache size by using

memory of idle clients.

NOW Opportunities: “RAWD”(Redundant Arrays of Workstation Disks)

• RAID systems provide fast performance by connecting arrays of small disks. By reading/ writing data in parallel, throughput is increased.

• Instead of a hardware RAID, build software version by reading/writing data across the work stations in the network– Especially useful for parallel programs running on

separate machines in the network

NOW Opportunities: “Parallel Computing”

• Harnessing the power of multiple idle workstations in a NOW can support high-performance parallel applications.

• NOW principles: – avoid going to disk by using RAM on other network nodes

(assumes network access is faster than disk access)– Further speedup may be achieved by parallelizing the computation

and striping the data to multiple disks.– allow user processes to access the network directly rather than

going through the operating system (one way of increasing access times)

Berkeley NOW Features

• GLUnix (Global Layer UNIX) is a layer on top of UNIX OS’s running on the workstations

• Applications running on GLUnix have a protected virtual operating system layer which catches UNIX system calls and translates them into GLUnix calls.

• Serverless Network File System – xFS– Avoided central server bottleneck– Cooperative file system (basically, peer-to-peer)

GLUnix Sociology• Their biggest problem: the reluctance of users to share

their computing resources; e.g., will it affect my interactive response time?• Will it mean my big job only gets to run at night?

• They guaranteed each active user full workstation capability by • migrating guest processes when the user returns; • saving user state before accepting a new process so file cache,

memory could be restored when the user returns – thus avoiding a performance hit due to having to re-acquire the working set of a process.

• They planned to dedicate a significant amount of system-wide resources to large problems to compete with existing MPPs performance-wise.

Summary

• Successful, in their opinion. • Ran for several years in the late 90s on the

Berkeley CS system • Key enabling technologies– Scalable, high performance network– Fast access to the network for user processes– Global operating system layer to support system

resources as a true shared pool.

CONDOR – pre 2012http://research.cs.wisc.edu/condor/ • Goal: “…to develop, implement,

deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources”

http://research.cs.wisc.edu/condor/

http://research.cs.wisc.edu/condor/

HTCONDOR – post 2012http://research.cs.wisc.edu/htcondor/ • Goal: “…to develop, implement,

deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources”(note that the goal remains unchanged)

http://research.cs.wisc.edu/htcondor/

CONDOR Definitions

• HTC computing – “… problems that require weeks or months of computation to solve. …this type of research needs a computing environment that delivers large amounts of computational power over a long period of time.”

• Compare to High Performance computing (HPC) which “…delivers a tremendous amount of power over a short period of time.”

Overview

• Condor can be used to manage computing clusters. It is designed to take advantage of idle machines

• Condor lets users submit many (batch) jobs at the same time. Result: tremendous amounts of computation with very little user intervention.

• No need to rewrite code - just link to Condor libraries

Overview

• Condor runs on most contemporary operating systems: Windows, UNIX, Linux, MacOS, ..

• Several clusters can be combined into a “flock”, to achieve grid processing capabilities.

• Globus tools are used.

Features

• Checkpoints: save complete state of the process– Critical for programs that run for long periods of

time to recover from crashes or to vacate machines whose user has returned, or for process migration due to other reasons.

• Remote system calls: data resides on the home machine and system calls are directed there. Provides protection for host machines.

Features

• Jobs can run anywhere in the cluster (which can be a physical cluster, a virtual cluster, or even a single machine)

• Different machines have different capabilities; when submitting a job Condor users can specify the kind of machine they wish to run on.

• When sets of jobs are submitted it’s possible to define dependencies; i.e., “don’t run Job 3 until jobs 1 and 2 have completed.”

Slides

• From a talk by Myron Livny, “The Principles and Power of Distributed Computing”,International Winter School on Grid Computing 2010.http://research.cs.wisc.edu/condor/talks.html

• Livny is a professor at the U of Wisconsin-Madison where he heads the Condor Project, and other grid/distributed computing projects or centers.

• Paradyn: related research; tools for performance measurement on long-running programs on distributed or parallel software.

http://research.cs.wisc.edu/condor/talks.html

http://www.paradyn.org/html/overview.html

Condor Daemons

• Title unknown, by Hans Holbein the Younger, from Historiarum Veteris Testamenti icones, 1543

Condor Daemons

master

negotiator

collector

schedd

startd

starter

shadow

procd

kbddexec

Condor today

• http://research.cs.wisc.edu/condor/map/

http://research.cs.wisc.edu/condor/map/

MOSIX

The MOSIX Management System for Linux Clusters, Multi-Clusters, GPU

Clusters, and Clouds: A White PaperA. Barak and A. Shiloh, 2010

22

What is MOSIX?• MOSIX (Multicomputer Operating System for

UnIX), is an on-line management system that acts like a cluster operating system. [Barak and Shiloh]

• A multicomputer consists of a collection of computers, each with its own memory, that communicate over a network. – It is sometimes called a distributed memory

multiprocessor. • MOSIX targets High Performance Computing

(HPC) on Linux clusters, multi-clusters, and clouds.

23

High Performance Computing• HPC uses parallel processing to execute complex,

compute intensive applications reliably and quickly– Sometimes synonymous with systems that operate above a

teraflop (1012 flops) or that require supercomputers– Scientific & academic research, engineering applications,

and the military are typical users.• Not to be confused with High Throughput Computing

(HTC)– HPC: tightly coupled components, needs to run in an

environment where communication is fast– HTC: sequential batch jobs, can be scheduled on different

computers

24

MOSIX Overview

• Provides a single-system image to users and applications – appears to be a SMP.

• Handles interactive and batch jobs• Supports resource discovery and automatic

workload distribution across processors. In other words, processes can migrate from the home computer to other computers in the system

• Implemented as a set of utilities that provide a Linux-like run-time environment.

25

System Configurations

• MOSIX clusters: connected computers, (servers, workstations, or a combination), running Linux, & having a single administrator.

• MOSIX multi-cluster: several MOSIX clusters, probably within the same organization, configured to work together– MOSIX processes can run in any

of the clusters, using their home cluster environment

26

System Configurations

• MOSIX Clouds: a collection of various things- MOSIX clusters and multi-clusters, Linux clusters, individual servers and workstations, etc., each running possibly a different version of Linux and MOSIX

• MOSIX Cloud users can launch jobs from their home computers to run on remote nodes. The jobs have access to files on the remote node and the home node.

Cluster Partitioning

• MOSIX clusters can be partitioned with each partition being assigned to a different user or designated as a general-use pool.

Processes

• MOSIX recognizes Linux processes as well as MOSIX processes

• Linux processes run in native Linux mode, cannot be migrated to other processors– Generally used for administrative tasks

• MOSIX processes usually represent user applications that might need to migrate elsewhere in the system to get good service.

• Each MOSIX process has a unique home node.

29

MOSIX Features

• Automatic Resource Discovery• Process Migration• The Run-Time Environment• The Priority Method• Flood Control

30

Features: Automatic Resource Discovery

• Resources include nodes and their individual resources: current load, available memory, etc.

• Processors periodically assess their own resource state and send this information, along with recent information about other nodes, to a random set of nodes in the system, using a variation of randomized gossip dissemination.

• Result: all nodes have relatively current information about resource availability.

31

Features: Process Migration• MOSIX supports preemptive process migration,

meaning a process can be moved after it has begun to execute– This is more difficult that non-preemptive because

the entire process state must be moved

• Migration may be initiated by the user or automatically by MOSIX– Reasons: load balancing, moving a process closer to

resources it needs, etc.

32

Features: Process Migration• Performance-monitoring algorithms run

continuously to profile processes and make migration decisions.– Process profiles consist of information such as size,

rate of system calls, how much I/O & IPC is generated, and so forth.

• Migration decisions are based on the profiles and on the resource status of cluster machines.– Node speed and current load, process size versus

available memory, other process characteristics all inform the decision.

33

Features: Process Migration

• The MOSIX layer supports process migration by making it possible for processes to execute as if they still were running on the home node.

• System calls are intercepted by MOSIX and forwarded to the home node when necessary (most of the time).

• Programs and users don’t need to make any modifications to a program to make it migrateable.

34

Features: Process Migration• The down-side of process migration is added overhead• The developers evaluated performance by running

three programs, one CPU intensive, one using a small amount of I/O, one using a significant amount of I/O.

• Each program was created as a Linux process, a MOSIX process migrated to a node in the same cluster, a MOSIX process migrated to another cluster. Each program was then run several times

• Results: The programs that were CPU intensive or used only a little I/O ran efficiently in the local cluster and the remote cluster, given a fast network (1Gb/s Ethernet).

35

Features: Process Migration• Socket migration– Socket: network communication

endpoint, represented by an IP address and a port number

– Migratable sockets let processes communicate directly without having to go through the home node

– This is accomplished by giving each process a “mailbox” (message queue) to which other processes can send messages

– Actual process location is thus transparent to the communication mechanism

– In-order message reception is guaranteed

36

Features: Process Migration

• Host nodes are protected from migrated processes by the MOSIX software, which guarantees a secure run-time environment (sandbox).

• The only local resources accessible to the migrated process are the CPU and the memory assigned to the process.

37

Features: The Priority Method

• Guarantees that local processes and processes with a higher priority can evict migrated processes and low priority processes

• Cluster owners are allowed to reject processes from clusters that owners do not wish to share resources with.– NOW and Condor have

a similar philosophy

38

Features: Flood Control

• Flooding: a user (intentionally or accidentally) creates a large number of processes that threaten to swamp the system.

• Control mechanisms:– Load-balancing algorithm won’t migrate a process to

a node that doesn’t have enough memory– Any node can set a limit on the number of processes

it will accept; attempts to create more will result in “frozen” processes that are swapped out to disk until there is room for them

39

MOSIX Virtual OpenCl (VCL)

• Open Computing Language (OpenCL) supports programs that execute on heterogeneous platforms (CPUs, GPUs, and other processors)– It consists of a language and an API that enables

platform definition and control– It enables the use of graphics processors (GPUs) for

non-graphics computing• MOSIX applications can create OpenCL

“contexts” that aggregate devices fromseveral nodes

40

MOSIX Reach the Clouds (MRC)

41

• MRC allows applications to run in a MOSIX cloud without having to pre-copy their files.

• They can use files from their home system and files from the target node.

• The idea is to support file-sharing, and to allow applications to run in the cloud without having to store their data there.

Checkpoint and Recovery

• A process checkpoint is a copy of the program’s state at a given point in time.

• The purpose is to be able to stop a program and restart it later, or to recover from an error by “rolling back” to a previous point and resuming execution.

• It is an important tool for migratory environments because it is one way to stop a process on one machine and start it on another. It’s particularly important if processes are sometimes “frozen”

• MOSIX provides checkpointing for most processes, but some may not run correctly after being recovered.

42

Other Features

• MOSIX supports batch jobs (hard to migrate interactive jobs!)

• MOSIX can run directly on top of Linux, or in a Virtual Machine.

43

Conclusion

• MOSIX provides operating system-like management system for sharing resources in Linux clusters, multi-clusters, and clouds

• Gives the illusion of running on a multiprocessor• Users don’t need to modify applications• Important features include resource discovery,

dynamic workload distribution, ability for processes to migrate to location of available resources, and other features such as flood control that protect both nodes and processes.

44

Dandelion: Compiler and Runtime for Heterogeneous Systems

Christopher J. Rossbach, et.al, SOSP’13

• Designed for heterogeneous computer systems running data-parallel applications– General purpose computers, clouds, multicore

with CPUs and GPUs, FPGAs

• “…automatically and transparently distributes data-parallel portions of a program to available computing resources, …”

• For the .net environment; programmers typically use C# or F#

Dandelion

• Provides a single-system image• Programmer writes a sequential program,

system parallelizes it and distributes across currently available resources.

• Complex – programming language, compilers, and runtime systems must cooperate

software for distributed systems. distributed systems – case studies now: a network of...

Documents