1 cluster operating system support for parallel autonomic computing andrzej m. goscinski, j....

28
1 Cluster Operating System Cluster Operating System Support For Support For Parallel Autonomic Parallel Autonomic Computing Computing Andrzej M. Goscinski, Andrzej M. Goscinski, J. Silcock, M. J. Silcock, M. Hobbs Hobbs School of Information Technology School of Information Technology Deakin University Deakin University Geelong, Vic 3217, Australia Geelong, Vic 3217, Australia

Upload: clayton-lier

Post on 01-Apr-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

11

Cluster Operating System Cluster Operating System Support ForSupport For

Parallel Autonomic Parallel Autonomic Computing Computing

Andrzej M. Goscinski, Andrzej M. Goscinski, J. Silcock, M. HobbsJ. Silcock, M. Hobbs

School of Information TechnologySchool of Information TechnologyDeakin UniversityDeakin University

Geelong, Vic 3217, AustraliaGeelong, Vic 3217, Australia

Page 2: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 22

A Need for More than A Need for More than Execution PerformanceExecution Performance

Performance is a critical assessment criterionPerformance is a critical assessment criterion Security, reliability, and ease of programming Security, reliability, and ease of programming

are neglectedare neglected FurthermoreFurthermore

– Parallel computers are seen as being user unfriendlyParallel computers are seen as being user unfriendly– Parallel processing is not used on daily basisParallel processing is not used on daily basis– Ordinary users have to be involved in programming Ordinary users have to be involved in programming

activities that are of the operating system natureactivities that are of the operating system nature– Ordinary engineers, managers, etc do not have, and Ordinary engineers, managers, etc do not have, and

should not have, specialized knowledge needed to should not have, specialized knowledge needed to program operating system oriented activitiesprogram operating system oriented activities

Page 3: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 33

Aim of Our ResearchAim of Our Research

IBM has launched a comprehensive program IBM has launched a comprehensive program – ““to re-examine an obsession with faster, smaller, and to re-examine an obsession with faster, smaller, and

more powerful” more powerful” – ““to look at the evolution of computing from a more to look at the evolution of computing from a more

holistic perspective” holistic perspective” IBM’s Autonomic Computing - one of the Grand IBM’s Autonomic Computing - one of the Grand

ChallengesChallenges Parallel processing on non-dedicated clusters Parallel processing on non-dedicated clusters

could benefit from the Autonomic Computing could benefit from the Autonomic Computing vision vision

Aim: to show a general design of services and Aim: to show a general design of services and initial implementation of a system that moves initial implementation of a system that moves parallel processing on clusters to the computing parallel processing on clusters to the computing mainstream using the Autonomic Computing mainstream using the Autonomic Computing visionvision

Page 4: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 44

IBM’s Autonomic ComputingIBM’s Autonomic Computing

The name “autonomic” has not caught on The name “autonomic” has not caught on everywhere, if only because it’s IBM’severywhere, if only because it’s IBM’s– Microsoft – “trustworthy”Microsoft – “trustworthy”– Others prefer more generic – “self-managing” Others prefer more generic – “self-managing”

Many see Many see “autonomic computing” as one of “autonomic computing” as one of the basic parts of a revolutionary technology the basic parts of a revolutionary technology thatthat– Will start the new .com boomWill start the new .com boom– Will move parallel computing on clusters to the Will move parallel computing on clusters to the

Computing mainstreamComputing mainstream

Page 5: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 55

IBM’s Autonomic ComputingIBM’s Autonomic Computing Characteristics of autonomic computing Characteristics of autonomic computing

systemssystems– knows itselfknows itself– configures and reconfigures itself under varying configures and reconfigures itself under varying

and unpredictable conditionsand unpredictable conditions– optimizes its workingoptimizes its working– performs something akin to healingperforms something akin to healing– provides self-protectionprovides self-protection– knows its surrounding environmentknows its surrounding environment– exists in an open (non-hermetic) environment exists in an open (non-hermetic) environment – anticipates the optimized resources needed while anticipates the optimized resources needed while

keeping its complexity hidden keeping its complexity hidden

Page 6: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 66

Related WorkRelated Work

A number of projects related to Autonomous A number of projects related to Autonomous Computing are mentioned by the IBM Computing are mentioned by the IBM

websitewebsite While many of the reported projects engage While many of the reported projects engage

in some aspects of Autonomic Computing in some aspects of Autonomic Computing none engage in research to develop a none engage in research to develop a system that has all eight of the system that has all eight of the characteristics required characteristics required

None of the projects addresses parallel None of the projects addresses parallel processing, in particular parallel processing processing, in particular parallel processing on non-dedicated clusters.on non-dedicated clusters.

Page 7: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 77

Design of Autonomic Elements Design of Autonomic Elements (Services) Providing Autonomic (Services) Providing Autonomic Computing on Non-dedicated Computing on Non-dedicated ClustersClusters We have proposed and designed a set of We have proposed and designed a set of

autonomic elements that must be provided autonomic elements that must be provided to develop an autonomic computing to develop an autonomic computing environment on a non-dedicated clusterenvironment on a non-dedicated cluster

Three component levelsThree component levels– ServicesServices– ComputersComputers– Non-dedicated clusterNon-dedicated cluster

Note: we have not addressedNote: we have not addressed– Hardware aspectsHardware aspects– Administration aspectsAdministration aspects

Page 8: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 88

Cluster Knows ItselfCluster Knows Itself

A need for resource discoveryA need for resource discovery This autonomic element runs on each computerThis autonomic element runs on each computer ActivitiesActivities

– Acquires knowledge of static parameters of Acquires knowledge of static parameters of computers computers processor type (e.g., speed)processor type (e.g., speed) memory sizememory size available softwareavailable software

– Acquires knowledge of dynamic parameters of Acquires knowledge of dynamic parameters of clusters clusters computers’ loadcomputers’ load available memoryavailable memory communication pattern and volumecommunication pattern and volume

Page 9: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 99

Resource Discovery Service Resource Discovery Service DesignDesign

ResourceDiscovery

Communication Pattern & Load

Local Communication Load

CPU Main Memory

RemoteCommunication Load

Computational Load & Parameters

Computer i

ResourceDiscovery

CPU Main Memory

Computation element

1

Computer j

Computation element

2

Computation element

2

Computation element

1

Page 10: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1010

Cluster Configures and Cluster Configures and Reconfigures Itself under Reconfigures Itself under Varying and Unpredictable Varying and Unpredictable ConditionsConditions

In a non-dedicated cluster there are times when In a non-dedicated cluster there are times when – Some computers are lightly loaded or idleSome computers are lightly loaded or idle– Some computers cannot be used Some computers cannot be used

owners removed them from a shared pool of resources owners removed them from a shared pool of resources are heavy loadedare heavy loaded

To offer high availability, i.e., to configure and To offer high availability, i.e., to configure and reconfigure itself, the systemreconfigure itself, the system– Forms parallel virtual clusters adaptively and Forms parallel virtual clusters adaptively and

dynamically dynamically – Forming is based on load and changing resourcesForming is based on load and changing resources

Page 11: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1111

Availability Service DesignAvailability Service Design

RD

RD

Availability Services

Virtual Parallel Cluster (t0)

Where times t0< t1< t2< t3

Virtual Parallel Cluster (t2)Virtual Parallel Cluster (t3)

RD

RD RD

RD

RD

Virtual Parallel Cluster (t1)

RD

Page 12: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1212

Cluster Should Optimize Its Cluster Should Optimize Its WorkingWorking

Application computation elements should be Application computation elements should be placed optimallyplaced optimally

To improve performance there is a need forTo improve performance there is a need for– Computation loadComputation load– Available memoryAvailable memory– Communication costsCommunication costs

To optimize cluster’s working there isTo optimize cluster’s working there is– Static allocation and load balancingStatic allocation and load balancing– Ability to change performance indices that reflect Ability to change performance indices that reflect

user objectivesuser objectives– Computation element migration, creation and Computation element migration, creation and

duplicationduplication– Setting of computation priorities of applicationsSetting of computation priorities of applications

Page 13: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1313

High Performance Service High Performance Service DesignDesign

Virtual Parallel Cluster

C1

P1

C2

P2

C3

Pi

Migration

Cn

AvailabilityServices

{ where: P1 → C1,P2 → C2,

………{Pi, Pj} → Cn }

{where, which, when: Pi : Cn → C3}

Global Scheduler

StaticAllocation

LoadBalancing

Pj

Page 14: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1414

Cluster Should Perform Cluster Should Perform Something Akin To HealingSomething Akin To Healing Hardware and software faults can occurHardware and software faults can occur Failures lead to the termination of Failures lead to the termination of

computations computations To provide something akin to healing To provide something akin to healing

– Faults are identified and reportedFaults are identified and reported– Checkpointing of parallel computation element of Checkpointing of parallel computation element of

applications is providedapplications is provided– Recovery from failures is employedRecovery from failures is employed– Migrating applications from faulty computers to Migrating applications from faulty computers to

healthy computers is carried out automaticallyhealthy computers is carried out automatically– Redundant/replicated services are providedRedundant/replicated services are provided

Page 15: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1515

Self-Healing Service DesignSelf-Healing Service Design

Computation Element i

Checkpointing (coordinated)

Recovery

Checkpoint forComputation Element i

C1

Checkpointfor

Compute Elem i

Checkpointfor

Compute Elem i

Disk

Compute Elem i after crash recovery

C2 Cj

Ck

Page 16: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1616

Clusters Should Provide Self-Clusters Should Provide Self-ProtectionProtection Computation elements of parallel applications Computation elements of parallel applications

are distributedare distributed Computation elements communicate using Computation elements communicate using

messagesmessages They are the subject of passive and active They are the subject of passive and active

attacksattacks To provide self-protection:To provide self-protection:

– Virus detection and recovery must be offeredVirus detection and recovery must be offered– Resource protection should be a mandatory serviceResource protection should be a mandatory service– Encryption, as a countermeasure against passive Encryption, as a countermeasure against passive

attacks, should be usedattacks, should be used– Authentication, as a countermeasure against active Authentication, as a countermeasure against active

attacks, should be usedattacks, should be used

Page 17: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1717

To Allow a System to Know Its To Allow a System to Know Its Surrounding Environment and to Surrounding Environment and to Prevent a System From Existing in Prevent a System From Existing in a Hermetic Environmenta Hermetic Environment

There are applications that require There are applications that require – More computation powerMore computation power– Specialized softwareSpecialized software– Unique peripheral devices etcUnique peripheral devices etc

Many owners cannot afford such resourcesMany owners cannot afford such resources Some owners can offer their services and Some owners can offer their services and

resources to appropriate usersresources to appropriate users

Page 18: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1818

To Allow a System to Know Its To Allow a System to Know Its Surrounding Environment and to Surrounding Environment and to Prevent a System From Existing in Prevent a System From Existing in a Hermetic Environmenta Hermetic Environment

To benefit from existing unique resourcesTo benefit from existing unique resources– Resource discovery of other clusters is providedResource discovery of other clusters is provided– Advertising services is in placeAdvertising services is in place– Systems are able to cooperateSystems are able to cooperate– Negotiation is in useNegotiation is in use– Brokerage of resources and services are usedBrokerage of resources and services are used– Resources are shared in a distributed mannerResources are shared in a distributed manner– ““The move toward a grid” should be in placeThe move toward a grid” should be in place

Page 19: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 1919

Grid-like Service DesignGrid-like Service Design

Brokerage Services

Computational Services

Storage/Memory Services

Printer Services

Information Services

Advertisement

Exporting Services

WithdrawalServices

ImportRequests

Cluster 1

Brokerage Servicess

Cluster nCluster 3

Cluster 2

Brokerage Servicess

Brokerage Servicess

Page 20: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2020

A Cluster Should Anticipate the A Cluster Should Anticipate the Optimized Resources Needed Optimized Resources Needed While Keeping Its Complexity While Keeping Its Complexity HiddenHidden

The scarcity of software to assist ordinary The scarcity of software to assist ordinary programmers limits the harnessing of the programmers limits the harnessing of the computing power of non-dedicated clusterscomputing power of non-dedicated clusters

This impliesThis implies– A programming environment simple to useA programming environment simple to use– Knowledge of resource distribution not neededKnowledge of resource distribution not needed– Message passing and shared memory Message passing and shared memory

programming supported transparentlyprogramming supported transparently

Page 21: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2121

Easy Programming Service Easy Programming Service DesignDesign

Communication Primitives

System Servicesof an

Operating System

Kernel Services of anOperating System

Programming Environment

Shared Memory

Message Passing

or PVM / MPI

DSM

Page 22: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2222

The Holos Services for The Holos Services for Autonomic Computing ClustersAutonomic Computing Clusters

Holos is built to demonstrate that it is possible to Holos is built to demonstrate that it is possible to develop an autonomic non-dedicated cluster thatdevelop an autonomic non-dedicated cluster that– could be routinely employed by ordinary engineers, could be routinely employed by ordinary engineers,

managers, etc managers, etc – able to support next generation application software able to support next generation application software

executing on clustersexecuting on clusters We followed We followed the IBM’s vision recommendations the IBM’s vision recommendations

regarding autonomic elementsregarding autonomic elements We decided to view autonomic elements as processesWe decided to view autonomic elements as processes

– Each computer is a multi-process systems with its objectivesEach computer is a multi-process systems with its objectives– A cluster is a set of multi-process systems with its objectivesA cluster is a set of multi-process systems with its objectives

Page 23: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2323

HolosHolos

System Servers

Kernel Servers

GlobalScheduler

Execution Server

Migration Server

Check-point

Server

Resource Discovery Server

DSM Server

Broker-age

Server

IPCServer

ProcessManageServer

SpaceManageServer

GENESISMicrokernel

Parallel Processes

MP / PVM / MPI

Process

DSMProcess

Holos was developed Holos was developed based on the P2P and based on the P2P and microkernel paradigmsmicrokernel paradigms

The microkernel provides The microkernel provides services such as services such as

– local IPClocal IPC– basic paging operationsbasic paging operations– interrupt handling interrupt handling – context switching context switching

Three groups of Three groups of processes: processes:

– kernel serverskernel servers– system serverssystem servers– application processes application processes

Kernel and system servers Kernel and system servers are stationary, application are stationary, application processes are mobile processes are mobile

All processes All processes communicate using communicate using messagesmessages

Page 24: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2424

System Servers Form a Basis System Servers Form a Basis of an Autonomic Operating of an Autonomic Operating System for Nondedicated System for Nondedicated ClustersClusters

Resource Discovery Server - collects data Resource Discovery Server - collects data about computation and communication loadabout computation and communication load

Availability Server - dynamically and Availability Server - dynamically and adaptively forms a parallel virtual cluster for adaptively forms a parallel virtual cluster for the applicationthe application

Global Scheduling Server – maps application Global Scheduling Server – maps application processes using static allocation and dynamic processes using static allocation and dynamic load balancing on the computers of the load balancing on the computers of the virtual parallel clustervirtual parallel cluster

Page 25: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2525

System Servers Form a Basis System Servers Form a Basis of an Autonomic Operating of an Autonomic Operating System for Nondedicated System for Nondedicated ClustersClusters

Execution Server - coordinates the single, Execution Server - coordinates the single, multiple and group creation and duplication multiple and group creation and duplication of application processes on both local and of application processes on both local and remote computersremote computers

Migration Server - coordinates Migration Server - coordinates movingmoving application application processes processes to other computersto other computers

DSM Server - hides the distributed nature of DSM Server - hides the distributed nature of the cluster’s memory and allows writing the cluster’s memory and allows writing code as though using physically shared code as though using physically shared memory memory

Page 26: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2626

System Servers Form a Basis System Servers Form a Basis of an Autonomic Operating of an Autonomic Operating System for Nondedicated System for Nondedicated ClustersClusters Checkpoint Server - coordinates creation of Checkpoint Server - coordinates creation of

checkpoints for an executing application checkpoints for an executing application Fault Recovery Server – recovers application Fault Recovery Server – recovers application

processes / applications using checkpointsprocesses / applications using checkpoints IAC Server - supports remote interprocess IAC Server - supports remote interprocess

communication and supports group communication and supports group communication within sets of application communication within sets of application processesprocesses

Brokerage Server – supports advertising and Brokerage Server – supports advertising and sharing services through service exporting, sharing services through service exporting, importing and revokingimporting and revoking

Page 27: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2727

Holos Possesses the Autonomic Holos Possesses the Autonomic Computing CharacteristicsComputing Characteristics

Autonomic Computing Requirement Cooperating Holos Servers –Relationships Among Autonomic Elements

To allow a system to know itself Resource Discovery Server

A system must configure and reconfigure itself under varying and unpredictable conditions

Resource Discover Server, Global Scheduling Server, Migration Server, Execution Server, and Availability Server

A system must optimize its working Global Scheduling Server, Migration Server, and Execution Server

A system must perform something akin to healing Checkpoint Server, Recovery Server, Migration Server, Global Scheduling Server

A system must provide self-protection Capabilities in the form of System Names

A system must know its surrounding environment Resource Discovery Server, and Brokerage Server

A system cannot exist in a hermetic environment Interprocess Communication Server, and Brokerage Server

A system must anticipate the optimized resources needed while keeping its complexity hidden (most critical for the user)

DSM Server, and Execution Server, DSM Programming Environment, Message Passing Programming Environment, PVM/MPI Programming Environment

Page 28: 1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin

June 2004June 2004 COSET’2004COSET’2004 2828

ConclusionConclusion

Autonomic computing has been shown to be a Autonomic computing has been shown to be a basic part of a revolutionary technology that basic part of a revolutionary technology that – Could move parallel computing on non-dedicated Could move parallel computing on non-dedicated

clusters to the computing mainstreamclusters to the computing mainstream– (Will start the new .com boom – is to be shown)(Will start the new .com boom – is to be shown)

The development of the Holos cluster operating The development of the Holos cluster operating system demonstrates that it is possible to build system demonstrates that it is possible to build an autonomic non-dedicated clusteran autonomic non-dedicated cluster

The Holos cluster operating system has been The Holos cluster operating system has been built from scratchbuilt from scratch