open resilient cluster manager: a distributed approach to a resilient router manager ralph h....
TRANSCRIPT
![Page 1: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/1.jpg)
Open Resilient Cluster Manager:A Distributed Approach to a Resilient Router Manager
Ralph H. Castain, Ph.D.Cisco Systems, Inc.
![Page 2: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/2.jpg)
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
![Page 3: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/3.jpg)
3© 2006 Cisco Systems, Inc. All rights reserved.
System Software Requirements
1) Turn on once with remote access thereafter
2) Non-Stop == max 20 events/day lasting < 200ms each
3) Hitless SW Upgrades and Downgrades
4) Upgrade/downgrade SW components across delta versions
5) Field Patchable
6) Beta Test New Features in situ
7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…
8) Configuration
9) Clear APIs; minimize application awareness
10) Extensive remote capabilities for fault management, software maintenance and software installations
![Page 4: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/4.jpg)
Our Approach
• Distributed redundancy NO master Multiple copies of everything Running in tracking mode
• Parallel, seeing identical input• Multiple ways of selecting leader
• Utilize component architecture Multiple ways to do something => framework! Create an initial working base Encourage experimentation
![Page 5: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/5.jpg)
Methodology
• Exploit open source software Reduce development time Encourage outside participation Cross-fertilize with HPC community
• Write new cluster manager (ORCM) Exploit new capabilities Potential dual-use for HPC clusters Encourage outside contributions
![Page 6: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/6.jpg)
Open Source ≠ Free
Pro
• Widespread exposure ORTE on thousands of
systems around world Surface & address problems
• Community support Others can help solve
problems Expanded access to tools
(e.g., debuggers)
• Energy Other ideas, methods
Con
• Your timeline ≠ my timeline No penalty for late
contributions Academic contributors have
other priorities
• Compromise: a required art Code must be designed to
support multiple approaches Nobody wins all the time Adds time to implementation
![Page 7: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/7.jpg)
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
3-day workshop
![Page 8: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/8.jpg)
Robustness(CSU)
A Convergence of Ideas
PACX-MPI(HLRS)
LAM/MPI(IU)
LA-MPI(LANL)
FT-MPI(U of TN)
Open MPIOpen MPI
FaultDetection
(LANL,Industry)
Grid(many)
AutonomousComputing
(many)
FDDP(Semi. Mfg.
Industry) ResilientResilientComputingComputing
SystemsSystems
OpenRTEOpenRTE
![Page 9: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/9.jpg)
Program Objective
*Cell = one or more computers sharing a common launch environment/point
![Page 10: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/10.jpg)
Participants
Developers
• DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab
• Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart
Support• Industry
Cisco Oracle IBM Microsoft* Apple* Multiple interconnect
vendors
• Open source teams OFED, autotools,
Mercurial*Providing funding
![Page 11: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/11.jpg)
Reliance on Components
• Formalized interfaces Specifies “black box”
implementation Different
implementations available at run-time
Can compose different systems on the fly
Interface 1 Interface 2 Interface 3
Caller
![Page 12: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/12.jpg)
OpenRTE and Components
• Components are shared libraries Central set of components in installation tree Users can also have components under $HOME
• Can add / remove components after install No need to recompile / re-link apps Download / install new components Develop new components safely
• Update “on-the-fly” Add, update components while running Frameworks “pause” during update
![Page 13: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/13.jpg)
Component Benefits
• Stable, production quality environment for 3rd party researchers Can experiment inside the system without rebuilding
everything else Small learning curve (learn a few components, not the
entire implementation) Allow wide use, experience before exposing work
• Vendors can quickly roll out support for new platforms Write only the components you want/need to change Protect intellectual property
![Page 14: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/14.jpg)
ORTE: Resiliency*
• Fault Events that hinder the correct operation of a process.
• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level
Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.
• Fault prediction Estimate probability of incipient fault within some time period in the future
• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault
• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences
• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults
*standalone presentation
![Page 15: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/15.jpg)
Key Frameworks
Error Manager (Errmgr)
• Receives all process state updates Sensor, waitpid Includes predictions
• Determines response strategy Restart locally, globally,
abort
• Executes recovery Accounts for fault groups to
avoid repeated failover
Sensor
• Monitors software and hardware state-of-health Sentinel file size, mod &
access times Memory footprint Temperature Heartbeat ECC errors
• Predicts incipient faults Trend, fingerprint AI-based algos coming
![Page 16: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/16.jpg)
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
![Page 17: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/17.jpg)
Universal PNP
• Widely adopted standard
• ORCM uses only a part PNP discovery via announcement on std multicast
channel• Includes application id, contact info• All applications respond
Wireup “storm” limits scalability Various algorithms for storm reduction
Each application assigned own “channel”• All output from members of that application• Input sent to that application given to all members
![Page 18: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/18.jpg)
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
![Page 19: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/19.jpg)
ORCM DVM
• One per node Started at node boot or launched by tool Locally spawns and monitors processes, system
health sensors Small footprint (≤1Mb)
• Each daemon tracks existence of others PNP wireup Know where all processes are located
orcmd
Predefined“System”multicastchannel
orcmd orcmd
![Page 20: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/20.jpg)
Parallel DVMs
• Allows Concurrent development, testing in production
environment Sharing of development resources
• Unique identifier (ORTE jobid) Maintains separation between orcmd’s Each application belongs to their respective
DVM No cross-DVM communication allowed
![Page 21: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/21.jpg)
Configuration Mgmt
orcmd orcmd orcmd
cfgi
confd tool fileconfddaemon
subscribe
Lowest vpid
recv config
Openframework
set recv configfile?
connect?
orcm-start
file
![Page 22: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/22.jpg)
Configuration Mgmt
orcmd orcmd orcmd
cfgi
confd tool fileconfddaemon
subscribe
recv config
Openframework
set recv configfile?
connect?
orcm-start
file
Update any missing config infoAssume “leader” duties
![Page 23: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/23.jpg)
Application Launch
orcmd orcmd orcmd
cfgi
confd tool fileconfddaemon
subscribe
recv config
set recv configfile?
connect?
orcm-start
file
Config change
#procslocation
Launch msgPredefined“System”multicastchannel
![Page 24: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/24.jpg)
Resilient Mapper
• Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file
• Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras
• Next-generation algorithms Failure mode probability => fault group selection
![Page 25: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/25.jpg)
Multiple Replicas
• Multiple copies of each executable Run on separate fault groups Async, independent
• Shared pnp channel Input: recvd by all Output: broadcast to all, recvd by those who
registered for input• Leader determined by recvr
![Page 26: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/26.jpg)
Leader Selection
• Two forms of leader selection Internal to ORCM DVM External facing
• Internal - framework App-specific module Configuration specified Lowest rank First contact None
![Page 27: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/27.jpg)
External Connections
orcm-connector Input
• Broadcast on respective PNP channel
Output• Determines “leader” to supply output to rest of world• Utilize any leader method in framework
![Page 28: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/28.jpg)
Testing in Production
orcm-logger
logger
db file syslog console
![Page 29: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/29.jpg)
Software Maintenance
• On-the-fly module activation Configuration manager can select new
modules to load, reload, activate Change priorities of active modules
• Full replacement When more than a module needs updating Start replacement version Configuration manager switches “leader” Stop old version
![Page 30: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/30.jpg)
Detecting Failures
• Application failures - detected by local daemon Monitors for self-induced problems
• Memory and cpu usage• Orders termination if limits exceeded or are trending to
exceed
Detects unexpected failures via waitpid
• Hardware failures Local hardware sensors continuously report status
• Read by local daemon• Projects potential failure modes to pre-order relocation of
processes, shutdown node
Detected by DVM when daemon misses heartbeats
![Page 31: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/31.jpg)
Application Failure
• Local daemon Detects (or predicts) failure Locally restarts up to specified max #local-restarts Utilizes resilient mapper to determine re-location Sends launch message to all daemons
• Replacement app Announces itself on application public address
channel Receives responses - registers own inputs Begins operation
• Connected applications Select new “leader” based on current module
![Page 32: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/32.jpg)
Node Failure
orcmd orcmd orcmd
cfgi
confd tool file
OpenframeworkNext higher orcmd becomes leader
Open/init cfgi frameworkUpdate any missing config infoMark node as “down”Relocate application processes from failed nodeConnected apps failover leader per active leader moduleAttempt to restart
![Page 33: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/33.jpg)
Node Replacement/Addition
• Auto-boot of local daemon on power up Daemon announces to DVM All DVM members add node to available resources
• Reboot/restart Relocate original procs back up to some max number
of times (need smarter algo here) Leadership remains unaffected to avoid “bounce”
• Processes will map to new resource as start/restart demands Future: rebalance existing load upon node availability
![Page 34: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/34.jpg)
Outline
• Overview
• Key pieces OpenRTE uPNP
• ORCM Architecture Fault behavior
• Future directions
![Page 35: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/35.jpg)
35© 2006 Cisco Systems, Inc. All rights reserved.
System Software Requirements
1) Turn on once with remote access thereafter
2) Non-Stop == max 20 events/day lasting < 200ms each
3) Hitless SW Upgrades and Downgrades
4) Upgrade/downgrade SW components across delta versions
5) Field Patchable
6) Beta Test New Features in situ
7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…
8) Configuration
9) Clear APIs; minimize application awareness
10) Extensive remote capabilities for fault management, software maintenance and software installations
~5ms recovery
Start new app triplet, kill old one
New app triplet, register for production input
Boot-level startup
Start/stop triplets, leader selection
![Page 36: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/36.jpg)
Still A Ways To Go
• Security Who can order ORCM to launch/stop apps? Who can “log” output from which apps? Network extent of communications?
• Communications Message size, fragmentation support Speed of underlying transport Truly reliable multicast Asynchronous messaging
![Page 37: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/37.jpg)
Still A Ways To Go
• Transfer of state How does a restarted application replica regain the
state of its prior existence? How do we re-sync state across replicas so outputs
track?
• Deterministic outputs Same output from replicas tracking same inputs
• Assumes deterministic algorithms
Can we support non-deterministic algorithms?• Random channel selection to balance loads• Decisions based on instantaneous traffic sampling
![Page 38: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/38.jpg)
Still A Ways To Go
• Enhanced algorithms Mapping Leader selection
• Fault prediction Implementation and algorithms Expanded sensors
• Replication vs rapid restart If we can restart in few millisecs, do we really
need replication?
![Page 39: Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc](https://reader035.vdocuments.net/reader035/viewer/2022062718/56649e885503460f94b8c48d/html5/thumbnails/39.jpg)
Concluding Remarks
http://www.open-mpi.orghttp://www.open-mpi.org/projects/orcm