the management of mscs computing environments

13
The Management of Microsoft Cluster Server Computing Environments by Richard R. Lee President, Data Storage Technologies, Inc., and author of the book “Windows NT Microsoft Cluster Server”

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

The Managementof MicrosoftCluster ServerComputingEnvironmentsby Richard R. LeePresident, Data Storage Technologies, Inc., and author of thebook “Windows NT Microsoft Cluster Server”

Table of Contents

Overview.......................................................................................................................................................................1

Microsoft Cluster Server (MSCS) ..................................................................................................................................2

Cluster-Aware Applications...........................................................................................................................................4

The MSCS Cluster Administrator ..................................................................................................................................5

Management Challenges and Shortcomings..................................................................................................................6

ClusterX for Microsoft Cluster Server (MSCS) ..............................................................................................................6

MSCS and VERITAS ClusterX ........................................................................................................................................8

VERITAS ClusterX Deployment......................................................................................................................................9

Summary and Recommendations ................................................................................................................................10

Overview

Microsoft Cluster Server (MSCS) has quickly become the de facto solution for high-availability clustering in theWindows NT space of the enterprise computing market. While certainly not the first to market in the growing segmentof high-availability computing (Digital, Tandem, NCR, VERITAS, Vinca and several other solutions were all availablewell in advance of MSCS), Microsoft has nonetheless captured the hearts and minds of most IT users based uponMSCS’s tight integration with the Windows NT operating system (OS), along with the use of a formidable team ofhardware partners to co-develop and extensively test the product prior to release. These so-called “Early Adopterpartners” included IBM, Compaq, Tandem, Digital, NCR, Data General1, all leaders in clustering technology for thepast 20 years and major contributors of intellectual property to MSCS.

Like many clustering solutions of the day, Microsoft Cluster Server (Version 1.0 is part of Windows NT EnterpriseEdition which was released in the Fall of ’97) can be both difficult to deploy and challenging to manage. It uses acentralized management console called the Cluster Administrator (CA) for setup, manipulation and monitoring of allResources and Failover Groups, which can be most challenging when used with individual two-node clusters, andvirtually impossible in multi-cluster environments2.

To create a more manageable MSCS environment, one that delivers on the clustering promises of TCO reductions andavailability enhancements, VERITAS Software has developed an MSCS cluster management solution called VERITASClusterX for MSCS. It is a cluster, application and NT service configuration/management solution for use byadministrators in MSCS-based computing environments. Announced in 1998 by NuView, Inc. and acquired by VERITASin August 1999, ClusterX is now available in its third version (3.0) with many enhanced features and benefits forMSCS administrators to use in any size Microsoft Cluster Server-based computing environment. ClusterX has beenespecially tailored to manage distributed clusters in anyone’s enterprise computing environment.3

Within the context of this white paper we will discuss the background of both MSCS and ClusterX. You will quickly seewhy ClusterX is becoming the de facto Cluster Management Solution for MSCS, regardless of the size of thedeployment (from a single cluster HA solution to an enterprise oriented, multi-cluster environment).

1

1 Intel was also an “early adopter” partner acting in a consultative capacity based on their CPU and system chipset technology expertise, as well as inrespect to their Standard High-Volume Server motherboard market initiatives.

2 Using the CA you can open connections to multiple clusters (by name), but can only manage one cluster environment at a time. Most multi-clusterenvironments use a dedicated NT Workstation running CA for each cluster (these workstations must have unique IP addresses established at the timethe cluster is first set up.

3 A recent survey of 150+ MSCS administrators conducted by Giga Information Group and Sunbelt Software indicates that many sites already havemultiple MSCS clusters deployed, with more planned for the future.

Microsoft Cluster Server (MSCS)

Microsoft Cluster Server (MSCS) today is a two-node high availability solution for use in application environments thatrequire major enhancements to Availability4 with respect to stand-alone NT Server solutions5. It is based upon the useof intelligent middleware that resides between the Windows NT Operating System (OS) and the Applications andServices that the user would like enhanced.

MSCS uses a “shared nothing” architectural model, where each node (server) has ownership of specific storage resourceslocated on a shared-storage bus (SCSI, FC or SSA) except in times of failover. During failover, these storage resourcesare transferred in ownership (via SCSI lock/release commands) from the failed node to the surviving node. The survivingnode has duplicate instances of those applications that we designated for failover installed on it, which are then startedonce the Cluster Executive initiates transfer of storage resource ownership, along with all the other Resources (if any)contained in the Failover Group6. The surviving node after this transfer and subsequent start-up of its copies of thefailed application or service, then resumes operations that were interrupted at the time of the failover, e.g., file andprint sharing, web services, database transactions and queries (via roll-back restart to the last “committed” transaction).The surviving node also takes ownership of the Quorum Resource, a special disk or volume that contains the clusterdatabase7. This mode of operation continues until such time as the failed node is revived and brought back on line8. At that time the failed node (or application/NT service) can then have its physical disk Resource (and others in theFailover Group) transferred back to its ownership on either an automatic or manual basis. It then resumes normaloperations until such time as a failure on either node (or its applications/NT services) occurs again.

2

4 MSCS provides improvements in Scalability through the use of Symmetric Virtual Servers. SQL Server 7.0 can operate in this mode. Manageabilityenhancements have been limited to a Single System Image for the most part until the recent availability of VERITAS ClusterX.

5 Windows NT Server Standard Edition provides approximately 97% Availability out of the box. Using Windows NT Enterprise Edition, including MSCS,along with other built-in availability enhancement techniques can improve Availability to approximately 99.5%.

6 The Physical Disk Resource is the most basic individual resource used by MSCS and its Failover Groups. All Failover Groups are comprised of at leasta Physical Disk Resource.

7 Only one node can own the Cluster Resource (either a disk or a logical volume). Its ownership is determined by a special Quorum algorithm.

Figure1: An architectural block diagram of MSCS deployed in an Active/Active mode

MSCS can be deployed on an application by application basis in a number of ways. There are five deployment modelsoutlined by Microsoft. Each takes advantage of the specific attributes that MSCS has to offer in respect to enhancingavailability, scalability and manageability. These five basic models are as follows:

1. Active/Active: In Active/Active mode, both nodes are allowed to have varying workloads and applicationsresiding on them. They each perform work independent of the other, with instances of those applicationssetup for failover/failback residing on both nodes.

2. Active/Standby: In Active/Standby mode, the active node is online and doing work, while the inactive nodesits in a “hot standby” mode waiting for any type of failure to occur on the primary node.

3. Partial Cluster Solution: In this mode there are a mixture of applications/resources that can Failover/Failback,along with those that cannot (non-cluster-aware). The cluster-aware applications are set up in a normal mannerunder MSCS using shared storage resources, and those that aren’t utilize local storage resources found onthe node where they permanently reside.

4. Virtual Server Only: This model utilizes MSCS virtual server mode, without having formed an actual cluster.It can be deployed on any node that has cluster server running at the time.

5. Hybrid Solution: The hybrid model is a combination of all the others previously described.

These deployment models are intended to support several types of Failover scenarios. These scenarios are designed tobe tailored to the specifics of each application type. Typical of these scenarios are the following;

A.Automatic Failover w/Automatic Failback: In this scenario, the application or service that fails or becomesunavailable, automatically fails over to its alternate node when there is a loss of heartbeat, or in the case ofa cluster-aware application, when the “LooksAlive”, “IsAlive” messages are not returned. When the failed nodeor application returns to its normal state and the heartbeat has been re-established , then the Failover Group’sresources are all returned automatically to their original state and owner.

B.Automatic Failover/Manual Failback: In this scenario, the failed node or application is brought back onlinemanually by the administrator. This allows for thorough testing and monitoring prior to the transfer – eliminatingany potential for the Failover Group to fail again and potentially bringing down the entire cluster.

C.Manual Failover/Manual Failback: This scenario is used for facilitating rolling upgrades and routinemaintenance of the cluster. Using full manual control of the cluster, the administrator can bring applications,services and the node itself offline gracefully, with full monitoring. All designated applications can then bere-started on the alternate node, while upgrades or routine maintenance operations are performed on thefailed node. Once these are complete, this node can then be brought back online and its original workloadand Failover groups can then be transferred back. At that time the alternate node can then be brought offlineand its workload can be transferred to the alternate node, so that it can undergo routine maintenance orhave applications upgraded.

These are just a few of the Failover scenarios that are supported by MSCS. The administrator has the option ofcreating many more depending upon his or her requirements on an application by application basis.

3

8 MSCS, like many other clustering solutions uses a “heartbeat” signal to monitor the status of nodes in the cluster. This heartbeat signal is sent over a“private network” used for inter-node communications only. The heartbeat in its most rudimentary form is a single “ping” that is sent back and forthbetween the nodes at specific intervals. If this ping signal is not returned for any reason, the Cluster Executive attempts to use the Public Network orthe SCSI bus to re-establish a heartbeat. In the event that all of these paths fail to respond to the heartbeat ping, then the cluster will begin Failover.

One of the key differentiators found in MSCS with respect to other NT-based clustering solutions is the use of a “cluster-aware” API developed byMicrosoft and its partners. This API is used to create cluster-aware applications and services that increase the granularity of Failover detection downto specific Applications and Services, rather than just node-wide, allowing failed applications and services to be failed over as opposed to the entirenode. Most system failures or “hangs” are software-driven according to statistics found today, so this increase in granularity allows for software errorsto be overcome without failing over an entire node and all of its applications and services.

Cluster-Aware Applications

One of MSCS’s most significant differentiators with respect to competing clustering solutions is its software interfacespecification called the “Wolfpack API.” This API (some 120+ pages in length) is designed to let software developerstake full (or partial) advantage of the power of MSCS with the applications that they develop. It significantly increasesthe granularity of Failure detection (down to the service level), and provides mechanisms for more comprehensiveapplication monitoring and reporting. In future versions of MSCS, it will be used to help provide unprecedented levelsof linear scaling through inter-node communications (use of the MSCS private network and the VIA specification).

Early usage of this API has been dominated by Microsoft in the form of Enterprise Editions of its familiar BackOfficeproducts. These include:

� SQL Server 7.0 (and 6.5)

� Exchange 5.5

� Internet Information Service

� Message Queue Server

� Transaction Server

� SMB File Share

� Print Spooler

All take advantage of the Wolfpack API in one manner or another (LooksAlive vs. IsAlive), and can be easily configuredas virtual servers (including symmetric).

These first generation cluster-aware applications provide significant availability enhancements over their non-clusteraware counterparts (with some scalability enhancements as well when you use virtual symmetric servers andpartitioned data sets), but can be difficult to set up and administrate. These applications come bundled with setupwizards, but much of the work in setting them up is done via the Cluster administrator on a manual basis. This canprove tedious and confusing to the untrained administrator and can be a source of errors for even the well-trainedadministrator.

4

Management Function Function Call Description

Cluster OpenCluster Open a connection to a cluster

Cluster SetClusterName Set the name for a cluster

Node EvictClusterNode Delete a node from the cluster db

Resource FailClusterResource Initiate a resource failure

Group MoveClusterGroup Move a group and its resources to another node

Configuration ClusterRegOpenKey Open a config db key

Wolfpack API

Figure 2: A sampling of commands from the Wolfpack API

The MSCS Cluster Administrator

Included with MSCS is a management and configuration tool known as the Cluster Administrator or CA. It is anexplorer-type GUI for use in setting up, managing and monitoring your cluster. The Cluster Administrator must be runon an NT Workstation designated during setup and provides the interface to the cluster during every phase of itsoperation9. The Cluster Administrator also contains a number of setup wizards for creating and installing applicationsand services on the cluster.

Although straightforward in approach, the CA can be very challenging and difficult to use in day-to-day administrationof MSCS. Many of these challenges are driven by the lack of true visualization of the cluster and its resources duringoperation and setup. The administrator is left in many cases to fend for him or herself to determine the status of keyservices and hardware components not displayed by the CA.

All of the commands and functions found in the CA are duplicated under a command line version known asCluster.exe. These commands can be setup under several scripting scenarios including Windows NT Scripting Host.

5

Figure 3: CA View of a Cluster-aware application and its Resources (Exchange 5.5 EE)

Figure 4: The MSCS Cluster Administrator. Shown on the right are the Resources in the Cluster Group

9 The current Cluster Administrator will be used in early releases of Windows 2000-Advanced Server. It will ultimately be incorporated into theMicrosoft Management Console as a snap-in. No time frame has been announced for this upgrade.

Management Challenges and Shortcomings

The MSCS Cluster Administrator is sorely lacking in both functionality and flexibility. It is difficult to install andconfigure applications and Failover groups for even the most seasoned administrator. It has limited managementfunctions, and cannot span across multiple clusters, much less perform rudimentary cluster tasks such as LoadBalancing and Cluster Database protection and management. The CA was designed in a vacuum with little feedbackfrom users in the field. The majority of sites that evaluated MSCS and the CA had never had a GUI-based managementtool (most clusters are CLI or script-driven) to work with and did not learn until much later how limited the ClusterAdministrator really was. This level of frustration, along with the inability to maximize the capabilities of MSCS withits own management tools prompted the development of ClusterX.

ClusterX for Microsoft Cluster Server (MSCS)

ClusterX is generally described as a “one to many” management tool for use in MSCS-based clustering environments.It was designed to enhance the existing MSCS Cluster Administrator, in order to provide dramatic improvements in thescope of functionality options available to administrators, as well as to substantially increase the number of clustersthat can be effectively managed from a single console. It was designed specifically for the Windows NT Serverenvironment, and is based upon such Microsoft standards as:

� COM (common object model) and DCOM (distributed COM)

� ActiveX scripting technology (Jscript and Vbscript)

� The Microsoft Management Console (MMC “snap-in”)

� ActiveX Containers (ClusterX can be used with IE 5.0 or later)

ClusterX was designed to be non- blocking to allow for multiple tasks to be performed on a parallel basis. It is alsomulti-threaded in design to allow multiple commands to be processed simultaneously, with prioritization support ofthese same commands.

ClusterX adds major functionality to any MSCS environment. The new capabilities that it provides are key toadministrators achieving the full capabilities of their MSCS clusters. These new capabilities are:

� The creation of a load balancing infrastructure across all cluster nodes.Administrators can now monitor workloads on each node on a real time basis. Information is then displayedin terms of percentage utilization of available resources (CPU, etc.), allowing administrator to move FailoverGroups off of over-worked servers to lesser used ones, while continuing to monitor them no matter wherethey ultimately reside.

� The addition of backup and restore functions to protect Cluster Configuration Data.ClusterX allows the system administrator to backup the entire cluster configuration information. This datacan then be used to restore the cluster in the event of a catastrophic failure (cluster-wide or on a singlenode).

� The support for duplicating Failover Groups across multiple clusters.A knowledge-based rules system to allow administrators to setup and manage complex Resources andApplications.

� The ability to move Resources from one Failover Group to another. Using the Dependency View, administrators can simply “drag ’n drop” resources from one group to another.

� The creation and manipulation of Failover Group Dependency trees.Failover Groups and their complex dependencies can be manipulated and viewed in their logical context.

� The creation of comprehensive logs for all cluster activities, problems and changes.User actions, and all other cluster activity, is logged and displayed in a consolidated Audit Log View. Multiplelevels of filtering can be applied to display log information in the most relevant manner.

6

� Setup wizards for use with cluster-aware applications and services.Advisors and wizards are provided for the most popular cluster-aware applications and Windows NT servicesto allow for enhanced setup and configuration. These tools can configure MSCS as well as the applicationthemselves.

All of these major functional areas (as well as some others not mentioned here) provide significant enhancementsover what is available with the MSCS Cluster Administrator. In total, ClusterX and its expanded capabilities deliver onthe promises that MSCS (and Microsoft) made during its development with respect to improvements in Availability,along with attendant reductions in Management burden and costs – both key components of everyone’s efforts toreduce TCO, while increasing operational efficiencies.

Figure 5: A VERITAS ClusterX “Enterprise-wide” view of Clusters, Failover Groups and Resources

Figure 6: A schematic view of two clusters managed by VERITAS ClusterX

7

MSCS and VERITAS ClusterX

Microsoft Cluster Server is the cornerstone of Microsoft’s efforts to move Windows NT into the Enterprise Computingspace of the market. It solves major problems associated with availability, scalability and manageability that haveplagued Windows NT since its earliest days. Given this significance, Microsoft and its hardware/software partners haveinvested heavily in MSCS with respect to core product development and testing. What they missed along the way in thisprocess was creating a simple to use, but powerful management environment for these clusters. They also apparentlyenvisioned that end-users would only deploy MSCS as autonomous clusters with little need to manage multiples ofthese in most circumstances. Both of these short-sided decisions left a gaping hole in Microsoft’s first-generationcluster offering – MSCS.

Additionally, the growth in the number of cluster-aware applications available for MSCS exposed other shortcomingsin the CA. Although many these applications came with setup wizards they did not set the application up correctly onthe cluster itself, they merely set up the application and its single or multiple instances. The Systems Administratorwas then required to intervene and either setup the applications’ Failover Groups and their Resources (includingdependencies) in advance of installing the application itself, or tweak it after the fact. Administrators were left to their own devices to fashion dependency trees and to determine how they were structured. This did little to make IT managers believe that MSCS was going to save time and reduce costs.

Responding to this growing level of frustration expressed by early adopters of this technology, NuView began todevelop ClusterX as the enterprise-wide, comprehensive management and configuration tool that systemsadministrators were looking for.

ClusterX came to market at just the right time to meet the growing demands of MSCS deployers and administrators.ClusterX Version 1.0 filled in the most gaping holes found in the management and configuration of MSCS, and thefunctionality has continued to grow to where it is today with V3.0 VERITAS ClusterX for MSCS. The capabilities ofClusterX have not gone unnoticed within the MSCS hardware camps in the computer industry. At this time, Data General,Unisys and Dell Computer are bundling versions of ClusterX with their MSCS solutions. It is anticipated that this listwill grow eventually to include all providers of MSCS solutions as ClusterX becomes the de facto standard for MSCSmanagement and configuration.

Figure 7: Several VERITAS ClusterX Dependency Trees on an MSCS Cluster

8

VERITAS ClusterX Deployment

The deployment of ClusterX is very straightforward. During installation, ClusterX assigns agents to each cluster node(server) that it manages. These agents provide feedback and logging to the ClusterX client workstation that is used forcommand and control. This data is then used to support policy-based management of each cluster, along with loadbalancing activities.

ClusterX is then configured for each cluster. This process is based on ClusterX “learning” the specifics of the clusterand its Failover Groups and Resources. It also examines each cluster’s hardware configuration with respect to publicand private network connections, CPU and memory information, physical disk devices, etc. All of this information isthen displayed to the administrator graphically, as well as logged by ClusterX.

As shown in several of the figures in this white paper, ClusterX displays on a single console all information about theclusters its manages, with varying messages displayed as icons or popups.

With regards to interoperability with other systems used across the enterprise, ClusterX is easily integrated into theframeworks of such enterprise management applications as: Unicenter TNG, HP Open View, Tivoli TME, etc.

Currently, ClusterX has advisors and wizards to support the most popular applications and NT services being deployedin MSCS environments. These include:

� SMB File Shares

� Print Spoolers

� Microsoft Exchange 5.5 EE

� Microsoft SQL Server 6.5/7.0 EE

� Microsoft Internet Information Server 3.0/4.0

ClusterX has demonstrated itself as easy to install and familiarize ones self with – all within a short time fromstartup. It has easy to read messages and icons and gives Systems Administrators all the information and toolsrequired to manage and support MSCS clusters – regardless of how widely they are disbursed or the complexity of theapplications and NT services that they have clustered.

9

Summary and Recommendations

Microsoft Cluster Server and ClusterX go hand-in-hand together. They work synergistically to create a Windows NTclustering offering that rivals long established high-availability solutions found in the Unix and proprietary OS spacesof the marketplace. They combine forces to meet the challenge of improving the Availability, Manageability and, to alesser extent, Scalability of major Windows NT applications such as SQL Server, Exchange and IIS. In addition toimproving these major shortcomings found in Windows NT, they reduce costs. These costs are not only in respect toTCO (total cost of ownership), but include business losses due to downtime. MSCS and ClusterX reduce downtimelosses from hundreds of hours per year to hours per year in many cases. Additionally, with the current wave of servicelevel agreement offerings available from many vendors in the market, ClusterX allows MSCS users to effectivelymonitor compliance with these agreements.

As MSCS evolves under the Windows 2000 umbrella, so will ClusterX. The need for ClusterX will not be eliminated bythe introduction of this major upgrade of Windows NT – it will be enhanced. With the delivery of a multi-nodeclustering solution from Microsoft looming on the horizon, the requirements for a “best of breed” management andconfiguration tool will be paramount. Investments in ClusterX today will be protected long into the future, enhancingthe ROI of this investment even further.

I recommend the use of ClusterX in any MSCS environment, regardless of size or complexity. It will go a long way inguaranteeing you the ROI increases and TCO reductions that clustering on Windows NT promised several years ago,along with making your application environment as bullet-proof and manageable as possible.

For further information on ClusterX and VERITAS see their web site at www.veritas.com or phone them at 1-800-327-2232 (in North America) or 1-407-531-7501.

Richard R. LeeData Storage Technologies, [email protected](201)-251-6620May-July 1999

References:

1. Microsoft Cluster Server general information and whitepaper links – www.microsoft.com/ntserverenterprise

2. “Windows NT – Microsoft Cluster Server”, Richard R. Lee, ©1999 –Osborne/McGraw-Hill, ISBN 007-8825008 – www.osborne.com

3. VERITAS web site for VERITAS ClusterX information and to download a 60-day free trial copy – www.veritas.com/us/products/clusterx

4. Microsoft Cluster Server – API Specification (now included in the Windows Platform Development Kit) – www.microsoft.com/msdn/sdk

5. “The Economics of Clustering”, Richard R. Lee, Intelligent Enterprise Magazine, January 1999 – www.intelligententerprise.com

10

VERITAS SoftwareCorporate Headquarters1600 Plymouth StreetMountain View, CA 94043

North American Sales Headquarters400 International ParkwayHeathrow, FL 32746800-327-2232 or 407-531-7501407-531-7730 Fax

Global Locations

United Kingdom0800-614-961 or44-(0)870-243100044-(0)870-2431001 Fax

France33-1-41-91-96-3733-1-41-91-96-38 Fax

Germany49-(0)69-9509-618849-(0)69-9509-6264 Fax

South Africa27-11-448-2080 27-11-448-1980 Fax

Australia1-800-BACKUP61-(0)2-8904-9833 Fax

Hong Kong852-2507-2233852-2598-7788 Fax

Japan81-3-5532-821781-3-5532-0887 Fax

Malaysia60-3-715-929760-3-715-9291 Fax

Singapore65-488-759665-488-7525 Fax

China86-10-6263835886-10-62638359 Fax

Electronic communication

E-Mail: [email protected]

World Wide Web: http://www.veritas.com

90-01095-910 • NT03-CLUXWPR-9900

© 2000 VERITAS Software Corp. All rights reserved. VERITAS is a registered trademark ofVERITAS Software Corporation in the US and other countries. The VERITAS logo, BusinessWithout Interruption and VERITAS ClusterX are trademarks of VERITAS SoftwareCorporation in the US and other countries. Other product names mentioned herein may betrademarks and/or registered trademarks of their respective companies. Printed in USA.March 2000.