cman questions

What is Cluster Manager (cman)? ¶

It depends on which version of the code you are running. Basically, cluster manager is a component of the cluster project that handles communications between nodes in the cluster.

In the latest cluster code, cman is just a userland program that interfaces with the OpenAIS membership and messenging system.

In the previous versions, cman was a kernel module whose job was to keep a "heartbeat" message moving throughout the cluster, letting all the nodes know that the others are alive.

It also handles cluster membership messages, determining when a node enters or leaves the cluster.

What does Quorum mean and why is it necessary? ¶

Quorum is a voting algorithm used by the cluster manager.

A cluster can only function correctly if there is general agreement between the members about things. We say a cluster has 'quorum' if a majority of nodes are alive, communicating, and agree on the active cluster members. So in a thirteen-node cluster, quorum is only reached if seven or more nodes are communicating. If the seventh node dies, the cluster loses quorum and can no longer function.

It's necessary for a cluster to maintain quorum to prevent 'split-brain' problems. If we didn't enforce quorum, a communication error on that same thirteen-node cluster may cause a situation where six nodes are operating on the shared disk, and another six were also operating on it, independently. Because of the communication error, the two partial-clusters would overwrite areas of the disk and corrupt the file system. With quorum rules enforced, only one of the partial clusters can use the shared storage, thus protecting data integrity.

Quorum doesn't prevent split-brain situations, but it does decide who is dominant and allowed to function in the cluster. Should split-brain occur, quorum prevents more than one cluster group from doing anything.

How can I define a two-node cluster if a majority is needed to reach quorum? ¶

We had to allow two-node clusters, so we made a special exception to the quorum rules. There is a special setting "two_node" in the /etc/cluster.conf file that looks like this:

<cman expected_votes="1" two_node="1"/>

This will allow one node to be considered enough to establish a quorum. Note that if you configure a quorum disk/partition, you don't want two_node="1".

What is a tie-breaker, and do I need one in two-node clusters? ¶

Tie-breakers are additional heuristics that allow a cluster partition to decide whether or not it is quorate in the event of an even-split - prior to fencing. A typical tie-breaker construct is an IP tie-breaker, sometimes called a ping node. With such a tie-breaker, nodes not only monitor each other, but also an upstream router that is on the same path as cluster communications.

With such a tie-breaker, nodes not only monitor each other, but also an upstream router that is on the same path as cluster communications. If the two nodes lose contact with each other, the one that wins is the one that can still ping the upstream router. Of course, there are cases, such as a switch-loop, where it is possible for two nodes to see the upstream router - but each other - causing what is called a split brain. This is why fencing is required in cases where tie-breakers are used.

This is why fencing is required in cases where tie-breakers are used.

Other types of tie-breakers include where a shared partition, often called a quorum disk, provides additional details. clumanager 1.2.x (Red Hat Cluster Suite 3) had a disk tie-breaker that allowed operation if the network went down as long as both nodes were still communicating over the shared partition.

More complex tie-breaker schemes exist, such as QDisk (part of linux-cluster). QDisk allows arbitrary heuristics to be specified. These allow each node to determine its own fitness for participation in the cluster. It is often used as a simple IP tie-breaker, however. See the qdisk(5) manual page for more information.

CMAN has no internal tie-breakers for various reasons. However, tie-breakers can be implemented using the API. This API allows quorum device registration and updating. For an example, look at the QDisk source code.

You might need a tie-breaker if you:

Have a two node configuration with the fence devices on a different network path than the path used for cluster communication

Have a two node configuration where fencing is at the fabric level - especially for SCSI reservations

However, if you have a correct network & fencing configuration in your cluster, a tie-breaker only adds complexity, except in pathological cases.

If both nodes in a two-node cluster lose contact with each other, don't they try to fence each other? ¶

They do. When each node recognizes that the other has stopped responding, it will try to fence the other. It can be like a gunfight at the O.K. Coral, and the node that's quickest on the draw (first to fence the other) wins. Unfortunately, both nodes can end up going down simultaneously, losing the whole cluster.

It's possible to avoid this by using a network power switch that serializes the two fencing operations. That ensures that one node is rebooted and the second never fences the first. For other configurations, see below.

What is the best two-node network & fencing configuration? ¶

In a two node cluster (where you are using two_node="1" in the cluster configuration, and w/o QDisk), there are several considerations you need to be aware of:

If you are using per-node power management of any sort where the device is not shared between cluster nodes, it must be connected to the same network used by CMAN for cluster communication. Failure to do so can result in both nodes simultaneously fencing each other, leaving the entire cluster dead, or end up in a fence loop. Typically, this includes all integrated power management solutions (iLO, IPMI, RSA, ERA, IBM Blade Center, Egenera Blade Frame, Dell DRAC, etc.), but also includes remote power switches (APC, WTI) if the devices are not shared between the two nodes.

It is best to use power-type fencing. SAN or SCSI-reservation fencing might work, as long as it meets the above requirements. If it does not, you should consider using a quorum disk or partition

If you can not meet the above requirements, you can use quorum disk or partition.

What if the fenced node comes back up and still can't contact the other? Will it corrupt my file system? ¶

The two_node cluster.conf option allows one node to have quorum by itself. A network partition between the nodes won't result in a corrupt fs because each node will try to fence the other when it comes up prior to mounting gfs.

Strangely, if you have a persistent network problem and the fencing device is still accessible to both nodes, this can result in a "A reboots B, B reboots A" fencing loop.

This problem can be worked around by using a quorum disk or partition to break the tie, or using a specific network & fencing configuration.

I lost quorum on my six-node cluster, but my remaining three nodes can still write to my GFS volume. Did you just lie? ¶

It's possible to still write to a GFS volume, even without quorum, but ONLY if the three nodes that left the cluster didn't have the GFS volume mounted. It's not a problem because if a partitioned cluster is ever formed that gains quorum, it will fence the nodes in the inquorate partition before doing anything.

If, on the other hand, nodes failed while they had gfs mounted and quorum was lost, then gfs activity on the remaining nodes will be mostly blocked. If it's not then it may be a bug.

Can I have a mixed cluster with some nodes at RHEL4U1 and some at RHEL4U2? ¶

You can't mix RHEL4 U1 and U2 systems in a cluster because there were changes between U1 and U2 that changed the format of internal messages that are sent around the cluster.

Since U2, we now require these messages to be backward-compatible, so mixing U2 and U3 or U3 and U4 shouldn't be a problem.

How do I add a third node to my two-node cluster? ¶

Unfortunately, two-node clusters are a special case. A two-node cluster needs two nodes to establish quorum, but only one node to maintain quorum. This special status is set by a special "two_node" option in the cman section of cluster.conf. Unfortunately, this setting can only be reset by shutting down the cluster. Therefore, the only way to add a third node is to:

Shut down the cluster software on both nodes. Add the third node into your /etc/cluster/cluster.conf file. Get rid of two_node="1" option in cluster.conf. Copy the modified cluster.conf to your third node. Restart all three nodes.

The system-config-cluster gui gets rid of the two_node option automatically when you add a third node. Also, note that this does not apply to two-node clusters with a quorum disk/partition. If you have a quorum disk/partition defined, you don't want to use the two_node option to begin with.

Adding subsequent nodes to a three-or-more node cluster is easy and the cluster does not need to be stopped to do it.

Add the node to your cluster.conf Increment the config file version number near the top and save the changes Do ccs_tool update /etc/cluster/cluster.conf to propagate the file to the

cluster Use cman_tool status | grep "Config version" to get the internal version

number. Use cman_tool version -r <new config version>. Start the cluster software on the additional node.

I removed a node from cluster.conf but the cluster software and services kept running. What did I do wrong? ¶

You're supposed to stop the node before removing it from the cluster.conf.

How can I rename my cluster? ¶

Here's the procedure:

Unmount all GFS partitions and stop all clustering software on all nodes in the cluster.

Change name="old_cluster_name" to name="new_cluster_name" in /etc/cluster/cluster.conf

If you have GFS partitions in your cluster, you need to change their superblock to use the new name. For example: gfs_tool sb /dev/vg_name/gfs1 table new_cluster_name:gfs1 Restart the clustering software on all nodes in the cluster. Remount your GFS partitions

What's the proper way to shut down my cluster? ¶

Halting a single node in the cluster will seem like a communication failure to the other nodes. Errors will be logged and the fencing code will get called, etc. So there's a procedure for properly shutting down a cluster. Here's what you should do:

Use the "cman_tool leave remove" command before shutting down each node. That will force the remaining nodes to adjust quorum to accomodate the missing node and not treat it as an error.

Follow these steps:

for i in rgmanager gfs2 gfs; do service ${i} stop; donefence_tool leavecman_tool leave remove

Why does the cman daemon keep shutting down and reconnecting? ¶

Additional info: When I try to start cman, I see these messages in /var/log/messages:

Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: InquorateSep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate. Allowing connections.Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown. Attemping to reconnect...

I see these messages in dmesg:

CMAN: forming a new clusterCMAN: quorum regained, resuming activityCMAN: sendmsg failed: -13CMAN: No functional network interfaces, leaving clusterCMAN: sendmsg failed: -13

CMAN: we are leaving the cluster.CMAN: Waiting to join or form a Linux-clusterCMAN: sendmsg failed: -13

This is almost always caused by a mismatch between the kernel and user space CMAN code. Update the CMAN user tools to fix the problem.

I've heard there are issues with using an even/odd number of nodes. Is it true? ¶

No, it's not true. There is only one special case: two node clusters have special rules for determining quorum. See question 3 above.

What is a quorum disk/partition and what does it do for you? ¶

A quorum disk or partition is a section of a disk that's set up for use with components of the cluster project. It has a couple of purposes. Again, I'll explain with an example.

Suppose you have nodes A and B, and node A fails to get several of cluster manager's "heartbeat" packets from node B. Node A doesn't know why it hasn't received the packets, but there are several possibilities: either node B has failed, the network switch or hub has failed, node A's network adapter has failed, or maybe just because node B was just too busy to send the packet. That can happen if your cluster is extremely large, your systems are extremely busy or your network is flakey.

Node A doesn't know which is the case, and it doesn't know whether the problem lies within itself or with node B. This is especially problematic in a two-node cluster because both nodes, out of touch with one another, can try to fence the other.

So before fencing a node, it would be nice to have another way to check if the other node is really alive, even though we can't seem to contact it. A quorum disk gives you the ability to do just that. Before fencing a node that's out of touch, the cluster software can check whether the node is still alive based on whether it has written data to the quorum partition.

In the case of two-node systems, the quorum disk also acts as a tie-breaker. If a node has access to the quorum disk and the network, that counts as two votes.

A node that has lost contact with the network or the quorum disk has lost a vote, and therefore may safely be fenced.

Is a quorum disk/partition needed for a two-node cluster? ¶

In older versions of the Cluster Project, a quorum disk was needed to break ties in a two-node cluster. Early versions of Red Hat Enterprise Linux 4 (RHEL4) did not have quorum disks, but it was added back as an optional feature in RHEL4U4.

In RHCS 4 update 4 and beyond, see the man page for qdisk for more information. As of September 2006, you need to edit your configuration file by hand to add quorum disk support. The system-config-cluster gui does not currently support adding or editing quorum disk properties.

Whether or not a quorum disk is needed is up to you. It is possible to configure a two-node cluster in such a manner that no tie-breaker (or quorum disk) is required. Here are some reasons you might want/need a quorum disk:

If you have a special requirement to go down from X -> 1 nodes in a single transition. For example, if you have a 3/1 network partition in a 4-node cluster - here the 1-node partition is the only node which still has network connectivity. (Generally, the surviving node is not going to be able to handle the load...)

If you have a special situation causing a need for a tie-breaker in general. If you have a need to determine node-fitness based on factors which are not

handled by CMAN

In any case, please be aware that use of a quorum disk requires additional configuration information and testing.

How do I set up a quorum disk/partition? ¶

The best way to start is to do "man qdisk" and read the qdisk.5 man page. This has good information about the setup of quorum disks.

Note that if you configure a quorum disk/partition, you don't want two_node="1" or expected_votes="2" since the quorum disk solves the voting imbalance. You want two_node="0" and expected_votes="3" (or nodes + 1 if it's not a two-node cluster). However, since 0 is the default value for two_node, you don't need to specify it at all. If this is an existing two-node cluster and you're changing the two_node value from "1" to "0", you'll have to stop the entire cluster and restart it after the configuration is changed (normally, the cluster doesn't have to be stopped and restarted for configuration changes, but two_node is a special case.) Basically, you want something like this in your /etc/cluster/cluster.conf:

<cman two_node="0" expected_votes="3" .../> <clusternodes> <clusternode name="node1" votes="1" .../> <clusternode name="node2" votes="1" .../> </clusternodes> <quorumd device="/dev/mapper/lun01" votes="1"/>

Note: You don't have to use a disk or partition to prevent two-node fence-cycles; you can also set your cluster up this way. You can set up a number of different heuristics for the qdisk daemon. For example, you can set up a redundant NIC with a crossover cable and use ping operations to the local router/switch to break the tie (this is typical, actually, and is called an IP tie breaker). A heuristic can be made to check anything, as long as it is a shared resource.

Do I really need a shared disk to use QDisk? ¶

Currently, yes. There have been suggestions to make qdiskd operate in a 'diskless' mode in order to help prevent a fence-race (i.e. prevent a node from attempting to fence another node), but no work has been done in this area (yet).

Are the quorum disk votes reported in "Total_votes" from cman_tool nodes? ¶

Yes. if the quorum disk is registered correctly with cman you should see the votes it contributes and also it's "node name" in cman_tool nodes.

What's the minimum size of a quorum disk/partition? ¶

The official answer is 10MB. The real number is something like 100KB, but we'd like to reserve 10MB for possible future expansion and features.

Is quorum disk/partition reserved for two-node clusters, and if not, how many nodes can it support? ¶

Currently a quorum disk/partition may be used in clusters of up to 16 nodes.

In a 2 node cluster, what happens if both nodes lose the heartbeat but they can still see the quorum disk? Don't they still have quorum and cause split-brain? ¶

First of all, no, they don't cause split-brain. As soon as heartbeat contact is lost, both nodes will realize something is wrong and lock GFS until it gets resolved and someone is fenced.

What actually happens depends on the configuration and the heuristics you build. The qdisk code allows you to build non-cluster heuristics to determine the fitness of each node beyond the heartbeat. With the heuristics in place, you can, for example, allow the node running a specific service to have priority over the other node. It's a way of saying "This node should win any tie" in case of a heartbeat failure. The winner fences the loser.

If both nodes still have a majority score according to their heuristics, then both nodes will try to fence each other, and the fastest node kills the other. Showdown at the Cluster Corral. The remaining node will have quorum along with the qdisk, and GFS will run normally under that node. When the "loser" reboots, unlike with a cman operation, it will become quorate with just the quorum disk/partition, so it cannot cause split-brain that way either.

At this point (4-Apr-2007), if there are no heuristics defined whatsoever, the QDisk master node wins (and fences the non-master node).

If my cluster is mission-critical, can I override quorum rules and have a "last-man-standing" cluster that's still functioning? ¶

This may not be a good idea in most cases because of the dangers of split-brain, but there is a way you can do this: You can adjust the "votes" for the quorum disk to be equal to the number of nodes in the cluster, minus 1

For example, if you have a four-node cluster, you can set the quorum disk votes to 3, and expected_votes to 7. That way, even if three of the four nodes die, the remaining node may still function. That's because the quorum disk's 3 votes plus the remaining node's 1 vote makes a total of 4 votes out of 7, which is enough to establish quorum. Additionally, all of the nodes can be online - but not the qdiskd (which you might need to take down for maintenance or reconfiguration).

My cluster won't come up. It says: kernel: CMAN: Cluster membership rejected. What do I do? ¶

One or more of the nodes in your cluster is rejecting the membership of this node. Check the syslog (/var/log/messages) on all remaining nodes in the cluster for messages regarding why the membership was rejected.

This message will appear when another node is rejecting the node in question and it WILL tell syslog (/var/log/messages) why unless you have kernel logging switched off for some reason. There are several reasons your node may be rejected:

Mismatched cluster.conf version numbers. Mismatched cluster names. Mismatched cluster number (a hash of the name). Node has the wrong node ID (i.e. it joined with the same name and a different

node ID or vice versa). CMAN protocol version differs (or other software mismatch - there are several

error messages for these but they boil down to the same thing).

Something else you might like to try is changing the port number that this cluster is using, or changing the cluster name to something totally different.

If you find that things work after doing this then you can be sure there is another cluster with that name or number on the network. If not, then you need to double/triple check that the config files really do all match on all nodes.

I've seen this message happen when I've accidentally done something like this:

Created a cluster.conf file with 5 nodes: A, B, C, D and E. Tried to bring up the cluster. Realized that node E has the wrong software, has no access to the SAN, has a

hardware problem or whatever. Removed node E from cluster.conf because it really doesn't belong in the

cluster after all. Updated all five machines with the new cluster.conf. Rebooted nodes A, B, C and D to restart the cluster.

Guess what? None of the nodes come up in a cluster. Can you guess why?

It's because node E still thinks it's part of the cluster and still has a claim on the cluster name. You still need to shut down the cluster software on E, or else reboot it before the correct nodes can form a cluster.

Is it a problem if node order isn't the same for all nodes in cman_tool services? ¶

No, this isn't a problem and can be ignored. Some nodes may report [1 2 3 4 5] while others report a different order, like [4 3 5 2 1]. This merely has to do with the order in which cman join messages are received.

Why does cman_tool leave say "cman_tool: Can't leave cluster while there are X active subsystems"? ¶

This message indicates that you tried to leave the cluster from a node that still has active cluster resources, such as mounted GFS file systems.

A node cannot leave the cluster if there are subsystems (e.g. DLM, GFS, rgmanager) active. You should unmount all GFS filesystems, stop the rgmanager service, stop the clvmd service, stop fenced and anything else using the cluster manager before using cman_tool leave. You can use cman_tool status and cman_tool services to see how many (and which) services are running.

What are these services/subsystems and how do I make sense of what cman_tool services prints? ¶

Although this may be an over-simplification, you can think of the services as a big membership roster for different special interest groups or clubs. Each "service-name" pair corresponds to access to a unique resource, and each node corresponds to a voting member in the club.

So let's weave a inane piece of fiction around this concept: let's pretend that a journalist named Sam, wants to write an article for her newspaper, "The National Conspiracy Theorist." To write her article, she needs access to secret knowledge kept hidden for centuries by a secret society known only as "The Group." The only way she can become a member is to petition the existing members to join and the decision must be unanimously in her favor. But The Group is so secretive, they don't even know each other's names; every member is assigned a unique id number. Their only means of communication is through a chat room, and they won't even speak to you unless you're a member or unless you know to become a member.

So she logs into chat room and joins the channel #default. In the chat room, she can see there are seven members of The Group. They're not listed in order, but they're all there.

[root@roth-02 ~]# cman_tool servicesService Name GID LID State CodeFence Domain: "default" 1 2 run -[7 6 1 2 3 4 5]

She finds a blog (called "cluster.conf") and reads from it that her own ID number is 8. So she sends them a message: "Node 8 wants to join the default group".

Secretly, the other members take attendance to make sure all the members are present and accounted for. Then they take a vote. If all of them vote yes, she's allowed into the group and she becomes the next member. Her ID number is added to the list of members.

[root@roth-02 ~]# cman_tool servicesService Name GID LID State CodeFence Domain: "default" 1 2 run -[7 6 1 2 3 4 5 8]

Now that she's a member of the Group, she is told that the secrets of the order are not given to ordinary newbies; they're kept in a locked space. They are stored in an office building owned by the order, that they oddly call "clvmd." Since she's a newbie, she has to petition the other members to get a key to the clvmd office building. After a similar vote, they agree to give her a key, and they keep track of everyone who has a key.

[root@roth-02 ~]# cman_tool servicesService Name GID LID State CodeFence Domain: "default" 1 2 run -[7 6 1 2 3 4 5 8]DLM Lock Space: "clvmd" 7 3 run -[7 6 1 2 3 4 5 8]

Eager to write her article, she drives to the clvmd office building, unlocks the door, and goes inside. She's heard rumors that the secrets are kept in suite labeled "secret". She goes from room to room until she finds a door marked "secret." Then she discovers that the door is locked and her key doesn't fit. Again, she has to petition the others for a key. They tell her that there are actually two adjacent rooms inside the suite, the "DLM" room and the "GFS" room, each holding a different set of secrets.

Four of the members (3, 4, 6 and 7) never really cared what was in those rooms, so they never bothered to learn the grueling rituals, and consequently, they were never issued keys to the two secret rooms. So after months of training, Sam once again petitions the other members to join the "secret rooms" group. She writes "Node 8 wants to join the 'secret' DLM group" and sends it to the members who have a key: #1, #2 and #5. She sends them a similar message for the other room as well: "Node 8 wants to join the 'secret' GFS group". Having performed all the necessary rituals, they agree, and she's issued a duplicate key for both secret rooms.

[root@roth-02 ~]# cman_tool servicesService Name GID LID State CodeFence Domain: "default" 1 2 run -[7 6 1 2 3 4 5 8]

DLM Lock Space: "clvmd" 7 3 run -[7 6 1 2 3 4 5 8]DLM Lock Space: "secret" 12 8 run -[1 2 5 8]GFS Mount Group: "secret" 13 9 run -[1 2 5 8]

Then something shocking rocks the secret society: member 2 went into cardiac arrest and died on the operating table. Clearly, something must be done to recover the keys held by member 2. In order to secure the contents of both rooms, no one is allowed to touch the information in the secret rooms until they've verified member 2 was really dead and recovered his keys. The members decide to leave that task to the most senior member, member 7.

That night, when no one is watching, Member 7 breaks into the morgue, verifies #2 is really dead, and steals back the key from his pocket. Then #7 drives to the office building, returns all the secrets he had borrowed from the secret room. (They call it "recovery".) He also informs the other members that #2 is truly dead and #2 is taken off the group membership lists. Relieved that their secrets are safe, the others are now allowed access to the secret rooms.

[root@roth-02 ~]# cman_tool servicesService Name GID LID State CodeFence Domain: "default" 1 2 run -[7 6 1 3 4 5 8]DLM Lock Space: "clvmd" 7 3 run -[7 6 1 3 4 5 8]DLM Lock Space: "secret" 12 8 run -[1 5 8]GFS Mount Group: "secret" 13 9 run -[1 5 8]

You get the picture...Each of these "services" keeps a list of members who are allowed access, and that's how the cluster software on each node knows which others to contact for locking purposes. Each GFS file system has two groups that are joined when the file system is mounted; one for GFS and one for DLM.

The "state" of each service corresponds to its status in the group: "run" means it's a normal member. There are also states corresponding to joining the group, leaving the group, recovering its locks, etc.

What can cause a node to leave the cluster? ¶

A node may leave the cluster for many reasons. Among them:

Shutdown: cman_tool leave was run on this node Killed by another node. The node was killed with either by cman_tool kill or

qdisk.

Panic: cman failed to allocate memory for a critical data structure or some other very bad internal failure.

Removed: Like 1, but the remainder of the cluster can adjust quorum downwards to keep working.

Membership Rejected: The node attempted to join a cluster but it's cluster.conf file did not match that of the other nodes. To find the real reason

for this you need to examine the syslog of all the valid cluster members to find out why it was rejected.

Inconsistent cluster view: This is usually indicative of a bug but it can also happen if the network is extremely unreliable.

Missed too many heartbeats: This means what it says. All nodes are expected to broadcast a heartbeat every 5 seconds (by default). If none is received within

21 seconds (by default) then it is removed for this reason. The heartbeat values may be changed from their defaults.

No response to messages: This usually happens during a state transition to add or remove another node from a group. The reporting node sent a message five times (by default) to the named node and did not get a response.

How do I change the time interval for the heartbeat messages? ¶

Just add hello_timer="value" to the cman section in your cluster.conf file. For example:

<cman hello_timer="5">

The default value is 5 seconds.

How do I change the time after which a non-responsive node is considered dead? ¶

For RHEL4 and STABLE branches: Just add deadnode_timeout="value" to the cman section in your cluster.conf file. For example:

<cman deadnode_timeout="21"/>

. The default value is 21 seconds.

For RHEL5 and STABLE2 branches: Just add token="value" to the totem section in your cluster.conf file. Note that the totem token timeout value is specified in milliseconds, not seconds. The equivalent for the above example is:

<totem token="21000"/>

. The default value is 10000 milliseconds (or 10 seconds). It is important to change this value if you are using QDisk on RHEL5/STABLE2; 21000 should work if you left QDiskd's interval/tko at their default values.

What does "split-brain" mean? ¶

"Split brain" is a condition whereby two or more computers or groups of computers lose contact with one another but still act as if the cluster were intact. This is like having two governments trying to rule the same country. If multiple computers are allowed to write to the same file system without knowledge of what the other nodes are doing, it will quickly lead to data corruption and other serious problems.

Split-brain is prevented by enforcing quorum rules (which say that no group of nodes may operate unless they are in contact with a majority of all nodes) and fencing (which makes sure nodes outside of the quorum are prevented from interfering with the cluster).

What's the "right" way to get cman to use a different NIC, say, eth2 rather than eth0? ¶

There are several reasons for doing this. You may want to do this in cases where you want the cman heartbeat messages to be on a dedicated network so that a heavily used network doesn't cause heartbeat messages to be missed (and nodes in your cluster to be fenced). Second, you may have security reasons for wanting to keep these messages off of an Internet-facing network.

First, you want to configure your alternate NIC to have its own IP address, and the settings that go with that (subnet, etc).

Next, add an entry into /etc/hosts (on all nodes) for the ip address associated with the NIC you want to use. In this case, eth2. One way to do this is to append a suffix to the original host name. For example, if your node is "node-01" you could give it the name "node-01-p" (-p for private network). For example, your /etc/hosts file might look like this:

# Do not remove the following line, or various programs# that require network functionality will fail.127.0.0.1 localhost.localdomain localhost::1 localhost6.localdomain6 localhost610.0.0.1 node-01192.168.0.1 node-01-p

Once you've done this you need to make sure that your cluster.conf uses the name with the -p suffix rather than the old name. Note that -p is just a suggestion for names you could use -internal or anything else really.

If you're using RHEL4.4 or above, or 5.1 or above, that's all you need to do. There is code in cman to look at all the active network interfaces on the node and find the one that corresponds to the entry in cluster.conf. Note that this only works on ipv4 interfaces.

Does cluster suite use multicast or broadcast? ¶

By default, the older cluster infrastructure (RHEL4, STABLE and so on) uses broadcast. By default, the newer cluster infrastructure with openais (RHEL5, HEAD and so on) uses multicast. You can configure a RHEL4 cluster to use multicast rather than broadcast. However, you can't switch openais to use broadcast.

Is it possible to configure a cluster with nodes running on different networks (subnets)? ¶

Yes, it is. If you configure the cluster to use multicast rather than broadcast (there is an option for this in system-config-cluster) then the nodes can be on different subnets.

Be careful that any switches and/or routers between the nodes are of good specification and are set to pass multicast traffic though.

How can I configure my RHEL4 cluster to use multicast rather than broadcast? ¶

Put something like this in your cluster.conf file:

<clusternode name="node1"><multicast addr="224.0.0.1" interface="eth0"/></clusternode>

On RHEL5, why do I get "cman not started: Can't bind to local cman socket /usr/sbin/cman_tool"? ¶

There is currently a known problem with RHEL5 whereby system-config-cluster is trying to improperly access /usr/sbin/cman_tool (cman_tool currently resides in /sbin). We'll correct the problem, but in the meanwhile, you can circumvent the problem by creating a symlink from /sbin/cman_tool to /usr/sbin/. For example:

[root@node-01 ~]# ln -s /sbin/cman_tool /usr/sbin/cman_tool

If this is not your problem, read on:

Ordinarily, this message would mean that cman could not create the local socket in /var/run for communication with the cluster clients.

The cman tries to create /var/run/cman_client and /var/run/cman_admin. Things like cman_tool, groupd and ccsd talk to cman over this link. If it can't be created then you'll get this error.

Check /var/run is writable and able to hold Unix domain sockets.

On Fedora 8, CMAN won't start, complaining about "aisexec not started". How do I fix it? ¶

On Fedora 8 and other distributions where the core supports multiple architectures (ex: x86, x86_64), you must have a matched set of packages installed. A cman package for x86_64 will not work with an x86 (i386/i686) openais package, and vice-versa. To see if you have a mixed set, run:

WRONG:

[root@ayanami ~]# file `which cman_tool`; file `which aisexec`/usr/sbin/cman_tool: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped

RIGHT:

[root@ayanami ~]# file `which cman_tool`; file `which aisexec`/usr/sbin/cman_tool: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped

You need to use the same architecture as your kernel for running the userland parts of the cluster packages; on x86_64, this generally means you should only have the x86_64 versions of the cluster packages installed.

rpm -e cman.i386 openais.i386 rgmanager.i386 ...yum install -y cman.x86_64 openais.x86_64 rgmanager.x86_64 ...

Note: If you were having trouble getting things up, there's a chance that an old aisexec process might be running on one of the nodes; make sure you kill it before trying to start again!

My RHEL5 or similar cluster won't work with my Cisco switch. ¶

Some nodes can not see each other; ping works! Why?

Two nodes in different blade frames can not see each other. Why?

These problems are caused by multicast routing problems. Assuming you have already checked your firewall configuration, read on.

Solution #1: Fix your switch ¶

Some Cisco (and other software-based) switches do not support IP multicast in their default configuration.

Since openais uses multicast for cluster communications, you may have to enable it in the switch in order to use the cluster software.

Before making any changes to your Cisco switches it is adviseable to contact your Cisco TAC to ensure the changes will have no negative consequences in your network. Please visit this page for more information: OpenAIS - Cisco Switches

Solution #2: Work around the switch ¶

In some environments, it's possible to simply change the multicast that OpenAIS uses from within cluster.conf rather than reconfiguring your switch. To do this, add a multicast tag to your cluster.conf:

<cman ... > <multicast addr="225.0.0.13" /></cman>

The address 225.0.0.x is known to work in some environments when the standard openais multicast address does not. Note, however, that this address lies within a reserved range of multicast addresses and may not be suitable for use in the future:

225.000.000.000-231.255.255.255 Reserved [IANA]

Source: - IANA - Multicast Addresses

My RHEL5 or similar cluster won't work with my HP switch. ¶

Some HP servers and switches do not play well together when using Linux. More information, and a workaround, is available here.

I created a large RHEL5 cluster but it falls apart when I boot it. ¶

The default parameters for a RHEL5 cluster are usually enough to get a small to medium size cluster running, say up to around 16 modes.

Beyond that limit some tuning needs to be done. Here are some parameters I have used to get larger clusters running. Note that this increases the time taken for dead nodes to be detected quite considerably.

The following cluster.conf extract allowed me to get 45 nodes running:

<totem token="50000" consensus="45000" join="6000" send_join="880" token_retransmits_before_loss_const="10"/>

To get beyond that some seriously large numbers are needed. Here's what I did to get 60 nodes working:

<totem token="60000" consensus="45000" join="15000" send_join="1000" token_retransmits_before_loss_const="100">

These numbers are not definitive and might not work perfectly at your site. Other variables such as network and host load come into play. But they should, I hope, be a good starting point for people wanting to run larger RHEL5 clusters.

[MAIN ] Killing node mynode01 because it has rejoined the cluster with existing state ¶

What this message means is that a node was a valid member of the cluster once; it then left the cluster (without being fenced) and rejoined automatically. This can sometimes happen if the ethernet is disconnected for a time, usually a few seconds.

If a node leave the cluster, it MUST rejoin using the cman_tool join command with no services running. The usual way to make this happen is to reboot the node and let the init script do its job, and if fencing is configured correctly that is what normally happens. It could be that fencing is too slow to manage this or that the cluster is made up of two nodes without a quorum disk so that the 'other' node doesn't have quorum and cannot initiate fencing.What must not happen is that the node is ejected from the cluster and the system manager simply reruns the init script from the command-line. This will almost certainly not clear out running services.

Another (more common) cause of this, is slow responding of some Cisco switches as documented above.

What is the "Dirty" flag that cman_tool shows, and should I be worried? ¶

The short answer is "No, you should not be worried".

All this flag indicates is that there are cman daemons running that have state which cannot be rebuilt without a node reboot. This can be as simple (in concept!) as a DLM lockspace or a fence domain. When a cluster has state the dirty flags is set (it cannot be reset) and this prevents two stateful clusters merging, as the two states cannot be reconciled. In some cases this can cause the message shown above. Many daemons can set this flag. eg: fence_tool join will set it, via fenced. As will clvmd (because it instantiates a lock space). Think of it as a "we have some daemons running" flag if you like !

The main reason for the flag is to prevent state corruption where the cluster is evenly split (so that fencing cannot occur) and tries to merge back again. Neither side of the cluster knows if the other side's state has changed and there is no mechanism for performing a state merge. So one side gets marked disallowed or is fenced, depending on quorum. Fencing can only be done by a quorate partition.

This flag has been renamed to "HaveState?" in STABLE3 so as to panic people less. In general most users can ignore this flag.

Chrissie's plea to people submitting logs for bug reports ¶

Please, please *always* attach full logs. I'd much rather have 2GB of log files to wade through than 1K of truncated logs that don't show what I'm looking for.

I'm very good at filtering log files, it's my job and I've been doing it for a very long time now! And it's quite possible that I might spot something important that looks insignificant to you.

Cluster name limitations ¶

* 15 non-NUL (ASCII 0) characters * You can use the 'alias' attribute to make a more descriptive name.

https://fedorahosted.org/cluster/wiki/FAQ/CMAN

cman questions

Documents