Linux Clustering With Csm and Gpfs

 < Day Day Up > 


The RSCT component provides a set of services designed to address issues related to the high-availability of a system. It includes two subsystems, as shown in Figure A-1 on page 254:

Both of these distributed subsystems operate within a domain. A domain is a set of machines upon which the RSCT components execute and, exclusively of other machines, provides their services.

The CSM product uses only a relatively small subset of the RSCT components. This subset is primarily composed of a set of commands. Therefore, some parts on this chapter relate only to the GPFS product and not to CSM.

Note 

Because both the SRC and RSCT components are required (and shipped) by both the CSM and GPFS, it is important to use the same version of SCR and RSCT during the installation. Refer to Chapter 5, "Cluster installation and configuration with CSM" on page 99 and Chapter 7, "GPFS installation and configuration" on page 185 for details.

Topology Services subsystem

Topology Services provides high-availability subsystems with network adapter status, node connectivity information, and a reliable messaging service. The adapter status and node connectivity information is provided to the Group Services subsystem upon request. Group Services makes the information available to its client subsystems. The reliable messaging service, which takes advantage of node connectivity information to reliably deliver a message to a destination node, is available to the other high-availability subsystems.

The adapter status and node connectivity information is discovered by an instance of the subsystem on one node, participating in concert with instances of the subsystem on other nodes, to form a ring of cooperating subsystem instances. This ring is known as a heartbeat ring, because each node sends a heartbeat message to one of its neighbors and expects to receive a heartbeat from its other neighbor.

Actually, each subsystem instance can form multiple rings, one for each network it is monitoring. This system of heartbeat messages enables each member to monitor one of its neighbors and to report to the heartbeat ring leader, called the Group Leader, if it stops responding. The Group Leader, in turn, forms a new heartbeat ring based on such reports and requests for new adapters to join the membership. Every time a new group is formed, it lists which adapters are present and which adapters are absent, making up the adapter status notification that is sent to Group Services.

The Topology Services subsystem consists in the following elements:

Topology Services daemon (hatsd)

The daemon is the central component of the Topology Services subsystem and is located in the /usr/sbin/rsct/bin/ directory. This daemon runs on each node in the GPFS cluster.

When each daemon starts, it first reads its configuration from a file set up by the startup command (cthats). This file is called the machines list file, because it has all the nodes listed that are part of the configuration and the IP addresses for each adapter for each of the nodes in that configuration.

From this file, the daemon knows the IP address and node number of all the potential heartbeat ring members.

The Topology Services subsystem directive is to form as large a heartbeat ring as possible. To form this ring, the daemon on one node must alert those on the other nodes of its presence by sending a proclaim message. According to a hierarchy defined by the Topology Services component, daemons can send a proclaim message only to IP addresses that are lower than its own and can accept a proclaim message only from an IP address higher than its own. Also, a daemon only proclaims if it is the leader of a ring.

When a daemon first starts up, it builds a heartbeat ring for every local adapter, containing only that local adapter. This is called a singleton group and this daemon is the Group Leader in each one of these singleton groups.

To manage the changes in these groups, Topology Services defines the following roles for each group:

Each one of these roles are dynamic, which means that every time a new heartbeat ring is formed, the roles of each member are evaluated and assigned.

In summary, Group Leaders send and receive proclaim messages. If the proclaim is from a Group Leader with a higher IP address, then the Group Leader with the lower address replies with a join request. The higher address Group Leader forms a new group with all members from both groups. All members monitor their upstream neighbor for heartbeats. If a sufficient number of heartbeats are missed, a message is sent to the Group Leader and the unresponsive adapter will be dropped from the group. Whenever there is a membership change, Group Services is notified if it asked to be.

The Group Leader also accumulates node connectivity information, constructs a connectivity graph, and routes connections from its node to every other node in the GPFS cluster. The group connectivity information is sent to all nodes so that they can update their graphs and also compute routes from their node to any other node. It is this traversal of the graph on each node that determines which node membership notification is provided to each node. Nodes to which there is no route are considered unreachable and are marked as down. Whenever the graph changes, routes are recalculated, and a list of nodes that have connectivity is generated and made available to Group Services.

When a network adapter fails or has a problem in one node the daemon will, for a short time, attempt to form a singleton group, since the adapter will be unable to communicate with any other adapter in the network. Topology Services will invoke a function that uses self-death logic. This self-death logic will attempt to determine whether the adapter is still working. This invokes network diagnosis to determine if the adapter is able to receive data packets from the network. The daemon will try to have data packets sent to the adapter. If it cannot receive any network traffic, the adapter is considered to be down. Group Services is then notified that all adapters in the group are down.

After an adapter that was down recovers, the daemon will eventually find that the adapter is working again by using a mechanism similar to the self-death logic, and will form a singleton group. This should allow the adapter to form a larger group with the other adapters in the network. An adapter up notification for the local adapter is sent to the Group Services subsystem.

Pluggable NIMs

Topology Services' pluggable NIMs are processes started by the Topology Services daemon to monitor each local adapter. The NIM is responsible for:

Port numbers and sockets

The Topology Services subsystem uses several types of communications:

Topology Services commands

The Topology Services subsystems contain several commands, such as:

For details on the above commands, refer to the IBM General Parallel File System (GPFS) for Linux: RSCT Guide and Reference, SA22-7854.

Configuring and operating Topology Services

The following sections describe how the components of the Topology Services subsystem work together to provide topology services. Included are discussions of Topology Services tasks.

Attention: 

Under normal operating conditions, Topology Services is controlled by GPFS. User intervention of Topology Services may cause GPFS to fail. Be very cautious when manually configuring or operating Topology Services.

The Topology Services subsystem is contained in the rsct.basic RPM. After the RPM is installed, you may change the default configuration options using the cthatsctrl command.

Initializing the Topology Services daemon

Normally, the Topology Services daemon is started by GPFS. If necessary, you can start the Topology Services daemon using the cthatsctrl command or the startsrc command directly. The first part of initialization is done by the startup command, cthats. It starts the hatsd daemon, which completes the initialization steps.

During this initialization, the startup command does the following:

  1. Determines the number of the local node.

  2. Obtains the name of the cluster from the GPFS cluster data.

  3. Retrieves the machines.lst file from the GPFS cluster data.

  4. Performs file maintenance in the log directory and current working directory to remove the oldest log and rename any core files that might have been generated.

  5. Starts the Topology Services hatsd daemon.

The daemon then continues the initialization with the following steps:

  1. Reads the current machines list file and initializes internal data structures.

  2. Initializes daemon-to-daemon communication, as well as client communication.

  3. Starts the NIMs.

  4. For each local adapter defined, forms a membership consisting of only the local adapter.

At this stage, the daemon is now in its initialized state and ready to communicate with Topology Services daemons on other nodes. The intent is to expand each singleton membership group formed during initialization to contain as many members as possible. Eventually, as long as all adapters in a particular network can communicate with each other, there will be a single group to which all adapters belong.

Merging all adapters into a single group

Initially the subsystem starts out as N singleton groups, one for each node. Each of those daemons is a Group Leader of those singleton groups and knows which other adapters could join the group by the configuration information. The next step is to begin proclaiming to subordinate nodes.

The proclaim logic tries to find members as efficiently as possible. For the first three proclaim cycles, daemons proclaim to only their own subnet, and if the subnet is broadcast-capable, that message is broadcast. The result of this is that given the previous assumption that all daemons started out as singletons, this would evolve into M groups, where M is the number of subnets that span this heartbeat ring. On the fourth proclaim cycle, those M Group Leaders send proclaims to adapters that are outside of their local subnet.

This causes a merging of groups into larger and larger groups until they have coalesced into a single group.

From the time the groups were formed as singletons until they reach a stabilization point, the groups are considered unstable. The stabilization point is reached when a heartbeat ring has no group changes for the interval of 10 times the heartbeat send interval. Up to that point, the proclaim continues on a four cycle operation, where three cycles only proclaim to the local subnets, and one cycle proclaims to adapters not contained on the local subnet. After the heartbeat ring has reached stability, proclaim messages go out to all adapters not currently in the group regardless of the subnet to which they belong. Adapter groups that are unstable are not used when computing the node connectivity graph.

Topology Services daemon operations

Normal operation of the Topology Services subsystem does not require administrative intervention.

The subsystem is designed to recover from temporary failures, such as node failures or failures of individual Topology Services daemons. Topology Services also provides indications of higher-level system failures.

However, there are some operational characteristics of interest to system administrators, and after adding or removing nodes or adapters you might need to refresh the subsystem. The maximum node number allowed is 2047 and the maximum number of networks it can monitor is 16.

Topology Services is meant to be sensitive to network response and this sensitivity is tunable. However, other conditions can degrade the ability of Topology Services to accurately report on adapter or node membership. One such condition is the failure to schedule the daemon process in a timely manner. This can cause daemons to be late in sending their heartbeats by a significant amount. This can happen because an interrupt rate is too high, the rate of paging activity is too high, or there are other problems. If the daemon is prevented from running for enough time, the node might not be able to send out heartbeat messages and will be considered to be down by other peer daemons. The node down indication, when notified to GPFS, will cause it to perform, in this case, undesirable recovery procedures and take over resources and roles of the node. Whenever these conditions exist, analyze the problem carefully to fully understand it.

Because Topology Services should get processor time-slices in a timely fashion, do not intentionally subvert its use of the CPU, because you can cause false indications.

Attention: 

The network options to enable IP source routing are set to their default values for security reasons. Changing them may cause the node to be vulnerable to network attacks. System administrators are advised to use other methods to protect the cluster from such attacks.

Topology Services requires the IP source routing feature to deliver its data packets when the networks are broken into several network partitions. The network options must be set correctly to enable IP source routing. The Topology Services startup command will set the options:

Tuning the Topology Services subsystem

The default settings for the frequency and sensitivity attributes discussed in "Configuring and operating Topology Services" on page 261 are overly aggressive for GPFS clusters that have more than 128 nodes or heavy load conditions. Using the default settings will result in false failure indications. Decide which settings are suitable for your system by considering the following:

By default, Topology Services uses the settings shown in Table A-4.

Table A-4: Topology Services defaults

Frequency

Sensitivity

Seconds to detect node failure

1

4

8

You can adjust the tunable attributes by using the cthatstune command. For example, to change the frequency attribute to the value 2 on network gpfs and then refresh the Topology Services subsystem, use the command:

cthatstune -f gpfs:2 -r

Refreshing the Topology Services daemon

When your system configuration is changed (such as by adding or removing nodes or adapters), the Topology Services subsystem needs to be refreshed before it can recognize the new configuration.

To refresh the Topology Services subsystem, run either the cthatsctrl command or the cthatstune command, both with the -r option, on any node in the GPFS cluster.

Note that if there are nodes in the GPFS cluster that are unreachable with Topology Services active, they will not be refreshed. Also, if the connectivity problem is resolved such that Topology Services on that node is not restarted, the node refreshes itself to remove the old configuration. Otherwise, it will not acknowledge nodes or adapters that are part of the configuration.

Topology Services procedures

Normally, the Topology Services subsystem runs itself without requiring administrator intervention. On occasion, you might need to check the status of the subsystem. You can display the operational status of the Topology Services daemon by issuing the lssrc command.

Topology Services monitors the Ethernet, the Myrinet switch, and the Token-Ring networks. To see the status of the networks, you need to run the command on a node that is up:

lssrc -ls cthats

In response, the lssrc command writes the status information to the standard output. The information includes:

Example A-2 shows the output from the lssrc -ls cthats command on a node.

Example A-2: lssrc -ls cthats output

[root@node001 root]# lssrc -ls cthats Subsystem Group PID Status cthats cthats 1632 active Network Name Indx Defd Mbrs St Adapter ID Group ID gpfs [ 0] 5 5 S 10.2.1.1 10.2.1.141 gpfs [ 0] myri0 0x85b72fa4 0x85b82c56 HB Interval = 1 secs. Sensitivity = 15 missed beats Missed HBs: Total: 0 Current group: 0 Packets sent : 194904 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 211334 ICMP 0 Dropped: 0 NIM's PID: 1731 gpfs2 [ 1] 5 5 S 10.0.3.1 10.0.3.141 gpfs2 [ 1] eth0 0x85b72fa5 0x85b82c5a HB Interval = 1 secs. Sensitivity = 15 missed beats Missed HBs: Total: 0 Current group: 0 Packets sent : 194903 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 211337 ICMP 0 Dropped: 0 NIM's PID: 1734 1 locally connected Client with PID: hagsd(1749) Configuration Instance = 1035415331 Default: HB Interval = 1 secs. Sensitivity = 8 missed beats Daemon employs no security Segments pinned: Text Data Stack. Text segment size: 593 KB. Static data segment size: 628 KB. Dynamic data segment size: 937. Number of outstanding malloc: 383 User time 7 sec. System time 2 sec. Number of page faults: 844. Process swapped out 0 times. Number of nodes up: 5. Number of nodes down: 0. [root@node001 root]#

Group Services (GS) subsystem

The function of the Group Services subsystem is to provide other subsystems with a distributed coordination and synchronization service. Other subsystems that utilize or depend upon Group Services are called client subsystems. Each client subsystem forms one or more groups by having its processes connect to the Group Services subsystem and use the various Group Services interfaces. A process of a client subsystem is called a GS client.

A group consists of two pieces of information:

Group Services guarantees that all processes that are members of a group see the same values for the group information, and that they see all changes to the group information in the correct order. In addition, the processes may initiate changes to the group information via certain protocols that are controlled by Group Services.

A GS client that has joined a group is called a provider. A GS client that wants only to monitor a group without being able to initiate changes in the group is called a subscriber.

Once a GS client has initialized its connection to Group Services, it can join a group and become a provider. All other GS clients that have already joined the group (those that have already become providers) are notified as part of the join protocol about the new providers that want to join. The existing providers can either accept new joiners unconditionally (by establishing a one-phase join protocol) or vote on the protocol (by establishing an n-phase protocol). During the vote, they can choose to approve the request and accept the new provider into the group, or reject the request and refuse to allow the new provider to join.

Group Services monitors the status of all the processes that are members of a group. If either the processes or the node on which a process is executing fails, Group Services initiates a failure notification that informs the remaining providers in the group that one or more providers have been lost.

Join and failure protocols are used to modify the membership list of the group. Any provider in the group may also propose protocols to modify the state value of the group. All protocols are either unconditional (one-phase) protocols, which are automatically approved, or conditional (n-phase) (sometimes called votes on) protocols, which are voted on by the providers.

During each phase of an n-phase protocol, each provider can take application-specific action and must vote to approve, reject, or continue the protocol. The protocol completes when it is either approved (the proposed changes become established in the group), or rejected (the proposed changes are dropped).

Group Services components

The Group Services subsystem consists of the following components:

Group Services dependencies

The Group Services subsystem depends on the following resources:

Configuring and operating Group Services

The following sections describe the various aspects of configuring and operating Group Services.

Group Services subsystem configuration

Group Services subsystem configuration is performed by the cthagscrtl command, which is included in the RSCT package.

The cthagsctrl command provides a number of functions for controlling the operation of the Group Services subsystem. These are:

Initializing the Group Services daemon

In a normal condition, the Group Service daemon is started by GPFS. If necessary, the Group Services daemon can be started using the cthagsctrl command or the startsrc command directly.

During initialization, the Group Services daemon performs the following steps:

Group Services daemon operation

Normal operation of the Group Services subsystem requires no administrative intervention. The subsystem normally automatically recovers from temporary failures, such as node failures or failures of Group Services daemons. However, there are some operational characteristics that might be of interest to administrators:

On occasion, you may need to check the status of the subsystem. You can display the operational status of the Group Services daemon by issuing the lssrc command, as shown in Example A-3.

Example A-3: Verify that Group Services is running

[root@storage001 root]# lssrc -ls cthags Subsystem Group PID Status cthags cthags 2772 active 1 locally-connected clients. Their PIDs: 3223(mmfsd) HA Group Services domain information: Domain established by node 1 Number of groups known locally: 3 Number of Number of local Group name providers providers/subscribers GpfsRec.1 5 1 0 Gpfs.1 5 1 0 NsdGpfs.1 5 1 0 [root@storage001 root]#


 < Day Day Up > 

Категории