Linux Clustering With Csm and Gpfs

2017-07-07 02:10:07

< Day Day Up >

6.2 Remote controlling nodes

The IBM Cluster 1350 is designed as a fully remote manageable solution. The combination of management components mean that physical access is seldom, if ever, required to the system. In this section, we outline some of these capabilities.

6.2.1 Power control

The onboard service processors on the xSeries nodes control the mainboard power. Via the RSA, nodes may be powered on, off, and reset from the management node.

To check the current status of all the nodes in the cluster, use the query argument to rpower, as in Example 6-12.

Example 6-12: Using rpower to query the power status of all nodes in the cluster

[root@master /]# rpower -a query storage1.cluster.com on node1.cluster.com on node2.cluster.com on node3.cluster.com off node4.cluster.com on [root@master /]#

Normally nodes are rebooted using the reboot or shutdown commands (usually run via dsh). If a node locks up completely and you want to press the reset switch on the front panel, use the reboot option of rpower instead; it has an identical effect.

To reset node4, use:

# rpower -n node4 reboot

Similarly, as shown in Example 6-13, nodes may be powered on or off using rpower.

Example 6-13: Power controlling nodes with rpower

[root@master /]# rpower -n node4 off node4.cluster.com off complete rc=0 [root@master /]# rpower -n node4 on node4.cluster.com on complete rc=0

If rpower returns errors, it could be because the IBM.HWCTRLRM daemon has crashed. In order to restart HWCTRLRM, run:

# stopsrc -s IBM.HWCTRLRM # startsrc -s IBM.HWCTRLRM

If rpower fails for all the nodes on a particular RSA, it could be that the RSA has locked up and needs restarting. rpower may be used to restart hardware control points. To restart the RSA that controls node1, use:

# rpower -n node1 resetsp_hcp

If this fails, use a Web browser on the management node to log in to the RSA and restart it. Some information on this can be found in 5.2.8, "System Management hardware configuration" on page 122.

6.2.2 Console access

It is possible to access the node's consoles via the terminal server(s). Console access is most commonly required when the network interface is accidentally configured incorrectly or in the case of boot-time problems. Beside the usual serial console support (accessing a LILO or GRUB prompt, watching boot messages, and getting a login prompt), the x335s support watching the BIOS boot screen accessing BIOS setup over the serial port.

Console access is significantly slower than the cluster network (9600 baud) and so is generally used only when the network is unavailable or unsuitable. Normal access to the nodes is over the cluster network via ssh, rlogin, and so on.

If you are running in X Windows, you can access the console of a particular node with:

# rconsole -n <hostname>

This will start the console in a new terminal on the current display.

If you are logged into the management server remotely (via ssh, for example) and want to use the current window to display the console, add the -t switch, as in Example 6-14. Note that you may need to press Enter before the login prompt appears.

Example 6-14: Remote console access with rconsole

[root@master eqnx]# rconsole -t -n node1 [Enter `^Ec?' for help] Connected. Red Hat Linux release 7.3 (Valhalla) Kernel 2.4.18-10smp on an i686 node1 login: [disconnect]

To exit from the console, use ^Ec. (Ctrl-E, then c, then a period).

From X Windows, the rconsole command can also be used to monitor a number of nodes. This is particularly useful when installing multiple nodes. Running

# rconsole -a

will open a tiny terminal for each node in the cluster. Although these terminals are too small to read any details, they may be used to get an idea of what is happening. If the install bars stop moving or the kernel messages appear irregular, the window may be changed to full-size by holding down Ctrl, right-clicking on the window, and selecting Medium from the menu.

6.2.3 Node availability monitor

CSM continually monitors the reachability of the nodes in the cluster via RSCT. Example 6-15 shows the use of lsnode to display this information.

Example 6-15: Checking node availability using lsnode

[root@master /]# lsnode -p node1: 1 node2: 1 node3: 1 node4: 1 storage1: 1 [root@master /]#

A value of 1 indicates the node is up and the RMC daemons are communicating correctly; a value 0 indicates the node is down. An error is denoted by the value 127.

This output format is the same as dsh. dshbak may be used to group up and down nodes, as in Example 6-16.

Example 6-16: Using dshbak to format lsnode output

[root@master /]# lsnode -p | dshbak -c HOSTS ------------------------------------------------------------------------- node1, node2, node3, storage1 ------------------------------------------------------------------------------- 1 HOSTS ------------------------------------------------------------------------- node4 ------------------------------------------------------------------------------- 0 [root@master /]#

Occasionally, lsnode -p may report a node as 127. This value is used in CSM to represent "unknown" and is an error condition. It can be caused by a number of factors:

Newly defined node

Any node that is defined but has not yet had its OS or CSM correctly installed will show up as unknown. Run:

# nodegrp -p PreManaged

Any nodes listed have not yet had CSM installed and therefore will show as 127 until they have been installed.

Temporary RMC failure

Occasionally, the RMC subsystem on the management node will lose contact with the node, causing this error. Restarting RMC on the node will solve this problem:

# dsh -w <hostname> /usr/sbin/rsct/bin/rmcctrl -z # dsh -w <hostname> /usr/sbin/rsct/bin/rmcctrl -s

RMC configuration missing or incorrect

If the CSM configuration on the node is inconsistent, for example, if CSM was re-installed on the management node, RMC will not be able to negotiate correctly with the node. Refresh the node with updatenode:

# updatenode -k <hostname>

6.2.4 Hardware status and management

The onboard service processors on the cluster nodes monitor various physical aspects of the node, such as fan speed and temperatures. These can be queried, via the RSAs, using the lshwstat command.

For example, Example 6-17 shows the use of lshwstat command to view the temperatures on node1.

Example 6-17: Querying a node temperature using lshwstat

[root@master /]# lshwstat -n node1 temp node1.cluster.com: CPU 1 Temperature = 39.0 (102.2 F) Hard Shutdown: 95.0 (203 F) Soft Shutdown: 90.0 (194 F) Warning: 85.0 (185 F) Warning Reset: 78.0 (172.4 F) CPU 2 Temperature = 39.0 (102.2 F) Hard Shutdown: 95.0 (203 F) Soft Shutdown: 90.0 (194 F) Warning: 85.0 (185 F) Warning Reset: 78.0 (172.4 F) DASD 1 Temperature = 29.0 (84.2 F) Ambient Temperature = 25.0 (77F) [root@master /]#

The lshwstat command can be use used to query many aspects of an individual node or node groups, although querying a large number of nodes can be slow. Besides monitoring the health of the system, lshwstat can be used to query the hardware configuration of the nodes. lshwstat -h gives a complete list of all the options that may be queried, as shown in Example 6-18.

Example 6-18: lshwstat -h output

[root@master /]# lshwstat -h Usage: lshwstat -h lshwstat [-v] { -a | -n hostname[,hostname...] -N nodegroup[,nodegroup...] } { cputemp | disktemp | ambtemp | temp voltage | fanspeed | power | powertime reboots | state | cpuspeed | maxdimm | insdimm | memory | model | serialnum | asset | all } -a runs the command on all the nodes in the cluster. -h writes usage information to standard output. -n hostname[,hostname...] specifies a node or list of nodes. -N nodegroup[,nodegroup...] specifies one or more nodegroups. -v verbose output. cputemp -- reports CPU temperatures disktemp -- reports DASD temperatures ambtemp -- reports system ambient temperature temp -- reports all the above temperatures voltage -- reports VRM and system board voltages fanspeed -- reports percent of maximum fan is rotating power -- reports current power status powertime -- reports number of hours running since last reboot reboots -- reports number of system restarts state -- reports current system status cpuspeed -- reports speed of CPUs in MHz maxdimm -- reports maximum supported DIMMs insdimm -- reports number of installed DIMMs memory -- reports total system memory in MB model -- reports model type serialnum -- reports model serial number asset -- reports service processor asset tag all -- reports all the above information [root@master /]#