Quintero - Deploying Linux on IBM E-Server Pseries Clusters
< Day Day Up > |
In 5.1, "CSM concepts and architecture" on page 212, we touch on the topics of CSM management and administration as a basic introduction to the main features of CSM and how they function. In this section, we examine these administration topics in detail by using examples and sample scenarios, and discuss the following areas:
5.3.1 Log file management
CSM logs to several different log files during installation and cluster management. These log files are available on the management server and managed nodes, and they help to determine the status of a command, or in troubleshooting a CSM issue. Most of the CSM log files on the management server are located in the /var/log/csm directory. Table 5-1 lists the log files on the management server and their purpose. Table 5-1. Log files on management server
Table 5-2 on page 253 lists log files on managed nodes and their purpose. Table 5-2. Log files on managed nodes
5.3.2 Node groups
Managed nodes can be grouped together by using the nodegrp command. Distributed commands can be issued against groups for common tasks , instead of performing them on each node. Default node groups created at install time are shown in Example 5-27. Example 5-27. nodegrp command
# nodegrp ManagedNodes AutoyastNodes ppcSLES81Nodes AllNodes SuSE82Nodes SLES72Nodes pSeriesNodes SLES81Nodes LinuxNodes PreManagedNodes xSeriesNodes EmptyGroup APCNodes RedHat9Nodes MinManagedNodes Node groups are created with the nodegrp command: #nodegrp -a lpar1, lpar2 testgroup This creates a group called test group which includes nodes lpar1 and lpar2. For more information, refer to the nodegrp man page. Distributed commands such as dsh can be run on nodegroups: # dsh -w testgroup date 5.3.3 Hardware control
The CSM hardware control feature is used to remotely control HMC-attached pSeries servers. Remote nodes can be powered on, off, the power status can be queried, and you can open a remote console from the management server. It is mandatory to have all pSeries servers connected to HMC, and to have the HMC communicate with the management server for the hardware control function. Figure 5-7 shows hardware control feature design for a simple CSM cluster. Figure 5-7. pSeries CSM cluster with hardware control using HMC
Hardware control uses openCIMOM (public software) and conserver software to communicate to HMC to issue remote commands. The IBM.HWCTRLRM daemon subscribes and maintains state to HMC openCIMOM events during startup. Conserver is started at boot time on the management server and reads from a defined config file located at /etc/opt/conserver/conserver.cf. The following hardware control commands are available on the management server: r power Powers nodes on and off and queries power status rconsole Opens a remote serial console for nodes chrconsolecfg Removes , adds and re- writes conserver config file entries rconsolerefresh Refreshes conserver on the management server getadapters Obtains MAC addresses of remote nodes lshwinfo Collects node information from Hardware Control points systemid Stores userid and encrypted password required to access remote hardware The rpower and rconsole commands are frequently used hardware control commands and we discuss them in detail here: Remote power
Remote power commands access the CSM database for node attribute information. PowerMethod Node attribute must be set to hmc to access pSeries nodes. HardwareControlPoint is the hostname or IP address of the Hardware Management Console (HMC). HardwareControlNodeId is the hostname or IP address of the managed node which is attached to the HMC over a serial link. Other Node attributes such as HWModel, HWSerialNum, HWType are obtained automatically using lshwinfo . Remote power configuration is outlined in 5.2.5, "Installing CSM on the management server" on page 228. Remote console
The Remote console command communicates with the console server to open remote console to nodes using management VLAN and Serial connections. The HMC works as the remote console server listening for requests from the management server. Only one read write console, but multiple read only consoles, can be opened to each node by using the rconsole command. 5.3.4 Configuration File Manager (CFM)
Configuration File Manager (CFM) is a CSM component to centralize and distribute files across management nodes in a management cluster. This is similar to file collections on IBM PSSP. Common files such as /etc/ hosts across the cluster are distributed from the management server using a push mechanism through root's crontab and/or event monitoring. CFM uses rdist to distribute files. Refer to 5.1.7, "CSM diagnostic probes" on page 220 for more information on hostname changes. CFM uses /cfmroot as its main root directory, but copies all files to /etc/opt/csm/cfmroot with a symlink on the management server. File permissions are preserved while copying. Make sure that you have enough space in your root directory or create /cfmroot on a separate partition and symlink it from /etc/opt/csm/cfmroot. Example 5-28 shows cfmupdatenode usage. Example 5-28. cfmupdatenode usage
Usage: cfmupdatenode [-h] [-v -V] [-a -N node_group[,node_group] --file file ] [-b] [[-y] [-c]] [-q [-s] ] [-r remote shell path] [-t timeout] [-M number of max children] [-d location for distfile] [-f filename] [[-n] node_list] -a Files are distributed to all nodes. This option cannot be used with the -N or host positional arguments. -b Backup. Preserve existing configuration file (on nodes) as "filename".OLD -c Perform binary comparison on files and transfer them if they differ. -d distfile location cfmupdatenode will generate a distfile in the given (absolute) path and exit (without transferring files). This way the user can execute Rdist with the given distfile and any options desired. -f filename Only update the given filename. The filename must be the absolute path name of the file and the file must reside in the cfmroot directory --file filename specifies a file that contains a list of nodes names. If the file name "-", then the list is read from stdin. The file can contain multiple lines and each line can have one or node names, separated by spaces. -h Writes the usage statement to standard out. [-n] node_list Specifies a list of node hostnames, IP addresses, or node ranges on which to run the command. (See the noderange man page for information on node ranges.) -M number of maximum children Set the number of nodes to update concurrently. (The default is 32.) -N Node_group[,Node_group...] Specifies one or more node groups on which to run the command. -q Queries for out of date CFM files across the cluster. -s Reports which nodes are up to date by comparing last CFM update times. Must be called with the -q option. -r remote shell path. Path to remote shell. (The default is the DSH_REMOTE_CMD environment variable, or /usr/bin/rsh). -t timeout Set the timeout period (in seconds) for waiting for response from a remote process. (The default is 900). -v V Verbose mode. -y Younger mode. Does not update files younger than master copy.
Note CFM can be set up prior to running the installnode command, and common files are distributed at install time while installing nodes.
At CSM install time, root's crontab is updated with an entry to run cfmupdatenode every day at midnight.This can changed to suit your requirements. #crontab -l grep cfmupdate 0 0 * * * /opt/csm/bin/cfmupdatenode -a 1>>/var/log/csm/cfmerror.log 2>>/var/log/csm/cfmerror.log Some common features of CFM, along with usage examples, are described here.
Whenever a file in /cfmroot is modified, the changes are propagated to all managed nodes in the cluster.
Note Use caution while enabling CFM event monitoring, as it can impact system performance.
User id management with CFM
CFM can be used to implement centralized user id management in your management domain. User ids and passwords are generated on the management server, stored under /cfmroot, and distributed to nodes as scheduled. Copy the following files to /cfmroot to set up effective user id management:
Be aware that any id and password changes made on the nodes will be lost once centralized user id management is implemented. However, you can force users to change their passwords on the management server instead of on nodes. Set up scripts or tools to centralize user id creation and password change by group on the management server, and disable password command privileges on managed nodes. CFM distributes files to managed nodes, but never deletes them. If a file needs to be deleted, delete it manually or with a dsh command from the management server. All CFM updates and errors are logged to files /var/log/csm/cfmchange.log and /var/log/csm/cfmerror.log. For more information, refer to IBM Cluster Systems Management for Linux: Administration Guide , SA22-7873. 5.3.5 Software maintenance
The CSM Software Maintenance System (SMS) is used to install, query, update and delete Linux RPM packages on the management server and managed nodes. It is performed using the smsupdatenode command. Autoupdate open source software is a prerequisite for using SMS. SMS uses either install mode to install new RPM packages, or update mode to update existing RPM packages on cluster nodes. Preview or test mode only tests the update without actually installing the packages. The SMS directory structure includes /csminstall/Linux/InstallOSName/InstallOSVersion/InstallOSArchitecture/RPMS ../updates and ../install subdirectories to maintain all SMS RPMs, updates and install packages, respectively. Sample SMS directory structure on SuSE8.1 looks like the following:
Copy the requisite RPM packages in the respective subdirectories from Install or Update CDs.
Note SMS is only for maintaining RPM packages. OS patch CDs cannot be used for updating OS packages. Follow these steps to copy the RPM packages from patch CDs to respective subdirectories, and then issue smsupdatenode :
Example 5-30 shows usage of smsupdatenode. Example 5-30. smsupdatenode usage
Usage: smsupdatenode [-h] [-a -N node_group[,node_group] --file file ] [-v -V] [-t --test] [-q --query [-c --common]] [--noinsdeps] [-r "remote shell path"] [-i --install packagename[,packagename]] [-e --erase {--deps --nodeps} packagename[,packagename]] [-p --packages packagename[,packagename]] [[-n] node_list] smsupdatenode [--path pkg_path] --copy {attr=value... hostname} -a Run Software Maintenance on all nodes. --copy {attr=value... hostname} Copy the distrobution CD-Roms corresponding to the given attributes or hostname to the correct /csminstall directory. If you give attr=value pairs they must come at the end of the command line. The valid attributes are: InstallDistributionName InstallDistributionVersion InstallPkgArchitecture If a hostname is given, the distribution CD-ROMs, and destination directory, are determined by the nodes attributes. -e --erase {--deps --nodeps} packagename[,packagename] Removes the RPM packages specified after either the --deps or --nodeps option. --deps Removes all packages dependent on the package targeted for removal. --nodeps Only removes this package and leaves the dependent packages installed. --file filename specifies a file that contains a list of nodes names. If the file name "-", then the list is read from stdin. The file can contain multiple lines and each line can have one or node names, separated by spaces. -h Writes the usage statement to standard out. [-n] node_list Specifies a list of node hostnames, IP addresses, or node ranges on which to run the command. (See the noderange man page for information on node ranges.) -i --install packagename[,packagename] Installs the given RPM packages. -N Node_group[,Node_group...] Specifies one or more node groups on which to run the command. --noinsdeps Do not install RPM dependencies. -p --packages packagename[,packagename] Only update the given packages. The user does not have to give the absolute path. It will be determined by looking under directory structure corresponding to the node. --path pkg_path Specifies one or more directories, separated by colons, that contain copies of the distrobution CD-ROMs. The default on a Linux system is /mnt/cdrom and the default on an AIX system is /dev/cd0. This flag may only be used with the --copy flag. -q --query [-c --common] Query all the RPMs installed on the target machines and report the RPMs installed that are not common to every node. -c --common Also report the common set of RPMs (installed on every target node). -r "remote shell path" Path to use for remote commands. If this is not set, the default is determined by dsh. -t --test Report what would be done by this command without making any changes to the target system(s) -v -V Verbose mode. SMS writes logs to /var/log/csm/smsupdatenode.log files. Kernel packages are updated as normal RPM packages using SMS. Once upgraded, kernel cannot be backed out, so use caution while running smsupdatenode command with any kernel packages (kernel* prefix). Also, make sure to run lilo to reload the boot loader if you upgrade kernel and wants to load the new kernel. 5.3.6 CSM Monitoring
CSM uses Reliable Scalable Cluster Technology Infrastructure (RSCT) for event monitoring. RSCT has been proven to provide highly available and scalable infrastructure in applications such as GPFS and PSSP. CSM Monitoring uses a condition and response-based system to monitor system resources such as processes, memory, CPU and file systems. A condition can be a quantified value of a monitored resource attribute, and is based on a defined event expression. If an event expression is true, then an event is generated. File system utilization(/var) is a resource to be monitored, and "condition" can be THE percent utilization on that resource. For example, /var >90% means if the /var file system increases above a 90% threshold value, then the event expression is true and an event is generated. To prevent flooding of generating events, a re-arm expression can be created. In this case, no event will be generated until the re-arm expression value is true. A response can be one or more actions performed when an event is triggered for a defined condition. Considering the file system resource example, if we define that a response action is to increase the file system by 1 MB if /var reaches above 90% and to notify the system administrator, then after monitoring is started, whenever /var goes above 90%, a response action is performed automatically. A set of predefined conditions and responses are available at CSM install. See the IBM Cluster Systems Management for Linux: Administration Guide SA22-7873, for more information. Resource Monitoring and Control (RMC) and Resource Managers (RMs)
Resource Monitoring and Control (RMC) and Resource Managers (RM) are components of RSCT and are critical for monitoring.
Table 5-3 lists available resource managers and their functions. Table 5-3. Resource managers
Table 5-4 lists predefined resource classes and can be obtained with the command lsrsrc . Table 5-4. Predefined resource classes
lsrsrc -l Resource_class will list detailed attributes of each resource class. Check the man page of lsrsrc for more details. Customizing event monitoring
As explained, custom conditions and responses can be created and custom monitoring can be activated on one or more nodes as follows :
Example 5-31 shows the output of lscondresp. Example 5-31. lscondresp output
#lscondresp Displaying condition with response information: Condition Response Node State "NodeFullInstallComplete" "RunCFMToNode" "mgmt_server" "Active" "NodeManaged" "GatherSSHHostKeys" "mgmt_server" "Active" "UpdatenodeFailedStatusChange" "UpdatenodeFailedStatusResponse" "mgmt_server" "Active" "NodeChanged" "rconsoleUpdateResponse" "mgmt_server" "Active" "NodeFullInstallComplete" "removeArpEntries" "mgmt_server" "Active" "FileSystem Space Used" "E-mail root any time" "mgmt_server" "Active" If any file system size exceeds 90% on lpar1, our newly created event triggers an an event as a response action by e-mailing root. Monitoring resumes once the file system is fixed back to 85%. Multiple response actions can be defined to a single condition, and a single response can be assigned to multiple conditions. For Example 5-31, an action such as increasing the file system or deleting files older than 60 days from the file system to claim space could be other actions. 5.3.7 Diagnostic probes
CSM diagnostic probes help you diagnose system problems using programs called probes.The CSM command probemgr is useful in sending custom probes to determine problems; users write their own diagnostics scripts and call probemgr. All predefined probes are located in the /opt/csm/diagnostics/probes directory, and probemgr can access the user-defined directory before reading the predefined probes called with a -D option. System problem diagnostics can be dependent on each other, and probes take a defined order while running. Example 5-32 shows usage of probemgr. Example 5-32. Probemgr usage
probemgr [-dh] [-c {01020127}] [-l {01234}] [-e prb,prb,...] [-D dir] [-n prb] -h Display usage information -d Show the probes dependencies and the run order -c Highest level of exit code returned by a probe that the probe manager permits before terminating.The defaule value is 10 0 - Success 10 - Success with Attention Messages 20 - Failure 127 - Internal Error -l Indicates the message output level. The default is 3 0 - Show probe manager messages, probe trace messages, probe explanation and suggested action messages, probe attention messages and probe error messages 1 - Show probe trace messages, probe explanation and suggested action messages, probe attention messages and probe error messages 2 - Show probe explanation and suggested action messages, probe attention messages and probe error messages 3 - Show probe attention messages and probe error messages 4 - Show probe error messages only -e prb,.. List of probes to exclude when creating the probe dependency tree. This also means that those probes will not be run -D dir Directory where user specified probes reside -n prb Run the specified probe Table 5-5 lists the default pre-defined probes available and the probe dependencies. Table 5-5. Probes and dependencies
All probes are run from the management server using the probemgr command. For detailed information on each probe, refer to probemgr man page. 5.3.8 Querying the CSM database
CSM stores all cluster information, such as nodes, attributes of nodes, and so on, in a database at a centralized location in the /var/ct directory. This database is accessed and modified using tools and commands, but not directly with a text editor. Table 5-6 on page 268 lists commands you can use to access the CSM database. Table 5-6. CSM database commands
5.3.9 Un-installing CSM
CSM is un-installed by using the uninstallms command on the management server. Not all packages are removed while running uninstallms. Table 5-7 identifies what is removed and what is not removed with uninstallms. Table 5-7. Uninstallms features
Clean up manually to remove all the packages and directories that are not removed with the uninstallms command to completely erase CSM. Refer to IBM Cluster Systems Management Guide for Linux: Planning and Installation Guide-Version 1.3.2 , SA22 7853, for detailed information 5.3.10 Distributed Command Execution Manager (DCEM)
DCEM is a Cluster Systems Management GUI interface used to run a variety of tasks on networked computers. Currently this is not available for pSeries machines. 5.3.11 Backing up CSM
Currently CSM backup and restore features are not available for pSeries Linux management server version 1.3.2. These will be available in the near future. 5.3.12 CSM problem determination and diagnostics
CSM logs detailed information in various log files on the management server and on managed nodes. These log files are useful in troubleshooting problems. In this section, we discuss some common and frequent problems which may be encountered while setting up and running CSM. For more detailed information and diagnostics, refer to the IBM Cluster Systems Management Guide for Linux: Administration Guide , SA22-7873. Table 5-8 lists common CSM problems and their fixes. Table 5-8. Common CSM problems and fixes
Refer to IBM Cluster Systems Management for Linux: Hardware Control guide SA22-7856 for more information on hardware control and HMC connectivity and RMC issues. |
< Day Day Up > |