UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

BMC Software provides monitoring capabilities through its PATROL software suite. PATROL is a system, application, and event management suite for system and database administrators. PATROL provides the basic framework for defining thresholds, sending and translating events, and performing other such tasks . PATROL consists of a console, intelligent agents , and Knowledge Modules (KMs). KMs are add-on products that contain the ability to monitor specific components .

Three types of consoles are available:

  • Operator console: Provides the graphical display and administration of systems and applications, as shown in Figure 3-5. Manual corrective actions are performed from here

    Figure 3-5. PATROL console window showing resources being monitored on system "bakers."

  • Developer console: Provides the capabilities for configuration and customization of remote agents and KMs

  • Event manager console: Adds event filtering, correlation, sorting, and escalation facilities

BMC PATROL runs on a variety of operating systems, including UNIX, NT, OS/390, and NetWare. BMC also provides additional management software for mainframe environments.

Monitored Components

Monitoring in PATROL is provided by KMs, which contain the expertise used by PATROL to know what to monitor and how to react when problems occur. KMs are used to monitor a set of parameters, which can include a description of the monitored attribute, the polling interval, the method for measuring the attribute, and a threshold for abnormal values. The KMs provide rules to detect events and perform corrective actions. Events are sent to an operator console when an error or warning condition occurs.

PATROL provides a wide variety of operating system, middleware, database, and application KMs. Different combinations of KMs can be purchased, so you can customize the solution for your own environment.

Operating system KMs are provided for UNIX, NT, and other OSs. UNIX platforms include Sun Solaris, HP-UX, and IBM AIX. Middleware KMs include Tuxedo and DCE.

Database KMs include Oracle, Sybase, Informix, DB2, Red Brick, Ingres, and others. BMC PATROL's database KMs provide more metrics than are available from the other products described in this chapter. For example, BMC PATROL provides more than 70 metrics for Oracle. About 50 metrics are monitored for Informix. Metrics include user connection information, active locks, I/O statistics, dictionary hit ratios, and CPU utilization. Server performance is also monitored. BMC formerly had a technical agreement with HP to provide the database KMs for MeasureWare. PATROL provides a strong solution for database management because it encompasses both monitoring and database administration tools.

BMC has bundled its database products into a PATROL Availability Suite for Oracle. The product bundle includes the PATROL KM for Oracle, PATROL DB-Stats for Oracle, PATROL DB-Reorg for Oracle, and the PATROL DB-Integrity products.

Application KMs include SAP R/3, Baan, and PeopleSoft. System and database KMs should also be used with these application KMs.

Details on specific KMs for system, applications, middleware, and databases are covered in separate chapters.

Monitoring Features

PATROL's operator console provides a centralized graphical display in which icons represent system components or other monitored components. Icons change color to correspond to a component's status. On the console, icons represent system components or other monitored components, and change color to correspond to the status. Detailed information can be displayed as gauges, or in graphs or text windows .

BMC PATROL has its own configuration interface and its own message browser for receiving events. Alarms can be configured such that events are sent to the console, indicated graphically, and shown in the Event Browser. Events can also be sent to the Message Browser in IT/O. Additionally, events can also be sent to other management platforms, such as Unicenter TNG or Tivoli TME, via SNMP.

PATROL can graph multiple metrics simultaneously for a single system to help with performance monitoring. If data is logged, historical graphs can also be shown.

Intelligent agents provide the ability to discover system, database, and application components in the enterprise. The agents reside on each server. On an ongoing basis, the agents look for problems. When they encounter problems, they either take preconfigured actions or send notification so that recovery can be done manually.

KMs on the monitored system provide the rules for detecting events and performing recovery actions. Events are sent to an operator console when an error or warning condition occurs. Administrators can customize the recovery actions. The PATROL agent polls for information and can adjust its sampling rate based on its performance impact. Each system has one PATROL agent, but may have many KMs.

PATROL has been certified to run in an MC/ServiceGuard environment, but inconsistency issues exist. For example, PATROL may report an SAP process failure to its operator console, while MC/ServiceGuard may have already restarted the process or moved it to another system.

Unlike MC/ServiceGuard, PATROL does not provide any failover capability. However, a limited set of automated recovery actions is available.

PATROL can show an application view in the PATROL console. IT/O provides only a node view, although ClusterView adds an application view to IT/O when used in an MC/ServiceGuard environment.

Hewlett-Packard used to rely on BMC PATROL to provide database information to its MeasureWare Agent. The database KMs were resold through a special licensing agreement. HP now uses its SMART Plug-Ins for databases to gather database information for MeasureWare.

Monitor Discovery and Configuration

You need to install the PATROL agent and KMs on each system to be monitored. Once this software is installed, you need to load (or activate) the KMs you intend to use on each system. This can be done from PATROL's operator console.

PATROL agents discover all databases, applications, and key resources when they are started. This makes it easy to start monitoring quickly. The administrator can also define additional applications and databases so that they can be discovered in the future.

Figure 3-5 shows an example of the resources being monitored by PATROL on a system called "bakers." This system has the KMs for UNIX and for Oracle loaded. You can drill down on each of these resource's classes to see the metrics being monitored, graph data, and check status.

In addition to monitoring, PATROL can automate recovery actions taken in response to error or failure conditions. The user must assign the desired recovery actions to an event. The concept of operator-initiated actions, available with IT/O, is not supported with PATROL.

PATROL can be configured to respond automatically to specific problems, and can help tune databases for optimal performance. Recovery actions can be performed locally by agents without requiring communication with the console.

Users can set thresholds. Events are sent by the PATROL Event Manager to the PATROL console. All metrics are sampled simultaneously, which can cause unnecessary system overhead for resources that are less critical or that rarely change.

Monitor Developer's Kit

BMC PATROL has an API that enables events to be sent by a non-PATROL program to the PATROL Event Manager. This is similar to IT/O, which has an API for sending RPC messages to its Event Browser. A non-PATROL program can also receive events from PATROL.

The PATROL Scripting Language (PSL), part of the PATROL Developer Console, can be used to create scripts to perform recovery actions by the PATROL agent on managed nodes. PSL can also be used to write parameters, commands, tasks, and discovery procedures for PATROL agents.

Notification Methods

PATROL Alarm Manager can be used to send notifications by pager, by e-mail, or to third-party messaging systems. Users define the resources to monitor and the notification criteria. After configuration, the alarm policies can be distributed to a predefined set of systems. The PATROL Alarm Manager keeps track of additional information, such as the number of events sent by a host, and time periods during which notifications should be sent. PATROL can also send copies of an event to multiple consoles.

The PATROL Event Translator (PET) can translate messages for various protocols. BMC provides a PET for IT/O, using the OpenView APIs, but it forwards only the event. The operator must switch to the BMC PATROL console to initiate corrective actions.

Diagnostic Capabilities

PATROL can graph multiple metrics simultaneously for a single system to help with performance monitoring. If data is logged, historical graphs can also be shown.

In addition to monitoring, PATROL can automate recovery actions taken in response to error or failure conditions. The user must assign the desired recovery actions to an event. The concept of operator-initiated actions, available with IT/O, is not supported with PATROL.

PATROL can be configured to respond automatically to specific problems, and can help tune databases for optimal performance. Recovery actions can be performed by agents without requiring communication with the console.

In addition to providing help diagnosing and recovering from problems, BMC provides PATROL DB, which includes database administrative tools such as Pathfinder, DB-Alter, DB-Reorg, DB-Change Manager, DB-Integrity, DB-Voyager, and SQL-Explorer. These tools are integrated into the PATROL framework and can be launched from the PATROL console. In contrast, IT/O provides monitoring only, relying on other software vendors to provide administrative tools.

To change UNIX kernel parameters in response to a problem, BMC uses the tool Opportune.

Additional Information

More information about BMC Software can be found on the Web at http://www.bmc.com. BMC PATROL software can be downloaded from its Web site.

BMC Software announced plans to buy BGS Systems, Inc. in early 1998. BMC intends to integrate BGS's Best/1 product for performance analysis and trending into the PATROL software suite. This includes integration of its separate agent, collection, and data store technologies. BGS provides performance management solutions for UNIX, NT, and mainframe systems. Information about BGS Systems, Inc. can be found at http://www.bmc.com.

I l @ ve RuBoard

Категории