Windows Server 2003 on Proliants. Deployment Techniques and Management Tools for System Administrators

 < Day Day Up > 

Troubleshooting ProLiant problems is described in this section. Obviously, there isn't sufficient space here to cover every issue, but the basics are covered along with some helpful Web sites. Whether you fix the machine yourself or call for HP support, these steps will help you narrow the problem to help support personnel.

Proactive Troubleshooting

Becoming proactive in managing your environment is probably the most efficient way to avoid problems. Although HP endeavors to enhance system and application fault tolerance, problems and failures still occur, most often because of lack of due diligence and application of recommended procedures. Accessing and acknowledging risk is a first step in recognizing the need to be proactive. The risk can be defined by asking some questions:

  • What is the risk of not updating virus software?

  • What is the risk of not verifying the backup of the accounting system?

  • What is the risk of using initial equipment cost as the only consideration for its purchase?

  • What is the risk of running systems in unsupported configurations?

  • What is the risk of updating drivers on both nodes of a cluster simultaneously ?

  • What is the risk of waiting a month to replace a failed disk in a RAID set?

These examples demonstrate the point of avoiding risk, but might seem like exaggerations and risks that no one would take. However, these are real situations reported to HP's call center.

HP includes system management, monitoring, troubleshooting tools, and utilities with each ProLiant server; however, none of these tools will prevent the risk just described. Using the tools as they were designed will go a very long way in avoiding risk. Common sense and due diligence are your greatest assets in avoiding and managing risk.

The management tools included with ProLiant servers allow you to proactively monitor server health and keep systems software, drivers, and utilities up-to-date. This, along with a thorough backup and disaster recovery plan, will substantially ease management and reduce the risk of data loss.

Troubleshooting ProLiant Servers

ProLiant ML, DL, and BL servers include troubleshooting procedures in the Installation and Setup and User Guides shipped with each server. The guides provide hardware troubleshooting steps to assist in locating and fixing hardware configuration issues. The underside hood label in every ProLiant system will also assist with server configuration, as shown in Figure 10.58.

Figure 10.58. ProLiant server hood labels assist with server configuration.

The hood labeling provides supported information for configuration and adding processors, memory, and other components .

Configuring New Servers and Troubleshooting

An out-of-the-box system should be prepared according to the instructions included with the server. The fast start setup guide will assist in preparing the server for initial powerup.

The following list covers basic steps for prepping a new server to install the OS:

1. Unpack and prep the server for powerup before adding any optional components.

2. After the server is connected to its power source and in standby mode, inspect the front panel for any error conditions. Refer to the Servers User Guide for Front Panel Fault LED Configuration .

3. Power the system on and allow it to complete its Power On Self Test (POST) and initialize standard components.

4. If no errors are encountered and optional components are to be installed, add internal components one at a time; POST the server after each component is installed to verify its status.

5. After all components are installed, insert the SmartStart CD and power the system on. SmartStart will assist you throughout the OS installation.

6. If a manual installation is preferred, run the ROM Based Setup Utility (RBSU), select the OS, and configure storage before attempting to install the OS.

Each ProLiant server is tested prior to shipping and should operate without error out of the box. However, during shipping, components might become unseated or loose. If problems are encountered during initial setup of a server, refer to the Servers Setup and Installation Guide for troubleshooting information. Some of the common conditions encountered are

Error: Loose Connections

Actions:

  • Be sure all power cords are securely connected.

  • Be sure all cables are properly aligned and securely connected for all external and internal components.

  • Remove and check all data and power cables for damage. Be sure no cables have bent pins or damaged connectors.

  • If a fixed cable tray is available for the server, be sure the cords and cables connected to the server are correctly routed through the tray.

  • Be sure each device is properly seated.

  • If a device has latches, be sure they are completely closed and locked.

  • Check any interlock or interconnect LEDs that might indicate a component is not connected properly.

  • If problems continue to occur, remove and reinstall each device, checking the connectors and sockets for bent pins or other damage.

Problems Adding Options to a Server

Actions:

1. Refer to the server documentation to be sure the hardware being installed is a supported option on the server. Remove unsupported hardware.

2. Refer to the release notes included with the hardware to be sure the problem is not caused by a last-minute change to the hardware release. If no documentation is available, refer to the HP support Web site at http://www.hp.com/support .

3. Verify new hardware is installed properly. Refer to the device, server, and OS documentation to be sure all requirements are met. Common problems include

  • Incomplete population of a memory bank

  • Installation of a processor without a corresponding PPM

  • Installation of a SCSI device without termination or without proper ID settings

  • Setting of an IDE device to Master/Slave when the other device is set to Cable Select

  • Connection of the data cable, but not the power cable, of a new device

The basic troubleshooting steps for ProLiant hardware problems are

  • Verify no memory, I/O, or interrupt conflicts exist.

  • Verify no loose connections exist. (See previous section.)

  • Verify all cables are connected to the correct locations and are the correct lengths. For more information, refer to the server documentation.

  • Verify other components were not unseated accidentally during the installation of the new hardware component. Verify all necessary software updates, such as device drivers, ROM updates, and patches, are installed and current. For example, if you are using a Smart Array controller, you need the latest Smart Array controller ROM and device driver.

  • Verify all device drivers are the correct ones for the hardware. Uninstall any incorrect drivers before installing the correct drivers.

  • Run RBSU after boards or other options are installed or replaced to be sure all system components recognize the changes. If you do not run the utility, you might receive a POST error message indicating a configuration error. After you check the settings in RBSU, save and exit the utility, and then restart the server. Refer to the HP ROM-Based Setup Utility User Guide for more information.

  • Verify all switch settings are set correctly. For additional information about required switch settings, refer to the labels located on the inside of the server access panel or the server documentation.

  • Verify all boards are properly installed in the server.

  • Run Insight Diagnostics ("HP Insight Diagnostics") to see if it recognizes and tests the device.

  • Uninstall the new hardware.

Problem: (Unknown Problem )

Actions:

1. Disconnect power to the server.

2. Following the guidelines and cautionary information in the server documentation, strip the server to its most basic configuration by removing every card or device that is not necessary to start the server. Keep the monitor connected to view the server startup process.

3. Reconnect power, and then power the system on.

  • If the video doesn't work, refer to "Video Problems" in the HP ProLiant Server Troubleshooting Guide.

warning

Only authorized technicians trained by HP should attempt to remove the system board. If you believe the system board requires replacement, contact HP Technical Support (see "Contacting HP Technical Support or Authorized Reseller" in the HP ProLiant Server Troubleshooting Guide) before proceeding.

  • If the system fails in this minimum configuration, one of the primary components has failed. If you have already verified that the processor, PPM, power supply, and memory are working before getting to this point, replace the system board. If not, be sure each of those components is working.

  • If the system boots and video is working, add each component back to the server one at a time, restarting the server after each component is added to determine if that component is the cause of the problem. When adding each component back to the server, be sure to disconnect power to the server and follow the guidelines and cautionary information in the server documentation.

Third-Party Device Problems

Actions:

1. Refer to the server and OS documentation to be sure the server and OS support the device.

2. Verify the latest device drivers ("Maintaining Current Drivers" in the HP ProLiant Server Troubleshooting Guide).

3. Refer to the device documentation to be sure the device is properly installed. For example, a third-party PCI board might be required to be installed on the primary PCI bus.

Testing the Device

Actions:

1. Uninstall the device. If the server works with the device removed and uninstalled , either a problem exists with the device, the server does not support the device, or a conflict exists with another device.

2. If the device is the only device on a bus, be sure the bus works by installing a different device on the bus.

3. Restarting the server each time to determine if the device is working, move the device

  • To a different slot on the same bus.

  • To a PCI slot on a different bus.

  • To the same slot in another working server of the same or similar design.

    If the board works in any of these slots, either the original slot is bad or the board was not properly seated. Reinsert the board into the original slot to verify.

4. If you are testing a board (or a device that connects to a board):

  • Test the board with all other boards removed.

  • Test the server with only that board removed.

warning

Clearing NVRAM deletes your configuration information. Refer to your server documentation for complete instructions before performing this operation or data loss could occur.

5. Clearing NVRAM can resolve various problems. Typically after hardware components have been added or removed, there is the potential that corrupt configuration information is stored in NVRAM, causing performance issues. Clearing NVRAM and reconfiguring the system will create a valid configuration.

HP ProLiant Server Troubleshooting Guide

Setup and Installation and User Guides are available for each ProLiant server that contain troubleshooting information specific to that server model. In addition to those guides, the HP ProLiant Server Troubleshooting Guide is also available at the HP Web site at http://h20000.www2.hp.com/bc/docs/support/UCR/SupportManual/TPM_338615-2/TPM_338615-2.pdf .

A new version of the guide was released in October of 2003 that contains information tailored to the ProLiant ML, DL and BL servers, which shipped with SmartStart 6.0 or later. The topics covered in the guide include

  • Diagnosing the Problem

  • Hardware Problems

  • Software Problems

  • HP Resources for Troubleshooting

  • ADU (Define ADU) Error Messages

  • POST Error Messages and Beep Codes

  • Event List Error Messages

  • Contacting HP

  • Acronyms and Abbreviations

Previous versions have been used as a reference guide for the Array Diagnostics Utility (ADU) error messages, POST error messages, and beep codes and the event list error messages. The new guide contains additional features that assist in troubleshooting. Troubleshooting flowcharts are a new feature that will assist engineers in diagnosing problems. Sample flowcharts are shown in Figure 10.59 and 10.60.

Figure 10.59. General hardware installation and failure troubleshooting flowchart.

Figure 10.60. Troubleshooting flowchart for OS issues.

The chart shown in Figure 10.58 is a general troubleshooting flow chart for installing the hardware or detecting a hardware failure. The OS Boot Problems flowchart, shown in Figure 10.59 assists with the following:

  • Symptoms:

    • Server does not boot a previously installed OS

    • Server does not boot SmartStart

  • Possible causes:

    • Corrupted OS

    • Hard drive subsystem problem

Troubleshooting Utilities

There are several troubleshooting utilities that you should be familiar with.

Survey Utility

Each ProLiant 300, 500, and 700 series server comes with Web-enabled Management Agents called HP Insight Management Agents. They are included on the SmartStart CD in the PSP and installed with a SmartStart assisted install or when applying the PSP. The Survey Utility is now included in the HP Insight Management Agent software and included in the PSP. The Survey features are available when selecting the tools tab in System Management Homepage. Survey sessions are now stored in XML files, displayed as HTML in the browser interface, and also used to perform session comparisons. The XML files can also be viewed by standard browsers. The survey.idi and survey.txt files used by the legacy Survey Utility are not used by Insight Diagnostics. The following steps describe how to use this utility:

1. Install the Windows PSP.

2. Browse to the System Management Homepage and click on the Tools tab (use the URL http://server- name :2301 or https ://server-name:2381 for HTTPS secure communications). You can also use the IP Address in place of the server name if needed.

3. Click on the Survey Utility link.

Survey Utility Legacy Version

The Survey Utility is the legacy online information-gathering agent that runs on ProLiant servers, and Netware, Windows, and Linux platforms. This utility was designed to facilitate the resolution of problems without taking the server offline. It gathers critical hardware and software information from various sources and saves it to the survey.txt file. A collection of the last 10 snapshots, or sessions, is saved in the survey.idi file. Sessions captured on Linux systems are saved in individual survey text files that include date and time stamps in the file name. The current configuration can be viewed by browsing to the Survey Utility Web page. To use this utility, follow these steps:

1. For Windows, run the PSP setup.exe program or run the component.exe file and choose the Install button. The Management CD distribution can be installed by running setup.exe.

2. Open the survey.txt file, or select Survey Utility from the server's System Management Homepage.

HP Insight Diagnostics Online Edition Maintenance Utility

This new utility replaces the Survey Utility. Deployed from the PSP, the HP Insight Diagnostics Online Edition maintenance utility displays information about your server's hardware configuration. It is a new Web-enabled Management Agent provided with the ProLiant Essentials suite of products. As of SmartStart 7.1, HP Insight Diagnostics Online Edition featuring Survey and the Integrated Management Log (IML) Viewer will be replacing the Survey Utility previously included with SmartStart. The online version of HP Insight Diagnostics acts like the Survey Utility it is replacing and does not perform any hardware tests on the system. You will need to uninstall the Survey Utility before beginning the installation of HP Insight Diagnostics Online Edition.

Insight Diagnostics uses a Web browser interface in addition to the command-line interface in an online mode. This enables remote control of the utility and facilitates easy transfer of configuration information from remote machines to a service provider. It can be updated from VCRM and VCA, and with SIM offers proactive notification when an updated version is available.

You can use Insight Diagnostics Online Edition to

  • View the hardware configuration of the machine

  • View the server IML

  • View the software configuration of the machine

  • Compare historical configurations of the machine

  • Capture a new configuration sample

Integrated Management Log (IML)

The IML can be viewed using the System Management Homepage at the Logs tab in the Survey Utility or the IML Viewer by clicking Start, Programs, HP System Tools. The IML utility allows you to view and manage the HP IML on both local and remote systems. The IML is a nonvolatile log used to record significant events that occur in a system and its components. The IML records system events, critical errors, power-on messages, and memory errors. It also records catastrophic hardware and software errors that typically cause a system to fail. The information contained in the log helps you quickly identify and correct problems, thus minimizing downtime.

The displayed IML entries include the following information:

  • Description : A brief description of the event, with details such as slot and chassis.

  • Class : The subsystem where the event occurred, such as Power Subsystem or Disk.

  • Severity : The severity of the event (one of four levels).

  • Count : The number of times the specific event has occurred.

  • Update Time : The time and date this event was updated.

  • Initial Time : The time and date this event was first entered into the log.

Additionally, a colored icon representing the severity of the event is displayed in the column with the event description.

  • Informational : Represents general information about a system event.

  • Repaired : Indicates that corrective action has been taken. Users must mark entries as repaired.

  • Caution : Represents a nonfatal error condition; action should be taken as soon as possible, but the situation is not critical.

  • Critical : Represents a component failure; action should be taken immediately.

The displayed log can be printed, sorted, filtered, saved to a disk file (for historical purposes), and exported to a CSV file (for import into third-party applications). Users with Administrator privileges can mark selected entries as "repaired" (after action has been taken to resolve the problem), and can clear all of the entries from a given machine's log. Logs that have been saved can also be viewed in the utility. These logs can be printed, sorted, and filtered just like the online logs; they can also be exported to a CSV file. The command functions (Mark Repaired and Clear All Entries), as well as Refresh, are disabled when viewing saved logs.

note

You must have Administrator privileges to enable the command functions of this utility.

Insight Diagnostics (Offline)

Insight Diagnostics is a browser-based hardware testing application that can be run offline from the SmartStart. It replaces the Server Diagnostics Utility, which is a DOS-based offline-only tool. It provides the following major features: three types of diagnostic testing, access to the IML, hardware data collection (Survey), and diagnostic test logs.

Tests can be configured to run in time-based or loop-based modes, and interactive or unattended testing modes; and can be customized to test any desired combination of hardware devices. Failures and other errors are gathered in the error log, and a report ticket can be generated that contains all diagnostic errors and IML records. The report ticket can be printed or saved for further troubleshooting. To use Insight Diagnostics:

1. Boot to the SmartStart CD.

2. Select the Maintenance tab.

3. Select Server Diagnostics from the Maintenance Utilities menu.

BIOS Serial Console and EMS

BIOS SerialConsole, which is the focus of Insight Diagnostics, can be enabled in RBSU. By default, BIOS Serial Console is disabled.

EMS Support Overview

Emergency Management Service (EMS) support is a Microsoft feature for the Windows Server 2003 OS, which is enabled by default in the OS, but which also must be enabled in the system ROM. Refer to "Operating System Support" in Chapter 2 of the HP BIOS Serial Console User Guide for more information about using supported OSs. By default, EMS support is disabled for ML and DL servers, and is enabled for BL servers.

Configuration in RBSU

As discussed, the BIOS serial Console/EMS feature is enabled in RBSU. When Enable Local is selected, the OS redirects through the local serial port. When Enable Remote is selected, the OS redirects through iLO or RILOE II. Data becomes available through the browser configured for iLO instead of through the serial port. Enabling remotely requires iLO 1.10 firmware or later.

Emergency Management Services (EMS)

EMS support provides I/O support for all Windows kernel components: the loader, setup, recovery console, OS kernel, blue screens, and the Special Administration Console. The Special Administration Console is a text-mode management console available after Windows Server 2003 OS has initialized . For more information on EMS support, go to http://www.microsoft.com/hwdev/headless.

Microsoft enables EMS support in the OS, but EMS support also requires ROM support. EMS support, when enabled, assumes the serial port for redirection and can cause interference with other devices attached to the serial port. To avoid interference, EMS is disabled in the system ROM by default on ML and DL servers. To enable this feature, Enable Local or Enable Remote must be selected under the BIOS Serial Console/EMS Support menu in RBSU before installing Windows Server 2003. If you install Windows Server 2003 with EMS disabled, and later decide to enable it, perform the following steps to update the boot.ini file:

1. Enable EMS in RBSU.

2. Run bootcfg /ems on/id 1 from the Windows command line.

3. Reboot.

Using iLO and RILOEII for Remote Troubleshooting

iLO-Based Diagnostics

iLO is an intelligent processor integrated into newer ProLiant servers that provides remote management and administration of a server through a standard browser. iLO provides the following reporting and diagnostic features: access to the IML, iLO event log, server POST results, iLO self-test results, graphical remote console, virtual power button, virtual floppy, and virtual CD.

The Remote Console can be used to monitor the system for POST error messages. The IML and iLO event log record events are useful for troubleshooting server issues. Virtual floppy and CD-ROM (if licensed) can be used to boot and run Server Diagnostics. A new feature with iLO is the capability to record server Port 84 POST Codes as the system boots. These codes document the progress of the server through the bootstrap process.

To access iLO or RILOE II, use any of the following methods :

  • Browse to the RILOEII or iLO's network address or DNS name.

  • Select RILOEII or iLO from the server's System Management Homepage.

  • Select the correct RILOEII or iLO Management Processor from the Device list in SIM or IM7.

Running Diagnostics on a Remote System Using the iLO or RILOE II

To run Insight Diagnostics from iLO or RILOE II virtual CD, follow these steps:

1. Browse to and log into iLO.or RILOE II.

2. Select Virtual Media from the Virtual Devices tab.

3. Select Local CD Drive.

4. Select the correct drive letter for the local CD-ROM.

5. Click the Connect button.

6. Insert the SmartStart CD into the local CD-ROM drive.

7. Reboot the server. This boots and loads the SmartStart CD.

8. Select Server Diagnostics from the Maintenance Utilities menu in SmartStart

To run Server Diagnostics from the iLO or RILOE II virtual floppy, follow these stps:

1. Browse to and log in to iLO or RILOE II.

2. Select Virtual Media from the Virtual Devices tab.

3. Select Local Floppy Drive or Local Image File.

4. If you are using the local floppy drive, click on the drop-down box and select the correct floppy drive letter. Click the Connect button. If you are using an image file, type in the name of the file or browse to it using the Browse button.

5. Reboot the server to boot to the diagnostics utility.

The tools and methods used in troubleshooting ProLiant servers provide powerful capabilities for Administrators and managers to reduce downtime by assisting HP in locating the problem in a timely fashion.

note

iLO now provides "Terminal Services Pass Through" for Windows remote console sessions. ProLiant servers with the iLO advanced pack enabled can leverage iLO's remote console function to provide Terminal Services pass through of a Windows Remote Desktop Connection to Windows Server. See Chapter 15 for details.

 < Day Day Up > 

Категории