Upgrading and Repairing Servers

Servers store the vital records for your business (or your clients' businesses). When a server goes down, it's essential to get it up and working again. The following sections discuss some of the best ways to troubleshoot problems related to motherboards.

Troubleshooting by Replacing Parts

You can troubleshoot a server in several ways, but in the end, it often comes down to simply reinstalling or replacing parts. That is why you should normally use a simple "known-good spare" technique that requires very little in the way of special tools or sophisticated diagnostics. In its simplest form, say you have two identical servers sitting side-by-side. One of them has a hardware problem; in this example, let's say one of the memory modules (DIMMs) is defective. Depending on how and where the defect lies, this could manifest itself in symptoms ranging from a completely dead system to one that boots up normally but crashes when running the operating system or a particular application. You observe that the system on the left has the problem but the system on the right works perfectly; they are otherwise identical. The simplest technique for finding the problem would be to swap parts from one system to another, one at a time, retesting after each swap. At the point when the DIMMs were swapped, upon powering up and testing (in this case testing is nothing more than allowing the system to boot up and run some of the installed applications), the problem has now moved from one system to the other. Knowing that the last item swapped over was the DIMM, you have just identified the source of the problem! This does not require an expensive ($2,000 or more) DIMM test machine or any diagnostics software. Because components such as DIMMs are not economical to repair, replacing the defective DIMM provides the needed solution.

Although this example is very simplistic, replacing parts is often the quickest and easiest way to identify a problem component as opposed to specifically testing each item with diagnostics. What if you don't have an identical system? You can maintain an inventory of known-good spare parts. These are parts that have been previously used, are known to be functional, and can be used to replace a suspicious part in a problem machine. However, this is different from new replacement parts because, when you open a box containing a new component, you really can't be 100% sure that it works. In some situations, you may replace a defective component with another (unknown to you) defective new component, and the problem remains. Not knowing that the new part you just installed was also defective, you could waste a lot of time checking other parts that are not the problem. This technique is also effective because few parts are needed to make up an entry-level server, and the known-good parts don't always have to be the same (for example, a lower-end NIC can be substituted in a system to verify that the original card had failed).

If you are troubleshooting a server that uses a proprietary architecture, such as one that uses processor or memory cartridges, blade-based components, and so on, we recommend that you pull temporary replacements for swapping from an identical system to avoid compatibility problems.

Troubleshooting Using the Bootstrap Approach

Another variation on the replacing-parts theme is the bootstrap approach, which is especially good for what seems to be a dead system. In this approach, you take the system apart to strip it down to the bare minimum necessary, functional components and test it to see whether it works. For example, you might strip down a server to the chassis/power supply, bare motherboard, CPU (with heatsink), one bank of RAM, and the display and then power it up to see whether it works. In that stripped configuration, you should see the POST or splash (logo) screen on the display, verifying that the motherboard, CPU, RAM, onboard video, and display are functional. If a keyboard is connected, you should see the three LEDs (Capslock, Scrlock, and Numlock) flash within a few seconds after power-on. This indicates that the CPU and motherboard are functioning because the POST routines are testing the keyboard. After you get the system to a minimum of components that are functional, you should reinstall or add one part at a time, testing the system each time you make a change to verify that it still works and that the part you added or changed is not the cause of a problem. Essentially, you rebuild the system from scratch, using the existing parts, but you do it one step at a time.

Many times, problems are caused by corrosion on contacts or connectors, so the mere act of disassembling and reassembling a server will "magically" repair it. In some cases, you may disassemble, test, and reassemble systems only to find no problems after the reassembly. How can merely taking a system apart and reassembling it repair a problem? Although it might seem that nothing was changed and everything is installed exactly as it was before, in reality, simply unplugging and replugging renews all the slot and cable connections between devices, which is often all the system needs.

Some useful troubleshooting tips include the following:

  • Eliminate unnecessary variables or components that are not pertinent to the problem.

  • Reinstall, reconfigure, or replace only one component at a time.

  • Test after each change you make.

  • Keep a detailed record (write it down) of each step you take.

  • Don't give up! Every problem has a solution.

  • If you hit a roadblock, take a break or work on another problem. A fresh approach the next day often reveals things you overlooked.

  • Don't overlook the simple or obvious. Double- and triple-check the installation and configuration of each component.

  • Keep in mind that the power supply is one of the most failure-prone parts in a server, as well as one of the most overlooked components. A high-output known-good spare power supply is highly recommended to use for testing suspect systems that use standard ATX or SSI form factors. For servers based on proprietary form factors, swap a power supply from a known-working system for testing.

  • Cables and connections are major causes of problems, so keep replacements of all types on hand.

Before starting any system troubleshooting, you should perform a few basic steps to ensure a consistent starting point and to isolate the failed component:

1.

Turn off the server and any peripheral devices. Disconnect all external peripherals from the system, except for the keyboard and video display.

2.

Make sure the server is plugged in to a properly grounded power outlet.

3.

If the server can be managed locally, make sure the keyboard and video displays are connected to the system. Turn on the video display and turn up the brightness and contrast controls to at least two-thirds of the maximum. Some displays have onscreen controls that might not be intuitive. Consult the display documentation for information on how to adjust these settings. If you can't get any video display but the system seems to be working, try moving the video card to a different slot or try a different video card or monitor.

4.

To enable the system to boot from a hard disk, make sure no floppy disk is in the floppy drive. Or put a known-good bootable floppy with diagnostics on it in the floppy drive for testing.

5.

Turn on the system. Observe the power supply, chassis fans (if any), and lights on either the system front panel or power supply. If the fans don't spin and the lights don't light, the power supply or motherboard might be defective.

6.

Observe the POST. If no errors are detected, the system beeps once and boots up. Errors that display onscreen (nonfatal errors) and that do not lock up the system display a text message that varies according to BIOS type and version. Record any errors that occur and refer to the tables earlier in this chapter for help with text error messages or beep codes.

7.

Confirm that the operating system loads successfully.

Problems During the POST

Problems that occur during the POST are usually caused by incorrect hardware configuration or installation. Actual hardware failure is a far less-frequent cause. If you have a POST error, check the following:

  • Are all cables correctly connected and secured?

  • Are the configuration settings correct in setup for the devices you have installed? In particular, ensure that the processor, memory, and hard drive settings are correct.

  • Is the motherboard configured properly? Most recent systems use BIOS-based configuration settings for processor speeds and clock multipliers, so you might need to enter the BIOS setup program to check and change configurations. If the motherboard uses jumper blocks or DIP switches for some or all configuration, check those as well.

  • Are all resource settings on add-in boards and peripheral devices set so that no conflicts exist (for example, two add-in boards sharing the same interrupt)?

  • Is the power supply set to the proper input voltage (110V120V or 220V240V)?

  • Are adapter boards and disk drives installed correctly?

  • Is a bootable hard disk (properly partitioned and formatted) installed? (This may not apply to some systems.)

  • Does the BIOS support the drive you have installed, and if so, are the parameters entered correctly?

  • Is a bootable floppy disk installed in drive A:?

  • Are all memory SIMMs or DIMMs installed correctly? Try reseating them and moving them around in different slots.

  • Is the operating system properly installed?

Problems Running Software

Problems running application software (especially new software) are usually caused by or related to the software itself or are due to the fact that the software is incompatible with the system. Here is a list of items to check in that case:

  • Check whether the system meets the minimum hardware requirements for the software. Check the software documentation to be sure.

  • Ensure that the software is correctly installed. Reinstall it if necessary.

  • Check to see that the latest drivers are installed.

  • Scan the system for viruses using the latest antivirus software.

Resource Conflicts

Problems related to add-in boards are usually related to improper board installation or resource (interrupt, DMA, or I/O address) conflicts. You need to be sure to check drivers for the latest versions and ensure that the card is compatible with your system and the operating system version you are using.

Sometimes adapter cards can be picky about which slot they are running in. Despite the fact that, technically, a PCI or ISA adapter should be able to run in any of the slots, minor timing or signal variations sometimes occur from slot to slot. Simply moving a card from one slot to another can make a failing card begin to work properly. Sometimes moving a card works just by the inadvertent cleaning (wiping) of the contacts that takes place when removing and reinstalling the card, but in other cases you can duplicate the problem by inserting the card back into its original slot. When all else fails, you should try moving the cards around. Because some motherboards share a single IRQ between two PCI slots or between a PCI and an AGP slot, changing one of the PCI cards to another slot can resolve conflicts.

Caution

Note that PCI cards become slot specific after their drivers are installed. So if you move the card to another slot, the PnP resource manager sees it as if you have removed one card and installed a new one. You must therefore install the drivers all over again for that card. You should not move a PCI card to a different slot unless you are prepared with all the drivers at hand to perform the driver installation. ISA cards don't share this quirk because the system is not aware of which slot an ISA card is in.

Special Server Problems

If problems occur after a system has been running and without any hardware or software changes having been made, a hardware fault has possibly occurred. Here is a list of items to check in that case:

  • Try reinstalling the software that has crashed or refuses to run.

  • Try clearing CMOS RAM (many systems use a jumper on the motherboard to clear CMOS) and running setup.

  • Check for loose cables, a marginal power supply, or other random component failures.

  • Check to see if a transient voltage spike, power outage, or brownout might have occurred. Symptoms of voltage spikes include a flickering video display, unexpected system reboots, and the system not responding to user commands. Reload the software and try again.

  • Try reseating the memory modules (SIMMs, DIMMs, or RIMMs).

The following sections answer some of the most frequently asked troubleshooting questions.

When I power on the system, I see the power LED light and hear the fans spin, but nothing else ever happens

The fact that the LEDs illuminate and fans spin indicates that the power supply is partially working, but that does not exclude it from being defective. This is a classic "dead" system, which can be caused by almost any defective hardware component.

Power supplies seem to have more problems than most other components, so you should immediately use a multimeter to measure the outputs at the power supply connectors and ensure that they are within the proper 5% tolerances of their rated voltages. Even if the voltage measurements check out, you should swap in a high-quality, high-power, known-good spare supply and retest. If that doesn't solve the problem, you should revert to the bootstrap approach mentioned earlier, which is to strip the system down to just the chassis/power supply, motherboard, CPU (with heatsink), one bank of RAM (one DIMM), and a video card and display. If the motherboard now starts, begin adding the components you removed, one at a time, retesting after each change. If the symptoms remain, use a POST card (if you have one) to see whether the board is partially functional and where it stops. Also, try replacing the video card, RAM, CPU, and finally the motherboard, and verify the CPU and (especially) the heatsink installation.

The system beeps when I turn it on, but there is nothing on the screen

The beep indicates a failure detected by the ROM POST routines. Look up the beep code in the table corresponding to the ROM version in your motherboard. This can typically be found in the motherboard manual; however, you can also find the beep codes for the most popular AMI, Award, and Phoenix BIOS earlier in this chapter.

I see STOP or STOP ERROR in Windows NT/2000/2003

Many things, including corrupted files, viruses, incorrectly configured hardware, and failing hardware, can cause Windows STOP errors. The most valuable resource for handling any error message displayed by Windows is the Microsoft Knowledge Base (MSKB), an online compendium of more than 250,000 articles covering all Microsoft products. You can visit the Knowledge Base at http://support.microsoft.com, and from there you can use the search tool to retrieve information specific to your problem.

For example, say you are receiving Stop 0x0000007B errors in Windows Server 2003. In this case, you should visit the Knowledge Base and enter the error message in the search box. In this case, you can type stop 7B error Windows Server 2003 in the box, and the Knowledge Base gives you two articles, one of which is Microsoft Knowledge Base article number 324103, titled "HOW TO: Troubleshoot "Stop 0x0000007B" Errors in Windows XP." When you click this link, you are taken to the article at http://support.microsoft.com/default.aspx?scid=kb;en-us;324103, which has a complete description of the problem and solutions for Windows XP and related operating systems (Windows Server 2003 is based on Windows XP). The article states that this error could be caused by the following:

  • Boot-sector viruses

  • Device driver issues

  • Hardware issues

  • Other issues

The article explains each issue and solution in detail. All things considered, the Knowledge Base is a valuable resource for dealing with any problems related to or reported by any version of Windows or any other Microsoft software.

I'm having other types of Windows problems

This is another example where the Microsoft Knowledge Base comes to the rescue. For example, assume that you can't shut down your Windows Server 2003based server. By searching for shutdown problems Windows Server 2003, (substitute the version of Windows you are using), you can quickly find several articles that can help you troubleshoot this type of problem. This problem has been caused by bugs in motherboard ROM (try upgrading your motherboard ROM to the latest version), bugs in the various Windows versions (visit www.windowsupdate.com and install the latest fixes, patches, and service packs), and configuration or hardware problems. The Knowledge Base articles provide more complete explanations of the Windows issues.

I'm having problems with Linux

Because there are many different Linux distributions in use, there are several places to check for help with Linux-related problems of all types. In addition to checking the official website for your Linux distribution, try these additional resources:

  • www.debian-administration.org For administrators of Debian GNU/Linuxbased distributions

  • www.aboutdebian.org For users of Debian GNU/Linux and related distributions

  • www.linuxquestions.org Forums, tutorials, podcasts, and other help

  • http://linux-nfs.org Help for using the Network File System with Linux, including client and server patches, bugs, and FAQs

  • www.apachefriends.org Help for users of the Apache webserver

  • www.linux.org News, distributions, tutorials, and other help for Linux users

  • www.tdlp The Linux Documentation Project is full of FAQs and how-tos

  • http://linux-ip.net/html/linux-ip.html A useful guide to Linux networking with TCP/IP

I'm having problems with UNIX

UNIX implementations, unlike Linux, are proprietary to a particular hardware platform. However, many commands are the same or are very similar across different UNIX platforms. In addition to checking with your hardware vendor for help with your UNIX implementation, try the following websites:

  • www.unix.com The UNIX Forums provide general and distribution-specific help with most versions of UNIX, including Sun Solaris, HP-UX, AIX, SCO, and BSD, as well as OS X (Apple), and Linux.

  • www.tek-tips.com The Tek-Tips website has help for various UNIX-based operating systems. Select Forums, MIS/IT, Operating Systems - UNIX Based to find version-specific help for virtually all UNIX distributions as well as FreeBSD.

  • www.osdata.com/kind/unix.htm The UNIX page at OSdata.com provides links to extensive coverage of specific UNIX implementations. Information such as features, FAQs, links to official and third-party websites, and other help is provided for each implementation.

The power button won't turn off the system

Servers that use the ATX, BTX, or SSI form factors use power supply designs in which the case power switch is connected to the motherboard and not directly to the power supply. This enables the motherboard and operating system to control system shutdown, preventing an unexpected loss of power that could cause data loss or file system corruption. However, if the system experiences a problem and becomes frozen or locked up in some way, the motherboard might not respond to the power button, meaning it does not send a shutdown signal to the power supply. It might seem that you will have to pull the plug to power off the system, but fortunately, a forced shutdown override is provided. You merely press and hold down the system power button (usually on the front of the chassis) for a minimum of 4 seconds, and the system should power off. The only drawback is that because this type of shutdown is forced and under the control of the motherboard or operating system, unsaved data can be lost, and some file system corruption could result. You should therefore run Chkdsk in Windows XP to check and correct any file-system issues after a forced shutdown.

If the system fails to power up after you perform an internal upgrade, make sure the front panel power switch cable is properly connected to the appropriate header pins on the motherboard.

For servers that use proprietary form factors, see the system documentation to determine how the power switch operates and how to shut down a system if the power switch fails. Note that in some situations, the power switch might be connected to AC power rather than to DC power. Be sure to disconnect the system from all external power to prevent electric shock.

The modem doesn't work

First verify that the phone line is good and that you have a dial tone. Then check and, if necessary, replace the phone cable from the modem to the wall outlet. If the modem is integrated into the motherboard, check the BIOS Setup to ensure that the modem is enabled. Try clearing the ESCD in the BIOS Setup. This forces the PnP routines to reconfigure the system, which can resolve any conflicts. If the modem is internal and you aren't using the COM1/COM2 serial ports integrated into the motherboard (as for an external modem), try disabling the serial ports to free up additional system resources. Also try removing and reinstalling the modem drivers, ensuring that you are using the most recent drivers from the modem manufacturer. If that doesn't help, try physically removing and reinstalling the modem. If the modem is internal, install it in a different slot. If the modem is external, make sure it has power and is properly connected to the serial or USB port on the PC. Try replacing the external modem power brick and the serial/USB cable. Finally, if you get this far and the modem still doesn't work, try replacing the modem and finally the motherboard.

Note that modems are very susceptible to damage from nearby lightning strikes. Consider adding lightning arrestors or surge suppressors on the phone line running to the modem and unplug the modem during storms. If the modem has failed after a storm, you can be almost certain that it has been damaged by lightning. The strike might have damaged the serial port or motherboard in addition to the modem. Any items damaged by lightning will most likely need to be replaced.

The keyboard doesn't work

The two primary ways to connect a keyboard to a server are via the standard keyboard port (usually called a PS/2 port) and via USB. One problem is that many older systems that have USB ports cannot use a USB keyboard because USB support is provided by the operating systemfor instance, if the motherboard has a USB port but does not include USB Legacy Support in the BIOS. This support is specifically for USB keyboards (and mouse devices) and was not common in servers until 1998 or later. Many servers that had such support in the BIOS still had problems with the implementation; in other words, they had bugs in the code that prevented the USB keyboard from working properly. If you are having problems with a USB keyboard, check to ensure that USB Legacy Support is enabled in the BIOS. If you are still having problems, make sure you have installed the latest BIOS for your motherboard and any Windows updates from Microsoft. Some older systems never could properly use a USB keyboard, in which case they should change to a PS/2 keyboard instead. Some keyboards feature both USB and PS/2 interfaces, which offer the flexibility to connect to almost any system.

If a PS/2 keyboard is having problems, the quickest way to verify whether the problem is the keyboard or the motherboard is to replace the keyboard with a known-good spare. In other words, borrow a working keyboard from another system and try it. If it still doesn't work, the keyboard controller on the motherboard is most likely defective, which means the entire board must be replaced.

The monitor appears completely garbled or unreadable

A completely garbled screen is most often due to improper, incorrect, or unsupported settings for the refresh rate, resolution, or color depth. Using incorrect drivers can also cause this. To check the configuration of the video card or onboard video, the first step is to power on the system and verify whether you can see the POST or the system splash screen and enter the BIOS Setup. If the screen looks fine during the POST but goes crazy after Windows starts to load, the problem is almost certainly due to an incorrect setting or configuration of the card. To resolve this, boot a system running Windows 2000 Server or Windows Server 2003 in VGA mode (hold down the F8 function key as Windows starts to load and select VGA mode from the special startup menu listing). This bypasses the current video driver and settings and places the system in the default VGA mode supported by the BIOS on the video card. When the Windows desktop appears, right-click the desktop, select Properties, and then either reconfigure the video settings or change drivers, as necessary.

If the problem occurs from the moment you turn on the system, a hardware problem definitely exists with the video card, cable, or monitor. First, replace the monitor with another one; if the cable is detachable, replace that, too. If replacing the monitor and cable does not solve the problem, the video card or integrated video is probably defective. If the motherboard uses integrated video, replace it with a PCI card. If it uses a video card, move the video card to a different slot. If video continues to malfunction, replace the card.

The image on the display is distorted (bent), shaking, or wavering

This can often be caused by problems with the power line, such as an electric motor, an air conditioner, a refrigerator, or another device causing interference. Try replacing the power cord, plugging the monitor and/or the system in to a different outlet, or moving it to a different location entirely. This problem can also be caused by local radio transmitters such as a nearby radio or television station or two-way radios being operated in the vicinity of the system. If the monitor image is bent and discolored, it could be due to the shadow mask being magnetized. Turn the monitor on and off repeatedly; this causes the built-in degaussing coil around the perimeter of the tube to activate in an attempt to demagnetize the shadow mask. If this seems to work partially but not completely, you might need to obtain a professional degaussing coil from an electronics or TV service shop to demagnetize the mask. Next, replace the monitor cable, try a different (known-good) monitor, and, finally, replace the video card.

I installed an upgraded processor, but it won't work

First, make sure the motherboard supports the processor that is installed. Also make sure you are using the latest BIOS for your motherboard; check with the motherboard manufacturer to see whether any updates are available for download and install them if any are available. Check the jumper settings (on older boards) or BIOS Setup screens to verify that the processor is properly identified and set properly with respect to the FSB (or CPU bus) speed, clock multiplier, and voltage settings. Make sure the processor is set to run at its rated speed and is not overclocked. If any of the CPU settings in the BIOS Setup are on manual override, set them to automatic instead. Then reseat the processor in the socket. Next, make sure the heatsink is properly installed and you are using thermal interface material (that is, thermal grease) at the mating junction between the CPU and heatsink.

Just because a processor fits in the socket (or slot) on your motherboard does not mean it will work. For a processor to work in a system, the following are required:

  • The CPU must fit in the socket. Because processors with different specifications sometimes use the same socket but the pinout might vary, you must make sure the motherboard supports the processor type as well as the pinout. For example, the Pentium D processor uses the same Socket LGA775 as late-model Pentium 4 processors, but most Pentium 4 server motherboards are not compatible with the Pentium D.

  • The motherboard must support the voltage required by the CPU. Modern motherboards set voltages by reading voltage ID (VID) pins on the processor and then setting the onboard voltage regulator module (VRM) to the appropriate settings. Older boards might not support the generally lower voltage requirements of newer processors.

  • The motherboard ROM BIOS must support the CPU. Modern boards read the CPU to determine the proper FSB (or CPU bus) speed settings as well as the clock multiplier settings for the CPU. Many CPUs have different requirements for cache settings and initialization, as well as for bug fixes and workarounds.

  • The motherboard chipset must support the CPU. In some cases, specific chipset models or revisions might be required to support certain processors.

Before purchasing an upgraded processor for a system, you should first check with the motherboard manufacturer to see whether your board supports the processor. If so, it will meet all the requirements listed previously. Often, BIOS updates are available that enable newer processors to be supported in older boards, beyond what was originally listed in the manual when the board was new. The only way to know for sure is to check with the motherboard manufacturer for updated information regarding supported processors for a particular board.

The system runs fine for a few minutes but then freezes or locks up

This is the classic symptom of a system that is overheating. Most likely the CPU(s) may be overheating, but other components, such as the memory or motherboard chipset, could also be overheating. If the system is custom built from standard components, the design might be insufficient for proper cooling, and bigger heatsinks, more fans, or other solutions might be required. If the system was working fine but now is exhibiting this problem, check to see whether the problem started after any recent changes were made. If so, the change that was made could be the cause of the problem. If no changes were made, most likely something such as a cooling fan has either failed or is starting to fail.

Most modern servers have several fans, one or two inside the power supply, one on the CPU (or positioned to blow on the CPU), and optionally others for the chassis. Slimline 1U and 2U servers often use arrays of multiple small-diameter internal fans. Verify that any and all fans are properly installed and spinning. They should not be making grinding or growling noises, which usually indicate bearing failure. Many newer systems have thermostatically controlled fans; in these systems, it is normal for the fan speeds to change with the temperature. Make sure that the chassis is several inches from walls and that the fan ports are unobstructed. Try removing and reseating the processor; then reinstall the CPU heatsink with new thermal interface material. Check the power supply and verify that it is rated sufficiently to power the system (most should be 300 watts or more). Use a digital multimeter to verify the voltage outputs of the power supply, which should be within ±5% of the rated voltage at each pin. Try replacing the power supply with a high-quality replacement or known-good spare.

I am experiencing intermittent problems with the hard drive(s)

Many entry-level servers use ATA (commonly called ATA/IDE or PATA) interface drives, which consist of a drive and integrated controller, a ribbon cable, and a host adapter circuit in the motherboard. Typically, intermittent problems are found with the cable and the drive; it is far more rare that the host adapter fails or exhibits problems. Many problems occur with the cables. ATA drives use either 40-conductor or 80-conductor cables, with one 40-pin connector at each end and optionally one in the middle. Drives supporting transfer rates higher than ATA33 (33MBps or Ultra DMA Mode 2) must use 80-conductor cables. Check the cable to ensure that it is not cut or damaged; then try unplugging and replugging it in to the drive and motherboard. Check to see that the cable is not more than 18 inches (46cm) in length because that is the maximum allowed by the ATA specification. This is especially important when you are using the faster ATA100 or ATA133 transfer rates. Try replacing the cable with a new 80-conductor 18 inches version.

If replacing the cable does not help, replace the drive with a spare, install an OS, and test it to see whether the problem remains. If the problem does remain, the problem is with the motherboard, which most likely needs to be replaced.

SATA drives use a jacketed cable that is thicker but much narrower than ATA/IDE cables. If the SATA cable is folded or creased, replace it. Make sure the SATA cable is tightly connected to the host adapter and the drive.

SCSI drives use cables that resemble ATA/IDE but are wider. In addition to cable problems, SCSI drives and devices can also fail because of conflicting device ID numbers and termination issues.

If the drive continues to fail after you replace the data cables or connect it to another system, the problem is most likely with your original drive. You can simply replace it or try testing, formatting, and reinstalling to see whether the drive can be repaired. To do this, you need the low-level format or test software provided by the drive manufacturer. See the following websites for diagnostic and testing software:

  • Maxtor (includes former Quantum hard disk products) www.maxtor.com (In December 2005 Seagate announced plans to merge with Maxtor in the second half of 2006.)

  • Seagate www.seagate.com

  • Western Digital www.wdc.com

  • Hitachi (includes former IBM hard disk products) www.hitachigst.com

The system won't boot up; the screen says Missing operating system

When your system boots, it reads the first sector from the hard diskcalled the MBRand runs the code contained in that sector. The MBR code then reads the partition table (also contained in the MBR) to determine which partition is bootable and where it starts. Then it loads the first sector of the bootable partitioncalled the VBRwhich contains the operating systemspecific boot code. However, before executing the VBR, the MBR checks to ensure that the VBR ends with the signature bytes 55AAh. The MBR displays the Missing operating system message if it finds that the first sector of the bootable partition (the VBR) does not end in 55AAh.

Several things can cause this to occur, including the following:

  • The drive parameters entered in the BIOS Setup are incorrect or corrupted. These are the parameters that define the drive that you entered in the BIOS Setup, and they're stored in a CMOS RAM chip powered by a battery on the motherboard. Incorrect parameters cause the MBR program to translate differently and read the wrong VBR sector, thus displaying the Missing operating system message. A dead CMOS battery can also cause this because it loses or corrupts the stored drive translation and transfer mode parameters. In fact, a dead battery is one of the more likely causes. To repair, check, and replace the CMOS battery, run the BIOS Setup, go to the hard drive parameter screen, and enter the correct drive parameters. Note that most drive parameters should be set to Auto or Autodetect.

  • The drive is not yet partitioned and formatted on this system. This is a normal error if you try to boot the system from the hard disk before the OS installation is complete. Boot to an OS startup disk (floppy or CD) and run the setup program, which prompts you through the partitioning and formatting process during the OS installation.

  • The MBR and/or partition tables are corrupt. This can be caused by boot sector viruses, among other things. To repair the MBR on an x86-based server using Windows 2000 Server or Windows Server 2003, insert the original Windows distribution CD and shut down the computer. Turn on the computer and select the option to boot from the CD. Select Repair and the option to run the Recovery Console. Log in to the system and use the FIXBOOT and FIXMBR commands to rewrite boot files and fix the MBR. Exit the Recovery Console and restart the system. See the Microsoft Knowledge Base article 326215 at http://support.microsoft.com for details for Windows Server 2003, or see article 229716 for Windows 2000 Server. If you use Linux, see "All About Linux: How to Repair a Corrupt MBR and boot into Linux," at http://linuxhelp.blogspot.com/2005/11/how-to-repair-corrupt-mbr-and-boot.html. For other distributions or for UNIX versions, see your operating system's documentation for help.

The system is experiencing intermittent memory errors

If the memory was recently added or some other change was made to the system, you should undo that addition/change to see whether it is the cause. If it's not, remove and reseat all memory modules. If the contacts look corroded, clean them with contact cleaner and then apply contact enhancer for protection. Check the memory settings in the BIOS Setup; generally, all settings should be on automatic settings. Some BIOS setup programs refer to Automatic as "by SPD" or something similar (the SPD is the serial presence detect chip that stores the default memory timing settings on DIMM modules). Next, upgrade to the latest BIOS for your motherboard and remove all memory except one bank. Then run only one bank of memory, but in the second or third bank position. A socket can develop a problem, and most motherboards do not require that the sockets be filled in numeric order. Also, replace the remaining module with one of the others that was removed, a new module, or a known-good spare. Note that if your motherboard uses pairs of memory (as in a dual-channel or redundant arrangement), you might need to use two or more modules.

If you get this far, the problem is most likely either the motherboard or the power supplyor possibly some other component in the system. Remove other components from the system to see whether they are causing problems. Reseat the CPU and replace the power supply with a high-quality new unit or a known-good spare. Finally, try replacing the motherboard.

The system locks up frequently and sometimes reboots on its own

This is one of the classic symptoms of a power supply problem. The power supply is designed to send a special Power_Good signal to the motherboard when it has passed its own internal tests and outputs are stable. If this signal is dropped, even for an instant, the system resets. Problems with the power good circuit cause lockups and spontaneous rebooting. This can also be caused if the power at the wall outlet is not correct. Verify the power supply output with a digital multimeter; all outputs should be within ±5% of the rated voltages. Use a tester for the wall outlet to ensure that it is properly wired and verify that the voltage is near 120V. Replace the power cord or power strip between the power supply and wall outlet.

Unfortunately, the intermittent nature makes this problem difficult to solve. If the problem is not with the wall outlet power, check the power connection(s) between the power supply and the motherboard. If your server uses a standard 20- or 24-pin ATX- or SSI-style connector, intermittent operation can be caused by not snapping the connector into place. Shut off the system and verify that the connector is fully inserted and locked into place. If the system uses multiple power connectors (which is common with some types of redundant power supplies), make sure each connector is completely inserted and locked in place (if the connector features a locking mechanism).

If the system continues to perform erratically, determine whether you have another system of the same type that is working properly and swap the suspect power supply from the failing system into another system. If the system performs properly after you swap power supplies, the original power supply is defective and should be replaced. If the problem stays with the original system, other power-related components, such as the power-supply paralleling board (PSPB) used on some Dell servers, redundant power supply modules, or internal power cables, may be defective. Continue to swap suspect for known-working parts until you determine the source of the problem. If the problem stops after you swap out a part, the swapped part has failed and should be replaced.

If the system continues to run erratically after you swap all parts of the power supply/distribution system, consider other components, such as memory modules, the processor, or the motherboard. Reseat the CPU and reinstall the heatsink with new thermal interface material. Then reseat the memory modules, run only one bank of memory, and replace the motherboard if all other options fail.

If you must replace the power supply, try to get a larger-wattage-rated unit if possible. This is easy to do if the server uses a standard form factor, such as ATX12V. If the server uses a proprietary power supply form factor, you might not be able to get a higher-rated unit.

I installed a 200GB ATA/IDE drive in my server, but it is recognizing only 137GB

Motherboard ROM BIOSs have been updated throughout the years to support larger and larger ATA/IDE drives. BIOSs older than August 1994 are typically limited to drives of up to 528MB, whereas BIOSs older than January 1998 are limited to 8.4GB. Most BIOSs dated 1998 or newer support drives up to 137GB, and those dated September 2002 or newer should support drives larger than 137GB. These are only general guidelines; to accurately determine this for a specific system, you should check with your motherboard manufacturer. If your server's BIOS does not support the full capacity of your hard disk, try these solutions:

  • Check with the server motherboard or system vendor for a BIOS upgrade that provides 48-bit LBA support.

  • If a BIOS upgrade is not available, install a PCI host adapter that provides 48-bit LBA support, such as the Promise Ultra100 TX2, Ultra133 TX2, or most current ATA RAID adapters from various vendors. Connect the drive to the host adapter, which contains its own BIOS.

Do not use the dynamic drive overlay or similar boot code replacement options offered by a hard disk vendor's installation programs. These options do not work with Windows Server 2003 or with Linux, and in any event, they create a nonstandard disk configuration.

If you are using Windows 2000 Server, make sure you have Service Pack 3 or greater installed. Windows Server 2003 has native 48-bit LBA support. Linux distributions based on the Linux kernel 2.4.20 or greater, such as Red Hat Linux 9, SUSE Linux 9, and Mandriva (Mandrake) Linux 9.2 and newer versions also have native 48-bit LBA support.

You should also download and install the latest versions of the correct motherboard/chipset drivers for your hardware and operating system from your hardware vendor's website.

Note that if you use an external USB or IEEE 1394 hard disk, 48-bit LBA support is not an issue; the hardware in the external enclosure takes care of handling the drive's entire capacity.

My CD-ROM/DVD drive doesn't work

CD and DVD drives are some of the most failure-prone components in a PC. It is not uncommon for one to suddenly fail after a year or so of use.

If you are having problems with a drive that was just installed, check the installation and configuration of the drive. Check the jumper settings on the drive. If you're using an 80-conductor cable, the drive should be jumpered to Cable Select; if you're using a 40-conductor cable, the drive should be set to either master or slave (depending on whether it is the only drive on the cable). Check the cable to ensure that it is not nicked or cut and is a maximum of 18 inches long (the maximum allowed by the ATA specification). Replace the cable with a new one or a known-good spare, preferably using an 80-conductor cable. Make sure the drive power is connected and verify that power is available at the connector by using a digital multimeter. Also make sure the BIOS Setup is set properly for the drive and verify that the drive is detected during the boot process. Finally, try replacing the drive and, if necessary, the motherboard.

If the drive had already been installed and was working before, first read different discs, preferably commercial-stamped discs rather than writable or rewritable ones. Then try the procedures listed previously.

My USB port or device doesn't work

Make sure you have enabled the USB ports in the BIOS Setup. Make sure your operating system supports USB; Windows NT 4 does not support USB ports, whereas Windows 2000 and Windows Server 2003 do have USB support. Then remove any hubs and plug the device directly in to the root hub connections on your system. Replace the cable. Many USB devices require additional power, so ensure that your device has an external power supply connected if one is required. Replace the power supply.

I installed an additional memory module, but the system doesn't recognize it

Verify that the memory is compatible with your motherboard. Many subtle variations exist in memory types that can appear to be identical on the surface. Just because it fits in the slot does not mean the memory will work properly with your system. Check your motherboard manual for the specific type of memory your system requires and possibly for a list of supported modules. You can visit www.crucial.com and use its memory selector to determine the exact type of memory for a specific system or motherboard. Also note that all motherboards have limits to the amount of memory they support, and many boards today support only up to 512MB or 1GB. Again, consult the motherboard manual or manufacturer for information on the limits for your board.

If you are sure you have the correct type of memory, follow the memory troubleshooting steps listed previously for intermittent memory problems.

I installed a new drive, but it doesn't work, and the drive LED remains lit

This is the classic symptom for a cable plugged in backward. Both ATA and floppy drives are designed to use cables with keyed connectors; however, some cables are available that lack this keying, which means they can easily be installed backward. When the cable is installed backward into either the motherboard or the drive, the LED on the drive remains lit and the drive does not function. In some cases, this can also cause the entire system to freeze. Check the cables to ensure that they are plugged in properly at both ends; the stripe on the cable indicates pin-1 orientation. On the drive, pin 1 is typically oriented toward the power connector. On the motherboard, look for orientation marks silk-screened on the board or observe the orientation of the other cables plugged in (all cables follow the same orientation).

While I was updating my BIOS, the system froze, and now the system is dead

This can occur when a flash ROM upgrade goes awry. Fortunately, most motherboards have a recovery routine that can be enabled via a jumper on the board. When enabled, the recovery routine causes the system to look for a floppy with the BIOS update program on it. If you haven't done so already, you need to download an updated BIOS from the motherboard manufacturer and follow its directions for placing the BIOS update program on a bootable floppy. Then set the BIOS recovery mode via the jumper on the motherboard, power on the system, and wait until the procedure completes. It usually takes up to 5 minutes, and you might hear beeping to indicate the start and end of the procedure. When the recovery is complete, turn off the system and restore the recovery jumper to the original (normal) settings.

If your motherboard does not feature BIOS recovery capability, you might have to send the board to the manufacturer for repair.

I installed a PCI video card in an older system with PCI slots, and it doesn't work

The PCI bus has gone through several revisions; most older slots are 2.0 type, and most newer cards need 2.1 or later PCI slots. The version of PCI your system has is dictated by the motherboard chipset. If you install a newer video or other PCI card that requires 2.1 slots in a system with 2.0 slots, often the system won't boot up or operate at all.

If you check the chipset reference information in Chapter 3, you might be able to determine which revision of PCI slots your motherboard has by knowing which chipset it has. If this is your problem, the only solution is to change either the card or motherboard so that they are both compatible.

Категории