Linux Annoyances for Geeks: Getting the Most Flexible System in the World Just the Way You Want It
8.2. My Hard Drive Is Failing and I Need a BackupFast
It's best to configure a regular backup of your entire system. But hard drives are large. Gigabytes of data take time to copy. So you can't be blamed for avoiding backups as long as possible. (That is, until there is a hard drive failure.) While you might have configured backups for those workstations that you administer, other people might not have been so farsighted and may look to you as a Linux geek when they hit the inevitable disk problem. Thus, you may be asked to recover the data of a less experienced Linux user who forgot to back up his hard drive.
8.2.1. Symptoms
One symptom of an imminent hard drive failure is the following message, which you might see during the Power On Self Test (POST) process: 1720 - S.M.A.R.T Hard Drive detects imminent failure(Failing Attr:05h) Please back up the contents of the hard drive and run HDD self test in F2 setup
While you could run the HDD self-test, chances are good that if you see this message, your hard drive is about to fail. So you should take steps right away to recover what you can. The first thing you should do is mark the bad blocks; we've described this process in the previous annoyance. At this point, you've applied the fsck command to your system. You've tried the regular backup techniques described in Chapter 2. You've marked the bad blocks with the techniques described in the previous annoyance. Commands such as dd or tar fail because they find errors when they hit bad blocks. First and foremost, save the files that you can't live without. Next, proceed with an emergency backup of the entire hard drive, described in the next section. 8.2.2. Configuring an Emergency Backup
To explain what you should do to back up a failing drive in terms as concrete and easy to follow as possible, I'll revisit a recent frightening day when my laptop hard drive failed, and describe the steps I took. It should not be hard for you to apply the lesson to another disk failure. I start with a narrative, followed by a step-by-step description of what I did to recover and transfer my data to a new hard drive. While this may be repetitive, if your hard drive is failing, it's important to get these steps right the first time. When the symptoms described in earlier sections showed me that my laptop hard drive was failing, my first step was to save the critical files that I absolutely needed. But that was not enough. I had spent several hours configuring Debian on this laptop computer and would have been really annoyed if I had to start over. I needed an emergency backup. Fortunately, I had a large external IEEE 1394 (FireWire) hard drive, which had plenty of space for my Debian partitions. Generally, most distributions with Linux kernel 2.6 have no problems with IEEE 1394 hard drives. I bought another hard drive to replace the one currently on my laptop. It turned out that I could get a significantly larger drive for just a little more money. This made things easier because I could specify slightly larger partitions than I had on my old disk, rather than spend a lot of effort trying to re-create each partition at exactly the same size. Once you realize that you need a new hard drive, you may want to order it as soon as possible, as shipping can take time. Because my hard drive seemed ready to fail, I needed to minimize the stress on that drive. I also needed a magic tool that could ignore the errors associated with the bad blocks on my drive while copying the partitions or all the files within them. What I needed was a Linux distribution that recognized my IEEE 1394 hard drive, included a magic backup tool, and could be loaded directly from a CD. From previous experience, I knew that when I boot Knoppix with kernel 2.6, it recognizes and allows me to partition, format, and mount my IEEE 1394 hard drive. If that didn't work, I knew Knoppix recognized my network card; I could have backed up my partitions over my network. As for the magic tool, current versions of Knoppix include the dd_rescue command. As it's designed to ignore errors such as bad blocks on a partition, it was what I needed at that moment. For more information on dd_rescue, see http://www.garloff.de/kurt/linux/ddrescue/. I booted my system with a Knoppix CD. Because it loaded Linux and the associated utilities onto a RAM disk, it minimized the stress on my hard drive. If you have a different magic tool, you may be able to use another CD-based distribution such as Ubuntu or SUSE Live CD. Next, I loaded and mounted my backup media. I formatted my external hard drive to the ext3 filesystem. Knoppix recognizes standard external drives and network connections, generally with little difficulty.
After formatting partitions on my IEEE 1394 drive, I rebooted into Knoppix to make sure the new partitions were properly written. Most of these commands require superuser mode, but when you boot Knoppix from a CD, no root password is required. Finally, I could use dd_rescue to save the data I could, and then write that data to the new laptop hard drive. Before you start, make sure you have the following available:
Now that I had the basic story and the tools I needed, I took the following steps to rescue my laptop hard drive:
|