Linux Annoyances for Geeks: Getting the Most Flexible System in the World Just the Way You Want It
8.1. I Can't Boot Because the Partition Is Corrupt
There are a number of reasons why partitions become corrupt. You may have lost power. Minor electrical surges can affect what is written to a drive. As hard drives wear out, bad blocks can corrupt your data. Yes, hard drive specifications suggest that the mean time between failures is several hundred thousand hours, which corresponds to several decades. But that's just an average, under ideal conditions. If all hard drives were that reliable, RAID would not be quite so popular. If your hard drive is failing, you may not be able to fix the problem. The best that you can do is minimize the corruption until you can create a backup. We'll show you how to back up data from a failing hard drive in the next annoyance. One reason for the popularity of the Reiser filesystem is its sensitivity to hard drive corruption. If you find corruption on your reiserfs-formatted filesystems, you'll probably have a bit more time to save your data. 8.1.1. Symptoms of Corruption
In this chapter, we'll describe two categories of filesystem corruption. The first, whose symptoms are described in the following annoyance, occurs when a hard drive wears out. The second is the occasional glitch that you can recover from while preserving the data on your disk. The temporary glitch is most commonly associated with a power failure. For example, once when I tripped over a cord, I lost power on my desktop computer. The next time I booted that computer, I saw the following message: *** An error occurred during the filesystem check. *** Dropping you to a shell; the system will reboot *** when you leave the shell. Give root password for maintenance This problem is most commonly associated with filesystems that do not include a journal, such as ext2. Whenever there's corruption, there's a risk that Linux won't be able to find some of your files. Journaling filesystems keep a static database of file locations. But journaling is not a guarantee. I've had this error even on a journaled ext3 filesystem. 8.1.2. Basic Checks with fsck
Whenever there is corruption, the first Linux command you should use is fsck. Ideally, you can apply this command alone to a specific, unmounted partition. For example, I managed to clean one partition with this simple fsck command: # fsck /dev/hda6 fsck 1.35 (28-Feb-2004) e2fsck 1.35 (28-Feb-2004) /: recovering journal Cleaning orphaned inode 16915 (uid=1000, gid=0, mode=0140600, size=0) Cleaning orphaned inode 16914 (uid=1000, gid=0, mode=0140600, size=0) Cleaning orphaned inode 16909 (uid=1000, gid=0, mode=0140600, size=0) Cleaning orphaned inode 302828 (uid=0, gid=0, mode=020600, size=0) /: clean, 165245/525888 files, 694569/1050241 blocksa
On most Linux systems, fsck works on a variety of filesystem formats. Try entering ls /sbin/fsck*. You should find a variety of commands, such as: /sbin/fsck /sbin/fsck.ext3 /sbin/fsck.msdos /sbin/fsck.xfs /sbin/fsck.cramfs /sbin/fsck.jfs /sbin/fsck.reiserfs
Thus, fsck is a frontend for all the filesystem-specific commands on your system. The proper utility is chosen automatically by fsck based on the type of the filesystem you run it on. 8.1.3. Finding Bad Blocks
If your system still has bad blocks, it may be the first sign of an impending failure. Hard drives can include hundreds of thousands of blocks. If one goes bad, that may not be the end of the world. But it may be a symptom of other problems. Many Linux gurus believe that is the time to get a new hard drive. If you're still not sure, the badblocks command can help you determine if your hard drive is in trouble. For example, the following command writes the ID number associated with each bad block to the blockbad file: # badblocks -v /dev/hda7 -o blockbad Checking for bad blocks (read-only test): 697008/ 1050241
The previous fsck command probably fixed any errors on that filesystem, and you can continue using Linux normally. The following output is evidence that the repair was completely successful: 0 bad blocks
When bad blocks remain, you should rerun fsck with more severe options, described in the next section. If you need to keep the hard drive working until a new one arrives, back it up as soon as possible. We show you how to do this with a partially corrupt partition in the next annoyance. But until that new hard drive arrives, there are things you can do to keep your current hard drive going. 8.1.4. Fixing Bad Blocks
The fsck command can help you check, mark, and fix bad blocks, and can help preserve the health of your filesystems. For that reason, current distributions force a periodic fsck on each filesystem formatted in the popular ext2 and ext3 formats. You can do your own fsck maintenance with the switches shown in Table 8-1; some of these switches are not documented on the fsck manpage.
For example, the following command marks the bad blocks on your system. If you're fortunate, each fsck "pass" of your partition proceeds without incident. The following is sample output from a run on a good partition. # fsck -cyfv /dev/hda5 fsck 1.35 (28-Feb-2004) e2fsck 1.35 (28-Feb-2004) Checking for bad blocks (read-only test): done Pass 1: Checking inodes, blocks and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information . . .
However, I had problems with a different partition. In the middle of this process, the test seemed to stop. I was tempted to interrupt the command by pressing Ctrl-C, but progress continued after a few minutes. As you can see here, the test turned up problems: Duplicate blocks found.... invoking duplicate block passes Pass 1B: Rescan for duplicate/bad blocks Duplicate/bad block(s) in inode 1448: 13568 Pass 1C Scan directories for inodes with dup blocks. Error reading block 697043 (Attempt to read block from filesystem resulted in a short read). Ignore error? yes Force rewrite? yes .... Pass 1D: Reconciling duplicate blocks (There are 4 inodes containing duplicate/bad blocks) File <The journal inode> (inode #8, mod time Fri Nov 12 08:43:05 2005) has 10 duplicate block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Fri Jan 7 12:11:24 2006) Clone duplicate/bad blocks? yes Error reading block 4049 (Attempt to read block from filesystem resulted in short read). Ignore error? yes Force rewrite? yes
The check continued, revealing hundreds of errors. But the most important error is near the beginning of the file. As you can see, there is corruption even in the journal. Any pointers from the journal to other files are thus suspect. After your bad blocks are marked, Linux knows to avoid reading data from those locations. The time is right for a backup. If standard techniques described in Chapter 2 don't work, see the next annoyance. |