How Linux Works: What Every Superuser Should Know

2017-07-07 02:10:07

2.4 Filesystems

A filesystem is a database of files and directories that you can attach to a Unix system at the root ( / ) or some other directory (like /usr ) in a currently attached filesystem. At one time, filesystems resided on disks and other physical media used exclusively for data storage. However, the tree-like directory structure and I/O interface of filesystems is quite versatile, so filesystems now perform a variety of tasks .

2.4.1 Filesystem Types

Linux supports an extraordinarily large number of filesystems, including native designs optimized for Linux, foreign types such as the Windows FAT family, universal filesystems like ISO9660, and others. The following list includes the most common types of filesystems for data storage; the type names as recognized by Linux are in parentheses next to the boldfaced filesystem names.

The Second Extended filesystem (ext2) is native to Linux. It is fairly quick, and it defragments itself. Nearly every Linux system uses ext2 or its newer , journaled version, ext3.

Third Extended filesystems (ext3) are ext2 filesystems augmented with journal support. This can make recovery from an abrupt system reboot or failure quicker and less painful.

ISO9660 (iso9660) is a CD-ROM standard. Most CD-ROMs use some variety of ISO9660 extension; Linux supports them.

FAT filesystems (msdos, vfat, umsdos) pertain to Microsoft systems. The simple msdos type supports the very primitive monocase variety in MS-DOS systems. For Windows filesystems, use vfat. The umsdos filesystem is peculiar to Linux; it supports Unix features such as symbolic links on top of an MS-DOS filesystem. It is also not very common.

The Reiser filesystem (reiserfs) is relatively new. It supports a journal and is optimized for fairly small files, a condition that often occurs in Unix systems.

2.4.2 Creating a Filesystem

You cannot mount and store files on a partition that does not contain a filesystem. The partitioning process described in Section 2.3.4 does not create any filesystems; you must place the filesystems on the partitions in a separate step. To create a Second Extended (ext2) filesystem, use the mke2fs program on the target device, as in this example for /dev/hdc3 :

mke2fs /dev/hdc3

The mke2fs program automatically determines the number of blocks in a device and sets some reasonable defaults. Unless you really know what you're doing and feel like reading the mke2fs(8) manual page in detail, you shouldn't change these.

When you create a filesystem, you initialize its database, including the superblock and the inode tables . The superblock is at the top level of the database, and it's so important that mke2fs creates a number of backups in case the original is destroyed . You may wish to record a few of the superblock backup numbers when mke2fs runs, in case you need to recover it later in the event of a disk failure (see Section 2.4.8).

Warning

Filesystem creation is a rare task that you should only need to perform after adding a new disk or repartitioning an old disk. You should create a filesystem just once for each new partition that has no preexisting data (or data that you want to remove). Creating a new filesystem on top of an existing filesystem will effectively destroy the old data.

Creating ext3 Filesystems

The only substantial difference between ext2 and ext3 filesystems is that ext3 filesystems have a journal file containing changes not yet written to the regular filesystem database. To create an ext3 filesystem, use the -j option to mke2fs :

mke2fs -j /dev/ disk_device

Don't worry if you forget the -j option when creating a filesystem. You can add a journal file to an existing filesystem with the utility. Here's an example:

tune2fs -j /dev/hda1

When upgrading a filesystem to ext3, don't forget to change the ext2 to ext3 in the /etc/fstab file.

2.4.3 Mounting a Filesystem

On Unix, the process of attaching a filesystem is called mounting . When the system boots, the kernel reads some configuration data and mounts / based on that data. To mount a filesystem, you must know the following:

The filesystem's device (such as a disk partition; where the actual filesystem data resides).

The filesystem type, or design. Operating system developers use different types to adapt to their particular system for backward compatibility or for other reasons that aren't necessarily that good. For example, the ext2-/ext3-based filesystems common on Linux are quite different than the FAT-based types found on many Windows machines.

The mount point ; that is, the place in the current system's directory hierarchy where the filesystem will be attached. The mount point is always a normal directory. For instance, Linux uses / cdrom as a mount point for CD-ROM devices. The mount point need not be directly below / ; it may be anywhere on the system.

When mounting a filesystem, the common terminology is "mount a device on a mount point." To learn the current filesystem status of your system, run mount . The output looks like this:

/dev/hda1 on / type ext2 (rw,errors=remount-ro) proc on /proc type proc (rw) /dev/hda3 on /usr type ext2 (rw) tmpfs on /dev/shm type tmpfs (rw) none on /proc/bus/usb type usbdevfs (rw)

Each line corresponds to one currently mounted filesystem, with items in this order:

The device, such as /dev/hda3 . Notice that some of these aren't real devices ( proc , for example); these are stand-ins for real device names, because these special-purpose filesystems do not need devices.

The word on .

The mount point.

The word type .

The filesystem type, usually in the form of a short identifier.

Mount options (in parentheses) ” see Section 2.4.5 for more details.

To mount a filesystem, use the mount command as follows with the filesystem type, device, and desired mount point:

mount -t type device mountpoint

For example, to mount the Second Extended filesystem /dev/hdb3 on /home/extra , use this command:

mount -t ext2 /dev/hdb3 /home/extra

To unmount (detach) a filesystem, use the umount command:

umount mountpoint

See Section 2.4.6 for a few more long options.

2.4.4 Filesystem Buffering

Linux, like other versions of Unix, buffers (caches) all requested changes to filesystems in memory before actually writing the changes to the disk. This cache system is transparent to the user and improves performance because the kernel can perform a large collection of file writes at once instead of performing the changes on demand.

When you unmount a filesystem with umount , the kernel automatically synchronizes with the disk. At any other time, you can force the kernel to write the changes in its buffer to the disk by running the sync command. If (for whatever reason) you can't unmount a filesystem before you turn off the system, make sure that you run sync first.

2.4.5 Filesystem Mount Options

There are many ways to change the mount command behavior. This is often necessary with removable media or when performing system maintenance.

The total number of mount options is staggering. The very extensive mount(8) manual page is a good reference, but it's hard to know where to start and what you can safely ignore.

Options fall into two rough categories: general options and filesystem-specific options. General options include -t for specifying the filesystem type, which was mentioned earlier. By contrast, a filesystem-specific option pertains only to certain filesystem types. To activate a filesystem option, use the -o switch followed by the option. For example, -o norock turns off Rock Ridge extensions on an ISO9660 filesystem, but it has no meaning for any other kind of filesystem.

Short Options

The most important general options are the following:

-r The -r option mounts the filesystem in read-only mode. This has a number of uses, from write protection to bootstrapping. You don't need to specify this option when accessing a read-only device such as a CD-ROM; the system will do it for you (and will also tell you about the read-only status).

-n The -n option ensures that mount does not try to update the system mount database, /etc/mtab . The mount operation fails when it cannot write to this file. This is important at boot time, because the root partition (and therefore, the system mount database) are read-only at first. You will also find this option handy if you are trying to fix a system problem in single-user mode (see Section 3.2.4), because the system mount database may not be available at the time.

-t The -t type option specifies the filesystem type.

Long Options

Short options like -r are too limited for the ever-increasing number of mount options; there are too few letters in the alphabet to accommodate all possible options. Short options are also troublesome because it is difficult to determine an option's meaning based on a single letter. Many general options and all filesystem-specific options use a longer, more flexible option format.

To use long options with mount on the command line, start with -o and supply some keywords. Here is a complete example with the long options in boldface:

mount -t vfat /dev/hda1 /dos -o ro,conv=auto

There are two long options here, ro and conv=auto . The ro option specifies read-only mode, and it is the same as the -r short option. The conv=auto option is a filesystem option telling the kernel to automatically convert certain text files from the DOS newline format to the Unix style (which will be explained shortly).

The most useful long options are the following:

exec , noexec Enables or disables execution of programs on the filesystem.

suid , nosuid Enables or disables setuid programs (see Section 1.17).

ro , rw Mounts the filesystem as read-only or read-write.

remount Reattaches a currently mounted filesystem at the same mount point. The only real reason to do this is to change mount options, and the most frequent case is making a read-only filesystem writable. An example of why you might use this is when the system leaves the root in read-only mode during crash recovery. The following command remounts the root in read-write mode (you need the -n option because the mount command cannot write to the system mount database when the root is read-only):

mount -n -o remount /

The preceding command assumes that the correct device listing for / is in /etc/fstab (explained in the next section). If it is not, you must specify the device.

norock , nojoliet (ISO9660 filesystem) Disables Rock Ridge (Unix) or Joliet (Microsoft) extensions. Be warned that plain, raw ISO9660 is really ugly.

conv= rule (FAT-based filesystems) Converts the newline characters in files based on rule , which can be binary , text , or auto . The default is binary , which disables any character translation. To treat all files as text, use text . The auto setting converts files based on their extension. For example, a .jpg file gets no special treatment, but a .txt file does. Be careful with this option, because it can damage files. You may want to use it in read-only mode.

2.4.6 The /etc/fstab Filesystem Table

To mount filesystems at boot time and take the drudgery out of the mount command, Linux systems keep a permanent list of filesystems and options in /etc/fstab . This is a plain text file in a very simple format, as this example shows:

/dev/hda1 / ext2 defaults,errors=remount-ro 0 1 /dev/hda2 none swap sw 0 0 /dev/hda3 /usr ext2 defaults 0 2 proc /proc proc defaults 0 0 /dev/hdc /cdrom iso9660 ro,user,nosuid,noauto 0 0

Each line corresponds to one filesystem, broken into six fields:

The device. Notice that the /proc entry has a stand-in device.

The mount point.

The filesystem type. You may not recognize swap , for /dev/hda2 . This is a swap partition (see Section 2.5).

Options.

Backup information for the dump command; dump does not see common use, but you should always specify this field with a .

The filesystem integrity test order (see the fsck command in Section 2.4.8). To ensure that fsck always runs on the root first, you should always set this to 1 for the root filesystem and 2 for any other filesystems on a hard disk. Use to disable the bootup check for everything else, including CD-ROM drives , swap, and the /proc filesystem.

When using mount , you can take some shortcuts if the filesystem you want to work with is in /etc/fstab . For the example fstab above, to mount a CD-ROM, you need only run

mount /cdrom

You can also try to mount all entries in /etc/fstab that do not contain the noauto option at once, with this command:

mount -a

You may have noticed some new options in the preceding fstab listing, namely defaults , errors , noauto , and user . These aren't covered in Section 2.4.5 because they don't make any sense outside of the /etc/fstab file. The meanings are as follows:

defaults This uses the mount defaults ” read-write mode, enable device files, executables, the setuid bit, and so on. You should use this when you don't want to give the filesystem any special options, but you do want to fill all fields in /etc/fstab .

errors This ext2-specific parameter sets the system behavior if there is trouble mounting a filesystem. The default is normally errors=continue , meaning that the kernel should return an error code and keep running. To get the kernel to try again in read-only mode, use errors=remount-ro . The errors=panic setting tells the kernel (and your system) to halt when there is a problem.

noauto This option tells a mount -a command to ignore the entry. Use this to prevent a boot-time mount of a removable-media device, such as a CD-ROM or floppy drive.

user This option allows normal users to run mount on this entry. This can be handy for enabling access to CD-ROM drives. Because users can put a setuid-root file on removable media with another system, this option also sets nosuid , noexec , and nodev (to bar special device files). The fstab example in this section explicitly sets nosuid .

2.4.7 Filesystem Capacity

To view the size and utilization of your currently mounted filesystems, use the df command. The output looks like this:

Filesystem 1024-blocks Used Available Capacity Mounted on /dev/hda1 1011928 71400 889124 7% / /dev/hda3 17710044 9485296 7325108 56% /usr

The listing has the following fields:

Filesystem The filesystem device

1024-blocks The total capacity of the filesystem in blocks of 1024 bytes

Used The number of occupied blocks

Available The number of free blocks

Capacity The percentage of blocks in use

Mounted on The mount point

It is relatively easy to see that the two filesystems here are roughly 1GB and 17.5GB in size. However, the capacity numbers may look a little strange because 71400 + 889124 does not equal 1011928, and 9485296 does not constitute 56 percent of 17710044. In both cases, 5 percent of the total capacity is unaccounted for. Nevertheless, the space is there. These hidden blocks are called the reserved blocks, and only the superuser may use the space if the rest of the partition fills up. This keeps system servers from immediately failing when they run out of disk space.

If your disk fills up and you need to know where all of those space-hogging, illegal MP3s are, use the du command. With no arguments, du prints the disk usage of every directory in the directory hierarchy, starting at the current working directory. (That's kind of a mouthful, so just run cd /; du to get the idea. Press CONTROL-C when you get bored.) The du -s command turns on summary mode to print only the grand total. If you want to evaluate a particular directory, change to that directory and run du -s * .

Note	1024-byte blocks in df and du output is not the POSIX standard. Some systems insist on displaying the numbers in 512-byte blocks. To get around this, use the -k option (both utilities support this). The df program also supports the -m option to list capacities in one-megabyte blocks.

The following pipeline is a handy way to create a searchable output file ( du_out ) and see the results on the terminal at the same time.

du tee du_out

2.4.8 Checking and Repairing Filesystems

The optimizations that Unix filesystems offer are made possible by a sophisticated database-like mechanism. For filesystems to work seamlessly, the kernel has to trust that there are no errors in a mounted filesystem. Otherwise , serious errors such as data loss and system crashes can happen.

The most frequent cause of a filesystem error is shutting down the system in a rude way (for example, with the power switch on the computer). The system's filesystem cache in memory may not match the data on the disk, and the system also may be in the process of altering the filesystem when you decide to give the computer a kick. Even though a new generation of filesystems supports journals to make filesystem corruption far less common, you should always shut the system down properly (see Section 3.1.5). Furthermore, filesystem checks are still necessary every now and then as sanity checks.

You need to remember one command name to check a filesystem: fsck . However, there is a different version of this tool for each filesystem type that Linux supports. The information presented here is specific to second and third extended (ext2/ext3) filesystems and the e2fsck utility. You generally don't need to type e2fsck , though, unless fsck can't figure out the filesystem type, or you're looking for the e2fsck manual page.

To run fsck in interactive manual mode, use the device or the mount point (in /etc/fstab ) as the argument. For example:

fsck /dev/hdd1

Warning

Never use fsck on a mounted filesystem. The kernel may alter the disk data as you run the check, causing mismatches that can crash your system and corrupt files. There is only one exception. If you mount the root as read-only in single user mode, you may use fsck on the root filesystem.

In manual mode, fsck prints verbose status reports on its passes , which should look something like this when there are no problems:

Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/hdd1: 11/1976 files (0.0% non-contiguous), 265/7891 blocks

If fsck finds a problem in manual mode, it stops and asks you a question relevant to fixing the problem. These questions deal with the internal structure of the filesystem, such as reconnecting loose inodes and clearing blocks. The reconnection business means that fsck found a file that doesn't appear to have a name; reconnecting places the file in the lost+found directory filesystem as a number. You need to guess the name based on the content of the file.

In general, it's pointless to sit through the fsck process if you just made the mistake of an impolite shutdown. e2fsck has a -p option to automatically fix silly problems without asking you, aborting if there is a serious error. This is so common that Linux distributions run some variant of fsck -p at boot time ( fsck -a is also common).

However, if you suspect that there is some major disaster, such as a hardware failure or device misconfiguration, you need to decide on a course of action, because fsck can really mess up a filesystem with larger problems. A telltale sign of a serious problem is a lot of questions in manual mode.

If you think that something really bad happened , try running fsck -n to check over the filesystem without modifying anything. If there's some sort of problem with the device configuration (an incorrect number of blocks in the partition table, loose cables, whatever) that you think you can fix, then fix it before running fsck for real. You're likely to lose a lot of data otherwise.

If you suspect that only the superblock , a key filesystem database component, is corrupt (for example, someone wrote to the beginning of the disk partition), you might be able to recover the filesystem with one of the superblock backups that mke2fs creates. Use fsck -b num to replace the corrupted superblock with an alternate at block num .

You may not know where to find a backup superblock, because you didn't write the numbers down when mke2fs ran. If the filesystem was created with the default values, you can try mke2fs -n on the device to view a list of superblock backup numbers without destroying your data (again, make dead sure that you're using -n , because you'll really tear up the filesystem otherwise).

If the device still appears to function properly except for a few small parts , you can run fsck -c before a manual fsck to search for bad blocks. Such a failure is somewhat rare.

Checking ext3 Filesystems

You normally do not need to check ext3 filesystems because the journal ensures data integrity. However, you may wish to mount an ext3 filesystem in ext2 mode. The kernel will not mount an ext3 filesystem that contains a non-empty journal (if you don't shut your system down cleanly, you can expect that the journal contains some data). To flush the journal in an ext3 filesystem to the regular filesystem database, run e2fsck as follows:

e2fsck -fy /dev/ disk_device

The Worst Case

Disk problems that are worse in severity leave you with few choices:

You can try to pull the entire filesystem from the disk with dd and transfer it to a partition on another disk that's the same size.

You could try to patch up the filesystem as well as you can, mount it in read-only mode, and salvage what you can.

In both cases, you still need to repair the filesystem before you mount it (unless you feel like picking through the raw data by hand). To answer y to all of the fsck questions, use fsck -y , but do this as a last resort.

Note	There is an advanced utility called debugfs for users with in-depth knowledge of filesystems, or for those who feel like experimenting on a filesystem that isn't important.

If you're really desperate, such as in the event of a catastrophic disk failure without backups, there isn't a lot you can do other than try to get a professional service to "scrape the platters."

2.4.9 Special-Purpose Filesystems

Not all filesystems represent storage on physical media. Most versions of Unix have filesystems that serve as system interfaces. This idea goes back along way; the /dev mechanism is an early model of using files for I/O interfaces. The /proc idea came from the eighth edition of research Unix [Killian]. Things really got rolling when the people at Bell Labs (including many of the original Unix designers) created Plan 9 [Bell Labs], a research operating system that took filesystem abstraction to a whole new level.

The special filesystem types in common use on Linux include the following:

proc , mounted on /proc . The name "proc" is actually an abbreviation of "process." Each numbered directory inside /proc is actually the process ID of a current process on the system, and the files in those directories represent various aspects of the processes. /proc/self represents the current process. The Linux proc filesystem includes a great deal of additional kernel and hardware information, such as /proc/ cpuinfo . Purists shudder and say that this additional information does not belong in /proc , but rather in /dev or some other directory, but it's probably too late to change it in Linux now.

usbdevfs , mounted on /proc/bus/usb . Programs that interact with the USB interface and its devices often need the files here. The files in a usbdevfs filesystem provide interesting information on the bus status.

tmpfs , mounted on /dev/shm . You can employ your physical memory and swap space as temporary storage with tmpfs. You can mount tmpfs wherever you like, using the size and nr_blocks long options to control the maximum size. However, you must be careful not to pour things into a tmpfs, because your system will eventually run out of memory, and programs will start to crash. For years , Sun systems used a version of tmpfs for /tmp , and this is a frequent problem on long-running systems.