8.2. Structure of an Inode To allow files to be allocated concurrently and to provide random access within files, FreeBSD uses the concept of an index node, or inode. The inode contains information about the contents of the file; see Figure 8.1 (on page 298). This information includes the following: The type and access mode for the file The file's owner and group-access identifiers The time that the file was created, when it was most recently read and written, and when its inode was most recently updated by the system The size of the file in bytes The number of physical blocks used by the file (including blocks used to hold indirect pointers and extended attributes) The number of directory entries that reference the file The kernel and user setable flags that describe characteristics of the file The generation number of the file (a randomly selected number assigned to the inode each time that the latter is allocated to a new file; the generation number is used by NFS to detect references to deleted files) The blocksize of the data blocks referenced by the mode (typically the same as, but sometimes larger than, the filesystem blocksize) The size of the extended attribute information Figure 8.1. The structure of an inode.
Notably missing in the mode is the filename. Filenames are maintained in directories rather than in modes because a file may have many names, or links, and the name of a file can be large (up to 255 bytes in length). Directories are described in Section 8.3. To create a new name for a file, the system increments the count of the number of names referring to that inode. Then the new name is entered in a directory, along with the number of the inode. Conversely, when a name is deleted, the entry is deleted from a directory, and the name count for the inode is then decremented. When the name count reaches zero, the system deallocates the inode by putting all the mode's blocks back on a list of free blocks. The inode also contains an array of pointers to the blocks in the file. The system can convert from a logical block number to a physical sector number by indexing into the array using the logical block number. A null array entry shows that no block has been allocated and will cause a block of zeros to be returned on a read. On a write of such an entry, a new block is allocated, the array entry is updated with the new block number, and the data are written to the disk. Inodes are fixed in size, and most files are small, so the array of pointers must be small for efficient use of space. The first 12 array entries are allocated in the inode itself. For typical filesystems, this implementation allows the first 96 or 192 Kbyte of data to be located directly via a simple indexed lookup. For somewhat larger files, Figure 8.1 shows that the inodc contains a single indirect pointer that points to a single indirect block of pointers to data blocks. To find the 100th logical block of a file, the system first fetches the block identified by the indirect pointer and then indexes into the 88th block (100 minus 12 direct pointers) and fetches that data block. For files that are bigger than a few megabytes, the single indirect block is eventually exhausted; these files must resort to using a double indirect block, which is a pointer to a block of pointers to pointers to data blocks. For files of multiple Gbyte, the system uses a triple indirect block, which contains three levels of pointer before reaching the data block. Although indirect blocks appear to increase the number of disk accesses required to get a block of data, the overhead of the transfer is typically much lower. In Section 6.6, we discussed the management of the cache that holds recently used disk blocks. The first time that a block of indirect pointers is needed, it is brought into the cache. Further accesses to the indirect pointers find the block already resident in memory; thus, they require only a single disk access to get the data. Changes to the Inode Format Traditionally, the FreeBSD fast filesystem (which we shall refer to in this book as UFS1) [McKusick et al., 1984] and its derivatives have used 32-bit pointers to reference the blocks used by a file on the disk. The UFS1 filesystem was designed in the early 1980s when the largest disks were 330 Mbytes. There was debate at the time whether it was worth squandering 32 bits per block pointer rather than using the 24-bit block pointers of the filesystem that it replaced. Luckily the futurist view prevailed, and the design used 32-bit block pointers. Over the 20 years since it has been deployed, storage systems have grown to hold over a terabyte of data. Depending on the block size configuration, the 32-bit block pointers of UFS1 run out of space in the 1 to 4 terabyte range. While some stopgap measures can be used to extend the maximum-size storage systems supported by UFS1, by 2002 it became clear the only long-term solution was to use 64-bit block pointers. Thus, we decided to build a new filesystem, UFS2, that would use 64-bit block pointers. We considered the alternatives between trying to make incremental changes to the existing UFS1 filesystem versus importing another existing filesystem such as XFS [Sweeney et al., 1996], or ReiserFS [Reiser, 2001]. We also considered writing a new filesystem from scratch so that we could take advantage of recent filesystem research and experience. We chose to extend the UFS1 filesystem because this approach allowed us to reuse most of the existing UFS1 code base. The benefits of this decision were that UFS2 was developed and deployed quickly, it became stable and reliable rapidly, and the same code base could be used to support both UFS1 and UFS2 filesystem formats. Over 90 percent of the code base is shared, so bug fixes and feature or performance enhancements usually apply to both filesystem formats. The on-disk inodcs used by UFS1 are 128 bytes in size and have only two unused 32-bit fields. It would not be possible to convert to 64-bit block pointers without reducing the number of direct block pointers from 12 to 5. Doing so would dramatically increase the amount of wasted space, since only direct block pointers can reference fragments, so the only alternative is to increase the size of the on-disk mode to 256 bytes. Once one is committed to changing to a new on-disk format for the modes, it is possible to include other inode-related changes that were not possible within the constraints of the old modes. While it was tempting to throw in everything that has ever been suggested over the last 20 years, we felt that it was best to limit the addition of new capabilities to those that were likely to have a clear benefit. Every new addition adds complexity that has a cost both in maintainability and performance. Obscure or little-used features may add conditional checks in frequently executed code paths such as read and write, slowing down the overall performance of the filesystem even if they are not used. Extended Attributes A major addition in UFS2 is support for extended attributes. Extended attributes are a piece of auxiliary data storage associated with an inode that can be used to store auxiliary data that is separate from the contents of the file. The idea is similar to the concept of data forks used in the Apple filesystem [Apple, 2003]. By integrating the extended attributes into the mode itself, it is possible to provide the same integrity guarantees as are made for the contents of the file itself. Specifically, the successful completion of an fsync system call ensures that the file data, the extended attributes, and all names and paths leading to the names of the file are in stable store. The current implementation has space in the mode to store up to two blocks of extended attributes. The new UFS2 inode format had room for up to five additional 64-bit pointers. Thus, the number of extended attribute blocks could have been between one to five blocks. We chose to allocate two blocks to the extended attributes and to leave the other three as spares for future use. By having two, all the code had to be prepared to deal with an array of pointers, so if the number got expanded into the remaining spares in the future, the existing implementation will work without changes to the source code. By saving three spares, we provided a reasonable amount of space for future needs. And if the decision to allow only two blocks proves to be too little space, one or more of the spares can be used to expand the size of the extended attributes in the future. If vastly more extended attribute space is needed, a spare could be used as an indirect pointer to extended attribute data blocks. Figure 8.2 shows the format used for the extended attributes. The first field of the header of each attribute is its length. Applications that do not understand the name space or name can simply skip over the unknown attribute by adding the length to their current position to get to the next attribute. Thus, many different applications can share the usage of the extended attribute space, even if they do not understand each other's data types. Figure 8.2. Format of extended attributes. The header of each attribute has a 4-byte length, 1-byte name-space class, 1-byte content pad length, 1-byte name length, and name. The name is padded so that the contents start on an 8-byte boundary. The contents are padded to the size shown by the "content pad length" field. The size of the contents can be calculated by subtracting from the length the size of the header (including the name) and the content pad length.
The first of two initial uses for extended attributes is to support an access control list, generally referred to as an ACL. An ACL replaces the group permissions for a file with a more specific list of the users that are permitted to access the files. The ACL also includes a list of the permissions that each user is granted. These permissions include the traditional read, write, and execute permissions along with other properties such as the right to rename or delete the file [Rhodes, 2003]. Earlier implementations of ACLs were done with a single auxiliary file per filesystem that was indexed by the inode number and had a small fixed-sized area to store the ACL permissions. The small size was to keep the size of the auxiliary file reasonable, since it had to have space for every possible inode in the filesystem. There were two problems with this implementation. The fixed size of the space per inode to store the ACL information meant that it was not possible to give access to long lists of users. The second problem was that it was difficult to atomically commit changes to the ACL list for a file, since an update required that both the file inode and the ACL file be written to have the update take effect [Watson, 2000]. Both problems with the auxiliary file implementation of ACLs are fixed by storing the ACL information directly in the extended-attribute data area of the inode. Because of the large size of the extended attribute data area (a minimum of 8 Kbytes and typically 32 Kbytes), long lists of ACL information can be easily stored. Space used to store extended attribute information is proportional to the number of inodes with extended attributes and the size of the ACL lists that they use. Atomic update of the information is much easier, since writing the inode will update the inode attributes and the set of data that it references including the extended attributes in one disk operation. While it would be possible to update the old auxiliary file on every fsync system call done on the filesystem, the cost of doing so would be prohibitive. Here, the kernel knows if the extended attribute data block for an inode is dirty and can write just that data block during an fsync call on the inode. The second use for extended attributes is for data labeling. Data labels provide permissions for a mandatory access control (MAC) framework enforced by the kernel. The kernel's MAC framework permits dynamically introduced system-security modules to modify system security functionality. This framework can be used to support a variety of new security services, including traditional labeled mandatory access control models. The framework provides a series of entry points that are called by code supporting various kernel services, especially with respects to access control points and object creation. The framework then calls out to security modules to offer them the opportunity to modify security behavior at those MAC entry points. Thus, the filesystem does not codify how the labels are used or enforced. It simply stores the labels associated with the inode and produces them when a security modules needs to query them to do a permission check [Watson, 2001; Watson et al., 2003]. We considered storing symbolic links in the extended attribute area but chose not to do so for four reasons: First, most symbolic links fit within the 120 bytes normally used to store the direct and indirect pointers and thus do not need a disk block to be allocated to hold them. If the symbolic link is large enough to require storage in a disk block, the time to access an extended storage block is the same as the time to access a regular data block. Since symbolic links rarely have any extended attributes, there would be no savings in storage, since a filesystem fragment would be needed whether it was stored in a regular data block or in an extended storage block. If it were stored in the extended storage area, it would take more time to traverse down the attribute list to find it. New Filesystem Capabilities Several other improvements were made when the enlarged inode format was created. We decided to get an early jump on the year 2038 problem when the 32-bit time fields overflow (specifically, Tue Jan 19 03:14:08 2038 GMT, which could be a really ugly way to usher in the first author's 84th birthday). We expanded the time fields (which hold seconds-since-1970) for access, modification, and inode-modification times from 32 bits to 64 bits. At plus or minus 136 billion years that should carry us from well before the universe was created until long after our sun has burned itself out. We left the nanoseconds fields for these times at 32 bits because we did not feel that added resolution was going to be useful in the foreseeable future. We considered expanding the time to only 48 bits. We chose to go to 64 bits, since 64 bits is a native size that can be easily manipulated with existing and likely future architectures. Using 48 bits would have required an extra unpacking or packing step each time the field was read or written. Also, going to 64 bits ensures enough bits for all likely measured time so it will not have to be enlarged. We also added a new time field (also 64 bits) to hold the birth time (also commonly called the creation time) of the file. The birth time is set when the inode is first allocated and is not changed thereafter. It has been added to the structure returned by the stat system call so that applications can determine its value and so that archiving programs such as dump, tar, and pax can save this value along with the other file times. The birth time was added to a previously spare field in the stat system call structure so that the size of the structure did not change. Thus, old versions of programs that use the stat call continue to work. To date, only the dump program has been changed to save the birth time value. This new version of dump, which can dump both UFS1 and UFS2 filesystems, creates a new dump format that is not readable by older versions of restore. The updated version of restore can identify and restore from both old and new dump formats. The birth times are only available and setable from the new dump format. The utimes system call sets the access and modification times of a file to a specified set of values. It is used primarily by archive retrieval programs to set a newly extracted file's times back to those associated with the file's times in the archive. With the addition of birth time, we added a new system call that allows the setting of access, modification, and birth times. However, we realized that many existing applications will not be changed to use the new utimes system call. The result will be that the files that they retrieved from archives will have a newer birth time than access or modification times. To provide a sensible birth time for applications that are unaware of the birth time attribute, we changed the semantics of the utimes system call so that if the birth time was newer than the value of the modification time that it was setting, it sets the birth time to the same time as the modification time. An application that is aware of the birth time attribute can set both the birth time and the modification time by doing two calls to utimes. First, it calls utimes with a modification time equal to the saved birth time, and then it calls utimes a second time with a modification time equal to the (presumably newer) saved modification time. For filesystems that do not store birth times, the second call will overwrite the first call resulting in the same values for access and modification times as they would have previously gotten. For filesystems that support birth time, it will be properly set. Most happily for the application writers, they will not have to conditionally compile the name of utimes for BSD and non-BSD systems. They just write their applications to call the standard interface twice knowing that the right thing will happen on all systems and filesystems. For those applications that value speed of execution over portability can use the new version of the utimes system call that allows all time values to be set with one call. File Flags FreeBSD has two system calls, chflags and fchflags, that set the 32-bit user-flags word in the inode. The flags are included in the stat structure so that they can be inspected. The owner of the file or the superuser can set the low 16 bits. Currently, there are flags defined to mark a file as append-only, immutable, and not needing to be dumped. An immutable file may not be changed, moved, or deleted. An append-only file is immutable, except data may be appended to it. The user append-only and immutable flags may be changed by the owner of the file or the superuser. Only the superuser can set the high 16 bits. Currently, there are flags defined to mark a file as append-only and immutable. Once set, the append-only and immutable flags in the top 16 bits cannot be cleared when the system is secure. The kernel runs with four different levels of security. Any superuser process can raise the security level, but only the init process can lower that level (the init program is described in Section 14.6). Security levels are defined as follows: -1. Permanently insecure mode: Always run system in level 0 mode (must be compiled into the kernel). 0. Insecure mode: Immutable and append-only flags may be turned off. All devices can be read or written, subject to their permissions. 1. Secure mode: The superuser-settable immutable and append-only flags cannot be cleared; disks for mounted filesystems and kernel memory (/dev/mem and /dev/kmem) are read-only. 2. Highly secure mode: This mode is the same as secure mode, except that disks are always read-only whether mounted or not. This level precludes even a superuser process from tampering with filesystems by unmounting them, but it also inhibits formatting of new filesystems. Normally, the system runs with level 0 security while in single-user mode, and with level 1 security while in multiuser mode. If level 2 security is desired while the system is running in multiuser mode, it should be set in the /etc/rc startup script (the /etc/rc script is described in Section 14.6). Files marked immutable by the superuser cannot be changed except by someone with physical access to either the machine or the system console. Files marked immutable include those that are frequently the subject of attack by intruders (e.g., login and su). The append-only flag is typically used for critical system logs. If an intruder breaks in, he will be unable to cover his tracks. Although simple in concept, these two features improve the security of a system dramatically. One change in the UFS2 inode format was to split the flags field into two separate 32-bit fields: one for flags that can be set by applications (as in UFS1) and a new field for flags maintained strictly by the kernel. An example of a kernel flag is the SNAPSHOT flag used to label a file as being a snapshot. Another kernel-only flag is OPAQUE, which is used by the union filesystem to mark a directory that should not make the layers below it visible. By moving these kernel flags from the high 16 bits of the user-flags field into a separate kernel-flags field, they will not be accidentally set or cleared by a naive or malicious application. Dynamic Inodes A common complaint about the UFS1 filesystem is that it preallocates all its inodes at the time that the filesystem is created. For filesystems with millions of files, the initialization of the filesystem can take several hours. Additionally, the filesystem creation program, newfs, had to assume that every filesystem would be filled with many small files and allocate a lot more inodes than were likely to ever be used. If a UFS1 filesystem uses up all its inodes, the only way to get more is to dump, rebuild, and restore the filesystem. The UFS2 filesystem resolves these problems by dynamically allocating its inodes. The usual implementation of dynamically allocated inodes requires a separate filesystem data structure (typically referred to as the inode file) that tracks the current set of inodes. The management and maintenance of this extra data structure adds overhead and complexity and often degrades performance. To avoid these costs, UFS2 preallocates a range of inode numbers and a set of blocks for each cylinder group (cylinder groups are described in Section 8.9). Initially each cylinder group has two blocks of inodes allocated (a typical block holds 32 or 64 inodes). When the blocks fill up, the next block of inodes in the set is allocated and initialized. The set of blocks that may be allocated to inodes is held as part of the free-space reserve until all other space in the filesystem is allocated. Only then can it be used for file data. In theory, a filesystem could fill, using up all the blocks set aside for inodes. Later, after large files have been removed and many small files created to replace them, the filesystem might find itself unable to allocated the needed inodes because all the space set aside for inodes was still in use. Here, it would be necessary to reallocate existing files to move them to new locations outside the inode area. Such code has not been written as we do not expect that this condition will arise in practice, since the free space reserve used on most filesystems (8 percent) exceeds the amount of space needed for inodes (typically less than 6 percent). On these systems only a process running with root privileges would ever be able to allocate the inode blocks. Should the code prove necessary in real use, it can be written at that time. Until it is written, filesystems hitting this condition will return an "out of inodes" error on attempts to create new files. A side benefit of dynamically allocating inodes is that the time to create a new filesystem in UFS2 is about 1 percent of the time that it takes in UFS1. A filesystem that would take one hour to build in a UFS1 format can be built in under a minute in the UFS2 format. While filesystem creations are not a common operation, having them build quickly does matter to the system administrators that have to do such tasks with some regularity. The cost of dynamically allocating inodes is one extra disk write for every 64 new inodes that are created. Although this cost is low compared to the other costs of creating 64 new files, some systems administrators might want to preallocate more than the minimal number of inodes. If such a demand arises, it would be trivial to add a flag to the newfs program to preallocate additional inodes at the time that the filesystem is created. Inode Management Most of the activity in the local filesystem revolves around inodes. As described in Section 6.6, the kernel keeps a list of active and recently accessed vnodes. The decisions regarding how many and which files should be cached are made by the vnode layer based on information about activity across all filesystems. Each local filesystem will have a subset of the system vnodes to manage. Each uses an inode supplemented with some additional information to identify and locate the set of files for which it is responsible. Figure 8.3 shows the location of the inodes within the system. Figure 8.3. Layout of kernel tables.
Reviewing the material in Section 6.4, each process has a process open-file table that has slots for up to a system-imposed limit of file descriptors; this table is maintained as part of the process state. When a user process opens a file (or socket), an unused slot is located in the process's open-file table; the small integer file descriptor that is returned on a successful open is an index value into this table. The per-process file-table entry points to a system open-file entry, which contains information about the underlying file or socket represented by the descriptor. For files, the file table points to the vnode representing the open file. For the local filesystem, the vnode references an inode. It is the inode that identifies the file itself. The first step in opening a file is to find the file's associated inode. The lookup request is given to the filesystem associated with the directory currently being searched. When the local filesystem finds the name in the directory, it gets the inode number of the associated file. First, the filesystem searches its collection of inodes to see whether the requested inode is already in memory. To avoid doing a linear scan of all its entries, the system keeps a set of hash chains keyed on inode number and filesystem identifier; see Figure 8.4. If the inode is not in the table, such as the first time a file is opened, the filesystem must request a new vnode. When a new vnode is allocated to the local filesystem, a new structure to hold the inode is allocated. Figure 8.4. Structure of the inode table.
The next step is to locate the disk block containing the inode and to read that block into a buffer in system memory. When the disk I/O completes, the inode is copied from the disk buffer into the newly allocated inode entry. In addition to the information contained in the disk portion of the inode, the inode table itself maintains supplemental information while the inode is in memory. This information includes the hash chains described previously, as well as flags showing the inode's status, reference counts on its use, and information to manage locks. The information also contains pointers to other kernel data structures of frequent interest, such as the superblock for the filesystem containing the inode. When the last reference to a file is closed, the local filesystem is notified that the file has become inactive. When it is inactivated, the inode times will be updated, and the inode may be written to disk. However, it remains on the hash list so that it can be found if it is reopened. After being inactive for a period determined by the vnode layer based on demand for vnodes in all the filesystems, the vnode will be reclaimed. When a vnode for a local file is reclaimed, the inode is removed from the previous filesystem's hash chain, and, if the inode is dirty, its contents are written back to disk. The space for the inode is then deallocated, so that the vnode will be ready for use by a new filesystem client. |