Mac OS X Internals: A Systems Approach

8.10. Unified Buffer Cache (UBC)

Historically, UNIX allocated a portion of physical memory to be used as the buffer cache. The goal was to improve performance by caching disk blocks in memory, therefore avoiding having to go to the disk while reading or writing data. Before the advent of unified buffer caching, a cached buffer was identified by a device number and a block number. Modern operating systems, including Mac OS X, use a unified approach wherein in-memory contents of files reside in the same namespace as regular memory.

The UBC conceptually exists in the BSD portion of the kernel. Each vnode corresponding to a regular file contains a reference to a ubc_info structure, which acts as a bridge between vnodes and the corresponding VM objects. Note that UBC information is not valid for system vnodes (marked as VSYSTEM), even if the vnode is otherwise regular. When a vnode is createdsay, because of an open() system calla ubc_info structure is allocated and initialized.

// bsd/sys/ubc_internal.h struct ubc_info { memory_object_t ui_pager; // for example, the vnode pager memory_object_control_t ui_control; // pager control port long ui_flags; struct vnode *ui_vnode; // our vnode struct ucred *ui_cred; // credentials for NFS paging off_t ui_size; // file size for vnode struct cl_readahead *cl_rahead; // cluster read-ahead context struct cl_writebehind *cl_wbehind; // cluster write-behind context }; // bsd/sys/vnode_internal.h struct vnode { ... union { struct mount *vu_mountedhere; // pointer to mounted vfs (VDIR) struct socket *vu_socket; // Unix IPC (VSOCK) struct specinfo *vu_specinfo; // device (VCHR, VBLK) struct fifoinfo *vu_fifoinfo; // fifo (VFIFO) struct ubc_info *vu_ubcinfo; // regular file (VREG) } v_un; ... };

The UBC's job is to cache file-backed and anonymous memory in physical memory using a greedy approach: It will attempt to consume all available physical memory. This is especially relevant for 32-bit processes on a 64-bit machine with more than 4GB of physical memory. Although no single 32-bit process can directly address more than 4GB of virtual memory, the larger physical memory benefits all processes as it amounts to a larger buffer cache. As we saw earlier, resident pages are evicted using an LRU-like page replacement policy. Recently used pages, say, corresponding to a file that was recently read, or memory that was recently allocated, are likely to be found in the buffer cache.

You can see the buffer cache at work by using the fs_usage utility. As we saw in Chapter 6, fs_usage uses the kernel's kdebug facility to perform fine-grained tracing of kernel events. The page-fault handler (vm_fault() [osfmk/vm/vm_fault.c]) creates trace records for various types of page faults.

// bsd/sys/kdebug.h #define DBG_ZERO_FILL_FAULT 1 #define DBG_PAGEIN_FAULT 2 #define DBG_COW_FAULT 3 #define DBG_CACHE_HIT_FAULT 4

Specifically, a fault of type DBG_CACHE_HIT_FAULT means that the handler found the page in the UBC. A fault of type DBG_PAGEIN_FAULT means that the handler had to issue I/O for that page fault. fs_usage reports these two events as CACHE_HIT and PAGE_IN, respectively. Running fs_usage to report system-wide cache hits and page-ins should show that normally, many of the I/O requests are satisfied from the UBC.

$ sudo fs_usage -f cachehit ... 11:26:36 CACHE_HIT 0.000002 WindowServer 11:26:36 CACHE_HIT 0.000002 WindowServer

Data caching can be disabled on a per-file basis by using the F_NOCACHE command with the fcntl() system call, which sets the VNOCACHE_DATA flag in the corresponding vnode. The cluster I/O layer examines this flag and performs I/O appropriately.

8.10.1. The UBC Interface

The UBC exports several routines for use by file systems. Figure 820 shows routines that operate on vnodes. For example, ubc_setsize(), which informs the UBC of a file size change, may be called when a file system's write routine extends the file. ubc_msync() can be used to flush out all dirty pages of an mmap()'ed vnode, for example:

int ret; vnode_t vp; off_t current_size; ... current_size = ubc_getsize(vp); if (current_size) ret = ubc_msync(vp, // vnode (off_t)0, // beginning offset current_size, // ending offset NULL, // residual offset UBC_PUSHDIRTY | UBC_SYNC); // flags // UBC_PUSHDIRTY pushes any dirty pages in the given range to the backing store // UBC_SYNC waits for the I/O generated by UBC_PUSHDIRTY to complete

Figure 820. Examples of exported UBC routines

// convert logical block number to file offset off_t ubc_blktooff(vnode_t vp, daddr64_t blkno); // convert file offset to logical block number daddr64_t ubc_offtoblk(vnode_t vp, off_t offset); // retrieve the file size off_t ubc_getsize(vnode_t vp); // file size has changed int ubc_setsize(vnode_t vp, off_t new_size); // get credentials from the ubc_info structure struct ucred * ubc_getcred(vnode_t vp); // set credentials in the ubc_info structure, but only if no credentials // are currently set int ubc_setcred(vnode_t vp, struct proc *p); // perform the clean/invalidate operation(s) specified by flags on the range // specified by (start, end) in the memory object that backs this vnode errno_t ubc_msync(vnode_t vp, off_t start, off_t end, off_t *resid, int flags); // ask the memory object that backs this vnode if any pages are resident int ubc_pages_resident(vnode_t vp);

Moreover, the UBC provides routines such as the following for working with UPLs.

  • ubc_create_upl() creates a UPL given a vnode, offset, and size.

  • ubc_upl_map() maps an entire UPL into an address space. ubc_upl_unmap() is the corresponding unmap function.

  • ubc_upl_commit(), ubc_upl_commit_range(), ubc_upl_abort(), and ubc_upl_abort_range() are UBC wrappers around UPL functions for committing or aborting UPLs in their entirety or a range within.

8.10.2. The NFS Buffer Cache

Not all types of system caches are unified, and some cannot be unified. For example, file system metadata, which is not a part of the file from the user's standpoint, needs to be cached independently. Besides, performance-related reasons can make a private buffer cache more appealing in some circumstances, which is why the NFS implementation in the Mac OS X kernel uses a private buffer cache with an NFS-specific buffer structure (struct nfsbuf [bsd/nfs/nfsnode.h]).

Mac OS X versions prior to 10.3 did not use a separate buffer cache for NFS.

NFS version 3 provides a new COMMIT operation that allows a client to ask the server to perform an unstable write, wherein data is written to the server, but the server is not required to verify that the data has been committed to stable storage. This way, the server can respond immediately to the client. Subsequently, the client can send a COMMIT request to commit the data to stable storage. Moreover, NFS version 3 provides a mechanism that allows a client to write the data to the server again if the server lost uncommitted data, perhaps because of a server reboot.

int nfs_doio(struct nfsbuf *bp, kauth_cred_t cr, proc_t p) { ... if (ISSET(bp->nb_flags, NB_WRITE)) { // we are doing a write ... if (/* a dirty range needs to be written out */) { ... error = nfs_writerpc(...); // let this be an unstable write ... if (!error && iomode == NFSV3WRITE_UNSTABLE) { ... SET(bp->nb_flags, NB_NEEDCOMMIT); ... } ... } ... } ... }

The regular buffer cache and cluster I/O mechanisms are not aware of the NFS-specific concept of unstable writes. In particular, once a client has completed an unstable write, the corresponding buffers in the NFS buffer cache are tagged as NB_NEEDCOMMIT.

NFS also uses its own asynchronous I/O daemon (nfsiod). The regular buffer laundry threadbcleanbuf_thread() [bsd/vfs/vfs_bio.c]is again not aware of unstable writes. While cleaning dirty NFS buffers, the laundry thread cannot help the NFS client code to coalesce COMMIT requests corresponding to multiple NB_NEEDCOMMIT buffers. Instead, it would remove one buffer at a time from the laundry queue and issue I/O for it. Consequently, NFS would have to send individual COMMIT requests, which would hurt performance and increase network traffic.

Another difference between the NFS and regular buffer caches is that the former explicitly supports buffers with multiple pages. The regular buffer cache provides a single bit (B_WASDIRTY) in the buf structure for marking a page that was found dirty in the cache. The nfsbuf structure provides up to 32 pages to be individually marked as clean or dirty. Larger NFS buffers help in improving NFS I/O performance.

// bsd/nfs/nfsnode.h struct nfsbuf { ... u_int32_t nb_valid; // valid pages in the buffer u_int32_t nb_dirty; // dirty pages in the buffer ... }; #define NBPGVALID(BP,P) (((BP)->nb_valid >> (P)) & 0x1) #define NBPGDIRTY(BP,P) (((BP)->nb_dirty >> (P)) & 0x1) #define NBPGVALID_SET(BP,P) ((BP)->nb_valid |= (1 << (P))) #define NBPGDIRTY_SET(BP,P) ((BP)->nb_dirty |= (1 << (P)))

Категории