In Figure 11-6 we see an overview of the LVM pseudodriver organization within the kernel. As mentioned earlier, an application thread requests a file open by referencing a file system pathname. The virtual file system resolves this to a specific vnode within the kernel file system table. The vnode contains the device number of the file system on which the file is located. The device number for LVM-based file systems directs all I/O requests to the LVM pseudodriver via the kernel's device switch table. Once an I/O request is received by the driver, the driver must pass it through several layers. Strategy Layer: This layer receives the initial request for a block I/O transaction to a specific file system block. The kernel facilitates this request by passing a buf structure containing the logical volume number (b_dev), request flags defining the transaction (v_flags), the block number within the logical volume (b_blkno), the byte count of the request (b_bcount), and various options (b_options). The strategy layer must validate the request, checking the availability of the requested volume and its size against the block size of the volume. Mirror Consistency Layer: If a logical volume has mirroring configured, then this layer must coordinate mirror writes. A volume may be configured to cache mirrored write request in an MWC. A volume group is divided into logical track groups (LTG) and cached write requests are first registered in one of the MWC's cache records (one per LTG) in kernel memory and also written to an mwc_entry on one of the physical volumes of the volume group. Originally, each volume group had 32 LTGs, but with the 11.i release this was increased to 126. The cached data does not contain the user data, but it does register the intent to update data on the disk. When the write has been completed to all mirror copies, the cache record is cleared. In the case of a system crash and reboot, the cache records on the physical volumes may be used to identify which LTG's may be out of sync. If the MWC is not used, then all logical extents on the mirrors would have to be resynchronized. Scheduling Layer: This layer makes full use of the kernel-based copies of the volume group's kernel-based configuration information. The actual location and number of mirrors are converted into one or more physical requests. The strategy layer accepts the buf pointer from the previous layers and directs it on its way through LVM. A logical volume may be configured to follow one of several scheduling strategies. It may be LVM_RESERVED if the request is to the reserved area on the disk (the actual metadata structures used to configure and manage the volume). Normal read and write requests may be either LVM_SEQUENTIAL or LVM_PARALLEL. This effects the methodology used for mirrored read and write requests. Finally, the request could be flagged as LVM_STRIPE for parallel striped operations. Physical Layer: This last layer is where the rubber meets the road. The LVM driver passes requests and their associated buf structures to the actual physical device drivers responsible for the mapped physical volumes. Figure 11-6. The LVM Pseudodriver Architecture
Work Queues: Keeping a Request on Track Because the LVM pseudodriver is a kernel resource, it is common for it to be processing multiple requests in the various layers at any one point in time. In addition to servicing multiple threads, the buffers may be queued while awaiting access to a physical device or to be transferred to the next layer within the LVM driver itself. In Figure 11-7, we see that there are a number of queues that a request may find itself on. Figure 11-7. Work Queues
They may be divided into four different categories. Global queues: The pf_wait_Q contains all requests that may not be completed due to a power failure. A request on this queue is waiting for termination and cleanup. Per-volume-group queues: The vg_cache_wait queue holds requests waiting on a free entry in their volume group's MWC. The vg_cache_write queue provides a linked list of physical volumes available for an MWC update to disk. When the LVM driver needs to store MWC data to a physical disk in its volume group, it selects the one at the head of this queue. Per-physical-volume queues: All requests scheduled for a specific physical volume are linked to the pv_read_Q. The pv_cache_wait holds requests waiting for their MWC data to be written to a physical disk. The LVM system supports a feature known as physical volume links (pv links). This feature allows for the automatic switchover from one bus to an alternate bus; that is, when a failure occurs on a bus controller, if the system has an alternate path available, I/O is switched to it. The pv_wait_Q holds requests waiting for the pv links switch to take effect. Currently, pv links support an active/passive mode of operation. This references the fact that only one interface may be active at a time. If it fails they standby passive interface is activated. In the future, this may be enhanced to allow active/active configurations where both interfaces may be used to share the load and increase overall throughput while providing true hot standby functionality Per-logical-volume queues: The work_Q is actually a per-logical-volume array of all outstanding requests for the volume. The array entries are the other queues on which the individual requests are currently linked. Since this master queue has knowledge of all current outstanding requests for a volume, the kernel strategy layer makes use of this information to serialize I/O request whenever possible. The lv_read_Q is a holding place for requests waiting to be passed to the MWC layer in the pseudo-driver. Next, let's consider the data structures stored in the kernel to support the operations of the LVM subsystem. Kernel Resident Data Structures When a volume group is activated (at boot or via the vgchange a y command), its metadata is copied to kernel-resident structures. Figure 11-8 presents an overview of these structures. The starting point for these structures is the kernel volgrp[] array. As individual volgrp structures are among the largest in the kernel, their number is limited by the kernel-tunable maxvgs and defaults to 10. Note: if you are creating a new volume, group directory and group file the volume group number passed to the mknod command (the first two digits of the minor number argument) should not exceed this tunable value. The volume group's number is used as the index into the kernel resident volgrp[] array. Figure 11-8. Kernel-Resident Configuration Structures
Let's begin by examining the volgrp structure (Listing 11.10). Listing 11.10. q4> fields struct volgrp We start with lock pointers and counters 0 0 4 0 * vg_lock.interlock 4 0 4 0 u_int vg_lock.delay 8 0 4 0 int vg_lock.read_count 12 0 1 0 char vg_lock.want_write 13 0 1 0 char vg_lock.want_upgrade 14 0 1 0 char vg_lock.waiting 15 0 1 0 char vg_lock.no_swap Next a pointer to the lvol array (sized to 256), the number of logical volumes, a lock, a pointer to the pvol array and its size, the major number for the volume group pseudo-device file (0x40, acts as a sanity check), the volume group identifier, a count of the open volumes 16 0 4 0 * lvols 20 0 4 0 u_int num_lvols 24 0 4 0 * vg_pvolsListLock.lvc_slock 28 0 4 0 * pvols 32 0 4 0 u_int size_pvols 36 0 4 0 u_int num_pvols 40 0 4 0 int major_num 44 0 4 0 u_int vg_id.id1 48 0 4 0 u_int vg_id.id2 52 0 2 0 short vg_extshift 54 0 2 0 short vg_opencount 56 0 4 0 u_int vg_flags VG_LOST_QUORUM | run quorum is lost | VG_ACTIVATED | volume group is activated | VG_NOLVOPENS | disallow lvol opens | VG_READONLY | volume group activated read-only |
60 0 4 0 * vg_intlock.lvc_slock Total number of requests processed in the strategy layer and the current pending requests 64 0 4 0 int vg_totalcount 68 0 4 0 int vg_requestcount 72 0 4 0 * vg_ca_intlock.lvc_slock Byte offset 80 through 539 holds a variety of MWC information structures. This section is examined later in this chapter. |
Pointers and offsets to various related structures 540 0 4 0 * vg_vgda 544 0 4 0 u_int vg_LVentry_off 548 0 4 0 u_int vg_PVentry_off 552 0 4 0 u_int vg_PVentry_len 556 0 4 0 u_int vg_VGtrail_off byte offset 560 through 1087 contains volume group status area data. |
Configured limits for the volume group, logical volumes, physical volumes, physical extents, extent size, data area length, status area length, mirror cache size, the volume group number (a quick sanity check), available physical volumes, data area and status area block sizes, cluster locking ID, and configuration mode (used in conjunction with service guard configuration) 1088 0 2 0 u_short vg_maxlvs 1090 0 2 0 u_short vg_maxpvs 1092 0 2 0 u_short vg_maxpxs 1100 0 4 0 u_int vgda_len 1096 0 4 0 u_int vg_pxsize 1104 0 4 0 u_int vgsa_len 1108 0 4 0 u_int mcr_len 1112 0 4 0 int vg_num 1116 0 2 0 u_short vg_npv_avail 1118 0 2 0 u_short vg_npv_newavail 1120 0 2 0 u_short vgda_blkfactor 1122 0 2 0 u_short vgsa_blkfactor 1124 0 4 0 u_int vg_cluster_id 1128 0 4 0 int vg_config_mode CLV_VG_CONF_STD | non-special mode | CLV_VG_CONF_EXCL | exclusive activation mode | CLV_VG_CONF_SHAR | shared activation mode |
The remainder of the structure holds shared mode data if applicable, volume group switching, and spare information The lvol and pvol data is populated from that found in the PVRA and VGRA structures on the volume group's physical disks (Listings 11.11 and 11.12). Listing 11.11. q4> fields struct lvol Various queue pointers and addresses 0 0 4 0 * work_Q 4 0 4 0 * lv_ready_Q.lv_head 8 0 4 0 * lv_ready_Q.lv_tail 12 0 4 0 int lv_ready_Q.lv_count The logical ext array pointer (used during re-sync operations) 16 0 4 0 * lv_lext Three physical extent pointer maps for mapping mirrored extents 20 0 4 0 * lv_exts[0] 24 0 4 0 * lv_exts[1] 28 0 4 0 * lv_exts[2] Pointer to the schedule queue, the number of stripes and the strip size 32 0 4 0 * lv_schedule 36 0 2 0 u_short lv_stripes 38 0 2 0 u_short lv_stripesize An assortment of lock pointers and primitives 40 0 4 0 * lv_lock.interlock 44 0 4 0 u_int lv_lock.delay 48 0 4 0 int lv_lock.read_count 52 0 1 0 char lv_lock.want_write 53 0 1 0 char lv_lock.want_upgrade 54 0 1 0 char lv_lock.waiting 55 0 1 0 char lv_lock.no_swap 56 0 4 0 * lv_intlock.lvc_slock 60 0 4 0 int lv_complcnt Next are the cumulative request count, the pending request count, and the current status flag 64 0 4 0 int lv_totalcount 68 0 4 0 int lv_requestcount 72 0 2 0 short lv_status 74 0 2 0 short lv_allow_cfgcmd_rslvr 76 0 2 0 u_short lv_ref 78 0 2 0 u_short lv_rawavoid 80 0 2 0 u_short lv_rawoptions The lvol's physical extent count, maximum number of logical extents, and the number of in-use logical extents 84 0 4 0 u_int lv_curpxs 88 0 2 0 u_short lv_maxlxs 90 0 2 0 u_short lv_curlxs 92 0 2 0 u_short lv_flags LVM_RESERVED | group file lvol0 strategy | LVM_SWAUENTIAL | sequential scheduling flag | LVM_PARALLEL | parallel scheduling flag | LVM_STRIPE | striping enabled | LVM_DYNAMIC | dynamic scheduling (not in current use) | LVM_STRIPE_NEW | new stripe (not in current use) |
Current scheduling strategy, mirror count, and a pointer to the bit allocation map for the logical volume 94 0 1 0 u_char lv_sched_strat 95 0 1 0 u_char lv_maxmirrors 96 0 4 0 * lv_bitmap 100 0 2 0 u_short lv_partner 102 0 2 0 u_short lv_mimwchit 104 0 2 0 u_short lv_mimwcmiss 108 0 4 0 u_int lv_mirxfers 112 0 4 0 u_int lv_mircount 116 0 4 0 u_int lv_miwxfers 120 0 4 0 u_int lv_miwcount 124 0 4 0 * lv_vg Byte offset 128 through 511 contains raw buffer data. Byte offset 512 through 895 contains logical volume disk sort information. |
Number of seconds for a request timeout 896 0 4 0 u_int lv_io_timeout Listing 11.12. q4> fields struct pvol Pointers to the volume group structure, the lvmrec structure, and the bad block directory for this pvol 0 0 4 0 * pv_vg 4 0 4 0 * pv_lvmrec 8 0 4 0 * pv_bbdir The maximum and current number of entries in the bad block directory 12 0 4 0 u_int pv_maxdefects 16 0 4 0 u_int pv_curdefects 20 0 4 0 u_int pv_vgdats[0].tv_sec 24 0 4 0 long pv_vgdats[0].tv_usec 28 0 4 0 u_int pv_vgdats[1].tv_sec 32 0 4 0 long pv_vgdats[1].tv_usec 36 0 4 0 int pv_vgra_psn 40 0 4 0 int pv_data_psn 44 0 4 0 u_int pv_pxspace The total number of physical extents and the number of free extents for the pvol 48 0 2 0 u_short pv_pxcount 50 0 2 0 u_short pv_freepxs 52 0 4 0 * pv_intlock.lvc_slock 56 0 4 0 int pv_armpos Work queue pointers 60 0 4 0 * pv_ready_Q.lv_head 64 0 4 0 * pv_ready_Q.lv_tail 68 0 4 0 int pv_ready_Q.lv_count Cumulative number of transfers to this pvol, number of pending requests, status flags, and the pvol's index number within its volume group 72 0 4 0 int pv_totxf 76 0 2 0 short pv_curxfs 78 0 2 0 u_short pv_flags 80 0 1 0 u_char pv_flags2 81 0 1 0 u_char pv_num 84 0 4 0 int pv_sa_psn[0] 88 0 4 0 int pv_sa_psn[1] 92 0 4 0 u_int pv_vgsats[0].tv_sec 96 0 4 0 long pv_vgsats[0].tv_usec 100 0 4 0 u_int pv_vgsats[1].tv_sec 104 0 4 0 long pv_vgsats[1].tv_usec 108 0 4 0 * pv_cache_wait.lv_head 112 0 4 0 * pv_cache_wait.lv_tail 116 0 4 0 int pv_cache_wait.lv_count 120 0 4 0 * pv_cache_next 124 0 4 0 * pv_mwc_rec 128 0 4 0 u_int pv_mwc_latest.tv_sec 132 0 4 0 long pv_mwc_latest.tv_usec 136 0 4 0 int pv_mwc_flags 140 0 4 0 int pv_mwc_loc[0] 144 0 4 0 int pv_mwc_loc[1] 148 0 4 0 int altpool_psn 152 0 4 0 int altpool_next 156 0 4 0 int altpool_end Physical volume defects array 160 0 4 0 * pv_defects[0] ---------------------------------------------- 412 0 4 0 * pv_defects[63] 416 0 4 0 * freelist 420 0 4 0 * freelist_ptr 424 0 4 0 u_int freelistsize 428 0 4 0 u_int bbdirsize Byte offset 432 through 523 contains the physical volume attribute data. |
vnode and pv-links information 524 0 4 0 * currentPhysicalLink 528 0 4 0 * pv_wait_Q.lv_head 532 0 4 0 * pv_wait_Q.lv_tail 536 0 4 0 int pv_wait_Q.lv_count Byte offset 540 through 927 contains physical volume buffer data. |
the size of a read/write request (multiple of 1 KB) 928 0 2 0 u_short pv_blkfactor 930 0 2 0 u_short sgio_flags Byte offset 934 through 1011 contains physical volume spare information. |
Now we have in the kernel all the configuration data necessary to allow translation from a logical volume offset to a physical volume offset. One-Way and Two-Way Mirroring Mechanics and Options Mirroring allows for either two (one-way) or three (two-way) mirroring of individual logical volumes under LVM control. The basic premise of mirroring is very straightforward, but the behind-the-scenes mechanics require additional considerations. A major concern with mirrored operations is assuring that each mirror copy has the same data. This really comes to bear when a write is requested. To help make sure that the data is consistent across all copies, LVM incorporates an MWC strategy. When a logical volume is configured to be mirrored, it may be configured to use one of three mirror-caching policies. NONE: Choosing this policy disables all internal consistency checks. Extents are not marked as stale at activation, and mirrors are not synchronized. This may be suitable for a swap volume. NOMWC: No MWC records are keep and there is no performance cost during normal operation. At volume group activation time all but one volume will be marked as stale. The activation process may take some time as all extents of the stale volume(s) will have to be copied from the non-stale volume. MWC: If this policy is selected, then for any write request to proceed on a mirrored volume, a request is passed to the MWC layer of the driver. An entry must be made to an MWC structure and copied to one of the volume group's physical volumes before it may continue through to the next LVM layer. If a resynchronization is required, all track groups represented by an active entry in the disk-based copies of the cache data must be synchronized. Activation here proceeds more quickly than with NOMWC since only LGT's with incomplete MWC entries will need to be synchronized. The performance hit here lies with the requirement that the MWC be written to disk before LVM advances the request through its various work queues. Note that while all mirrors should be identical following resynchronization, there is no way to know which mirror had the most current data. The mirror chosen to be copied to the others is selected by random choice. The MWC data on a physical disk is stored in the mwc_entry structure we examined earlier in this chapter and is sized to hold 126 individual entries (as of HP-UX 11i). There is also a kernel-based copy of this information (see Figure 11-9). Figure 11-9. Mirror Write Consistency Records
Listing 11.13 is an extracted portion of a listing (with annotation) created using q4> fields struct volgrp. Listing 11.13. q4> fields struct volgrp This contains the spinlock structure for controlling mp-access to this data 72 0 4 0 * vg_ca_intlock.lvc_slock Byte offset 80 through 463 contains the vg_cache_lbuf structure used by the MWC to store MWC entries to a physical volume |
Next we have the linkage pointers to the vg_cache_wait or vg_cache_write wait queues 464 0 4 0 * vg_cache_wait.lv_head 468 0 4 0 * vg_cache_wait.lv_tail 472 0 4 0 int vg_cache_wait.lv_count 476 0 4 0 * vg_cache_write.lv_pvhead 480 0 4 0 * vg_cache_write.lv_pvtail 484 0 4 0 int vg_cache_write.lv_pvcount The vg_mwc_rec points to the memory-resident copy of the lmv record data 488 0 4 0 * vg_mwc_rec This points to the beginning of the memory-resident cache array followed by a pointer to the least recently used element in the list 492 0 4 0 * ca_part2 496 0 4 0 * ca_lst This is the hash list used to speed searches for a cached entry 500 0 4 0 * ca_hash[0] --------------------------------------- 528 0 4 0 * ca_hash[7] The number of current free entries, the total number of entries, and the number of changed entries in memory (dirty entries) 532 0 1 0 u_char ca_free 533 0 1 0 u_char ca_size 534 0 1 0 u_char ca_chgcount The cache flags 535 0 1 0 u_char ca_flags CACHE_ACTIVATED | cache has been initialized | CACHE_INFLIGHT | cache being written to disk | CACHE_CHANGED | memory cache is currently dirty | CACHE_CLEAN | something is waiting for disk write to complete |
536 0 2 0 u_short ca_clean_lvnum |