We start our examination of LVM data structures with the layout of the physical volumes. Figure 11-4 shows us an overview of the LVM metadata structures on a physical volume. Bootable LVM disks are created with the pvcreate B option and have a logical interchange format (LIF) file system header located in the first 8 KB of the disk. The LIF header is actually an ancient file system type used by the HP-UX boot loader. In the case of a LVM disk, it is a simple directory structure containing pointers to boot files stored in the boot disk reserved area (BDRA) on bootable disks. Following the boot block, we see the physical disk reserved area (PDRA). This structure contains the LVM record, which stores information about this specific physical volume and offset pointers for each of the LVM structures on the disk. In effect, you may think of the LVM record as a type of LVM superblock for the disk. In addition, there is a bad block directory with relocation information for blocks identified by LVM as needing to be replaced. As we mentioned there is an area for the BDRA if this is a bootable disk. Also note the duplicates of each structure. Next is the volume group reserved area (VGRA), which in turn contains the volume group descriptor area (VGDA), the volume group status area (VGSA), and the mirror consistency record (MCR). The VGDA structure is critical to LVM's ability to map logical to physical extents. The majority of the extent-mapping information is held in the VGDA along with information about the volume group the disk belongs to. The VGSA contains stale extent and missing physical volume information. The MCR provides space for the mirror write cache (MWC) consistency data stored on the drive. As with the PVRA, there are duplicates of each structure. Figure 11-5 illustrates the PVRA and VGRA in greater detail. Figure 11-5. PVRA and VGRA Components
After the VGRA is the bulk of the disk's space, the area mapped as physical extents. Only whole extents may be allocated within LVM, so a partial extent will remain unused. The worst case for this would be a wasted space slightly smaller than the basic extent size. This is one consideration when deciding on the extent size: the larger the extent size, the larger the possible waste may be. Considering that the size of modern disk drives is in gigabytes, the potential to waste less than 4, or even 8 or 16 mega-bytes, doesn't seem to be much of a concern. Bringing up the end is an area for bad block relocation, known as the bad block pool, and a small optional structure for cluster-locking information if the volume group is to be used in conjunction with an HP Service Guard cluster environment. Bad block relocation under LVM control may be disabled during configuration. Considering that most modern disk drive controllers handle the relocation of bad blocks, many administrators choose to disable bad block checking within LVM. The PVRA and VGRA The kernel defines the location of several key structures on an LVM physical volume (see Table 11-1). Disk drives are block-oriented devices. Access to their data must be performed a sector at a time. The basic size is either 1 KB per sector or 2 KB per sector on newer drives. The primary LVM record is located at sector 8. The primary bad block directory is at sector 9 for 1 KB/sector drives and at sector 10 for 2 KB/sector drives. The secondary LVM record starts at sector 72. The secondary bad block directory is at sector 73 for 1 KB/sector drives and at sector 74 for 2 KB/sector drives. Overall size of the PVRA is set to 128 sectors, and the bad block directory is set to 55 sectors. Table 11-1. Kernel Parameters for LVM Disk-Based StructuresKernel | # define | Sector # |
---|
Primary LVM record | PVRA_LVM_REC_SN1 | 8 | Primary bad block directory | PVRA_BBDIR_SN1 | 9 (or 10) | Secondary LVM record | PVRA_LVM_REC_SN2 | 72 | Secondary bad block directory | PVRA_BBDIR_SN2 | 73 (or 74) | Primary boot data record | BDRA_BDR_SN1 | 128 | Secondary boot data record | BDRA_BDR_SN2 | 136 | Overall size of the BDRA | BDRA_SIZE | 16 | Length of the boot disk record | BDRA_BDR_LENGTH | 2 | Length of the physical volume list | BDRA_PVL_LENGTH | 6 |
Now let's examine the disk-resident structures in greater detail (Listings 11.1 and 11.2). We use our friend q4 to examine the fields of the various structures. These listings have been annotated, and in some cases redundant fields have been truncated. The PVRA begins with the lv_lvmrec structure. Listing 11.1. q4> fields struct lv_lvmrec The first element of this structure is the structures magic ID and is set to LVMREC01. It is followed by the double-word physical volume and volume group unique ID numbers 0 0 8 0 char[8] lvm_id 8 0 4 0 u_int pv_id.id1 12 0 4 0 u_int pv_id.id2 16 0 4 0 u_int vg_id.id1 20 0 4 0 u_int vg_id.id2 Next are the pointers and lengths of the other structures on this disk 24 0 4 0 u_int last_psn 28 0 4 0 u_int pv_num 32 0 4 0 u_int vgra_len 36 0 4 0 u_int vgra_psn 40 0 4 0 u_int vgda_len 44 0 4 0 u_int vgsa_len 48 0 4 0 u_int vgda_psn1 52 0 4 0 u_int vgda_psn2 56 0 4 0 u_int mcr_len 60 0 4 0 u_int mcr_psn1 64 0 4 0 u_int mcr_psn2 68 0 4 0 u_int data_len 72 0 4 0 u_int data_psn We also see the physical extent size configured for this volume group and additional structure pointers 76 0 4 0 u_int pxsize 80 0 4 0 u_int pxspace 84 0 4 0 u_int altpool_len 88 0 4 0 u_int altpool_psn 92 0 4 0 u_int maxdefects 96 0 4 0 u_int io_timeout 100 0 4 0 u_int bdra_len 104 0 4 0 u_int bdra_psn 108 0 4 0 u_int bdr_len 112 0 4 0 u_int bdr_psn1 116 0 4 0 u_int bdr_psn2 120 0 4 0 u_int pvl_len 124 0 4 0 u_int pvl_psn1 128 0 4 0 u_int pvl_psn2 132 0 4 0 u_int cl_lock_flags 136 0 4 0 u_int cl_lock_psn 140 0 4 0 u_int cluster_id 144 0 4 0 int conf_act_mode 148 0 4 0 u_int orig_pv.pv_id.id1 152 0 4 0 u_int orig_pv.pv_id.id2 156 0 2 0 u_short orig_pv.pv_pxcount 158 0 2 0 u_short orig_pv.pv_pxalloc 160 0 1 0 u_char orig_pv.pv_num Following the bad block directory information is the BDRA for bootable disks. Listing 11.2. q4> fields struct lv_bootdata Again we begin with a magic ID, HPLVMBDR, a timestamp, and a version number 0 0 8 0 char[8] bd_magic 8 0 4 0 u_int bd_timestamp 12 0 2 0 short bd_version Next the root volume group's boot, dump, and swap volumes are identified (note that some of these fields are not currently being used but are in place for future enhancements) 14 0 2 0 short bd_numrootPVs 16 0 2 0 short bd_numswapPVs 18 0 2 0 short bd_numdumpPVs 20 0 4 0 u_int bd_rootVGID.id1 24 0 4 0 u_int bd_rootVGID.id2 28 0 4 0 u_int bd_swapVGID.id1 32 0 4 0 u_int bd_swapVGID.id2 36 0 4 0 u_int bd_dumpVGID.id1 40 0 4 0 u_int bd_dumpVGID.id2 44 0 4 0 int bd_rootvg 48 0 4 0 int bd_swapvg 52 0 4 0 int bd_dumpvg 56 0 4 0 int bd_rootlv[0] -------------------------------- 180 0 4 0 int bd_rootlv[31] 184 0 4 0 int bd_swaplv[0] -------------------------------- 308 0 4 0 int bd_swaplv[31] 312 0 4 0 int bd_dumplv[0] --------------------------------] 436 0 4 0 int bd_dumplv[31] 440 0 4 0 int bd_rootPVs 444 0 4 0 int bd_swapPVs 448 0 4 0 int bd_dumpPVs 452 0 4 0 u_int bd_rootPVsize 456 0 4 0 u_int bd_swapPVsize 460 0 4 0 u_int bd_dumpPVsize 464 0 4 0 int bd_rootPVcksum 468 0 4 0 int bd_swapPVcksum 472 0 4 0 int bd_dumpPVcksum 476 0 2 0 short bd_boot[0] ----------------------------------- 482 0 2 0 short bd_boot[3] 484 0 2 0 short bd_rootdisks[0] ----------------------------------- 546 0 2 0 short bd_rootdisks[31] 548 0 2 0 short bd_swapdisks[0] ----------------------------------- 610 0 2 0 short bd_swapdisks[31] 612 0 2 0 short bd_dumpdisks[0] ----------------------------------- 674 0 2 0 short bd_dumpdisks[31] 676 0 2 0 short bd_rootmaint[0] ----------------------------------- 738 0 2 0 short bd_rootmaint[31] 740 0 2 0 short bd_swapmaint[0] ----------------------------------- 802 0 2 0 short bd_swapmaint[31] 804 0 2 0 short bd_dumpmaint[0] ----------------------------------- 866 0 2 0 short bd_dumpmaint[31] 868 0 4 0 int bd_flags 872 0 4 0 int bd_reserved[0] ----------------------------------- 2040 0 4 0 int bd_reserved[292] 2044 0 4 0 int bd_checksum Let's switch our focus to the VGRA and its components. The first part of the VGRA is the VGDA, which includes four main structures: VG_header, lvol[], pvol[], and VG_trailer. The configurable maximum number of logical volumes and physical volumes per volume group are used to size the lvol[] and pvol[] arrays respectively. The volume group tunables max_lv and max_pv may be set during the vgcreate command. Let's take a look at Listings 11.3 and 11.4: Listing 11.3. q4> fields struct VG_header First are the timestamps and identifier 0 0 4 0 int vg_timestamp.tv_sec 4 0 4 0 int vg_timestamp.tv_usec 8 0 4 0 u_int vg_id.id1 12 0 4 0 u_int vg_id.id2 Next is the maximum number of logical volumes and physical volumes for the volume group 16 0 2 0 u_short maxlvs 18 0 2 0 u_short numpvs The maximum number of physical extents for the volume group and its status flag 20 0 2 0 u_short maxpxs 22 0 2 0 u_short flags 24 0 4 0 u_int reserved2 28 0 4 0 u_int reserved3 Listing 11.4. q4> fields struct LV_entry We start with the maximum size for the logical volume and its state flags 0 0 2 0 u_short maxlxs 2 0 2 0 u_short lv_flags LVM_LVDEFINED | Logical volume entry defined | LVM_DISABLED | lvol unavailable | LVM_RDONLY | lvol read only | LVM_NORELOC | bad blocks not relocated | LVM_VERIFY | all writes to be verified | LVM_STRICT | allocate mirror on distinct pvols | LVM_NOMWC | no mirror consistency checks for this lvol | LVM_PVG_STRICT | allocate mirrors from distinct PVG's | LVM_CONSISTENCY | mirror consistency recovery required | LVM_CLEAN | lvol has no pending writes | LVM_CONTIGUOUS | allocate contiguous physical extents for this lvol |
Next is the configured scheduling strategy (sequential or parallel) 4 0 1 0 u_char sched_strat The number of mirrors, number of stripes and the stripe size are recorded in the next three parameters 5 0 1 0 u_char maxmirrors 6 0 2 0 u_short stripes 8 0 2 0 u_short stripe_size 10 0 2 0 u_short reserved2 The timeout is the number of seconds allowed before a scheduled LV I/O fails 12 0 4 0 u_int lv_io_timeout Each pvol[] entry consists of a PV_header and a PX_entry[]. The PX_entry[] array is sized in accordance with the maximum number of extents allowed per physical volume (max_pe is settable during the vgcreate command). This array contains the final word when it comes to which logical extent is mapped to which physical extent. The index into the PX_entry[] array represents the physical extent number; the array data contains the logical volume and logical extent IDs to which it is mapped. See Listings 11.5, 11.6, and 11.7. Listing 11.5. q4> fields struct PV_header The physical volume identifier, an extent count, and the pvol flags 0 0 4 0 u_int pv_id.id1 4 0 4 0 u_int pv_id.id2 8 0 2 0 u_short px_count 10 0 2 0 u_short pv_flags LVM_PVDEFINED | this entry is used | LVM_PVNOALLOC | no extent allocation is allowed | LVM_NOVGDA | pvol contains a VGDA | LVM_PVRORELOC | no new defects relocated | LVM_PVMISSING | pvol is missing | LVM_NOATTACHED | pvol not attached | LVM_PVPOWERFAIL | pvol is power-failing | LVM_PVNEEDSYNC | pvol needs re-sync | LVM_PVALTLINK | pvol not the primary link | LVM_PVINUSE | pvol is being configured | LVM_PVCFGRSTORD | pvol had config data restored | LVM_PVSWITCHLINK | pvol path requires switch | LVM_PVMOSWBACK | don't switch links back | LVM_PVSPARD | pvol is a spare | LVM_PVDATA_SPARED | pvol failed, data has been spared |
The number of entries in the pvol extent map 12 0 2 0 u_short pv_msize The maximum number of defects that may be relocated 14 0 2 0 u_short pv_defectlim Listing 11.6. q4> fields struct PX_entry The physical extent table entries map to a logical volume and a logical extent number 0 0 2 0 u_short lv_index 2 0 2 0 u_short lx_num Listing 11.7. q4> fields struct VG_trailer The trailer structure is a finishing thought to the VGDA 0 0 4 0 int vg_timestamp.tv_sec 4 0 4 0 int vg_timestamp.tv_usec 8 0 4 0 u_int reserved1 12 0 4 0 u_int reserved2 16 0 4 0 u_int reserved3 20 0 4 0 u_int vgda_cksum 24 0 8 0 char[8] vgda_magic You may be wondering why there are duplicate copies of so many of the disk-resident data structures. When a disk-based structure is updated by the LVM pseudo-driver, only one of the copies is written on each physical volume (except in the case of a volume group with a single physical volume, where both copies are updated). When the data needs to be read by the LVM driver, it chooses the one with the newest copy. As an additional sanity check, the timestamps in the structure header and trailer are compared. If they match, we can assume that the last write to the disk was successful; if they don't match, we try another copy. Remember that all the disks in the volume group contain the same extent-mapping information for redundancy. Following the VGDA is the VGSA consisting of the SA_header and a trailer (Listing 11.8). Listing 11.8. q4> fields struct SA_header The first two structures hold timestamps 0 0 4 0 int sa_h_timestamp.tv_sec 4 0 4 0 int sa_h_timestamp.tv_usec Next is the maximum number of physical extents per physical volume and the maximum number of physical volumes for the volume group 8 0 2 0 u_short sa_maxpxs 10 0 2 0 u_short sa_maxpvs 12 0 4 0 u_int reserved1 The final component in the structure is a checksum 16 0 4 0 u_int sa_checksum The corresponding trailer is 16 bytes in length and consists of the VGSA magic number ("VGSA0001"} and a timestamp The MCR completes the primary structures in the VGRA and contains the disk copies of data from the MWC. The mwc_entry structure (Listing 11.9) contains the disk copies of mirror consistency cache information. We discuss the way these records are used later in this chapter. Listing 11.9. q4> fields struct mwc_entry Timestamps surround 126 sets of logical volume number, the track group shift, and the logical track group number 0 0 4 0 int b_tmstamp.tv_sec 4 0 4 0 int b_tmstamp.tv_usec 8 0 2 0 u_short ca_p1[0].lv_number 10 0 2 0 u_short ca_p1[0].ltgshift 12 0 4 0 u_int ca_p1[0].lv_ltg --------------------------------------------- 1008 0 2 0 u_short ca_p1[125].lv_number 1010 0 2 0 u_short ca_p1[125].ltgshift 1012 0 4 0 u_int ca_p1[125].lv_ltg 1016 0 4 0 int e_tmstamp.tv_sec 1020 0 4 0 int e_tmstamp.tv_usec Finally we have a 1024-character pad 1024 0 1024 0 char[1024] pad |