The name pdir is a bit of a throwback. Originally, HP-UX used an inverse page directory that mapped physical page numbers to virtual page numbers. This meant that while the size of the table was easy to determine (simply one entry for each page), searching for a specific virtual address was a lengthy process. To reduce search times, a hashed table of virtual addresses was created. In this format, a virtual page address is used to create a hashtable index value. The index is then tested against a mask value to make sure the index doesn't exceed the size of the hashtable, and the result is used as an offset into the htbl (or htbl2_0). The hashing algorithm varies according to the hardware width. For a narrow system, the routine is as follows: pdirhash1_1(space,offset) & (mask) We left-shift the space value by 5 bits. The offset is right-shifted by 12 bits (this is the byte offset into a page and as such is not part of the virtual page number). These two values are exclusively ORed to create the initial hashtable index. To assure that the index does not exceed the size of the hashtable, an AND operation is performed between the initial index value and the mask (the size of the table, a power of two). This is where we answer the question asked previously. By shifting the space value by 5 bits to the left, the last 5 bits of the virtual page number translate directly to the hashed index. Effectively this means that 32 sequential virtual page numbers are mapped in 32 consecutive htbl entries. For a wide system, pdirhash2_0(space,offset) & (mask) pdirhash2_0 bit shifts the space value 10 bits to the right (recall from our discussion of the Global Virtual Address (GVA) formation in Chapter 2, "PA-Risc 2.0 Architecture," that the lowest 10 bits of the space number are always set to 0). Depending on whether this is a narrow or wide kernel, additional bit operations are performed on the offset value (again from our GVA discussion, there are only 30 actual virtual page address bits implemented by the current processors: 64 2 space reg. selection bits 20 unimplemented bits 12 page offset bits = 30 bits). As with the narrow version, to assure that the index does not exceed the size of the hashtable, an AND operation is performed between the initial index value and the mask. For the most part, the index calculation involves bit shifts, exclusive ORs, and ANDs. The indexing functions are quick and dirty, but keep in mind that the hardware performs a parallel to these operations: each time an implicit address is requested, the hardware calculates the index as an atomic operation and its value is temporarily stored in CR28. If a tlb fault occurs the hardware will use CR28 as an index into the kernel hashtable in an attempt to find the correct translation. The only reason the kernel needs to be able to duplicate this effort is so that it can move virtual pages in and out of the physical memory map and keep the htbl/pdir in sync with the changes it makes. Creating a Sparse Table You may be wondering what happens if two virtual page numbers hash to the same index, which is a very real possibility. When this happens a forward linkage pointer will direct the fault handler to an extension pde. Extended pdes are allocated from pages of pde structures managed by the kernel. The hashtable pages and pages allocated for the extended pde entries are collectively called the sparse pdir. When an extended hpde entry needs to be stored, the kernel allocates the space from the head of pdir_free_list (see Figure 6-5). In order to maintain a supply of these structures for use when needed, several kernel parameters are utilized. The pdir_free_list and pdir_free_list_tail point to the front and back of a linked list of available structures. Whenever the number of free entries begins to run low (drops below 256), the kernel allocates an additional page to the effort, carves it up into hpdes, and links them to the free list. Figure 6-5. Sparse Tables
For wide systems, the free lists are maintained on a per-spu (system processor unit) basis and the pointers are pd_fl2_0[spu].head and pd_fl2_0[spu].tail. Once memory pages have been allocated by the kernel for sparse structures, they will not be returned to the memory free list. In actual practice, the virtual-to-physical mapping system rarely sees more than three virtual addresses hashed to the same index. The hashed sparse table structure serves our needs very well, but what if you know the physical address and need to determine the virtual address (or addresses) currently assigned to it? Page Frame to Virtual Page Frame Tables Let's address a structure that is completely independent of the underlying hardware architecture. The kernel may know the physical page number and need to determine if there are any virtual pages currently mapped to it. This is a scenario that occurs at various points when the kernel is trying to clean up resources following a process termination or panic. In Figure 6-6, we see the page frame number to virtual address table, or pfn_to_virt_ptr, as its current incarnation is named. In early HP-UX systems, this was a very simple table; it had one entry for each physical page configured on the system. The physical page number was the index to get the associated virtual address if there was one. As the complexity of the Hewlett Packard computer hardware platforms grew, so did the complexity of physical-to-virtual mapping. The V-class introduced partitioned memory designs to the HP-UX kernel, and in order to avoid waste in the pfn_to_virt table, it was converted to a partitioned design, thus the new name pfn_to_virt_ptr. Figure 6-6. Physical to Virtual Address Translation
The pfn_to_virt_ptr structure merely directs us to the appropriate page partition table that contains the actual pfn_to_virt_entry structures. The granularity of the partitioning is determined by the kernel parameter PFN_CONTIGUOUS_PAGES (currently defaulted to 4096). To find the correct pfn_to_virt_entry, we need to divide the physical page number by 4096 to find an index and offset (the remainder). First, we index into pfn_to_virt_ptr and find the pointer to the pp_tbl structure containing our pfn_to_virt_entry. The remainder is our index into this pp_tbl. Let's take a look at an annotated listing of these two structures (Listings 6.3 and 6.4). Listing 6.3. q4> fields struct pfn_to_virt_ptr The pp_tbl points to a block of "page frame number to virtual" entries and the pp_offset directs us to the first valid entry in the block. Each block maps 4096 physical page translations (determined by the kernel parameter PFN_CONTIGUOUS_PAGES) 0 0 4 0 * pp_tbl 4 0 4 0 int pp_offset Listing 6.4. q4> fields struct pfn_to_virt_entry The first word of each entry contains either the virtual space number or INVALID_SPACE_ENTRY (0xffffffff). The second word is either the page offset or a pointer to an alias structure 0 0 4 0 u_int space 4 0 4 0 u_long alias_or_offset.offset_page 4 0 4 0 * alias_or_offset.alias Virtual Address Aliasing Virtual addressing aliasing was introduced to the HP-UX kernel with the release of HP-UX 10.0. This feature was incorporated to facilitate a "copy-on-write" scenario in the process creation fork() system call. Figure 6-7 demonstrates the kernel data structure changes this entailed. Figure 6-7. Virtual Address Aliasing
In the case of a single virtual page number being assigned to a specific physical page, no alias structure is required. The physical to virtual translation follows the previously discussed model. If more than one virtual address needs to be associated with a physical page, then its pfn_to_virt_entry space value will be set to INVALID_SPACE_ENTRY (0xffffffff), and the offset value will be a pointer to an alias data structure. The alias structure will contain a virtual space and offset and a pointer to the next alias structure if one exists (Listing 6.5). Listing 6.5. q4> fields struct alias 0 0 4 0 * aa_next 4 0 4 0 u_int space 8 0 0 7 u_int aa_savear 8 7 3 1 u_int aa_saveprot 12 0 4 0 u_long offset_page A free list of unused alias structures is maintained by the kernel and pointed to by aa_entfreelist (the alias structure free list). Another kernel parameter, min_alias_entries, sets a low-water mark for free alias structures. When the available count, aa_entcnt, falls below this limit, the kernel allocates a new page of alias structures. When an aliased virtual address is created, an entry must also be added to the hashtable (htbl/pdir). If the new virtual address maps to a used hashtable entry, a sparse entry must be allocated. Space for one of these alias pdes is allocated from the aa_pdirfreelist instead of the system's pdir_free_list. The aa_pdirfreelist is maintained to avoid overtaxing the pdir_free_list. The system has a pointer to both the head and the tail of this free list, aa_pdirfreelist and aa_pdiefreelist_tail respectively. For a wide system, the pointers are aa_pdirfreelist2_0 and a_pdirfreelist_tail2_0. As with the alias structures, the kernel monitors their use and availability. The parameter aa_pdircnt is monitored; if it falls below min_alias_pdirs (256 by default), then a new page of alias PDEs is allocated. The total number of kernel alias PDEs is stored in max_aapdir; this value may grow over time but will never shrink. Pages allocated to the kernel for use as alias PDEs or alias structures are never returned to the free page list. They represent a type of high-water mark, and the current thought is that if you need them once, you may need them again. Tying up a few pages is less costly than the overhead of reallocating the pages. Now that we can convert from virtual to physical and back, let's consider how the kernel manages its many regions of virtual pages. Regions of Pages The region is the workhorse of the kernel memory management subsystem. It represents the highest level of memory resource management available to a process (via pointers from the process's pregion structure). As we see in Figure 6-8, the region is simply a collection of consecutive virtual pages. Each region occupies a unique space in the system's VAS. The kernel must make sure the regions are assigned to the correct quadrant type. As far as the region structure is concerned, there are only two types of regions: shared and private. Figure 6-8. Region of Page Frames
The region structure contains the organizational pointers and status flags used to manage it. Let's look at an annotated listing of the region structure (Listing 6.6). Listing 6.6. q4> fields struct region The region flags indicate the current region state: 0x00000004 RF_ALLOC This region structure is allocated 0x00000020 RF_UNALIGNED For support of unaligned code pages 0x00000080 RF_WANTLOCK lock requestor waiting 0x00000200 RF_EVERSWP set if the b-tree was ever swapped 0x00000400 RF_NOWSWP set if b-tree is currently swapped 0x00002000 RF_IOMAP region created by iomap() call 0x00004000 RF_LOCAL front-store is NFS but local swap has been allocated for text 0x00008000 RF_EXCLUSIVE memory mapped MAP_EXCLUSIVE 0x00200000 RF_SUPERPAGE_TEXT variable-page-size in use 0x02000000 RF_PGSIZE_FIXED variable-page-size not in use 0x01000000 RF_MPROTECTED indicates the mprotect() has been called for this region 0 0 4 0 enum4 r_flags A region is either RT_PRIVATE (0x1) or RT_SHARED (0x2) 4 0 4 0 enum4 r_type Number of pages in this region 8 0 4 0 int r_pgsz Current number of valid vfd entries in the region's b-tree 12 0 4 0 int r_nvalid Number of valid pages at time of deactivations used by the swapper 16 0 4 0 int r_dnvalid If RF_SWLAZY flag is set (controlled by the chart command) this is the number pages with an actual swap page allocated. If lazy swap is not enabled, then this is the total number of allocated and reserved pages for the region 20 0 4 0 int r_swalloc The number of pages allocated and reserved in pseudo-swap 24 0 4 0 int r_swapmem 28 0 4 0 int r_vfd_swapmem Used with the mlock() call 32 0 4 0 int r_lockmem Forward and backward pointers linking all regions using pseudo-swap (head of this list is pointed to by the kernel pointer pswaplist) 36 0 4 0 * r_pswapf 40 0 4 0 * r_pswapb A reference count of all pregions pointing to this region 44 0 2 0 u_short r_refcnt A region zombie is one whose a.out file is remote (nfs mounted) and has had its contents modified 46 0 2 0 short r_zomb For page-aligned regions being mapped from a vnode, this is the offset into the file for the first page of the region. 48 0 4 0 int r_off Number of pregions sharing this region which are currently "in-core" 52 0 2 0 u_short r_incore If a b-tree has been moved to swap, this will point to the location of the first page in its page list (each subsequent page's swap location is contained in a pointer at the end of the previous page) 56 0 4 0 u_int r_dbd 60 0 4 0 int r_scan The front and back store values point to the preferred paging location for this region's pages in the event of high memory pressure 64 0 4 0 * r_fstore 68 0 4 0 * r_bstore All active regions are linked via the next two pointers (the head of this list is regactive) 72 0 4 0 * r_forw 76 0 4 0 * r_back Unaligned regions may be accessed by the TEXTHASH algorithm (there are 32 hash headers in the kernel structure texts[TEXTHSHSZ]) and linked by this pointer 80 0 4 0 * r_hchain Next we locate the text image in the front store by its offset and length in bytes 84 0 4 0 u_long r_byte 88 0 4 0 u_long r_bytelen Additional locking structures and flags come next. 92 0 4 0 * r_lock.interlock 96 0 4 0 u_int r_lock.delay 100 0 4 0 u_int r_lock.write_waiters 104 0 4 0 int r_lock.read_count 108 0 1 0 char r_lock.want_write 109 0 1 0 char r_lock.want_upgrade 110 0 1 0 char r_lock.waiting 111 0 1 0 char r_lock.rwl_flags 112 0 4 0 * r_lock.l_kthread 116 0 1 0 u_char r_mlock.b_lock 118 0 2 0 u_short r_mlock.order 120 0 4 0 * r_mlock.owner Number of page I/O's currently in process for this region 124 0 4 0 int r_poip Pointer to the root of the b-tree (if a b-tree is being used) 128 0 4 0 * r_root The key value may be UNUSED_IDX (0x7fffffff) if we are not using a b-tree, or DONTUSE_IDX (0x7ffffffe) if we are using the b-tree pointed to by b_root. Any other value in r_key is a key associated with r_chunk 132 0 4 0 int r_key 136 0 4 0 * r_chunk All regions associated with a front-store vnode are linked using the next two pointers 140 0 4 0 * r_next 144 0 4 0 * r_prev 148 0 4 0 * r_preg_un.r_un_pregskl 148 0 4 0 * r_preg_un.r_un_pregion 152 0 4 0 * r_psklh.l_header.n_next[0] 156 0 4 0 * r_psklh.l_header.n_next[1] 160 0 4 0 * r_psklh.l_header.n_next[2] 164 0 4 0 * r_psklh.l_header.n_next[3] 168 0 4 0 * r_psklh.l_header.n_prev 172 0 4 0 u_long r_psklh.l_header.n_key 176 0 4 0 u_long r_psklh.l_header.n_value 180 0 1 0 char r_psklh.l_header.n_flags 181 0 1 0 char r_psklh.l_header.n_cookie 184 0 4 0 * r_psklh.l_tail 188 0 4 0 * r_psklh.l_cache 192 0 4 0 * r_psklh.l_cmpf 196 0 4 0 int r_psklh.l_level 200 0 1 0 char r_psklh.l_cookie 204 0 4 0 * r_excproc 208 0 4 0 * r_lchain 212 0 4 0 int r_mlockswap This contains the performance-optimized page sizing hint 216 0 4 0 int r_pgszhint The following hardware-dependent layer (structure hdlregion) contain hints used when attaching the region to a pregion 220 0 4 0 u_int r_hdl.r_space 224 0 4 0 u_int r_hdl.r_prot 228 0 4 0 * r_hdl.r_vaddr 232 0 2 0 u_short r_hdl.r_hdlflags 236 0 4 0 * r_mrg 240 0 4 0 int r_mrg_status 244 0 4 0 int r_dbd_asyncinflight a pointer to the spinlock used to protect this structure 248 0 4 0 * r_spinlock As we study the fields of the region, you may have noticed that it doesn't include a spot for the virtual address, just the number of page frames it manages. From the kernel's point of view, it isn't really necessary to know the virtual address only the number of pages in the region and how they are to be used (you may recall that earlier we describe the VAS as a conceptual device for memory management). We need to examine the pregion to region linkages in order to explain how regions may be shared. |