Inside Microsoft Windows 2000, Third Edition (Microsoft Programming Series)
Whereas working sets describe the resident pages owned by a process or the system, the page frame number (PFN) database describes the state of each page in physical memory. Pages are in one of eight states, as shown in Table 7-22.
Table 7-22 Page States
Status | Description |
---|---|
Active (also called Valid) | The page is part of a working set (either a process working set or the system working set) or it's not in any working set (e.g. nonpaged kernel page), and a valid PTE points to it. |
Transition | A temporary state for a page that isn't owned by a working set and isn't on any paging list. A page is in this state when an I/O to the page is in progress. The PTE is encoded so that collided page faults can be recognized and handled properly. |
Standby | The page previously belonged to a working set but was removed. The page wasn't modified since it was last written to disk. The PTE still refers to the physical page but is marked invalid and in transition. |
Modified | The page previously belonged to a working set but was removed. However, the page was modified while it was in use and its current contents haven't yet been written to disk. The PTE still refers to the physical page but is marked invalid and in transition. It must be written to disk before the physical page can be reused. |
Modified no-write | Same as a modified page, except that it has been marked so that the memory manager's modified page writer won't write it to disk. The cache manager marks pages as modified no-write at the request of file system drivers. For example, NTFS uses this state for pages containing file system metadata so that it can first ensure that transaction log entries are flushed to disk before the pages they are protecting are written to disk. (NTFS transaction logging is explained in Chapter 12.) |
Free | The page is free but has unspecified dirty data in it. (Thesepages can't be given as a user page to a user process without being initialized with zeros, for security reasons.) |
Zeroed | The page is free and has been initialized with zeros by the zero page thread. |
Bad | The page has generated parity or other hardware errors and can't be used. |
The PFN database consists of an array of structures that represent each physical page of memory on the system. The PFN database and its relationship to page tables are shown in Figure 7-19. As this figure shows, valid PTEs point to entries in the PFN database, and the PFN database entries (for nonprototype PFNs) point back to the page table that is using them. For prototype PFNs, they point back to the prototype PTE.
Figure 7-19 Page tables and the page frame number database
Of the page states listed in Table 7-22, six are organized into linked lists so that the memory manager can quickly locate pages of a specific type. (Active/valid pages and transition pages aren't in any systemwide page list.) Figure 7-20 shows an example of how these entries are linked together.
Figure 7-20 Page lists in the PFN database
In the next section, you'll find out how these linked lists are used to satisfy page faults and how pages move to and from the various lists.
EXPERIMENT
Viewing the PFN Database
Using the kernel debugger !memusage command, you can dump the size of the various paging lists. The following is the output from this command:
kd> !memusage loading PFN database loading (99% complete) Zeroed: 8 ( 32 kb) Free: 0 ( 0 kb) Standby: 2809 ( 11236 kb) Modified: 756 ( 3024 kb) ModifiedNoWrite: 1 ( 4 kb) Active/Valid: 29150 (116600 kb) Transition: 10 ( 40 kb) Unknown: 0 ( 0 kb) TOTAL: 32734 (130936 kb) Building kernel map |
Page List Dynamics
Figure 7-21 shows a state diagram for page frame transitions. For simplicity, the modified-no-write list isn't shown. Page frames move between the paging lists in the following ways:
- When the memory manager needs a zero-initialized page to service a demand-zero page fault (a reference to a page that is defined to be all zeros or to a user-mode committed private page that has never been accessed), it first attempts to get one from the zero page list; if the list is empty, it gets one from the free page list and zeros the page. If the free list is empty, it goes to the standby list and zeros that page.
- When the memory manager doesn't require a zero-initialized page, it goes first to the free list; if that's empty, it goes to the zeroed list. If the zeroed list is empty, it goes to the standby list. Before the memory manager can use a page frame from the standby list, it must first backtrack and remove the reference from the invalid PTE (or prototype PTE) that still points to the page frame. Because entries in the PFN database contain pointers back to the previous user's page table (or to a prototype PTE for shared pages), the memory manager can quickly find the PTE and make the appropriate change.
- When a process has to give up a page out of its working set (either because it referenced a new page and its working set was full or the memory manager trimmed its working set), the page goes to the standby list if the page was clean (not modified) or to the modified list if the page was modified while it was resident. When a process exits, all the private pages go to the free list. Also, when the last reference to a page file backed section is closed, these pages also go to the free list.
One reason zero-initialized pages are required is to meet C2 security requirements. C2 specifies that user-mode processes must be given initialized page frames to prevent them from reading a previous process's memory contents. Therefore, the memory manager gives user-mode processes zeroed page frames unless the page is being read in from a mapped file. If that's the case, the memory manager prefers to use nonzeroed page frames, initializing them with the data off the disk.
Figure 7-21 State diagram for page frames
The zero page list is populated from the free list by a system thread called the zero page thread (thread 0 in the System process). The zero page thread waits on an event object to signal it to go to work. When the free list has eight or more pages, this event is signaled. However, the zero page thread will run only if no other threads are running, because the zero page thread runs at priority 0 and the lowest priority that a user thread can be set to is 1.
EXPERIMENT
Viewing Page Fault Behavior
With the Pfmon tool in the Windows 2000 resource kit, you can watch page fault behavior as it occurs. A soft fault refers to a page fault satisfied from one of the transition lists. Hard faults refer to a disk-read. The following example is a portion of output you'll see if you start Notepad with Pfmon and then exit. Be sure to notice the summary of page fault activity at the end.
C:\> pfmon notepad SOFT: KiUserApcDispatcher : KiUserApcDispatcher SOFT: LdrInitializeThunk : LdrInitializeThunk SOFT: 0x77f61016 : : 0x77f61016 SOFT: 0x77f6105b : : fltused+0xe00 HARD: 0x77f6105b : : fltused+0xe00 SOFT: LdrQueryImageFileExecutionOptions : LdrQueryImageFileExecutionOptions SOFT: RtlAppendUnicodeToString : RtlAppendUnicodeToString SOFT: RtlInitUnicodeString : RtlInitUnicodeString |
When the modified list gets too big, or if the size of the zeroed and standby lists falls below a minimum threshold (as indicated by the kernel variable MmMinimumFreePages, which is computed at system boot time), a system thread called the modified page writer is awakened to write pages back to disk and move the pages to the standby list.
Modified Page Writer
The modified page writer is responsible for limiting the size of the modified page list by writing pages back to disk when the list becomes too big. It consists of two system threads: one to write out modified pages (MiModifiedPageWriter) to the paging file and a second one to write modified pages to mapped files (MiMappedPageWriter). Two threads are required to avoid creating a deadlock, which would occur if the writing of mapped file pages caused a page fault that in turn required a free page when no free pages were available (thus requiring the modified page writer to create more free pages). By having the modified page writer perform mapped file paging I/Os from a second system thread, that thread can wait without blocking regular page file I/O.
Both threads run at priority 17 and, after initialization, wait on separate event objects to trigger their operation. The modified page writer event is triggered for one of two reasons:
- When the number of modified pages exceeds the maximum value computed at system initialization (MmModifiedPageMaximum)
- When the number of available pages (MmAvailablePages) goes below MmMinimumFreePages
Table 7-23 shows the number of pages that trigger the waking of the modified page writer to reduce the size of the modified list and how many pages it leaves on the list. As with other memory management variables, this value is computed at system boot time and depends on the amount of physical memory.
Table 7-23 Modified Page Writer Values
Memory Size | Modified Page Threshold | Retain Modified Pages |
---|---|---|
< 12 MB | 100 | 40 |
12-19 MB | 150 | 80 |
19-33 MB | 300 | 150 |
>33 MB (special case) | 400 | 800 |
The modified page writer waits on an additional event (MiMappedPagesTooOldEvent) that is set after a predetermined number of seconds (MmModifiedPageLifeInSeconds) to indicate that mapped pages (not modified pages) should be written to disk. By default, this value is 300 seconds (5 minutes). (You can override this value by adding the DWORD registry value HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\ModifiedPageLife). The reason for this additional event is to reduce data loss in the case of a system crash or power failure by eventually writing out modified mapped pages even if the modified list hasn't reached the thresholds listed in Table 7-23.
When invoked, the mapped page writer attempts to write as many pages as possible to disk with a single I/O request. It accomplishes this by examining the original PTE field of the PFN database elements for pages on the modified page list to locate pages in contiguous locations on the disk. Once a list is created, the pages are removed from the modified list, an I/O request is issued, and at successful completion of the I/O request, the pages are placed at the tail of the standby list.
Pages that are in the process of being written can be referenced by another thread. When this happens, the reference count and the share count in the PFN entry that represents the physical page are incremented to indicate that another process is using the page. When the I/O operation completes, the modified page writer notices that the share count is no longer 0 and doesn't place the page on the standby list.
PFN Data Structures
Although PFN database entries are of fixed length, they can be in several different states, depending on the state of the page. Thus, individual fields have different meanings depending on the state. The states of a PFN entry are shown in Figure 7-22.
Figure 7-22 States of PFN database entries
Several fields are the same for several of the PFN types, but others are specific to a given type of PFN. The following fields appear in more than one PFN type:
- PTE address Virtual address of the PTE that points to this page.
- Reference count The number of references to this page. The reference count is incremented when a page is first added to a working set and/or when the page is locked in memory for I/O (for example, by a device driver). The reference count is decremented when the share count becomes 0 or when pages are unlocked from memory. When the share count becomes 0, the page is no longer owned by a working set. Then, depending on the reference count, the PFN database entry that describes the page is updated to add the page to the free, standby, or modified list.
- Type The type of page represented by this PFN (active/valid, transition, standby, modified, modified no-write, free, zeroed, bad, and transition).
- Flags The information contained in the flags field is shown in Table 7-24.
- Original PTE contents All PFN database entries contain the original contents of the PTE that pointed to the page (which could be a prototype PTE). Saving the contents of the PTE allows it to be restored when the physical page is no longer resident.
- PFN of PTE Physical page number of the page table page containing the PTE that points to this page.
Table 7-24 Flags Within PFN Database Entries
Flag | Meaning |
---|---|
Modified state | Indicates whether the page was modified. (If the page is modified, its contents must be saved to disk before removing it from memory.) |
Prototype PTE | Indicates that the PTE referenced by the PFN entry is a prototype PTE. (For example, this page is sharable.) |
Parity error | Indicates that the physical page contains parity or error correction control errors. |
Read in progress | Indicates that an in-page operation is in progress for the page. The first DWORD contains the address of the event object that will be signaled when the I/O is complete; also used to indicate the first PFN for nonpaged pool allocations. |
Write in progress | Indicates that a page write operation is in progress. The first DWORD contains the address of the event object that will be signaled when the I/O is complete; also used to indicate the last PFN for nonpaged pool allocations. |
Start of nonpaged pool | For nonpaged pool pages, indicates that this is the first PFN for a given nonpaged pool allocation. |
End of nonpaged pool | For nonpaged pool pages, indicates that this is the last PFN for a given nonpaged pool allocation. |
In-page error | Indicates that an I/O error occurred during the in-page operation on this page. (In this case, the first field in the PFN contains the error code.) |
The remaining fields are specific to the type of PFN. For example, the first PFN in Figure 7-22 represents a page that is active and part of a working set. The share count field represents the number of PTEs that refer to this page. (Pages marked read-only, copy-on-write, or shared read/write can be shared by multiple processes.) For page table pages, this field is the number of valid PTEs in the page table. As long as the share count is greater than 0, the page isn't eligible for removal from memory.
The working set index field is an index into the process working set list (or the system or session working set list, or zero if not in any working set) where the virtual address that maps this physical page resides. If the page is a private page, the working set index field refers directly to the entry in the working set list because the page is mapped only at a single virtual address. In the case of a shared page, the working set index is a hint that is guaranteed to be correct only for the first process that made the page valid. (Other processes will try to use the same index where possible.) The process that initially sets this field is guaranteed to refer to the proper index and doesn't need to add a working set list hash entry referenced by the virtual address into its working set hash tree. This guarantee reduces the size of the working set hash tree and makes searches faster for these particular direct entries.
The second PFN in Figure 7-22 is for a page on either the standby or the modified list. In this case, the forward and backward link fields link the elements of the list together within the list. This linking allows pages to be easily manipulated to satisfy page faults. When a page is on one of the lists, the share count is by definition 0 (because no working set is using the page) and therefore can be overlaid with the backward link. However, the reference count might not be 0 because an I/O could be in progress for this page (for example, when the page is being written to disk).
The third PFN in Figure 7-22 is for a page on the free or zeroed list. Besides being linked together within the two lists, these PFN database entries use an additional field to link physical pages by "color," their location in the processor CPU memory cache. Windows 2000 attempts to minimize unnecessary thrashing of CPU memory caches by using different physical pages in the CPU cache. It achieves this optimization by avoiding using the same cache entry for two different pages wherever possible. For systems with direct mapped caches, optimally using the hardware's capabilities can result in a significant performance advantage.
The fourth PFN in Figure 7-22 is for a page that has an I/O in progress (for example, a page read). While the I/O is in progress, the first field points to an event object that will be signaled when the I/O completes. If an in-page error occurs, this field contains the Windows 2000 error status code representing the I/O error. This PFN type is used to resolve collided page faults.
EXPERIMENT
Viewing PFN Entries
You can examine individual PFN entries with the kernel debugger !pfn command. You first need to supply the PFN as an argument. (For example, !pfn 0 shows the first entry, !pfn 1 shows the second, and so on.) In the following example, the PTE for virtual address 0x50000 is displayed, followed by the PFN that contains the page directory and then the actual page:
kd> !pte 50000 00050000 - PDE at C0300000 PTE at C0000140 contains 00700067 contains 00DAA047 pfn 00700 --DA--UWV pfn 00DAA --D---UWV kd> !pfn 700 PFN 00000700 at address 827CD800 flink 00000004 blink / share count 00000010 pteaddress C0300000 reference count 0001 color 0 restore pte 00000080 containing page 00030 Active M Modified kd> !pfn daa PFN 00000DAA at address 827D77F0 flink 00000077 blink / share count 00000001 pteaddress C0000140 reference count 0001 color 0 restore pte 00000080 containing page 00700 Active M Modified |
In addition to the PFN database, the system variables in Table 7-25 describe the overall state of physical memory.
Table 7-25 System Variables That Describe Physical Memory
Variable | Description |
---|---|
MmNumberOfPhysicalPages | Total number of physical pages available on the system |
MmAvailablePages | Total number of available pages on the system—the sum of the pages on the zeroed, free, and standby lists |
MmResidentAvailablePages | Total number of physical pages that would be available if every process were at its minimum working set size |