The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures
4.10. Page Fault
Throughout the lifespan of a process, it is possible that it might attempt to access an address that belongs to its address space but is not loaded in RAM. It might alternatively access a page that is in RAM, but attempt action upon it that would violate the page's permission settings (for example, writing in a read-only area). When this happens, the system generates a page fault. The page fault is an exception handler that manages errors in a program's page access. Pages are fetched from storage when the hardware raises this page fault exception that the kernel traps. The kernel then allocates the missing page. Each architecture has an architecture-dependent function that handles page faults. Both x86 and PPC call the function do_page_fault(). The x86 page fault handler do_page_fault(*regs, error_code) is located in /arch/i386/mm/fault.c. The PowerPC page fault handler do_page_fault(*regs, address, error_code) is located in /arch/ppc/mm/fault.c. The similarities are close enough that a discussion of do_page_fault() for the x86 covers the functionality of the PowerPC version. The major difference in how the two architectures handle the page fault is in how the fault information is gathered and stored before do_page_fault() is called. We first explain the specifics of the x86 page fault handling and proceed to explain the do_page_fault() function. We follow this explanation by highlighting the differences seen in PowerPC. 4.10.1. x86 Page Fault Exception
The x86 page fault handler do_page_fault() is called as the result of a hardware interrupt 14. This interrupt occurs when the processor identifies the following conditions to be true:
Upon raising this interrupt, the processor saves two valuable pieces of information:
The regs parameter of do_page_fault() is a struct that contains the system registers, and the error_code parameter uses a 3-bit field to describe the source of the fault. 4.10.2. Page Fault Handler
For both architectures, the do_page_fault() function uses the just-given information and takes one of several actions. These code segments follow a fairly complicated series of checks to end up with one of the following:
----------------------------------------------------------------------------- arch/i386/mm/fault.c 212 asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) 213 { 214 struct task_struct *tsk; 215 struct mm_struct *mm; 216 struct vm_area_struct * vma; 217 unsigned long address; 218 unsigned long page; 219 int write; 220 siginfo_t info; 221 222 /* get the address */ 223 __asm__("movl %%cr2,%0":"=r" (address)); ... 232 tsk = current; 233 234 info.si_code = SEGV_MAPERR; ----------------------------------------------------------------------------- Line 223
The address at which the page fault occurred is stored in the cr2 control register. The linear address is read and the local variable address is set to hold the value. Line 232
The task_struct pointer tsk is set to point at the task_struct current. Now, we are ready to find out more about where the address that generated the page fault comes from. Figure 4.14 illustrates the flow of the following lines of code: ----------------------------------------------------------------------------- arch/i386/mm/fault.c 246 if (unlikely(address >= TASK_SIZE)) { 247 if (!(error_code & 5)) 248 goto vmalloc_fault; ... 253 goto bad_area_nosemaphore; 254 } ... 257 mm = tsk->mm ... -----------------------------------------------------------------------------
Figure 4.14. Page Fault I
Lines 246248
This code checks if the address at which the page fault occurred was in kernel module space (that is, in a noncontiguous memory area). Noncontiguous memory area addresses have their linear address >= TASK_SIZE. If it was, it checks if bits 0 and 2 of the error_code are clear. Recall from Table 4.7 that this indicates that the error is caused by trying to access a kernel page that is not present. If so, this indicates that the page fault occurred in kernel mode and the code at label vmalloc_fault: is called. Line 253
If we get here, it means that although the access occurred in a noncontiguous memory area, it occurred in user mode, hit a protection fault, or both. In this case, we jump to the label bad_area_semaphore:. Line 257
This sets the local variable mm to point to the current task's memory descriptor. If the current task is a kernel thread, this value is NULL. This becomes significant in the next code lines. At this point, we have determined that the page fault did not occur in a noncontiguous memory area. Again, Figure 4.15 illustrates the flow of the following lines of code: ----------------------------------------------------------------------------- arch/i386/mm/fault.c ... 262 if (in_atomic() || !mm) 263 goto bad_area_nosemaphore; 264 265 down_read(&mm->mmap_sem); 266 267 vma = find_vma(mm, address); 268 if (!vma) 269 goto bad_area; 270 if (vma->vm_start <= address) 271 goto good_area; 272 if (!(vma->vm_flags & VM_GROWSDOWN)) 273 goto bad_area; 274 if (error_code & 4) { ... 281 if (address + 32 < regs->esp) 282 goto bad_area; 283 } 284 if (expand_stack(vma, address)) 285 goto bad_area; ... ----------------------------------------------------------------------------- Figure 4.15. Page Fault II
Lines 262263
In this code block, we check to see if the fault occurred while executing within an interrupt handler or in kernel space. If it did, we jump to label bad_area_ semaphore:. Line 265
At this point, we are about to search through the memory areas of the current process, so we set a read lock on the memory descriptor's semaphore. Lines 267269
Given that, at this point, we know the address that generated the page fault is not in a kernel thread or in an interrupt handler, we search the address space of the process to see if the address is in one of its memory areas. If it is not there, jump to label bad_area:. Lines 270271
If we found a valid region within the process address space, we jump to label good_area:. Lines 272273
If we found a region that is not valid, we check if the nearest region can grow to fit the page. If not, we jump to the label bad_area:. Lines 274284
Otherwise, the offending address might be the result of a stack operation. If expanding the stack does not help, jump to the label bad_area:. Now, we proceed to explain what each of the label jump points do. We begin with the label vmalloc_fault, which is illustrated in Figure 4.16: ----------------------------------------------------------------------------- arch/i386/mm/fault.c 473 vmalloc_fault: { int index = pgd_index(address); pgd_t *pgd, *pgd_k; pmd_t *pmd, *pmd_k; pte_t *pte_k; asm("movl %%cr3,%0":"=r" (pgd)); pgd = index + (pgd_t *)__va(pgd); pgd_k = init_mm.pgd + index; 491 if (!pgd_present(*pgd_k)) goto no_context; pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); if (!pmd_present(*pmd_k)) goto no_context; set_pmd(pmd, *pmd_k); pte_k = pte_offset_kernel(pmd_k, address); 506 if (!pte_present(*pte_k)) 507 goto no_context; 508 return; 509 } -----------------------------------------------------------------------------
Figure 4.16. Label vmalloc_fault
Lines 473509
The current process Page Global Directory is referenced (by way of cr3) and saved in the variable pgd and the kernel Page Global Directory is referenced by pgd_k (likewise for the pmd and the pte variables). If the offending address is not valid in the kernel paging system, the code jumps to the no_context: label. Otherwise, the current process uses the kernel pgd. Now, we look at the label good_area:. At this point, we know that the memory area holding the offending address exists within the address space of the process. Now, we need to ensure that the access permissions were correct. Figure 4.17 shows the flow diagram: ----------------------------------------------------------------------------- arch/i386/mm/fault.c 290 good_area: 291 info.si_code = SEGV_ACCERR; 292 write = 0; 293 switch (error_code & 3) { 294 default: /* 3: write, present */ ... /* fall through */ 300 case 2: /* write, not present */ 301 if (!(vma->vm_flags & VM_WRITE)) 302 goto bad_area; 303 write++; 304 break; 305 case 1: /* read, present */ 306 goto bad_area; 307 case 0: /* read, not present */ 308 if (!(vma->vm_flags & (VM_READ | VM_EXEC))) 309 goto bad_area; 310 } -----------------------------------------------------------------------------
Figure 4.17. Label good_area
Lines 294304
If the page fault was caused by a memory access that was a write (recall that if this is the case, our left-most bit in the error code is set to 1), we check if our memory area is writeable. If it is not, we have a mismatch of permissions and we jump to the label bad_area:. If it was writeable, we fall through the case statement and eventually proceed to handle_mm_fault() with the local variable write set to 1. Lines 305309
If the page fault was caused by a read or execute access and the page is present, we jump to the label bad_area: because this constitutes a clear permissions violation. If the page is not present, we check to see if the memory area has read or execute permissions. If it does not, we jump to the label bad_area: because even if we were to fetch the page, the permissions would not allow the operation. If it does, we fall out of the case statement and eventually proceed to handle_mm_fault() with the local variable write set to 0. The following label marks the code we fall through to when the permissions checks comes out OK. It is appropriately labeled survive:. ----------------------------------------------------------------------------- arch/i386/mm/fault.c survive: 318 switch (handle_mm_fault(mm, vma, address, write)) { case VM_FAULT_MINOR: tsk->min_flt++; break; case VM_FAULT_MAJOR: tsk->maj_flt++; break; case VM_FAULT_SIGBUS: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; 329 default: BUG(); } -----------------------------------------------------------------------------
Lines 318329
The function handle_mm_fault() is called with the current memory descriptor (mm), the descriptor to the offending address' area, the offending address, and whether the access was a read/execute or write. The switch statement catches us if we fail at handling the fault, which ensures we exit gracefully. The following code snippet describes the flow of the label bad_area and bad_area_no_semaphore. When we jump to this point, we know that either
Now, we need to determine if the access is from within kernel mode. The following code and Figure 4.18 illustrates the flow of these labels: ----------------------------------------------------------------------------- arch/i386/mm.fault.c 348 bad_area: 349 up_read(&mm->mmap_sem); 350 351 bad_area_nosemaphore: 352 /* User mode accesses just cause a SIGSEGV */ 353 if (error_code & 4) { 354 if (is_prefetch(regs, address)) 355 return; 356 357 tsk->thread.cr2 = address; 358 tsk->thread.error_code = error_code; 359 tsk->thread.trap_no = 14; 360 info.si_signo = SIGSEGV; 361 info.si_errno = 0; 362 /* info.si_code has been set above */ 363 info.si_addr = (void *)address; 364 force_sig_info(SIGSEGV, &info, tsk); 365 return; 366 } -----------------------------------------------------------------------------
Figure 4.18. Label bad_area
Line 348
The function up_read() releases the read lock on the semaphore of the process' memory descriptor. Notice that we have only jumped to the label bad_area after we place read lock on the memory descriptor's semaphore to look through its memory areas to see if our address was within the process address space. Otherwise, we have jumped to the label bad_area_nosemaphore. The only difference between the two is the lifting of the read lock on the semaphore. Lines 351353
Because the address is not in the address space, we now check to see if the error was generated in user mode. If you recall from Table 4.7, an error code value of 4 indicates that the error occurred in user mode. Lines 354366
We have determined that the error occurred in user mode, so we send a SIGSEGV signal (trap 14). The following code snippet describes the flow of the label no_context. When we jump to this point, we know that either
Figure 4.19 illustrates the flow diagram of the label no_context: ----------------------------------------------------------------------------- arch/i386/mm/fault.c 388 no_context: 390 if (fixup_exception(regs)) return; 432 die("Oops", regs, error_code); bust_spinlocks(0); do_exit(SIGKILL); -----------------------------------------------------------------------------
Figure 4.19. Label no_context
Line 390
The function fixup_exception() uses the eip passed in to search an exception table for the offending instruction. If the instruction is in the table, it must have already been compiled with "hidden" fault handling code built in. The page fault handler, do_page__fault(), uses the fault handling code as a return address and jumps to it. The code can then flag an error. Line 432
If there is not an entry in the exception table for the offending instruction, the code that jumped to label no_context ends up with the oops screen dump. 4.10.3. PowerPC Page Fault Exception
The PowerPC page fault handler do_page_fault() is called as a result of an instruction or data store exception. Because of the subtle differences between the various versions of the PowerPC processors, the error codes are in a slightly different format, but yield similar information. The bits of interest are whether the offending operation was a read or write, and if it was a protection fault. The PowerPC page fault handler do_page_fault() does not initiate the oops error. In PowerPC, the label no_context code is combined with the label bad_area code and placed in a function called bad_page_fault(), which ends by producing a segmentation fault. This function also has the fixup function that traverses the exception_table. |
Категории