Mac OS X Internals: A Systems Approach
8.16. Memory Allocation in the Kernel
Figure 843 shows an overview of kernel-level memory allocation functions in Mac OS X. The numerical labels are rough indicators of how low-level that group of functions is. For example, page-level allocation, which is labeled with the lowest number, is the lowest-level allocation mechanism, since it allocates memory directly from the list of free pages in the Mach VM subsystem. Figure 843. An overview of memory allocation in the Mac OS X kernel
Figure 844 shows an overview of kernel-level memory deallocation functions. Figure 844. An overview of memory deallocation in the Mac OS X kernel
8.16.1. Page-Level Allocation
Page-level allocation is performed in the kernel by vm_page_alloc() [osfmk/vm/vm_resident.c]. This function requires a VM object and an offset as arguments. It then attempts to allocate a page associated with the VM object/offset pair. The VM object can be the kernel VM object (kernel_object), or it can be a newly allocated VM object. vm_page_alloc() first calls vm_page_grab() [osfmk/vm/vm_resident.c] to remove a page from the free list. If the free list is too small, vm_page_grab() fails, returning a VM_PAGE_NULL. However, if the requesting thread is a VM-privileged thread, vm_page_grab() consumes a page from the reserved pool. If there are no reserved pages available, vm_page_grab() waits for a page to become available. If vm_page_grab() returns a valid page, vm_page_alloc() calls vm_page_insert() [osfmk/vm/vm_resident.c] to insert the page into the hash table that maps VM object/offset pairs to pagesthat is, the virtual-to-physical (VP) table. The VM object's resident page count is also incremented. kernel_memory_allocate() [osfmk/vm/vm_kern.c] is the master entry point for allocating kernel memory in that most but not all pathways to memory allocation go through this function. kern_return_t kernel_memory_allocate( vm_map_t map, // the VM map to allocate into vm_offset_t *addrp, // pointer to start of new memory vm_size_t size, // size to allocate (rounded up to a page size multiple) vm_offset_t mask, // mask specifying a particular alignment int flags);// KMA_HERE, KMA_NOPAGEWAIT, KMA_KOBJECT
The flag bits are used as follows:
kernel_memory_allocate() calls vm_map_find_space() [osfmk/vm/vm_map.c] to find and allocate a virtual address range in the VM map. A new VM map entry is initialized because of this. As shown in Figure 843, kernel_memory_allocate() calls vm_page_alloc() to allocate pages. If the VM object is newly allocated, it passes a zero offset to vm_page_alloc(). If the kernel object is being used, the offset is the difference of the address returned by vm_map_find_space() and the minimum kernel address (VM_MIN_KERNEL_ADDRESS, defined to be 0x1000 in osfmk/mach/ppc/vm_param.h). 8.16.2. kmem_alloc
The kmem_alloc family of functions is implemented in osfmk/vm/vm_kern.c. These functions are intended for use in the Mach portion of the kernel. kern_return_t kmem_alloc(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_wired(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_aligned(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_pageable(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_contig(vm_map_t map, vm_offset_t *addrp, vm_size_t size, vm_offset_t mask, int flags); kern_return_t kmem_realloc(vm_map_t map, vm_offset_t oldaddr, vm_size_t oldsize, vm_offset_t *newaddrp, vm_size_t newsize); void kmem_free(vm_map_t map, vm_offset_t addr, vm_size_t size);
Except kmem_alloc_pageable(), all kmem_alloc functions allocate wired memory. 8.16.3. The Mach Zone Allocator
The Mach zone allocator is a fast memory allocation mechanism with garbage collection. As shown in Figure 843, several allocation functions in the kernel directly or indirectly use the zone allocator. A zone is a collection of fixed-size memory blocks that are accessible through an efficient interface for allocation and deallocation. The kernel typically creates a zone for each class of data structure to be managed. Examples of data structures for which the Mac OS X kernel creates individual zones include the following:
The host_zone_info() Mach routine retrieves information about Mach zones from the kernel. It returns an array of zone names and another array of zone_info structures [<mach_debug/zone_info.h>]. The zprint command-line program uses host_zone_info() to retrieve and display information about all zones in the kernel. $ zprint elem cur max cur max cur alloc alloc zone name size size size #elts #elts inuse size count ------------------------------------------------------------------------------- zones 80 11K 12K 152 153 89 4K 51 vm.objects 136 6562K 8748K 49410 65867 39804 4K 30 C vm.object.hash.entries 20 693K 768K 35496 39321 24754 4K 204 C ... pmap_mappings 64 25861K 52479K 413789 839665272627 4K 64 C kalloc.large 59229 2949K 4360K 51 75 51 57K 1 Note that zprint's output includes the size of an object in each zone (the elem size column). You can pipe zprint's output through the sort command to see that several zones have the same element sizes. A single physical page is never shared between two or more zones. In other words, all zone-allocated objects on a physical page will be of the same type. $ zprint | sort +1 -n ... alarms 44 3K 4K 93 93 1 4K 93 C kernel.map.entries 44 4151K 4152K 96628 96628 9582 4K 93 non-kernel.map.entries 44 1194K 1536K 27807 35746 18963 4K 93 C semaphores 44 35K 1092K 837 25413 680 4K 93 C vm.pages 44 32834K 0K 764153 0763069 4K 93 C ...
A zone is described in the kernel by a zone structure (struct zone). // osfmk/kern/zalloc.h struct zone { int count; // number of elements used now vm_offset_t free_elements; decl_mutex_data(,lock); // generic lock vm_size_t cur_size; // current memory utilization vm_size_t max_size; // how large this zone can grow vm_size_t elem_size; // size of an element vm_size_t alloc_size; // chunk size for more memory char *zone_name; // string describing the zone ... struct zone *next_zone; // link for all-zones list ... };
A new zone is initialized by calling zinit(), which returns a pointer to a newly created zone structure (zone_t). Various subsystems use zinit() to initialize the zones they need. zone_t zinit(vm_size_t size, // size of each object vm_size_t max, // maximum size in bytes the zone may reach vm_size_t alloc, // allocation size const char *name); // a string that describes the objects in the zone
The allocation size specified in the zinit() call is the amount of memory to add to the zone each time the zone becomes emptythat is, when there are no free elements on the zone's free list. The allocation size is automatically rounded up to an integral number of pages. Note that zone structures are themselves allocated from a zone of zones (zone_zone). When the zone allocator is initialized during kernel bootstrap, it calls zinit() to initialize the zone of zones. zinit() TReats this initialization specially: It calls zget_space() [osfmk/kern/zalloc.c] to allocate contiguous, nonpaged space through the master kernel memory allocator (kernel_memory_allocate() [osfmk/vm/vm_kern.c]). Other calls to zinit() allocate zone structures from the zone of zones through zalloc() [osfmk/kern/zalloc.c]. // osfmk/kern/zalloc.c // zone data structures are themselves stored in a zone zone_t zone_zone = ZONE_NULL; zone_t zinit(vm_size_t size, vm_size_t max, vm_size_t alloc, const char *name) { zone_t z; if (zone_zone == ZONE_NULL) { if (zget_space(sizeof(struct_zone), (vm_offset_t *)&z) != KERN_SUCCESS) return(ZONE_NULL); } else z = (zone_t)zalloc(zone_zone); // initialize various fields of the newly allocated zone structure thread_call_setup(&z->call_async_alloc, zalloc_async, z); // add the zone structure to the end of the list of all zones return(z); } void zone_bootstrap(void) { ... // this is the first call to zinit() zone_zone = zinit(sizeof(struct zone), 128 * sizeof(struct zone), sizeof(struct zone), "zones"); // this zone's empty pages will not be garbage collected zone_change(zone_zone, Z_COLLECT, FALSE); ... } zinit() populates the various fields of a newly allocated zone structure. In particular, it sets the zone's current size to 0 and the zone's empty list to NULL. Therefore, at this point, the zone's memory pool is empty. Before returning, zinit() arranges for zalloc_async() [osfmk/kern/zalloc.c] to run by setting up a callout. zalloc_async() attempts to allocate a single element from the empty zone, because of which memory is allocated for the zone. zalloc_async() immediately frees the dummy allocation. // osfmk/kern/zalloc.c void zalloc_async(thread_call_param_t p0, __unused thread_call_param_t p1) { void *elt; elt = zalloc_canblock((zone_t)p0, TRUE); zfree((zone_t)p0, elt); lock_zone((zone_t)p0); ((zone_t)p0)->async_pending = FALSE; unlock_zone((zone_t)p0); } The zone allocator exports several functions for memory allocation, deallocation, and zone configuration. Figure 845 shows the important functions. Figure 845. Zone allocator functions
The zone_change() function allows the following Boolean flags to be modified for a zone.
The typical kernel usage of zalloc() is blockingthat is, the caller is willing to wait if memory is not available immediately. The zalloc_noblock() and zget() functions attempt to allocate memory with no allowance for blocking and therefore can return NULL if no memory is available. As shown in Figure 843, the zone allocator eventually allocates memory through kernel_memory_allocate() [osfmk/vm/vm_kern.c]. If the system is low on available memory, this function returns KERN_RESOURCE_SHORTAGE, which causes the zone allocator to wait for a page to become available. However, if kernel_memory_allocate() fails because there is no more kernel virtual address space left, the zone allocator causes a kernel panic. Freeing a zone element through zfree() [osfmk/kern/zalloc.c] causes the element to be added to the zone's free list and the zone's count of in-use elements to be decremented. A collectable zone's unused pages are periodically garbage collected. During VM subsystem initialization, the kernel calls zone_init() [osfmk/kern/zalloc.c] to create a map for the zone allocator (zone_map) as a submap of the kernel map. zone_init() also sets up garbage collection information: It allocates wired memory for the zone page tablea linked list that contains one element, a zone_page_table_entry structure, for each page assigned to a zone. // osfmk/kern/zalloc.c struct zone_page_table_entry { struct zone_page_table_entry *link; short alloc_count; short collect_count; };
The alloc_count field of the zone_page_table_entry structure is the total number of elements from that page assigned to the zone, whereas the collect_count field is the number of elements from that page on the zone's free list. Consider the following sequence of steps as an example of new memory being added to a zone.
The zone garbage collector, zone_gc() [osfmk/kern/zalloc.c], is invoked by consider_zone_gc() [osfmk/kern/zalloc.c]. The latter ensures that garbage collection is performed at most once per minute, unless someone else has explicitly requested a garbage collection. The page-out daemon calls consider_zone_gc().
zfree() can request explicit garbage collection if the system is low on memory and the zone from which the element is being freed has an element size of a page size or more.
zone_gc() makes two passes on each collectable zone.[24] In the first pass, it calls zone_page_collect() [osfmk/kern/zalloc.c] on each free element. zone_page_collect() increments the appropriate collect_count value by one. In the second pass, it calls zone_page_collectable() on each element, which compares the collect_count and alloc_count values for that page. If the values are equal, the page can be reclaimed since all elements on that page are free. zone_gc() tracks such pages in a list of pages to be freed and eventually frees them by calling kmem_free(). [24] zone_gc() can skip a collectable zone if the zone has less than 10% of its elements free or if the amount of free memory in the zone is less than twice its allocation size. 8.16.4. The Kalloc Family
The kalloc family of functions, implemented in osfmk/kern/kalloc.c, provides access to a fast general-purpose memory allocator built atop the zone allocator. kalloc() uses a 16MB submap (kalloc_map) of the kernel map from which it allocates its memory. The limited submap size avoids virtual memory fragmentation. kalloc() supports a set of allocation sizes, ranging from as little as KALLOC_MINSIZE bytes (16 bytes by default) to several kilobytes. Note that each size is a power of 2. When the allocator is initialized, it calls zinit() to create a zone for each allocation size that it handles. Each zone's name is set to reflect the zone's associated size, as shown in Figure 846. These are the so-called power-of-2 zones. Figure 846. Printing sizes of kalloc zones supported in the kernel
Note that the zone named kalloc.large in the zprint output in Figure 846 is not realit is a fake zone used for reporting on too-large-for-a-zone objects that were allocated through kmem_alloc(). The kalloc family provides malloc-style functions, along with a version that attempts memory allocation without blocking. void * kalloc(vm_size_t size); void * kalloc_noblock(vm_size_t size); void * kalloc_canblock(vm_size_t size, boolean_t canblock); void krealloc(void **addrp, vm_size_t old_size, vm_size_t new_size, simple_lock_t lock); void kfree(void *data, vm_size_t size);
Both kalloc() and kalloc_noblock() are simple wrappers around kalloc_canblock(), which prefers to get memory through zalloc_canblock(), unless the allocation size is too largekalloc_max_prerounded (8193 bytes by default or more). krealloc() uses kmem_realloc() if the existing allocation is already too large for a kalloc zone. If the new size is also too large, krealloc() uses kmem_alloc() to allocate new memory, copies existing data into it using bcopy(), and frees the old memory. If the new memory fits in a kalloc zone, krealloc() uses zalloc() to allocate new memory. It still must copy existing data and free the old memory, since there is no "zrealloc" function. 8.16.5. The OSMalloc Family
The file osfmk/kern/kalloc.c implements another family of memory allocation functions: the OSMalloc family. OSMallocTag OSMalloc_Tagalloc(const char *str, uint32_t flags); void OSMalloc_Tagfree(OSMallocTag tag); void * OSMalloc(uint32_t size, OSMallocTag tag); void * OSMalloc_nowait(uint32_t size, OSMallocTag tag); void * OSMalloc_noblock(uint32_t size, OSMallocTag tag); void OSFree(void *addr, uint32_t size, OSMallocTag tag);
The key aspect of these functions is their use of a tag structure, which encapsulates certain properties of allocations made with that tag. #define OSMT_MAX_NAME 64 typedef struct _OSMallocTag_ { queue_chain_t OSMT_link; uint32_t OSMT_refcnt; uint32_t OSMT_state; uint32_t OSMT_attr; char OSMT_name[OSMT_MAX_NAME]; } *OSMallocTag;
Here is an example use of the OSMalloc functions: #include <libkern/OSMalloc.h> OSMallocTag my_tag; void my_init(void) { my_tag = OSMalloc_Tagalloc("My Tag Name", OSMT_ATTR_PAGEABLE); ... } void my_uninit(void) { OSMalloc_Tagfree(my_tag); } void some_function(...) { void *p = OSMalloc(some_size, my_tag); }
OSMalloc_Tagalloc() calls kalloc() to allocate a tag structure. The tag's name and attributes are set based on the arguments passed to OSMalloc_Tagalloc(). The tag's reference count is initialized to one, and the tag is placed on a global list of tags. Thereafter, memory is allocated using one of the OSMalloc allocation functions, which in turn uses one of kalloc(), kalloc_noblock(), or kmem_alloc_pageable() for the actual allocation. Each allocation increments the tag's reference count by one. 8.16.6. Memory Allocation in the I/O Kit
The I/O Kit provides its own interface for memory allocation in the kernel. void * IOMalloc(vm_size_t size); void * IOMallocPageable(vm_size_t size, vm_size_t alignment); void * IOMallocAligned(vm_size_t size, vm_size_t alignment); void * IOMallocContiguous(vm_size_t size, vm_size_t alignment, IOPhysicalAddress *physicalAddress); void IOFree(void *address, vm_size_t size); void IOFreePageable(void *address, vm_size_t size); void IOFreeAligned(void *address, vm_size_t size); void IOFreeContiguous(void *address, vm_size_t size);
IOMalloc() allocates general-purpose, wired memory in the kernel map by simply calling kalloc(). Since kalloc() can block, IOMalloc() must not be called while holding a simple lock or from an interrupt context. Moreover, since kalloc() offers no alignment guarantees, IOMalloc() should not be called when a specific alignment is desired. Memory allocated through IOMalloc() is freed through IOFree(), which simply calls kfree(). The latter too can block. Pageable memory with alignment restriction is allocated through IOMallocPageable(), whose alignment argument specifies the desired alignment in bytes. The I/O Kit maintains a bookkeeping data structure (gIOKitPageableSpace) for pageable memory. // iokit/Kernel/IOLib.c enum { kIOMaxPageableMaps = 16 }; enum { kIOPageableMapSize = 96 * 1024 * 1024 }; enum { kIOPageableMaxMapSize = 96 * 1024 * 1024 }; static struct { UInt32 count; UInt32 hint; IOMapData maps[kIOMaxPageableMaps]; lck_mtx_t *lock; } gIOKitPageableSpace; The maps array of gIOKitPageableSpace contains submaps allocated from the kernel map. During bootstrap, the I/O Kit initializes the first entry of this array by allocating a 96MB (kIOPageableMapSize) pageable map. IOMallocPageable() calls IOIteratePageableMaps(), which first attempts to allocate memory from an existing pageable map, failing which it fills the next slotup to a maximum of kIOPageableMaps slotsof the maps array with a newly allocated map. The eventual memory allocation is done through kmem_alloc_pageable(). When such memory is freed through IOFreePageable(), the maps array is consulted to determine which map the address being freed belongs to, after which kmem_free() is called to actually free the memory. Wired memory with alignment restriction is allocated through IOMallocAligned(), whose alignment argument specifies the desired alignment in bytes. If the adjusted allocation size (after accounting for the alignment) is equal to or more than the page size, IOMallocAligned() uses kernel_memory_allocate(); otherwise, it uses kalloc(). Correspondingly, the memory is freed through kmem_free() or kfree(). IOMallocContiguous() allocates physically contiguous, wired, alignment-restricted memory in the kernel map. Optionally, this function returns the physical address of the allocated memory if a non-NULL pointer for holding the physical address is passed as an argument. When the adjusted allocation size is less than or equal to a page, physical contiguity is trivially present. In these two cases, IOMallocContiguous() uses kalloc() and kernel_memory_allocate(), respectively, for the underlying allocation. When multiple physical contiguous pages are requested, the allocation is handled by kmem_alloc_contig(). Like vm_page_alloc(), this function also causes memory allocation directly from the free list. It calls kmem_alloc_contig(), which in turn calls vm_page_find_contiguous() [osfmk/vm/vm_resident.c]. The latter traverses the free list, inserting free pages into a private sublist sorted on the physical address. As soon as a contiguous range large enough to fulfill the contiguous allocation request is detected in the sublist, the function allocates the corresponding pages and returns the remaining pages collected on the sublist to the free list. Because of the free list sorting, this function can take a substantial time to run when the free list is very largefor example, soon after bootstrapping on a system with a large amount of physical memory. When the caller requests the newly allocated memory's physical address to be returned, IOMallocContiguous() first retrieves the corresponding physical page from the pmap layer by calling pmap_find_phys() [osfmk/ppc/pmap.c]. If the DART IOMMU[25] is present and active on the system, the address of this page is not returned as is. As we noted earlier, the DART translates I/O Kit-visible 32-bit "physical for I/O" addresses to 64-bit "true" physical addresses. Code running in the I/O Kit environment cannot even see the true physical address. In fact, even if such code attempted to use a 64-bit physical address, the DART would not be able to translate it, and an error would occur. [25] We will discuss the DART in Section 10.3. If the DART is active, IOMallocContiguous() calls it to allocate an appropriately sized I/O memory rangethe address of this allocation is the "physical" address that is returned. Moreover, IOMallocContiguous() has to insert each "true" physical page into the I/O memory range by calling the DART's "insert" function. Since IOFreeContiguous() must call the DART to undo this work, IOMallocContiguous() saves the virtual address and the I/O address in an _IOMallocContiguousEntry structure. The I/O Kit maintains these structures in a linked list. When the memory is freed, the caller provides the virtual address, using which the I/O Kit can search for the I/O address on this linked list. Once the I/O address is found, the structure is removed from the list and the DART allocation is freed. // iokit/Kernel/IOLib.c struct _IOMallocContiguousEntry { void *virtual; // caller-visible virtual address ppnum_t ioBase; // caller-visible "physical" address queue_chain_t link; // chained to other contiguous entries }; typedef struct _IOMallocContiguousEntry _IOMallocContiguousEntry; 8.16.7. Memory Allocation in the Kernel's BSD Portion
The BSD portion of the kernel provides _MALLOC() [bsd/kern/kern_malloc.c] and _MALLOC_ZONE() [bsd/kern/kern_malloc.c] for memory allocation. The header file bsd/sys/malloc.h defines the MALLOC() and MALLOC_ZONE() macros, which are trivial wrappers around _MALLOC() and _MALLOC_ZONE(), respectively. void * _MALLOC(size_t size, int type, int flags); void _FREE(void *addr, int type); void * _MALLOC_ZONE(size_t size, int type, int flags); void _FREE_ZONE(void *elem, size_t size, int type);
The BSD-specific allocator designates different types of memory with different numerical values, where the "memory type" (the type argument), which is specified by the caller, represents the purpose of the memory. For example, M_FILEPROC memory is used for open file structures, and M_SOCKET memory is used for socket structures. The various known types are defined in bsd/sys/malloc.h. The value M_LAST is one more than the last known type's value. This allocator is initialized during kernel bootstrap by a call to kmeminit() [bsd/kern/kern_malloc.c], which goes through a predefined array of kmzones structures (struct kmzones [bsd/kern/kern_malloc.c]). As shown in Figure 847, there is one kmzones structure for each type of memory supported by the BSD allocator. Figure 847. Array of memory types supported by the BSD memory allocator
Moreover, each type has a string name. These names are defined in bsd/sys/malloc.h in another array. // bsd/sys/malloc.h #define INITKMEMNAMES { \ "free", /* 0 M_FREE */ \ "mbuf", /* 1 M_MBUF */ \ "devbuf", /* 2 M_DEVBUF */ \ "socket", /* 3 M_SOCKET */ \ "pcb", /* 4 M_PCB */ \ "routetbl", /* 5 M_RTABLE */ \ ... "kauth", /* 100 M_KAUTH */ \ "dummynet", /* 101 M_DUMMYNET */ \ "unsafe_fsnode" /* 102 M_UNSAFEFS */ \ } ...
As kmeminit() iterates over the array of kmzones, it analyses each entry's kz_elemsize and kz_zalloczone fields. Entries with kz_elemsize values of -1 are skipped. For the other entries, if kz_zalloczone is KMZ_CREATEZONE, kmeminit() calls zinit() to initialize a zone using kz_elemsize as the size of an element of the zone, 1MB as the maximum memory to use, PAGE_SIZE as the allocation size, and the corresponding string in the memname array as the zone's name. The kz_zalloczone field is set to this newly initialized zone. If kz_zalloczone is KMZ_LOOKUPZONE, kmeminit() calls kalloc_zone() to simply look up the kernel memory allocator (kalloc) zone with the appropriate allocation size. The kz_zalloczone field is set to the found zone or to ZONE_NULL if none is found. If kz_zalloczone is KMZ_SHAREZONE, the entry shares the zone with the entry at index kz_elemsize in the kmzones array. For example, the kmzones entry for M_RTABLE shares the zone with the entry for M_MBUF. kmeminit() sets the kz_zalloczone and kz_elemsize fields of a KMZ_SHAREZONE entry to those of the "shared with" zone. Thereafter, _MALLOC_ZONE() uses its type argument as an index into the kmzones array. If the specified type is greater than the last known type, there is a kernel panic. If the allocation request's size matches the kz_elemsize field of kmzones[type], _MALLOC_ZONE() calls the Mach zone allocator to allocate from the zone pointed to by the kz_zalloczone field of kmzones[type]. If their sizes do not match, _MALLOC_ZONE() uses kalloc() or kalloc_noblock(), depending on whether the M_NOWAIT bit is clear or set, respectively, in the flags argument. Similarly, _MALLOC() calls kalloc() or kalloc_noblock() to allocate memory. The type argument is not used, but if its value exceeds the last known BSD malloc type, _MALLOC() still causes a kernel panic. _MALLOC() uses a bookkeeping data structure of its own to track allocated memory. It adds the size of this data structure (struct _mhead) to the size of the incoming allocation request. struct _mhead { size_t mlen; // used to record the length of allocated memory char dat[0]; // this is returned by _MALLOC() };
Moreover, if the M_ZERO bit is set in the flags argument, _MALLOC calls bzero() to zero-fill the memory. 8.16.8. Memory Allocation in libkern's C++ Environment
As we noted in Section 2.4.4, libkern defines OSObject as the root base class for the Mac OS X kernel. The new and delete operators for OSObject call kalloc() and kfree(), respectively. // libkern/c++/OSObject.cpp void * OSObject::operator new(size_t size) { void *mem = (void *)kalloc(size); ... return mem; } void OSObject::operator delete(void *mem, size_t size) { kfree((vm_offset_t)mem, size); ... }
|
Категории