The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures

4.5. Slab Allocator's Lifecycle

Now, we explore the interaction of caches and the slab allocator throughout the lifecycle of the kernel. The kernel needs to make sure that certain structures are in place to support memory area requests on the part of processes and the creation of specialized caches on the part of dynamically loadable modules.

A few global structures play key roles for the slab allocator. Some of these were in passing previously in the chapter. Let's look at these global variables.

4.5.1. Global Variables of the Slab Allocator

There are a number of global variables that are associated with the slab allocator. These include

  • cache_cache. The cache descriptor for the cache that is to contain all other cache descriptors. The human-readable name of this cache is kmem_cache. This cache descriptor is the only one that is statically allocated.

  • cache_chain. The list element that serves as a pointer to the cache descriptor list.

  • cache_chain_sem. The semaphore that controls access to cache_chain.[9] Every time an element (new cache descriptor) is added to the chain, this semaphore needs to be acquired with a down() and released with an up().

    [9] Semaphores are discussed in detail in Chapter 9, "Building the Linux Kernel."

  • malloc_sizes[]. The array that holds the cache descriptors for the DMA and non-DMA caches that correspond to a general cache.

Before the slab allocator is initialized, these structures are already in place. Let's look at their creation:

----------------------------------------------------------------------------- mm/slab.c 486 static kmem_cache_t cache_cache = { 487 .lists = LIST3_INIT(cache_cache.lists), 488 .batchcount = 1, 489 .limit = BOOT_CPUCACHE_ENTRIES, 490 .objsize = sizeof(kmem_cache_t), 491 .flags = SLAB_NO_REAP, 492 .spinlock = SPIN_LOCK_UNLOCKED, 493 .color_off = L1_CACHE_BYTES, 494 .name = "kmem_cache", 495 }; 496 497 /* Guard access to the cache-chain. */ 498 static struct semaphore cache_chain_sem; 499 500 struct list_head cache_chain; -----------------------------------------------------------------------------

The cache_cache cache descriptor has the SLAB_NO_REAP flag. Even if memory is low, this cache is retained throughout the life of the kernel. Note that the cache_chain semaphore is only defined, not initialized. The initialization occurs during system initialization in the call to kmem_cache_init(). We explore this function in detail here:

----------------------------------------------------------------------------- mm/slab.c 462 struct cache_sizes malloc_sizes[] = { 463 #define CACHE(x) { .cs_size = (x) }, 464 #include <linux/kmalloc_sizes.h> 465 { 0, } 466 #undef CACHE 467 }; -----------------------------------------------------------------------------

This piece of code initializes the malloc_sizes[] array and sets the cs_size field according to the values defined in include/linux/kmalloc_sizes.h. As mentioned, the cache sizes can span from 32 bytes to 131,072 bytes depending on the specific kernel configurations.[10]

[10] There are a few additional configuration options that result in more general caches of sizes larger than 131,072. For more information, see include/linux/kmalloc_sizes.h.

With these global variables in place, the kernel proceeds to initialize the slab allocator by calling kmem_cache_init() from init/main.c.[11] This function takes care of initializing the cache chain, its semaphore, the general caches, the kmem_cache cachein essence, all the global variables that are used by the slab allocator for slab management. At this point, specialized caches can be created. The function used to create caches is kmem_cache_create().

[11] Chapter 9 covers the initialization process linearly from power on. We see how kmem_cache_init() fits into the bootstrapping process.

4.5.2. Creating a Cache

The creation of a cache involves three steps:

1.

Allocation and initialization of the descriptor

2.

Calculation of the slab coloring and object size

3.

Addition of the cache to cache_chain list

General caches are set up during system initalization by kmem_cache_init() (mm/slab.c). Specialized caches are created by way of a call to kmem_cache_create().

We now look at each of these functions.

4.5.2.1. kmem_cache_init()

This is where the cache_chain and general caches are created. This function is called during the initialization process. Notice that the function has __init preceding the function name. As discussed in Chapter 2, "Exploration Toolkit," this indicates that the function is loaded into memory that gets wiped after the bootstrap and initialization process is over.

----------------------------------------------------------------------------- mm/slab.c 659 void __init kmem_cache_init(void) 660 { 661 size_t left_over; 662 struct cache_sizes *sizes; 663 struct cache_names *names; ... 669 if (num_physpages > (32 << 20) >> PAGE_SHIFT) 670 slab_break_gfp_order = BREAK_GFP_ORDER_HI; 671 672 -----------------------------------------------------------------------

Lines 661663

The variable sizes and names are the head arrays for the kmalloc allocated arrays (the general caches with geometrically distributes sizes). At this point, these arrays are located in the __init data area. Be aware that kmalloc() does not exist at this point. kmalloc() uses the malloc_sizes array and that is precisely what we are setting up now. At this point, all we have is the statically allocated cache_cache descriptor.

Lines 669670

This code block determines how many pages a slab can use. The number of pages a slab can use is entirely determined by how much memory is available. In both x86 and PPC, the variable PAGE_SHIFT (include/asm/page.h) evaluates to 12. So, we are verifying if num_physpages holds a value greater than 8k. This would be the case if we have a machine with more than 32MB of memory. If this is the case, we fit BREAK_GFP_ORDER_HI pages per slab. Otherwise, one page is allocated per slab.

----------------------------------------------------------------------------- mm/slab.c 690 init_MUTEX(&cache_chain_sem); 691 INIT_LIST_HEAD(&cache_chain); 692 list_add(&cache_cache.next, &cache_chain); 693 cache_cache.array[smp_processor_id()] = &initarray_cache.cache; 694 695 cache_estimate(0, cache_cache.objsize, 0, 696 &left_over, &cache_cache.num); 697 if (!cache_cache.num) 698 BUG(); 699 ... -----------------------------------------------------------------------------

Line 690

This line initializes the cache_chain semaphore cache_chain_sem.

Line 691

Initialize the cache_chain list where all the cache descriptors are stored.

Line 692

Add the cache_cache descriptor to the cache_chain list.

Line 693

Create the per CPU caches. The details of this are beyond the scope of this book.

Lines 695698

This block is a sanity check verifying that at least one cache descriptor can be allocated in cache_cache. Also, it sets the cache_cache descriptor's num field and calculates how much space will be left over. This is used for slab coloring Slab coloring is a method by which the kernel reduces cache alignmentrelated performance hits.

----------------------------------------------------------------------------- mm/slab.c 705 sizes = malloc_sizes; 706 names = cache_names; 707 708 while (sizes->cs_size) { ... 714 sizes->cs_cachep = kmem_cache_create( 715 names->name, sizes->cs_size, 716 0, SLAB_HWCACHE_ALIGN, NULL, NULL); 717 if (!sizes->cs_cachep) 718 BUG(); 719 ... 725 726 sizes->cs_dmacachep = kmem_cache_create( 727 names->name_dma, sizes->cs_size, 728 0, SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL); 729 if (!sizes->cs_dmacachep) 730 BUG(); 731 732 sizes++; 733 names++; 734 } -----------------------------------------------------------------------------

Line 708

This line verifies if we have reached the end of the sizes array. The sizes array's last element is always set to 0. Hence, this case is true until we hit the last cell of the array.

Lines 714718

Create the next kmalloc cache for normal allocation and verify that it is not empty. See the section, "kmem_cache_create()."

Lines 726730

This block creates the caches for DMA allocation.

Lines 732733

Go to the next element in the sizes and names arrays.

The remainder of the kmem_cache_init() function handles the replacement of the temporary bootstrapping data for kmalloc allocated data. We leave out the explanation of this because it is not directly pertinent to the actual initialization of the cache descriptors.

4.5.2.2. kmem_cache_create()

Times arise when the memory regions provided by the general caches are not sufficient. This function is called when a specialized cache needs to be created. The steps required to create a specialized cache are not unlike those required to create a general cache: create, allocate, and initialize the cache descriptor, align objects, align slab descriptors, and add the cache to the cache chain. This function does not have __init in front of the function name because persistent memory is available when it is called:

----------------------------------------------------------------------------- mm/slab.c 1027 kmem_cache_t * 1028 kmem_cache_create (const char *name, size_t size, size_t offset, 1029 unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long), 1030 void (*dtor)(void*, kmem_cache_t *, unsigned long)) 1031 { 1032 const char *func_nm = KERN_ERR "kmem_create: "; 1033 size_t left_over, align, slab_size; 1034 kmem_cache_t *cachep = NULL; ... -----------------------------------------------------------------------------

Let's look at the function parameters of kmem_cache_create.

name

This is the name used to identify the cache. This gets stored in the name field of the cache descriptor and displayed in /proc/slabinfo.

size

This parameter specifies the size (in bytes) of the objects that are contained in this cache. This value is stored in the objsize field of the cache descriptor.

offset

This value determines where the objects are placed within a page.

flags

The flags parameter is related to the slab. Refer to Table 4.4 for a description of the cache descriptor flags field and possible values.

ctor and dtor

ctor and dtor are respectively the constructor and destructor that are called upon creation or destruction of objects in this memory region.

This function performs sizable debugging and sanity checks that we do not cover here. See the code for more details:

----------------------------------------------------------------------------- mm/slab.c 1079 /* Get cache's description obj. */ 1080 cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL); 1081 if (!cachep) 1082 goto opps; 1083 memset(cachep, 0, sizeof(kmem_cache_t)); 1084 ... 1144 do { 1145 unsigned int break_flag = 0; 1146 cal_wastage: 1147 cache_estimate(cachep->gfporder, size, flags, 1148 &left_over, &cachep->num); ... 1174 } while (1); 1175 1176 if (!cachep->num) { 1177 printk("kmem_cache_create: couldn't create cache %s.\n", name); 1178 kmem_cache_free(&cache_cache, cachep); 1179 cachep = NULL; 1180 goto opps; 1181 } -----------------------------------------------------------------------------

Lines 10791084

This is where the cache descriptor is allocated. Following this is the portion of the code that is involved with the alignment of objects in the slab. We leave this portion out of this discussion.

Lines 11441174

This is where the number of objects in cache is determined. The bulk of the work is done by cache_estimate(). Recall that the value is to be stored in the num field of the cache descriptor.

----------------------------------------------------------------------------- mm/slab.c ... 1201 cachep->flags = flags; 1202 cachep->gfpflags = 0; 1203 if (flags & SLAB_CACHE_DMA) 1204 cachep->gfpflags |= GFP_DMA; 1205 spin_lock_init(&cachep->spinlock); 1206 cachep->objsize = size; 1207 /* NUMA */ 1208 INIT_LIST_HEAD(&cachep->lists.slabs_full); 1209 INIT_LIST_HEAD(&cachep->lists.slabs_partial); 1210 INIT_LIST_HEAD(&cachep->lists.slabs_free); 1211 1212 if (flags & CFLGS_OFF_SLAB) 1213 cachep->slabp_cache = kmem_find_general_cachep(slab_size,0); 1214 cachep->ctor = ctor; 1215 cachep->dtor = dtor; 1216 cachep->name = name; 1217 ... 1242 1243 cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 + 1244 ((unsigned long)cachep)%REAPTIMEOUT_LIST3; 1245 1246 /* Need the semaphore to access the chain. */ 1247 down(&cache_chain_sem); 1248 { 1249 struct list_head *p; 1250 mm_segment_t old_fs; 1251 1252 old_fs = get_fs(); 1253 set_fs(KERNEL_DS); 1254 list_for_each(p, &cache_chain) { 1255 kmem_cache_t *pc = list_entry(p, kmem_cache_t, next); 1256 char tmp; ... 1265 if (!strcmp(pc->name,name)) { 1266 printk("kmem_cache_create: duplicate cache %s\n",name); 1267 up(&cache_chain_sem); 1268 BUG(); 1269 } 1270 } 1271 set_fs(old_fs); 1272 } 1273 1274 /* cache setup completed, link it into the list */ 1275 list_add(&cachep->next, &cache_chain); 1276 up(&cache_chain_sem); 1277 opps: 1278 return cachep; 1279 } -----------------------------------------------------------------------------

Just prior to this, the slab is aligned to the hardware cache and colored. The fields color and color_off of the slab descriptor are filled out.

Lines 12001217

This code block initializes the cache descriptor fields much like we saw in kmem_cache_init().

Lines 12431244

The time for the next cache reap is set.

Lines 12471276

The cache descriptor is initialized and all the information regarding the cache has been calculated and stored. Now, we can add the new cache descriptor to the cache_chain list.

4.5.3. Slab Creation and cache_grow()

When a cache is created, it starts empty of slabs. In fact, slabs are not allocated until a request for an object demonstrates a need for a new slab. This happens when the cache descriptor's lists.slabs_partial and lists.slabs_free fields are empty. At this point, we won't relate how the request for memory translates into the request for an object within a particular cache. For now, we take for granted that this translation has occurred and concentrate on the technical implementation within the slab allocator.

A slab is created within a cache by cache_grow(). When we create a slab, we not only allocate and initialize its descriptor; we also allocate the actual memory. To this end, we need to interface with the buddy system to request the pages. This is done by kmem_getpages() (mm/slab.c).

4.5.3.1. cache_grow()

The cache_grow() function grows the number of slabs within a cache by 1. It is called only when no free objects are available in the cache. This occurs when lists.slabs_partial and lists.slabs_free are empty:

----------------------------------------------------------------------------- mm/slab.c 1546 static int cache_grow (kmem_cache_t * cachep, int flags) 1547 { ... -----------------------------------------------------------------------------

The parameters passed to the function are

  • cachep. This is the cache descriptor of the cache to be grown.

  • flags. These flags will be involved in the creation of the slab.

----------------------------------------------------------------------------- mm/slab.c 1572 check_irq_off(); 1573 spin_lock(&cachep->spinlock); ... 1581 1582 spin_unlock(&cachep->spinlock); 1583 1584 if (local_flags & __GFP_WAIT) 1585 local_irq_enable(); -----------------------------------------------------------------------------

Lines 15721573

Prepare for manipulating the cache descriptor's fields by disabling interrupts and locking the descriptor.

Lines 15821585

Unlock the cache descriptor and reenable the interrupts.

----------------------------------------------------------------------------- mm/slab.c ... 1597 if (!(objp = kmem_getpages(cachep, flags))) 1598 goto failed; 1599 1600 /* Get slab management. */ 1601 if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags))) 1602 goto opps1; ... 1605 i = 1 << cachep->gfporder; 1606 page = virt_to_page(objp); 1607 do { 1608 SET_PAGE_CACHE(page, cachep); 1609 SET_PAGE_SLAB(page, slabp); 1610 SetPageSlab(page); 1611 inc_page_state(nr_slab); 1612 page++; 1613 } while (--i) ; 1614 1615 cache_init_objs(cachep, slabp, ctor_flags); -----------------------------------------------------------------------------

Lines 15971598

Interface with the buddy system to acquire page(s) for the slab.

Lines 16011602

Place the slab descriptor where it needs to go. Recall that slab descriptors can be stored within the slab itself or within the first general purpose cache.

Lines 16051613

The pages need to be associated with the cache and slab descriptors.

Line 1615

Initialize all the objects in the slab.

----------------------------------------------------------------------------- mm/slab.c 1616 if (local_flags & __GFP_WAIT) 1617 local_irq_disable(); 1618 check_irq_off(); 1619 spin_lock(&cachep->spinlock); 1620 1621 /* Make slab active. */ 1622 list_add_tail(&slabp->list, &(list3_data(cachep)->slabs_free)); 1623 STATS_INC_GROWN(cachep); 1624 list3_data(cachep)->free_objects += cachep->num; 1625 spin_unlock(&cachep->spinlock); 1626 return 1; 1627 opps1: 1628 kmem_freepages(cachep, objp); 1629 failed: 1630 if (local_flags & __GFP_WAIT) 1631 local_irq_disable(); 1632 return 0; 1633 } -----------------------------------------------------------------------------

Lines 16161619

Because we are about to access and change descriptor fields, we need to disable interrupts and lock the data.

Lines 16221624

Add the new slab descriptor to the lists.slabs_free field of the cache descriptor. Update the statistics that keep track of these sizes.

Lines 16251626

Unlock the spinlock and return because all succeeded.

Lines 16271628

This gets called if something goes wrong with the page request. Basically, we are freeing the pages.

Lines 16291632

Disable the interrupt disable, which now lets interrupts come through.

4.5.4. Slab Destruction: Returning Memory and kmem_cache_destroy()

Both caches and slabs can be destroyed. Caches can be shrunk or destroyed to return memory to the free memory pool. The kernel calls these functions when memory is low. In either case, slabs are being destroyed and the pages corresponding to them are being returned for the buddy system to recycle. kmem_cache_destroy() gets rid of a cache. We explore this function in depth. Caches can be reaped and shrunk by kmem_cache_reap() (mm/slab.c) and kmem_cache_shrink(), respectively (mm/slab.c). The function to interface with the buddy system is kmem_freepages() (mm/slab.c).

4.5.4.1. kmem_cache_destroy()

There are a few instances when a cache would need to be removed. Dynamically loadable modules (assuming no persistent memory across loading and unloading) that create caches must destroy them upon unloading to free up the memory and to ensure that the cache won't be duplicated the next time the module is loaded. Thus, the specialized caches are generally destroyed in this manner.

The steps to destroy a cache are the reverse of the steps to create one. Alignment issues are not a concern upon destruction of a cache, only the deletion of descriptors and freeing of memory. The steps to destroy a cache can be summarized as

1.

Remove the cache from the cache chain.

2.

Delete the slab descriptors.

3.

Delete the cache descriptor.

----------------------------------------------------------------------------- mm/slab.c 1421 int kmem_cache_destroy (kmem_cache_t * cachep) 1422 { 1423 int i; 1424 1425 if (!cachep || in_interrupt()) 1426 BUG(); 1427 1428 /* Find the cache in the chain of caches. */ 1429 down(&cache_chain_sem); 1430 /* 1431 * the chain is never empty, cache_cache is never destroyed 1432 */ 1433 list_del(&cachep->next); 1434 up(&cache_chain_sem); 1435 1436 if (__cache_shrink(cachep)) { 1437 slab_error(cachep, "Can't free all objects"); 1438 down(&cache_chain_sem); 1439 list_add(&cachep->next,&cache_chain); 1440 up(&cache_chain_sem); 1441 return 1; 1442 } 1443 ... 1450 kmem_cache_free(&cache_cache, cachep); 1451 1452 return 0; 1453 } -----------------------------------------------------------------------------

The function parameter cache is a pointer to the cache descriptor of the cache that is to be destroyed.

Lines 14251426

This sanity check consists of ensuring that an interrupt is not in play and that the cache descriptor is not NULL.

Lines 14291434

Acquire the cache_chain semaphore, delete the cache from the cache chain, and release the cache chain semaphore.

Lines 14361442

This is where the bulk of the work related to freeing the unused slabs takes place. If the __cache_shrink() function returns true, that indicates that there are still slabs in the cache and, therefore, it cannot be destroyed. Thus, we reverse the previous step and reenter the cache descriptor into the cache_chain, again by first reacquiring the cache_chain semaphore, and releasing it once we finish.

Line 1450

We finish by releasing the cache descriptor.

Категории