| [Linux MM home page] | [Linux Headquarters] | [My home page] |
One of the biggest problems currently facing the Linux memory management
subsystem is memory fragmentation. This is the result of several developments
in other parts of the Linux kernel, most importantly the growth of each
process'es kernel stack to 8 kB and the dynamic allocation of DMA and networking
buffers. These factors, together with a general speedup of both peripheral
hardware and the device drivers has lead to a situation where the currently
used buddy allocator just can't cut it anymore. This white-paper is divided
in 3 pieces, the problem, the
solution and some actual code. I need a lot of
comments and hints for possible improvement, so feel free to email
them to me...
There have been (and there are) several workarounds for this fragmentation issue; one of them (PTE chaining) even involves a physical to logical translating, almost reverse page table-like solution. With that project, we can swap out pages based on their physical address, thus force freeing that one page that blocked an entire 128 kB area. This would solve most of our problems, except when that last page is unswappable, for example a page table or a program's kernel stack. In that case, we're screwed regardlessly of what deallocation scheme we're using.
Because our inability to hand out larger chunks of memory has impact on system functionality and could even have impact on system stability it seems warranted to sacrifice a little bit of speed (the buddy system is fast!) in order to solve most of the above problems. The main problem with the current system is that it doesn't differentiate between swappable and unswappable memory, leading to a system where page tables and other cruft are scattered all over the system, making it impossible to free up one large contiguous area.
This problem is made even worse by the fact that on some architectures we can only do DMA to addresses under 16 MB and it will undoubtedly show up again in some years when we all have 16 GB of memory and try do do DMA to those oldie 32 bit PCI cards that don't support dual cycle address mode :-)
In the current Linux kernel, we have the following uses for memory:
In addition to this, we can differentiate between 3 different kinds of memory:
This leads to the following zone allocation orders:
| SLAB and kernel stack | user memory | page tables | DMA buffers |
|
|
|
|
|
This means that when, for instance, we ran out of user memory and there is enough free memory available, we first try to grab a zone of 'normal memory', if that fails we look for a free area of slow memory and DMA memory is tried last.
User pages are kept on a number of linked lists: active, inactive, clean and free. We allocate new pages in the inactive queue and perform allocations from the free queue first, moving to the clean queue when we're out of free pages. Inactive pages get either promoted to the active queue (when they're in heavy use) or demoted to the clean queue (when they're dirty, we have to clean them first). Pages in the clean queue are also unmapped from the page table and thus already 'halfway swapped out'. Pages only enter the free list when a program free()s pages or when we add a new zone to the user area.
In order to be able to free new zones (for when SLAB gets overly active), we need to be able to mark a relatively free zone force-freeable. Upon scanning such a page kswapd will free the page and make sure it isn't allocated again.When the PTE chaining system gets integrated into the kernel, we can just force-free a user zone with relatively few active pages when the system runs out of free zones. Until then we'll need to keep two free zones and walk the page tables to find and free the pages.
/*
* The struct mem_zone is used to describe a 32 page memory area.
*/
struct mem_zone {
mem_zone * prev, next; /* The previous and next zone on this list */
unsigned long used; /* Used pages bitmap for SLAB, etc !!! count for user */
unsigned long flags;
};
/*
* Flags for struct_mem->flags
*/
#define ZONE_DMA 0x00000001 /* DMA memory */
#define ZONE_SLOW 0x00000002 /* uncached/slow memory */
#define ZONE_USER 0x00000004 /* usermode pages, these defines are for paranoia only */
#define ZONE_SLAB 0x00000008 /* large SLAB */
#define ZONE_STK 0x00000010 /* kernel stack and order-1 SLAB (and order-0 SLAB if there is slow memory) */
#define ZONE_PTBL 0x00000020 /* page tables and one-page SLAB (except when there is slow memory) */
#define ZONE_DMA 0x00000040 /* DMAbuffers */
#define ZONE_RECL 0x00000080 /* We are reclaiming this zone */
#define ZONE_0 0x00000100 /* loose pages allocated */
#define ZONE_1 0x00000200 /*order-1 (2^1 = 2 page)chunks allocated */
#define ZONE_2 0x00000400 /* etc... In order to help in buddy-like allocation for */
#define ZONE_3 0x00000800 /* large SLAB zones on small memory machines. */
#define ZONE_4 0x00001000
#define ZONE_5 0x00002000
/*
* Memory statistics
*/
typedef struct {
unsigned long used;
unsigned long free;
} zone_stats_t;
struct memstats {
struct zone_stats_t ptbl;
struct zone_stats_t stk;
struct zone_stats_t slab;
struct zone_stats_t dma;
/* Slightly different structs for these */
struct user {
unsigned long active;
unsigned long inactive;
unsigned long clean; /* we do lazy reclamation */
unsigned long free;
};
struct free {
unsigned long dma; /* different memory types */
unsigned long normal;
unsigned long slow;
};
struct misc {
unsigned long num_physpages;
unsigned long reserved; /* reserved pages */
unsigned long kernel; /* taken by static kernel stuff */
};
};
/* This is where we find the different zones */
struct memzones {
struct free {
struct mem_zone dma;
struct mem_zone normal;
struct mem_zone slow;
};
struct mem_zone dma;
struct mem_zone user;
struct mem_zone slab;
struct mem_zone stk;
struct mem_zone ptbl;
};