Linux MM: design for a zone based memory allocator

Rik van Riel, July 1998

One of the biggest problems currently facing the Linux memory management subsystem is memory fragmentation. This is the result of several developments in other parts of the Linux kernel, most importantly the growth of each process'es kernel stack to 8 kB and the dynamic allocation of DMA and networking buffers. These factors, together with a general speedup of both peripheral hardware and the device drivers has lead to a situation where the currently used buddy allocator just can't cut it anymore. This white-paper is divided in 3 pieces, the problem, the solution and some actual code. I need a lot of comments and hints for possible improvement, so feel free to email them to me...

The problem

The problem is caused by the fact that memory is allocated in chunks of different sizes. For most types of usage we just allocate memory one page (4 kB on most machines) at a time, but sometimes we give out larger pieces of memory (2, 4, 8, 16 or 32 pages at once). Because of the fact that most UNIX (and Linux) machines have a completely full memory (free memory is wasted memory), it is next to impossible to free larger area's and the best we can do is be very careful not to hand out those large areas when we only need a small one.

There have been (and there are) several workarounds for this fragmentation issue; one of them (PTE chaining) even involves a physical to logical translating, almost reverse page table-like solution. With that project, we can swap out pages based on their physical address, thus force freeing that one page that blocked an entire 128 kB area. This would solve most of our problems, except when that last page is unswappable, for example a page table or a program's kernel stack. In that case, we're screwed regardlessly of what deallocation scheme we're using.

Because our inability to hand out larger chunks of memory has impact on system functionality and could even have impact on system stability it seems warranted to sacrifice a little bit of speed (the buddy system is fast!) in order to solve most of the above problems. The main problem with the current system is that it doesn't differentiate between swappable and unswappable memory, leading to a system where page tables and other cruft are scattered all over the system, making it impossible to free up one large contiguous area.

This problem is made even worse by the fact that on some architectures we can only do DMA to addresses under 16 MB and it will undoubtedly show up again in some years when we all have 16 GB of memory and try do do DMA to those oldie 32 bit PCI cards that don't support dual cycle address mode :-)

The solution

The solution is to hand out free zones of 128 kB large, and to use each zone for one type of usage only. Then we can be sure that no page tables interfere with the freeing of a zone of user memory, and we can always just free an area of memory.

In the current Linux kernel, we have the following uses for memory:

reserved memory, kernel code and statically allocated kernel structures: after system boot we never much with the layout of this memory so it's a non issue wrt. the allocator
user memory: this memory can be swapped out and/or relocated at will, it is allocated one page at a time and gives us no trouble, apart from the fact that we always need more than we have physically available; no special requirements
kernel stack: we allocate 8 kB (2 pages) of unswappable kernel stack for each process; each of those stacks needs to be physically contiguous and it needs to be in fast memory (not in uncached memory)
page tables: page directories are unswappable, page tables and (on some machines) page middle directories can be moved/swapped with great caution; the memory for these is given out one page at a time; we only look up the page tables every once in a while so speed is not very critical; when we have uncached memory, we'd rather use it for page tables than for user pages
small SLAB: SLAB memory is used for dynamic kernel data; it is allocated and freed at will, unfortunately this will is not ours but that of the (device) driver that requested the memory; speed is critical
large SLAB: the same as small SLAB, but sometimes the kernel wants large chunks (> 2 pages); we make the distinction between the two because we don't want to face hopeless fragmentation inside the SLAB zones...

DMA buffers

For small (< 16 MB) machines, the above scheme is overkill and we treat several types of usage as one. We can, for instance, treat large SLAB and DMA the same, and small SLAB, kernel stack and page table can be allocated in the same zones too. Small slab and kernel stack will be treated the same on every machine; the distinction is only made because I want the documentation to be complete.

In addition to this, we can differentiate between 3 different kinds of memory:

DMA memory: this memory is located under the 16 MB limit and is cached by the L1 and L2 caches
'normal' memory: this memory is located above the DMA limit and is cached by the L1 and L2 caches, it can not be used for DMA buffers
slow memory: this memory is not cached or present on an add-on board, it can not be used for DMA buffers and using it for time critical kernel stack and SLAB would be disastrous for performance; we also don't want to use it for CPU intensive user applications

Since we don't want to waste the slow memory we might have, we can use that for page tables and user memory that isn't used very often. If we have user memory in slow memory and it turns out that it is used very often we can always use the swap code to relocate it to fast memory. DMA memory is scarce, so we want to allocate that only we specifically need it or when we don't have any other memory left.

This leads to the following zone allocation orders:

SLAB and kernel stack user memory page tables DMA buffers

normal memory

DMA memory

slow memory

normal memory

slow memory

DMA memory

slow memory

normal memory

DMA memory

DMA memory

This means that when, for instance, we ran out of user memory and there is enough free memory available, we first try to grab a zone of 'normal memory', if that fails we look for a free area of slow memory and DMA memory is tried last.

Page allocation

For SLAB, page table and DMA memory we always try to allocate from the fullest zone available and we grab a free zone when we're out of our own memory. In order to grab the fullest zone, we keep these zones in a (partially?) sorted order. For large SLAB/DMA areas we will also want to keep in mind the sizes of the memory chunks previously allocated in this zone.

User pages are kept on a number of linked lists: active, inactive, clean and free. We allocate new pages in the inactive queue and perform allocations from the free queue first, moving to the clean queue when we're out of free pages. Inactive pages get either promoted to the active queue (when they're in heavy use) or demoted to the clean queue (when they're dirty, we have to clean them first). Pages in the clean queue are also unmapped from the page table and thus already 'halfway swapped out'. Pages only enter the free list when a program free()s pages or when we add a new zone to the user area.

In order to be able to free new zones (for when SLAB gets overly active), we need to be able to mark a relatively free zone force-freeable. Upon scanning such a page kswapd will free the page and make sure it isn't allocated again.When the PTE chaining system gets integrated into the kernel, we can just force-free a user zone with relatively few active pages when the system runs out of free zones. Until then we'll need to keep two free zones and walk the page tables to find and free the pages.

Actual code

There's not much of actual code yet but all the administrative details are ready. ALPHA status reached and the .h file is ready :)

/*
 * The struct mem_zone is used to describe a 32 page memory area.
 */

struct mem_zone {
 mem_zone * prev, next; /* The previous and next zone on this list */
 unsigned long used; /* Used pages bitmap for SLAB, etc !!! count for user */
 unsigned long flags;
};

/*
 * Flags for struct_mem->flags
 */

#define ZONE_DMA 0x00000001 /* DMA memory */
#define ZONE_SLOW 0x00000002 /* uncached/slow memory */
#define ZONE_USER 0x00000004 /* usermode pages, these defines are for paranoia only */
#define ZONE_SLAB 0x00000008 /* large SLAB */
#define ZONE_STK 0x00000010 /* kernel stack and order-1 SLAB (and order-0 SLAB if there is slow memory) */
#define ZONE_PTBL 0x00000020 /* page tables and one-page SLAB (except when there is slow memory) */
#define ZONE_DMA 0x00000040 /* DMAbuffers */
#define ZONE_RECL 0x00000080 /* We are reclaiming this zone */
#define ZONE_0 0x00000100 /* loose pages allocated */
#define ZONE_1 0x00000200 /*order-1 (2^1 = 2 page)chunks allocated */
#define ZONE_2 0x00000400 /* etc... In order to help in buddy-like allocation for */
#define ZONE_3 0x00000800 /* large SLAB zones on small memory machines. */
#define ZONE_4 0x00001000
#define ZONE_5 0x00002000

/*
 * Memory statistics
 */

typedef struct {
	unsigned long used;
	unsigned long free;
} zone_stats_t;

struct memstats {
	struct zone_stats_t ptbl;
	struct zone_stats_t stk;
	struct zone_stats_t slab;
	struct zone_stats_t dma;
	/* Slightly different structs for these */
	struct user {
		unsigned long active;
		unsigned long inactive;
		unsigned long clean;	/* we do lazy reclamation */
		unsigned long free;
	};
	struct free {
		unsigned long dma;	/* different memory types */
		unsigned long normal;
		unsigned long slow;
	};
	struct misc {
		unsigned long num_physpages;
		unsigned long reserved;	/* reserved pages */
		unsigned long kernel;	/* taken by static kernel stuff */
	};
};

/* This is where we find the different zones */

struct memzones {
	struct free {
		struct mem_zone dma;
		struct mem_zone normal;
		struct mem_zone slow;
	};
	struct mem_zone dma;
	struct mem_zone user;
	struct mem_zone slab;
	struct mem_zone stk;
	struct mem_zone ptbl;
};

SLAB and kernel stack	user memory	page tables	DMA buffers
normal memory DMA memory slow memory	normal memory slow memory DMA memory	slow memory normal memory DMA memory	DMA memory