From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from renko.ucs.ed.ac.uk (renko.ucs.ed.ac.uk [129.215.13.3]) by kvack.org (8.8.7/8.8.7) with ESMTP id IAA04391 for ; Thu, 23 Jul 1998 08:51:04 -0400 Date: Thu, 23 Jul 1998 13:48:28 +0100 Message-Id: <199807231248.NAA04764@dax.dcs.ed.ac.uk> From: "Stephen C. Tweedie" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Good and bad news on 2.1.110, and a fix Sender: owner-linux-mm@kvack.org To: Linus Torvalds , Alan Cox , "David S. Miller" , Bill Hawes , Ingo Molnar , Mark Hemment Cc: linux-mm@kvack.org, linux-kernel@vger.rutgers.edu, Stephen Tweedie List-ID: As the subject says, 2.1.110 is both very very promising and a stability nightmare depending on what you are doing with it. Fortunately, a very simple failsafe mechanism against the observed problems seems to deal with them extremely efficiently without any other performance impact, and the resulting VM appears to be very stable. The new memory code in get_free_pages is founded on a good principle: if you don't let non-atomic consumers at the last few pages, then the last high order pages get reserved for atomic use. That's great as far as it goes. When 2.1.110 gets going, it works a treat: the performance on low memory is by *far* the best of any recent kernels, and approaches 2.0 performance. I'm seriously impressed. However, if the memory useage pattern just happens by chance to fail to leave any high order pages free, then the free_memory_available just gives up in disgust. My first attempt at booting 2.1.110 on a low-memory setup failed because 4k nfs deadlocked during logon. Shift-scrlck showed all the free memory in 4k and 8k pages, but 4k nfs requires 16k pages. The problem is twofold. First of all, with low memory it is enormously harder to get 16k free pages than 8k free pages. With the default SLAB_BREAK_GFP_ORDER setting of two, the slab allocator tries to allocate 16k slabs for every object over 2048 bytes. Setting this to one instead improved things dramatically, and I haven't been able to reproduce the problem since. This does make some allocations less efficient, but since it is primarily networking which creates atomic demand for higher order pages than 8k, we can expect the allocations to be sufficiently short-lived that the packing density is not important. The second problem is more serious: the free_memory_available simply doesn't care about page orders any more, and if you don't have enough high order pages then you won't be given any. A "ping -s 3000" to a 2.1.110 box doing NFS will kill it, even on a 16MB configuration. Depending on the amount of background VM activity (eg. cron), it may unstick itself after a few minutes once the attack stops, but a complete session freeze for 4 or 5 minutes is still pretty bad. Shift-scrlck on the 16MB box in this mode shows all 97 free pages being of order 0; there are no higher order pages available, at all, even after the ping flood. Linus, the patch at the end fixes these two problems for me, in a painless manner. The patch to slab.c simply makes SLAB_BREAK_GFP_ORDER dependent on the memory size, and defaults to 1 instead of 2 if the machine has less than 16MB. The patch to page_alloc.c is a minimal fix for the fragmentation problem. It simply records allocation failures for high-order pages, and forces free_memory_available to return false until a page of at least that order becomes available. The impact should be low, since with the SLAB_BREAK_GFP_ORDER patch 2.1.111-pre1 seems to survive pretty well anyway (and hence won't invoke the new mechanism), but in cases of major atomic allocation load, the patch allows even low memory machines to survive the ping attack handsomely (even with 8k NFS on a 6.5MB configuration). I get tons of "IP: queue_glue: no memory for gluing queue" failures, but enough NFS retries get through even during the ping flood to prevent any NFS server unreachables happening. 2.1.110 has fixed most of the VM problems I've been tracking. It eliminates the rusting memory death: a "find /" can still increase the inode cache enough to cause a small but perceptible and permanent performance drop on low memory, but at least it is stable, does not appear to be cumulative, and does not result in swap deaths. The page cache is trimmed very nicely, and compiles are running more smoothly than ever on 2.1. The only stability problems I found were the atomic allocation failures: preventing boot is a _serious_ problem. With these fixes in place, even that problem appears to have vanished. It's looking good. Comments? --Stephen ---------------------------------------------------------------- --- mm/page_alloc.c.~1~ Wed Jul 22 14:48:23 1998 +++ mm/page_alloc.c Thu Jul 23 13:00:54 1998 @@ -31,6 +31,8 @@ int nr_swap_pages = 0; int nr_free_pages = 0; +static int max_failed_order; + /* * Free area management * @@ -114,6 +116,23 @@ { static int available = 1; + /* First, perform a very simple test for fragmentation */ + if (max_failed_order) { + unsigned long flags; + struct free_area_struct * list; + spin_lock_irqsave(&page_alloc_lock, flags); + for (list = free_area+max_failed_order; + list < free_area+NR_MEM_LISTS; + list++) { + if (list->next != memory_head(list)) + break; + } + spin_unlock_irqrestore(&page_alloc_lock, flags); + if (list == free_area+NR_MEM_LISTS) + return 0; + max_failed_order = 0; + } + if (nr_free_pages < freepages.low) { available = 0; return 0; @@ -209,6 +228,8 @@ nr_free_pages -= 1 << order; \ EXPAND(ret, map_nr, order, new_order, area); \ spin_unlock_irqrestore(&page_alloc_lock, flags); \ + if (order >= max_failed_order) \ + max_failed_order = 0; \ return ADDRESS(map_nr); \ } \ prev = ret; \ @@ -263,6 +284,8 @@ spin_lock_irqsave(&page_alloc_lock, flags); RMQUEUE(order, (gfp_mask & GFP_DMA)); spin_unlock_irqrestore(&page_alloc_lock, flags); + if (order > max_failed_order) + max_failed_order = order; nopage: return 0; } --- mm/slab.c.~1~ Wed Jul 8 14:35:46 1998 +++ mm/slab.c Thu Jul 23 12:41:57 1998 @@ -313,7 +313,9 @@ /* If the num of objs per slab is <= SLAB_MIN_OBJS_PER_SLAB, * then the page order must be less than this before trying the next order. */ -#define SLAB_BREAK_GFP_ORDER 2 +#define SLAB_BREAK_GFP_ORDER_HI 2 +#define SLAB_BREAK_GFP_ORDER_LO 1 +static int slab_break_gfp_order = SLAB_BREAK_GFP_ORDER_LO; /* Macros for storing/retrieving the cachep and or slab from the * global 'mem_map'. With off-slab bufctls, these are used to find the @@ -447,6 +449,11 @@ cache_cache.c_colour = (i-(cache_cache.c_num*size))/L1_CACHE_BYTES; cache_cache.c_colour_next = cache_cache.c_colour; + /* Fragmentation resistance on low memory */ + if ((num_physpages * PAGE_SIZE) < 16 * 1024 * 1024) + slab_break_gfp_order = SLAB_BREAK_GFP_ORDER_LO; + else + slab_break_gfp_order = SLAB_BREAK_GFP_ORDER_HI; return start; } @@ -869,7 +876,7 @@ * bad for the gfp()s. */ if (cachep->c_num <= SLAB_MIN_OBJS_PER_SLAB) { - if (cachep->c_gfporder < SLAB_BREAK_GFP_ORDER) + if (cachep->c_gfporder < slab_break_gfp_order) goto next; } -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org