From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <464B4D43.9020002@shadowen.org> Date: Wed, 16 May 2007 19:28:19 +0100 From: Andy Whitcroft MIME-Version: 1.0 Subject: Re: [PATCH 2/2] Only check absolute watermarks for ALLOC_HIGH and ALLOC_HARDER allocations References: <20070514173218.6787.56089.sendpatchset@skynet.skynet.ie> <20070514173259.6787.58533.sendpatchset@skynet.skynet.ie> <464AF589.2000000@yahoo.com.au> <20070516132419.GA18542@skynet.ie> <464B089C.9070805@yahoo.com.au> <20070516140038.GA10225@skynet.ie> <464B110E.2040309@yahoo.com.au> In-Reply-To: <464B110E.2040309@yahoo.com.au> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin , clameter@sgi.com Cc: Mel Gorman , nicolas.mailhot@laposte.net, akpm@linux-foundation.org, linux-mm@kvack.org List-ID: Nick Piggin wrote: > Mel Gorman wrote: >> On (16/05/07 23:35), Nick Piggin didst pronounce: >> >>> Mel Gorman wrote: > >>>> In page_alloc.c >>>> >>>> if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait) >>>> alloc_flags |= ALLOC_HARDER; >>>> >>>> See the !wait part. >>> >>> And the || part. >>> >> >> >> I doubt a rt_task is thrilled to be entering direct reclaim. > > Doesn't mean you should break the watermarks. !wait allocations don't > always happen from interrupt context either, and it is possible to see > code doing The problem perhaps here is that we are not able to allocate at all despite having large amounts of memory free. In the original problem report we had a failing order-2 allocation when order-7 pages were free, and we were over the reserve. Indeed in experiments with this algorithm I am finding that it is common to fail an order 2 allocation when more than 2* the reserve is available. More on this at the end ... > if (!alloc(GFP_KERNEL&~__GFP_WAIT)) { > spin_unlock() > alloc(GFP_KERNEL) > spin_lock() > } > > >>>> The ALLOC_HIGH applies to __GFP_HIGH allocations which are allowed to >>>> dip into emergency pools and go below the reserve. >>> >>> And some of them can sleep too. >>> >> >> >> If you feel very strongly about it, I can back out the ALLOC_HIGH part >> for >> __GFP_HIGH allocations but it looks like at a glance that users of >> __GFP_HIGH >> are not too keen on sleeping; > > I feel strongly about not breaking these things which are specifically > there > for a reason and that are being changed seemingly because of the false > impression that kswapd doesn't proactively free pages for them. The interaction with kswapd is not instantaneous. When an allocation at high order fails to allocate at the low watermarks it will indeed wake up kswapd and that will work to release memory at the order specified. However, if it is already reclaiming at another order it will not switch up until it next completes a pass. For an allocator who cannot sleep this is very likely to be too late. This is never going to help a bursty allocator. >>>> ALLOC_HARDER is an urgent allocation class. >>> >>> And HIGH is even more, and MEMALLOC even more again. >>> >> >> >> HIGH => ALLOC_HIGH => obey watermarks at order-0 >> >> Somewhat counter-intuitively, with the current code if the allocation is >> a really high priority but can sleep, it can actually allocate without >> any >> watermarks at all > > I didn't understand what you meant? > > >>>> What actually happens is that high-order allocations fail even though >>>> the watermarks are met because they cannot enter direct reclaim. >>> >>> Yeah, they fail leaving some spare for more urgent allocations. Like >>> how the order-0 allocations work. >> >> >> order-0 watermarks are still in place. After the patch, it is still not >> possible for the allocations to break the watermarks there. > > The watermarks for higher order pages you could say are implicit but > still there. They are scaled down from the order-0 watermarks, so they > should behave in the same way. I just can't understand why you're > bypassing these if you think the order-0 behaviour is OK. The problem is the watermarks for the higher orders are actually much stricter than for low orders. This is a by product of the way in which the algorithm calculates the current free at each iteration, taking away the pages at smaller order. The effective free pages at each order is scaled by the ratio of the free pages at that order to all the sum of all higher orders. The effective min at each order is halved. Due to the nature of the reclaim strategy we will always expect to see exponentially more order-0 pages than order-1 etc and so on, making it hugely more difficult to allocate a page at these higher orders. >>> They should also kick kswapd to start freeing pages _before_ they start >>> failing too. >>> >> >> >> Should prehaps, but from what I read kswapd is only kicked into action >> when the first allocation attempt has already failed. > > Well that's wrong unless you are allocating with GFP_THISNODE, in which > case that is specifically the behaviour that is asked for. kswapd is kicked when we cannot allocate at the normal low water mark, we will then attempt a further allocation at min/2 etc. However we are as likely to fail the second as the effective low water mark for higher order pages is significantly higher than for order-0. So kswapd will be woken, but it has a huge job on its hands to get us from order-0 low order to order-N low water. As we cannot sleep we are very likely to fail. I did some testing with the current algorithm in a test harness. That testing seems to show that the effect of the reserve can be majorly higher than the real reserve. If we look at the OOM from the original report we can see there was some 7700 pages free at the time of the allocation. The effective reserve for an ALLOC_HARD allocation is only 711 pages, and yet we cannot allocate any pages over order-0. total free: 7768 reserve : 1423 0 1 2 3 4 5 6 7 8 9 10 7560 0 8 0 1 1 0 1 0 0 0 allocation order : 0 effective free : 7768 effective reserve: 711.5 allocation order : 1 effective free : 7767 effective reserve: 711.5 0 207 355 FAIL Looking at the figures above dispassionately it is hard to fault the logic of the allocator denying this allocation. There are indeed very few pages at those orders and some (where possible) are reserved for PF_MEM tasks, for reclaim itself. However, the reservation system takes no account of higher orders, so we can always end up in a situation where there only order-0 pages free; all higher orders have been split. This gives us a constraint on all reclaim processing, it must only involve order-0 pages else it could deadlock. BUT if that is true and reclaim only uses order-0 pages then there is in actually no point in retaining any PF_MEM reserve at higher order as it would never be used. What does this mean: 1) any slab which is used from the reclaim path _must_ use order-0 allocations, 2) any slab which is allocated from atomically _should_ use order-0 allocations. My understanding is all slabs within a slub slab cache have to be the same order. So we need to ensure that any slab that might be used from the reclaim path must only use order-0 pages. Also it seems that any slab that is allocated from atomically will have to use order-0 pages in order to remain reliable. Christoph, do we have any facility to tag caches to use a specific allocation order? I think that probabally means that the second of our patches here is not the right approach to this problem long term. A patch which sets the critical slabs to order-0 is probabally the way forward. Having said that, as I mentioned before if all of PF_MEM processing is (and it must be to be safe) order-0 then actually some relaxing of the watermarks above order-0 may well be in order. It is interesting to note the state of the system with the kswapd patch to target reclaim at SLUB order, figures below. We get significantly closer to real OOM before failing the order-2 allocations. To my mind that indicates that this change is pretty beneficial under memory pressure. Especially for Andrews e1000 workload. We might consider using that patch with the minimum order set to PAGE_ALLOC_COSTLY_ORDER. total free: 2889 reserve : 1423 0 1 2 3 4 5 6 7 8 9 2619 27 6 0 0 2 0 1 0 0 allocation order : 0 effective free : 2889 effective reserve: 711.5 allocation order : 1 effective free : 2888 effective reserve: 711.5 0 269 355 FAIL -apw -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org