From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 1 Nov 2005 00:59:41 +0000 (GMT) From: Mel Gorman Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 In-Reply-To: <4366A8D1.7020507@yahoo.com.au> Message-ID: References: <20051030183354.22266.42795.sendpatchset@skynet.csn.ul.ie><20051031055725.GA3820@w-mikek2.ibm.com><4365BBC4.2090906@yahoo.com.au> <20051030235440.6938a0e9.akpm@osdl.org> <27700000.1130769270@[10.10.2.4]> <4366A8D1.7020507@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: "Martin J. Bligh" , Andrew Morton , kravetz@us.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, lhms-devel@lists.sourceforge.net List-ID: On Tue, 1 Nov 2005, Nick Piggin wrote: > Martin J. Bligh wrote: > > > > We think that Mel's patches will allow us to reintroduce Rohit's > > > optimisation. > > > > > > ... frankly, it happens without Rohit's patch as well (under more stress). > > If we want a OS that is robust, and supports higher order allocations, > > we need to start caring about fragmentations. Not just for large pages, > > and hotplug, but also for more common things like jumbo GigE frames, > > CIFS, various device drivers, kernel stacks > 4K etc. > > But it doesn't seem to be a great problem right now, apart from hotplug > and hugepages. Some jumbo GigE drivers use higher order allocations, but > I think there are moves to get away from that (e1000, for example). > GigE drivers and any other subsystem will not use higher order allocations if they know the underlying allocator is not going to satisfy the request. These patches are the starting point for properly supporting large allocations. I will admit that this set of patches is not going to solve the whole problem, but it is a start that can be built upon. > > To me, the question is "do we support higher order allocations, or not?". > > Pretending we do, making a half-assed job of it, and then it not working > > well under pressure is not helping anyone. I'm told, for instance, that > > AMD64 requires > 4K stacks - that's pretty fundamental, as just one > > And i386 had required 8K stacks for a long long time too. > > > instance. I'd rather make Linux pretty bulletproof - the added feature > > stuff is just a bonus that comes for free with that. > > > > But this doesn't exactly make Linux bulletproof, AFAIKS it doesn't work > well on small memory systems, and it can still get fragmented and not work. Small memory systems are unlikely to care about satisfying large allocations. These patches should not be adversely affecting small memory systems but it is likely that a smaller value of MAX_ORDER would have to be used to help with fragmentation. You are right that we can still get fragmented. To prevent all fragmentation would require more work but these patches would still be the starting point. It makes sense to start with this patchset now and move on the the more complex stuff later. If these patches are in, we could later do stuff like; o Configurable option that controls how strict fallback is. In a situation where we absolutely do not want to fragment, do not allow kernel allocations to fallback to EasyRclm zones. Instead, teach kswapd to reclaim pages from the Fallback and KernNoRclm areas. o Configurable option that gets kswapd to keep the KernNoRclm, KernRclm and Fallback areas free of EasyRclm pages. This would prevent awkward kernel pages ending up in the wrong areas at the cost of more work for kswapd o Linear scan memory to remove contiguous groups of large pages to satisfy larger allocations. The usemap gives hints to what regions are worth trying to reclaim. I have a set of patches that do something like this and it was able to satisfy large allocations reliably, but they are slow right now and need a lot of work. All these ideas need a mechanism like this set of patches to group related pages together. This set of patches still help fragmentation now, although not in a 100% reliable fashion. My desktop which is running a kernel patched with these patches has been running for 33 hours and managed to allocate 80 order-10 blocks from ZONE_NORMAL which is about 42% of the zone while xmms, X, konqueror and a pile of terminals were running. That is pretty decent, even if it's not perfect. With this approach, an easyrclm region can be reclaimed to help satisfy a large allocation. This would be harder to do with a zone-based approach. Obviously, stealing easyrclm pages to satisfy a high order allocation could end up fragmenting the system given enough time. This worst-case scenario would occur if high order allocations were in heavy demand, they were kernel allocations *and* they were long lived. If we find in the future that this worst-case scenario occurs frequently, the easiest solution would be to use __GFP_KERNRCLM for high order short lived allocations rather than it's current usage for caches like icaches. This would set up areas that tend to have high order free blocks in them. > IMO in order to make Linux bulletproof, just have fallbacks for anything > greater than about order 2 allocations. > What sort of fallbacks? Private pools of pages of the larger order for subsystems that need large pages is hardly desirable. > From what I have seen, by far our biggest problems in the mm are due to > page reclaim, and these patches will make our reclaim behaviour more > complex I think. > This patchset does not touch reclaim at all. The lists that this patch really affects is the zone freelists, not the LRU lists that page reclaim are dealing with. It is only later when we want to try and guarantee large-order allocations that we will have to change page reclaim. -- Mel Gorman Part-time Phd Student Java Applications Developer University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org