From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Fri, 15 Sep 2006 21:48:22 -0700 From: Andrew Morton Subject: Re: [PATCH] GFP_THISNODE for the slab allocator Message-Id: <20060915214822.1c15c2cb.akpm@osdl.org> In-Reply-To: <20060915203816.fd260a0b.pj@sgi.com> References: <20060914220011.2be9100a.akpm@osdl.org> <20060914234926.9b58fd77.pj@sgi.com> <20060915002325.bffe27d1.akpm@osdl.org> <20060915012810.81d9b0e3.akpm@osdl.org> <20060915203816.fd260a0b.pj@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Jackson Cc: clameter@sgi.com, linux-mm@kvack.org, rientjes@google.com, ak@suse.de List-ID: On Fri, 15 Sep 2006 20:38:16 -0700 Paul Jackson wrote: > [Adding Andi to cc list, since I mention him below. -pj] > > Andrew wrote: > > I'm thinking a) is easily solved by adding an array of the zones inside the > > `struct cpuset', and change get_page_from_freelist() to only look at those > > zones. > > ... > > err, if we cache the most-recently-allocated-from zone in the cpuset then > > we don't need the array-of-zones, do we? We'll only need to do a zone > > waddle when switching from one zone to the next, which is super-rare. > > > > That's much simpler. > > ... > > And locking becomes simpler too. It's just a check of > > cpuset_zone_allowed(current->cpuset->current_allocation_zone) > > This will blow chunks performance wise, with the current cpuset locking > scheme. > > Just one current_allocation_zone would not be enough. Each node that > the cpuset allowed would require its own current_allocation_zone. For > example, on a big honkin NUMA box with 2 CPUs per Node, tasks running > on CPU 32, Node 16, might be able to find free memory right on that > Node 16. But another task in the same cpuset running on CPU 112, Node > 56 might have to scan past a dozen Nodes to Node 68 to find memory. > > Accessing anything from a cpuset that depends on what nodes it allows > requires taking the global mutex callback_mutex (in kernel/cpuset.c). > We don't want to put a global mutex on the page alloc hot code path. > > Anything we need to access frequently from a tasks cpuset has to be > cached in its task struct. > > Three alternative possibilities: > > 1) Perhaps these most-recently-allocated-from-zone's shouldn't be > properties of the cpuset, nor even of the task, but of the zone structs. > > If each zone struct on the zonelist had an additional flag bit marking > the zones that had no free memory, then we could navigate the zonelist > pretty quickly. One more bit per zone struct would be enough to track > a simple rescan mechanism, so that we could detect when a node that > had formerly run out of memory once again had free memory. > > One or two bits per zone struct would be way cheaper, so far as > data space requirements. > > Downside - it still hits each zone struct - suboptimal cache trashing. > One less pointer chase than z->zone_pgdat->node_id, but still not > great. > > 2) It may be sufficient to locally optimize get_page_from_freelist()'s > calls to cpuset_zone_allowed() - basically open code cpuset_zone_allowed, > or at least refine its invocation. > > This might require a second nodemask in the task struct, for the typically > larger set of nodes that GFP_KERNEL allocations can use, more than just > the nodes that GFP_USER can use. Such a second nodemask in the task struct > would enable me to avoid taking the global callback_mutex for some GFP_KERNEL > allocations on tight memory systems. > > Downside #1 - still requires z->zone_pgdat->node_id. Andrew suspects > that this is enough of a problem in itself. From the profile, which > showed cpuset_zone_allowed(), not get_page_from_freelist(), at the > top of the list, given that the node id is evaluated in the > get_page_from_freelist() routine, I was figuring that the real > problem was in the cpuset_zone_allowed() code. Perhaps some testing > of a simple hack approximation to this patch will tell us - next week. > > Downside #2 - may require the above mentioned additional nodemask_t > in the task struct. > > 3) The custom zonelist option - which was part of my original cpuset > proposal, and which Andi K and I have gone back and forth on, with > each of us liking and disliking it, at different times. See further > my latest writeup on this option: > > http://lkml.org/lkml/2005/11/5/252 > Date Sat, 5 Nov 2005 20:18:41 -0800 > From Paul Jackson > Subject Re: [PATCH]: Clean up of __alloc_pages > > My current plan - see if somehow I can code up and get tested (2), > since a rough approximation to it would be trivial to code. If that > works, go with it, unless someone convinces me otherwise. If (2) can't > do the job, try (1), since that seems easier to code. If that fails, > or someone shoots that down, or Andi makes a good enough case for (3), > give (3) a go - that's the hardest path, and risks the most collateral > damage to the behaviour of the memory paging subsystem. > hm. Why is it not sufficient to cache the most-recent zone* in task_struct? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org