From: Paul Jackson <pj@sgi.com>
To: Andrew Morton <akpm@osdl.org>
Cc: clameter@sgi.com, linux-mm@kvack.org, rientjes@google.com, ak@suse.de
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Fri, 15 Sep 2006 20:38:16 -0700 [thread overview]
Message-ID: <20060915203816.fd260a0b.pj@sgi.com> (raw)
In-Reply-To: <20060915012810.81d9b0e3.akpm@osdl.org>
[Adding Andi to cc list, since I mention him below. -pj]
Andrew wrote:
> I'm thinking a) is easily solved by adding an array of the zones inside the
> `struct cpuset', and change get_page_from_freelist() to only look at those
> zones.
> ...
> err, if we cache the most-recently-allocated-from zone in the cpuset then
> we don't need the array-of-zones, do we? We'll only need to do a zone
> waddle when switching from one zone to the next, which is super-rare.
>
> That's much simpler.
> ...
> And locking becomes simpler too. It's just a check of
> cpuset_zone_allowed(current->cpuset->current_allocation_zone)
This will blow chunks performance wise, with the current cpuset locking
scheme.
Just one current_allocation_zone would not be enough. Each node that
the cpuset allowed would require its own current_allocation_zone. For
example, on a big honkin NUMA box with 2 CPUs per Node, tasks running
on CPU 32, Node 16, might be able to find free memory right on that
Node 16. But another task in the same cpuset running on CPU 112, Node
56 might have to scan past a dozen Nodes to Node 68 to find memory.
Accessing anything from a cpuset that depends on what nodes it allows
requires taking the global mutex callback_mutex (in kernel/cpuset.c).
We don't want to put a global mutex on the page alloc hot code path.
Anything we need to access frequently from a tasks cpuset has to be
cached in its task struct.
Three alternative possibilities:
1) Perhaps these most-recently-allocated-from-zone's shouldn't be
properties of the cpuset, nor even of the task, but of the zone structs.
If each zone struct on the zonelist had an additional flag bit marking
the zones that had no free memory, then we could navigate the zonelist
pretty quickly. One more bit per zone struct would be enough to track
a simple rescan mechanism, so that we could detect when a node that
had formerly run out of memory once again had free memory.
One or two bits per zone struct would be way cheaper, so far as
data space requirements.
Downside - it still hits each zone struct - suboptimal cache trashing.
One less pointer chase than z->zone_pgdat->node_id, but still not
great.
2) It may be sufficient to locally optimize get_page_from_freelist()'s
calls to cpuset_zone_allowed() - basically open code cpuset_zone_allowed,
or at least refine its invocation.
This might require a second nodemask in the task struct, for the typically
larger set of nodes that GFP_KERNEL allocations can use, more than just
the nodes that GFP_USER can use. Such a second nodemask in the task struct
would enable me to avoid taking the global callback_mutex for some GFP_KERNEL
allocations on tight memory systems.
Downside #1 - still requires z->zone_pgdat->node_id. Andrew suspects
that this is enough of a problem in itself. From the profile, which
showed cpuset_zone_allowed(), not get_page_from_freelist(), at the
top of the list, given that the node id is evaluated in the
get_page_from_freelist() routine, I was figuring that the real
problem was in the cpuset_zone_allowed() code. Perhaps some testing
of a simple hack approximation to this patch will tell us - next week.
Downside #2 - may require the above mentioned additional nodemask_t
in the task struct.
3) The custom zonelist option - which was part of my original cpuset
proposal, and which Andi K and I have gone back and forth on, with
each of us liking and disliking it, at different times. See further
my latest writeup on this option:
http://lkml.org/lkml/2005/11/5/252
Date Sat, 5 Nov 2005 20:18:41 -0800
From Paul Jackson <pj@sgi.com>
Subject Re: [PATCH]: Clean up of __alloc_pages
My current plan - see if somehow I can code up and get tested (2),
since a rough approximation to it would be trivial to code. If that
works, go with it, unless someone convinces me otherwise. If (2) can't
do the job, try (1), since that seems easier to code. If that fails,
or someone shoots that down, or Andi makes a good enough case for (3),
give (3) a go - that's the hardest path, and risks the most collateral
damage to the behaviour of the memory paging subsystem.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-09-16 3:38 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-09-13 23:50 Christoph Lameter
2006-09-15 5:00 ` Andrew Morton
2006-09-15 6:49 ` Paul Jackson
2006-09-15 7:23 ` Andrew Morton
2006-09-15 7:44 ` Paul Jackson
2006-09-15 8:06 ` Andrew Morton
2006-09-15 15:53 ` David Rientjes
2006-09-15 23:03 ` David Rientjes
2006-09-16 0:04 ` Paul Jackson
2006-09-16 1:36 ` Andrew Morton
2006-09-16 2:23 ` Christoph Lameter
2006-09-16 4:34 ` Andrew Morton
2006-09-16 3:28 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16 3:40 ` Paul Jackson
2006-09-16 3:45 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16 2:47 ` Christoph Lameter
2006-09-17 3:45 ` David Rientjes
2006-09-17 11:17 ` Paul Jackson
2006-09-17 12:41 ` Christoph Lameter
2006-09-17 13:03 ` Paul Jackson
2006-09-17 20:36 ` David Rientjes
2006-09-17 21:20 ` Paul Jackson
2006-09-17 22:27 ` Paul Jackson
2006-09-17 23:49 ` David Rientjes
2006-09-18 2:20 ` Paul Jackson
2006-09-18 16:34 ` Paul Jackson
2006-09-18 17:49 ` David Rientjes
2006-09-18 20:46 ` Paul Jackson
2006-09-19 20:52 ` David Rientjes
2006-09-19 21:26 ` Christoph Lameter
2006-09-19 21:50 ` David Rientjes
2006-09-21 22:11 ` David Rientjes
2006-09-22 10:10 ` Nick Piggin
2006-09-22 16:26 ` Paul Jackson
2006-09-22 16:36 ` Christoph Lameter
2006-09-15 8:28 ` Andrew Morton
2006-09-16 3:38 ` Paul Jackson [this message]
2006-09-16 4:42 ` Andi Kleen
2006-09-16 11:38 ` Paul Jackson
2006-09-16 4:48 ` Andrew Morton
2006-09-16 11:30 ` Paul Jackson
2006-09-16 15:18 ` Andrew Morton
2006-09-17 9:28 ` Paul Jackson
2006-09-17 9:51 ` Nick Piggin
2006-09-17 11:15 ` Paul Jackson
2006-09-17 12:44 ` Nick Piggin
2006-09-17 13:19 ` Paul Jackson
2006-09-17 13:52 ` Nick Piggin
2006-09-17 21:19 ` Paul Jackson
2006-09-18 12:44 ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20 ` Christoph Lameter
2006-09-18 20:43 ` Peter Zijlstra
2006-09-19 14:35 ` Nick Piggin
2006-09-19 14:44 ` Christoph Lameter
2006-09-19 15:02 ` Nick Piggin
2006-09-19 14:51 ` Peter Zijlstra
2006-09-19 15:10 ` Nick Piggin
2006-09-19 15:05 ` Peter Zijlstra
2006-09-19 15:39 ` Christoph Lameter
2006-09-17 16:29 ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18 2:11 ` Paul Jackson
2006-09-18 5:09 ` Andrew Morton
2006-09-18 7:49 ` Paul Jackson
2006-09-16 11:48 ` Paul Jackson
2006-09-16 15:38 ` Andrew Morton
2006-09-16 21:51 ` Paul Jackson
2006-09-16 23:10 ` Andrew Morton
2006-09-17 4:37 ` Christoph Lameter
2006-09-17 4:55 ` Andrew Morton
2006-09-17 12:09 ` Paul Jackson
2006-09-17 12:36 ` Christoph Lameter
2006-09-17 13:06 ` Paul Jackson
2006-09-19 19:17 ` David Rientjes
2006-09-19 19:19 ` David Rientjes
2006-09-19 19:31 ` Christoph Lameter
2006-09-19 21:12 ` David Rientjes
2006-09-19 21:28 ` Christoph Lameter
2006-09-19 21:53 ` Paul Jackson
2006-09-15 17:08 ` Christoph Lameter
2006-09-15 17:37 ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38 ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42 ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060915203816.fd260a0b.pj@sgi.com \
--to=pj@sgi.com \
--cc=ak@suse.de \
--cc=akpm@osdl.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox