From: Andrew Morton <akpm@osdl.org>
To: Paul Jackson <pj@sgi.com>
Cc: clameter@sgi.com, linux-mm@kvack.org,
David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Fri, 15 Sep 2006 00:23:25 -0700 [thread overview]
Message-ID: <20060915002325.bffe27d1.akpm@osdl.org> (raw)
In-Reply-To: <20060914234926.9b58fd77.pj@sgi.com>
On Thu, 14 Sep 2006 23:49:26 -0700
Paul Jackson <pj@sgi.com> wrote:
> Andrew wrote:
> > hm, there's cpuset_zone_allowed() again.
> >
> > I have a feeling that we need to nuke that thing: take a 128-node machine,
> > create a cpuset which has 64 memnodes, consume all the memory in 60 of
> > them, do some heavy page allocation, then stick a thermometer into
> > get_page_from_freelist()?
>
> Hmmm ... are you worried that if get_page_from_freelist() has to scan
> many nodes before it finds memory, that it will end up spending more
> CPU cycles than we'd like calling cpuset_zone_allowed()?
>
> The essential thing that cpuset_zone_allowed() does, in the most common
> case, is to determine if a zone is on one of the nodes the task is allowed
> to use.
>
> The get_page_from_freelist() and cpuset_zone_allowed() code is optimized
> for the case that memory is usually found in the first few zones in the
> zonelist.
>
> Here's the relevant portion of the get_page_from_freelist() code, as it
> stands now:
>
>
> ============================================================
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, int alloc_flags)
> {
> struct zone **z = zonelist->zones;
> ...
> do {
> if ((alloc_flags & ALLOC_CPUSET) &&
> !cpuset_zone_allowed(*z, gfp_mask))
> continue;
> ... if zone z has free pages, use them ...
> } while (*(++z) != NULL);
> ============================================================
>
>
> For the purposes of discussion here, let me open code the hot code
> path down into cpuset_zone_allowed(), so we can see more what's
> happening here. Here's the open coded rewrite:
>
>
> ============================================================
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, int alloc_flags)
> {
> struct zone **z = zonelist->zones;
> ...
> int do_cpuset_check = !in_interrupt() && alloc_flags & ALLOC_CPUSET;
>
> do {
> int node = z->zone_pgdat->node_id
> if (do_cpuset_check &&
> !node_isset(node, current->mems_allowed) &&
> !cpuset_zone_allowed_slow_path_check())
> continue;
> ... if zone z has free pages, use them ...
> } while (*(++z) != NULL);
> ============================================================
>
>
> With this open coding, we can see what cpuset_zone_allowed() is doing
> here. The key thing it must do each loop (each zone z) is to ask if
> that zone's node is set in current->mems_allowed.
>
> My hypothetical routine 'cpuset_zone_allowed_slow_path_check()'
> contains the infrequently executed code path. Usually, either we are
> not doing the cpuset check (because we are in interrupt), or we are
> checking and the check passes because the 'node' is allowed in
> current->mems_allowed.
>
> This code is optimized for the case that we find memory in a node
> fairly near the front of the zonelist. If we have to go scavanging
> down a long list of zones before we find a node with free memory, then
> yes, we are sucking wind calling cpuset_zone_allowed(), or my
> hypothetical cpuset_zone_allowed_slow_path_check(), many times.
>
> I guess that was your concern.
You got it.
> I don't think we should be tuning especially hard for that case.
Well some bright spark went and had the idea of using cpusets and fake numa
nodes as a means of memory paritioning, didn't he?
David (cc'ed here) did some testing for me. A fake-64-node machine with
60/64ths of its memory in a 60-node "container". We filled up 40-odd of
those nodes with malloc+memset+sleep and then ran a kernel build in the
remainder.
System time went through the roof. We still need to get the profile, but
I'll eat my hat if the cause isn't get_page_from_freelist() waddling across
60-odd zones for each page allocation.
> On a big honking NUMA box, if we have to go scavanging for memory
> dozens or hundreds of nodes removed from the scene of the memory fault,
> then **even if we found that precious free page of memory instantly**
> (in zero cost CPU cycles in the above code) we're -still- screwed.
>
> Well, the user of that machine is still screwed. They have overloaded
> its memory, forcing poor NUMA placement. It's obviously not as bad as
> swap hell, but it's not good either. There is nothing that the above
> code can do to make the "Non-Uniform" part of "NUMA" magically
> disappear. Recall that these zonelists are sorted by distance from
> the starting node; so the further down the list we go, the slower the
> memory we get, relative to the tasks current CPU.
>
> We shouldn't be heavily tuning for this case, and I am not aware of any
> real world situations where real users would have reasonably determined
> otherwise, had they had full realization of what was going on.
gotcha ;)
> By 'not heavily tuning', I mean we should be more interested in minimizing
> kernel text size and cache footprint here than in optimizing CPU cycles
> for the case of having to frequently scan a long way down a long zonelist.
>
There are two problems:
a) the linear search across nodes which are not in the cpuset
b) the linear search across nodes which _are_ in the cpuset, but which
are used up.
I'm thinking a) is easily solved by adding an array of the zones inside the
`struct cpuset', and change get_page_from_freelist() to only look at those
zones.
And b) can, I think, be solved by caching the most-recently-allocated-from
zone* inside the cpuset as well. This might alter page allocation
behaviour a bit. And we'd need to do an exhaustive search at some point in
there.
The nasty part is locking that array of zones, and its length, and the
cached zone*. I guess it'd need to be RCUed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-09-15 7:23 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-09-13 23:50 Christoph Lameter
2006-09-15 5:00 ` Andrew Morton
2006-09-15 6:49 ` Paul Jackson
2006-09-15 7:23 ` Andrew Morton [this message]
2006-09-15 7:44 ` Paul Jackson
2006-09-15 8:06 ` Andrew Morton
2006-09-15 15:53 ` David Rientjes
2006-09-15 23:03 ` David Rientjes
2006-09-16 0:04 ` Paul Jackson
2006-09-16 1:36 ` Andrew Morton
2006-09-16 2:23 ` Christoph Lameter
2006-09-16 4:34 ` Andrew Morton
2006-09-16 3:28 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16 3:40 ` Paul Jackson
2006-09-16 3:45 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16 2:47 ` Christoph Lameter
2006-09-17 3:45 ` David Rientjes
2006-09-17 11:17 ` Paul Jackson
2006-09-17 12:41 ` Christoph Lameter
2006-09-17 13:03 ` Paul Jackson
2006-09-17 20:36 ` David Rientjes
2006-09-17 21:20 ` Paul Jackson
2006-09-17 22:27 ` Paul Jackson
2006-09-17 23:49 ` David Rientjes
2006-09-18 2:20 ` Paul Jackson
2006-09-18 16:34 ` Paul Jackson
2006-09-18 17:49 ` David Rientjes
2006-09-18 20:46 ` Paul Jackson
2006-09-19 20:52 ` David Rientjes
2006-09-19 21:26 ` Christoph Lameter
2006-09-19 21:50 ` David Rientjes
2006-09-21 22:11 ` David Rientjes
2006-09-22 10:10 ` Nick Piggin
2006-09-22 16:26 ` Paul Jackson
2006-09-22 16:36 ` Christoph Lameter
2006-09-15 8:28 ` Andrew Morton
2006-09-16 3:38 ` Paul Jackson
2006-09-16 4:42 ` Andi Kleen
2006-09-16 11:38 ` Paul Jackson
2006-09-16 4:48 ` Andrew Morton
2006-09-16 11:30 ` Paul Jackson
2006-09-16 15:18 ` Andrew Morton
2006-09-17 9:28 ` Paul Jackson
2006-09-17 9:51 ` Nick Piggin
2006-09-17 11:15 ` Paul Jackson
2006-09-17 12:44 ` Nick Piggin
2006-09-17 13:19 ` Paul Jackson
2006-09-17 13:52 ` Nick Piggin
2006-09-17 21:19 ` Paul Jackson
2006-09-18 12:44 ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20 ` Christoph Lameter
2006-09-18 20:43 ` Peter Zijlstra
2006-09-19 14:35 ` Nick Piggin
2006-09-19 14:44 ` Christoph Lameter
2006-09-19 15:02 ` Nick Piggin
2006-09-19 14:51 ` Peter Zijlstra
2006-09-19 15:10 ` Nick Piggin
2006-09-19 15:05 ` Peter Zijlstra
2006-09-19 15:39 ` Christoph Lameter
2006-09-17 16:29 ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18 2:11 ` Paul Jackson
2006-09-18 5:09 ` Andrew Morton
2006-09-18 7:49 ` Paul Jackson
2006-09-16 11:48 ` Paul Jackson
2006-09-16 15:38 ` Andrew Morton
2006-09-16 21:51 ` Paul Jackson
2006-09-16 23:10 ` Andrew Morton
2006-09-17 4:37 ` Christoph Lameter
2006-09-17 4:55 ` Andrew Morton
2006-09-17 12:09 ` Paul Jackson
2006-09-17 12:36 ` Christoph Lameter
2006-09-17 13:06 ` Paul Jackson
2006-09-19 19:17 ` David Rientjes
2006-09-19 19:19 ` David Rientjes
2006-09-19 19:31 ` Christoph Lameter
2006-09-19 21:12 ` David Rientjes
2006-09-19 21:28 ` Christoph Lameter
2006-09-19 21:53 ` Paul Jackson
2006-09-15 17:08 ` Christoph Lameter
2006-09-15 17:37 ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38 ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42 ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060915002325.bffe27d1.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=pj@sgi.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox