linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@osdl.org>
To: Paul Jackson <pj@sgi.com>
Cc: clameter@sgi.com, linux-mm@kvack.org,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Fri, 15 Sep 2006 00:23:25 -0700	[thread overview]
Message-ID: <20060915002325.bffe27d1.akpm@osdl.org> (raw)
In-Reply-To: <20060914234926.9b58fd77.pj@sgi.com>

On Thu, 14 Sep 2006 23:49:26 -0700
Paul Jackson <pj@sgi.com> wrote:

> Andrew wrote:
> > hm, there's cpuset_zone_allowed() again.
> > 
> > I have a feeling that we need to nuke that thing: take a 128-node machine,
> > create a cpuset which has 64 memnodes, consume all the memory in 60 of
> > them, do some heavy page allocation, then stick a thermometer into
> > get_page_from_freelist()?
> 
> Hmmm ... are you worried that if get_page_from_freelist() has to scan
> many nodes before it finds memory, that it will end up spending more
> CPU cycles than we'd like calling cpuset_zone_allowed()?
> 
> The essential thing that cpuset_zone_allowed() does, in the most common
> case, is to determine if a zone is on one of the nodes the task is allowed
> to use.
> 
> The get_page_from_freelist() and cpuset_zone_allowed() code is optimized
> for the case that memory is usually found in the first few zones in the
> zonelist.
> 
> Here's the relevant portion of the get_page_from_freelist() code, as it
> stands now:
> 
> 
> ============================================================
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
>                 struct zonelist *zonelist, int alloc_flags)
> {
>         struct zone **z = zonelist->zones;
> 	...
>         do {
>                 if ((alloc_flags & ALLOC_CPUSET) &&
>                                 !cpuset_zone_allowed(*z, gfp_mask))
>                         continue;
>                 ... if zone z has free pages, use them ...
>         } while (*(++z) != NULL);
> ============================================================
> 
> 
> For the purposes of discussion here, let me open code the hot code
> path down into cpuset_zone_allowed(), so we can see more what's
> happening here.  Here's the open coded rewrite:
> 
> 
> ============================================================
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
>                 struct zonelist *zonelist, int alloc_flags)
> {
>         struct zone **z = zonelist->zones;
> 	...
> 	int do_cpuset_check = !in_interrupt() && alloc_flags & ALLOC_CPUSET;
> 
>         do {
> 		int node = z->zone_pgdat->node_id
>                 if (do_cpuset_check &&
> 				!node_isset(node, current->mems_allowed) &&
> 				!cpuset_zone_allowed_slow_path_check())
>                         continue;
>                 ... if zone z has free pages, use them ...
>         } while (*(++z) != NULL);
> ============================================================
> 
> 
> With this open coding, we can see what cpuset_zone_allowed() is doing
> here.  The key thing it must do each loop (each zone z) is to ask if
> that zone's node is set in current->mems_allowed.
> 
> My hypothetical routine 'cpuset_zone_allowed_slow_path_check()'
> contains the infrequently executed code path.  Usually, either we are
> not doing the cpuset check (because we are in interrupt), or we are
> checking and the check passes because the 'node' is allowed in
> current->mems_allowed.
> 
> This code is optimized for the case that we find memory in a node
> fairly near the front of the zonelist.  If we have to go scavanging
> down a long list of zones before we find a node with free memory, then
> yes, we are sucking wind calling cpuset_zone_allowed(), or my
> hypothetical cpuset_zone_allowed_slow_path_check(), many times.
> 
> I guess that was your concern.

You got it.

> I don't think we should be tuning especially hard for that case.

Well some bright spark went and had the idea of using cpusets and fake numa
nodes as a means of memory paritioning, didn't he?

David (cc'ed here) did some testing for me.  A fake-64-node machine with
60/64ths of its memory in a 60-node "container".  We filled up 40-odd of
those nodes with malloc+memset+sleep and then ran a kernel build in the
remainder.

System time went through the roof.  We still need to get the profile, but
I'll eat my hat if the cause isn't get_page_from_freelist() waddling across
60-odd zones for each page allocation.

> On a big honking NUMA box, if we have to go scavanging for memory
> dozens or hundreds of nodes removed from the scene of the memory fault,
> then **even if we found that precious free page of memory instantly**
> (in zero cost CPU cycles in the above code) we're -still- screwed.
> 
> Well, the user of that machine is still screwed.  They have overloaded
> its memory, forcing poor NUMA placement. It's obviously not as bad as
> swap hell, but it's not good either.  There is nothing that the above
> code can do to make the "Non-Uniform" part of "NUMA" magically
> disappear.  Recall that these zonelists are sorted by distance from
> the starting node; so the further down the list we go, the slower the
> memory we get, relative to the tasks current CPU.
> 
> We shouldn't be heavily tuning for this case, and I am not aware of any
> real world situations where real users would have reasonably determined
> otherwise, had they had full realization of what was going on.

gotcha ;)

> By 'not heavily tuning', I mean we should be more interested in minimizing
> kernel text size and cache footprint here than in optimizing CPU cycles
> for the case of having to frequently scan a long way down a long zonelist.
> 

There are two problems:

a) the linear search across nodes which are not in the cpuset

b) the linear search across nodes which _are_ in the cpuset, but which
   are used up.

I'm thinking a) is easily solved by adding an array of the zones inside the
`struct cpuset', and change get_page_from_freelist() to only look at those
zones.

And b) can, I think, be solved by caching the most-recently-allocated-from
zone* inside the cpuset as well.  This might alter page allocation
behaviour a bit.  And we'd need to do an exhaustive search at some point in
there.

The nasty part is locking that array of zones, and its length, and the
cached zone*.  I guess it'd need to be RCUed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2006-09-15  7:23 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-13 23:50 Christoph Lameter
2006-09-15  5:00 ` Andrew Morton
2006-09-15  6:49   ` Paul Jackson
2006-09-15  7:23     ` Andrew Morton [this message]
2006-09-15  7:44       ` Paul Jackson
2006-09-15  8:06         ` Andrew Morton
2006-09-15 15:53           ` David Rientjes
2006-09-15 23:03           ` David Rientjes
2006-09-16  0:04             ` Paul Jackson
2006-09-16  1:36               ` Andrew Morton
2006-09-16  2:23                 ` Christoph Lameter
2006-09-16  4:34                   ` Andrew Morton
2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16  3:40                   ` Paul Jackson
2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16  2:47             ` Christoph Lameter
2006-09-17  3:45             ` David Rientjes
2006-09-17 11:17               ` Paul Jackson
2006-09-17 12:41                 ` Christoph Lameter
2006-09-17 13:03                   ` Paul Jackson
2006-09-17 20:36                     ` David Rientjes
2006-09-17 21:20                       ` Paul Jackson
2006-09-17 22:27                       ` Paul Jackson
2006-09-17 23:49                         ` David Rientjes
2006-09-18  2:20                           ` Paul Jackson
2006-09-18 16:34                             ` Paul Jackson
2006-09-18 17:49                               ` David Rientjes
2006-09-18 20:46                                 ` Paul Jackson
2006-09-19 20:52                               ` David Rientjes
2006-09-19 21:26                                 ` Christoph Lameter
2006-09-19 21:50                                   ` David Rientjes
2006-09-21 22:11                                 ` David Rientjes
2006-09-22 10:10                                   ` Nick Piggin
2006-09-22 16:26                                   ` Paul Jackson
2006-09-22 16:36                                     ` Christoph Lameter
2006-09-15  8:28       ` Andrew Morton
2006-09-16  3:38         ` Paul Jackson
2006-09-16  4:42           ` Andi Kleen
2006-09-16 11:38             ` Paul Jackson
2006-09-16  4:48           ` Andrew Morton
2006-09-16 11:30             ` Paul Jackson
2006-09-16 15:18               ` Andrew Morton
2006-09-17  9:28                 ` Paul Jackson
2006-09-17  9:51                   ` Nick Piggin
2006-09-17 11:15                     ` Paul Jackson
2006-09-17 12:44                       ` Nick Piggin
2006-09-17 13:19                         ` Paul Jackson
2006-09-17 13:52                           ` Nick Piggin
2006-09-17 21:19                             ` Paul Jackson
2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20                               ` Christoph Lameter
2006-09-18 20:43                                 ` Peter Zijlstra
2006-09-19 14:35                               ` Nick Piggin
2006-09-19 14:44                                 ` Christoph Lameter
2006-09-19 15:02                                   ` Nick Piggin
2006-09-19 14:51                                 ` Peter Zijlstra
2006-09-19 15:10                                   ` Nick Piggin
2006-09-19 15:05                                     ` Peter Zijlstra
2006-09-19 15:39                                       ` Christoph Lameter
2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18  2:11                     ` Paul Jackson
2006-09-18  5:09                       ` Andrew Morton
2006-09-18  7:49                         ` Paul Jackson
2006-09-16 11:48       ` Paul Jackson
2006-09-16 15:38         ` Andrew Morton
2006-09-16 21:51           ` Paul Jackson
2006-09-16 23:10             ` Andrew Morton
2006-09-17  4:37               ` Christoph Lameter
2006-09-17  4:55                 ` Andrew Morton
2006-09-17 12:09                   ` Paul Jackson
2006-09-17 12:36                   ` Christoph Lameter
2006-09-17 13:06                     ` Paul Jackson
2006-09-19 19:17                 ` David Rientjes
2006-09-19 19:19                   ` David Rientjes
2006-09-19 19:31                   ` Christoph Lameter
2006-09-19 21:12                     ` David Rientjes
2006-09-19 21:28                       ` Christoph Lameter
2006-09-19 21:53                         ` Paul Jackson
2006-09-15 17:08   ` Christoph Lameter
2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060915002325.bffe27d1.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=pj@sgi.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox