linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Paul Jackson <pj@sgi.com>
To: Andrew Morton <akpm@osdl.org>
Cc: clameter@sgi.com, linux-mm@kvack.org
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Thu, 14 Sep 2006 23:49:26 -0700	[thread overview]
Message-ID: <20060914234926.9b58fd77.pj@sgi.com> (raw)
In-Reply-To: <20060914220011.2be9100a.akpm@osdl.org>

Andrew wrote:
> hm, there's cpuset_zone_allowed() again.
> 
> I have a feeling that we need to nuke that thing: take a 128-node machine,
> create a cpuset which has 64 memnodes, consume all the memory in 60 of
> them, do some heavy page allocation, then stick a thermometer into
> get_page_from_freelist()?

Hmmm ... are you worried that if get_page_from_freelist() has to scan
many nodes before it finds memory, that it will end up spending more
CPU cycles than we'd like calling cpuset_zone_allowed()?

The essential thing that cpuset_zone_allowed() does, in the most common
case, is to determine if a zone is on one of the nodes the task is allowed
to use.

The get_page_from_freelist() and cpuset_zone_allowed() code is optimized
for the case that memory is usually found in the first few zones in the
zonelist.

Here's the relevant portion of the get_page_from_freelist() code, as it
stands now:


============================================================
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, int alloc_flags)
{
        struct zone **z = zonelist->zones;
	...
        do {
                if ((alloc_flags & ALLOC_CPUSET) &&
                                !cpuset_zone_allowed(*z, gfp_mask))
                        continue;
                ... if zone z has free pages, use them ...
        } while (*(++z) != NULL);
============================================================


For the purposes of discussion here, let me open code the hot code
path down into cpuset_zone_allowed(), so we can see more what's
happening here.  Here's the open coded rewrite:


============================================================
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, int alloc_flags)
{
        struct zone **z = zonelist->zones;
	...
	int do_cpuset_check = !in_interrupt() && alloc_flags & ALLOC_CPUSET;

        do {
		int node = z->zone_pgdat->node_id
                if (do_cpuset_check &&
				!node_isset(node, current->mems_allowed) &&
				!cpuset_zone_allowed_slow_path_check())
                        continue;
                ... if zone z has free pages, use them ...
        } while (*(++z) != NULL);
============================================================


With this open coding, we can see what cpuset_zone_allowed() is doing
here.  The key thing it must do each loop (each zone z) is to ask if
that zone's node is set in current->mems_allowed.

My hypothetical routine 'cpuset_zone_allowed_slow_path_check()'
contains the infrequently executed code path.  Usually, either we are
not doing the cpuset check (because we are in interrupt), or we are
checking and the check passes because the 'node' is allowed in
current->mems_allowed.

This code is optimized for the case that we find memory in a node
fairly near the front of the zonelist.  If we have to go scavanging
down a long list of zones before we find a node with free memory, then
yes, we are sucking wind calling cpuset_zone_allowed(), or my
hypothetical cpuset_zone_allowed_slow_path_check(), many times.

I guess that was your concern.

I don't think we should be tuning especially hard for that case.

On a big honking NUMA box, if we have to go scavanging for memory
dozens or hundreds of nodes removed from the scene of the memory fault,
then **even if we found that precious free page of memory instantly**
(in zero cost CPU cycles in the above code) we're -still- screwed.

Well, the user of that machine is still screwed.  They have overloaded
its memory, forcing poor NUMA placement. It's obviously not as bad as
swap hell, but it's not good either.  There is nothing that the above
code can do to make the "Non-Uniform" part of "NUMA" magically
disappear.  Recall that these zonelists are sorted by distance from
the starting node; so the further down the list we go, the slower the
memory we get, relative to the tasks current CPU.

We shouldn't be heavily tuning for this case, and I am not aware of any
real world situations where real users would have reasonably determined
otherwise, had they had full realization of what was going on.

By 'not heavily tuning', I mean we should be more interested in minimizing
kernel text size and cache footprint here than in optimizing CPU cycles
for the case of having to frequently scan a long way down a long zonelist.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2006-09-15  6:49 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-13 23:50 Christoph Lameter
2006-09-15  5:00 ` Andrew Morton
2006-09-15  6:49   ` Paul Jackson [this message]
2006-09-15  7:23     ` Andrew Morton
2006-09-15  7:44       ` Paul Jackson
2006-09-15  8:06         ` Andrew Morton
2006-09-15 15:53           ` David Rientjes
2006-09-15 23:03           ` David Rientjes
2006-09-16  0:04             ` Paul Jackson
2006-09-16  1:36               ` Andrew Morton
2006-09-16  2:23                 ` Christoph Lameter
2006-09-16  4:34                   ` Andrew Morton
2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16  3:40                   ` Paul Jackson
2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16  2:47             ` Christoph Lameter
2006-09-17  3:45             ` David Rientjes
2006-09-17 11:17               ` Paul Jackson
2006-09-17 12:41                 ` Christoph Lameter
2006-09-17 13:03                   ` Paul Jackson
2006-09-17 20:36                     ` David Rientjes
2006-09-17 21:20                       ` Paul Jackson
2006-09-17 22:27                       ` Paul Jackson
2006-09-17 23:49                         ` David Rientjes
2006-09-18  2:20                           ` Paul Jackson
2006-09-18 16:34                             ` Paul Jackson
2006-09-18 17:49                               ` David Rientjes
2006-09-18 20:46                                 ` Paul Jackson
2006-09-19 20:52                               ` David Rientjes
2006-09-19 21:26                                 ` Christoph Lameter
2006-09-19 21:50                                   ` David Rientjes
2006-09-21 22:11                                 ` David Rientjes
2006-09-22 10:10                                   ` Nick Piggin
2006-09-22 16:26                                   ` Paul Jackson
2006-09-22 16:36                                     ` Christoph Lameter
2006-09-15  8:28       ` Andrew Morton
2006-09-16  3:38         ` Paul Jackson
2006-09-16  4:42           ` Andi Kleen
2006-09-16 11:38             ` Paul Jackson
2006-09-16  4:48           ` Andrew Morton
2006-09-16 11:30             ` Paul Jackson
2006-09-16 15:18               ` Andrew Morton
2006-09-17  9:28                 ` Paul Jackson
2006-09-17  9:51                   ` Nick Piggin
2006-09-17 11:15                     ` Paul Jackson
2006-09-17 12:44                       ` Nick Piggin
2006-09-17 13:19                         ` Paul Jackson
2006-09-17 13:52                           ` Nick Piggin
2006-09-17 21:19                             ` Paul Jackson
2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20                               ` Christoph Lameter
2006-09-18 20:43                                 ` Peter Zijlstra
2006-09-19 14:35                               ` Nick Piggin
2006-09-19 14:44                                 ` Christoph Lameter
2006-09-19 15:02                                   ` Nick Piggin
2006-09-19 14:51                                 ` Peter Zijlstra
2006-09-19 15:10                                   ` Nick Piggin
2006-09-19 15:05                                     ` Peter Zijlstra
2006-09-19 15:39                                       ` Christoph Lameter
2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18  2:11                     ` Paul Jackson
2006-09-18  5:09                       ` Andrew Morton
2006-09-18  7:49                         ` Paul Jackson
2006-09-16 11:48       ` Paul Jackson
2006-09-16 15:38         ` Andrew Morton
2006-09-16 21:51           ` Paul Jackson
2006-09-16 23:10             ` Andrew Morton
2006-09-17  4:37               ` Christoph Lameter
2006-09-17  4:55                 ` Andrew Morton
2006-09-17 12:09                   ` Paul Jackson
2006-09-17 12:36                   ` Christoph Lameter
2006-09-17 13:06                     ` Paul Jackson
2006-09-19 19:17                 ` David Rientjes
2006-09-19 19:19                   ` David Rientjes
2006-09-19 19:31                   ` Christoph Lameter
2006-09-19 21:12                     ` David Rientjes
2006-09-19 21:28                       ` Christoph Lameter
2006-09-19 21:53                         ` Paul Jackson
2006-09-15 17:08   ` Christoph Lameter
2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060914234926.9b58fd77.pj@sgi.com \
    --to=pj@sgi.com \
    --cc=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox