Re: [PATCH] GFP_THISNODE for the slab allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Paul Jackson <pj@sgi.com>
To: Andrew Morton <akpm@osdl.org>
Cc: clameter@sgi.com, linux-mm@kvack.org, rientjes@google.com
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Sat, 16 Sep 2006 14:51:17 -0700	[thread overview]
Message-ID: <20060916145117.9b44786d.pj@sgi.com> (raw)
In-Reply-To: <20060916083825.ba88eee8.akpm@osdl.org>

Andrew wrote:
> Pretty much all loads?  If you haven't consumed most of the "container"'s
> memory then you have overprovisioned its size.

Not so on real NUMA boxes.  If you configure your system so that
you are having to go a long way off-node for much of your memory,
then your performance is screwed.

No one in their right mind would run a memory hog that eats 40 nodes
of memory and a kernel build both in the same 60 node, small CPU
count cpuset on a real NUMA box.

The primary motivation for cpusets is to improve memory locality on
NUMA boxes.  You're using fake numa and cpusets to simulate destroying
memory locality.

On a real 64 node NUMA box, there would be 64 differently sorted
zonelists, each one centered on a different node.  The kernel build
would be running on different CPUs, associated with different nodes
than the memory hog, and it would be using zonelists that had the
unloaded (still has free memory) nodes at the front the list.

Aha - maybe this is the problem - the fake numa stuff is missing the
properly sorted zone lists.

We normally pick which zone list to use based on which CPU the
thread is running on, but that assumes there are at least as many
CPUs as Nodes, so that we have a many-to-one map from CPUs to Nodes.
In this fake numa setup, we've got way more Nodes than CPUs.

The current macro wrapper that calls __alloc_pages() in gfp.h relies
on per-node data to select a zonelist, and the node is selected based
on the tasks current CPU.  You can see this in the following code
from mmzone.h and gfp.h:

  #define numa_node_id()   (cpu_to_node(raw_smp_processor_id()))

  nid = numa_node_id();

  return __alloc_pages(gfp_mask, order,
        NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));

For the fake numa mechanism to work well, it would need to assign a
different zone list to the kernel build tasks in this test case than
it assigned to the memory hogs.  These different zone lists would
have to have a different sort, with unloaded nodes at the front.
The definition of numa_node_id() would have to depend on more than
just the processor id for this to work.

Perhaps the memory hog and the kernel build should be in separate
cpusets, and the zonelists they were assigned should depend on their
cpuset (not just their current CPU, as it is now), so that they
got zonelists with a different sort to them.  Then the kernel build
wouldn't have to waddle past all the used up memory hog nodes to find
some free memory

...

Well, I intentionally overstated the case a bit.  I doubt that we
should put a big effort -at this time- into elaborating the zonelists
and numa_node_id() mechanisms for x86_64 fake numa configurations.

Rather, we should recognize that it is only in such configurations that
the existing code in get_page_from_freelist() and cpuset_zone_allowed()
has the suboptimal performance observed here, due to the different
zonelist configuration.

In an earlier message, Andrew wrote:
> Guys, it's still 50 cachelines per page.  This one needs more than tweaking
> - algorithmic changes are needed.

I disagree (respectfully disagree -- Andrew has *way* more experience
hacking memory code than I ;).

I still don't know if this x86_64 fake numa mechanism has any real
life outside of entertaining kernel memory hackers.

We should first see how far the localized tweaks get us.

I suspect the tweaks will help quite a bit and provide an adequate
basis for effective machine partitioning, as proposed by the various
bright sparks.  If this partitioning proves useful, then I agree
that we should elaborate the algorithms and data structures to also
handle this usage well.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-09-16 21:51 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-13 23:50 Christoph Lameter
2006-09-15  5:00 ` Andrew Morton
2006-09-15  6:49   ` Paul Jackson
2006-09-15  7:23     ` Andrew Morton
2006-09-15  7:44       ` Paul Jackson
2006-09-15  8:06         ` Andrew Morton
2006-09-15 15:53           ` David Rientjes
2006-09-15 23:03           ` David Rientjes
2006-09-16  0:04             ` Paul Jackson
2006-09-16  1:36               ` Andrew Morton
2006-09-16  2:23                 ` Christoph Lameter
2006-09-16  4:34                   ` Andrew Morton
2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16  3:40                   ` Paul Jackson
2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16  2:47             ` Christoph Lameter
2006-09-17  3:45             ` David Rientjes
2006-09-17 11:17               ` Paul Jackson
2006-09-17 12:41                 ` Christoph Lameter
2006-09-17 13:03                   ` Paul Jackson
2006-09-17 20:36                     ` David Rientjes
2006-09-17 21:20                       ` Paul Jackson
2006-09-17 22:27                       ` Paul Jackson
2006-09-17 23:49                         ` David Rientjes
2006-09-18  2:20                           ` Paul Jackson
2006-09-18 16:34                             ` Paul Jackson
2006-09-18 17:49                               ` David Rientjes
2006-09-18 20:46                                 ` Paul Jackson
2006-09-19 20:52                               ` David Rientjes
2006-09-19 21:26                                 ` Christoph Lameter
2006-09-19 21:50                                   ` David Rientjes
2006-09-21 22:11                                 ` David Rientjes
2006-09-22 10:10                                   ` Nick Piggin
2006-09-22 16:26                                   ` Paul Jackson
2006-09-22 16:36                                     ` Christoph Lameter
2006-09-15  8:28       ` Andrew Morton
2006-09-16  3:38         ` Paul Jackson
2006-09-16  4:42           ` Andi Kleen
2006-09-16 11:38             ` Paul Jackson
2006-09-16  4:48           ` Andrew Morton
2006-09-16 11:30             ` Paul Jackson
2006-09-16 15:18               ` Andrew Morton
2006-09-17  9:28                 ` Paul Jackson
2006-09-17  9:51                   ` Nick Piggin
2006-09-17 11:15                     ` Paul Jackson
2006-09-17 12:44                       ` Nick Piggin
2006-09-17 13:19                         ` Paul Jackson
2006-09-17 13:52                           ` Nick Piggin
2006-09-17 21:19                             ` Paul Jackson
2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20                               ` Christoph Lameter
2006-09-18 20:43                                 ` Peter Zijlstra
2006-09-19 14:35                               ` Nick Piggin
2006-09-19 14:44                                 ` Christoph Lameter
2006-09-19 15:02                                   ` Nick Piggin
2006-09-19 14:51                                 ` Peter Zijlstra
2006-09-19 15:10                                   ` Nick Piggin
2006-09-19 15:05                                     ` Peter Zijlstra
2006-09-19 15:39                                       ` Christoph Lameter
2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18  2:11                     ` Paul Jackson
2006-09-18  5:09                       ` Andrew Morton
2006-09-18  7:49                         ` Paul Jackson
2006-09-16 11:48       ` Paul Jackson
2006-09-16 15:38         ` Andrew Morton
2006-09-16 21:51           ` Paul Jackson [this message]
2006-09-16 23:10             ` Andrew Morton
2006-09-17  4:37               ` Christoph Lameter
2006-09-17  4:55                 ` Andrew Morton
2006-09-17 12:09                   ` Paul Jackson
2006-09-17 12:36                   ` Christoph Lameter
2006-09-17 13:06                     ` Paul Jackson
2006-09-19 19:17                 ` David Rientjes
2006-09-19 19:19                   ` David Rientjes
2006-09-19 19:31                   ` Christoph Lameter
2006-09-19 21:12                     ` David Rientjes
2006-09-19 21:28                       ` Christoph Lameter
2006-09-19 21:53                         ` Paul Jackson
2006-09-15 17:08   ` Christoph Lameter
2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060916145117.9b44786d.pj@sgi.com \
    --to=pj@sgi.com \
    --cc=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox