linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Paul Jackson <pj@sgi.com>
To: Andrew Morton <akpm@osdl.org>
Cc: clameter@sgi.com, linux-mm@kvack.org, rientjes@google.com, ak@suse.de
Subject: Re: [PATCH] GFP_THISNODE for the slab allocator
Date: Fri, 15 Sep 2006 20:38:16 -0700	[thread overview]
Message-ID: <20060915203816.fd260a0b.pj@sgi.com> (raw)
In-Reply-To: <20060915012810.81d9b0e3.akpm@osdl.org>

[Adding Andi to cc list, since I mention him below. -pj]

Andrew wrote:
> I'm thinking a) is easily solved by adding an array of the zones inside the
> `struct cpuset', and change get_page_from_freelist() to only look at those
> zones.
> ...
> err, if we cache the most-recently-allocated-from zone in the cpuset then
> we don't need the array-of-zones, do we?  We'll only need to do a zone
> waddle when switching from one zone to the next, which is super-rare.
> 
> That's much simpler.
> ...
> And locking becomes simpler too.  It's just a check of
> cpuset_zone_allowed(current->cpuset->current_allocation_zone)

This will blow chunks performance wise, with the current cpuset locking
scheme.

Just one current_allocation_zone would not be enough.  Each node that
the cpuset allowed would require its own current_allocation_zone.  For
example, on a big honkin NUMA box with 2 CPUs per Node, tasks running
on CPU 32, Node 16, might be able to find free memory right on that
Node 16.  But another task in the same cpuset running on CPU 112, Node
56 might have to scan past a dozen Nodes to Node 68 to find memory.

Accessing anything from a cpuset that depends on what nodes it allows
requires taking the global mutex callback_mutex (in kernel/cpuset.c).
We don't want to put a global mutex on the page alloc hot code path.

Anything we need to access frequently from a tasks cpuset has to be
cached in its task struct.

Three alternative possibilities:

1)  Perhaps these most-recently-allocated-from-zone's shouldn't be
    properties of the cpuset, nor even of the task, but of the zone structs.

    If each zone struct on the zonelist had an additional flag bit marking
    the zones that had no free memory, then we could navigate the zonelist
    pretty quickly.  One more bit per zone struct would be enough to track
    a simple rescan mechanism, so that we could detect when a node that
    had formerly run out of memory once again had free memory.

    One or two bits per zone struct would be way cheaper, so far as
    data space requirements.

    Downside - it still hits each zone struct - suboptimal cache trashing.
    One less pointer chase than z->zone_pgdat->node_id, but still not
    great.

2)  It may be sufficient to locally optimize get_page_from_freelist()'s
    calls to cpuset_zone_allowed() - basically open code cpuset_zone_allowed,
    or at least refine its invocation.

    This might require a second nodemask in the task struct, for the typically
    larger set of nodes that GFP_KERNEL allocations can use, more than just
    the nodes that GFP_USER can use.  Such a second nodemask in the task struct
    would enable me to avoid taking the global callback_mutex for some GFP_KERNEL 
    allocations on tight memory systems.

    Downside #1 - still requires z->zone_pgdat->node_id.  Andrew suspects
    that this is enough of a problem in itself.  From the profile, which
    showed cpuset_zone_allowed(), not get_page_from_freelist(), at the
    top of the list, given that the node id is evaluated in the
    get_page_from_freelist() routine, I was figuring that the real
    problem was in the cpuset_zone_allowed() code.  Perhaps some testing
    of a simple hack approximation to this patch will tell us - next week.

    Downside #2 - may require the above mentioned additional nodemask_t
    in the task struct.

3)  The custom zonelist option - which was part of my original cpuset
    proposal, and which Andi K and I have gone back and forth on, with
    each of us liking and disliking it, at different times.  See further
    my latest writeup on this option:

      http://lkml.org/lkml/2005/11/5/252
      Date	Sat, 5 Nov 2005 20:18:41 -0800
      From	Paul Jackson <pj@sgi.com>
      Subject	Re: [PATCH]: Clean up of __alloc_pages

My current plan - see if somehow I can code up and get tested (2),
since a rough approximation to it would be trivial to code.  If that
works, go with it, unless someone convinces me otherwise.  If (2) can't
do the job, try (1), since that seems easier to code.  If that fails,
or someone shoots that down, or Andi makes a good enough case for (3),
give (3) a go - that's the hardest path, and risks the most collateral
damage to the behaviour of the memory paging subsystem.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2006-09-16  3:38 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-13 23:50 Christoph Lameter
2006-09-15  5:00 ` Andrew Morton
2006-09-15  6:49   ` Paul Jackson
2006-09-15  7:23     ` Andrew Morton
2006-09-15  7:44       ` Paul Jackson
2006-09-15  8:06         ` Andrew Morton
2006-09-15 15:53           ` David Rientjes
2006-09-15 23:03           ` David Rientjes
2006-09-16  0:04             ` Paul Jackson
2006-09-16  1:36               ` Andrew Morton
2006-09-16  2:23                 ` Christoph Lameter
2006-09-16  4:34                   ` Andrew Morton
2006-09-16  3:28                 ` [PATCH] Add node to zone for the NUMA case Christoph Lameter
2006-09-16  3:40                   ` Paul Jackson
2006-09-16  3:45                 ` [PATCH] GFP_THISNODE for the slab allocator Paul Jackson
2006-09-16  2:47             ` Christoph Lameter
2006-09-17  3:45             ` David Rientjes
2006-09-17 11:17               ` Paul Jackson
2006-09-17 12:41                 ` Christoph Lameter
2006-09-17 13:03                   ` Paul Jackson
2006-09-17 20:36                     ` David Rientjes
2006-09-17 21:20                       ` Paul Jackson
2006-09-17 22:27                       ` Paul Jackson
2006-09-17 23:49                         ` David Rientjes
2006-09-18  2:20                           ` Paul Jackson
2006-09-18 16:34                             ` Paul Jackson
2006-09-18 17:49                               ` David Rientjes
2006-09-18 20:46                                 ` Paul Jackson
2006-09-19 20:52                               ` David Rientjes
2006-09-19 21:26                                 ` Christoph Lameter
2006-09-19 21:50                                   ` David Rientjes
2006-09-21 22:11                                 ` David Rientjes
2006-09-22 10:10                                   ` Nick Piggin
2006-09-22 16:26                                   ` Paul Jackson
2006-09-22 16:36                                     ` Christoph Lameter
2006-09-15  8:28       ` Andrew Morton
2006-09-16  3:38         ` Paul Jackson [this message]
2006-09-16  4:42           ` Andi Kleen
2006-09-16 11:38             ` Paul Jackson
2006-09-16  4:48           ` Andrew Morton
2006-09-16 11:30             ` Paul Jackson
2006-09-16 15:18               ` Andrew Morton
2006-09-17  9:28                 ` Paul Jackson
2006-09-17  9:51                   ` Nick Piggin
2006-09-17 11:15                     ` Paul Jackson
2006-09-17 12:44                       ` Nick Piggin
2006-09-17 13:19                         ` Paul Jackson
2006-09-17 13:52                           ` Nick Piggin
2006-09-17 21:19                             ` Paul Jackson
2006-09-18 12:44                             ` [PATCH] mm: exempt pcp alloc from watermarks Peter Zijlstra
2006-09-18 20:20                               ` Christoph Lameter
2006-09-18 20:43                                 ` Peter Zijlstra
2006-09-19 14:35                               ` Nick Piggin
2006-09-19 14:44                                 ` Christoph Lameter
2006-09-19 15:02                                   ` Nick Piggin
2006-09-19 14:51                                 ` Peter Zijlstra
2006-09-19 15:10                                   ` Nick Piggin
2006-09-19 15:05                                     ` Peter Zijlstra
2006-09-19 15:39                                       ` Christoph Lameter
2006-09-17 16:29                   ` [PATCH] GFP_THISNODE for the slab allocator Andrew Morton
2006-09-18  2:11                     ` Paul Jackson
2006-09-18  5:09                       ` Andrew Morton
2006-09-18  7:49                         ` Paul Jackson
2006-09-16 11:48       ` Paul Jackson
2006-09-16 15:38         ` Andrew Morton
2006-09-16 21:51           ` Paul Jackson
2006-09-16 23:10             ` Andrew Morton
2006-09-17  4:37               ` Christoph Lameter
2006-09-17  4:55                 ` Andrew Morton
2006-09-17 12:09                   ` Paul Jackson
2006-09-17 12:36                   ` Christoph Lameter
2006-09-17 13:06                     ` Paul Jackson
2006-09-19 19:17                 ` David Rientjes
2006-09-19 19:19                   ` David Rientjes
2006-09-19 19:31                   ` Christoph Lameter
2006-09-19 21:12                     ` David Rientjes
2006-09-19 21:28                       ` Christoph Lameter
2006-09-19 21:53                         ` Paul Jackson
2006-09-15 17:08   ` Christoph Lameter
2006-09-15 17:37   ` [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA Christoph Lameter
2006-09-15 17:38   ` [PATCH] Disable GFP_THISNODE in the non-NUMA case Christoph Lameter
2006-09-15 17:42   ` [PATCH] GFP_THISNODE for the slab allocator V2 Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060915203816.fd260a0b.pj@sgi.com \
    --to=pj@sgi.com \
    --cc=ak@suse.de \
    --cc=akpm@osdl.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox