linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Adam Litke <agl@us.ibm.com>
To: Paul Jackson <pj@sgi.com>
Cc: linux-mm@kvack.org, mel@skynet.ie, apw@shadowen.org,
	wli@holomorphy.com, clameter@sgi.com, kenchen@google.com
Subject: Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
Date: Fri, 13 Jul 2007 16:05:42 -0500	[thread overview]
Message-ID: <1184360742.16671.55.camel@localhost.localdomain> (raw)
In-Reply-To: <20070713130508.6f5b9bbb.pj@sgi.com>

On Fri, 2007-07-13 at 13:05 -0700, Paul Jackson wrote:
> Adam wrote:
> > +	/*
> > +	 * I haven't figured out how to incorporate this cpuset bodge into
> > +	 * the dynamic hugetlb pool yet.  Hopefully someone more familiar with
> > +	 * cpusets can weigh in on their desired semantics.  Maybe we can just
> > +	 * drop this check?
> > +	 *
> >  	if (chg > cpuset_mems_nr(free_huge_pages_node))
> >  		return -ENOMEM;
> > +	 */
> 
> I can't figure out the value of this check either -- Ken Chen added it, perhaps
> he can comment.

To be honest, I just don't think a global hugetlb pool and cpusets are
compatible, period.  I wonder if moving to the mempool interface and
having dynamic adjustable per-cpuset hugetlb mempools (ick) could make
things work saner.  It's on my list to see if mempools could be used to
replace the custom hugetlb pool code.  Otherwise, Mel's zone_movable
stuff could possibly remove the need for hugetlb pools as we know them.

> But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
>  1) The code in alloc_fresh_huge_page() seems to round robin over
>     the entire system, spreading the hugetlb pages uniformly on all nodes.
>     If one a task in one small cpuset starts aggressively allocating hugetlb
>     pages, do you think this will work, Adam -- looks to me like we will end
>     up calling alloc_fresh_huge_page() many times, most of which will fail to
>     alloc_pages_node() anything because the 'static nid' clock hand will be
>     pointing at a node outside of the current tasks cpuset (not in that tasks
>     mems_allowed).  Inefficient, but I guess ok.

Very good point.  I guess we call alloc_fresh_huge_page in two scenarios
now... 1) By echoing a number into /proc/sys/vm/nr_hugepages, and 2) by
trying to dynamically increase the pool size for a particular process.
Case 1 is not in the context of any process (per se) and so
node_online_map makes sense.  For case 2 we could teach the
__alloc_fresh_huge_page() to take a nodemask.  That could get nasty
though since we'd have to move away from a static variable to get proper
interleaving.

>  2) I don't see what keeps us from picking hugetlb pages off -any- node in the
>     system, perhaps way outside the current cpuset.  We shouldn't be looking for
>     enough available (free_huge_pages - resv_huge_pages) pages in the whole
>     system.  Rather we should be looking for and reserving enough such pages
>     that are in the current tasks cpuset (set in its mems_allowed, to be precise)
>     Folks aren't going to want their hugetlb pages coming from outside their
>     tasks cpuset.

Hmm, I see what you mean, but cpusets are already broken because we use
the global resv_huge_pages counter.  I realize that's what the
cpuset_mems_nr() thing was meant to address but it's not correct.

Perhaps if we make sure __alloc_fresh_huge_page() can be restricted to a
nodemask then we can avoid stealing pages from other cpusets.  But we'd
still be stuck with the existing problem for shared mappings: cpusets +
our strict_reservation algorithm cannot provide guarantees (like we can
without cpusets).

>  3) If there is some code I missed (good chance) that enforces the rule that
>     a task can only get a hugetlb page from a node in its cpuset, then this
>     uniform global allocation of hugetlb pages, as noted in (1) above, can't
>     be right.  Either it will force all nodes, including many nodes outside
>     of the current tasks cpuset, to bulk up on free hugetlb pages, just to
>     get enough of them on nodes allowed by the current tasks cpuset, or else
>     it will fail to get enough on nodes local to the current tasks cpuset.
>     I don't understand the logic well enough to know which, but either way
>     sucks.

I'll cook up a __alloc_fresh_huge_page(nodemask) patch and see if that
makes things better.  Thanks for your review and comments.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-07-13 21:05 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
2007-07-23 19:43   ` Christoph Lameter
2007-07-23 19:52     ` Adam Litke
2007-07-13 15:16 ` [PATCH 2/5] [hugetlb] Account for hugepages as locked_vm Adam Litke
2007-07-13 15:16 ` [PATCH 3/5] [hugetlb] Move update_and_free_page so it can be used by alloc functions Adam Litke
2007-07-13 15:17 ` [PATCH 4/5] [hugetlb] Try to grow pool on alloc_huge_page failure Adam Litke
2007-07-13 15:17 ` [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings Adam Litke
2007-07-13 20:05   ` Paul Jackson
2007-07-13 21:05     ` Adam Litke [this message]
2007-07-13 21:24       ` Ken Chen
2007-07-13 21:29       ` Christoph Lameter
2007-07-13 21:38         ` Ken Chen
2007-07-13 21:47           ` Christoph Lameter
2007-07-13 22:21           ` Paul Jackson
2007-07-13 21:38       ` Paul Jackson
2007-07-17 23:42         ` Nish Aravamudan
2007-07-18 14:44           ` Lee Schermerhorn
2007-07-18 15:17             ` Nish Aravamudan
2007-07-18 16:02               ` Lee Schermerhorn
2007-07-18 21:16                 ` Nish Aravamudan
2007-07-18 21:40                   ` Lee Schermerhorn
2007-07-19  1:52                 ` Paul Mundt
2007-07-20 20:35                   ` Nish Aravamudan
2007-07-20 20:53                     ` Lee Schermerhorn
2007-07-20 21:12                       ` Nish Aravamudan
2007-07-21 16:57                     ` Paul Mundt
2007-07-13 23:15       ` Nish Aravamudan
2007-07-13 21:09     ` Ken Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1184360742.16671.55.camel@localhost.localdomain \
    --to=agl@us.ibm.com \
    --cc=apw@shadowen.org \
    --cc=clameter@sgi.com \
    --cc=kenchen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mel@skynet.ie \
    --cc=pj@sgi.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox