From: Adam Litke <agl@us.ibm.com>
To: Paul Jackson <pj@sgi.com>
Cc: linux-mm@kvack.org, mel@skynet.ie, apw@shadowen.org,
wli@holomorphy.com, clameter@sgi.com, kenchen@google.com
Subject: Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
Date: Fri, 13 Jul 2007 16:05:42 -0500 [thread overview]
Message-ID: <1184360742.16671.55.camel@localhost.localdomain> (raw)
In-Reply-To: <20070713130508.6f5b9bbb.pj@sgi.com>
On Fri, 2007-07-13 at 13:05 -0700, Paul Jackson wrote:
> Adam wrote:
> > + /*
> > + * I haven't figured out how to incorporate this cpuset bodge into
> > + * the dynamic hugetlb pool yet. Hopefully someone more familiar with
> > + * cpusets can weigh in on their desired semantics. Maybe we can just
> > + * drop this check?
> > + *
> > if (chg > cpuset_mems_nr(free_huge_pages_node))
> > return -ENOMEM;
> > + */
>
> I can't figure out the value of this check either -- Ken Chen added it, perhaps
> he can comment.
To be honest, I just don't think a global hugetlb pool and cpusets are
compatible, period. I wonder if moving to the mempool interface and
having dynamic adjustable per-cpuset hugetlb mempools (ick) could make
things work saner. It's on my list to see if mempools could be used to
replace the custom hugetlb pool code. Otherwise, Mel's zone_movable
stuff could possibly remove the need for hugetlb pools as we know them.
> But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
> 1) The code in alloc_fresh_huge_page() seems to round robin over
> the entire system, spreading the hugetlb pages uniformly on all nodes.
> If one a task in one small cpuset starts aggressively allocating hugetlb
> pages, do you think this will work, Adam -- looks to me like we will end
> up calling alloc_fresh_huge_page() many times, most of which will fail to
> alloc_pages_node() anything because the 'static nid' clock hand will be
> pointing at a node outside of the current tasks cpuset (not in that tasks
> mems_allowed). Inefficient, but I guess ok.
Very good point. I guess we call alloc_fresh_huge_page in two scenarios
now... 1) By echoing a number into /proc/sys/vm/nr_hugepages, and 2) by
trying to dynamically increase the pool size for a particular process.
Case 1 is not in the context of any process (per se) and so
node_online_map makes sense. For case 2 we could teach the
__alloc_fresh_huge_page() to take a nodemask. That could get nasty
though since we'd have to move away from a static variable to get proper
interleaving.
> 2) I don't see what keeps us from picking hugetlb pages off -any- node in the
> system, perhaps way outside the current cpuset. We shouldn't be looking for
> enough available (free_huge_pages - resv_huge_pages) pages in the whole
> system. Rather we should be looking for and reserving enough such pages
> that are in the current tasks cpuset (set in its mems_allowed, to be precise)
> Folks aren't going to want their hugetlb pages coming from outside their
> tasks cpuset.
Hmm, I see what you mean, but cpusets are already broken because we use
the global resv_huge_pages counter. I realize that's what the
cpuset_mems_nr() thing was meant to address but it's not correct.
Perhaps if we make sure __alloc_fresh_huge_page() can be restricted to a
nodemask then we can avoid stealing pages from other cpusets. But we'd
still be stuck with the existing problem for shared mappings: cpusets +
our strict_reservation algorithm cannot provide guarantees (like we can
without cpusets).
> 3) If there is some code I missed (good chance) that enforces the rule that
> a task can only get a hugetlb page from a node in its cpuset, then this
> uniform global allocation of hugetlb pages, as noted in (1) above, can't
> be right. Either it will force all nodes, including many nodes outside
> of the current tasks cpuset, to bulk up on free hugetlb pages, just to
> get enough of them on nodes allowed by the current tasks cpuset, or else
> it will fail to get enough on nodes local to the current tasks cpuset.
> I don't understand the logic well enough to know which, but either way
> sucks.
I'll cook up a __alloc_fresh_huge_page(nodemask) patch and see if that
makes things better. Thanks for your review and comments.
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-07-13 21:05 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
2007-07-23 19:43 ` Christoph Lameter
2007-07-23 19:52 ` Adam Litke
2007-07-13 15:16 ` [PATCH 2/5] [hugetlb] Account for hugepages as locked_vm Adam Litke
2007-07-13 15:16 ` [PATCH 3/5] [hugetlb] Move update_and_free_page so it can be used by alloc functions Adam Litke
2007-07-13 15:17 ` [PATCH 4/5] [hugetlb] Try to grow pool on alloc_huge_page failure Adam Litke
2007-07-13 15:17 ` [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings Adam Litke
2007-07-13 20:05 ` Paul Jackson
2007-07-13 21:05 ` Adam Litke [this message]
2007-07-13 21:24 ` Ken Chen
2007-07-13 21:29 ` Christoph Lameter
2007-07-13 21:38 ` Ken Chen
2007-07-13 21:47 ` Christoph Lameter
2007-07-13 22:21 ` Paul Jackson
2007-07-13 21:38 ` Paul Jackson
2007-07-17 23:42 ` Nish Aravamudan
2007-07-18 14:44 ` Lee Schermerhorn
2007-07-18 15:17 ` Nish Aravamudan
2007-07-18 16:02 ` Lee Schermerhorn
2007-07-18 21:16 ` Nish Aravamudan
2007-07-18 21:40 ` Lee Schermerhorn
2007-07-19 1:52 ` Paul Mundt
2007-07-20 20:35 ` Nish Aravamudan
2007-07-20 20:53 ` Lee Schermerhorn
2007-07-20 21:12 ` Nish Aravamudan
2007-07-21 16:57 ` Paul Mundt
2007-07-13 23:15 ` Nish Aravamudan
2007-07-13 21:09 ` Ken Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1184360742.16671.55.camel@localhost.localdomain \
--to=agl@us.ibm.com \
--cc=apw@shadowen.org \
--cc=clameter@sgi.com \
--cc=kenchen@google.com \
--cc=linux-mm@kvack.org \
--cc=mel@skynet.ie \
--cc=pj@sgi.com \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox