From: David Rientjes <rientjes@google.com>
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Nishanth Aravamudan <nacc@us.ibm.com>,
Adam Litke <agl@us.ibm.com>, Andy Whitcroft <apw@canonical.com>,
eric.whitney@hp.com, Ranjit Manomohan <ranjitm@google.com>
Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed
Date: Thu, 25 Jun 2009 12:22:12 -0700 (PDT) [thread overview]
Message-ID: <alpine.DEB.2.00.0906251155250.30090@chino.kir.corp.google.com> (raw)
In-Reply-To: <1245896060.6439.159.camel@lts-notebook>
On Wed, 24 Jun 2009, Lee Schermerhorn wrote:
> Would having cpusets constrain huge page pool allocation meet your
> needs?
>
It would, but it seems like an unnecessary inconvenience. It would
require the admin task to join a cpuset to allocate hugepages for an
application while allocating them. It also is more difficult to expand a
cpuset to include a new node that has a specific threshold of hugepages
available if writing to
/sys/devices/system/node/node*/hugepages-<size>kB/nr_hugepages is used to
preallocate hugepages to determine which node has the least fragmentation
to support such an allocation in the first place (and then freeing them if
they cannot be allocated).
It also doesn't support users who don't have CONFIG_CPUSETS, but do
mbind their memory to a subset of nodes that need hugepages while others
do not.
> > This could become pretty cryptic:
> >
> > hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G \
> > hugepages=(2:10,3:10)
> >
> > and I assume we'd use `count' of 99999 for nodes of unknown sizes where we
> > simply want to allocate as many hugepages as possible.
>
> If one needed that capability--"allocate as many as possible"--then,
> yes, I guess any ridiculously large count would do the trick.
>
I remember on older kernels that large hugepages= values would cause the
system not to boot because subsequent kernel allocations would fail
because it's oom. We need to do hugepages= allocation as early as
possible to avoid additional fragmentation later, but not enough to
completely oom the kernel. The only way to prevent that is with a
maximum hugepage watermark (which would no longer be global but per zone
with the hugepages=(node:count,...) support to allow at least a certain
threshold of memory to be free to the kernel for boot.
> > We'd still need to support hugepages=N for large NUMA machines so we don't
> > have to specify the same number of hugepages per node for a true
> > interleave, which would require an extremely large command line. And then
> > the behavior of
> >
> > hugepagesz=1G hugepages=(0:10,1:20) hugepages=30
> >
> > needs to be defined. In that case, does hugepages=30 override the
> > previous settings if this system only has dual nodes? If so, for SGI's 1K
> > node systems it's going to be difficult to specify many nodes with 10
> > hugepages and a few with 20. So perhaps hugepages=(node:count,...) should
> > increment or decrement the hugepages= value, if specified?
>
> Mel mentioned that we probably don't need boot command line hugepage
> allocation all that much with lumpy reclaim, etc. I can see his point.
Understood, especially with the complexity of specifying them on the
command line in the first place :) The only concern I have is for users
who want hugepages preallocated on a specific set of nodes at boot and are
required to use much higher hugepages= values to allocate on all system
nodes and then just free the pages on the nodes they aren't interested in.
> If we can't allocate all the hugepages we need from an early init script
> or similar, we probably don't have enough memory anyway. For
> compatibility, I supposed we need to retain the hugepages= parameter.
> And, we've added the hugepagesz parameter, so we need to retain that.
> But, maybe we should initially limit per node allocations to sysfs node
> attributes post boot?
>
Agreed, it solves the early boot oom failures as well.
> -------------
> Related question: do you think we need per node overcommit limits?
I do, because applications constrained to an exclusive cpuset will only be
able to allocate from its set of allowable nodes anyway, so the global
overcommit limits aren't in effect. There needs to be a mechanism to
allow such allocations to take place for such constrained tasks.
The only reason I've proposed these hugepage tunables to be attributes of
the system's nodes and not attributes of the individual cpusets is because
non-exclusive cpusets may share nodes among siblings and parents will
always share nodes with children, so the tunables could become
inconsistent with one another. For all other purposes, they really are a
characteristic of the cpuset's job, however.
> I'm
> having difficulty understanding what the semantics of the global limit
> would be with per node limits--i.e., how would one distribute the global
> limit across nodes [for backwards compatibility]. With nr_hugepages,
> today we just do a best effort to distribute the requested number of
> pages over the on-line nodes. If we fail to allocate that many, we
> don't remember the initial request, just how many we actually allocated
> where ever they landed. But, I don't see how that works with limits. I
> suppose we could arrange that if you don't specify a per node limit, the
> global limit applies when attempting to allocate a surplus page on a
> given node. If you do [so specify], then the respective node limit
> applies, whether or not the sum of per node surplus pages exceeds the
> global limit.
>
Hmm, yes, that's a concern. I think the easiest way to do it would be to
respect both the global and node surplus limits when the node limit is 0,
and respect the node surplus limit when it is positive. The setting of
either the global limit or the node limit does not change the other. This
deals with the cpuset case where applications constrained to a cpuset may
only allocate from their own nodes.
This would allow the global limit to exceed the sum of the node limits and
for hugepage allocation to fail, but that would have required a node limit
to be set on each online node. In such a situation, the global limit has,
in effect, been obsoleted and its value no longer matters.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2009-06-25 19:21 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-16 13:52 Lee Schermerhorn
2009-06-16 13:52 ` [PATCH 1/5] Free huge pages round robin to balance across nodes Lee Schermerhorn
2009-06-17 13:18 ` Mel Gorman
2009-06-17 17:16 ` Lee Schermerhorn
2009-06-18 19:08 ` David Rientjes
2009-06-16 13:52 ` [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Lee Schermerhorn
2009-06-17 13:35 ` Mel Gorman
2009-06-17 17:38 ` Lee Schermerhorn
2009-06-18 9:17 ` Mel Gorman
2009-06-16 13:53 ` [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Lee Schermerhorn
2009-06-17 13:39 ` Mel Gorman
2009-06-17 17:47 ` Lee Schermerhorn
2009-06-18 9:18 ` Mel Gorman
2009-06-16 13:53 ` [PATCH 4/5] Add sysctl for default hstate nodes_allowed Lee Schermerhorn
2009-06-17 13:41 ` Mel Gorman
2009-06-17 17:52 ` Lee Schermerhorn
2009-06-18 9:19 ` Mel Gorman
2009-06-16 13:53 ` [PATCH 5/5] Update huge pages kernel documentation Lee Schermerhorn
2009-06-18 18:49 ` David Rientjes
2009-06-18 19:06 ` Lee Schermerhorn
2009-06-17 13:02 ` [PATCH 0/5] Huge Pages Nodes Allowed Mel Gorman
2009-06-17 17:15 ` Lee Schermerhorn
2009-06-18 9:33 ` Mel Gorman
2009-06-18 14:46 ` Lee Schermerhorn
2009-06-18 15:00 ` Mel Gorman
2009-06-18 19:08 ` David Rientjes
2009-06-24 7:11 ` David Rientjes
2009-06-24 11:25 ` Lee Schermerhorn
2009-06-24 22:26 ` David Rientjes
2009-06-25 2:14 ` Lee Schermerhorn
2009-06-25 19:22 ` David Rientjes [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.00.0906251155250.30090@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=Lee.Schermerhorn@hp.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=apw@canonical.com \
--cc=eric.whitney@hp.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=nacc@us.ibm.com \
--cc=ranjitm@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox