linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>,
	linux-mm@kvack.org, linux-numa@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Greg KH <gregkh@suse.de>, Nishanth Aravamudan <nacc@us.ibm.com>,
	Andi Kleen <andi@firstfloor.org>, Adam Litke <agl@us.ibm.com>,
	Andy Whitcroft <apw@canonical.com>,
	eric.whitney@hp.com
Subject: Re: [PATCH 4/4] hugetlb: add per node hstate attributes
Date: Fri, 31 Jul 2009 12:55:08 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.00.0907311239190.22732@chino.kir.corp.google.com> (raw)
In-Reply-To: <20090731103632.GB28766@csn.ul.ie>

On Fri, 31 Jul 2009, Mel Gorman wrote:

> > Google is going to need this support regardless of what finally gets
> > merged into mainline, so I'm thrilled you've implemented this version.
> > 
> 
> The fact that there is a definite use case in mind lends weight to this
> approach but I want to be 100% sure that a hugetlbfs-specific interface
> is required in this case.
> 

It's not necessarily required over the mempolicy approach for allocation 
since it's quite simple to just do

	numactl --membind nodemask echo 10 >			\
		/sys/kernel/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

on the nodemask for which you want to allocate 10 additional hugepages 
(or, if node-targeted allocations are really necessary, to use
numactl --preferred node in succession to get a balanced interleave, for 
example.)

> I don't know the setup, but lets say something like the following is
> happening
> 
> 1. job scheduler creates cpuset of subset of nodes
> 2. job scheduler creates memory policy for subset of nodes
> 3. initialisation job starts, reserves huge pages. If a memory policy is
>    already in place, it will reserve them in the correct places

This is where per-node nr_hugepages attributes would be helpful.  It may 
not be possible for the desired number of hugepages to be evenly allocated 
on each node in the subset for MPOL_INTERLEAVE.

If the subset is {1, 2, 3}, for instance, it's possible to get hugepage 
quantities on those nodes as {10, 5, 10}.  The preferred userspace 
solution may be to either change its subset of the cpuset nodes to 
allocate 10 hugepages on another node and not use node 2, or to deallocate 
hugepages on nodes 1 and 3 so it matches node 2.

With the per-node nr_hugepages attributes, that's trivial.  With the 
mempolicy based approach, you'd need to do this (I guess):

 - to change the subset of cpuset nodes: construct a mempolicy of
   MPOL_PREFERRED on node 2, deallocate via the global nr_hugepages file,
   select (or allocate) another cpuset node, construct another mempolicy
   of MPOL_PREFERRED on that new node, allocate, check, reiterate, and

 - to deallocate on nodes 1 and 3: construct a mempolicy of MPOL_BIND on
   nodes 1 and 3, deallocate via the global nr_hugepages.

I'm not sure at the moment that mempolicies work in freeing hugepages via 
/sys/kernel/mm/hugepages/*/nr_hugepags and it isn't simply a round-robin, 
so the second solution may not even work.

> 4. Job completes
> 5. job scheduler frees the pages reserved for the job freeing up pages
>    on the subset of nodes
> 
> i.e. if the job scheduler already has a memory policy of it's own, or
> even some child process of that job scheduler, it should just be able to
> set nr_hugepages and have them reserved on the correct nodes.
> 

Right, allocation is simple with the mempolicy based approach, but given 
the fact that hugepages are not always successfully allocated to what 
userspace wants and freeing is more difficult, it's easier to use per-node 
controls.

> With the per-node-attribute approach, little stops a process going
> outside of it's subset of allowed nodes.
> 

If you are allowed the capability to allocate system-wide resources for 
hugepages (and you can change your own mempolicy to MPOL_DEFAULT whenever 
you want, of course), that doesn't seem like an issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      parent reply	other threads:[~2009-07-31 19:55 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-29 18:11 [PATCH 0/4] hugetlb: V1 Per Node Hugepages attributes Lee Schermerhorn
2009-07-29 18:11 ` [PATCH 1/4] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-07-29 18:11 ` [PATCH 2/4] hugetlb: numafy several functions Lee Schermerhorn
2009-07-29 18:11 ` [PATCH 3/4] hugetlb: add private bit-field to kobject structure Lee Schermerhorn
2009-07-29 18:25   ` Greg KH
2009-07-31 18:59     ` Lee Schermerhorn
2009-07-29 18:12 ` [PATCH 4/4] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-07-30 19:39   ` David Rientjes
2009-07-31 10:36     ` Mel Gorman
2009-07-31 19:10       ` Lee Schermerhorn
2009-08-14 22:38         ` David Rientjes
2009-08-14 23:08           ` Andrew Morton
2009-08-14 23:19             ` Greg KH
2009-08-14 23:53             ` David Rientjes
2009-08-17  1:10               ` Lee Schermerhorn
2009-08-17 10:07                 ` David Rientjes
2009-08-15 10:08           ` Mel Gorman
2009-07-31 19:55       ` David Rientjes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.0907311239190.22732@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=apw@canonical.com \
    --cc=eric.whitney@hp.com \
    --cc=gregkh@suse.de \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-numa@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox