tmpfs round-robin NUMA allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* tmpfs round-robin NUMA allocation
@ 2004-08-02 22:52 Brent Casavant
  2004-08-02 23:06 ` Martin J. Bligh
  2004-08-02 23:29 ` Andi Kleen
  0 siblings, 2 replies; 4+ messages in thread
From: Brent Casavant @ 2004-08-02 22:52 UTC (permalink / raw)
  To: linux-mm; +Cc: hugh, ak

Hello,

OK, I swear that it's complete coincidence that something else I'm
looking at happens to fall into tmpfs...

It would be helpful to be able to round-robin tmpfs allocations
(except for /dev/null and shm) between nodes on NUMA systems.
This avoids putting undue memory pressure on a single node, while
leaving other nodes less used.

I looked at using the MPOL_INTERLEAVE policy to accomplish this,
however I think there's a flaw with that approach.  Since that
policy uses the vm_pgoff value (which for tmpfs is determined by
the inode swap page index) to determine the node from which to
allocate, it seems that we'll overload the first few available
nodes for interleaving instead of evenly distributing pages.
This will be particularly exacerbated if there are a large number
of small files in the tmpfs filesystem.

I see two possible ways to address this, and hope you can give me
some guidance as to which one you'd prefer to see implemented.

The first, and more hackerly, way of addressing the problem is to
use MPOL_INTERLEAVE, but change the tmpfs shmem_alloc_page() code
(for CONFIG_NUMA) to perform its own round-robinning of the vm_pgoff
value instead of deriving it from the swap page index.  This should
be straightforward to do, and will add very little additional code.

The second, and more elegant, way of addressing the problem is to
create a new MPOL_ROUNDROBIN policy, which would be identical to
MPOL_INTERLEAVE, except it would use either a counter or rotor to
choose the node from which to allocate.  This would probably be
just a bit more code than the previous idea, but would also provide
a more general facility that could be useful elsewhere.

In either case, I would set each inode to use the corresponding policy
by default for tmpfs files.  If an application "knows better" than
to use round-robin allocation in some circumstance, it could use
the mbind() call to change the placement for a particular mmap'd tmpfs
file.

For the /dev/null and shm uses of tmpfs I would leave things as-is.
The MPOL_DEFAULT policy will usually be appropriate in these cases,
and mbind() can alter that situation on a case-by-case basis.

So, the big decision is whether I should put the round-robining
into tmpfs itself, or write the more general mechanism for the
NUMA memory policy code.

Thoughts/opinions valued,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: tmpfs round-robin NUMA allocation
  2004-08-02 22:52 tmpfs round-robin NUMA allocation Brent Casavant
@ 2004-08-02 23:06 ` Martin J. Bligh
  2004-08-02 23:29 ` Andi Kleen
  1 sibling, 0 replies; 4+ messages in thread
From: Martin J. Bligh @ 2004-08-02 23:06 UTC (permalink / raw)
  To: Brent Casavant, linux-mm; +Cc: hugh, ak

> I looked at using the MPOL_INTERLEAVE policy to accomplish this,
> however I think there's a flaw with that approach.  Since that
> policy uses the vm_pgoff value (which for tmpfs is determined by
> the inode swap page index) to determine the node from which to
> allocate, it seems that we'll overload the first few available
> nodes for interleaving instead of evenly distributing pages.
> This will be particularly exacerbated if there are a large number
> of small files in the tmpfs filesystem.

...

> So, the big decision is whether I should put the round-robining
> into tmpfs itself, or write the more general mechanism for the
> NUMA memory policy code.

Doesn't really seem like a tmpfs problem - I'd think the general
mod would be more appropriate. But rather than creating another
policy, would it not be easier to just add a static "node offset"
on a per-file basis (ie make them all start on different nodes)?
Either according to the node we created the file from, or just
a random node?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: tmpfs round-robin NUMA allocation
  2004-08-02 22:52 tmpfs round-robin NUMA allocation Brent Casavant
  2004-08-02 23:06 ` Martin J. Bligh
@ 2004-08-02 23:29 ` Andi Kleen
  2004-08-04 21:27   ` Brent Casavant
  1 sibling, 1 reply; 4+ messages in thread
From: Andi Kleen @ 2004-08-02 23:29 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linux-mm, hugh

On Mon, 2 Aug 2004 17:52:52 -0500
Brent Casavant <bcasavan@sgi.com> wrote:

> Hello,
> 
> OK, I swear that it's complete coincidence that something else I'm
> looking at happens to fall into tmpfs...
> 
> It would be helpful to be able to round-robin tmpfs allocations
> (except for /dev/null and shm) between nodes on NUMA systems.
> This avoids putting undue memory pressure on a single node, while
> leaving other nodes less used.

Hmm, maybe for tmpfs used as normal files. But when tmpfs is used as shmfs
local policy may be better (consider a database allocating its shared cache
- i suspect for those local policy is the better default). 

Perhaps it would make sense to only do interleaving by default when
tmpfs is used using read/write, but not using a mmap fault.

In general for all read/write page cache operations a default interleaving policy
looks like a good idea. Except for anonymous memory of course. 

The rationale is the same - these are mostly file caches
and a bit of additional latency for a file read is not an 
issue, but memory pressure on specific nodes is.

The VM/VFS keep these paths separated, so it would be possible to do.

So if you do it please do it for everybody. 

Longer term I would like to have arbitary policy for named
page cache objects (not only tmpfs/hugetlbfs), but that is a 
bit more work. Using interleaving for it would be a good start.

> 
> I looked at using the MPOL_INTERLEAVE policy to accomplish this,
> however I think there's a flaw with that approach.  Since that
> policy uses the vm_pgoff value (which for tmpfs is determined by
> the inode swap page index) to determine the node from which to
> allocate, it seems that we'll overload the first few available
> nodes for interleaving instead of evenly distributing pages.
> This will be particularly exacerbated if there are a large number
> of small files in the tmpfs filesystem.
> 
> I see two possible ways to address this, and hope you can give me
> some guidance as to which one you'd prefer to see implemented.
> 
> The first, and more hackerly, way of addressing the problem is to
> use MPOL_INTERLEAVE, but change the tmpfs shmem_alloc_page() code
> (for CONFIG_NUMA) to perform its own round-robinning of the vm_pgoff
> value instead of deriving it from the swap page index.  This should
> be straightforward to do, and will add very little additional code.
> 
> The second, and more elegant, way of addressing the problem is to
> create a new MPOL_ROUNDROBIN policy, which would be identical to
> MPOL_INTERLEAVE, except it would use either a counter or rotor to
> choose the node from which to allocate.  This would probably be
> just a bit more code than the previous idea, but would also provide
> a more general facility that could be useful elsewhere.
> 
> In either case, I would set each inode to use the corresponding policy
> by default for tmpfs files.  If an application "knows better" than
> to use round-robin allocation in some circumstance, it could use
> the mbind() call to change the placement for a particular mmap'd tmpfs
> file.
> 
> For the /dev/null and shm uses of tmpfs I would leave things as-is.
> The MPOL_DEFAULT policy will usually be appropriate in these cases,
> and mbind() can alter that situation on a case-by-case basis.

My thoughts exactly.

 
> So, the big decision is whether I should put the round-robining
> into tmpfs itself, or write the more general mechanism for the
> NUMA memory policy code.

I don't like the using a global variable for this. The problem
is that it is quite evenly distributed at the beginning, as soon
as pages get dropped you can end up with worst case scenarios again.

I would prefer to use an "exact", but more global approach. How about 
something  like (inodenumber + pgoff) % numnodes ?
anonymous memory can use the process pid instead of inode number.

So basically I would suggest to split alloc_page_vma() into two 
new functions, one that is used for mmap faults, the other used 
for read/write. The only difference would be the default policy
when the vma policy and process policies are NULL. 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: tmpfs round-robin NUMA allocation
  2004-08-02 23:29 ` Andi Kleen
@ 2004-08-04 21:27   ` Brent Casavant
  0 siblings, 0 replies; 4+ messages in thread
From: Brent Casavant @ 2004-08-04 21:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, hugh

On Tue, 3 Aug 2004, Andi Kleen wrote:

> On Mon, 2 Aug 2004 17:52:52 -0500
> Brent Casavant <bcasavan@sgi.com> wrote:
>
> > The second, and more elegant, way of addressing the problem is to
> > create a new MPOL_ROUNDROBIN policy, which would be identical to
> > MPOL_INTERLEAVE, except it would use either a counter or rotor to
> > choose the node from which to allocate.  This would probably be
> > just a bit more code than the previous idea, but would also provide
> > a more general facility that could be useful elsewhere.

[snip]

> I don't like the using a global variable for this. The problem
> is that it is quite evenly distributed at the beginning, as soon
> as pages get dropped you can end up with worst case scenarios again.
>
> I would prefer to use an "exact", but more global approach. How about
> something  like (inodenumber + pgoff) % numnodes ?
> anonymous memory can use the process pid instead of inode number.

Perhaps I'm missing something here.  How would using a value like
(inodenumber + pgoff) % numnodes help alleviate the problem of
memory becoming unbalanced as pages are dropped?  The pages are
dropped when the file is deleted.  For any given way of selecting
the node from which the allocation is made, there's probably a
pathelogic case where the memory can become unbalanced as files
are deleted.

I'm not really shooting for perfectly even page distribution -- just
something close enough that we don't end up with signficant lumpiness.
What I'm trying to avoid is this situation from a freshly booted system:

--- cut here ---
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   1940880   1416992    523888
   1   1955840   1851904    103936
   2   1955840   1875840     80000
   8   1955840   1925408     30432
   9   1955840   1397824    558016
  10   1955840   1660096    295744
  11   1955840   1924480     31360
  12   1955824   1925696     30128
. . .
 190   1955840   1930816     25024
 191   1955840   1930816     25024
 192   1955840   1930880     24960
 193   1955824   1406624    549200
 194   1955840   1929824     26016
 248   1955840   1930496     25344
 249   1955840   1930816     25024
 250   1955840   1799776    156064
 251   1955824   1930752     2507
. . .
--- cut here ---

Granted, in this particular example there are factors other than tmpfs
contributing to the problem (i.e. certain kernel hash tables), but I'm
tackling one problem at a time.

I can think of even better methods than round-robin to ensure a very
even distribution (e.g. a policy which allocates from the least used
node), but these all seem like a bit of overkill.

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-08-04 21:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-02 22:52 tmpfs round-robin NUMA allocation Brent Casavant
2004-08-02 23:06 ` Martin J. Bligh
2004-08-02 23:29 ` Andi Kleen
2004-08-04 21:27   ` Brent Casavant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox