From: "Ray Bryant" <raybry@mpdtxmail.amd.com>
To: Andi Kleen <ak@suse.de>
Cc: Martin Hicks <mort@sgi.com>, Ingo Molnar <mingo@elte.hu>,
Linux MM <linux-mm@kvack.org>, Andrew Morton <akpm@osdl.org>,
torvalds@osdl.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] VM: add vm.free_node_memory sysctl
Date: Fri, 5 Aug 2005 12:45:58 -0500 [thread overview]
Message-ID: <200508051245.59528.raybry@mpdtxmail.amd.com> (raw)
In-Reply-To: <20050803200808.GE8266@wotan.suse.de>
On Wednesday 03 August 2005 15:08, Andi Kleen wrote:
> >
> > Hmmm.... What happens if there are already mapped pages (e. g. mapped in
> > the sense that pages are mapped into an address space) on the node and
> > you want to allocate some more, but can't because the node is full of
> > clean page cache pages? Then one would have to set the memhog argument
> > to the right thing to
>
> If you have a bind policy in the memory grabbing program then the standard
> try_to_free_pages should DTRT. That is because we generated a custom zone
> list only containing nodes in that zone and the zone reclaim only looks
> into those.
>
It may depend on what your definition of DTRT is here. :-)
As I understand things, if we have a node that has some mapped memory
allocated, and if one starts up a numactl -bind node memhog nodesize-slop so
as to clear some clean page cache pages from that node, then unless the
"slop" is sized in proportion to the amount of mapped memory used on the
node, then the existing mapped memory will get swapped out in order to
satisfy the new request. In addition, clean page-cache pages will get
discarded. I think what Martin and I would prefer to see is an interface
that allows one to just get rid of the clean page cache (or at least enough
of it) so that additional mapped page allocations will occur locally to the
node without causing swapping.
AFAIK, the number of mapped pages on the node is not exported to user space
(by, for example, /sys). So there is no good way to size the "slop" to
allow for an existing allocation. If there was, then using a bound memory
hog would likely be a reasonable replacement for Martin's syscall to release
all free page cache, at least for small to medium sized sized systems.
> With prefered or other policies it's different though, in that cases
> t_t_f_p will also look into other nodes because the policy is not binding.
>
> That said it might be probably possible to even make non bind policies more
> aggressive at freeing in the current node before looking into other nodes.
> I think the zone balancing has been mostly tuned on non NUMA systems, so
> some improvements might be possible here.
>
> Most people don't use BIND and changing the default policies like this
> might give NUMA systems a better "out of the box" experience. However this
> memory balance is very subtle code and easy to break, so this would need
> some care.
>
Of course!
> I don't think sysctls or new syscalls are the way to go here though.
>
The reason we ended up with a sysctl/syscall (to control the aggressiveness
with which __alloc_pages will try to free page cache before spilling) is that
deciding whether or not to spend the effort to free up page cache pages on
the local node before spilling is a workload dependent optimization. For
an HPC application it is typically worth the effort to try to free local
node page cache before spilling off node because the program will run
sufficiently long to make the improvement due to getting local storage
dominates the extra cost of doing the page allocation. For file server
workloads, for example, it is typically important to minimize the time to do
the page allocation; if it turns out to be on a remote node it really doesn't
matter that much. So it seems to me that we need some way for the
application to tell the system which approach it prefers based on the type of
workload it is -- hence the sysctl or syscall approach.
> -Andi
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2005-08-05 17:45 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20050801113913.GA7000@elte.hu>
[not found] ` <20050801102903.378da54f.akpm@osdl.org>
[not found] ` <20050801195426.GA17548@elte.hu>
[not found] ` <20050802171050.GG26803@localhost>
[not found] ` <20050802210746.GA26494@elte.hu>
2005-08-03 13:56 ` Martin Hicks
2005-08-03 14:15 ` Andi Kleen
2005-08-03 14:24 ` Martin Hicks
2005-08-03 14:38 ` Andi Kleen
2005-08-03 14:56 ` Martin Hicks
2005-08-03 19:59 ` Ray Bryant
2005-08-03 20:08 ` Andi Kleen
2005-08-05 17:45 ` Ray Bryant [this message]
2005-08-05 21:48 ` Andi Kleen
2005-08-15 16:05 ` Martin Hicks
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200508051245.59528.raybry@mpdtxmail.amd.com \
--to=raybry@mpdtxmail.amd.com \
--cc=ak@suse.de \
--cc=akpm@osdl.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=mort@sgi.com \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox