linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
@ 2005-04-27 15:08 Martin Hicks
  2005-04-27 17:36 ` Nikita Danilov
  2005-04-28  6:33 ` Andrew Morton
  0 siblings, 2 replies; 12+ messages in thread
From: Martin Hicks @ 2005-04-27 15:08 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM; +Cc: Ray Bryant, ak

Hi,

The following set of patches is response to the first round of comments
that were sent out in February

http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2

The consensus of this thread was that manual reclaim should happen
through a syscall and be per-node, and that automatic reclaim should
probably happen through mempolicy hints.

This set is against 2.6.12-rc2-mm2  (sorry for not being against -mm3.
It doesn't boot correctly on Altix)

The patches introduce two different ways to free up page cache from a
node: manually through a syscall and automatically through flag
modifiers to a mempolicy.

Currently if a job is started and there is page cache lying around on a
particular node then allocations will spill onto remote nodes and page
cache won't be reclaimed until the whole system is short on memory.
This can result in a signficiant performance hit for HPC applications
that planned on that memory being allocated locally.

Here's a little summary of the pages in the set:

1/4:  Merge LRU pages

This is the opposite of isolate_lru_pages().  It merges pages from
a list back onto the appropriate LRU lists.

2/4:  Local reclaim core

The reclaim code.  It extends shrink_list() so it can be used to scan
the active list as well.  The core of all of this is
reclaim_clean_pages().  It tries to remove a specified number of pages
from a zone's cache.  It does this without swapping or doing writebacks.
The goal here is to free easily freeable pages.

3/4:  toss_page_cache_node() syscall

This adds the manual reclaim method via a syscall.

4/4:  localreclaim flags for mempolicies

Adds a flags argument to set_mempolicy() and add a new mempolicy.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-04-27 15:08 [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim Martin Hicks
@ 2005-04-27 17:36 ` Nikita Danilov
  2005-04-28  6:33 ` Andrew Morton
  1 sibling, 0 replies; 12+ messages in thread
From: Nikita Danilov @ 2005-04-27 17:36 UTC (permalink / raw)
  To: Martin Hicks; +Cc: Ray Bryant, ak, linux-mm

Martin Hicks writes:

[...]

 > 
 > The reclaim code.  It extends shrink_list() so it can be used to scan
 > the active list as well.  The core of all of this is
 > reclaim_clean_pages().  It tries to remove a specified number of pages
 > from a zone's cache.  It does this without swapping or doing writebacks.
 > The goal here is to free easily freeable pages.

That's probably not very relevant for the scenario you describe, but
reclaiming free pages first looks quite similar to the behavior Linux
had when there were separate inactive_clean and inactive_dirty queues in
VM. Problem with that approach was that by skipping dirty pages, LRU was
destroyed, and system shortly starts reclaiming hot read-only pages,
ignoring cold but dirty ones.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-04-27 15:08 [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim Martin Hicks
  2005-04-27 17:36 ` Nikita Danilov
@ 2005-04-28  6:33 ` Andrew Morton
  2005-04-28 11:16   ` Nick Piggin
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Andrew Morton @ 2005-04-28  6:33 UTC (permalink / raw)
  To: Martin Hicks; +Cc: linux-mm, raybry, ak

Martin Hicks <mort@sgi.com> wrote:
>
> The patches introduce two different ways to free up page cache from a
>  node: manually through a syscall and automatically through flag
>  modifiers to a mempolicy.

Backing up and thinking about this a bit more....

>  Currently if a job is started and there is page cache lying around on a
>  particular node then allocations will spill onto remote nodes and page
>  cache won't be reclaimed until the whole system is short on memory.
>  This can result in a signficiant performance hit for HPC applications
>  that planned on that memory being allocated locally.

Why do it this way at all?

Is it not possible to change the page allocator's zone fallback mechanism
so that once the local node's zones' pages are all allocated, we don't
simply advance onto the next node?  Instead, could we not perform a bit of
reclaim on this node's zones first?  Only advance onto the next nodes if
things aren't working out?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-04-28  6:33 ` Andrew Morton
@ 2005-04-28 11:16   ` Nick Piggin
  2005-04-28 11:56   ` Rik van Riel
  2005-05-03  7:17   ` Ray Bryant
  2 siblings, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2005-04-28 11:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin Hicks, linux-mm, raybry, ak

Andrew Morton wrote:
> Martin Hicks <mort@sgi.com> wrote:
> 
>>The patches introduce two different ways to free up page cache from a
>> node: manually through a syscall and automatically through flag
>> modifiers to a mempolicy.
> 
> 
> Backing up and thinking about this a bit more....
> 
> 
>> Currently if a job is started and there is page cache lying around on a
>> particular node then allocations will spill onto remote nodes and page
>> cache won't be reclaimed until the whole system is short on memory.
>> This can result in a signficiant performance hit for HPC applications
>> that planned on that memory being allocated locally.
> 
> 
> Why do it this way at all?
> 
> Is it not possible to change the page allocator's zone fallback mechanism
> so that once the local node's zones' pages are all allocated, we don't
> simply advance onto the next node?  Instead, could we not perform a bit of
> reclaim on this node's zones first?  Only advance onto the next nodes if
> things aren't working out?

Yeah. I got a patch that does this. It is quite possible - you have
to make some balance so you don't go to shit on workloads that have
a working set larger than a single node's memory, but...

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-04-28  6:33 ` Andrew Morton
  2005-04-28 11:16   ` Nick Piggin
@ 2005-04-28 11:56   ` Rik van Riel
  2005-04-28 12:53     ` Martin Hicks
  2005-05-03  7:17   ` Ray Bryant
  2 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2005-04-28 11:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin Hicks, linux-mm, raybry, ak

On Wed, 27 Apr 2005, Andrew Morton wrote:

> Is it not possible to change the page allocator's zone fallback mechanism
> so that once the local node's zones' pages are all allocated, we don't
> simply advance onto the next node?  Instead, could we not perform a bit of
> reclaim on this node's zones first?  Only advance onto the next nodes if
> things aren't working out?

IMHO that's the best idea.  The patches posted add new
mechanisms to the VM and have the potential to disturb
LRU ordering quite a bit - which could make the VM
worse under load.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-04-28 11:56   ` Rik van Riel
@ 2005-04-28 12:53     ` Martin Hicks
  0 siblings, 0 replies; 12+ messages in thread
From: Martin Hicks @ 2005-04-28 12:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-mm, raybry, ak

On Thu, Apr 28, 2005 at 07:56:07AM -0400, Rik van Riel wrote:
> On Wed, 27 Apr 2005, Andrew Morton wrote:
> 
> > Is it not possible to change the page allocator's zone fallback mechanism
> > so that once the local node's zones' pages are all allocated, we don't
> > simply advance onto the next node?  Instead, could we not perform a bit of
> > reclaim on this node's zones first?  Only advance onto the next nodes if
> > things aren't working out?
> 
> IMHO that's the best idea.  The patches posted add new
> mechanisms to the VM and have the potential to disturb
> LRU ordering quite a bit - which could make the VM
> worse under load.

I'd like to see Nick's patch.  Through the mempolicy the patch does take
the approach of freeing memory on the preferred node before going
offnode.  I agree that the patch disturbs LRU ordering.  The reason that
I have to destroy LRU ordering is so that I don't have to scan through
the same Dirty/Locked/whatever pages on the tail of the LRU list during
each call to reclaim_clean_pages().

mh

--
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-04-28  6:33 ` Andrew Morton
  2005-04-28 11:16   ` Nick Piggin
  2005-04-28 11:56   ` Rik van Riel
@ 2005-05-03  7:17   ` Ray Bryant
  2005-05-03  8:08     ` Andrew Morton
  2 siblings, 1 reply; 12+ messages in thread
From: Ray Bryant @ 2005-05-03  7:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin Hicks, linux-mm, raybry, ak

Andrew Morton wrote:
> Martin Hicks <mort@sgi.com> wrote:
> 
>>The patches introduce two different ways to free up page cache from a
>> node: manually through a syscall and automatically through flag
>> modifiers to a mempolicy.
> 
> 
> Backing up and thinking about this a bit more....
> 
> 
>> Currently if a job is started and there is page cache lying around on a
>> particular node then allocations will spill onto remote nodes and page
>> cache won't be reclaimed until the whole system is short on memory.
>> This can result in a signficiant performance hit for HPC applications
>> that planned on that memory being allocated locally.
> 
> 
> Why do it this way at all?
> 
> Is it not possible to change the page allocator's zone fallback mechanism
> so that once the local node's zones' pages are all allocated, we don't
> simply advance onto the next node?  Instead, could we not perform a bit of
> reclaim on this node's zones first?  Only advance onto the next nodes if
> things aren't working out?
> 

Effectively, that is what we are trying to do with this set of patches.

Let me see if I can describe the problem we are trying to solve a little
more clearly, and to explain how we got to this particular set of patches.

Before we start on that, however, it is important to understand that this
particular optimization is a crucial performance optimization for certain
kinds of workloads (admittedly on NUMA hardware, only).  That is why we
have made this a controllable policy that would be enabled only for those
workloads where it makes sense.   When the policy is not enabled, the
code is neutral with respect to VM algorithms.  It is not expected that
this code would be enabled for a traditional workload where LRU
aging is important.  So, while it is true that the proposed patch does
modify LRU ordering, that should not be a fundamental argument against
this patchset, since for workloads where keeping the LRU ordering
correct is important, the page cache reclaim code would not be enabled.

Secondly, I would observe that I have run benchmarks of OpenMP applications
with and without these type of page cache reclaiming optimizations.  If we
don't have the kind of operations needed (more on the scenario's below) there
can be a 30-40% reduction in performance due to the fact that storage which
the application believes is local to a thread is actually allocated on a
remote node.  So the optimizations proposed here do matter, and they can
be quite significant.

So what is the problem we are trying to solve?
----------------------------------------------

We are trying to fix the "stickiness" of page-cache pages on NUMA systems
for workloads where local allocation of storage is crucial.   (Note well,
this is not all workloads.)  In many cases, caching disk data in memory is
very important to performance; so the correct tradeoff to make
in most cases is to allocate remotely when a local page is not available
and not to look to see if there are local pages that can be freed instead.

However, the  typical scenario we run up against is the following:  We start 
up a long running parallel application.  As part of the application work flow,
a large amount of data is staged (copied) from a distributed file system
to higher speed local storage.  The data being copied can be 10's to 100's
of GB.  This data is brought into the page cache and the pages become
cleaned through the normal I/O process.  Remember the curse of a large
NUMA machine is that there is lots of memory, but practically none of it
is local.  (e. g. on a 512 CPU Altix, if each node has the same amount
of memory, only 1/256th of the memory available is local).

So what happens due to copying this data is that non-trival numbers
(but not all) of the nodes on the machine become completely filled
with page cache.  Now when the parallel application starts, it pins
processes to nodes and tries to allocate storage local to those processes. 
This is required for good performance -- the applications are optimized to 
place heavily referenced data in storage that the application expects to be 
local to the thread.  Since some of the nodes are full of page cache, the 
processes that are running on those nodes don't get local storage and hence 
run more slowly. We then run up against the second rule of parallel 
processing:  A parallel application only runs as quickly as the slowest 
thread.  So performance of the entire parallel job is impacted because a few 
of the threads didn't get the local storage they expected.

What we have done for our current production kernels to work around this
problem is to implement "manual" page cache reclaim.  This is the
toss_page_cache_nodes patch that we previously submitted.  The disadvantage
of that patch is that it is a "big hammer".  It causes all clean page-cache
pages on the nodes to be released.

The idea of the current patch is to only reclaim as much clean page-cache as
are required for the application by reacting to allocation requests and 
freeing storage proportional to these requests.

Why must this be an optionally controlled component of the VM?
--------------------------------------------------------------

Because this is fundamentally a workload dependent optimization.  Many
workloads want the normal VM algorithms to apply.  Caching data is
important, and until the entire system is under memory pressure,
it makes sense to keep that data in storage.  New page allocation
requests that come in and that can be allocated remotely should be
allocated on a remote node since the system has no way of knowing
how important getting local storage is to the application.  (Equivalently,
the O/S has no way of knowing how long and how intensely the newly
allocated page will be used.  So it cannot make an intelligent
tradeoff about where to allocate the page.)

Effectively, the interface we are proposing here is a way of telling
the system that for this application, getting local storage is more
important than caching data.  It needs to be optional because this
trade off does not apply to all applications.  But for our parallel
application, which may run for 10's to 100's of hours, getting local
storage is crucial and the O/S should work quite hard to try to
allocate local storage.  The time spent doing that now will be more
than made up for by the increased efficiency of the application during
its long run.  So we need a way to tell the O/S that this is
the case for this particular application.

Why can't the VM system just figure this out?
---------------------------------------------

One of the common responses to changes in the VM system for optimizations
of this type is that we instead should devote our efforts to improving
the VM system algorithms and that we are taking an "easy way out" by
putting a hack into the VM system.  Fundamentally, the VM system cannot
predict the future behavior of the application in order to correctly
make this tradeoff.  Since the application programmer (in this environment)
typically knows a lot about the behavior of the application it simply
makes sense to allow the developer a way of telling the operating system
what is about to happen rather than having the O/S make a guess.

Without this interface, the O/S's tradeoff will normally be to allocate 
remotely if local storage is not available.  In the past, it has been 
suggested that the way to react to improper local/remote storage is to watch 
the application (using storage reference counters in the memory interconnect,
for example) and to find pages that appear to be incorrectly placed and
to move those pages.  (This is the so called "Automatic Page Migration"
problem.)  Our experience at SGI with such algorithms is that they don't
work very well.  Part of the reason is that the penalty for making a
mistake is very high -- moving a page takes a long time, and if you
move it to the wrong node you can be very sorry.  The other part of the
problem is that by using sampling based methods to figure out page
placement, you only have partially correct information, and this leads
to occasionally making mistakes.  The combination of these two factors
results in poor placement decisions and a corresponding poorly
performing system.

The other part of the problem is that sampling is historical rather
than predictive.  Just when the O/S has enough samples to make a
migration decision, the computation can start a new phase, possibly
invalidating the decision the operating system has made, and without
the operating systems knowledge.  So it does the wrong thing.

Why isn't it good enough to use the synchronous page cache reclaim path?
-------------------------------------------------------------------------

There are basically two reasons (1)  We have found it to be too slow
(running the entire synchronous reclaim path on even a moderately
large Altix system can take 10's of minutes) and (2)  it is indiscriminate
in that it can also free mapped pages, and we want to keep those around. 
Effectively what we are looking for here is a way to tell the VM system that
allocating local storage is more important to this application than caching
clean file system pages.

(Setting vm_swappiness=0 doesn't do this correctly because it is global to
the system rather than the application, and in certain cases we have found
setting vm_swappiness=0 can cause the VM system to live-lock if then the
system is overcommitted due to mapped pages.)

Why isn't POSIX_FADV_DONTNEED good enough here?
----------------------------------------------

We've tried that too.  If the application is sufficiently aware of what
files it has opened, it could schedule those page cache pages to be
released.  Unfortunately, this doesn't handle the case of the last
application that ran and wrote out a bunch of data before it terminated,
nor does it deal very well with shell scripts that stage data onto and
off of the compute node as part the job's workflow.

So how did we end up with this particular set of patches?
--------------------------------------------------------

This set of patches is based, in part, on experience with our 2.4.21
based kernels.  Those kernels had an "automatic page cache reclaim"
facility, and our benchmarks have shown that this is almost as good
using the "manual page cache reclaim" approach we previously proposed.
Our experience with those kernels was that using the synchronous reclaim
path was too slow, so we special cased the search with code that
paralleled the existing code but would only release clean page-cache
pages.

For 2.6.x, we didn't want code that duplicated much of the VM path
in a separate routine, but instead wanted to slightly modify the
existing VM routines so they would only release clean page-cache
pages and not release mapped storage.  Hence, the extensions that
were proposed to the "scan control" structure.

Originally, we wanted to start with the "manual page cache release"
code we previously proposed, but that got shot down, so here we are
with the "automatic page cache release" approach.

I hope this all helps, rather than hinders the discussion of Martin's
patchset.  Discussion, complaints, and flames, all happily accepted
by yours truly,

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-05-03  7:17   ` Ray Bryant
@ 2005-05-03  8:08     ` Andrew Morton
  2005-05-03 13:21       ` Martin Hicks
  2005-05-12 18:53       ` Martin Hicks
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2005-05-03  8:08 UTC (permalink / raw)
  To: Ray Bryant; +Cc: mort, linux-mm, raybry, ak

Ray Bryant <raybry@engr.sgi.com> wrote:
>
> ...
> One of the common responses to changes in the VM system for optimizations
> of this type is that we instead should devote our efforts to improving
> the VM system algorithms and that we are taking an "easy way out" by
> putting a hack into the VM system.

There's that plus the question which forever lurks around funky SGI patches:

	How many machines in the world want this feature?

Because if the answer is "twelve" then gee it becomes hard to justify
merging things into the mainline kernel.  Particularly when they add
complexity to page reclaim.

>  Fundamentally, the VM system cannot
> predict the future behavior of the application in order to correctly
> make this tradeoff.

Yup.  But we could add a knob to each zone which says, during page
allocation "be more reluctant to advance onto the next node - do some
direct reclaim instead"

And the good thing about that is that it is an easier merge because it's a
simpler patch and because it's useful to more machines.  People can tune it
and get better (or worse) performance from existing apps on NUMA.

Yes, if it's a "simple" patch then it _might_ do a bit of swapout or
something.  But the VM does prefer to reclaim clean pagecache first (as
well as slab, which is a bonus for this approach).

Worth trying, at least?

> 
> Why isn't POSIX_FADV_DONTNEED good enough here?
> ----------------------------------------------

I was going to ask that.

> We've tried that too.  If the application is sufficiently aware of what
> files it has opened, it could schedule those page cache pages to be
> released.  Unfortunately, this doesn't handle the case of the last
> application that ran and wrote out a bunch of data before it terminated,
> nor does it deal very well with shell scripts that stage data onto and
> off of the compute node as part the job's workflow.

Ah.  But to do this we need to be able to answer the question "what files
are in pagecache, and how much pagecache do they have".  (And "on what
nodes", but let's ignore that coz it's hard ;)) And something like this
would be an easier merge because it's useful to more than twelve machines.

It could be done in userspace, really.  Hack into glibc's open() and
creat() to log file opening activity, something silly like that.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-05-03  8:08     ` Andrew Morton
@ 2005-05-03 13:21       ` Martin Hicks
  2005-05-04  1:23         ` Andrew Morton
  2005-05-12 18:53       ` Martin Hicks
  1 sibling, 1 reply; 12+ messages in thread
From: Martin Hicks @ 2005-05-03 13:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ray Bryant, mort, linux-mm, ak

On Tue, May 03, 2005 at 01:08:46AM -0700, Andrew Morton wrote:
> Ray Bryant <raybry@engr.sgi.com> wrote:
> >
> > ...
> > One of the common responses to changes in the VM system for optimizations
> > of this type is that we instead should devote our efforts to improving
> > the VM system algorithms and that we are taking an "easy way out" by
> > putting a hack into the VM system.
> 
> There's that plus the question which forever lurks around funky SGI patches:
> 
> 	How many machines in the world want this feature?
> 
> Because if the answer is "twelve" then gee it becomes hard to justify
> merging things into the mainline kernel.  Particularly when they add
> complexity to page reclaim.

And vendors seem hesitant because it isn't upstream.... chicken?  egg?

> 
> >  Fundamentally, the VM system cannot
> > predict the future behavior of the application in order to correctly
> > make this tradeoff.
> 
> Yup.  But we could add a knob to each zone which says, during page
> allocation "be more reluctant to advance onto the next node - do some
> direct reclaim instead"
> 
> And the good thing about that is that it is an easier merge because it's a
> simpler patch and because it's useful to more machines.  People can tune it
> and get better (or worse) performance from existing apps on NUMA.

The problem is that it really can't be a machine-wide policy.  This is
something that, at the very least, has to be limited to a cpuset.  I
chose to use the mempolicy infrastructure because this seemed like the
best method for sending hints to the allocator, based on the first discussion.

> Yes, if it's a "simple" patch then it _might_ do a bit of swapout or
> something.  But the VM does prefer to reclaim clean pagecache first (as
> well as slab, which is a bonus for this approach).
> 
> Worth trying, at least?

Well, another limitation of this is that we then only get inactive pages
reclaimed.  When the reclaim policy is in place the allocator is going
to ignore LRU and try really hard to get local memory.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-05-03 13:21       ` Martin Hicks
@ 2005-05-04  1:23         ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2005-05-04  1:23 UTC (permalink / raw)
  To: Martin Hicks; +Cc: raybry, linux-mm, ak

Martin Hicks <mort@sgi.com> wrote:
>
> 
> On Tue, May 03, 2005 at 01:08:46AM -0700, Andrew Morton wrote:
> > Ray Bryant <raybry@engr.sgi.com> wrote:
> > >
> > > ...
> > > One of the common responses to changes in the VM system for optimizations
> > > of this type is that we instead should devote our efforts to improving
> > > the VM system algorithms and that we are taking an "easy way out" by
> > > putting a hack into the VM system.
> > 
> > There's that plus the question which forever lurks around funky SGI patches:
> > 
> > 	How many machines in the world want this feature?
> > 
> > Because if the answer is "twelve" then gee it becomes hard to justify
> > merging things into the mainline kernel.  Particularly when they add
> > complexity to page reclaim.
> 
> And vendors seem hesitant because it isn't upstream.... chicken?  egg?
> 

That's between SGI and vendors, to some extent.  Generally, yes, I very
much want to keep vendor trees and the public tree in sync.  But a patch
like this is relatively intrusive, adds to long-term maintenance cost and
on the other hand is extremely specialised.  It's really hard to justify
adding this work to the public tree, IMO.

Which is why I'd like to see whether you can come up with something which
is either useful to a wider range of users or which adds less maintenance
complexity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-05-03  8:08     ` Andrew Morton
  2005-05-03 13:21       ` Martin Hicks
@ 2005-05-12 18:53       ` Martin Hicks
  2005-05-12 18:57         ` Martin Hicks
  1 sibling, 1 reply; 12+ messages in thread
From: Martin Hicks @ 2005-05-12 18:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ray Bryant, linux-mm, ak

On Tue, May 03, 2005 at 01:08:46AM -0700, Andrew Morton wrote:
> 
> Yup.  But we could add a knob to each zone which says, during page
> allocation "be more reluctant to advance onto the next node - do some
> direct reclaim instead"
> 
> And the good thing about that is that it is an easier merge because it's a
> simpler patch and because it's useful to more machines.  People can tune it
> and get better (or worse) performance from existing apps on NUMA.
> 
> Yes, if it's a "simple" patch then it _might_ do a bit of swapout or
> something.  But the VM does prefer to reclaim clean pagecache first (as
> well as slab, which is a bonus for this approach).
> 
> Worth trying, at least?

So, I did this as an exercise.  A few things came up:

1)  If you just call directly into the reclaim code then it swaps a LOT.
I stuck my "don't swap" flag back in, just to see what would happen.  It
works a lot better if you can tell it to just not swap.

2)  With a per zone on/off flag for reclaim, I then run into the
trouble where the allocator always reclaims pages, even when it
shouldn't.  Filling pagecache with files will start reclaiming from the
preferred zone as soon as the zone fills, leaving the rest of the zones
unused.

My last patch, using mempolicies, got this right because the core
kernel, which wasn't set to use reclaim, would just allocate off-node
for stuff like page cache pages.

3)  This patch has no code that limits the amount of scanning that is done
under really heavy memory stress.  A "make -j" kernel build takes more
time to complete than I'm willing to wait, while a stock kernel does
complete the run in 15-20 minutes.

Scanning too much is really the biggest problem.  I want to keep using
refill_inactive_list(), so that I don't futz with the LRU ordering or
resort to reclaiming active pages like I was doing in my old patch.

4) Under trivial tests, this patch helps NUMA machines get local memory
more often.  The silly test was to just fill node 0 with page cache and
then run a "make -j8" kernbench test on node 0  (2 cpu node).

Without zone reclaiming turned on, all memory allocations go to node 1.
With the reclaiming on, page cache is reclaimed and gcc gets all local
memory.

This is a real problem.  We even see it on modest 8p/32G build servers
because there is lots of pagecache kicking around and a lot of the
allocations end up being remote.

zone reclaiming on:

Average Optimal -j 8 Load Run:
Elapsed Time 703.87
User Time 1337.77
System Time 47.94
Percent CPU 196
Context Switches 73669
Sleeps 58874

zone reclaiming off:

Average Optimal -j 8 Load Run:
Elapsed Time 741.22
User Time 1396.97
System Time 65.14
Percent CPU 197
Context Switches 73211
Sleeps 58996

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim
  2005-05-12 18:53       ` Martin Hicks
@ 2005-05-12 18:57         ` Martin Hicks
  0 siblings, 0 replies; 12+ messages in thread
From: Martin Hicks @ 2005-05-12 18:57 UTC (permalink / raw)
  To: Martin Hicks; +Cc: Andrew Morton, Ray Bryant, linux-mm, ak

On Thu, May 12, 2005 at 02:53:02PM -0400, Martin Hicks wrote:
> 
> So, I did this as an exercise.  A few things came up:

and this time here's the patch.  Its against something like
2.6.12-rc3-mm3

mh

Index: linux-2.6.12-rc3/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.12-rc3.orig/arch/ia64/kernel/entry.S	2005-05-12 10:07:56.000000000 -0700
+++ linux-2.6.12-rc3/arch/ia64/kernel/entry.S	2005-05-12 10:08:14.000000000 -0700
@@ -1573,7 +1573,7 @@ sys_call_table:
 	data8 sys_keyctl
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall			// 1275
-	data8 sys_ni_syscall
+	data8 sys_set_zone_reclaim
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
Index: linux-2.6.12-rc3/include/linux/mmzone.h
===================================================================
--- linux-2.6.12-rc3.orig/include/linux/mmzone.h	2005-05-12 10:07:56.000000000 -0700
+++ linux-2.6.12-rc3/include/linux/mmzone.h	2005-05-12 10:12:20.000000000 -0700
@@ -163,6 +163,12 @@ struct zone {
 	int temp_priority;
 	int prev_priority;
 
+	/*
+	 * Does the zone try to reclaim before giving allowing the allocator
+	 * to try the next zone?
+	 */
+	int reclaim_pages;
+	int reclaim_pages_failed;
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
Index: linux-2.6.12-rc3/include/linux/swap.h
===================================================================
--- linux-2.6.12-rc3.orig/include/linux/swap.h	2005-05-12 10:07:56.000000000 -0700
+++ linux-2.6.12-rc3/include/linux/swap.h	2005-05-12 10:08:14.000000000 -0700
@@ -173,6 +173,7 @@ extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
+extern int zone_reclaim(struct zone *, unsigned int, unsigned int);
 extern int shrink_all_memory(int);
 extern int vm_swappiness;
 
Index: linux-2.6.12-rc3/mm/page_alloc.c
===================================================================
--- linux-2.6.12-rc3.orig/mm/page_alloc.c	2005-05-12 10:07:56.000000000 -0700
+++ linux-2.6.12-rc3/mm/page_alloc.c	2005-05-12 10:13:30.000000000 -0700
@@ -349,6 +349,7 @@ free_pages_bulk(struct zone *zone, int c
 
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
+	zone->reclaim_pages_failed = 0;
 	zone->pages_scanned = 0;
 	while (!list_empty(list) && count--) {
 		page = list_entry(list->prev, struct page, lru);
@@ -761,14 +762,29 @@ __alloc_pages(unsigned int __nocast gfp_
  restart:
 	/* Go through the zonelist once, looking for a zone with enough free */
 	for (i = 0; (z = zones[i]) != NULL; i++) {
-
-		if (!zone_watermark_ok(z, order, z->pages_low,
-				       classzone_idx, 0, 0))
-			continue;
+		int do_reclaim = z->reclaim_pages;
 
 		if (!cpuset_zone_allowed(z))
 			continue;
 
+		/*
+		 * If the zone is to attempt early page reclaim hen this loop
+		 * will try to reclaim pages and check the watermark a second
+		 * time before giving up and falling back to the next zone.
+		 */
+	zone_reclaim_retry:
+		if (!zone_watermark_ok(z, order, z->pages_low,
+				       classzone_idx, 0, 0)) {
+			if (!do_reclaim)
+				continue;
+			else {
+				zone_reclaim(z, gfp_mask, order);
+				/* Only try reclaim once */
+				do_reclaim = 0;
+				goto zone_reclaim_retry;
+			}
+		}
+
 		page = buffered_rmqueue(z, order, gfp_mask);
 		if (page)
 			goto got_pg;
Index: linux-2.6.12-rc3/mm/vmscan.c
===================================================================
--- linux-2.6.12-rc3.orig/mm/vmscan.c	2005-05-12 10:07:56.000000000 -0700
+++ linux-2.6.12-rc3/mm/vmscan.c	2005-05-12 10:11:31.000000000 -0700
@@ -73,6 +73,7 @@ struct scan_control {
 	unsigned int gfp_mask;
 
 	int may_writepage;
+	int may_swap;
 
 	/* This context's SWAP_CLUSTER_MAX. If freeing memory for
 	 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
@@ -414,7 +415,7 @@ static int shrink_list(struct list_head 
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 */
-		if (PageAnon(page) && !PageSwapCache(page)) {
+		if (PageAnon(page) && !PageSwapCache(page) && sc->may_swap) {
 			void *cookie = page->mapping;
 			pgoff_t index = page->index;
 
@@ -944,6 +945,7 @@ int try_to_free_pages(struct zone **zone
 
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
+	sc.may_swap = 1;
 
 	inc_page_state(allocstall);
 
@@ -1044,6 +1046,7 @@ loop_again:
 	total_reclaimed = 0;
 	sc.gfp_mask = GFP_KERNEL;
 	sc.may_writepage = 0;
+	sc.may_swap = 1;
 	sc.nr_mapped = read_page_state(nr_mapped);
 
 	inc_page_state(pageoutrun);
@@ -1335,3 +1338,69 @@ static int __init kswapd_init(void)
 }
 
 module_init(kswapd_init)
+
+
+/*
+ * Try to free up some pages from this zone through reclaim.
+ */
+int zone_reclaim(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+{
+	struct scan_control sc;
+	int nr_pages = 1 << order;
+	int priority;
+	int total_reclaimed = 0;
+
+	/* The reclaim may sleep, so don't do it if sleep isn't allowed */
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+	if (zone->reclaim_pages_failed)
+		return 0;
+
+	sc.gfp_mask = gfp_mask;
+	sc.may_writepage = 0;
+	sc.may_swap = 0;
+	sc.nr_mapped = read_page_state(nr_mapped);
+	sc.nr_scanned = 0;
+	sc.nr_reclaimed = 0;
+	sc.priority = 0;  /* scan at the highest priority */
+
+	if (nr_pages > SWAP_CLUSTER_MAX)
+		sc.swap_cluster_max = nr_pages;
+	else
+		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+
+	shrink_zone(zone, &sc);
+
+	if (sc.nr_reclaimed < nr_pages)
+		zone->reclaim_pages_failed = 1;
+
+	return total_reclaimed;
+}
+
+asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
+				     unsigned int state)
+{
+	struct zone *z;
+	int i;
+
+	if (node >= MAX_NUMNODES || !node_online(node))
+		return -EINVAL;
+
+	/* This will break if we ever add more zones */
+	if (!(zone & (1<<ZONE_DMA|1<<ZONE_NORMAL|1<<ZONE_HIGHMEM)))
+		return -EINVAL;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		if (!(zone & 1<<i))
+			continue;
+
+		z = &NODE_DATA(node)->node_zones[i];
+
+		if (state)
+			z->reclaim_pages = 1;
+		else
+			z->reclaim_pages = 0;
+	}
+
+	return 0;
+}
Index: linux-2.6.12-rc3/kernel/sys_ni.c
===================================================================
--- linux-2.6.12-rc3.orig/kernel/sys_ni.c	2005-05-12 10:07:56.000000000 -0700
+++ linux-2.6.12-rc3/kernel/sys_ni.c	2005-05-12 10:09:18.000000000 -0700
@@ -77,6 +77,7 @@ cond_syscall(sys_request_key);
 cond_syscall(sys_keyctl);
 cond_syscall(compat_sys_keyctl);
 cond_syscall(compat_sys_socketcall);
+cond_syscall(sys_set_zone_reclaim);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-05-12 18:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-27 15:08 [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim Martin Hicks
2005-04-27 17:36 ` Nikita Danilov
2005-04-28  6:33 ` Andrew Morton
2005-04-28 11:16   ` Nick Piggin
2005-04-28 11:56   ` Rik van Riel
2005-04-28 12:53     ` Martin Hicks
2005-05-03  7:17   ` Ray Bryant
2005-05-03  8:08     ` Andrew Morton
2005-05-03 13:21       ` Martin Hicks
2005-05-04  1:23         ` Andrew Morton
2005-05-12 18:53       ` Martin Hicks
2005-05-12 18:57         ` Martin Hicks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox