From: Andrea Arcangeli <andrea@suse.de>
To: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH 01 of 16] remove nr_scan_inactive/active
Date: Sun, 10 Jun 2007 20:17:00 +0200 [thread overview]
Message-ID: <20070610181700.GC7443@v2.random> (raw)
In-Reply-To: <466C36AE.3000101@redhat.com>
On Sun, Jun 10, 2007 at 01:36:46PM -0400, Rik van Riel wrote:
> Andrea Arcangeli wrote:
>
> >- else
> >+ nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
> >+ if (nr_inactive < sc->swap_cluster_max)
> > nr_inactive = 0;
>
> This is a problem.
>
> On workloads with lots of anonymous memory, for example
> running a very large JVM or simply stressing the system
> with AIM7, the inactive list can be very small.
>
> If dozens (or even hundreds) of tasks get into the
> pageout code simultaneously, they will all spend a lot
> of time moving pages from the active to the inactive
> list, but they will not even try to free any of the
> (few) inactive pages the system has!
>
> We have observed systems in stress tests that spent
> well over 10 minutes in shrink_active_list before
> the first call to shrink_inactive_list was made.
>
> Your code looks like it could exacerbate that situation,
> by not having zone->nr_scan_inactive increment between
> calls.
If all tasks spend 10 minutes in shrink_active_list before the first
call to shrink_inactive_list that could mean you hit the race that I'm
just trying to fix with this very patch. (i.e. nr_*active going
totally huge because of the race triggering, and trashing over the few
pages left in the *active_list until the artificially boosted
nr_*active finally goes down to zero in all tasks that read it at the
unlucky time when it got huge) So my patch may actually fix your
situation completely if your trouble was nr_scan_active becoming huge
for no good reason, just because many tasks entered the VM at the same
time on big-SMP systems. Did you monitor the real sizes of the active
lists during those 10 min and compared it to the nr_active stored in
the stack?
Normally if the highest priority passes only calls into
shrink_active_list that's because the two lists needs rebalancing. But
I fail to see how it could ever take 10min for the first
shrink_inactive_list to trigger with my patch applied, while if it
happens in current vanilla that could be the race triggering, or
anyway something unrelated is going wrong in the VM.
Overall this code seems quite flakey in its current "racy" form, so I
doubt it can be allowed to live as-is. Infact even if we fix the race
with a slow-shared-lock in a fast path or if we only make sure not to
avoid exacerbate your situation with something a simple and lock-less
as "nr_active = min(sizeof_active_list, nr_scan_active)", I think it
would still wrong to do more work in the current tasks, if we've other
tasks helping us at the same time. We should do nothing more, nothing
less. So I think if we want those counters to avoid restarting from
zero at each priority step (what I understand is your worry), those
counters should be in the stack, task-local. That will still take into
account the previously not scanned "nr_inactive" value.
Not sure what's best. I've the feeling that introducing a task-local
*nr_active *nr_inactive counter shared by all priority steps, won't
move the VM needle much, but I sure wouldn't be against it. It will
change the balancing to be more fair, but in practice I don't expect
huge differences, there are only 12 steps anyway, very quickly the
inactive list should be shrunk even if the active list is huge.
I'm only generally against the current per-zone global and racy
approach without limits, so potentially exacerbating your situation
when nr_active becomes very huge despite the active list being very
small.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-06-10 18:17 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
2007-06-10 17:36 ` Rik van Riel
2007-06-10 18:17 ` Andrea Arcangeli [this message]
2007-06-11 14:58 ` Rik van Riel
2007-06-26 17:08 ` Rik van Riel
2007-06-26 17:55 ` Andrew Morton
2007-06-26 19:02 ` Rik van Riel
2007-06-28 22:44 ` Rik van Riel
2007-06-28 22:57 ` Andrew Morton
2007-06-28 23:04 ` Rik van Riel
2007-06-28 23:13 ` Andrew Morton
2007-06-28 23:16 ` Rik van Riel
2007-06-28 23:29 ` Andrew Morton
2007-06-29 0:00 ` Rik van Riel
2007-06-29 0:19 ` Andrew Morton
2007-06-29 0:45 ` Rik van Riel
2007-06-29 1:12 ` Andrew Morton
2007-06-29 1:20 ` Rik van Riel
2007-06-29 1:29 ` Andrew Morton
2007-06-28 23:25 ` Andrea Arcangeli
2007-06-29 0:12 ` Andrew Morton
2007-06-29 13:38 ` Lee Schermerhorn
2007-06-29 14:12 ` Andrea Arcangeli
2007-06-29 14:59 ` Rik van Riel
2007-06-29 22:39 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn
2007-06-29 22:42 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn
2007-06-29 22:44 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn
2007-06-29 22:49 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn
2007-06-26 20:37 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
2007-06-26 20:57 ` Rik van Riel
2007-06-26 22:21 ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2007-06-10 17:38 ` Rik van Riel
2007-06-10 18:27 ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli
2007-06-09 6:43 ` Peter Zijlstra
2007-06-09 15:27 ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
2007-06-08 21:57 ` Christoph Lameter
2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2007-06-08 21:48 ` Christoph Lameter
2007-06-09 1:59 ` Andrea Arcangeli
2007-06-09 3:01 ` Christoph Lameter
2007-06-09 14:05 ` Andrea Arcangeli
2007-06-09 14:38 ` Andrea Arcangeli
2007-06-11 16:07 ` Christoph Lameter
2007-06-11 16:50 ` Andrea Arcangeli
2007-06-11 16:57 ` Christoph Lameter
2007-06-11 17:51 ` Andrea Arcangeli
2007-06-11 17:56 ` Christoph Lameter
2007-06-11 18:22 ` Andrea Arcangeli
2007-06-11 18:39 ` Christoph Lameter
2007-06-11 18:58 ` Andrea Arcangeli
2007-06-11 19:25 ` Christoph Lameter
2007-06-11 16:04 ` Christoph Lameter
2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli
2007-06-10 17:17 ` Rik van Riel
2007-06-10 17:30 ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli
2007-06-10 17:20 ` Rik van Riel
2007-06-10 17:32 ` Andrea Arcangeli
2007-06-10 17:52 ` Rik van Riel
2007-06-11 16:23 ` Christoph Lameter
2007-06-11 16:57 ` Rik van Riel
2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli
2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III
2007-06-09 14:55 ` Andrea Arcangeli
2007-06-12 8:58 ` Petr Tesarik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070610181700.GC7443@v2.random \
--to=andrea@suse.de \
--cc=linux-mm@kvack.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox