Re: [PATCH 01 of 24] remove nr_scan_inactive/active

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@linux-foundation.org>
To: Andrea Arcangeli <andrea@suse.de>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>
Subject: Re: [PATCH 01 of 24] remove nr_scan_inactive/active
Date: Wed, 12 Sep 2007 04:44:50 -0700	[thread overview]
Message-ID: <20070912044450.cef400fa.akpm@linux-foundation.org> (raw)
In-Reply-To: <c8ec651562ad6514753e.1187786928@v2.random>

On Wed, 22 Aug 2007 14:48:48 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778124 -7200
> # Node ID c8ec651562ad6514753e408596e30d7d9e448a51
> # Parent  b03dfad58a311488ec373c30fd5dc97dc03aecae
> remove nr_scan_inactive/active
> 
> The older atomic_add/atomic_set were pointless (atomic_set vs atomic_add would
> race), but removing them didn't actually remove the race, the race is still
> there, for the same reasons atomic_add/set couldn't prevent it. This is really
> the kind of code that I dislike because it's sort of buggy, and it shouldn't be
> making any measurable difference and when it does something for real it can
> only hurt!
> 
> The real focus is on shrink_zone (ignore the other places where it's being used
> that are even less interesting). Assume two tasks adds to nr_scan_*active at
> the same time (first line of the old buggy code), they'll effectively double their
> scan rate, for no good reason. What can happen is that instead of scanning
> nr_entries each, they'll scan nr_entries*2 each. The more CPUs the bigger the
> race and the higher the multiplication effect and the harder it will be to
> detect oom. In the case that nr_*active < sc->swap_cluster_max, regardless of
> whatever future invocation of alloc_pages, we'll be going down in the
> priorities in the current alloc_pages invocation if the DEF_PRIORITY was too
> high to make any work, so again accumulating the nr_scan_*active doesn't seem
> interesting even when it's smaller than sc->swap_cluster_max. Each task should
> work for itself without much care of what the others are doing.

You're coming at this from the wrong end of town.  The code in there is to
address small zones (actually small LRU lists) at "easy" scanning
priorities.  I suspect you just broke it in that region of operation.

Does that above text describe something which you've observed and measured
in practice, or is it theoretical-from-code-inspection?

> ...
>

We go from this:

	/*
	 * Add one to `nr_to_scan' just to make sure that the kernel will
	 * slowly sift through the active list.
	 */
	zone->nr_scan_active +=
		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
	nr_active = zone->nr_scan_active;
	if (nr_active >= sc->swap_cluster_max)
		zone->nr_scan_active = 0;
	else
		nr_active = 0;

	zone->nr_scan_inactive +=
		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
	nr_inactive = zone->nr_scan_inactive;
	if (nr_inactive >= sc->swap_cluster_max)
		zone->nr_scan_inactive = 0;
	else
		nr_inactive = 0;

	while (nr_active || nr_inactive) {

to this:

	/*
	 * Add one to `nr_to_scan' just to make sure that the kernel will
	 * slowly sift through the active list.
	 */
	nr_active = zone_page_state(zone, NR_ACTIVE) >> priority;
	if (nr_active < sc->swap_cluster_max)
		nr_active = 0;
	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
	if (nr_inactive < sc->swap_cluster_max)
		nr_inactive = 0;

	while (nr_active || nr_inactive) {

I have issues.

The old code took care of the situtaion where zone_page_state(zone,
NR_ACTIVE) is smaller than (1 << priority): do a bit of reclaim in that
case anyway.  This is a minor issue, as we'll at least perform some
scanning when priority is low.  But you should have depeted the now-wrong
comment.

More serious issue: the logic in there takes care of balancing a small LRU
list.  If (zone_page_state(zone, NR_ACTIVE)>>priority) is, umm, "3" then
we'll add "3" into zone->nr_scan_active and then leave the zone alone. 
Once we've done this enough times, the "3"s will add up to something which
is larger than swap_cluster_max and then we'll do a round of scanning for
real.

Your change breaks that logic and there is potential that a small LRU will
be underscanned, especially when reclaim is not under distress.

I don't know how serious this change is, but it's a change for the worse
and it would take quite a bit of thought and careful testing to be able to
justify this change.

According to the above-described logic, one would think that it would be
more accurate to replace the existing

	if (nr_active >= sc->swap_cluster_max)
		zone->nr_scan_active = 0;

with

	if (nr_active >= sc->swap_cluster_max)
		zone->nr_scan_active -= sc->swap_cluster_max;

and for twelve seconds on 12 March 2004 we were partially doing that, but
then I merged this:

commit 4d5e349b89e4017ddbdbd06345e94c59e8b851b7
Author: akpm <akpm>
Date:   Fri Mar 12 16:25:24 2004 +0000

    [PATCH] fix vm-batch-inactive-scanning.patch

    - prevent nr_scan_inactive from going negative

    - compare `count' with SWAP_CLUSTER_MAX, not `max_scan'

    - Use ">= SWAP_CLUSTER_MAX", not "> SWAP_CLUSTER_MAX".

    BKrev: 4051e474u37Zwj2o6Q5o5NeVCL-5kQ

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb86cb2..65824df 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -757,14 +757,14 @@ shrink_zone(struct zone *zone, int max_s
 	ratio = (unsigned long)SWAP_CLUSTER_MAX * zone->nr_active /
 				((zone->nr_inactive | 1) * 2);
 	atomic_add(ratio+1, &zone->nr_scan_active);
-	if (atomic_read(&zone->nr_scan_active) > SWAP_CLUSTER_MAX) {
+	count = atomic_read(&zone->nr_scan_active);
+	if (count >= SWAP_CLUSTER_MAX) {
 		/*
 		 * Don't try to bring down too many pages in one attempt.
 		 * If this fails, the caller will increase `priority' and
 		 * we'll try again, with an increased chance of reclaiming
 		 * mapped memory.
 		 */
-		count = atomic_read(&zone->nr_scan_active);
 		if (count > SWAP_CLUSTER_MAX * 4)
 			count = SWAP_CLUSTER_MAX * 4;
 		atomic_set(&zone->nr_scan_active, 0);
@@ -773,8 +773,8 @@ shrink_zone(struct zone *zone, int max_s

 	atomic_add(max_scan, &zone->nr_scan_inactive);
 	count = atomic_read(&zone->nr_scan_inactive);
-	if (max_scan > SWAP_CLUSTER_MAX) {
-		atomic_sub(count, &zone->nr_scan_inactive);
+	if (count >= SWAP_CLUSTER_MAX) {
+		atomic_set(&zone->nr_scan_inactive, 0);
 		return shrink_cache(zone, gfp_mask, count, total_scanned);
 	}
 	return 0;

which made both the inactive and active list scanning the same (and
inaccurate).

So I'm thinking that a correct fix to all these problems is to go back to
atomics and to not just set the counters to zero, but to subtract the
number-of-scanned-pages from them as we're supposed to so.

An alternative approach might be to only touch nr_scanned_[in]active at all
when (zone_page_state(zone, NR_ACTIVE) >> priority) is less than
(1<<priority).  So most of the time we'll just go in there and scan the
full swap_cluster_max pages.  And the nr_scan_[in]active counters are
purely used as "fractional" counters to prevent the underscanning in the
corner cases to which I referred above.

Yet another alternative approach would be to remove the batching
altogether.  If (zone_page_state(zone, NR_ACTIVE) >> priority) evaluates to
"3", well, just go in and scan three pages.  That should address any
accuracy problems and it will address the problem which you're addressing,
but it will add unknown-but-probably-small computational cost.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-09-12 11:44 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
2007-09-12 11:44   ` Andrew Morton [this message]
2008-01-02 17:50     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 02 of 24] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2007-09-12 23:54   ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 03 of 24] prevent oom deadlocks during read/write operations Andrea Arcangeli
2007-09-12 11:56   ` Andrew Morton
2007-09-12  2:18     ` Nick Piggin
2008-01-03  0:53     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
2007-09-12 12:02   ` Andrew Morton
2007-09-12 12:04     ` Andrew Morton
2007-09-12 12:11       ` Andrea Arcangeli
2008-01-03  0:55     ` Andrea Arcangeli
2007-09-13  0:09   ` Christoph Lameter
2007-09-13 18:32     ` David Rientjes
2007-09-13 18:37       ` Christoph Lameter
2007-09-13 18:46         ` David Rientjes
2007-09-13 18:53           ` Christoph Lameter
2007-09-14  0:36             ` David Rientjes
2007-09-14  2:31               ` Christoph Lameter
2007-09-14  3:33                 ` David Rientjes
2007-09-18 16:44                   ` David Rientjes
2007-09-18 16:44                     ` [patch 1/4] oom: move prototypes to appropriate header file David Rientjes
2007-09-18 16:44                       ` [patch 2/4] oom: move constraints to enum David Rientjes
2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
2007-09-18 16:44                           ` [patch 4/4] oom: serialize out of memory calls David Rientjes
2007-09-18 19:54                             ` Christoph Lameter
2007-09-18 19:56                               ` David Rientjes
2007-09-18 20:01                                 ` Christoph Lameter
2007-09-18 20:06                                   ` David Rientjes
2007-09-18 20:23                                     ` [patch 5/4] oom: rename serialization helper functions David Rientjes
2007-09-18 20:26                                       ` Christoph Lameter
2007-09-18 20:39                                         ` [patch 5/4 v2] " David Rientjes
2007-09-18 20:59                                           ` Christoph Lameter
2007-09-18 19:57                           ` [patch 3/4] oom: save zonelist pointer for oom killer calls Christoph Lameter
2007-09-18 20:13                             ` David Rientjes
2007-09-18 20:16                               ` Christoph Lameter
2007-09-18 20:47                                 ` [patch 6/4] oom: pass null to kfree if zonelist is not cleared David Rientjes
2007-09-18 21:01                                   ` Christoph Lameter
2007-09-18 21:13                                     ` David Rientjes
2007-09-18 21:25                                       ` Christoph Lameter
2007-09-18 22:16                                         ` David Rientjes
2007-09-19 17:09                                           ` Paul Jackson
2007-09-19 18:21                                             ` David Rientjes
2007-09-18 19:55                         ` [patch 2/4] oom: move constraints to enum Christoph Lameter
2007-08-22 12:48 ` [PATCH 05 of 24] avoid selecting already killed tasks Andrea Arcangeli
2007-09-13  0:13   ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 06 of 24] reduce the probability of an OOM livelock Andrea Arcangeli
2007-09-12 12:17   ` Andrew Morton
2008-01-03  1:03     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2007-09-12 12:18   ` Andrew Morton
2007-09-13  0:26     ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2007-09-12 12:20   ` Andrew Morton
2008-01-03  0:56     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
2007-09-12 12:30   ` Andrew Morton
2007-09-12 12:34     ` Andrew Morton
2008-01-03  1:06     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2007-09-12 12:42   ` Andrew Morton
2007-09-13  0:36     ` Christoph Lameter
2007-09-21 19:10   ` David Rientjes
2008-01-03  1:08     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
2007-09-12 12:44   ` Andrew Morton
2007-08-22 12:48 ` [PATCH 12 of 24] show mem information only when a task is actually being killed Andrea Arcangeli
2007-09-12 12:49   ` Andrew Morton
2007-08-22 12:49 ` [PATCH 13 of 24] simplify oom heuristics Andrea Arcangeli
2007-09-12 12:52   ` Andrew Morton
2007-09-12 13:40     ` Andrea Arcangeli
2007-09-12 20:52       ` Andrew Morton
2007-08-22 12:49 ` [PATCH 14 of 24] oom select should only take rss into account Andrea Arcangeli
2007-09-13  0:43   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
2007-09-12 12:57   ` Andrew Morton
2008-01-03  1:12     ` Andrea Arcangeli
2007-09-12 12:58   ` Andrew Morton
2007-09-12 13:38     ` Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 16 of 24] avoid some lock operation in vm fast path Andrea Arcangeli
2007-09-12 12:59   ` Andrew Morton
2007-09-13  0:49     ` Christoph Lameter
2007-09-13  1:16       ` Andrew Morton
2007-09-13  1:33         ` Christoph Lameter
2007-09-13  1:41           ` KAMEZAWA Hiroyuki
2007-09-13  1:44           ` Andrew Morton
2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
2007-09-12 13:02   ` Andrew Morton
2007-09-13  0:53     ` Christoph Lameter
2007-09-13  0:52   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 18 of 24] run panic the same way in both places Andrea Arcangeli
2007-09-13  0:54   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing Andrea Arcangeli
2007-09-12 13:02   ` Andrew Morton
2007-09-12 13:36     ` Andrea Arcangeli
2007-09-13  0:55       ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 20 of 24] extract deadlock helper function Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 21 of 24] select process to kill for cpusets Andrea Arcangeli
2007-09-12 13:05   ` Andrew Morton
2007-09-13  0:59     ` Christoph Lameter
2007-09-13  5:13       ` David Rientjes
2007-09-13 17:55         ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 22 of 24] extract select helper function Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 23 of 24] serialize for cpusets Andrea Arcangeli
2007-09-12 13:10   ` Andrew Morton
2007-09-12 13:34     ` Andrea Arcangeli
2007-09-12 19:08     ` David Rientjes
2007-09-13  1:02     ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 24 of 24] add oom_kill_asking_task flag Andrea Arcangeli
2007-09-12 13:11   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070912044450.cef400fa.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=andrea@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox