linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Larry Woodman <lwoodman@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: kosaki.motohiro@jp.fujitsu.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	aarcange@redhat.com
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone
Date: Fri, 11 Dec 2009 06:49:19 -0500	[thread overview]
Message-ID: <4B2231BF.6040407@redhat.com> (raw)
In-Reply-To: <20091210185626.26f9828a@cuia.bos.redhat.com>

Rik van Riel wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways.  The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.  In Larry's case, this resulted in over
> 6000 processes fighting over locks in the page reclaim code, even
> though the system already had 1.5GB of free memory.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves.  We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <lwoodman@redhat.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
>
> --- 
> This patch is against today's MMOTM tree. It has only been compile tested,
> I do not have an AIM7 system standing by.
>
> Larry, does this fix your issue?
>
>  Documentation/sysctl/vm.txt |   18 ++++++++++++++++++
>  include/linux/mmzone.h      |    4 ++++
>  include/linux/swap.h        |    1 +
>  kernel/sysctl.c             |    7 +++++++
>  mm/page_alloc.c             |    3 +++
>  mm/vmscan.c                 |   38 ++++++++++++++++++++++++++++++++++++++
>  6 files changed, 71 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..5cf766f 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- max_zone_concurrent_reclaim
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>  
>  =============================================================
>  
> +max_zone_concurrent_reclaim:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously.  This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================
> +
>  memory_failure_early_kill:
>  
>  Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
>  	/* Zone statistics */
>  	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
>  
> +	/* Number of processes running page reclaim code on this zone. */
> +	atomic_t		concurrent_reclaimers;
> +	wait_queue_head_t	reclaim_wait;
> +
>  	/*
>  	 * prev_priority holds the scanning priority for this zone.  It is
>  	 * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>  
>  #ifdef CONFIG_NUMA
>  extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..89b919c 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1270,6 +1270,13 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one,
>  	},
> +	{
> +		.procname	= "max_zone_concurrent_reclaimers",
> +		.data		= &max_zone_concurrent_reclaimers,
> +		.maxlen		= sizeof(max_zone_concurrent_reclaimers),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  
>  		zone->prev_priority = DEF_PRIORITY;
>  
> +		atomic_set(&zone->concurrent_reclaimers, 0);
> +		init_waitqueue_head(&zone->reclaim_wait);
> +
>  		zone_pcp_init(zone);
>  		for_each_lru(l) {
>  			INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..cf3ef29 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/wait.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
>  int vm_swappiness = 60;
>  long vm_total_pages;	/* The total number of pages which the VM controls */
>  
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone.  Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time.  We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
> +
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int noswap = 0;
>  
> +	if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> +					max_zone_concurrent_reclaimers) {
> +		/*
> +		 * Do not add to the lock contention if this zone has
> +		 * enough processes doing page reclaim already, since
> +		 * we would just make things slower.
> +		 */
> +		sleep_on(&zone->reclaim_wait);
> +
> +		/*
> +		 * If other processes freed enough memory while we waited,
> +		 * break out of the loop and go back to the allocator.
> +		 */
> +		if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> +					0, 0)) {
> +			wake_up(&zone->reclaim_wait);
> +			sc->nr_reclaimed += nr_to_reclaim;
> +			return;
> +		}
> +	}
> +
> +	atomic_inc(&zone->concurrent_reclaimers);
> +
>  	/* If we have no swap space, do not bother scanning anon pages. */
>  	if (!sc->may_swap || (nr_swap_pages <= 0)) {
>  		noswap = 1;
> @@ -1655,6 +1690,9 @@ static void shrink_zone(int priority, struct zone *zone,
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> +
> +	atomic_dec(&zone->concurrent_reclaimers);
> +	wake_up(&zone->reclaim_wait);
>  }
>  
>  /*
>
>   
FYI everyone, I put together a similar patch(doesnt block explicitly, 
relies on vm_throttle) for
RHEL5(2.6.18 based kernel) and it fixes hangs due to several CPUs 
spinning on zone->lru_lock.
This might eliminate the need to add complexity to the reclaim code.  
Its impossible to make it parallel
enough to allow hundreds or even thousands of processes in this code at 
the same time.

Will backport this patch & test it as well as testing it on latest kernel.

Larry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2009-12-11 11:54 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-10 23:56 Rik van Riel
2009-12-11  2:03 ` Minchan Kim
2009-12-11  3:19   ` Rik van Riel
2009-12-11  3:43     ` Minchan Kim
2009-12-11 12:07   ` Larry Woodman
2009-12-11 13:41     ` Minchan Kim
2009-12-11 13:51       ` Rik van Riel
2009-12-11 14:08         ` Minchan Kim
2009-12-11 13:48     ` Rik van Riel
2009-12-11 21:24   ` Rik van Riel
2009-12-11 11:49 ` Larry Woodman [this message]
2009-12-14 13:08 ` Andi Kleen
2009-12-14 14:23   ` Larry Woodman
2009-12-14 16:19     ` Andi Kleen
2009-12-14 14:40   ` Rik van Riel
2009-12-14 13:14 ` Christoph Hellwig
2009-12-14 14:22   ` Larry Woodman
2009-12-14 14:52   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B2231BF.6040407@redhat.com \
    --to=lwoodman@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox