[PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2
@ 2009-06-09 17:01 Mel Gorman
  2009-06-09 17:01 ` [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-09 17:01 UTC (permalink / raw)
  To: Mel Gorman, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	yanmin.zhang, Wu Fengguang, linuxram
  Cc: linux-mm, LKML

Changelog since V1
  o Rebase to mmotm
  o Add various acks
  o Documentation and patch leader fixes
  o Use Kosaki's method for calculating the number of unmapped pages
  o Consider the zone full in more situations than all pages being unreclaimable
  o Add a counter to detect when scan-avoidance heuristics are failing
  o Handle jiffie wraps for zone_reclaim_interval
  o Move zone_reclaim_interval to the end of the set with the view to dropping
    it. If Kosaki's calculation is accurate, then the problem being dealt with
    should also be addressed

A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various guises
on the mailing lists although I don't have specific examples at the moment.

The problem is that malloc() stalled for a long time (minutes in some
cases) if a large tmpfs mount was occupying a large percentage of memory
overall. The pages did not get cleaned or reclaimed by zone_reclaim()
because the zone_reclaim_mode was unsuitable, but the lists are uselessly
scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations. It is based on top of mmotm
and takes advantage of Kosaki's work with respect to zone_reclaim().

Patch 1 alters the heuristics that zone_reclaim() uses to determine if the
	scan should go ahead. Currently, it is basically assuming
	zone_reclaim_mode is 1 and historically it could not deal with
	tmpfs pages at all. This fixes up the heuristic so that the scan
	is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
	the zone is marked full. This is not always true. It could have
	failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
	time the zone_reclaim scan-avoidance heuristics fail. If that
	counter is rapidly increasing, then zone_reclaim_mode should be
	set to 0 as a temporarily resolution and a bug reported.

Patch 4 reintroduces zone_reclaim_interval to catch the situation where
	zone_reclaim() cannot tell in advance that the scan is a waste of
	time. This is a brute force catch-all. I've asked the bug reporter
	to test with just patch 1. If that works, then this patch will be
	dropped and patch 3 will be enough to tell us if/when the situation
	occured again. Even with this patch applied, the counter will
	increase slowly so it's still possible to detect the problem.

 Documentation/sysctl/vm.txt |   15 +++++++
 include/linux/mmzone.h      |    9 ++++
 include/linux/swap.h        |    1 +
 include/linux/vmstat.h      |    3 +
 kernel/sysctl.c             |    9 ++++
 mm/internal.h               |    4 ++
 mm/page_alloc.c             |   26 ++++++++++--
 mm/vmscan.c                 |   91 ++++++++++++++++++++++++++++++++++---------
 mm/vmstat.c                 |    3 +
 9 files changed, 138 insertions(+), 23 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-09 17:01 [PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2 Mel Gorman
@ 2009-06-09 17:01 ` Mel Gorman
  2009-06-09 18:15   ` Rik van Riel
  2009-06-10  1:19   ` Wu Fengguang
  2009-06-09 17:01 ` [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-09 17:01 UTC (permalink / raw)
  To: Mel Gorman, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	yanmin.zhang, Wu Fengguang, linuxram
  Cc: linux-mm, LKML

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is basically
assuming zone_reclaim_mode is 1 if it is enabled.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files. This is far superior
when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
in the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2ddcfc8..2bfc76e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
  */
 int sysctl_min_slab_ratio = 5;
 
+static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+{
+	return zone_page_state(zone, NR_INACTIVE_FILE) +
+		zone_page_state(zone, NR_ACTIVE_FILE) -
+		zone_page_state(zone, NR_FILE_MAPPED);
+}
+
+/* Work out how many page cache pages we can reclaim in this reclaim_mode */
+static inline long zone_pagecache_reclaimable(struct zone *zone)
+{
+	long nr_pagecache_reclaimable;
+	long delta = 0;
+
+	/*
+	 * If RECLAIM_SWAP is set, then all file pages are considered
+	 * potentially reclaimable. Otherwise, we have to worry about
+	 * pages like swapcache and zone_unmapped_file_pages() provides
+	 * a better estimate
+	 */
+	if (zone_reclaim_mode & RECLAIM_SWAP)
+		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+	else
+		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+
+	/* If we can't clean pages, remove dirty pages from consideration */
+	if (!(zone_reclaim_mode & RECLAIM_WRITE))
+		delta += zone_page_state(zone, NR_FILE_DIRTY);
+
+	/* Beware of double accounting */
+	if (delta < nr_pagecache_reclaimable)
+		nr_pagecache_reclaimable -= delta;
+
+	return nr_pagecache_reclaimable;
+}
+
 /*
  * Try to free up some pages from this zone through reclaim.
  */
@@ -2355,7 +2390,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.isolate_pages = isolate_pages_global,
 	};
 	unsigned long slab_reclaimable;
-	long nr_unmapped_file_pages;
 
 	disable_swap_token();
 	cond_resched();
@@ -2368,11 +2402,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
-				 zone_page_state(zone, NR_ACTIVE_FILE) -
-				 zone_page_state(zone, NR_FILE_MAPPED);
-
-	if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
+	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
@@ -2419,8 +2449,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 {
 	int node_id;
 	int ret;
-	long nr_unmapped_file_pages;
-	long nr_slab_reclaimable;
 
 	/*
 	 * Zone reclaim reclaims unmapped file backed pages and
@@ -2432,12 +2460,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	 * if less than a specified percentage of the zone is used by
 	 * unmapped file backed pages.
 	 */
-	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
-				 zone_page_state(zone, NR_ACTIVE_FILE) -
-				 zone_page_state(zone, NR_FILE_MAPPED);
-	nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
-	if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
-	    nr_slab_reclaimable <= zone->min_slab_pages)
+	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
+	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return 0;
 
 	if (zone_is_all_unreclaimable(zone))
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-09 17:01 ` [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
@ 2009-06-09 18:15   ` Rik van Riel
  2009-06-10  1:19   ` Wu Fengguang
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2009-06-09 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Christoph Lameter, yanmin.zhang, Wu Fengguang,
	linuxram, linux-mm, LKML

Mel Gorman wrote:

> This patch alters how zone_reclaim() works out how many pages it might be
> able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> not set, then NR_FILE_MAPPED are not.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Christoph Lameter <cl@linux-foundation.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-09 17:01 ` [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
  2009-06-09 18:15   ` Rik van Riel
@ 2009-06-10  1:19   ` Wu Fengguang
  2009-06-10  7:31     ` KOSAKI Motohiro
  2009-06-10 10:31     ` Mel Gorman
  1 sibling, 2 replies; 25+ messages in thread
From: Wu Fengguang @ 2009-06-10  1:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
> 
> There is a heuristic that determines if the scan is worthwhile but the
> problem is that the heuristic is not being properly applied and is basically
> assuming zone_reclaim_mode is 1 if it is enabled.
> 
> Historically, once enabled it was depending on NR_FILE_PAGES which may
> include swapcache pages that the reclaim_mode cannot deal with.  Patch
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> pages that were not file-backed such as swapcache and made a calculation
> based on the inactive, active and mapped files. This is far superior
> when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> reasonable starting figure.
> 
> This patch alters how zone_reclaim() works out how many pages it might be
> able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> not set, then NR_FILE_MAPPED are not.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
>  1 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2ddcfc8..2bfc76e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
>   */
>  int sysctl_min_slab_ratio = 5;
>  
> +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> +{
> +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> +		zone_page_state(zone, NR_ACTIVE_FILE) -
> +		zone_page_state(zone, NR_FILE_MAPPED);

This may underflow if too many tmpfs pages are mapped.

> +}
> +
> +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> +static inline long zone_pagecache_reclaimable(struct zone *zone)
> +{
> +	long nr_pagecache_reclaimable;
> +	long delta = 0;
> +
> +	/*
> +	 * If RECLAIM_SWAP is set, then all file pages are considered
> +	 * potentially reclaimable. Otherwise, we have to worry about
> +	 * pages like swapcache and zone_unmapped_file_pages() provides
> +	 * a better estimate
> +	 */
> +	if (zone_reclaim_mode & RECLAIM_SWAP)
> +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> +	else
> +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> +
> +	/* If we can't clean pages, remove dirty pages from consideration */
> +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> +
> +	/* Beware of double accounting */

The double accounting happens for NR_FILE_MAPPED but not
NR_FILE_DIRTY(dirty tmpfs pages won't be accounted), so this comment
is more suitable for zone_unmapped_file_pages(). But the double
accounting does affects this abstraction. So a more reasonable
sequence could be to first substract NR_FILE_DIRTY and then
conditionally substract NR_FILE_MAPPED?

Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?

Thanks,
Fengguang

> +	if (delta < nr_pagecache_reclaimable)
> +		nr_pagecache_reclaimable -= delta;
> +
> +	return nr_pagecache_reclaimable;
> +}
> +
>  /*
>   * Try to free up some pages from this zone through reclaim.
>   */
> @@ -2355,7 +2390,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  		.isolate_pages = isolate_pages_global,
>  	};
>  	unsigned long slab_reclaimable;
> -	long nr_unmapped_file_pages;
>  
>  	disable_swap_token();
>  	cond_resched();
> @@ -2368,11 +2402,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
>  
> -	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> -				 zone_page_state(zone, NR_ACTIVE_FILE) -
> -				 zone_page_state(zone, NR_FILE_MAPPED);
> -
> -	if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> +	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
>  		/*
>  		 * Free memory by calling shrink zone with increasing
>  		 * priorities until we have enough memory freed.
> @@ -2419,8 +2449,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  {
>  	int node_id;
>  	int ret;
> -	long nr_unmapped_file_pages;
> -	long nr_slab_reclaimable;
>  
>  	/*
>  	 * Zone reclaim reclaims unmapped file backed pages and
> @@ -2432,12 +2460,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	 * if less than a specified percentage of the zone is used by
>  	 * unmapped file backed pages.
>  	 */
> -	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> -				 zone_page_state(zone, NR_ACTIVE_FILE) -
> -				 zone_page_state(zone, NR_FILE_MAPPED);
> -	nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
> -	if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
> -	    nr_slab_reclaimable <= zone->min_slab_pages)
> +	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> +	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return 0;
>  
>  	if (zone_is_all_unreclaimable(zone))
> -- 
> 1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10  1:19   ` Wu Fengguang
@ 2009-06-10  7:31     ` KOSAKI Motohiro
  2009-06-10 10:31     ` Mel Gorman
  1 sibling, 0 replies; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-06-10  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Mel Gorman, Rik van Riel, Christoph Lameter,
	Zhang, Yanmin, linuxram, linux-mm, LKML

> On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > 
> > There is a heuristic that determines if the scan is worthwhile but the
> > problem is that the heuristic is not being properly applied and is basically
> > assuming zone_reclaim_mode is 1 if it is enabled.
> > 
> > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > pages that were not file-backed such as swapcache and made a calculation
> > based on the inactive, active and mapped files. This is far superior
> > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > reasonable starting figure.
> > 
> > This patch alters how zone_reclaim() works out how many pages it might be
> > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > not set, then NR_FILE_MAPPED are not.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> >  1 files changed, 38 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2ddcfc8..2bfc76e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> >   */
> >  int sysctl_min_slab_ratio = 5;
> >  
> > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > +{
> > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > +		zone_page_state(zone, NR_FILE_MAPPED);
> 
> This may underflow if too many tmpfs pages are mapped.

sorry my fault.
I'm preparing updated patch.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10  1:19   ` Wu Fengguang
  2009-06-10  7:31     ` KOSAKI Motohiro
@ 2009-06-10 10:31     ` Mel Gorman
  2009-06-10 11:59       ` Wu Fengguang
  1 sibling, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2009-06-10 10:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 09:19:39AM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > 
> > There is a heuristic that determines if the scan is worthwhile but the
> > problem is that the heuristic is not being properly applied and is basically
> > assuming zone_reclaim_mode is 1 if it is enabled.
> > 
> > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > pages that were not file-backed such as swapcache and made a calculation
> > based on the inactive, active and mapped files. This is far superior
> > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > reasonable starting figure.
> > 
> > This patch alters how zone_reclaim() works out how many pages it might be
> > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > not set, then NR_FILE_MAPPED are not.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> >  1 files changed, 38 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2ddcfc8..2bfc76e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> >   */
> >  int sysctl_min_slab_ratio = 5;
> >  
> > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > +{
> > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > +		zone_page_state(zone, NR_FILE_MAPPED);
> 
> This may underflow if too many tmpfs pages are mapped.
> 

You're right. This is also a bug now in mmotm for patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch which
is where I took this code out of and didn't think deeply enough about.
Well spotted.

Should this be something like?

static unsigned long zone_unmapped_file_pages(struct zone *zone)
{
	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE)
			zone_page_state(zone, NR_ACTIVE_FILE);

	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
}

?

If that returns 0, it does mean that there are very few pages that the
current reclaim_mode is going to be able to deal with so even if the
count is not perfect, it should be good enough for what we need it for.

> > +}
> > +
> > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > +static inline long zone_pagecache_reclaimable(struct zone *zone)
> > +{
> > +	long nr_pagecache_reclaimable;
> > +	long delta = 0;
> > +
> > +	/*
> > +	 * If RECLAIM_SWAP is set, then all file pages are considered
> > +	 * potentially reclaimable. Otherwise, we have to worry about
> > +	 * pages like swapcache and zone_unmapped_file_pages() provides
> > +	 * a better estimate
> > +	 */
> > +	if (zone_reclaim_mode & RECLAIM_SWAP)
> > +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > +	else
> > +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > +
> > +	/* If we can't clean pages, remove dirty pages from consideration */
> > +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> > +
> > +	/* Beware of double accounting */
> 
> The double accounting happens for NR_FILE_MAPPED but not
> NR_FILE_DIRTY(dirty tmpfs pages won't be accounted),

I should have taken that out. In an interim version, delta was altered
more than once in a way that could have caused underflow.

> so this comment
> is more suitable for zone_unmapped_file_pages(). But the double
> accounting does affects this abstraction. So a more reasonable
> sequence could be to first substract NR_FILE_DIRTY and then
> conditionally substract NR_FILE_MAPPED?

The end result is the same I believe and I prefer having the
zone_unmapped_file_pages() doing just that and nothing else because it's
in line with what zone_lru_pages() does.

> 
> Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?
> 

I considered such a counter and dismissed it but maybe it merits wider discussion.

My problem with it is that it would affect the pagecache add/remove hot paths
and a few other sites and increase the amount of accouting we do within a
zone. It seemed unjustified to help a seldom executed slow path that only
runs on NUMA.

> > +	if (delta < nr_pagecache_reclaimable)
> > +		nr_pagecache_reclaimable -= delta;
> > +
> > +	return nr_pagecache_reclaimable;
> > +}
> > +
> >  /*
> >   * Try to free up some pages from this zone through reclaim.
> >   */
> > @@ -2355,7 +2390,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  		.isolate_pages = isolate_pages_global,
> >  	};
> >  	unsigned long slab_reclaimable;
> > -	long nr_unmapped_file_pages;
> >  
> >  	disable_swap_token();
> >  	cond_resched();
> > @@ -2368,11 +2402,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  	reclaim_state.reclaimed_slab = 0;
> >  	p->reclaim_state = &reclaim_state;
> >  
> > -	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > -				 zone_page_state(zone, NR_ACTIVE_FILE) -
> > -				 zone_page_state(zone, NR_FILE_MAPPED);
> > -
> > -	if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> > +	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
> >  		/*
> >  		 * Free memory by calling shrink zone with increasing
> >  		 * priorities until we have enough memory freed.
> > @@ -2419,8 +2449,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  {
> >  	int node_id;
> >  	int ret;
> > -	long nr_unmapped_file_pages;
> > -	long nr_slab_reclaimable;
> >  
> >  	/*
> >  	 * Zone reclaim reclaims unmapped file backed pages and
> > @@ -2432,12 +2460,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  	 * if less than a specified percentage of the zone is used by
> >  	 * unmapped file backed pages.
> >  	 */
> > -	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > -				 zone_page_state(zone, NR_ACTIVE_FILE) -
> > -				 zone_page_state(zone, NR_FILE_MAPPED);
> > -	nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
> > -	if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
> > -	    nr_slab_reclaimable <= zone->min_slab_pages)
> > +	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> > +	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> >  		return 0;
> >  
> >  	if (zone_is_all_unreclaimable(zone))
> > -- 
> > 1.5.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10 10:31     ` Mel Gorman
@ 2009-06-10 11:59       ` Wu Fengguang
  2009-06-10 13:41         ` Mel Gorman
  2009-06-11  3:26         ` KOSAKI Motohiro
  0 siblings, 2 replies; 25+ messages in thread
From: Wu Fengguang @ 2009-06-10 11:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 06:31:53PM +0800, Mel Gorman wrote:
> On Wed, Jun 10, 2009 at 09:19:39AM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > 
> > > There is a heuristic that determines if the scan is worthwhile but the
> > > problem is that the heuristic is not being properly applied and is basically
> > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > 
> > > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > > pages that were not file-backed such as swapcache and made a calculation
> > > based on the inactive, active and mapped files. This is far superior
> > > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > > reasonable starting figure.
> > > 
> > > This patch alters how zone_reclaim() works out how many pages it might be
> > > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > > not set, then NR_FILE_MAPPED are not.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > > ---
> > >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> > >  1 files changed, 38 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2ddcfc8..2bfc76e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> > >   */
> > >  int sysctl_min_slab_ratio = 5;
> > >  
> > > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > +{
> > > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > > +		zone_page_state(zone, NR_FILE_MAPPED);
> > 
> > This may underflow if too many tmpfs pages are mapped.
> > 
> 
> You're right. This is also a bug now in mmotm for patch
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch which
> is where I took this code out of and didn't think deeply enough about.
> Well spotted.
> 
> Should this be something like?
> 
> static unsigned long zone_unmapped_file_pages(struct zone *zone)
> {
> 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> 	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE)
> 			zone_page_state(zone, NR_ACTIVE_FILE);
> 
> 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> }
> 
> ?
> 
> If that returns 0, it does mean that there are very few pages that the
> current reclaim_mode is going to be able to deal with so even if the
> count is not perfect, it should be good enough for what we need it for.

Agreed. We opt to give up direct zone reclaim than to risk busy looping ;)

> > > +}
> > > +
> > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > +static inline long zone_pagecache_reclaimable(struct zone *zone)
> > > +{
> > > +	long nr_pagecache_reclaimable;
> > > +	long delta = 0;
> > > +
> > > +	/*
> > > +	 * If RECLAIM_SWAP is set, then all file pages are considered
> > > +	 * potentially reclaimable. Otherwise, we have to worry about
> > > +	 * pages like swapcache and zone_unmapped_file_pages() provides
> > > +	 * a better estimate
> > > +	 */
> > > +	if (zone_reclaim_mode & RECLAIM_SWAP)
> > > +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > +	else
> > > +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > +
> > > +	/* If we can't clean pages, remove dirty pages from consideration */
> > > +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> > > +
> > > +	/* Beware of double accounting */
> > 
> > The double accounting happens for NR_FILE_MAPPED but not
> > NR_FILE_DIRTY(dirty tmpfs pages won't be accounted),
> 
> I should have taken that out. In an interim version, delta was altered
> more than once in a way that could have caused underflow.
> 
> > so this comment
> > is more suitable for zone_unmapped_file_pages(). But the double
> > accounting does affects this abstraction. So a more reasonable
> > sequence could be to first substract NR_FILE_DIRTY and then
> > conditionally substract NR_FILE_MAPPED?
> 
> The end result is the same I believe and I prefer having the
> zone_unmapped_file_pages() doing just that and nothing else because it's
> in line with what zone_lru_pages() does.

OK.

> > Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?
> > 
> 
> I considered such a counter and dismissed it but maybe it merits wider discussion.
> 
> My problem with it is that it would affect the pagecache add/remove hot paths
> and a few other sites and increase the amount of accouting we do within a
> zone. It seemed unjustified to help a seldom executed slow path that only
> runs on NUMA.

We are not talking about NR_TMPFS_PAGES, but NR_TMPFS_MAPPED :)

We only need to account it in page_add_file_rmap() and page_remove_rmap(),
I don't think they are too hot paths. And the relative cost is low enough.

It will look like this.

---
 include/linux/mmzone.h |    1 +
 mm/rmap.c              |    4 ++++
 2 files changed, 5 insertions(+)

--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -99,6 +99,7 @@ enum zone_stat_item {
 	NR_VMSCAN_WRITE,
 	/* Second 128 byte cacheline */
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
+	NR_TMPFS_MAPPED,
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -844,6 +844,8 @@ void page_add_file_rmap(struct page *pag
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
+		if (PageSwapBacked(page))
+			__inc_zone_page_state(page, NR_TMPFS_MAPPED);
 		mem_cgroup_update_mapped_file_stat(page, 1);
 	}
 }
@@ -894,6 +896,8 @@ void page_remove_rmap(struct page *page)
 			mem_cgroup_uncharge_page(page);
 		__dec_zone_page_state(page,
 			PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+		if (!PageAnon(page) && PageSwapBacked(page))
+			__dec_zone_page_state(page, NR_TMPFS_MAPPED);
 		mem_cgroup_update_mapped_file_stat(page, -1);
 		/*
 		 * It would be tidy to reset the PageAnon mapping here,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10 11:59       ` Wu Fengguang
@ 2009-06-10 13:41         ` Mel Gorman
  2009-06-10 22:42           ` Ram Pai
  2009-06-11  1:29           ` Wu Fengguang
  2009-06-11  3:26         ` KOSAKI Motohiro
  1 sibling, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-10 13:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 07:59:44PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 06:31:53PM +0800, Mel Gorman wrote:
> > On Wed, Jun 10, 2009 at 09:19:39AM +0800, Wu Fengguang wrote:
> > > On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > > > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > > 
> > > > There is a heuristic that determines if the scan is worthwhile but the
> > > > problem is that the heuristic is not being properly applied and is basically
> > > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > > 
> > > > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > > > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > > > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > > > pages that were not file-backed such as swapcache and made a calculation
> > > > based on the inactive, active and mapped files. This is far superior
> > > > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > > > reasonable starting figure.
> > > > 
> > > > This patch alters how zone_reclaim() works out how many pages it might be
> > > > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > > > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > > > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > > > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > > > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > > > not set, then NR_FILE_MAPPED are not.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > > > ---
> > > >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> > > >  1 files changed, 38 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 2ddcfc8..2bfc76e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> > > >   */
> > > >  int sysctl_min_slab_ratio = 5;
> > > >  
> > > > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > > +{
> > > > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > > > +		zone_page_state(zone, NR_FILE_MAPPED);
> > > 
> > > This may underflow if too many tmpfs pages are mapped.
> > > 
> > 
> > You're right. This is also a bug now in mmotm for patch
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch which
> > is where I took this code out of and didn't think deeply enough about.
> > Well spotted.
> > 
> > Should this be something like?
> > 
> > static unsigned long zone_unmapped_file_pages(struct zone *zone)
> > {
> > 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> > 	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE)
> > 			zone_page_state(zone, NR_ACTIVE_FILE);
> > 
> > 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> > }
> > 
> > ?
> > 
> > If that returns 0, it does mean that there are very few pages that the
> > current reclaim_mode is going to be able to deal with so even if the
> > count is not perfect, it should be good enough for what we need it for.
> 
> Agreed. We opt to give up direct zone reclaim than to risk busy looping ;)
> 

Yep. Those busy loops doth chew up the CPU time, heat the planet and
wear out Ye Olde Bugzilla with the wailing of unhappy users :)

> > > > +}
> > > > +
> > > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > > +static inline long zone_pagecache_reclaimable(struct zone *zone)
> > > > +{
> > > > +	long nr_pagecache_reclaimable;
> > > > +	long delta = 0;
> > > > +
> > > > +	/*
> > > > +	 * If RECLAIM_SWAP is set, then all file pages are considered
> > > > +	 * potentially reclaimable. Otherwise, we have to worry about
> > > > +	 * pages like swapcache and zone_unmapped_file_pages() provides
> > > > +	 * a better estimate
> > > > +	 */
> > > > +	if (zone_reclaim_mode & RECLAIM_SWAP)
> > > > +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > +	else
> > > > +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > > +
> > > > +	/* If we can't clean pages, remove dirty pages from consideration */
> > > > +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> > > > +
> > > > +	/* Beware of double accounting */
> > > 
> > > The double accounting happens for NR_FILE_MAPPED but not
> > > NR_FILE_DIRTY(dirty tmpfs pages won't be accounted),
> > 
> > I should have taken that out. In an interim version, delta was altered
> > more than once in a way that could have caused underflow.
> > 
> > > so this comment
> > > is more suitable for zone_unmapped_file_pages(). But the double
> > > accounting does affects this abstraction. So a more reasonable
> > > sequence could be to first substract NR_FILE_DIRTY and then
> > > conditionally substract NR_FILE_MAPPED?
> > 
> > The end result is the same I believe and I prefer having the
> > zone_unmapped_file_pages() doing just that and nothing else because it's
> > in line with what zone_lru_pages() does.
> 
> OK.
> 
> > > Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?
> > > 
> > 
> > I considered such a counter and dismissed it but maybe it merits wider discussion.
> > 
> > My problem with it is that it would affect the pagecache add/remove hot paths
> > and a few other sites and increase the amount of accouting we do within a
> > zone. It seemed unjustified to help a seldom executed slow path that only
> > runs on NUMA.
> 
> We are not talking about NR_TMPFS_PAGES, but NR_TMPFS_MAPPED :)
> 
> We only need to account it in page_add_file_rmap() and page_remove_rmap(),
> I don't think they are too hot paths. And the relative cost is low enough.
> 
> It will look like this.
> 

Ok, you're right, that is much simplier than what I had in mind. I was fixated
on accounting for TMPFS pages. I think this patch has definite possibilities
and would help us with the tmpfs problem. If the tests come back "failed",
I'll be adding taking this logic and seeing can it be made work.

What about ramfs pages though? They have similar problems to tmpfs but are
not swap-backed, right?

> ---
>  include/linux/mmzone.h |    1 +
>  mm/rmap.c              |    4 ++++
>  2 files changed, 5 insertions(+)
> 
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -99,6 +99,7 @@ enum zone_stat_item {
>  	NR_VMSCAN_WRITE,
>  	/* Second 128 byte cacheline */
>  	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
> +	NR_TMPFS_MAPPED,
>  #ifdef CONFIG_NUMA
>  	NUMA_HIT,		/* allocated in intended node */
>  	NUMA_MISS,		/* allocated in non intended node */
> --- linux.orig/mm/rmap.c
> +++ linux/mm/rmap.c
> @@ -844,6 +844,8 @@ void page_add_file_rmap(struct page *pag
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> +		if (PageSwapBacked(page))
> +			__inc_zone_page_state(page, NR_TMPFS_MAPPED);
>  		mem_cgroup_update_mapped_file_stat(page, 1);
>  	}
>  }
> @@ -894,6 +896,8 @@ void page_remove_rmap(struct page *page)
>  			mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page,
>  			PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> +		if (!PageAnon(page) && PageSwapBacked(page))
> +			__dec_zone_page_state(page, NR_TMPFS_MAPPED);
>  		mem_cgroup_update_mapped_file_stat(page, -1);
>  		/*
>  		 * It would be tidy to reset the PageAnon mapping here,
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10 13:41         ` Mel Gorman
@ 2009-06-10 22:42           ` Ram Pai
  2009-06-11 13:52             ` Mel Gorman
  2009-06-11  1:29           ` Wu Fengguang
  1 sibling, 1 reply; 25+ messages in thread
From: Ram Pai @ 2009-06-10 22:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	Zhang, Yanmin, linux-mm, LKML

On Wed, 2009-06-10 at 14:41 +0100, Mel Gorman wrote:
> On Wed, Jun 10, 2009 at 07:59:44PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 06:31:53PM +0800, Mel Gorman wrote:
> > > On Wed, Jun 10, 2009 at 09:19:39AM +0800, Wu Fengguang wrote:
> > > > On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > > > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > > > 
> > > > > There is a heuristic that determines if the scan is worthwhile but the
> > > > > problem is that the heuristic is not being properly applied and is basically
> > > > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > > > 
> > > > > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > > > > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > > > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > > > > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > > > > pages that were not file-backed such as swapcache and made a calculation
> > > > > based on the inactive, active and mapped files. This is far superior
> > > > > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > > > > reasonable starting figure.
> > > > > 
> > > > > This patch alters how zone_reclaim() works out how many pages it might be
> > > > > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > > > > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > > > > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > > > > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > > > > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > > > > not set, then NR_FILE_MAPPED are not.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > > > > ---
> > > > >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> > > > >  1 files changed, 38 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 2ddcfc8..2bfc76e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> > > > >   */
> > > > >  int sysctl_min_slab_ratio = 5;
> > > > >  
> > > > > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > > > +{
> > > > > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > > > > +		zone_page_state(zone, NR_FILE_MAPPED);
> > > > 
> > > > This may underflow if too many tmpfs pages are mapped.
> > > > 
> > > 
> > > You're right. This is also a bug now in mmotm for patch
> > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch which
> > > is where I took this code out of and didn't think deeply enough about.
> > > Well spotted.
> > > 
> > > Should this be something like?
> > > 
> > > static unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > {
> > > 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> > > 	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE)
> > > 			zone_page_state(zone, NR_ACTIVE_FILE);
> > > 
> > > 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> > > }
> > > 
> > > ?
> > > 
> > > If that returns 0, it does mean that there are very few pages that the
> > > current reclaim_mode is going to be able to deal with so even if the
> > > count is not perfect, it should be good enough for what we need it for.
> > 
> > Agreed. We opt to give up direct zone reclaim than to risk busy looping ;)
> > 
> 
> Yep. Those busy loops doth chew up the CPU time, heat the planet and
> wear out Ye Olde Bugzilla with the wailing of unhappy users :)
> 
> > > > > +}
> > > > > +
> > > > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > > > +static inline long zone_pagecache_reclaimable(struct zone *zone)
> > > > > +{
> > > > > +	long nr_pagecache_reclaimable;
> > > > > +	long delta = 0;
> > > > > +
> > > > > +	/*
> > > > > +	 * If RECLAIM_SWAP is set, then all file pages are considered
> > > > > +	 * potentially reclaimable. Otherwise, we have to worry about
> > > > > +	 * pages like swapcache and zone_unmapped_file_pages() provides
> > > > > +	 * a better estimate
> > > > > +	 */
> > > > > +	if (zone_reclaim_mode & RECLAIM_SWAP)
> > > > > +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > > +	else
> > > > > +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > > > +
> > > > > +	/* If we can't clean pages, remove dirty pages from consideration */
> > > > > +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > > +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> > > > > +
> > > > > +	/* Beware of double accounting */
> > > > 
> > > > The double accounting happens for NR_FILE_MAPPED but not
> > > > NR_FILE_DIRTY(dirty tmpfs pages won't be accounted),
> > > 
> > > I should have taken that out. In an interim version, delta was altered
> > > more than once in a way that could have caused underflow.
> > > 
> > > > so this comment
> > > > is more suitable for zone_unmapped_file_pages(). But the double
> > > > accounting does affects this abstraction. So a more reasonable
> > > > sequence could be to first substract NR_FILE_DIRTY and then
> > > > conditionally substract NR_FILE_MAPPED?
> > > 
> > > The end result is the same I believe and I prefer having the
> > > zone_unmapped_file_pages() doing just that and nothing else because it's
> > > in line with what zone_lru_pages() does.
> > 
> > OK.
> > 
> > > > Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?
> > > > 
> > > 
> > > I considered such a counter and dismissed it but maybe it merits wider discussion.
> > > 
> > > My problem with it is that it would affect the pagecache add/remove hot paths
> > > and a few other sites and increase the amount of accouting we do within a
> > > zone. It seemed unjustified to help a seldom executed slow path that only
> > > runs on NUMA.
> > 
> > We are not talking about NR_TMPFS_PAGES, but NR_TMPFS_MAPPED :)
> > 
> > We only need to account it in page_add_file_rmap() and page_remove_rmap(),
> > I don't think they are too hot paths. And the relative cost is low enough.
> > 
> > It will look like this.
> > 
> 
> Ok, you're right, that is much simplier than what I had in mind. I was fixated
> on accounting for TMPFS pages. I think this patch has definite possibilities
> and would help us with the tmpfs problem. If the tests come back "failed",
> I'll be adding taking this logic and seeing can it be made work


And the results look great!  While constantly watching /proc/zoneinfo, I
observe that unlike earlier there was no unnecessary attempt to scan for
reclaimable pages, instead pages were allocated from the other node's
zone normal.


RP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10 22:42           ` Ram Pai
@ 2009-06-11 13:52             ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-11 13:52 UTC (permalink / raw)
  To: Ram Pai
  Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	Zhang, Yanmin, linux-mm, LKML

On Wed, Jun 10, 2009 at 03:42:59PM -0700, Ram Pai wrote:
> On Wed, 2009-06-10 at 14:41 +0100, Mel Gorman wrote:
> > On Wed, Jun 10, 2009 at 07:59:44PM +0800, Wu Fengguang wrote:
> > > On Wed, Jun 10, 2009 at 06:31:53PM +0800, Mel Gorman wrote:
> > > > On Wed, Jun 10, 2009 at 09:19:39AM +0800, Wu Fengguang wrote:
> > > > > On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > > > > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > > > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > > > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > > > > 
> > > > > > There is a heuristic that determines if the scan is worthwhile but the
> > > > > > problem is that the heuristic is not being properly applied and is basically
> > > > > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > > > > 
> > > > > > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > > > > > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > > > > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > > > > > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > > > > > pages that were not file-backed such as swapcache and made a calculation
> > > > > > based on the inactive, active and mapped files. This is far superior
> > > > > > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > > > > > reasonable starting figure.
> > > > > > 
> > > > > > This patch alters how zone_reclaim() works out how many pages it might be
> > > > > > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > > > > > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > > > > > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > > > > > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > > > > > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > > > > > not set, then NR_FILE_MAPPED are not.
> > > > > > 
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > > > > > ---
> > > > > >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> > > > > >  1 files changed, 38 insertions(+), 14 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 2ddcfc8..2bfc76e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> > > > > >   */
> > > > > >  int sysctl_min_slab_ratio = 5;
> > > > > >  
> > > > > > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > > > > +{
> > > > > > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > > > > > +		zone_page_state(zone, NR_FILE_MAPPED);
> > > > > 
> > > > > This may underflow if too many tmpfs pages are mapped.
> > > > > 
> > > > 
> > > > You're right. This is also a bug now in mmotm for patch
> > > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch which
> > > > is where I took this code out of and didn't think deeply enough about.
> > > > Well spotted.
> > > > 
> > > > Should this be something like?
> > > > 
> > > > static unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > > {
> > > > 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> > > > 	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE)
> > > > 			zone_page_state(zone, NR_ACTIVE_FILE);
> > > > 
> > > > 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> > > > }
> > > > 
> > > > ?
> > > > 
> > > > If that returns 0, it does mean that there are very few pages that the
> > > > current reclaim_mode is going to be able to deal with so even if the
> > > > count is not perfect, it should be good enough for what we need it for.
> > > 
> > > Agreed. We opt to give up direct zone reclaim than to risk busy looping ;)
> > > 
> > 
> > Yep. Those busy loops doth chew up the CPU time, heat the planet and
> > wear out Ye Olde Bugzilla with the wailing of unhappy users :)
> > 
> > > > > > +}
> > > > > > +
> > > > > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > > > > +static inline long zone_pagecache_reclaimable(struct zone *zone)
> > > > > > +{
> > > > > > +	long nr_pagecache_reclaimable;
> > > > > > +	long delta = 0;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If RECLAIM_SWAP is set, then all file pages are considered
> > > > > > +	 * potentially reclaimable. Otherwise, we have to worry about
> > > > > > +	 * pages like swapcache and zone_unmapped_file_pages() provides
> > > > > > +	 * a better estimate
> > > > > > +	 */
> > > > > > +	if (zone_reclaim_mode & RECLAIM_SWAP)
> > > > > > +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > > > +	else
> > > > > > +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > > > > +
> > > > > > +	/* If we can't clean pages, remove dirty pages from consideration */
> > > > > > +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > > > +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> > > > > > +
> > > > > > +	/* Beware of double accounting */
> > > > > 
> > > > > The double accounting happens for NR_FILE_MAPPED but not
> > > > > NR_FILE_DIRTY(dirty tmpfs pages won't be accounted),
> > > > 
> > > > I should have taken that out. In an interim version, delta was altered
> > > > more than once in a way that could have caused underflow.
> > > > 
> > > > > so this comment
> > > > > is more suitable for zone_unmapped_file_pages(). But the double
> > > > > accounting does affects this abstraction. So a more reasonable
> > > > > sequence could be to first substract NR_FILE_DIRTY and then
> > > > > conditionally substract NR_FILE_MAPPED?
> > > > 
> > > > The end result is the same I believe and I prefer having the
> > > > zone_unmapped_file_pages() doing just that and nothing else because it's
> > > > in line with what zone_lru_pages() does.
> > > 
> > > OK.
> > > 
> > > > > Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?
> > > > > 
> > > > 
> > > > I considered such a counter and dismissed it but maybe it merits wider discussion.
> > > > 
> > > > My problem with it is that it would affect the pagecache add/remove hot paths
> > > > and a few other sites and increase the amount of accouting we do within a
> > > > zone. It seemed unjustified to help a seldom executed slow path that only
> > > > runs on NUMA.
> > > 
> > > We are not talking about NR_TMPFS_PAGES, but NR_TMPFS_MAPPED :)
> > > 
> > > We only need to account it in page_add_file_rmap() and page_remove_rmap(),
> > > I don't think they are too hot paths. And the relative cost is low enough.
> > > 
> > > It will look like this.
> > > 
> > 
> > Ok, you're right, that is much simplier than what I had in mind. I was fixated
> > on accounting for TMPFS pages. I think this patch has definite possibilities
> > and would help us with the tmpfs problem. If the tests come back "failed",
> > I'll be adding taking this logic and seeing can it be made work
> 
> 
> And the results look great!  While constantly watching /proc/zoneinfo, I
> observe that unlike earlier there was no unnecessary attempt to scan for
> reclaimable pages, instead pages were allocated from the other node's
> zone normal.
> 

Happy days, thanks a lot for testing and reporting.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10 13:41         ` Mel Gorman
  2009-06-10 22:42           ` Ram Pai
@ 2009-06-11  1:29           ` Wu Fengguang
  1 sibling, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2009-06-11  1:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 09:41:34PM +0800, Mel Gorman wrote:
> On Wed, Jun 10, 2009 at 07:59:44PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 06:31:53PM +0800, Mel Gorman wrote:
> > > On Wed, Jun 10, 2009 at 09:19:39AM +0800, Wu Fengguang wrote:
> > > > On Wed, Jun 10, 2009 at 01:01:41AM +0800, Mel Gorman wrote:
> > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > > > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > > > 
> > > > > There is a heuristic that determines if the scan is worthwhile but the
> > > > > problem is that the heuristic is not being properly applied and is basically
> > > > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > > > 
> > > > > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > > > > include swapcache pages that the reclaim_mode cannot deal with.  Patch
> > > > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > > > > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > > > > pages that were not file-backed such as swapcache and made a calculation
> > > > > based on the inactive, active and mapped files. This is far superior
> > > > > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > > > > reasonable starting figure.
> > > > > 
> > > > > This patch alters how zone_reclaim() works out how many pages it might be
> > > > > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > > > > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > > > > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > > > > swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
> > > > > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > > > > not set, then NR_FILE_MAPPED are not.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > Acked-by: Christoph Lameter <cl@linux-foundation.org>
> > > > > ---
> > > > >  mm/vmscan.c |   52 ++++++++++++++++++++++++++++++++++++++--------------
> > > > >  1 files changed, 38 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 2ddcfc8..2bfc76e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2333,6 +2333,41 @@ int sysctl_min_unmapped_ratio = 1;
> > > > >   */
> > > > >  int sysctl_min_slab_ratio = 5;
> > > > >  
> > > > > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > > > +{
> > > > > +	return zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > +		zone_page_state(zone, NR_ACTIVE_FILE) -
> > > > > +		zone_page_state(zone, NR_FILE_MAPPED);
> > > > 
> > > > This may underflow if too many tmpfs pages are mapped.
> > > > 
> > > 
> > > You're right. This is also a bug now in mmotm for patch
> > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch which
> > > is where I took this code out of and didn't think deeply enough about.
> > > Well spotted.
> > > 
> > > Should this be something like?
> > > 
> > > static unsigned long zone_unmapped_file_pages(struct zone *zone)
> > > {
> > > 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> > > 	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE)
> > > 			zone_page_state(zone, NR_ACTIVE_FILE);
> > > 
> > > 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> > > }
> > > 
> > > ?
> > > 
> > > If that returns 0, it does mean that there are very few pages that the
> > > current reclaim_mode is going to be able to deal with so even if the
> > > count is not perfect, it should be good enough for what we need it for.
> > 
> > Agreed. We opt to give up direct zone reclaim than to risk busy looping ;)
> > 
> 
> Yep. Those busy loops doth chew up the CPU time, heat the planet and
> wear out Ye Olde Bugzilla with the wailing of unhappy users :)
> 
> > > > > +}
> > > > > +
> > > > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > > > +static inline long zone_pagecache_reclaimable(struct zone *zone)
> > > > > +{
> > > > > +	long nr_pagecache_reclaimable;
> > > > > +	long delta = 0;
> > > > > +
> > > > > +	/*
> > > > > +	 * If RECLAIM_SWAP is set, then all file pages are considered
> > > > > +	 * potentially reclaimable. Otherwise, we have to worry about
> > > > > +	 * pages like swapcache and zone_unmapped_file_pages() provides
> > > > > +	 * a better estimate
> > > > > +	 */
> > > > > +	if (zone_reclaim_mode & RECLAIM_SWAP)
> > > > > +		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > > +	else
> > > > > +		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > > > +
> > > > > +	/* If we can't clean pages, remove dirty pages from consideration */
> > > > > +	if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > > +		delta += zone_page_state(zone, NR_FILE_DIRTY);
> > > > > +
> > > > > +	/* Beware of double accounting */
> > > > 
> > > > The double accounting happens for NR_FILE_MAPPED but not
> > > > NR_FILE_DIRTY(dirty tmpfs pages won't be accounted),
> > > 
> > > I should have taken that out. In an interim version, delta was altered
> > > more than once in a way that could have caused underflow.
> > > 
> > > > so this comment
> > > > is more suitable for zone_unmapped_file_pages(). But the double
> > > > accounting does affects this abstraction. So a more reasonable
> > > > sequence could be to first substract NR_FILE_DIRTY and then
> > > > conditionally substract NR_FILE_MAPPED?
> > > 
> > > The end result is the same I believe and I prefer having the
> > > zone_unmapped_file_pages() doing just that and nothing else because it's
> > > in line with what zone_lru_pages() does.
> > 
> > OK.
> > 
> > > > Or better to introduce a new counter NR_TMPFS_MAPPED to fix this mess?
> > > > 
> > > 
> > > I considered such a counter and dismissed it but maybe it merits wider discussion.
> > > 
> > > My problem with it is that it would affect the pagecache add/remove hot paths
> > > and a few other sites and increase the amount of accouting we do within a
> > > zone. It seemed unjustified to help a seldom executed slow path that only
> > > runs on NUMA.
> > 
> > We are not talking about NR_TMPFS_PAGES, but NR_TMPFS_MAPPED :)
> > 
> > We only need to account it in page_add_file_rmap() and page_remove_rmap(),
> > I don't think they are too hot paths. And the relative cost is low enough.
> > 
> > It will look like this.
> > 
> 
> Ok, you're right, that is much simplier than what I had in mind. I was fixated
> on accounting for TMPFS pages. I think this patch has definite possibilities
> and would help us with the tmpfs problem. If the tests come back "failed",
> I'll be adding taking this logic and seeing can it be made work.

OK, thank you.

> What about ramfs pages though? They have similar problems to tmpfs but are
> not swap-backed, right?

We don't care ramfs pages because they are unevictable :)

Thanks,
Fengguang

> > ---
> >  include/linux/mmzone.h |    1 +
> >  mm/rmap.c              |    4 ++++
> >  2 files changed, 5 insertions(+)
> > 
> > --- linux.orig/include/linux/mmzone.h
> > +++ linux/include/linux/mmzone.h
> > @@ -99,6 +99,7 @@ enum zone_stat_item {
> >  	NR_VMSCAN_WRITE,
> >  	/* Second 128 byte cacheline */
> >  	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
> > +	NR_TMPFS_MAPPED,
> >  #ifdef CONFIG_NUMA
> >  	NUMA_HIT,		/* allocated in intended node */
> >  	NUMA_MISS,		/* allocated in non intended node */
> > --- linux.orig/mm/rmap.c
> > +++ linux/mm/rmap.c
> > @@ -844,6 +844,8 @@ void page_add_file_rmap(struct page *pag
> >  {
> >  	if (atomic_inc_and_test(&page->_mapcount)) {
> >  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> > +		if (PageSwapBacked(page))
> > +			__inc_zone_page_state(page, NR_TMPFS_MAPPED);
> >  		mem_cgroup_update_mapped_file_stat(page, 1);
> >  	}
> >  }
> > @@ -894,6 +896,8 @@ void page_remove_rmap(struct page *page)
> >  			mem_cgroup_uncharge_page(page);
> >  		__dec_zone_page_state(page,
> >  			PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> > +		if (!PageAnon(page) && PageSwapBacked(page))
> > +			__dec_zone_page_state(page, NR_TMPFS_MAPPED);
> >  		mem_cgroup_update_mapped_file_stat(page, -1);
> >  		/*
> >  		 * It would be tidy to reset the PageAnon mapping here,
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim
  2009-06-10 11:59       ` Wu Fengguang
  2009-06-10 13:41         ` Mel Gorman
@ 2009-06-11  3:26         ` KOSAKI Motohiro
  1 sibling, 0 replies; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-06-11  3:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Mel Gorman, Rik van Riel, Christoph Lameter,
	Zhang, Yanmin, linuxram, linux-mm, LKML


> We are not talking about NR_TMPFS_PAGES, but NR_TMPFS_MAPPED :)
> 
> We only need to account it in page_add_file_rmap() and page_remove_rmap(),
> I don't think they are too hot paths. And the relative cost is low enough.
> 
> It will look like this.
> 
> ---
>  include/linux/mmzone.h |    1 +
>  mm/rmap.c              |    4 ++++
>  2 files changed, 5 insertions(+)
> 
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -99,6 +99,7 @@ enum zone_stat_item {
>  	NR_VMSCAN_WRITE,
>  	/* Second 128 byte cacheline */
>  	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
> +	NR_TMPFS_MAPPED,
>  #ifdef CONFIG_NUMA
>  	NUMA_HIT,		/* allocated in intended node */
>  	NUMA_MISS,		/* allocated in non intended node */
> --- linux.orig/mm/rmap.c
> +++ linux/mm/rmap.c
> @@ -844,6 +844,8 @@ void page_add_file_rmap(struct page *pag
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> +		if (PageSwapBacked(page))
> +			__inc_zone_page_state(page, NR_TMPFS_MAPPED);
>  		mem_cgroup_update_mapped_file_stat(page, 1);
>  	}
>  }
> @@ -894,6 +896,8 @@ void page_remove_rmap(struct page *page)
>  			mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page,
>  			PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> +		if (!PageAnon(page) && PageSwapBacked(page))
> +			__dec_zone_page_state(page, NR_TMPFS_MAPPED);
>  		mem_cgroup_update_mapped_file_stat(page, -1);
>  		/*
>  		 * It would be tidy to reset the PageAnon mapping here,

I think this patch looks good. thanks :)

but I have one request. 
Could you please rename NR_FILE_MAPPED to NR_SWAP_BACKED_FILE_MAPPED?

I mean, mm/shmem isn't only used for tmpfs, but also be used ipc/shm and
/dev/zero.
NR_TMPFS_MAPPED seems a bit misleading.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full
  2009-06-09 17:01 [PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2 Mel Gorman
  2009-06-09 17:01 ` [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
@ 2009-06-09 17:01 ` Mel Gorman
  2009-06-09 18:11   ` Rik van Riel
  2009-06-10  1:52   ` KOSAKI Motohiro
  2009-06-09 17:01 ` [PATCH 3/4] Count the number of times zone_reclaim() scans and fails Mel Gorman
  2009-06-09 17:01 ` [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA Mel Gorman
  3 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-09 17:01 UTC (permalink / raw)
  To: Mel Gorman, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	yanmin.zhang, Wu Fengguang, linuxram
  Cc: linux-mm, LKML

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being
met. The problem is that zone_reclaim() failing at all means the zone
gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full if
it really is unreclaimable. If it's a case that the scan did not occur or
if enough pages were not reclaimed with the limited reclaim_mode, then the
zone is simply skipped.

There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
go ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/internal.h   |    4 ++++
 mm/page_alloc.c |   26 ++++++++++++++++++++++----
 mm/vmscan.c     |   11 ++++++-----
 3 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f02c750..f290c4d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -259,4 +259,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
 		     struct page **pages, struct vm_area_struct **vmas);
 
+#define ZONE_RECLAIM_NOSCAN	-2
+#define ZONE_RECLAIM_FULL	-1
+#define ZONE_RECLAIM_SOME	0
+#define ZONE_RECLAIM_SUCCESS	1
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d35e753..667ffbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1477,15 +1477,33 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
+			int ret;
+
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-			if (!zone_watermark_ok(zone, order, mark,
-				    classzone_idx, alloc_flags)) {
-				if (!zone_reclaim_mode ||
-				    !zone_reclaim(zone, gfp_mask, order))
+			if (zone_watermark_ok(zone, order, mark,
+				    classzone_idx, alloc_flags))
+				goto try_this_zone;
+
+			if (zone_reclaim_mode == 0)
+				goto this_zone_full;
+
+			ret = zone_reclaim(zone, gfp_mask, order);
+			switch (ret) {
+			case ZONE_RECLAIM_NOSCAN:
+				/* did not scan */
+				goto try_next_zone;
+			case ZONE_RECLAIM_FULL:
+				/* scanned but unreclaimable */
+				goto this_zone_full;
+			default:
+				/* did we reclaim enough */
+				if (!zone_watermark_ok(zone, order, mark,
+						classzone_idx, alloc_flags))
 					goto this_zone_full;
 			}
 		}
 
+try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
 		if (page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bfc76e..e862fc9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2462,16 +2462,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	 */
 	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return 0;
+		return ZONE_RECLAIM_FULL;
 
 	if (zone_is_all_unreclaimable(zone))
-		return 0;
+		return ZONE_RECLAIM_FULL;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
 	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
-			return 0;
+		return ZONE_RECLAIM_NOSCAN;
 
 	/*
 	 * Only run zone reclaim on the local zone or on zones that do not
@@ -2481,10 +2481,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	 */
 	node_id = zone_to_nid(zone);
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
-		return 0;
+		return ZONE_RECLAIM_NOSCAN;
 
 	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return 0;
+		return ZONE_RECLAIM_NOSCAN;
+
 	ret = __zone_reclaim(zone, gfp_mask, order);
 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full
  2009-06-09 17:01 ` [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
@ 2009-06-09 18:11   ` Rik van Riel
  2009-06-10  1:52   ` KOSAKI Motohiro
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2009-06-09 18:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Christoph Lameter, yanmin.zhang, Wu Fengguang,
	linuxram, linux-mm, LKML

Mel Gorman wrote:

> There is a side-effect to this patch. Currently, if zone_reclaim()
> successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> go ahead. With this patch applied, zone watermarks are rechecked after
> zone_reclaim() does some work.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full
  2009-06-09 17:01 ` [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
  2009-06-09 18:11   ` Rik van Riel
@ 2009-06-10  1:52   ` KOSAKI Motohiro
  1 sibling, 0 replies; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-06-10  1:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Rik van Riel, Christoph Lameter, yanmin.zhang,
	Wu Fengguang, linuxram, linux-mm, LKML

>  			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> -			if (!zone_watermark_ok(zone, order, mark,
> -				    classzone_idx, alloc_flags)) {
> -				if (!zone_reclaim_mode ||
> -				    !zone_reclaim(zone, gfp_mask, order))
> +			if (zone_watermark_ok(zone, order, mark,
> +				    classzone_idx, alloc_flags))
> +				goto try_this_zone;
> +
> +			if (zone_reclaim_mode == 0)
> +				goto this_zone_full;
> +
> +			ret = zone_reclaim(zone, gfp_mask, order);
> +			switch (ret) {
> +			case ZONE_RECLAIM_NOSCAN:
> +				/* did not scan */
> +				goto try_next_zone;
> +			case ZONE_RECLAIM_FULL:
> +				/* scanned but unreclaimable */
> +				goto this_zone_full;
> +			default:
> +				/* did we reclaim enough */
> +				if (!zone_watermark_ok(zone, order, mark,
> +						classzone_idx, alloc_flags))
>  					goto this_zone_full;

ok, this version's change are minimal than previous.
I'm not afraid this patch now. thanks.


	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/4] Count the number of times zone_reclaim() scans and fails
  2009-06-09 17:01 [PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2 Mel Gorman
  2009-06-09 17:01 ` [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
  2009-06-09 17:01 ` [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
@ 2009-06-09 17:01 ` Mel Gorman
  2009-06-09 18:56   ` Rik van Riel
                     ` (2 more replies)
  2009-06-09 17:01 ` [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA Mel Gorman
  3 siblings, 3 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-09 17:01 UTC (permalink / raw)
  To: Mel Gorman, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	yanmin.zhang, Wu Fengguang, linuxram
  Cc: linux-mm, LKML

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being met.

There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and experimentation
so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during
high CPU utilisation this counter is increasing rapidly, then the resolution
to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/vmstat.h |    3 +++
 mm/vmscan.c            |    4 ++++
 mm/vmstat.c            |    3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ff4696c..416f748 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -36,6 +36,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGSTEAL),
 		FOR_ALL_ZONES(PGSCAN_KSWAPD),
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
+#ifdef CONFIG_NUMA
+		PGSCAN_ZONERECLAIM_FAILED,
+#endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e862fc9..8be4582 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2489,6 +2489,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	ret = __zone_reclaim(zone, gfp_mask, order);
 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
 
+	if (!ret) {
+		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
+	}
+
 	return ret;
 }
 #endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1e3aa81..02677d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -673,6 +673,9 @@ static const char * const vmstat_text[] = {
 	TEXTS_FOR_ZONES("pgscan_kswapd")
 	TEXTS_FOR_ZONES("pgscan_direct")
 
+#ifdef CONFIG_NUMA
+	"zreclaim_failed",
+#endif
 	"pginodesteal",
 	"slabs_scanned",
 	"kswapd_steal",
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] Count the number of times zone_reclaim() scans and fails
  2009-06-09 17:01 ` [PATCH 3/4] Count the number of times zone_reclaim() scans and fails Mel Gorman
@ 2009-06-09 18:56   ` Rik van Riel
  2009-06-10  1:47   ` KOSAKI Motohiro
  2009-06-10  2:10   ` Wu Fengguang
  2 siblings, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2009-06-09 18:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Christoph Lameter, yanmin.zhang, Wu Fengguang,
	linuxram, linux-mm, LKML

Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
> 
> There is a heuristic that determines if the scan is worthwhile but it is
> possible that the heuristic will fail and the CPU gets tied up scanning
> uselessly. Detecting the situation requires some guesswork and experimentation
> so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during
> high CPU utilisation this counter is increasing rapidly, then the resolution
> to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] Count the number of times zone_reclaim() scans and fails
  2009-06-09 17:01 ` [PATCH 3/4] Count the number of times zone_reclaim() scans and fails Mel Gorman
  2009-06-09 18:56   ` Rik van Riel
@ 2009-06-10  1:47   ` KOSAKI Motohiro
  2009-06-10 10:36     ` Mel Gorman
  2009-06-10  2:10   ` Wu Fengguang
  2 siblings, 1 reply; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-06-10  1:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Rik van Riel, Christoph Lameter, yanmin.zhang,
	Wu Fengguang, linuxram, linux-mm, LKML

Hi

I like this patch. thank you mel.

> @@ -2489,6 +2489,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	ret = __zone_reclaim(zone, gfp_mask, order);
>  	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>  
> +	if (!ret) {
> +		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
> +	}
> +
>  	return ret;


count_vm_event(PGSCAN_ZONERECLAIM_FAILED)?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] Count the number of times zone_reclaim() scans and fails
  2009-06-10  1:47   ` KOSAKI Motohiro
@ 2009-06-10 10:36     ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-10 10:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Christoph Lameter, yanmin.zhang, Wu Fengguang,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 10:47:20AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> I like this patch. thank you mel.
> 
> > @@ -2489,6 +2489,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  	ret = __zone_reclaim(zone, gfp_mask, order);
> >  	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >  
> > +	if (!ret) {
> > +		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
> > +	}
> > +
> >  	return ret;
> 
> count_vm_event(PGSCAN_ZONERECLAIM_FAILED)?
> 

/me slaps self

Yes, that makes more sense.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] Count the number of times zone_reclaim() scans and fails
  2009-06-09 17:01 ` [PATCH 3/4] Count the number of times zone_reclaim() scans and fails Mel Gorman
  2009-06-09 18:56   ` Rik van Riel
  2009-06-10  1:47   ` KOSAKI Motohiro
@ 2009-06-10  2:10   ` Wu Fengguang
  2009-06-10 10:40     ` Mel Gorman
  2 siblings, 1 reply; 25+ messages in thread
From: Wu Fengguang @ 2009-06-10  2:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 01:01:43AM +0800, Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
> 
> There is a heuristic that determines if the scan is worthwhile but it is
> possible that the heuristic will fail and the CPU gets tied up scanning
> uselessly. Detecting the situation requires some guesswork and experimentation
> so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during
> high CPU utilisation this counter is increasing rapidly, then the resolution
> to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/vmstat.h |    3 +++
>  mm/vmscan.c            |    4 ++++
>  mm/vmstat.c            |    3 +++
>  3 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ff4696c..416f748 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -36,6 +36,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGSTEAL),
>  		FOR_ALL_ZONES(PGSCAN_KSWAPD),
>  		FOR_ALL_ZONES(PGSCAN_DIRECT),
> +#ifdef CONFIG_NUMA
> +		PGSCAN_ZONERECLAIM_FAILED,
> +#endif

I'd rather to refine the zone accounting (ie. mapped tmpfs pages)
so that we know whether a zone scan is going to be fruitless.  Then
we can get rid of the remedy patches 3 and 4.

We don't have to worry about swap cache pages accounted as file pages.
Since there are no double accounting in NR_FILE_PAGES for tmpfs pages.

We don't have to worry about MLOCKED pages, because they may defeat
the estimation temporarily, but after one or several more zone scans,
MLOCKED pages will go to the unevictable list, hence this cause of
zone reclaim failure won't be persistent.

Any more known accounting holes?

Thanks,
Fengguang

>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>  #ifdef CONFIG_HUGETLB_PAGE
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e862fc9..8be4582 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2489,6 +2489,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	ret = __zone_reclaim(zone, gfp_mask, order);
>  	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>  
> +	if (!ret) {
> +		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
> +	}
> +
>  	return ret;
>  }
>  #endif
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1e3aa81..02677d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -673,6 +673,9 @@ static const char * const vmstat_text[] = {
>  	TEXTS_FOR_ZONES("pgscan_kswapd")
>  	TEXTS_FOR_ZONES("pgscan_direct")
>  
> +#ifdef CONFIG_NUMA
> +	"zreclaim_failed",
> +#endif
>  	"pginodesteal",
>  	"slabs_scanned",
>  	"kswapd_steal",
> -- 
> 1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] Count the number of times zone_reclaim() scans and fails
  2009-06-10  2:10   ` Wu Fengguang
@ 2009-06-10 10:40     ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-10 10:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, Zhang, Yanmin,
	linuxram, linux-mm, LKML

On Wed, Jun 10, 2009 at 10:10:28AM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 01:01:43AM +0800, Mel Gorman wrote:
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > 
> > There is a heuristic that determines if the scan is worthwhile but it is
> > possible that the heuristic will fail and the CPU gets tied up scanning
> > uselessly. Detecting the situation requires some guesswork and experimentation
> > so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during
> > high CPU utilisation this counter is increasing rapidly, then the resolution
> > to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/vmstat.h |    3 +++
> >  mm/vmscan.c            |    4 ++++
> >  mm/vmstat.c            |    3 +++
> >  3 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index ff4696c..416f748 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -36,6 +36,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		FOR_ALL_ZONES(PGSTEAL),
> >  		FOR_ALL_ZONES(PGSCAN_KSWAPD),
> >  		FOR_ALL_ZONES(PGSCAN_DIRECT),
> > +#ifdef CONFIG_NUMA
> > +		PGSCAN_ZONERECLAIM_FAILED,
> > +#endif
> 
> I'd rather to refine the zone accounting (ie. mapped tmpfs pages)
> so that we know whether a zone scan is going to be fruitless.  Then
> we can get rid of the remedy patches 3 and 4.
> 

This patch is not a remedy patch as such. tmpfs might not be the only
trigger case for thie zone_reclaim() excessive scan problem. In the
event it's occuring, we want to be able to pinpoint better why we are
spinning at 100% CPU. It's to reduce the debug time if/when this problem
is encountered.

On the mapped tmpfs page accounting, I mentioned the problems I see with
this in another mail. It would alter a number of paths, particularly the
page cache add/remove paths to maintain the counters we need to avoid
tmpfs in this slow NUMA-specific path. I'm hoping that can be avoided.

> We don't have to worry about swap cache pages accounted as file pages.
> Since there are no double accounting in NR_FILE_PAGES for tmpfs pages.
> 
> We don't have to worry about MLOCKED pages, because they may defeat
> the estimation temporarily, but after one or several more zone scans,
> MLOCKED pages will go to the unevictable list, hence this cause of
> zone reclaim failure won't be persistent.
> 
> Any more known accounting holes?
> 

Not that I'm aware of but if/when they show up, I'd like to be able to
detect the situation easily, hence this patch.

> Thanks,
> Fengguang
> 
> >  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >  #ifdef CONFIG_HUGETLB_PAGE
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e862fc9..8be4582 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2489,6 +2489,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  	ret = __zone_reclaim(zone, gfp_mask, order);
> >  	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >  
> > +	if (!ret) {
> > +		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
> > +	}
> > +
> >  	return ret;
> >  }
> >  #endif
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 1e3aa81..02677d1 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -673,6 +673,9 @@ static const char * const vmstat_text[] = {
> >  	TEXTS_FOR_ZONES("pgscan_kswapd")
> >  	TEXTS_FOR_ZONES("pgscan_direct")
> >  
> > +#ifdef CONFIG_NUMA
> > +	"zreclaim_failed",
> > +#endif
> >  	"pginodesteal",
> >  	"slabs_scanned",
> >  	"kswapd_steal",
> > -- 
> > 1.5.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA
  2009-06-09 17:01 [PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2 Mel Gorman
                   ` (2 preceding siblings ...)
  2009-06-09 17:01 ` [PATCH 3/4] Count the number of times zone_reclaim() scans and fails Mel Gorman
@ 2009-06-09 17:01 ` Mel Gorman
  2009-06-10  1:53   ` KOSAKI Motohiro
  2009-06-10  5:54   ` Andrew Morton
  3 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-09 17:01 UTC (permalink / raw)
  To: Mel Gorman, KOSAKI Motohiro, Rik van Riel, Christoph Lameter,
	yanmin.zhang, Wu Fengguang, linuxram
  Cc: linux-mm, LKML

On NUMA machines, the administrator can configure zone_reclaim_mode that is a
more targetted form of direct reclaim. On machines with large NUMA distances,
zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
reclaimed if the zone watermarks are not being met. The problem is that
zone_reclaim() may get into a situation where it scans excessively without
making progress.

One such situation occured where a large tmpfs mount occupied a
large percentage of memory overall. The pages did not get reclaimed by
zone_reclaim(), but the lists are uselessly scanned frequencly making the
CPU spin at 100%. The observation in the field was that malloc() stalled
for a long time (minutes in some cases) when this situation occurs. This
situation should be resolved now and there are counters in place that
detect when the scan-avoidance heuristics break but the heuristics might
still not be bullet proof. If they fail again, the kernel should respond
in some fashion other than scanning uselessly chewing up CPU time.

This patch reintroduces zone_reclaim_interval which was removed by commit
34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
remove /proc/sys/vm/zone_reclaim_interval. In the event the scan-avoidance
heuristics fail, the event is counted and zone_reclaim_interval avoids
excessive scanning.

Signed-off-by: Mel Gorman <mel@csn.ul.ie
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/sysctl/vm.txt |   15 +++++++++++++++
 include/linux/mmzone.h      |    9 +++++++++
 include/linux/swap.h        |    1 +
 kernel/sysctl.c             |    9 +++++++++
 mm/vmscan.c                 |   24 ++++++++++++++++++++++++
 5 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 0ea5adb..22ffc3e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -52,6 +52,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - vfs_cache_pressure
 - zone_reclaim_mode
+- zone_reclaim_interval
 
 
 ==============================================================
@@ -621,4 +622,18 @@ Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
 
+================================================================
+
+zone_reclaim_interval:
+
+The time allowed for off-node allocations after zone reclaim
+has failed to reclaim enough pages to allow a local allocation.
+
+Time is set in seconds and set by default to 30 seconds.
+
+Reduce the interval if undesired off-node allocations occur or
+set to 0 to always try and reclaim pages for node-local memory.
+However, too frequent scans will have a negative impact on
+off-node allocation performance and manifest as high CPU usage.
+
 ============ End of Document =================================
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..3a53e1c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -335,6 +335,15 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
+	 * timestamp (in jiffies) of the last zone_reclaim that scanned
+	 * but failed to free enough pages. This is used to avoid repeated
+	 * scans when zone_reclaim() is unable to detect in advance that
+	 * the scanning is useless. This can happen for example if a zone
+	 * has large numbers of clean unmapped file pages on tmpfs
+	 */
+	unsigned long		zone_reclaim_failure;
+
+	/*
 	 * prev_priority holds the scanning priority for this zone.  It is
 	 * defined as the scanning priority at which we achieved our reclaim
 	 * target at the previous try_to_free_pages() or balance_pgdat()
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c88b366..28a01e3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -225,6 +225,7 @@ extern long vm_total_pages;
 
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
+extern int zone_reclaim_interval;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0554886..2afffa5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1221,6 +1221,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "zone_reclaim_interval",
+		.data           = &zone_reclaim_interval,
+		.maxlen         = sizeof(zone_reclaim_interval),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec_jiffies,
+		.strategy       = &sysctl_jiffies,
+	},
+	{
 		.ctl_name	= VM_MIN_UNMAPPED,
 		.procname	= "min_unmapped_ratio",
 		.data		= &sysctl_min_unmapped_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8be4582..5fa4843 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2315,6 +2315,13 @@ int zone_reclaim_mode __read_mostly;
 #define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
 
 /*
+ * Minimum time between zone_reclaim() scans that failed. Ordinarily, a
+ * scan will not fail because it will be determined in advance if it can
+ * succeeed but this does not always work. See mmzone.h
+ */
+int zone_reclaim_interval __read_mostly = 30*HZ;
+
+/*
  * Priority for ZONE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
  * a zone.
@@ -2464,6 +2471,15 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
+	/* Watch for jiffie wraparound */
+	if (unlikely(jiffies < zone->zone_reclaim_failure))
+		zone->zone_reclaim_failure = jiffies;
+
+	/* Do not attempt a scan if scanning failed recently */
+	if (time_before(jiffies,
+			zone->zone_reclaim_failure + zone_reclaim_interval))
+		return ZONE_RECLAIM_FULL;
+
 	if (zone_is_all_unreclaimable(zone))
 		return ZONE_RECLAIM_FULL;
 
@@ -2491,6 +2507,14 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 	if (!ret) {
 		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
+
+		/*
+		 * We were unable to reclaim enough pages to stay on node and
+		 * unable to detect in advance that the scan would fail. Allow
+		 * off node accesses for zone_reclaim_inteval jiffies before
+		 * trying zone_reclaim() again
+		 */
+		zone->zone_reclaim_failure = jiffies;
 	}
 
 	return ret;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA
  2009-06-09 17:01 ` [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA Mel Gorman
@ 2009-06-10  1:53   ` KOSAKI Motohiro
  2009-06-10  5:54   ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-06-10  1:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Rik van Riel, Christoph Lameter, yanmin.zhang,
	Wu Fengguang, linuxram, linux-mm, LKML

> +
> +zone_reclaim_interval:
> +
> +The time allowed for off-node allocations after zone reclaim
> +has failed to reclaim enough pages to allow a local allocation.
> +
> +Time is set in seconds and set by default to 30 seconds.
> +
> +Reduce the interval if undesired off-node allocations occur or
> +set to 0 to always try and reclaim pages for node-local memory.
> +However, too frequent scans will have a negative impact on
> +off-node allocation performance and manifest as high CPU usage.
> +

good documentaion :)

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 -kosaki



>  ============ End of Document =================================
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8895985..3a53e1c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -335,6 +335,15 @@ struct zone {
>  	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
>  
>  	/*
> +	 * timestamp (in jiffies) of the last zone_reclaim that scanned
> +	 * but failed to free enough pages. This is used to avoid repeated
> +	 * scans when zone_reclaim() is unable to detect in advance that
> +	 * the scanning is useless. This can happen for example if a zone
> +	 * has large numbers of clean unmapped file pages on tmpfs
> +	 */
> +	unsigned long		zone_reclaim_failure;
> +
> +	/*
>  	 * prev_priority holds the scanning priority for this zone.  It is
>  	 * defined as the scanning priority at which we achieved our reclaim
>  	 * target at the previous try_to_free_pages() or balance_pgdat()
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c88b366..28a01e3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -225,6 +225,7 @@ extern long vm_total_pages;
>  
>  #ifdef CONFIG_NUMA
>  extern int zone_reclaim_mode;
> +extern int zone_reclaim_interval;
>  extern int sysctl_min_unmapped_ratio;
>  extern int sysctl_min_slab_ratio;
>  extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 0554886..2afffa5 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1221,6 +1221,15 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= &zero,
>  	},
>  	{
> +		.ctl_name       = CTL_UNNUMBERED,
> +		.procname       = "zone_reclaim_interval",
> +		.data           = &zone_reclaim_interval,
> +		.maxlen         = sizeof(zone_reclaim_interval),
> +		.mode           = 0644,
> +		.proc_handler   = &proc_dointvec_jiffies,
> +		.strategy       = &sysctl_jiffies,
> +	},
> +	{
>  		.ctl_name	= VM_MIN_UNMAPPED,
>  		.procname	= "min_unmapped_ratio",
>  		.data		= &sysctl_min_unmapped_ratio,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8be4582..5fa4843 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2315,6 +2315,13 @@ int zone_reclaim_mode __read_mostly;
>  #define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
>  
>  /*
> + * Minimum time between zone_reclaim() scans that failed. Ordinarily, a
> + * scan will not fail because it will be determined in advance if it can
> + * succeeed but this does not always work. See mmzone.h
> + */
> +int zone_reclaim_interval __read_mostly = 30*HZ;
> +
> +/*
>   * Priority for ZONE_RECLAIM. This determines the fraction of pages
>   * of a node considered for each zone_reclaim. 4 scans 1/16th of
>   * a zone.
> @@ -2464,6 +2471,15 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>  		return ZONE_RECLAIM_FULL;
>  
> +	/* Watch for jiffie wraparound */
> +	if (unlikely(jiffies < zone->zone_reclaim_failure))
> +		zone->zone_reclaim_failure = jiffies;
> +
> +	/* Do not attempt a scan if scanning failed recently */
> +	if (time_before(jiffies,
> +			zone->zone_reclaim_failure + zone_reclaim_interval))
> +		return ZONE_RECLAIM_FULL;
> +
>  	if (zone_is_all_unreclaimable(zone))
>  		return ZONE_RECLAIM_FULL;
>  
> @@ -2491,6 +2507,14 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  
>  	if (!ret) {
>  		count_vm_events(PGSCAN_ZONERECLAIM_FAILED, 1);
> +
> +		/*
> +		 * We were unable to reclaim enough pages to stay on node and
> +		 * unable to detect in advance that the scan would fail. Allow
> +		 * off node accesses for zone_reclaim_inteval jiffies before
> +		 * trying zone_reclaim() again
> +		 */
> +		zone->zone_reclaim_failure = jiffies;
>  	}
>  
>  	return ret;
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA
  2009-06-09 17:01 ` [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA Mel Gorman
  2009-06-10  1:53   ` KOSAKI Motohiro
@ 2009-06-10  5:54   ` Andrew Morton
  2009-06-10 10:48     ` Mel Gorman
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2009-06-10  5:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, yanmin.zhang,
	Wu Fengguang, linuxram, linux-mm, LKML

On Tue,  9 Jun 2009 18:01:44 +0100 Mel Gorman <mel@csn.ul.ie> wrote:

> On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> more targetted form of direct reclaim. On machines with large NUMA distances,
> zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> reclaimed if the zone watermarks are not being met. The problem is that
> zone_reclaim() may get into a situation where it scans excessively without
> making progress.
> 
> One such situation occured where a large tmpfs mount occupied a
> large percentage of memory overall. The pages did not get reclaimed by
> zone_reclaim(), but the lists are uselessly scanned frequencly making the
> CPU spin at 100%. The observation in the field was that malloc() stalled
> for a long time (minutes in some cases) when this situation occurs. This
> situation should be resolved now and there are counters in place that
> detect when the scan-avoidance heuristics break but the heuristics might
> still not be bullet proof. If they fail again, the kernel should respond
> in some fashion other than scanning uselessly chewing up CPU time.
> 
> This patch reintroduces zone_reclaim_interval which was removed by commit
> 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> remove /proc/sys/vm/zone_reclaim_interval. In the event the scan-avoidance
> heuristics fail, the event is counted and zone_reclaim_interval avoids
> excessive scanning.

More distressed fretting!

Pages can be allocated and freed and reclaimed at rates anywhere
between zero per second to one million per second or more.  So what
sense does it make to pace MM activity by wall-time??

A better clock for pacing MM activity is page-allocation-attempts, or
pages-scanned, etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA
  2009-06-10  5:54   ` Andrew Morton
@ 2009-06-10 10:48     ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2009-06-10 10:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Rik van Riel, Christoph Lameter, yanmin.zhang,
	Wu Fengguang, linuxram, linux-mm, LKML

On Tue, Jun 09, 2009 at 10:54:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Jun 2009 18:01:44 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > more targetted form of direct reclaim. On machines with large NUMA distances,
> > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > reclaimed if the zone watermarks are not being met. The problem is that
> > zone_reclaim() may get into a situation where it scans excessively without
> > making progress.
> > 
> > One such situation occured where a large tmpfs mount occupied a
> > large percentage of memory overall. The pages did not get reclaimed by
> > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > CPU spin at 100%. The observation in the field was that malloc() stalled
> > for a long time (minutes in some cases) when this situation occurs. This
> > situation should be resolved now and there are counters in place that
> > detect when the scan-avoidance heuristics break but the heuristics might
> > still not be bullet proof. If they fail again, the kernel should respond
> > in some fashion other than scanning uselessly chewing up CPU time.
> > 
> > This patch reintroduces zone_reclaim_interval which was removed by commit
> > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > remove /proc/sys/vm/zone_reclaim_interval. In the event the scan-avoidance
> > heuristics fail, the event is counted and zone_reclaim_interval avoids
> > excessive scanning.
> 
> More distressed fretting!
> 

Not at all. One day I'll get a significant patch completed without any
eyebrows raised and the world will end :). 

> Pages can be allocated and freed and reclaimed at rates anywhere
> between zero per second to one million per second or more.  So what
> sense does it make to pace MM activity by wall-time??
> 

None - this is a brute force workaround if the scan heuristic breaks and
was lifted directly from an old patch by Christoph. It could be much
better.

> A better clock for pacing MM activity is page-allocation-attempts, or
> pages-scanned, etc.
> 

Agreed. Wu convinced me of that and had some good suggestions. I'm waiting
to hear back from the testers on the new scan heuristics to see if they
are working or not. If they are working now, I'll drop this patch. If we
decide we need it, I'll update with Wu's work. I need to send the tests a
new version now though because of the underflow problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-06-11 13:50 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-09 17:01 [PATCH 0/4] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations V2 Mel Gorman
2009-06-09 17:01 ` [PATCH 1/4] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
2009-06-09 18:15   ` Rik van Riel
2009-06-10  1:19   ` Wu Fengguang
2009-06-10  7:31     ` KOSAKI Motohiro
2009-06-10 10:31     ` Mel Gorman
2009-06-10 11:59       ` Wu Fengguang
2009-06-10 13:41         ` Mel Gorman
2009-06-10 22:42           ` Ram Pai
2009-06-11 13:52             ` Mel Gorman
2009-06-11  1:29           ` Wu Fengguang
2009-06-11  3:26         ` KOSAKI Motohiro
2009-06-09 17:01 ` [PATCH 2/4] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
2009-06-09 18:11   ` Rik van Riel
2009-06-10  1:52   ` KOSAKI Motohiro
2009-06-09 17:01 ` [PATCH 3/4] Count the number of times zone_reclaim() scans and fails Mel Gorman
2009-06-09 18:56   ` Rik van Riel
2009-06-10  1:47   ` KOSAKI Motohiro
2009-06-10 10:36     ` Mel Gorman
2009-06-10  2:10   ` Wu Fengguang
2009-06-10 10:40     ` Mel Gorman
2009-06-09 17:01 ` [PATCH 4/4] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA Mel Gorman
2009-06-10  1:53   ` KOSAKI Motohiro
2009-06-10  5:54   ` Andrew Morton
2009-06-10 10:48     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox