[PATCH] mm: do batched scans for mem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: do batched scans for mem_cgroup
@ 2009-08-20  2:49 Wu Fengguang
  2009-08-20  2:52 ` [PATCH] mm: make nr_scan_try_batch() more safe on races Wu Fengguang
  2009-08-20  3:13 ` [PATCH] mm: do batched scans for mem_cgroup KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20  2:49 UTC (permalink / raw)
  To: Andrew Morton, Balbir Singh
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
in which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
scan rate by up to 32 times.

For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive
list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.

The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.

This batching won't blur up the cgroup limits, since it is driven by
"pages reclaimed" rather than "pages scanned". When shrink_zone()
decides to cancel (and save) one smallish scan, it may well be called
again to accumulate up nr_saved_scan.

It could possibly be a problem for some tiny mem_cgroup (which may be
_full_ scanned too much times in order to accumulate up nr_saved_scan).

CC: Rik van Riel <riel@redhat.com>
CC: Minchan Kim <minchan.kim@gmail.com>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   12 ++++++++++++
 mm/vmscan.c                |    9 +++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2009-08-20 10:41:05.000000000 +0800
+++ linux/include/linux/memcontrol.h	2009-08-20 10:43:22.000000000 +0800
@@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone);
 struct zone_reclaim_stat*
--- linux.orig/mm/memcontrol.c	2009-08-20 10:43:20.000000000 +0800
+++ linux/mm/memcontrol.c	2009-08-20 10:43:22.000000000 +0800
@@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
 	 */
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
+	unsigned long		nr_saved_scan[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
 };
@@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return &mz->nr_saved_scan[lru];
+}
+
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
 {
--- linux.orig/mm/vmscan.c	2009-08-20 10:40:56.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-20 10:43:22.000000000 +0800
@@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
 	for_each_evictable_lru(l) {
 		int file = is_file_lru(l);
 		unsigned long scan;
+		unsigned long *saved_scan;
 
 		scan = zone_nr_pages(zone, sc, l);
 		if (priority || noswap) {
@@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
 			scan = (scan * percent[file]) / 100;
 		}
 		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
+			saved_scan = &zone->lru[l].nr_saved_scan;
 		else
-			nr[l] = scan;
+			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
+							       zone, l);
+		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] mm: make nr_scan_try_batch() more safe on races
  2009-08-20  2:49 [PATCH] mm: do batched scans for mem_cgroup Wu Fengguang
@ 2009-08-20  2:52 ` Wu Fengguang
  2009-08-20  3:17   ` [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list() Wu Fengguang
  2009-08-20  3:13 ` [PATCH] mm: do batched scans for mem_cgroup KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20  2:52 UTC (permalink / raw)
  To: Andrew Morton, Balbir Singh
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

nr_scan_try_batch() can be called concurrently on the same zone
and the non-atomic calculations can go wrong. This is not a big
problem as long as the errors are small and won't impact the
balanced zone aging noticeably.

@nr_to_scan could be much larger values than @swap_cluster_max.
So don't store such large values to *nr_saved_scan directly,
which helps reducing possible errors on races.

CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- linux.orig/mm/vmscan.c	2009-08-20 10:46:30.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-20 10:49:36.000000000 +0800
@@ -1496,15 +1496,14 @@ static unsigned long nr_scan_try_batch(u
 				       unsigned long *nr_saved_scan,
 				       unsigned long swap_cluster_max)
 {
-	unsigned long nr;
-
-	*nr_saved_scan += nr_to_scan;
-	nr = *nr_saved_scan;
+	unsigned long nr = *nr_saved_scan + nr_to_scan;
 
 	if (nr >= swap_cluster_max)
 		*nr_saved_scan = 0;
-	else
+	else {
+		*nr_saved_scan = nr;
 		nr = 0;
+	}
 
 	return nr;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list()
  2009-08-20  2:52 ` [PATCH] mm: make nr_scan_try_batch() more safe on races Wu Fengguang
@ 2009-08-20  3:17   ` Wu Fengguang
  2009-08-21 11:09     ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20  3:17 UTC (permalink / raw)
  To: Andrew Morton, Balbir Singh
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

shrink_inactive_list() won't be called to scan too much pages
(unless in hibernation code which is fine) or too few pages (ie.
batching is taken care of by the callers).  So we can just remove the
big loop and isolate the exact number of pages requested.

Just a RFC, and a scratch patch to show the basic idea.
Please kindly NAK quick if you don't like it ;)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

--- linux.orig/mm/vmscan.c	2009-08-20 10:16:18.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-20 10:24:34.000000000 +0800
@@ -1032,16 +1032,22 @@ int isolate_lru_page(struct page *page)
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
-	unsigned long nr_reclaimed = 0;
+	unsigned long nr_reclaimed;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	int lumpy_reclaim = 0;
+	int lumpy_reclaim;
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_scan;
+	unsigned long nr_freed;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	int mode;
 
 	/*
 	 * If we need a large contiguous chunk of memory, or have
@@ -1054,21 +1060,17 @@ static unsigned long shrink_inactive_lis
 		lumpy_reclaim = 1;
 	else if (sc->order && priority < DEF_PRIORITY - 2)
 		lumpy_reclaim = 1;
+	else
+		lumpy_reclaim = 0;
+
+	mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
 
 	pagevec_init(&pvec, 1);
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
 
-		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
+		nr_taken = sc->isolate_pages(nr_to_scan,
 			     &page_list, &nr_scan, sc->order, mode,
 				zone, sc->mem_cgroup, 0, file);
 		nr_active = clear_active_flags(&page_list, count);
@@ -1093,7 +1095,6 @@ static unsigned long shrink_inactive_lis
 
 		spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
 		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
 
 		/*
@@ -1117,7 +1118,7 @@ static unsigned long shrink_inactive_lis
 							PAGEOUT_IO_SYNC);
 		}
 
-		nr_reclaimed += nr_freed;
+		nr_reclaimed = nr_freed;
 		local_irq_disable();
 		if (current_is_kswapd()) {
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
@@ -1158,7 +1159,6 @@ static unsigned long shrink_inactive_lis
 				spin_lock_irq(&zone->lru_lock);
 			}
 		}
-  	} while (nr_scanned < max_scan);
 	spin_unlock(&zone->lru_lock);
 done:
 	local_irq_enable();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list()
  2009-08-20  3:17   ` [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list() Wu Fengguang
@ 2009-08-21 11:09     ` KOSAKI Motohiro
  2009-08-21 11:22       ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 11:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

2009/8/20 Wu Fengguang <fengguang.wu@intel.com>:
> shrink_inactive_list() won't be called to scan too much pages
> (unless in hibernation code which is fine) or too few pages (ie.
> batching is taken care of by the callers).  So we can just remove the
> big loop and isolate the exact number of pages requested.
>
> Just a RFC, and a scratch patch to show the basic idea.
> Please kindly NAK quick if you don't like it ;)

Hm, I think this patch taks only cleanups. right?
if so, I don't find any objection reason.




> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/vmscan.c |   32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
>
> --- linux.orig/mm/vmscan.c      2009-08-20 10:16:18.000000000 +0800
> +++ linux/mm/vmscan.c   2009-08-20 10:24:34.000000000 +0800
> @@ -1032,16 +1032,22 @@ int isolate_lru_page(struct page *page)
>  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>  * of reclaimed pages
>  */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>                        struct zone *zone, struct scan_control *sc,
>                        int priority, int file)
>  {
>        LIST_HEAD(page_list);
>        struct pagevec pvec;
> -       unsigned long nr_scanned = 0;
> -       unsigned long nr_reclaimed = 0;
> +       unsigned long nr_reclaimed;
>        struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> -       int lumpy_reclaim = 0;
> +       int lumpy_reclaim;
> +       struct page *page;
> +       unsigned long nr_taken;
> +       unsigned long nr_scan;
> +       unsigned long nr_freed;
> +       unsigned long nr_active;
> +       unsigned int count[NR_LRU_LISTS] = { 0, };
> +       int mode;
>
>        /*
>         * If we need a large contiguous chunk of memory, or have
> @@ -1054,21 +1060,17 @@ static unsigned long shrink_inactive_lis
>                lumpy_reclaim = 1;
>        else if (sc->order && priority < DEF_PRIORITY - 2)
>                lumpy_reclaim = 1;
> +       else
> +               lumpy_reclaim = 0;
> +
> +       mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
>
>        pagevec_init(&pvec, 1);
>
>        lru_add_drain();
>        spin_lock_irq(&zone->lru_lock);
> -       do {
> -               struct page *page;
> -               unsigned long nr_taken;
> -               unsigned long nr_scan;
> -               unsigned long nr_freed;
> -               unsigned long nr_active;
> -               unsigned int count[NR_LRU_LISTS] = { 0, };
> -               int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
>
> -               nr_taken = sc->isolate_pages(sc->swap_cluster_max,
> +               nr_taken = sc->isolate_pages(nr_to_scan,
>                             &page_list, &nr_scan, sc->order, mode,
>                                zone, sc->mem_cgroup, 0, file);
>                nr_active = clear_active_flags(&page_list, count);
> @@ -1093,7 +1095,6 @@ static unsigned long shrink_inactive_lis
>
>                spin_unlock_irq(&zone->lru_lock);
>
> -               nr_scanned += nr_scan;
>                nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
>
>                /*
> @@ -1117,7 +1118,7 @@ static unsigned long shrink_inactive_lis
>                                                        PAGEOUT_IO_SYNC);
>                }
>
> -               nr_reclaimed += nr_freed;
> +               nr_reclaimed = nr_freed;

maybe, nr_freed can be removed perfectly. it have the same meaning as
nr_reclaimed.



>                local_irq_disable();
>                if (current_is_kswapd()) {
>                        __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
> @@ -1158,7 +1159,6 @@ static unsigned long shrink_inactive_lis
>                                spin_lock_irq(&zone->lru_lock);
>                        }
>                }
> -       } while (nr_scanned < max_scan);
>        spin_unlock(&zone->lru_lock);
>  done:
>        local_irq_enable();
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list()
  2009-08-21 11:09     ` KOSAKI Motohiro
@ 2009-08-21 11:22       ` Wu Fengguang
  2009-08-27  0:20         ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21 11:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Fri, Aug 21, 2009 at 07:09:17PM +0800, KOSAKI Motohiro wrote:
> 2009/8/20 Wu Fengguang <fengguang.wu@intel.com>:
> > shrink_inactive_list() won't be called to scan too much pages
> > (unless in hibernation code which is fine) or too few pages (ie.
> > batching is taken care of by the callers). A So we can just remove the
> > big loop and isolate the exact number of pages requested.
> >
> > Just a RFC, and a scratch patch to show the basic idea.
> > Please kindly NAK quick if you don't like it ;)
> 
> Hm, I think this patch taks only cleanups. right?
> if so, I don't find any objection reason.

Mostly cleanups, but one behavior change here: 

> > - A  A  A  A  A  A  A  nr_taken = sc->isolate_pages(sc->swap_cluster_max,
> > + A  A  A  A  A  A  A  nr_taken = sc->isolate_pages(nr_to_scan,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  &page_list, &nr_scan, sc->order, mode,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A zone, sc->mem_cgroup, 0, file);

The new behavior is to scan exactly the number of pages that
shrink_zone() or other callers tell it. It won't try to "round it up"
to 32 pages. This new behavior is in line with shrink_active_list()'s
current status as well as shrink_zone()'s expectation.

shrink_zone() may still submit scan requests for <32 pages, which is
suboptimal. I'll try to eliminate that totally with more patches.

> > A  A  A  A  A  A  A  A nr_active = clear_active_flags(&page_list, count);
> > @@ -1093,7 +1095,6 @@ static unsigned long shrink_inactive_lis
> >
> > A  A  A  A  A  A  A  A spin_unlock_irq(&zone->lru_lock);
> >
> > - A  A  A  A  A  A  A  nr_scanned += nr_scan;
> > A  A  A  A  A  A  A  A nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> >
> > A  A  A  A  A  A  A  A /*
> > @@ -1117,7 +1118,7 @@ static unsigned long shrink_inactive_lis
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A PAGEOUT_IO_SYNC);
> > A  A  A  A  A  A  A  A }
> >
> > - A  A  A  A  A  A  A  nr_reclaimed += nr_freed;
> > + A  A  A  A  A  A  A  nr_reclaimed = nr_freed;
> 
> maybe, nr_freed can be removed perfectly. it have the same meaning as
> nr_reclaimed.

Yes, good spot!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list()
  2009-08-21 11:22       ` Wu Fengguang
@ 2009-08-27  0:20         ` KOSAKI Motohiro
  0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2009-08-27  0:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

> On Fri, Aug 21, 2009 at 07:09:17PM +0800, KOSAKI Motohiro wrote:
> > 2009/8/20 Wu Fengguang <fengguang.wu@intel.com>:
> > > shrink_inactive_list() won't be called to scan too much pages
> > > (unless in hibernation code which is fine) or too few pages (ie.
> > > batching is taken care of by the callers). A So we can just remove the
> > > big loop and isolate the exact number of pages requested.
> > >
> > > Just a RFC, and a scratch patch to show the basic idea.
> > > Please kindly NAK quick if you don't like it ;)
> > 
> > Hm, I think this patch taks only cleanups. right?
> > if so, I don't find any objection reason.
> 
> Mostly cleanups, but one behavior change here: 
> 
> > > - A  A  A  A  A  A  A  nr_taken = sc->isolate_pages(sc->swap_cluster_max,
> > > + A  A  A  A  A  A  A  nr_taken = sc->isolate_pages(nr_to_scan,
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  &page_list, &nr_scan, sc->order, mode,
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A zone, sc->mem_cgroup, 0, file);
> 
> The new behavior is to scan exactly the number of pages that
> shrink_zone() or other callers tell it. It won't try to "round it up"
> to 32 pages. This new behavior is in line with shrink_active_list()'s
> current status as well as shrink_zone()'s expectation.
> 
> shrink_zone() may still submit scan requests for <32 pages, which is
> suboptimal. I'll try to eliminate that totally with more patches.

Your explanation seems makes sense.
I'll wait your next spin :)





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm: do batched scans for mem_cgroup
  2009-08-20  2:49 [PATCH] mm: do batched scans for mem_cgroup Wu Fengguang
  2009-08-20  2:52 ` [PATCH] mm: make nr_scan_try_batch() more safe on races Wu Fengguang
@ 2009-08-20  3:13 ` KAMEZAWA Hiroyuki
  2009-08-20  4:05   ` [PATCH -v2] " Wu Fengguang
  1 sibling, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-20  3:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Balbir Singh, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Thu, 20 Aug 2009 10:49:29 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> scan rate by up to 32 times.
> 
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> 
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
> 
> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.
> 
> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
> 
> CC: Rik van Riel <riel@redhat.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---

Hmm, how about this ? 
==
Now, nr_saved_scan is tied to zone's LRU.
But, considering how vmscan works, it should be tied to reclaim_stat.

By this, memcg can make use of nr_saved_scan information seamlessly.

---
 include/linux/mmzone.h |    2 +-
 mm/vmscan.c            |   19 ++++++++++---------
 2 files changed, 11 insertions(+), 10 deletions(-)

Index: linux-2.6.31-rc6/include/linux/mmzone.h
===================================================================
--- linux-2.6.31-rc6.orig/include/linux/mmzone.h	2009-08-17 15:14:21.000000000 +0900
+++ linux-2.6.31-rc6/include/linux/mmzone.h	2009-08-20 12:10:44.000000000 +0900
@@ -269,6 +269,7 @@
 	 */
 	unsigned long		recent_rotated[2];
 	unsigned long		recent_scanned[2];
+	unsigned long nr_saved_scan[NR_LRU_LISTS];/* accumulated for batching */
 };
 
 struct zone {
@@ -323,7 +324,6 @@
 	spinlock_t		lru_lock;	
 	struct zone_lru {
 		struct list_head list;
-		unsigned long nr_saved_scan;	/* accumulated for batching */
 	} lru[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
Index: linux-2.6.31-rc6/mm/vmscan.c
===================================================================
--- linux-2.6.31-rc6.orig/mm/vmscan.c	2009-08-17 15:14:21.000000000 +0900
+++ linux-2.6.31-rc6/mm/vmscan.c	2009-08-20 12:17:47.000000000 +0900
@@ -1521,6 +1521,7 @@
 	enum lru_list l;
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long swap_cluster_max = sc->swap_cluster_max;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int noswap = 0;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
@@ -1540,12 +1541,9 @@
 			scan >>= priority;
 			scan = (scan * percent[file]) / 100;
 		}
-		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
-		else
-			nr[l] = scan;
+		nr[l] = nr_scan_try_batch(scan,
+					  &recalim_stat->nr_saved_scan[l],
+					  swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2128,6 +2126,7 @@
 {
 	struct zone *zone;
 	unsigned long nr_reclaimed = 0;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	for_each_populated_zone(zone) {
 		enum lru_list l;
@@ -2144,11 +2143,13 @@
 						l == LRU_ACTIVE_FILE))
 				continue;
 
-			zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
-			if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
+			reclaim_stat->nr_saved_scan[l] +=
+						(lru_pages >> prio) + 1;
+			if (reclaim_stat->nr_saved_scan[l]
+						>= nr_pages || pass > 3) {
 				unsigned long nr_to_scan;
 
-				zone->lru[l].nr_saved_scan = 0;
+				reclaim_stat->nr_saved_scan[l] = 0;
 				nr_to_scan = min(nr_pages, lru_pages);
 				nr_reclaimed += shrink_list(l, nr_to_scan, zone,
 								sc, prio);





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20  3:13 ` [PATCH] mm: do batched scans for mem_cgroup KAMEZAWA Hiroyuki
@ 2009-08-20  4:05   ` Wu Fengguang
  2009-08-20  4:06     ` KAMEZAWA Hiroyuki
                       ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20  4:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 20 Aug 2009 10:49:29 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > in which case shrink_list() _still_ calls isolate_pages() with the much
> > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > scan rate by up to 32 times.
> > 
> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > 
> > The accesses to nr_saved_scan are not lock protected and so not 100%
> > accurate, however we can tolerate small errors and the resulted small
> > imbalanced scan rates between zones.
> > 
> > This batching won't blur up the cgroup limits, since it is driven by
> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > decides to cancel (and save) one smallish scan, it may well be called
> > again to accumulate up nr_saved_scan.
> > 
> > It could possibly be a problem for some tiny mem_cgroup (which may be
> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > 
> > CC: Rik van Riel <riel@redhat.com>
> > CC: Minchan Kim <minchan.kim@gmail.com>
> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> 
> Hmm, how about this ? 
> ==
> Now, nr_saved_scan is tied to zone's LRU.
> But, considering how vmscan works, it should be tied to reclaim_stat.
> 
> By this, memcg can make use of nr_saved_scan information seamlessly.

Good idea, full patch updated with your signed-off-by :)

Thanks,
Fengguang
---
mm: do batched scans for mem_cgroup

For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
in which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
scan rate by up to 32 times.

For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive
list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.

The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.

This batching won't blur up the cgroup limits, since it is driven by
"pages reclaimed" rather than "pages scanned". When shrink_zone()
decides to cancel (and save) one smallish scan, it may well be called
again to accumulate up nr_saved_scan.

It could possibly be a problem for some tiny mem_cgroup (which may be
_full_ scanned too much times in order to accumulate up nr_saved_scan).

CC: Rik van Riel <riel@redhat.com>
CC: Minchan Kim <minchan.kim@gmail.com>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mmzone.h |    6 +++++-
 mm/page_alloc.c        |    2 +-
 mm/vmscan.c            |   20 +++++++++++---------
 3 files changed, 17 insertions(+), 11 deletions(-)

--- linux.orig/include/linux/mmzone.h	2009-07-30 10:45:15.000000000 +0800
+++ linux/include/linux/mmzone.h	2009-08-20 11:51:08.000000000 +0800
@@ -269,6 +269,11 @@ struct zone_reclaim_stat {
 	 */
 	unsigned long		recent_rotated[2];
 	unsigned long		recent_scanned[2];
+
+	/*
+	 * accumulated for batching
+	 */
+	unsigned long		nr_saved_scan[NR_LRU_LISTS];
 };
 
 struct zone {
@@ -323,7 +328,6 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct zone_lru {
 		struct list_head list;
-		unsigned long nr_saved_scan;	/* accumulated for batching */
 	} lru[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
--- linux.orig/mm/vmscan.c	2009-08-20 11:48:46.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-20 12:00:55.000000000 +0800
@@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
 	enum lru_list l;
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long swap_cluster_max = sc->swap_cluster_max;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int noswap = 0;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
@@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
 			scan >>= priority;
 			scan = (scan * percent[file]) / 100;
 		}
-		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
-		else
-			nr[l] = scan;
+		nr[l] = nr_scan_try_batch(scan,
+					  &reclaim_stat->nr_saved_scan[l],
+					  swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
 {
 	struct zone *zone;
 	unsigned long nr_reclaimed = 0;
+	struct zone_reclaim_stat *reclaim_stat;
 
 	for_each_populated_zone(zone) {
 		enum lru_list l;
@@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
 						l == LRU_ACTIVE_FILE))
 				continue;
 
-			zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
-			if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
+			reclaim_stat = get_reclaim_stat(zone, sc);
+			reclaim_stat->nr_saved_scan[l] +=
+						(lru_pages >> prio) + 1;
+			if (reclaim_stat->nr_saved_scan[l]
+						>= nr_pages || pass > 3) {
 				unsigned long nr_to_scan;
 
-				zone->lru[l].nr_saved_scan = 0;
+				reclaim_stat->nr_saved_scan[l] = 0;
 				nr_to_scan = min(nr_pages, lru_pages);
 				nr_reclaimed += shrink_list(l, nr_to_scan, zone,
 								sc, prio);
--- linux.orig/mm/page_alloc.c	2009-08-20 11:57:54.000000000 +0800
+++ linux/mm/page_alloc.c	2009-08-20 11:58:39.000000000 +0800
@@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
-			zone->lru[l].nr_saved_scan = 0;
+			zone->reclaim_stat.nr_saved_scan[l] = 0;
 		}
 		zone->reclaim_stat.recent_rotated[0] = 0;
 		zone->reclaim_stat.recent_rotated[1] = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20  4:05   ` [PATCH -v2] " Wu Fengguang
@ 2009-08-20  4:06     ` KAMEZAWA Hiroyuki
  2009-08-20  5:16     ` Balbir Singh
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-20  4:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Balbir Singh, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Thu, 20 Aug 2009 12:05:33 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > On Thu, 20 Aug 2009 10:49:29 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > > 
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > 
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > > 
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > > 
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > 
> > > CC: Rik van Riel <riel@redhat.com>
> > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > 
> > Hmm, how about this ? 
> > ==
> > Now, nr_saved_scan is tied to zone's LRU.
> > But, considering how vmscan works, it should be tied to reclaim_stat.
> > 
> > By this, memcg can make use of nr_saved_scan information seamlessly.
> 
> Good idea, full patch updated with your signed-off-by :)
> 
looks nice :)
thanks,
-Kame


> Thanks,
> Fengguang
> ---
> mm: do batched scans for mem_cgroup
> 
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> scan rate by up to 32 times.
> 
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> 
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
> 
> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.
> 
> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
> 
> CC: Rik van Riel <riel@redhat.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/mmzone.h |    6 +++++-
>  mm/page_alloc.c        |    2 +-
>  mm/vmscan.c            |   20 +++++++++++---------
>  3 files changed, 17 insertions(+), 11 deletions(-)
> 
> --- linux.orig/include/linux/mmzone.h	2009-07-30 10:45:15.000000000 +0800
> +++ linux/include/linux/mmzone.h	2009-08-20 11:51:08.000000000 +0800
> @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
>  	 */
>  	unsigned long		recent_rotated[2];
>  	unsigned long		recent_scanned[2];
> +
> +	/*
> +	 * accumulated for batching
> +	 */
> +	unsigned long		nr_saved_scan[NR_LRU_LISTS];
>  };
>  
>  struct zone {
> @@ -323,7 +328,6 @@ struct zone {
>  	spinlock_t		lru_lock;	
>  	struct zone_lru {
>  		struct list_head list;
> -		unsigned long nr_saved_scan;	/* accumulated for batching */
>  	} lru[NR_LRU_LISTS];
>  
>  	struct zone_reclaim_stat reclaim_stat;
> --- linux.orig/mm/vmscan.c	2009-08-20 11:48:46.000000000 +0800
> +++ linux/mm/vmscan.c	2009-08-20 12:00:55.000000000 +0800
> @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
>  	enum lru_list l;
>  	unsigned long nr_reclaimed = sc->nr_reclaimed;
>  	unsigned long swap_cluster_max = sc->swap_cluster_max;
> +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int noswap = 0;
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
> @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
>  			scan >>= priority;
>  			scan = (scan * percent[file]) / 100;
>  		}
> -		if (scanning_global_lru(sc))
> -			nr[l] = nr_scan_try_batch(scan,
> -						  &zone->lru[l].nr_saved_scan,
> -						  swap_cluster_max);
> -		else
> -			nr[l] = scan;
> +		nr[l] = nr_scan_try_batch(scan,
> +					  &reclaim_stat->nr_saved_scan[l],
> +					  swap_cluster_max);
>  	}
>  
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
>  {
>  	struct zone *zone;
>  	unsigned long nr_reclaimed = 0;
> +	struct zone_reclaim_stat *reclaim_stat;
>  
>  	for_each_populated_zone(zone) {
>  		enum lru_list l;
> @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
>  						l == LRU_ACTIVE_FILE))
>  				continue;
>  
> -			zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> -			if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> +			reclaim_stat = get_reclaim_stat(zone, sc);
> +			reclaim_stat->nr_saved_scan[l] +=
> +						(lru_pages >> prio) + 1;
> +			if (reclaim_stat->nr_saved_scan[l]
> +						>= nr_pages || pass > 3) {
>  				unsigned long nr_to_scan;
>  
> -				zone->lru[l].nr_saved_scan = 0;
> +				reclaim_stat->nr_saved_scan[l] = 0;
>  				nr_to_scan = min(nr_pages, lru_pages);
>  				nr_reclaimed += shrink_list(l, nr_to_scan, zone,
>  								sc, prio);
> --- linux.orig/mm/page_alloc.c	2009-08-20 11:57:54.000000000 +0800
> +++ linux/mm/page_alloc.c	2009-08-20 11:58:39.000000000 +0800
> @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
>  		zone_pcp_init(zone);
>  		for_each_lru(l) {
>  			INIT_LIST_HEAD(&zone->lru[l].list);
> -			zone->lru[l].nr_saved_scan = 0;
> +			zone->reclaim_stat.nr_saved_scan[l] = 0;
>  		}
>  		zone->reclaim_stat.recent_rotated[0] = 0;
>  		zone->reclaim_stat.recent_rotated[1] = 0;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20  4:05   ` [PATCH -v2] " Wu Fengguang
  2009-08-20  4:06     ` KAMEZAWA Hiroyuki
@ 2009-08-20  5:16     ` Balbir Singh
  2009-08-21  1:39       ` Wu Fengguang
  2009-08-20 11:01     ` Minchan Kim
  2009-08-21  3:55     ` Minchan Kim
  3 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-08-20  5:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

* Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:

> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > On Thu, 20 Aug 2009 10:49:29 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > > 
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > 
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > > 
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > > 
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > 
> > > CC: Rik van Riel <riel@redhat.com>
> > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > 
> > Hmm, how about this ? 
> > ==
> > Now, nr_saved_scan is tied to zone's LRU.
> > But, considering how vmscan works, it should be tied to reclaim_stat.
> > 
> > By this, memcg can make use of nr_saved_scan information seamlessly.
> 
> Good idea, full patch updated with your signed-off-by :)
> 
> Thanks,
> Fengguang
> ---
> mm: do batched scans for mem_cgroup
> 
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> scan rate by up to 32 times.
> 
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> 
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
> 
> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.
> 
> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
>

Looks good to me, how did you test it?

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20  5:16     ` Balbir Singh
@ 2009-08-21  1:39       ` Wu Fengguang
  2009-08-21  1:46         ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21  1:39 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Thu, Aug 20, 2009 at 01:16:56PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
> 
> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 20 Aug 2009 10:49:29 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > > > scan rate by up to 32 times.
> > > > 
> > > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > > 
> > > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > > accurate, however we can tolerate small errors and the resulted small
> > > > imbalanced scan rates between zones.
> > > > 
> > > > This batching won't blur up the cgroup limits, since it is driven by
> > > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > > decides to cancel (and save) one smallish scan, it may well be called
> > > > again to accumulate up nr_saved_scan.
> > > > 
> > > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > > 
> > > > CC: Rik van Riel <riel@redhat.com>
> > > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > 
> > > Hmm, how about this ? 
> > > ==
> > > Now, nr_saved_scan is tied to zone's LRU.
> > > But, considering how vmscan works, it should be tied to reclaim_stat.
> > > 
> > > By this, memcg can make use of nr_saved_scan information seamlessly.
> > 
> > Good idea, full patch updated with your signed-off-by :)
> > 
> > Thanks,
> > Fengguang
> > ---
> > mm: do batched scans for mem_cgroup
> > 
> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > in which case shrink_list() _still_ calls isolate_pages() with the much
> > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > scan rate by up to 32 times.
> > 
> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > 
> > The accesses to nr_saved_scan are not lock protected and so not 100%
> > accurate, however we can tolerate small errors and the resulted small
> > imbalanced scan rates between zones.
> > 
> > This batching won't blur up the cgroup limits, since it is driven by
> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > decides to cancel (and save) one smallish scan, it may well be called
> > again to accumulate up nr_saved_scan.
> > 
> > It could possibly be a problem for some tiny mem_cgroup (which may be
> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >
> 
> Looks good to me, how did you test it?

I observed the shrink_inactive_list() calls with this patch:

        @@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
                struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
                int lumpy_reclaim = 0;

        +       if (!scanning_global_lru(sc))
        +               printk("shrink inactive %s count=%lu scan=%lu\n",
        +                      file ? "file" : "anon",
        +                      mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
        +                                               LRU_INACTIVE_ANON + 2 * !!file),
        +                      max_scan);

and these commands:

        mkdir /cgroup/0
        echo 100M > /cgroup/0/memory.limit_in_bytes
        echo $$ > /cgroup/0/tasks
        cp /tmp/10G /dev/null

before patch:

        [ 3682.646008] shrink inactive file count=25535 scan=6
        [ 3682.661548] shrink inactive file count=25535 scan=6
        [ 3682.666933] shrink inactive file count=25535 scan=6
        [ 3682.682865] shrink inactive file count=25535 scan=6
        [ 3682.688572] shrink inactive file count=25535 scan=6
        [ 3682.703908] shrink inactive file count=25535 scan=6
        [ 3682.709431] shrink inactive file count=25535 scan=6

after patch:

        [  223.146544] shrink inactive file count=25531 scan=32
        [  223.152060] shrink inactive file count=25507 scan=10
        [  223.167503] shrink inactive file count=25531 scan=32
        [  223.173426] shrink inactive file count=25507 scan=10
        [  223.188764] shrink inactive file count=25531 scan=32
        [  223.194270] shrink inactive file count=25507 scan=10
        [  223.209885] shrink inactive file count=25531 scan=32
        [  223.215388] shrink inactive file count=25507 scan=10

Before patch, the inactive list is over scanned by 30/6=5 times;
After patch, it is over scanned by 64/42=1.5 times. It's much better,
and can be further improved if necessary.

> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-21  1:39       ` Wu Fengguang
@ 2009-08-21  1:46         ` Wu Fengguang
  0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21  1:46 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, Rik van Riel,
	Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
	Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
	nishimura, lizf, menage

On Fri, Aug 21, 2009 at 09:39:26AM +0800, Wu Fengguang wrote:
> On Thu, Aug 20, 2009 at 01:16:56PM +0800, Balbir Singh wrote:
> > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
> > 
> > > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 20 Aug 2009 10:49:29 +0800
> > > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > 
> > > > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > > > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > > > > scan rate by up to 32 times.
> > > > > 
> > > > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > > > 
> > > > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > > > accurate, however we can tolerate small errors and the resulted small
> > > > > imbalanced scan rates between zones.
> > > > > 
> > > > > This batching won't blur up the cgroup limits, since it is driven by
> > > > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > > > decides to cancel (and save) one smallish scan, it may well be called
> > > > > again to accumulate up nr_saved_scan.
> > > > > 
> > > > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > > > 
> > > > > CC: Rik van Riel <riel@redhat.com>
> > > > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > ---
> > > > 
> > > > Hmm, how about this ? 
> > > > ==
> > > > Now, nr_saved_scan is tied to zone's LRU.
> > > > But, considering how vmscan works, it should be tied to reclaim_stat.
> > > > 
> > > > By this, memcg can make use of nr_saved_scan information seamlessly.
> > > 
> > > Good idea, full patch updated with your signed-off-by :)
> > > 
> > > Thanks,
> > > Fengguang
> > > ---
> > > mm: do batched scans for mem_cgroup
> > > 
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > > 
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > 
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > > 
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > > 
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > >
> > 
> > Looks good to me, how did you test it?
> 
> I observed the shrink_inactive_list() calls with this patch:
> 
>         @@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
>                 struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>                 int lumpy_reclaim = 0;
> 
>         +       if (!scanning_global_lru(sc))
>         +               printk("shrink inactive %s count=%lu scan=%lu\n",
>         +                      file ? "file" : "anon",
>         +                      mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
>         +                                               LRU_INACTIVE_ANON + 2 * !!file),
>         +                      max_scan);
> 
> and these commands:
> 
>         mkdir /cgroup/0
>         echo 100M > /cgroup/0/memory.limit_in_bytes
>         echo $$ > /cgroup/0/tasks
>         cp /tmp/10G /dev/null

And I can reduce the limit to 1M and 500K without triggering OOM:


[  963.329746] shrink inactive file count=201 scan=32
[  963.335076] shrink inactive file count=177 scan=15
[  963.350719] shrink inactive file count=201 scan=32
[  963.356020] shrink inactive file count=177 scan=15
[  963.371914] shrink inactive file count=201 scan=32
[  963.377225] shrink inactive file count=177 scan=15
[  963.393022] shrink inactive file count=201 scan=32
[  963.398362] shrink inactive file count=177 scan=15


[ 1103.951251] shrink inactive file count=70 scan=32
[ 1104.054242] shrink inactive file count=46 scan=32
[ 1104.077381] shrink inactive file count=70 scan=32
[ 1104.083095] shrink inactive file count=73 scan=32
[ 1104.088513] shrink inactive file count=45 scan=2
[ 1104.113545] shrink inactive file count=70 scan=32
[ 1104.118915] shrink inactive file count=73 scan=32
[ 1104.124612] shrink inactive file count=45 scan=2
[ 1104.130093] shrink inactive file count=69 scan=32

So the patch is pretty safe for tiny mem cgroups.

Thanks,
Fengguang

> before patch:
> 
>         [ 3682.646008] shrink inactive file count=25535 scan=6
>         [ 3682.661548] shrink inactive file count=25535 scan=6
>         [ 3682.666933] shrink inactive file count=25535 scan=6
>         [ 3682.682865] shrink inactive file count=25535 scan=6
>         [ 3682.688572] shrink inactive file count=25535 scan=6
>         [ 3682.703908] shrink inactive file count=25535 scan=6
>         [ 3682.709431] shrink inactive file count=25535 scan=6
> 
> after patch:
> 
>         [  223.146544] shrink inactive file count=25531 scan=32
>         [  223.152060] shrink inactive file count=25507 scan=10
>         [  223.167503] shrink inactive file count=25531 scan=32
>         [  223.173426] shrink inactive file count=25507 scan=10
>         [  223.188764] shrink inactive file count=25531 scan=32
>         [  223.194270] shrink inactive file count=25507 scan=10
>         [  223.209885] shrink inactive file count=25531 scan=32
>         [  223.215388] shrink inactive file count=25507 scan=10
> 
> Before patch, the inactive list is over scanned by 30/6=5 times;
> After patch, it is over scanned by 64/42=1.5 times. It's much better,
> and can be further improved if necessary.
> 
> > Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20  4:05   ` [PATCH -v2] " Wu Fengguang
  2009-08-20  4:06     ` KAMEZAWA Hiroyuki
  2009-08-20  5:16     ` Balbir Singh
@ 2009-08-20 11:01     ` Minchan Kim
  2009-08-20 11:49       ` Wu Fengguang
  2009-08-21  3:55     ` Minchan Kim
  3 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2009-08-20 11:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

Hi, Wu.

On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
>> On Thu, 20 Aug 2009 10:49:29 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
>> > scan rate by up to 32 times.
>> >
>> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>> >
>> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> > accurate, however we can tolerate small errors and the resulted small
>> > imbalanced scan rates between zones.
>> >
>> > This batching won't blur up the cgroup limits, since it is driven by
>> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> > decides to cancel (and save) one smallish scan, it may well be called
>> > again to accumulate up nr_saved_scan.
>> >
>> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >
>> > CC: Rik van Riel <riel@redhat.com>
>> > CC: Minchan Kim <minchan.kim@gmail.com>
>> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> > ---
>>
>> Hmm, how about this ?
>> ==
>> Now, nr_saved_scan is tied to zone's LRU.
>> But, considering how vmscan works, it should be tied to reclaim_stat.
>>
>> By this, memcg can make use of nr_saved_scan information seamlessly.
>
> Good idea, full patch updated with your signed-off-by :)
>
> Thanks,
> Fengguang
> ---
> mm: do batched scans for mem_cgroup
>
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> scan rate by up to 32 times.

Yes. It can scan 32 times pages in only inactive list, not active list.

> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.

Active list scan would be scanned in 4,  inactive list  is 32.

>
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.

Yes.

> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.

You mean nr_scan_try_batch logic ?
But that logic works for just global reclaim?
Now am I missing something?

Could you elaborate more? :)

> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
>
> CC: Rik van Riel <riel@redhat.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/mmzone.h |    6 +++++-
>  mm/page_alloc.c        |    2 +-
>  mm/vmscan.c            |   20 +++++++++++---------
>  3 files changed, 17 insertions(+), 11 deletions(-)
>
> --- linux.orig/include/linux/mmzone.h   2009-07-30 10:45:15.000000000 +0800
> +++ linux/include/linux/mmzone.h        2009-08-20 11:51:08.000000000 +0800
> @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
>         */
>        unsigned long           recent_rotated[2];
>        unsigned long           recent_scanned[2];
> +
> +       /*
> +        * accumulated for batching
> +        */
> +       unsigned long           nr_saved_scan[NR_LRU_LISTS];
>  };
>
>  struct zone {
> @@ -323,7 +328,6 @@ struct zone {
>        spinlock_t              lru_lock;
>        struct zone_lru {
>                struct list_head list;
> -               unsigned long nr_saved_scan;    /* accumulated for batching */
>        } lru[NR_LRU_LISTS];
>
>        struct zone_reclaim_stat reclaim_stat;
> --- linux.orig/mm/vmscan.c      2009-08-20 11:48:46.000000000 +0800
> +++ linux/mm/vmscan.c   2009-08-20 12:00:55.000000000 +0800
> @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
>        enum lru_list l;
>        unsigned long nr_reclaimed = sc->nr_reclaimed;
>        unsigned long swap_cluster_max = sc->swap_cluster_max;
> +       struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>        int noswap = 0;
>
>        /* If we have no swap space, do not bother scanning anon pages. */
> @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
>                        scan >>= priority;
>                        scan = (scan * percent[file]) / 100;
>                }
> -               if (scanning_global_lru(sc))
> -                       nr[l] = nr_scan_try_batch(scan,
> -                                                 &zone->lru[l].nr_saved_scan,
> -                                                 swap_cluster_max);
> -               else
> -                       nr[l] = scan;
> +               nr[l] = nr_scan_try_batch(scan,
> +                                         &reclaim_stat->nr_saved_scan[l],
> +                                         swap_cluster_max);
>        }
>
>        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
>  {
>        struct zone *zone;
>        unsigned long nr_reclaimed = 0;
> +       struct zone_reclaim_stat *reclaim_stat;
>
>        for_each_populated_zone(zone) {
>                enum lru_list l;
> @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
>                                                l == LRU_ACTIVE_FILE))
>                                continue;
>
> -                       zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> -                       if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> +                       reclaim_stat = get_reclaim_stat(zone, sc);
> +                       reclaim_stat->nr_saved_scan[l] +=
> +                                               (lru_pages >> prio) + 1;
> +                       if (reclaim_stat->nr_saved_scan[l]
> +                                               >= nr_pages || pass > 3) {
>                                unsigned long nr_to_scan;
>
> -                               zone->lru[l].nr_saved_scan = 0;
> +                               reclaim_stat->nr_saved_scan[l] = 0;
>                                nr_to_scan = min(nr_pages, lru_pages);
>                                nr_reclaimed += shrink_list(l, nr_to_scan, zone,
>                                                                sc, prio);
> --- linux.orig/mm/page_alloc.c  2009-08-20 11:57:54.000000000 +0800
> +++ linux/mm/page_alloc.c       2009-08-20 11:58:39.000000000 +0800
> @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
>                zone_pcp_init(zone);
>                for_each_lru(l) {
>                        INIT_LIST_HEAD(&zone->lru[l].list);
> -                       zone->lru[l].nr_saved_scan = 0;
> +                       zone->reclaim_stat.nr_saved_scan[l] = 0;
>                }
>                zone->reclaim_stat.recent_rotated[0] = 0;
>                zone->reclaim_stat.recent_rotated[1] = 0;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20 11:01     ` Minchan Kim
@ 2009-08-20 11:49       ` Wu Fengguang
  2009-08-20 12:13         ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20 11:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
> Hi, Wu.
> 
> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> >> On Thu, 20 Aug 2009 10:49:29 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> >> > scan rate by up to 32 times.
> >> >
> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >> >
> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> > accurate, however we can tolerate small errors and the resulted small
> >> > imbalanced scan rates between zones.
> >> >
> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> > again to accumulate up nr_saved_scan.
> >> >
> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >
> >> > CC: Rik van Riel <riel@redhat.com>
> >> > CC: Minchan Kim <minchan.kim@gmail.com>
> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> >> > ---
> >>
> >> Hmm, how about this ?
> >> ==
> >> Now, nr_saved_scan is tied to zone's LRU.
> >> But, considering how vmscan works, it should be tied to reclaim_stat.
> >>
> >> By this, memcg can make use of nr_saved_scan information seamlessly.
> >
> > Good idea, full patch updated with your signed-off-by :)
> >
> > Thanks,
> > Fengguang
> > ---
> > mm: do batched scans for mem_cgroup
> >
> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > in which case shrink_list() _still_ calls isolate_pages() with the much
> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> > scan rate by up to 32 times.
> 
> Yes. It can scan 32 times pages in only inactive list, not active list.

Yes and no ;)

inactive anon list over scanned => inactive_anon_is_low() == TRUE
                                => shrink_active_list()
                                => active anon list over scanned

So the end result may be

- anon inactive  => over scanned
- anon active    => over scanned (maybe not as much)
- file inactive  => over scanned
- file active    => under scanned (relatively)

> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> 
> Active list scan would be scanned in 4,  inactive list  is 32.

Exactly.

> >
> > The accesses to nr_saved_scan are not lock protected and so not 100%
> > accurate, however we can tolerate small errors and the resulted small
> > imbalanced scan rates between zones.
> 
> Yes.
> 
> > This batching won't blur up the cgroup limits, since it is driven by
> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > decides to cancel (and save) one smallish scan, it may well be called
> > again to accumulate up nr_saved_scan.
> 
> You mean nr_scan_try_batch logic ?
> But that logic works for just global reclaim?
> Now am I missing something?
> 
> Could you elaborate more? :)

Sorry for the confusion. The above paragraph originates from Balbir's
concern:

        This might be a concern (although not a big ATM), since we can't
        afford to miss limits by much. If a cgroup is near its limit and we
        drop scanning it. We'll have to work out what this means for the end
        user. May be more fundamental look through is required at the priority
        based logic of exposing how much to scan, I don't know.

Thanks,
Fengguang

> > It could possibly be a problem for some tiny mem_cgroup (which may be
> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >
> > CC: Rik van Riel <riel@redhat.com>
> > CC: Minchan Kim <minchan.kim@gmail.com>
> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > A include/linux/mmzone.h | A  A 6 +++++-
> > A mm/page_alloc.c A  A  A  A | A  A 2 +-
> > A mm/vmscan.c A  A  A  A  A  A | A  20 +++++++++++---------
> > A 3 files changed, 17 insertions(+), 11 deletions(-)
> >
> > --- linux.orig/include/linux/mmzone.h A  2009-07-30 10:45:15.000000000 +0800
> > +++ linux/include/linux/mmzone.h A  A  A  A 2009-08-20 11:51:08.000000000 +0800
> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> > A  A  A  A  */
> > A  A  A  A unsigned long A  A  A  A  A  recent_rotated[2];
> > A  A  A  A unsigned long A  A  A  A  A  recent_scanned[2];
> > +
> > + A  A  A  /*
> > + A  A  A  A * accumulated for batching
> > + A  A  A  A */
> > + A  A  A  unsigned long A  A  A  A  A  nr_saved_scan[NR_LRU_LISTS];
> > A };
> >
> > A struct zone {
> > @@ -323,7 +328,6 @@ struct zone {
> > A  A  A  A spinlock_t A  A  A  A  A  A  A lru_lock;
> > A  A  A  A struct zone_lru {
> > A  A  A  A  A  A  A  A struct list_head list;
> > - A  A  A  A  A  A  A  unsigned long nr_saved_scan; A  A /* accumulated for batching */
> > A  A  A  A } lru[NR_LRU_LISTS];
> >
> > A  A  A  A struct zone_reclaim_stat reclaim_stat;
> > --- linux.orig/mm/vmscan.c A  A  A 2009-08-20 11:48:46.000000000 +0800
> > +++ linux/mm/vmscan.c A  2009-08-20 12:00:55.000000000 +0800
> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> > A  A  A  A enum lru_list l;
> > A  A  A  A unsigned long nr_reclaimed = sc->nr_reclaimed;
> > A  A  A  A unsigned long swap_cluster_max = sc->swap_cluster_max;
> > + A  A  A  struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > A  A  A  A int noswap = 0;
> >
> > A  A  A  A /* If we have no swap space, do not bother scanning anon pages. */
> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> > A  A  A  A  A  A  A  A  A  A  A  A scan >>= priority;
> > A  A  A  A  A  A  A  A  A  A  A  A scan = (scan * percent[file]) / 100;
> > A  A  A  A  A  A  A  A }
> > - A  A  A  A  A  A  A  if (scanning_global_lru(sc))
> > - A  A  A  A  A  A  A  A  A  A  A  nr[l] = nr_scan_try_batch(scan,
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  &zone->lru[l].nr_saved_scan,
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  swap_cluster_max);
> > - A  A  A  A  A  A  A  else
> > - A  A  A  A  A  A  A  A  A  A  A  nr[l] = scan;
> > + A  A  A  A  A  A  A  nr[l] = nr_scan_try_batch(scan,
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  &reclaim_stat->nr_saved_scan[l],
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  swap_cluster_max);
> > A  A  A  A }
> >
> > A  A  A  A while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> > A {
> > A  A  A  A struct zone *zone;
> > A  A  A  A unsigned long nr_reclaimed = 0;
> > + A  A  A  struct zone_reclaim_stat *reclaim_stat;
> >
> > A  A  A  A for_each_populated_zone(zone) {
> > A  A  A  A  A  A  A  A enum lru_list l;
> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A l == LRU_ACTIVE_FILE))
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A continue;
> >
> > - A  A  A  A  A  A  A  A  A  A  A  zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> > - A  A  A  A  A  A  A  A  A  A  A  if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> > + A  A  A  A  A  A  A  A  A  A  A  reclaim_stat = get_reclaim_stat(zone, sc);
> > + A  A  A  A  A  A  A  A  A  A  A  reclaim_stat->nr_saved_scan[l] +=
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  (lru_pages >> prio) + 1;
> > + A  A  A  A  A  A  A  A  A  A  A  if (reclaim_stat->nr_saved_scan[l]
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  >= nr_pages || pass > 3) {
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A unsigned long nr_to_scan;
> >
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  zone->lru[l].nr_saved_scan = 0;
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  reclaim_stat->nr_saved_scan[l] = 0;
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A nr_to_scan = min(nr_pages, lru_pages);
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A nr_reclaimed += shrink_list(l, nr_to_scan, zone,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A sc, prio);
> > --- linux.orig/mm/page_alloc.c A 2009-08-20 11:57:54.000000000 +0800
> > +++ linux/mm/page_alloc.c A  A  A  2009-08-20 11:58:39.000000000 +0800
> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> > A  A  A  A  A  A  A  A zone_pcp_init(zone);
> > A  A  A  A  A  A  A  A for_each_lru(l) {
> > A  A  A  A  A  A  A  A  A  A  A  A INIT_LIST_HEAD(&zone->lru[l].list);
> > - A  A  A  A  A  A  A  A  A  A  A  zone->lru[l].nr_saved_scan = 0;
> > + A  A  A  A  A  A  A  A  A  A  A  zone->reclaim_stat.nr_saved_scan[l] = 0;
> > A  A  A  A  A  A  A  A }
> > A  A  A  A  A  A  A  A zone->reclaim_stat.recent_rotated[0] = 0;
> > A  A  A  A  A  A  A  A zone->reclaim_stat.recent_rotated[1] = 0;
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. A For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20 11:49       ` Wu Fengguang
@ 2009-08-20 12:13         ` Minchan Kim
  2009-08-20 12:32           ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2009-08-20 12:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

On Thu, Aug 20, 2009 at 8:49 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
>> Hi, Wu.
>>
>> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
>> >> On Thu, 20 Aug 2009 10:49:29 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> >> > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
>> >> > scan rate by up to 32 times.
>> >> >
>> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>> >> >
>> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> >> > accurate, however we can tolerate small errors and the resulted small
>> >> > imbalanced scan rates between zones.
>> >> >
>> >> > This batching won't blur up the cgroup limits, since it is driven by
>> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> >> > decides to cancel (and save) one smallish scan, it may well be called
>> >> > again to accumulate up nr_saved_scan.
>> >> >
>> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >> >
>> >> > CC: Rik van Riel <riel@redhat.com>
>> >> > CC: Minchan Kim <minchan.kim@gmail.com>
>> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> >> > ---
>> >>
>> >> Hmm, how about this ?
>> >> ==
>> >> Now, nr_saved_scan is tied to zone's LRU.
>> >> But, considering how vmscan works, it should be tied to reclaim_stat.
>> >>
>> >> By this, memcg can make use of nr_saved_scan information seamlessly.
>> >
>> > Good idea, full patch updated with your signed-off-by :)
>> >
>> > Thanks,
>> > Fengguang
>> > ---
>> > mm: do batched scans for mem_cgroup
>> >
>> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
>> > scan rate by up to 32 times.
>>
>> Yes. It can scan 32 times pages in only inactive list, not active list.
>
> Yes and no ;)
>
> inactive anon list over scanned => inactive_anon_is_low() == TRUE
>                                => shrink_active_list()
>                                => active anon list over scanned

Why inactive anon list is overscanned in case mem_cgroup ?

in shrink_zone,
1) The vm doesn't accumulate nr[l].
2) Below routine store min value to nr_to_scan.
nr_to_scan = min(nr[l], swap_cluster_max);
ex) if nr[l] = 4, vm calls shrink_active_list with 4 as nr_to_scan.
So I think overscan doesn't occur in active list.

> So the end result may be
>
> - anon inactive  => over scanned
> - anon active    => over scanned (maybe not as much)
> - file inactive  => over scanned
> - file active    => under scanned (relatively)
>
>> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>>
>> Active list scan would be scanned in 4,  inactive list  is 32.
>
> Exactly.
>
>> >
>> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> > accurate, however we can tolerate small errors and the resulted small
>> > imbalanced scan rates between zones.
>>
>> Yes.
>>
>> > This batching won't blur up the cgroup limits, since it is driven by
>> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> > decides to cancel (and save) one smallish scan, it may well be called
>> > again to accumulate up nr_saved_scan.
>>
>> You mean nr_scan_try_batch logic ?
>> But that logic works for just global reclaim?
>> Now am I missing something?
>>
>> Could you elaborate more? :)
>
> Sorry for the confusion. The above paragraph originates from Balbir's
> concern:
>
>        This might be a concern (although not a big ATM), since we can't
>        afford to miss limits by much. If a cgroup is near its limit and we
>        drop scanning it. We'll have to work out what this means for the end

Why does mem_cgroup drops scanning ?
It's because nr_scan_try_batch? or something ?

Sorry. Still, I can't understand your point. :(

>        user. May be more fundamental look through is required at the priority
>        based logic of exposing how much to scan, I don't know.
>
> Thanks,
> Fengguang
>
>> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >
>> > CC: Rik van Riel <riel@redhat.com>
>> > CC: Minchan Kim <minchan.kim@gmail.com>
>> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> > ---
>> >  include/linux/mmzone.h |    6 +++++-
>> >  mm/page_alloc.c        |    2 +-
>> >  mm/vmscan.c            |   20 +++++++++++---------
>> >  3 files changed, 17 insertions(+), 11 deletions(-)
>> >
>> > --- linux.orig/include/linux/mmzone.h   2009-07-30 10:45:15.000000000 +0800
>> > +++ linux/include/linux/mmzone.h        2009-08-20 11:51:08.000000000 +0800
>> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
>> >         */
>> >        unsigned long           recent_rotated[2];
>> >        unsigned long           recent_scanned[2];
>> > +
>> > +       /*
>> > +        * accumulated for batching
>> > +        */
>> > +       unsigned long           nr_saved_scan[NR_LRU_LISTS];
>> >  };
>> >
>> >  struct zone {
>> > @@ -323,7 +328,6 @@ struct zone {
>> >        spinlock_t              lru_lock;
>> >        struct zone_lru {
>> >                struct list_head list;
>> > -               unsigned long nr_saved_scan;    /* accumulated for batching */
>> >        } lru[NR_LRU_LISTS];
>> >
>> >        struct zone_reclaim_stat reclaim_stat;
>> > --- linux.orig/mm/vmscan.c      2009-08-20 11:48:46.000000000 +0800
>> > +++ linux/mm/vmscan.c   2009-08-20 12:00:55.000000000 +0800
>> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
>> >        enum lru_list l;
>> >        unsigned long nr_reclaimed = sc->nr_reclaimed;
>> >        unsigned long swap_cluster_max = sc->swap_cluster_max;
>> > +       struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>> >        int noswap = 0;
>> >
>> >        /* If we have no swap space, do not bother scanning anon pages. */
>> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
>> >                        scan >>= priority;
>> >                        scan = (scan * percent[file]) / 100;
>> >                }
>> > -               if (scanning_global_lru(sc))
>> > -                       nr[l] = nr_scan_try_batch(scan,
>> > -                                                 &zone->lru[l].nr_saved_scan,
>> > -                                                 swap_cluster_max);
>> > -               else
>> > -                       nr[l] = scan;
>> > +               nr[l] = nr_scan_try_batch(scan,
>> > +                                         &reclaim_stat->nr_saved_scan[l],
>> > +                                         swap_cluster_max);
>> >        }
>> >
>> >        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
>> >  {
>> >        struct zone *zone;
>> >        unsigned long nr_reclaimed = 0;
>> > +       struct zone_reclaim_stat *reclaim_stat;
>> >
>> >        for_each_populated_zone(zone) {
>> >                enum lru_list l;
>> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
>> >                                                l == LRU_ACTIVE_FILE))
>> >                                continue;
>> >
>> > -                       zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
>> > -                       if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
>> > +                       reclaim_stat = get_reclaim_stat(zone, sc);
>> > +                       reclaim_stat->nr_saved_scan[l] +=
>> > +                                               (lru_pages >> prio) + 1;
>> > +                       if (reclaim_stat->nr_saved_scan[l]
>> > +                                               >= nr_pages || pass > 3) {
>> >                                unsigned long nr_to_scan;
>> >
>> > -                               zone->lru[l].nr_saved_scan = 0;
>> > +                               reclaim_stat->nr_saved_scan[l] = 0;
>> >                                nr_to_scan = min(nr_pages, lru_pages);
>> >                                nr_reclaimed += shrink_list(l, nr_to_scan, zone,
>> >                                                                sc, prio);
>> > --- linux.orig/mm/page_alloc.c  2009-08-20 11:57:54.000000000 +0800
>> > +++ linux/mm/page_alloc.c       2009-08-20 11:58:39.000000000 +0800
>> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
>> >                zone_pcp_init(zone);
>> >                for_each_lru(l) {
>> >                        INIT_LIST_HEAD(&zone->lru[l].list);
>> > -                       zone->lru[l].nr_saved_scan = 0;
>> > +                       zone->reclaim_stat.nr_saved_scan[l] = 0;
>> >                }
>> >                zone->reclaim_stat.recent_rotated[0] = 0;
>> >                zone->reclaim_stat.recent_rotated[1] = 0;
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org.  For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20 12:13         ` Minchan Kim
@ 2009-08-20 12:32           ` Wu Fengguang
  0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20 12:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

On Thu, Aug 20, 2009 at 08:13:59PM +0800, Minchan Kim wrote:
> On Thu, Aug 20, 2009 at 8:49 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
> >> Hi, Wu.
> >>
> >> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> >> >> On Thu, 20 Aug 2009 10:49:29 +0800
> >> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >>
> >> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> >> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> >> >> > scan rate by up to 32 times.
> >> >> >
> >> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >> >> >
> >> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> >> > accurate, however we can tolerate small errors and the resulted small
> >> >> > imbalanced scan rates between zones.
> >> >> >
> >> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> >> > again to accumulate up nr_saved_scan.
> >> >> >
> >> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >> >
> >> >> > CC: Rik van Riel <riel@redhat.com>
> >> >> > CC: Minchan Kim <minchan.kim@gmail.com>
> >> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >> > ---
> >> >>
> >> >> Hmm, how about this ?
> >> >> ==
> >> >> Now, nr_saved_scan is tied to zone's LRU.
> >> >> But, considering how vmscan works, it should be tied to reclaim_stat.
> >> >>
> >> >> By this, memcg can make use of nr_saved_scan information seamlessly.
> >> >
> >> > Good idea, full patch updated with your signed-off-by :)
> >> >
> >> > Thanks,
> >> > Fengguang
> >> > ---
> >> > mm: do batched scans for mem_cgroup
> >> >
> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> >> > scan rate by up to 32 times.
> >>
> >> Yes. It can scan 32 times pages in only inactive list, not active list.
> >
> > Yes and no ;)
> >
> > inactive anon list over scanned => inactive_anon_is_low() == TRUE
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A => shrink_active_list()
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A => active anon list over scanned
> 
> Why inactive anon list is overscanned in case mem_cgroup ?
> 
> in shrink_zone,
> 1) The vm doesn't accumulate nr[l].
> 2) Below routine store min value to nr_to_scan.
> nr_to_scan = min(nr[l], swap_cluster_max);
> ex) if nr[l] = 4, vm calls shrink_active_list with 4 as nr_to_scan.

It's not over scanned here, but at end of shrink_zone():

        /* 
         * Even if we did not try to evict anon pages at all, we want to
         * rebalance the anon lru active/inactive ratio.
         */
        if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
                shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

as well as balance_pgdat():

                        /*
                         * Do some background aging of the anon list, to give
                         * pages a chance to be referenced before reclaiming.
                         */
                        if (inactive_anon_is_low(zone, &sc))
                                shrink_active_list(SWAP_CLUSTER_MAX, zone,
                                                        &sc, priority, 0);

So the anon lists are over scanned compared to the active file list.

> So I think overscan doesn't occur in active list.
> 
> > So the end result may be
> >
> > - anon inactive A => over scanned
> > - anon active A  A => over scanned (maybe not as much)
> > - file inactive A => over scanned
> > - file active A  A => under scanned (relatively)
> >
> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >>
> >> Active list scan would be scanned in 4, A inactive list A is 32.
> >
> > Exactly.
> >
> >> >
> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> > accurate, however we can tolerate small errors and the resulted small
> >> > imbalanced scan rates between zones.
> >>
> >> Yes.
> >>
> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> > again to accumulate up nr_saved_scan.
> >>
> >> You mean nr_scan_try_batch logic ?
> >> But that logic works for just global reclaim?
> >> Now am I missing something?
> >>
> >> Could you elaborate more? :)
> >
> > Sorry for the confusion. The above paragraph originates from Balbir's
> > concern:
> >
> > A  A  A  A This might be a concern (although not a big ATM), since we can't
> > A  A  A  A afford to miss limits by much. If a cgroup is near its limit and we
> > A  A  A  A drop scanning it. We'll have to work out what this means for the end
> 
> Why does mem_cgroup drops scanning ?

Right, it has no reason to drop scanning, as long as it has not
reclaimed enough pages.

> It's because nr_scan_try_batch? or something ?

nr_scan_try_batch may only make this invocation of shrink_zone() drop scanning.
But balance_pgdat() etc. will re-call shrink_zone() to make progress.

> Sorry. Still, I can't understand your point. :(

It's _nothing_ wrong with you to not able to understand it :)
Sorry, I was explaining a null issue indeed. I'd better just remove that paragraph..
 
Thanks,
Fengguang

> > A  A  A  A user. May be more fundamental look through is required at the priority
> > A  A  A  A based logic of exposing how much to scan, I don't know.
> >
> > Thanks,
> > Fengguang
> >
> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >
> >> > CC: Rik van Riel <riel@redhat.com>
> >> > CC: Minchan Kim <minchan.kim@gmail.com>
> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> >> > ---
> >> > A include/linux/mmzone.h | A  A 6 +++++-
> >> > A mm/page_alloc.c A  A  A  A | A  A 2 +-
> >> > A mm/vmscan.c A  A  A  A  A  A | A  20 +++++++++++---------
> >> > A 3 files changed, 17 insertions(+), 11 deletions(-)
> >> >
> >> > --- linux.orig/include/linux/mmzone.h A  2009-07-30 10:45:15.000000000 +0800
> >> > +++ linux/include/linux/mmzone.h A  A  A  A 2009-08-20 11:51:08.000000000 +0800
> >> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> >> > A  A  A  A  */
> >> > A  A  A  A unsigned long A  A  A  A  A  recent_rotated[2];
> >> > A  A  A  A unsigned long A  A  A  A  A  recent_scanned[2];
> >> > +
> >> > + A  A  A  /*
> >> > + A  A  A  A * accumulated for batching
> >> > + A  A  A  A */
> >> > + A  A  A  unsigned long A  A  A  A  A  nr_saved_scan[NR_LRU_LISTS];
> >> > A };
> >> >
> >> > A struct zone {
> >> > @@ -323,7 +328,6 @@ struct zone {
> >> > A  A  A  A spinlock_t A  A  A  A  A  A  A lru_lock;
> >> > A  A  A  A struct zone_lru {
> >> > A  A  A  A  A  A  A  A struct list_head list;
> >> > - A  A  A  A  A  A  A  unsigned long nr_saved_scan; A  A /* accumulated for batching */
> >> > A  A  A  A } lru[NR_LRU_LISTS];
> >> >
> >> > A  A  A  A struct zone_reclaim_stat reclaim_stat;
> >> > --- linux.orig/mm/vmscan.c A  A  A 2009-08-20 11:48:46.000000000 +0800
> >> > +++ linux/mm/vmscan.c A  2009-08-20 12:00:55.000000000 +0800
> >> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> >> > A  A  A  A enum lru_list l;
> >> > A  A  A  A unsigned long nr_reclaimed = sc->nr_reclaimed;
> >> > A  A  A  A unsigned long swap_cluster_max = sc->swap_cluster_max;
> >> > + A  A  A  struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >> > A  A  A  A int noswap = 0;
> >> >
> >> > A  A  A  A /* If we have no swap space, do not bother scanning anon pages. */
> >> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> >> > A  A  A  A  A  A  A  A  A  A  A  A scan >>= priority;
> >> > A  A  A  A  A  A  A  A  A  A  A  A scan = (scan * percent[file]) / 100;
> >> > A  A  A  A  A  A  A  A }
> >> > - A  A  A  A  A  A  A  if (scanning_global_lru(sc))
> >> > - A  A  A  A  A  A  A  A  A  A  A  nr[l] = nr_scan_try_batch(scan,
> >> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  &zone->lru[l].nr_saved_scan,
> >> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  swap_cluster_max);
> >> > - A  A  A  A  A  A  A  else
> >> > - A  A  A  A  A  A  A  A  A  A  A  nr[l] = scan;
> >> > + A  A  A  A  A  A  A  nr[l] = nr_scan_try_batch(scan,
> >> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  &reclaim_stat->nr_saved_scan[l],
> >> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  swap_cluster_max);
> >> > A  A  A  A }
> >> >
> >> > A  A  A  A while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> >> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> >> > A {
> >> > A  A  A  A struct zone *zone;
> >> > A  A  A  A unsigned long nr_reclaimed = 0;
> >> > + A  A  A  struct zone_reclaim_stat *reclaim_stat;
> >> >
> >> > A  A  A  A for_each_populated_zone(zone) {
> >> > A  A  A  A  A  A  A  A enum lru_list l;
> >> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A l == LRU_ACTIVE_FILE))
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A continue;
> >> >
> >> > - A  A  A  A  A  A  A  A  A  A  A  zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> >> > - A  A  A  A  A  A  A  A  A  A  A  if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> >> > + A  A  A  A  A  A  A  A  A  A  A  reclaim_stat = get_reclaim_stat(zone, sc);
> >> > + A  A  A  A  A  A  A  A  A  A  A  reclaim_stat->nr_saved_scan[l] +=
> >> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  (lru_pages >> prio) + 1;
> >> > + A  A  A  A  A  A  A  A  A  A  A  if (reclaim_stat->nr_saved_scan[l]
> >> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  >= nr_pages || pass > 3) {
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A unsigned long nr_to_scan;
> >> >
> >> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  zone->lru[l].nr_saved_scan = 0;
> >> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  reclaim_stat->nr_saved_scan[l] = 0;
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A nr_to_scan = min(nr_pages, lru_pages);
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A nr_reclaimed += shrink_list(l, nr_to_scan, zone,
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A sc, prio);
> >> > --- linux.orig/mm/page_alloc.c A 2009-08-20 11:57:54.000000000 +0800
> >> > +++ linux/mm/page_alloc.c A  A  A  2009-08-20 11:58:39.000000000 +0800
> >> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> >> > A  A  A  A  A  A  A  A zone_pcp_init(zone);
> >> > A  A  A  A  A  A  A  A for_each_lru(l) {
> >> > A  A  A  A  A  A  A  A  A  A  A  A INIT_LIST_HEAD(&zone->lru[l].list);
> >> > - A  A  A  A  A  A  A  A  A  A  A  zone->lru[l].nr_saved_scan = 0;
> >> > + A  A  A  A  A  A  A  A  A  A  A  zone->reclaim_stat.nr_saved_scan[l] = 0;
> >> > A  A  A  A  A  A  A  A }
> >> > A  A  A  A  A  A  A  A zone->reclaim_stat.recent_rotated[0] = 0;
> >> > A  A  A  A  A  A  A  A zone->reclaim_stat.recent_rotated[1] = 0;
> >> >
> >> > --
> >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> > the body to majordomo@kvack.org. A For more info on Linux MM,
> >> > see: http://www.linux-mm.org/ .
> >> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >>
> >>
> >>
> >> --
> >> Kind regards,
> >> Minchan Kim
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
  2009-08-20  4:05   ` [PATCH -v2] " Wu Fengguang
                       ` (2 preceding siblings ...)
  2009-08-20 11:01     ` Minchan Kim
@ 2009-08-21  3:55     ` Minchan Kim
  2009-08-21  7:27       ` [PATCH -v2 changelog updated] " Wu Fengguang
  3 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2009-08-21  3:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
>> On Thu, 20 Aug 2009 10:49:29 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> > larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
>> > scan rate by up to 32 times.
>> >
>> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>> >
>> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> > accurate, however we can tolerate small errors and the resulted small
>> > imbalanced scan rates between zones.
>> >
>> > This batching won't blur up the cgroup limits, since it is driven by
>> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> > decides to cancel (and save) one smallish scan, it may well be called
>> > again to accumulate up nr_saved_scan.
>> >
>> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >
>> > CC: Rik van Riel <riel@redhat.com>
>> > CC: Minchan Kim <minchan.kim@gmail.com>
>> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It looks better than now :)

I hope you will rewrite description and add test result in changelog. :)
Thanks for your great effort.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH -v2 changelog updated] mm: do batched scans for mem_cgroup
  2009-08-21  3:55     ` Minchan Kim
@ 2009-08-21  7:27       ` Wu Fengguang
  2009-08-21 10:57         ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21  7:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
in which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
scan rate by up to 32 times.

For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive
list, the active list will be scanned 4 pages, while the inactive list
will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that
could break the balance between the two lists.

It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

inactive anon list over scanned => inactive_anon_is_low() == TRUE
                                => shrink_active_list()
                                => active anon list over scanned

So the end result may be

- anon inactive  => over scanned
- anon active    => over scanned (maybe not as much)
- file inactive  => over scanned
- file active    => under scanned (relatively)

The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.

CC: Rik van Riel <riel@redhat.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mmzone.h |    6 +++++-
 mm/page_alloc.c        |    2 +-
 mm/vmscan.c            |   20 +++++++++++---------
 3 files changed, 17 insertions(+), 11 deletions(-)

--- linux.orig/include/linux/mmzone.h	2009-08-21 15:02:50.000000000 +0800
+++ linux/include/linux/mmzone.h	2009-08-21 15:03:25.000000000 +0800
@@ -269,6 +269,11 @@ struct zone_reclaim_stat {
 	 */
 	unsigned long		recent_rotated[2];
 	unsigned long		recent_scanned[2];
+
+	/*
+	 * accumulated for batching
+	 */
+	unsigned long		nr_saved_scan[NR_LRU_LISTS];
 };
 
 struct zone {
@@ -323,7 +328,6 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct zone_lru {
 		struct list_head list;
-		unsigned long nr_saved_scan;	/* accumulated for batching */
 	} lru[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
--- linux.orig/mm/vmscan.c	2009-08-21 15:03:15.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-21 15:03:25.000000000 +0800
@@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
 	enum lru_list l;
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long swap_cluster_max = sc->swap_cluster_max;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int noswap = 0;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
@@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
 			scan >>= priority;
 			scan = (scan * percent[file]) / 100;
 		}
-		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
-		else
-			nr[l] = scan;
+		nr[l] = nr_scan_try_batch(scan,
+					  &reclaim_stat->nr_saved_scan[l],
+					  swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
 {
 	struct zone *zone;
 	unsigned long nr_reclaimed = 0;
+	struct zone_reclaim_stat *reclaim_stat;
 
 	for_each_populated_zone(zone) {
 		enum lru_list l;
@@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
 						l == LRU_ACTIVE_FILE))
 				continue;
 
-			zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
-			if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
+			reclaim_stat = get_reclaim_stat(zone, sc);
+			reclaim_stat->nr_saved_scan[l] +=
+						(lru_pages >> prio) + 1;
+			if (reclaim_stat->nr_saved_scan[l]
+						>= nr_pages || pass > 3) {
 				unsigned long nr_to_scan;
 
-				zone->lru[l].nr_saved_scan = 0;
+				reclaim_stat->nr_saved_scan[l] = 0;
 				nr_to_scan = min(nr_pages, lru_pages);
 				nr_reclaimed += shrink_list(l, nr_to_scan, zone,
 								sc, prio);
--- linux.orig/mm/page_alloc.c	2009-08-21 15:02:50.000000000 +0800
+++ linux/mm/page_alloc.c	2009-08-21 15:03:25.000000000 +0800
@@ -3734,7 +3734,7 @@ static void __paginginit free_area_init_
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
-			zone->lru[l].nr_saved_scan = 0;
+			zone->reclaim_stat.nr_saved_scan[l] = 0;
 		}
 		zone->reclaim_stat.recent_rotated[0] = 0;
 		zone->reclaim_stat.recent_rotated[1] = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -v2 changelog updated] mm: do batched scans for mem_cgroup
  2009-08-21  7:27       ` [PATCH -v2 changelog updated] " Wu Fengguang
@ 2009-08-21 10:57         ` KOSAKI Motohiro
  0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh,
	Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
	Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
	LKML, linux-mm, nishimura, lizf, menage

2009/8/21 Wu Fengguang <fengguang.wu@intel.com>:
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list
> scan rate by up to 32 times.
>
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, the active list will be scanned 4 pages, while the inactive list
> will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that
> could break the balance between the two lists.
>
> It can further impact the scan of anon active list, due to the anon
> active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
>
> inactive anon list over scanned => inactive_anon_is_low() == TRUE
>                                => shrink_active_list()
>                                => active anon list over scanned
>
> So the end result may be
>
> - anon inactive  => over scanned
> - anon active    => over scanned (maybe not as much)
> - file inactive  => over scanned
> - file active    => under scanned (relatively)
>
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
>

Looks good to me.
  Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-08-27  0:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-20  2:49 [PATCH] mm: do batched scans for mem_cgroup Wu Fengguang
2009-08-20  2:52 ` [PATCH] mm: make nr_scan_try_batch() more safe on races Wu Fengguang
2009-08-20  3:17   ` [RFC][PATCH] mm: remove unnecessary loop inside shrink_inactive_list() Wu Fengguang
2009-08-21 11:09     ` KOSAKI Motohiro
2009-08-21 11:22       ` Wu Fengguang
2009-08-27  0:20         ` KOSAKI Motohiro
2009-08-20  3:13 ` [PATCH] mm: do batched scans for mem_cgroup KAMEZAWA Hiroyuki
2009-08-20  4:05   ` [PATCH -v2] " Wu Fengguang
2009-08-20  4:06     ` KAMEZAWA Hiroyuki
2009-08-20  5:16     ` Balbir Singh
2009-08-21  1:39       ` Wu Fengguang
2009-08-21  1:46         ` Wu Fengguang
2009-08-20 11:01     ` Minchan Kim
2009-08-20 11:49       ` Wu Fengguang
2009-08-20 12:13         ` Minchan Kim
2009-08-20 12:32           ` Wu Fengguang
2009-08-21  3:55     ` Minchan Kim
2009-08-21  7:27       ` [PATCH -v2 changelog updated] " Wu Fengguang
2009-08-21 10:57         ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox