* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 4:05 ` [PATCH -v2] " Wu Fengguang
@ 2009-08-20 4:06 ` KAMEZAWA Hiroyuki
2009-08-20 5:16 ` Balbir Singh
` (2 subsequent siblings)
3 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-20 4:06 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Balbir Singh, KOSAKI Motohiro, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
nishimura, lizf, menage
On Thu, 20 Aug 2009 12:05:33 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > On Thu, 20 Aug 2009 10:49:29 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > >
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > >
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > >
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > >
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > >
> > > CC: Rik van Riel <riel@redhat.com>
> > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> >
> > Hmm, how about this ?
> > ==
> > Now, nr_saved_scan is tied to zone's LRU.
> > But, considering how vmscan works, it should be tied to reclaim_stat.
> >
> > By this, memcg can make use of nr_saved_scan information seamlessly.
>
> Good idea, full patch updated with your signed-off-by :)
>
looks nice :)
thanks,
-Kame
> Thanks,
> Fengguang
> ---
> mm: do batched scans for mem_cgroup
>
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> scan rate by up to 32 times.
>
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
>
> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.
>
> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
>
> CC: Rik van Riel <riel@redhat.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/mmzone.h | 6 +++++-
> mm/page_alloc.c | 2 +-
> mm/vmscan.c | 20 +++++++++++---------
> 3 files changed, 17 insertions(+), 11 deletions(-)
>
> --- linux.orig/include/linux/mmzone.h 2009-07-30 10:45:15.000000000 +0800
> +++ linux/include/linux/mmzone.h 2009-08-20 11:51:08.000000000 +0800
> @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> */
> unsigned long recent_rotated[2];
> unsigned long recent_scanned[2];
> +
> + /*
> + * accumulated for batching
> + */
> + unsigned long nr_saved_scan[NR_LRU_LISTS];
> };
>
> struct zone {
> @@ -323,7 +328,6 @@ struct zone {
> spinlock_t lru_lock;
> struct zone_lru {
> struct list_head list;
> - unsigned long nr_saved_scan; /* accumulated for batching */
> } lru[NR_LRU_LISTS];
>
> struct zone_reclaim_stat reclaim_stat;
> --- linux.orig/mm/vmscan.c 2009-08-20 11:48:46.000000000 +0800
> +++ linux/mm/vmscan.c 2009-08-20 12:00:55.000000000 +0800
> @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> enum lru_list l;
> unsigned long nr_reclaimed = sc->nr_reclaimed;
> unsigned long swap_cluster_max = sc->swap_cluster_max;
> + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> scan >>= priority;
> scan = (scan * percent[file]) / 100;
> }
> - if (scanning_global_lru(sc))
> - nr[l] = nr_scan_try_batch(scan,
> - &zone->lru[l].nr_saved_scan,
> - swap_cluster_max);
> - else
> - nr[l] = scan;
> + nr[l] = nr_scan_try_batch(scan,
> + &reclaim_stat->nr_saved_scan[l],
> + swap_cluster_max);
> }
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> {
> struct zone *zone;
> unsigned long nr_reclaimed = 0;
> + struct zone_reclaim_stat *reclaim_stat;
>
> for_each_populated_zone(zone) {
> enum lru_list l;
> @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> l == LRU_ACTIVE_FILE))
> continue;
>
> - zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> - if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> + reclaim_stat = get_reclaim_stat(zone, sc);
> + reclaim_stat->nr_saved_scan[l] +=
> + (lru_pages >> prio) + 1;
> + if (reclaim_stat->nr_saved_scan[l]
> + >= nr_pages || pass > 3) {
> unsigned long nr_to_scan;
>
> - zone->lru[l].nr_saved_scan = 0;
> + reclaim_stat->nr_saved_scan[l] = 0;
> nr_to_scan = min(nr_pages, lru_pages);
> nr_reclaimed += shrink_list(l, nr_to_scan, zone,
> sc, prio);
> --- linux.orig/mm/page_alloc.c 2009-08-20 11:57:54.000000000 +0800
> +++ linux/mm/page_alloc.c 2009-08-20 11:58:39.000000000 +0800
> @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> - zone->lru[l].nr_saved_scan = 0;
> + zone->reclaim_stat.nr_saved_scan[l] = 0;
> }
> zone->reclaim_stat.recent_rotated[0] = 0;
> zone->reclaim_stat.recent_rotated[1] = 0;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 4:05 ` [PATCH -v2] " Wu Fengguang
2009-08-20 4:06 ` KAMEZAWA Hiroyuki
@ 2009-08-20 5:16 ` Balbir Singh
2009-08-21 1:39 ` Wu Fengguang
2009-08-20 11:01 ` Minchan Kim
2009-08-21 3:55 ` Minchan Kim
3 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-08-20 5:16 UTC (permalink / raw)
To: Wu Fengguang
Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
nishimura, lizf, menage
* Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > On Thu, 20 Aug 2009 10:49:29 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > >
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > >
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > >
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > >
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > >
> > > CC: Rik van Riel <riel@redhat.com>
> > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> >
> > Hmm, how about this ?
> > ==
> > Now, nr_saved_scan is tied to zone's LRU.
> > But, considering how vmscan works, it should be tied to reclaim_stat.
> >
> > By this, memcg can make use of nr_saved_scan information seamlessly.
>
> Good idea, full patch updated with your signed-off-by :)
>
> Thanks,
> Fengguang
> ---
> mm: do batched scans for mem_cgroup
>
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> scan rate by up to 32 times.
>
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
>
> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.
>
> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
>
Looks good to me, how did you test it?
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 5:16 ` Balbir Singh
@ 2009-08-21 1:39 ` Wu Fengguang
2009-08-21 1:46 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21 1:39 UTC (permalink / raw)
To: Balbir Singh
Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
nishimura, lizf, menage
On Thu, Aug 20, 2009 at 01:16:56PM +0800, Balbir Singh wrote:
> * Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
>
> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 20 Aug 2009 10:49:29 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >
> > > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > > scan rate by up to 32 times.
> > > >
> > > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > >
> > > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > > accurate, however we can tolerate small errors and the resulted small
> > > > imbalanced scan rates between zones.
> > > >
> > > > This batching won't blur up the cgroup limits, since it is driven by
> > > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > > decides to cancel (and save) one smallish scan, it may well be called
> > > > again to accumulate up nr_saved_scan.
> > > >
> > > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > >
> > > > CC: Rik van Riel <riel@redhat.com>
> > > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > >
> > > Hmm, how about this ?
> > > ==
> > > Now, nr_saved_scan is tied to zone's LRU.
> > > But, considering how vmscan works, it should be tied to reclaim_stat.
> > >
> > > By this, memcg can make use of nr_saved_scan information seamlessly.
> >
> > Good idea, full patch updated with your signed-off-by :)
> >
> > Thanks,
> > Fengguang
> > ---
> > mm: do batched scans for mem_cgroup
> >
> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > in which case shrink_list() _still_ calls isolate_pages() with the much
> > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > scan rate by up to 32 times.
> >
> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >
> > The accesses to nr_saved_scan are not lock protected and so not 100%
> > accurate, however we can tolerate small errors and the resulted small
> > imbalanced scan rates between zones.
> >
> > This batching won't blur up the cgroup limits, since it is driven by
> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > decides to cancel (and save) one smallish scan, it may well be called
> > again to accumulate up nr_saved_scan.
> >
> > It could possibly be a problem for some tiny mem_cgroup (which may be
> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >
>
> Looks good to me, how did you test it?
I observed the shrink_inactive_list() calls with this patch:
@@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ if (!scanning_global_lru(sc))
+ printk("shrink inactive %s count=%lu scan=%lu\n",
+ file ? "file" : "anon",
+ mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
+ LRU_INACTIVE_ANON + 2 * !!file),
+ max_scan);
and these commands:
mkdir /cgroup/0
echo 100M > /cgroup/0/memory.limit_in_bytes
echo $$ > /cgroup/0/tasks
cp /tmp/10G /dev/null
before patch:
[ 3682.646008] shrink inactive file count=25535 scan=6
[ 3682.661548] shrink inactive file count=25535 scan=6
[ 3682.666933] shrink inactive file count=25535 scan=6
[ 3682.682865] shrink inactive file count=25535 scan=6
[ 3682.688572] shrink inactive file count=25535 scan=6
[ 3682.703908] shrink inactive file count=25535 scan=6
[ 3682.709431] shrink inactive file count=25535 scan=6
after patch:
[ 223.146544] shrink inactive file count=25531 scan=32
[ 223.152060] shrink inactive file count=25507 scan=10
[ 223.167503] shrink inactive file count=25531 scan=32
[ 223.173426] shrink inactive file count=25507 scan=10
[ 223.188764] shrink inactive file count=25531 scan=32
[ 223.194270] shrink inactive file count=25507 scan=10
[ 223.209885] shrink inactive file count=25531 scan=32
[ 223.215388] shrink inactive file count=25507 scan=10
Before patch, the inactive list is over scanned by 30/6=5 times;
After patch, it is over scanned by 64/42=1.5 times. It's much better,
and can be further improved if necessary.
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-21 1:39 ` Wu Fengguang
@ 2009-08-21 1:46 ` Wu Fengguang
0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21 1:46 UTC (permalink / raw)
To: Balbir Singh
Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, Rik van Riel,
Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G,
Hugh Dickins, Christoph Lameter, Mel Gorman, LKML, linux-mm,
nishimura, lizf, menage
On Fri, Aug 21, 2009 at 09:39:26AM +0800, Wu Fengguang wrote:
> On Thu, Aug 20, 2009 at 01:16:56PM +0800, Balbir Singh wrote:
> > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-20 12:05:33]:
> >
> > > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 20 Aug 2009 10:49:29 +0800
> > > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > >
> > > > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > > > scan rate by up to 32 times.
> > > > >
> > > > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > > > >
> > > > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > > > accurate, however we can tolerate small errors and the resulted small
> > > > > imbalanced scan rates between zones.
> > > > >
> > > > > This batching won't blur up the cgroup limits, since it is driven by
> > > > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > > > decides to cancel (and save) one smallish scan, it may well be called
> > > > > again to accumulate up nr_saved_scan.
> > > > >
> > > > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > > > >
> > > > > CC: Rik van Riel <riel@redhat.com>
> > > > > CC: Minchan Kim <minchan.kim@gmail.com>
> > > > > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > > > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > ---
> > > >
> > > > Hmm, how about this ?
> > > > ==
> > > > Now, nr_saved_scan is tied to zone's LRU.
> > > > But, considering how vmscan works, it should be tied to reclaim_stat.
> > > >
> > > > By this, memcg can make use of nr_saved_scan information seamlessly.
> > >
> > > Good idea, full patch updated with your signed-off-by :)
> > >
> > > Thanks,
> > > Fengguang
> > > ---
> > > mm: do batched scans for mem_cgroup
> > >
> > > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > > in which case shrink_list() _still_ calls isolate_pages() with the much
> > > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> > > scan rate by up to 32 times.
> > >
> > > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> > >
> > > The accesses to nr_saved_scan are not lock protected and so not 100%
> > > accurate, however we can tolerate small errors and the resulted small
> > > imbalanced scan rates between zones.
> > >
> > > This batching won't blur up the cgroup limits, since it is driven by
> > > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > > decides to cancel (and save) one smallish scan, it may well be called
> > > again to accumulate up nr_saved_scan.
> > >
> > > It could possibly be a problem for some tiny mem_cgroup (which may be
> > > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> > >
> >
> > Looks good to me, how did you test it?
>
> I observed the shrink_inactive_list() calls with this patch:
>
> @@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int lumpy_reclaim = 0;
>
> + if (!scanning_global_lru(sc))
> + printk("shrink inactive %s count=%lu scan=%lu\n",
> + file ? "file" : "anon",
> + mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone,
> + LRU_INACTIVE_ANON + 2 * !!file),
> + max_scan);
>
> and these commands:
>
> mkdir /cgroup/0
> echo 100M > /cgroup/0/memory.limit_in_bytes
> echo $$ > /cgroup/0/tasks
> cp /tmp/10G /dev/null
And I can reduce the limit to 1M and 500K without triggering OOM:
[ 963.329746] shrink inactive file count=201 scan=32
[ 963.335076] shrink inactive file count=177 scan=15
[ 963.350719] shrink inactive file count=201 scan=32
[ 963.356020] shrink inactive file count=177 scan=15
[ 963.371914] shrink inactive file count=201 scan=32
[ 963.377225] shrink inactive file count=177 scan=15
[ 963.393022] shrink inactive file count=201 scan=32
[ 963.398362] shrink inactive file count=177 scan=15
[ 1103.951251] shrink inactive file count=70 scan=32
[ 1104.054242] shrink inactive file count=46 scan=32
[ 1104.077381] shrink inactive file count=70 scan=32
[ 1104.083095] shrink inactive file count=73 scan=32
[ 1104.088513] shrink inactive file count=45 scan=2
[ 1104.113545] shrink inactive file count=70 scan=32
[ 1104.118915] shrink inactive file count=73 scan=32
[ 1104.124612] shrink inactive file count=45 scan=2
[ 1104.130093] shrink inactive file count=69 scan=32
So the patch is pretty safe for tiny mem cgroups.
Thanks,
Fengguang
> before patch:
>
> [ 3682.646008] shrink inactive file count=25535 scan=6
> [ 3682.661548] shrink inactive file count=25535 scan=6
> [ 3682.666933] shrink inactive file count=25535 scan=6
> [ 3682.682865] shrink inactive file count=25535 scan=6
> [ 3682.688572] shrink inactive file count=25535 scan=6
> [ 3682.703908] shrink inactive file count=25535 scan=6
> [ 3682.709431] shrink inactive file count=25535 scan=6
>
> after patch:
>
> [ 223.146544] shrink inactive file count=25531 scan=32
> [ 223.152060] shrink inactive file count=25507 scan=10
> [ 223.167503] shrink inactive file count=25531 scan=32
> [ 223.173426] shrink inactive file count=25507 scan=10
> [ 223.188764] shrink inactive file count=25531 scan=32
> [ 223.194270] shrink inactive file count=25507 scan=10
> [ 223.209885] shrink inactive file count=25531 scan=32
> [ 223.215388] shrink inactive file count=25507 scan=10
>
> Before patch, the inactive list is over scanned by 30/6=5 times;
> After patch, it is over scanned by 64/42=1.5 times. It's much better,
> and can be further improved if necessary.
>
> > Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 4:05 ` [PATCH -v2] " Wu Fengguang
2009-08-20 4:06 ` KAMEZAWA Hiroyuki
2009-08-20 5:16 ` Balbir Singh
@ 2009-08-20 11:01 ` Minchan Kim
2009-08-20 11:49 ` Wu Fengguang
2009-08-21 3:55 ` Minchan Kim
3 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2009-08-20 11:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
Hi, Wu.
On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
>> On Thu, 20 Aug 2009 10:49:29 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
>> > scan rate by up to 32 times.
>> >
>> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>> >
>> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> > accurate, however we can tolerate small errors and the resulted small
>> > imbalanced scan rates between zones.
>> >
>> > This batching won't blur up the cgroup limits, since it is driven by
>> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> > decides to cancel (and save) one smallish scan, it may well be called
>> > again to accumulate up nr_saved_scan.
>> >
>> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >
>> > CC: Rik van Riel <riel@redhat.com>
>> > CC: Minchan Kim <minchan.kim@gmail.com>
>> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> > ---
>>
>> Hmm, how about this ?
>> ==
>> Now, nr_saved_scan is tied to zone's LRU.
>> But, considering how vmscan works, it should be tied to reclaim_stat.
>>
>> By this, memcg can make use of nr_saved_scan information seamlessly.
>
> Good idea, full patch updated with your signed-off-by :)
>
> Thanks,
> Fengguang
> ---
> mm: do batched scans for mem_cgroup
>
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> scan rate by up to 32 times.
Yes. It can scan 32 times pages in only inactive list, not active list.
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
Active list scan would be scanned in 4, inactive list is 32.
>
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
Yes.
> This batching won't blur up the cgroup limits, since it is driven by
> "pages reclaimed" rather than "pages scanned". When shrink_zone()
> decides to cancel (and save) one smallish scan, it may well be called
> again to accumulate up nr_saved_scan.
You mean nr_scan_try_batch logic ?
But that logic works for just global reclaim?
Now am I missing something?
Could you elaborate more? :)
> It could possibly be a problem for some tiny mem_cgroup (which may be
> _full_ scanned too much times in order to accumulate up nr_saved_scan).
>
> CC: Rik van Riel <riel@redhat.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/mmzone.h | 6 +++++-
> mm/page_alloc.c | 2 +-
> mm/vmscan.c | 20 +++++++++++---------
> 3 files changed, 17 insertions(+), 11 deletions(-)
>
> --- linux.orig/include/linux/mmzone.h 2009-07-30 10:45:15.000000000 +0800
> +++ linux/include/linux/mmzone.h 2009-08-20 11:51:08.000000000 +0800
> @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> */
> unsigned long recent_rotated[2];
> unsigned long recent_scanned[2];
> +
> + /*
> + * accumulated for batching
> + */
> + unsigned long nr_saved_scan[NR_LRU_LISTS];
> };
>
> struct zone {
> @@ -323,7 +328,6 @@ struct zone {
> spinlock_t lru_lock;
> struct zone_lru {
> struct list_head list;
> - unsigned long nr_saved_scan; /* accumulated for batching */
> } lru[NR_LRU_LISTS];
>
> struct zone_reclaim_stat reclaim_stat;
> --- linux.orig/mm/vmscan.c 2009-08-20 11:48:46.000000000 +0800
> +++ linux/mm/vmscan.c 2009-08-20 12:00:55.000000000 +0800
> @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> enum lru_list l;
> unsigned long nr_reclaimed = sc->nr_reclaimed;
> unsigned long swap_cluster_max = sc->swap_cluster_max;
> + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> scan >>= priority;
> scan = (scan * percent[file]) / 100;
> }
> - if (scanning_global_lru(sc))
> - nr[l] = nr_scan_try_batch(scan,
> - &zone->lru[l].nr_saved_scan,
> - swap_cluster_max);
> - else
> - nr[l] = scan;
> + nr[l] = nr_scan_try_batch(scan,
> + &reclaim_stat->nr_saved_scan[l],
> + swap_cluster_max);
> }
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> {
> struct zone *zone;
> unsigned long nr_reclaimed = 0;
> + struct zone_reclaim_stat *reclaim_stat;
>
> for_each_populated_zone(zone) {
> enum lru_list l;
> @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> l == LRU_ACTIVE_FILE))
> continue;
>
> - zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> - if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> + reclaim_stat = get_reclaim_stat(zone, sc);
> + reclaim_stat->nr_saved_scan[l] +=
> + (lru_pages >> prio) + 1;
> + if (reclaim_stat->nr_saved_scan[l]
> + >= nr_pages || pass > 3) {
> unsigned long nr_to_scan;
>
> - zone->lru[l].nr_saved_scan = 0;
> + reclaim_stat->nr_saved_scan[l] = 0;
> nr_to_scan = min(nr_pages, lru_pages);
> nr_reclaimed += shrink_list(l, nr_to_scan, zone,
> sc, prio);
> --- linux.orig/mm/page_alloc.c 2009-08-20 11:57:54.000000000 +0800
> +++ linux/mm/page_alloc.c 2009-08-20 11:58:39.000000000 +0800
> @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> - zone->lru[l].nr_saved_scan = 0;
> + zone->reclaim_stat.nr_saved_scan[l] = 0;
> }
> zone->reclaim_stat.recent_rotated[0] = 0;
> zone->reclaim_stat.recent_rotated[1] = 0;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 11:01 ` Minchan Kim
@ 2009-08-20 11:49 ` Wu Fengguang
2009-08-20 12:13 ` Minchan Kim
0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20 11:49 UTC (permalink / raw)
To: Minchan Kim
Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
> Hi, Wu.
>
> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> >> On Thu, 20 Aug 2009 10:49:29 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> >> > scan rate by up to 32 times.
> >> >
> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >> >
> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> > accurate, however we can tolerate small errors and the resulted small
> >> > imbalanced scan rates between zones.
> >> >
> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> > again to accumulate up nr_saved_scan.
> >> >
> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >
> >> > CC: Rik van Riel <riel@redhat.com>
> >> > CC: Minchan Kim <minchan.kim@gmail.com>
> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> >> > ---
> >>
> >> Hmm, how about this ?
> >> ==
> >> Now, nr_saved_scan is tied to zone's LRU.
> >> But, considering how vmscan works, it should be tied to reclaim_stat.
> >>
> >> By this, memcg can make use of nr_saved_scan information seamlessly.
> >
> > Good idea, full patch updated with your signed-off-by :)
> >
> > Thanks,
> > Fengguang
> > ---
> > mm: do batched scans for mem_cgroup
> >
> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > in which case shrink_list() _still_ calls isolate_pages() with the much
> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> > scan rate by up to 32 times.
>
> Yes. It can scan 32 times pages in only inactive list, not active list.
Yes and no ;)
inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned
So the end result may be
- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)
> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>
> Active list scan would be scanned in 4, inactive list is 32.
Exactly.
> >
> > The accesses to nr_saved_scan are not lock protected and so not 100%
> > accurate, however we can tolerate small errors and the resulted small
> > imbalanced scan rates between zones.
>
> Yes.
>
> > This batching won't blur up the cgroup limits, since it is driven by
> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > decides to cancel (and save) one smallish scan, it may well be called
> > again to accumulate up nr_saved_scan.
>
> You mean nr_scan_try_batch logic ?
> But that logic works for just global reclaim?
> Now am I missing something?
>
> Could you elaborate more? :)
Sorry for the confusion. The above paragraph originates from Balbir's
concern:
This might be a concern (although not a big ATM), since we can't
afford to miss limits by much. If a cgroup is near its limit and we
drop scanning it. We'll have to work out what this means for the end
user. May be more fundamental look through is required at the priority
based logic of exposing how much to scan, I don't know.
Thanks,
Fengguang
> > It could possibly be a problem for some tiny mem_cgroup (which may be
> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >
> > CC: Rik van Riel <riel@redhat.com>
> > CC: Minchan Kim <minchan.kim@gmail.com>
> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > A include/linux/mmzone.h | A A 6 +++++-
> > A mm/page_alloc.c A A A A | A A 2 +-
> > A mm/vmscan.c A A A A A A | A 20 +++++++++++---------
> > A 3 files changed, 17 insertions(+), 11 deletions(-)
> >
> > --- linux.orig/include/linux/mmzone.h A 2009-07-30 10:45:15.000000000 +0800
> > +++ linux/include/linux/mmzone.h A A A A 2009-08-20 11:51:08.000000000 +0800
> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> > A A A A */
> > A A A A unsigned long A A A A A recent_rotated[2];
> > A A A A unsigned long A A A A A recent_scanned[2];
> > +
> > + A A A /*
> > + A A A A * accumulated for batching
> > + A A A A */
> > + A A A unsigned long A A A A A nr_saved_scan[NR_LRU_LISTS];
> > A };
> >
> > A struct zone {
> > @@ -323,7 +328,6 @@ struct zone {
> > A A A A spinlock_t A A A A A A A lru_lock;
> > A A A A struct zone_lru {
> > A A A A A A A A struct list_head list;
> > - A A A A A A A unsigned long nr_saved_scan; A A /* accumulated for batching */
> > A A A A } lru[NR_LRU_LISTS];
> >
> > A A A A struct zone_reclaim_stat reclaim_stat;
> > --- linux.orig/mm/vmscan.c A A A 2009-08-20 11:48:46.000000000 +0800
> > +++ linux/mm/vmscan.c A 2009-08-20 12:00:55.000000000 +0800
> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> > A A A A enum lru_list l;
> > A A A A unsigned long nr_reclaimed = sc->nr_reclaimed;
> > A A A A unsigned long swap_cluster_max = sc->swap_cluster_max;
> > + A A A struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > A A A A int noswap = 0;
> >
> > A A A A /* If we have no swap space, do not bother scanning anon pages. */
> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> > A A A A A A A A A A A A scan >>= priority;
> > A A A A A A A A A A A A scan = (scan * percent[file]) / 100;
> > A A A A A A A A }
> > - A A A A A A A if (scanning_global_lru(sc))
> > - A A A A A A A A A A A nr[l] = nr_scan_try_batch(scan,
> > - A A A A A A A A A A A A A A A A A A A A A A A A &zone->lru[l].nr_saved_scan,
> > - A A A A A A A A A A A A A A A A A A A A A A A A swap_cluster_max);
> > - A A A A A A A else
> > - A A A A A A A A A A A nr[l] = scan;
> > + A A A A A A A nr[l] = nr_scan_try_batch(scan,
> > + A A A A A A A A A A A A A A A A A A A A &reclaim_stat->nr_saved_scan[l],
> > + A A A A A A A A A A A A A A A A A A A A swap_cluster_max);
> > A A A A }
> >
> > A A A A while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> > A {
> > A A A A struct zone *zone;
> > A A A A unsigned long nr_reclaimed = 0;
> > + A A A struct zone_reclaim_stat *reclaim_stat;
> >
> > A A A A for_each_populated_zone(zone) {
> > A A A A A A A A enum lru_list l;
> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> > A A A A A A A A A A A A A A A A A A A A A A A A l == LRU_ACTIVE_FILE))
> > A A A A A A A A A A A A A A A A continue;
> >
> > - A A A A A A A A A A A zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> > - A A A A A A A A A A A if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> > + A A A A A A A A A A A reclaim_stat = get_reclaim_stat(zone, sc);
> > + A A A A A A A A A A A reclaim_stat->nr_saved_scan[l] +=
> > + A A A A A A A A A A A A A A A A A A A A A A A (lru_pages >> prio) + 1;
> > + A A A A A A A A A A A if (reclaim_stat->nr_saved_scan[l]
> > + A A A A A A A A A A A A A A A A A A A A A A A >= nr_pages || pass > 3) {
> > A A A A A A A A A A A A A A A A unsigned long nr_to_scan;
> >
> > - A A A A A A A A A A A A A A A zone->lru[l].nr_saved_scan = 0;
> > + A A A A A A A A A A A A A A A reclaim_stat->nr_saved_scan[l] = 0;
> > A A A A A A A A A A A A A A A A nr_to_scan = min(nr_pages, lru_pages);
> > A A A A A A A A A A A A A A A A nr_reclaimed += shrink_list(l, nr_to_scan, zone,
> > A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A sc, prio);
> > --- linux.orig/mm/page_alloc.c A 2009-08-20 11:57:54.000000000 +0800
> > +++ linux/mm/page_alloc.c A A A 2009-08-20 11:58:39.000000000 +0800
> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> > A A A A A A A A zone_pcp_init(zone);
> > A A A A A A A A for_each_lru(l) {
> > A A A A A A A A A A A A INIT_LIST_HEAD(&zone->lru[l].list);
> > - A A A A A A A A A A A zone->lru[l].nr_saved_scan = 0;
> > + A A A A A A A A A A A zone->reclaim_stat.nr_saved_scan[l] = 0;
> > A A A A A A A A }
> > A A A A A A A A zone->reclaim_stat.recent_rotated[0] = 0;
> > A A A A A A A A zone->reclaim_stat.recent_rotated[1] = 0;
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. A For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 11:49 ` Wu Fengguang
@ 2009-08-20 12:13 ` Minchan Kim
2009-08-20 12:32 ` Wu Fengguang
0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2009-08-20 12:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
On Thu, Aug 20, 2009 at 8:49 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
>> Hi, Wu.
>>
>> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
>> >> On Thu, 20 Aug 2009 10:49:29 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> >> > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
>> >> > scan rate by up to 32 times.
>> >> >
>> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>> >> >
>> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> >> > accurate, however we can tolerate small errors and the resulted small
>> >> > imbalanced scan rates between zones.
>> >> >
>> >> > This batching won't blur up the cgroup limits, since it is driven by
>> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> >> > decides to cancel (and save) one smallish scan, it may well be called
>> >> > again to accumulate up nr_saved_scan.
>> >> >
>> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >> >
>> >> > CC: Rik van Riel <riel@redhat.com>
>> >> > CC: Minchan Kim <minchan.kim@gmail.com>
>> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> >> > ---
>> >>
>> >> Hmm, how about this ?
>> >> ==
>> >> Now, nr_saved_scan is tied to zone's LRU.
>> >> But, considering how vmscan works, it should be tied to reclaim_stat.
>> >>
>> >> By this, memcg can make use of nr_saved_scan information seamlessly.
>> >
>> > Good idea, full patch updated with your signed-off-by :)
>> >
>> > Thanks,
>> > Fengguang
>> > ---
>> > mm: do batched scans for mem_cgroup
>> >
>> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
>> > scan rate by up to 32 times.
>>
>> Yes. It can scan 32 times pages in only inactive list, not active list.
>
> Yes and no ;)
>
> inactive anon list over scanned => inactive_anon_is_low() == TRUE
> => shrink_active_list()
> => active anon list over scanned
Why inactive anon list is overscanned in case mem_cgroup ?
in shrink_zone,
1) The vm doesn't accumulate nr[l].
2) Below routine store min value to nr_to_scan.
nr_to_scan = min(nr[l], swap_cluster_max);
ex) if nr[l] = 4, vm calls shrink_active_list with 4 as nr_to_scan.
So I think overscan doesn't occur in active list.
> So the end result may be
>
> - anon inactive => over scanned
> - anon active => over scanned (maybe not as much)
> - file inactive => over scanned
> - file active => under scanned (relatively)
>
>> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>>
>> Active list scan would be scanned in 4, inactive list is 32.
>
> Exactly.
>
>> >
>> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> > accurate, however we can tolerate small errors and the resulted small
>> > imbalanced scan rates between zones.
>>
>> Yes.
>>
>> > This batching won't blur up the cgroup limits, since it is driven by
>> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> > decides to cancel (and save) one smallish scan, it may well be called
>> > again to accumulate up nr_saved_scan.
>>
>> You mean nr_scan_try_batch logic ?
>> But that logic works for just global reclaim?
>> Now am I missing something?
>>
>> Could you elaborate more? :)
>
> Sorry for the confusion. The above paragraph originates from Balbir's
> concern:
>
> This might be a concern (although not a big ATM), since we can't
> afford to miss limits by much. If a cgroup is near its limit and we
> drop scanning it. We'll have to work out what this means for the end
Why does mem_cgroup drops scanning ?
It's because nr_scan_try_batch? or something ?
Sorry. Still, I can't understand your point. :(
> user. May be more fundamental look through is required at the priority
> based logic of exposing how much to scan, I don't know.
>
> Thanks,
> Fengguang
>
>> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >
>> > CC: Rik van Riel <riel@redhat.com>
>> > CC: Minchan Kim <minchan.kim@gmail.com>
>> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> > ---
>> > include/linux/mmzone.h | 6 +++++-
>> > mm/page_alloc.c | 2 +-
>> > mm/vmscan.c | 20 +++++++++++---------
>> > 3 files changed, 17 insertions(+), 11 deletions(-)
>> >
>> > --- linux.orig/include/linux/mmzone.h 2009-07-30 10:45:15.000000000 +0800
>> > +++ linux/include/linux/mmzone.h 2009-08-20 11:51:08.000000000 +0800
>> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
>> > */
>> > unsigned long recent_rotated[2];
>> > unsigned long recent_scanned[2];
>> > +
>> > + /*
>> > + * accumulated for batching
>> > + */
>> > + unsigned long nr_saved_scan[NR_LRU_LISTS];
>> > };
>> >
>> > struct zone {
>> > @@ -323,7 +328,6 @@ struct zone {
>> > spinlock_t lru_lock;
>> > struct zone_lru {
>> > struct list_head list;
>> > - unsigned long nr_saved_scan; /* accumulated for batching */
>> > } lru[NR_LRU_LISTS];
>> >
>> > struct zone_reclaim_stat reclaim_stat;
>> > --- linux.orig/mm/vmscan.c 2009-08-20 11:48:46.000000000 +0800
>> > +++ linux/mm/vmscan.c 2009-08-20 12:00:55.000000000 +0800
>> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
>> > enum lru_list l;
>> > unsigned long nr_reclaimed = sc->nr_reclaimed;
>> > unsigned long swap_cluster_max = sc->swap_cluster_max;
>> > + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>> > int noswap = 0;
>> >
>> > /* If we have no swap space, do not bother scanning anon pages. */
>> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
>> > scan >>= priority;
>> > scan = (scan * percent[file]) / 100;
>> > }
>> > - if (scanning_global_lru(sc))
>> > - nr[l] = nr_scan_try_batch(scan,
>> > - &zone->lru[l].nr_saved_scan,
>> > - swap_cluster_max);
>> > - else
>> > - nr[l] = scan;
>> > + nr[l] = nr_scan_try_batch(scan,
>> > + &reclaim_stat->nr_saved_scan[l],
>> > + swap_cluster_max);
>> > }
>> >
>> > while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
>> > {
>> > struct zone *zone;
>> > unsigned long nr_reclaimed = 0;
>> > + struct zone_reclaim_stat *reclaim_stat;
>> >
>> > for_each_populated_zone(zone) {
>> > enum lru_list l;
>> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
>> > l == LRU_ACTIVE_FILE))
>> > continue;
>> >
>> > - zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
>> > - if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
>> > + reclaim_stat = get_reclaim_stat(zone, sc);
>> > + reclaim_stat->nr_saved_scan[l] +=
>> > + (lru_pages >> prio) + 1;
>> > + if (reclaim_stat->nr_saved_scan[l]
>> > + >= nr_pages || pass > 3) {
>> > unsigned long nr_to_scan;
>> >
>> > - zone->lru[l].nr_saved_scan = 0;
>> > + reclaim_stat->nr_saved_scan[l] = 0;
>> > nr_to_scan = min(nr_pages, lru_pages);
>> > nr_reclaimed += shrink_list(l, nr_to_scan, zone,
>> > sc, prio);
>> > --- linux.orig/mm/page_alloc.c 2009-08-20 11:57:54.000000000 +0800
>> > +++ linux/mm/page_alloc.c 2009-08-20 11:58:39.000000000 +0800
>> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
>> > zone_pcp_init(zone);
>> > for_each_lru(l) {
>> > INIT_LIST_HEAD(&zone->lru[l].list);
>> > - zone->lru[l].nr_saved_scan = 0;
>> > + zone->reclaim_stat.nr_saved_scan[l] = 0;
>> > }
>> > zone->reclaim_stat.recent_rotated[0] = 0;
>> > zone->reclaim_stat.recent_rotated[1] = 0;
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org. For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 12:13 ` Minchan Kim
@ 2009-08-20 12:32 ` Wu Fengguang
0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-08-20 12:32 UTC (permalink / raw)
To: Minchan Kim
Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
On Thu, Aug 20, 2009 at 08:13:59PM +0800, Minchan Kim wrote:
> On Thu, Aug 20, 2009 at 8:49 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
> >> Hi, Wu.
> >>
> >> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> >> >> On Thu, 20 Aug 2009 10:49:29 +0800
> >> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >>
> >> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> >> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> >> >> > scan rate by up to 32 times.
> >> >> >
> >> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >> >> >
> >> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> >> > accurate, however we can tolerate small errors and the resulted small
> >> >> > imbalanced scan rates between zones.
> >> >> >
> >> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> >> > again to accumulate up nr_saved_scan.
> >> >> >
> >> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >> >
> >> >> > CC: Rik van Riel <riel@redhat.com>
> >> >> > CC: Minchan Kim <minchan.kim@gmail.com>
> >> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >> > ---
> >> >>
> >> >> Hmm, how about this ?
> >> >> ==
> >> >> Now, nr_saved_scan is tied to zone's LRU.
> >> >> But, considering how vmscan works, it should be tied to reclaim_stat.
> >> >>
> >> >> By this, memcg can make use of nr_saved_scan information seamlessly.
> >> >
> >> > Good idea, full patch updated with your signed-off-by :)
> >> >
> >> > Thanks,
> >> > Fengguang
> >> > ---
> >> > mm: do batched scans for mem_cgroup
> >> >
> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> > larger SWAP_CLUSTER_MAX. A It effectively scales up the inactive list
> >> > scan rate by up to 32 times.
> >>
> >> Yes. It can scan 32 times pages in only inactive list, not active list.
> >
> > Yes and no ;)
> >
> > inactive anon list over scanned => inactive_anon_is_low() == TRUE
> > A A A A A A A A A A A A A A A A => shrink_active_list()
> > A A A A A A A A A A A A A A A A => active anon list over scanned
>
> Why inactive anon list is overscanned in case mem_cgroup ?
>
> in shrink_zone,
> 1) The vm doesn't accumulate nr[l].
> 2) Below routine store min value to nr_to_scan.
> nr_to_scan = min(nr[l], swap_cluster_max);
> ex) if nr[l] = 4, vm calls shrink_active_list with 4 as nr_to_scan.
It's not over scanned here, but at end of shrink_zone():
/*
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
as well as balance_pgdat():
/*
* Do some background aging of the anon list, to give
* pages a chance to be referenced before reclaiming.
*/
if (inactive_anon_is_low(zone, &sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone,
&sc, priority, 0);
So the anon lists are over scanned compared to the active file list.
> So I think overscan doesn't occur in active list.
>
> > So the end result may be
> >
> > - anon inactive A => over scanned
> > - anon active A A => over scanned (maybe not as much)
> > - file inactive A => over scanned
> > - file active A A => under scanned (relatively)
> >
> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >>
> >> Active list scan would be scanned in 4, A inactive list A is 32.
> >
> > Exactly.
> >
> >> >
> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> > accurate, however we can tolerate small errors and the resulted small
> >> > imbalanced scan rates between zones.
> >>
> >> Yes.
> >>
> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> > again to accumulate up nr_saved_scan.
> >>
> >> You mean nr_scan_try_batch logic ?
> >> But that logic works for just global reclaim?
> >> Now am I missing something?
> >>
> >> Could you elaborate more? :)
> >
> > Sorry for the confusion. The above paragraph originates from Balbir's
> > concern:
> >
> > A A A A This might be a concern (although not a big ATM), since we can't
> > A A A A afford to miss limits by much. If a cgroup is near its limit and we
> > A A A A drop scanning it. We'll have to work out what this means for the end
>
> Why does mem_cgroup drops scanning ?
Right, it has no reason to drop scanning, as long as it has not
reclaimed enough pages.
> It's because nr_scan_try_batch? or something ?
nr_scan_try_batch may only make this invocation of shrink_zone() drop scanning.
But balance_pgdat() etc. will re-call shrink_zone() to make progress.
> Sorry. Still, I can't understand your point. :(
It's _nothing_ wrong with you to not able to understand it :)
Sorry, I was explaining a null issue indeed. I'd better just remove that paragraph..
Thanks,
Fengguang
> > A A A A user. May be more fundamental look through is required at the priority
> > A A A A based logic of exposing how much to scan, I don't know.
> >
> > Thanks,
> > Fengguang
> >
> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >
> >> > CC: Rik van Riel <riel@redhat.com>
> >> > CC: Minchan Kim <minchan.kim@gmail.com>
> >> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> >> > ---
> >> > A include/linux/mmzone.h | A A 6 +++++-
> >> > A mm/page_alloc.c A A A A | A A 2 +-
> >> > A mm/vmscan.c A A A A A A | A 20 +++++++++++---------
> >> > A 3 files changed, 17 insertions(+), 11 deletions(-)
> >> >
> >> > --- linux.orig/include/linux/mmzone.h A 2009-07-30 10:45:15.000000000 +0800
> >> > +++ linux/include/linux/mmzone.h A A A A 2009-08-20 11:51:08.000000000 +0800
> >> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> >> > A A A A */
> >> > A A A A unsigned long A A A A A recent_rotated[2];
> >> > A A A A unsigned long A A A A A recent_scanned[2];
> >> > +
> >> > + A A A /*
> >> > + A A A A * accumulated for batching
> >> > + A A A A */
> >> > + A A A unsigned long A A A A A nr_saved_scan[NR_LRU_LISTS];
> >> > A };
> >> >
> >> > A struct zone {
> >> > @@ -323,7 +328,6 @@ struct zone {
> >> > A A A A spinlock_t A A A A A A A lru_lock;
> >> > A A A A struct zone_lru {
> >> > A A A A A A A A struct list_head list;
> >> > - A A A A A A A unsigned long nr_saved_scan; A A /* accumulated for batching */
> >> > A A A A } lru[NR_LRU_LISTS];
> >> >
> >> > A A A A struct zone_reclaim_stat reclaim_stat;
> >> > --- linux.orig/mm/vmscan.c A A A 2009-08-20 11:48:46.000000000 +0800
> >> > +++ linux/mm/vmscan.c A 2009-08-20 12:00:55.000000000 +0800
> >> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> >> > A A A A enum lru_list l;
> >> > A A A A unsigned long nr_reclaimed = sc->nr_reclaimed;
> >> > A A A A unsigned long swap_cluster_max = sc->swap_cluster_max;
> >> > + A A A struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >> > A A A A int noswap = 0;
> >> >
> >> > A A A A /* If we have no swap space, do not bother scanning anon pages. */
> >> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> >> > A A A A A A A A A A A A scan >>= priority;
> >> > A A A A A A A A A A A A scan = (scan * percent[file]) / 100;
> >> > A A A A A A A A }
> >> > - A A A A A A A if (scanning_global_lru(sc))
> >> > - A A A A A A A A A A A nr[l] = nr_scan_try_batch(scan,
> >> > - A A A A A A A A A A A A A A A A A A A A A A A A &zone->lru[l].nr_saved_scan,
> >> > - A A A A A A A A A A A A A A A A A A A A A A A A swap_cluster_max);
> >> > - A A A A A A A else
> >> > - A A A A A A A A A A A nr[l] = scan;
> >> > + A A A A A A A nr[l] = nr_scan_try_batch(scan,
> >> > + A A A A A A A A A A A A A A A A A A A A &reclaim_stat->nr_saved_scan[l],
> >> > + A A A A A A A A A A A A A A A A A A A A swap_cluster_max);
> >> > A A A A }
> >> >
> >> > A A A A while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> >> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> >> > A {
> >> > A A A A struct zone *zone;
> >> > A A A A unsigned long nr_reclaimed = 0;
> >> > + A A A struct zone_reclaim_stat *reclaim_stat;
> >> >
> >> > A A A A for_each_populated_zone(zone) {
> >> > A A A A A A A A enum lru_list l;
> >> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> >> > A A A A A A A A A A A A A A A A A A A A A A A A l == LRU_ACTIVE_FILE))
> >> > A A A A A A A A A A A A A A A A continue;
> >> >
> >> > - A A A A A A A A A A A zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> >> > - A A A A A A A A A A A if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> >> > + A A A A A A A A A A A reclaim_stat = get_reclaim_stat(zone, sc);
> >> > + A A A A A A A A A A A reclaim_stat->nr_saved_scan[l] +=
> >> > + A A A A A A A A A A A A A A A A A A A A A A A (lru_pages >> prio) + 1;
> >> > + A A A A A A A A A A A if (reclaim_stat->nr_saved_scan[l]
> >> > + A A A A A A A A A A A A A A A A A A A A A A A >= nr_pages || pass > 3) {
> >> > A A A A A A A A A A A A A A A A unsigned long nr_to_scan;
> >> >
> >> > - A A A A A A A A A A A A A A A zone->lru[l].nr_saved_scan = 0;
> >> > + A A A A A A A A A A A A A A A reclaim_stat->nr_saved_scan[l] = 0;
> >> > A A A A A A A A A A A A A A A A nr_to_scan = min(nr_pages, lru_pages);
> >> > A A A A A A A A A A A A A A A A nr_reclaimed += shrink_list(l, nr_to_scan, zone,
> >> > A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A sc, prio);
> >> > --- linux.orig/mm/page_alloc.c A 2009-08-20 11:57:54.000000000 +0800
> >> > +++ linux/mm/page_alloc.c A A A 2009-08-20 11:58:39.000000000 +0800
> >> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> >> > A A A A A A A A zone_pcp_init(zone);
> >> > A A A A A A A A for_each_lru(l) {
> >> > A A A A A A A A A A A A INIT_LIST_HEAD(&zone->lru[l].list);
> >> > - A A A A A A A A A A A zone->lru[l].nr_saved_scan = 0;
> >> > + A A A A A A A A A A A zone->reclaim_stat.nr_saved_scan[l] = 0;
> >> > A A A A A A A A }
> >> > A A A A A A A A zone->reclaim_stat.recent_rotated[0] = 0;
> >> > A A A A A A A A zone->reclaim_stat.recent_rotated[1] = 0;
> >> >
> >> > --
> >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> > the body to majordomo@kvack.org. A For more info on Linux MM,
> >> > see: http://www.linux-mm.org/ .
> >> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >> >
> >>
> >>
> >>
> >> --
> >> Kind regards,
> >> Minchan Kim
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -v2] mm: do batched scans for mem_cgroup
2009-08-20 4:05 ` [PATCH -v2] " Wu Fengguang
` (2 preceding siblings ...)
2009-08-20 11:01 ` Minchan Kim
@ 2009-08-21 3:55 ` Minchan Kim
2009-08-21 7:27 ` [PATCH -v2 changelog updated] " Wu Fengguang
3 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2009-08-21 3:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
>> On Thu, 20 Aug 2009 10:49:29 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
>> > in which case shrink_list() _still_ calls isolate_pages() with the much
>> > larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
>> > scan rate by up to 32 times.
>> >
>> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
>> > So when shrink_zone() expects to scan 4 pages in the active/inactive
>> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>> >
>> > The accesses to nr_saved_scan are not lock protected and so not 100%
>> > accurate, however we can tolerate small errors and the resulted small
>> > imbalanced scan rates between zones.
>> >
>> > This batching won't blur up the cgroup limits, since it is driven by
>> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
>> > decides to cancel (and save) one smallish scan, it may well be called
>> > again to accumulate up nr_saved_scan.
>> >
>> > It could possibly be a problem for some tiny mem_cgroup (which may be
>> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
>> >
>> > CC: Rik van Riel <riel@redhat.com>
>> > CC: Minchan Kim <minchan.kim@gmail.com>
>> > CC: Balbir Singh <balbir@linux.vnet.ibm.com>
>> > CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
It looks better than now :)
I hope you will rewrite description and add test result in changelog. :)
Thanks for your great effort.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH -v2 changelog updated] mm: do batched scans for mem_cgroup
2009-08-21 3:55 ` Minchan Kim
@ 2009-08-21 7:27 ` Wu Fengguang
2009-08-21 10:57 ` KOSAKI Motohiro
0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-08-21 7:27 UTC (permalink / raw)
To: Minchan Kim
Cc: KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh, KOSAKI Motohiro,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
in which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
scan rate by up to 32 times.
For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive
list, the active list will be scanned 4 pages, while the inactive list
will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that
could break the balance between the two lists.
It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned
So the end result may be
- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)
The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.
CC: Rik van Riel <riel@redhat.com>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mmzone.h | 6 +++++-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 20 +++++++++++---------
3 files changed, 17 insertions(+), 11 deletions(-)
--- linux.orig/include/linux/mmzone.h 2009-08-21 15:02:50.000000000 +0800
+++ linux/include/linux/mmzone.h 2009-08-21 15:03:25.000000000 +0800
@@ -269,6 +269,11 @@ struct zone_reclaim_stat {
*/
unsigned long recent_rotated[2];
unsigned long recent_scanned[2];
+
+ /*
+ * accumulated for batching
+ */
+ unsigned long nr_saved_scan[NR_LRU_LISTS];
};
struct zone {
@@ -323,7 +328,6 @@ struct zone {
spinlock_t lru_lock;
struct zone_lru {
struct list_head list;
- unsigned long nr_saved_scan; /* accumulated for batching */
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
--- linux.orig/mm/vmscan.c 2009-08-21 15:03:15.000000000 +0800
+++ linux/mm/vmscan.c 2009-08-21 15:03:25.000000000 +0800
@@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long swap_cluster_max = sc->swap_cluster_max;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;
/* If we have no swap space, do not bother scanning anon pages. */
@@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
- if (scanning_global_lru(sc))
- nr[l] = nr_scan_try_batch(scan,
- &zone->lru[l].nr_saved_scan,
- swap_cluster_max);
- else
- nr[l] = scan;
+ nr[l] = nr_scan_try_batch(scan,
+ &reclaim_stat->nr_saved_scan[l],
+ swap_cluster_max);
}
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
{
struct zone *zone;
unsigned long nr_reclaimed = 0;
+ struct zone_reclaim_stat *reclaim_stat;
for_each_populated_zone(zone) {
enum lru_list l;
@@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
l == LRU_ACTIVE_FILE))
continue;
- zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
+ reclaim_stat = get_reclaim_stat(zone, sc);
+ reclaim_stat->nr_saved_scan[l] +=
+ (lru_pages >> prio) + 1;
+ if (reclaim_stat->nr_saved_scan[l]
+ >= nr_pages || pass > 3) {
unsigned long nr_to_scan;
- zone->lru[l].nr_saved_scan = 0;
+ reclaim_stat->nr_saved_scan[l] = 0;
nr_to_scan = min(nr_pages, lru_pages);
nr_reclaimed += shrink_list(l, nr_to_scan, zone,
sc, prio);
--- linux.orig/mm/page_alloc.c 2009-08-21 15:02:50.000000000 +0800
+++ linux/mm/page_alloc.c 2009-08-21 15:03:25.000000000 +0800
@@ -3734,7 +3734,7 @@ static void __paginginit free_area_init_
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
- zone->lru[l].nr_saved_scan = 0;
+ zone->reclaim_stat.nr_saved_scan[l] = 0;
}
zone->reclaim_stat.recent_rotated[0] = 0;
zone->reclaim_stat.recent_rotated[1] = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH -v2 changelog updated] mm: do batched scans for mem_cgroup
2009-08-21 7:27 ` [PATCH -v2 changelog updated] " Wu Fengguang
@ 2009-08-21 10:57 ` KOSAKI Motohiro
0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2009-08-21 10:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh,
Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli,
Dike, Jeffrey G, Hugh Dickins, Christoph Lameter, Mel Gorman,
LKML, linux-mm, nishimura, lizf, menage
2009/8/21 Wu Fengguang <fengguang.wu@intel.com>:
> For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> in which case shrink_list() _still_ calls isolate_pages() with the much
> larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list
> scan rate by up to 32 times.
>
> For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> So when shrink_zone() expects to scan 4 pages in the active/inactive
> list, the active list will be scanned 4 pages, while the inactive list
> will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that
> could break the balance between the two lists.
>
> It can further impact the scan of anon active list, due to the anon
> active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
>
> inactive anon list over scanned => inactive_anon_is_low() == TRUE
> => shrink_active_list()
> => active anon list over scanned
>
> So the end result may be
>
> - anon inactive => over scanned
> - anon active => over scanned (maybe not as much)
> - file inactive => over scanned
> - file active => under scanned (relatively)
>
> The accesses to nr_saved_scan are not lock protected and so not 100%
> accurate, however we can tolerate small errors and the resulted small
> imbalanced scan rates between zones.
>
Looks good to me.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread