From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 76EB5900086 for ; Fri, 15 Apr 2011 14:00:27 -0400 (EDT) Received: from wpaz17.hot.corp.google.com (wpaz17.hot.corp.google.com [172.24.198.81]) by smtp-out.google.com with ESMTP id p3FI0ARN010295 for ; Fri, 15 Apr 2011 11:00:10 -0700 Received: from qyk30 (qyk30.prod.google.com [10.241.83.158]) by wpaz17.hot.corp.google.com with ESMTP id p3FHwc77028220 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Fri, 15 Apr 2011 11:00:09 -0700 Received: by qyk30 with SMTP id 30so1876576qyk.7 for ; Fri, 15 Apr 2011 11:00:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110415171437.098392da.kamezawa.hiroyu@jp.fujitsu.com> References: <1302821669-29862-1-git-send-email-yinghan@google.com> <1302821669-29862-7-git-send-email-yinghan@google.com> <20110415101148.80cb6721.kamezawa.hiroyu@jp.fujitsu.com> <20110415171437.098392da.kamezawa.hiroyu@jp.fujitsu.com> Date: Fri, 15 Apr 2011 11:00:08 -0700 Message-ID: Subject: Re: [PATCH V4 06/10] Per-memcg background reclaim. From: Ying Han Content-Type: multipart/alternative; boundary=000e0cdfd082fdede604a0f8cf65 Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro , Minchan Kim , Daisuke Nishimura , Balbir Singh , Tejun Heo , Pavel Emelyanov , Andrew Morton , Li Zefan , Mel Gorman , Christoph Lameter , Johannes Weiner , Rik van Riel , Hugh Dickins , Michal Hocko , Dave Hansen , Zhu Yanhai , linux-mm@kvack.org --000e0cdfd082fdede604a0f8cf65 Content-Type: text/plain; charset=ISO-8859-1 On Fri, Apr 15, 2011 at 1:14 AM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 14 Apr 2011 23:08:40 -0700 > Ying Han wrote: > > > On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki < > > kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > > > As you know, memcg works against user's memory, memory should be in > highmem > > > zone. > > > Memcg-kswapd is not for memory-shortage, but for voluntary page > dropping by > > > _user_. > > > > > > > in some sense, yes. but it would also related to memory-shortage on fully > > packed machines. > > > > No. _at this point_, this is just for freeing volutary before hitting limit > to gain performance. Anyway, this understainding is not affecting the patch > itself. > > > > > > > If this memcg-kswapd drops pages from lower zones first, ah, ok, it's > good > > > for > > > the system because memcg's pages should be on higher zone if we have > free > > > memory. > > > > > > So, I think the reason for dma->highmem is different from global > kswapd. > > > > > > > yes. I agree that the logic of dma->highmem ordering is not exactly the > same > > from per-memcg kswapd and per-node kswapd. But still the page allocation > > happens on the other side, and this is still good for the system as you > > pointed out. > > > > > > > > > > > > > > > > > > + for (i = 0; i < pgdat->nr_zones; i++) { > > > > + struct zone *zone = pgdat->node_zones + i; > > > > + > > > > + if (!populated_zone(zone)) > > > > + continue; > > > > + > > > > + sc->nr_scanned = 0; > > > > + shrink_zone(priority, zone, sc); > > > > + total_scanned += sc->nr_scanned; > > > > + > > > > + /* > > > > + * If we've done a decent amount of scanning and > > > > + * the reclaim ratio is low, start doing writepage > > > > + * even in laptop mode > > > > + */ > > > > + if (total_scanned > SWAP_CLUSTER_MAX * 2 && > > > > + total_scanned > sc->nr_reclaimed + sc->nr_reclaimed > / > > > 2) { > > > > + sc->may_writepage = 1; > > > > + } > > > > + } > > > > + > > > > + sc->nr_scanned = total_scanned; > > > > + return; > > > > +} > > > > + > > > > +/* > > > > + * Per cgroup background reclaim. > > > > + * TODO: Take off the order since memcg always do order 0 > > > > + */ > > > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup > > > *mem_cont, > > > > + int order) > > > > +{ > > > > + int i, nid; > > > > + int start_node; > > > > + int priority; > > > > + bool wmark_ok; > > > > + int loop; > > > > + pg_data_t *pgdat; > > > > + nodemask_t do_nodes; > > > > + unsigned long total_scanned; > > > > + struct scan_control sc = { > > > > + .gfp_mask = GFP_KERNEL, > > > > + .may_unmap = 1, > > > > + .may_swap = 1, > > > > + .nr_to_reclaim = ULONG_MAX, > > > > + .swappiness = vm_swappiness, > > > > + .order = order, > > > > + .mem_cgroup = mem_cont, > > > > + }; > > > > + > > > > +loop_again: > > > > + do_nodes = NODE_MASK_NONE; > > > > + sc.may_writepage = !laptop_mode; > > > > > > I think may_writepage should start from '0' always. We're not sure > > > the system is in memory shortage...we just want to release memory > > > volunatary. write_page will add huge costs, I guess. > > > > > > For exmaple, > > > sc.may_writepage = !!loop > > > may be better for memcg. > > > > > > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it > later. > > > > > > I think you should add some logic to fix it to right value. > > > > > > For example, before calling shrink_zone(), > > > > > > sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() / > > > 100); # 1% in this zone. > > > > > > if we love 'fair pressure for each zone'. > > > > > > > Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd > > case which is intended to add equal memory pressure for each zone. > > And it need to reclaim memory from the zone. > memcg can visit other zone/node because it's not work for zone/pgdat. > > > So in the shrink_zone, we won't bail out in the following condition: > > > > > > >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || > > > >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) { > > > > > > > >------->-------if (nr_reclaimed >= nr_to_reclaim && priority < > > DEF_PRIORITY) > > >------->------->-------break; > > > > } > > Yes. So, by setting nr_to_reclaim to be proper value for a zone, > we can visit next zone/node sooner. memcg's kswapd is not requrested to > free memory from a node/zone. (But we'll need a hint for bias, later.) > > By making nr_reclaimed to be ULONG_MAX, to quit this loop, we need to > loop until all nr[lru] to be 0. When memcg kswapd finds that memcg's usage > is difficult to be reduced under high_wmark, priority goes up dramatically > and we'll see long loop in this zone if zone is busy. > > For memcg kswapd, it can visit next zone rather than loop more. Then, > we'll be able to reduce cpu usage and contention by memcg_kswapd. > > I think this do-more/skip-and-next logic will be a difficult issue > and need to be maintained with long time research. For now, I bet > ULONG_MAX is not a choice. As usual try_to_free_page() does, > SWAP_CLUSTER_MAX will be enough. As it is, we can visit next node. > fair enough and make sense. I will make the change on the next post. --Ying > > Thanks, > -Kame > > > > --000e0cdfd082fdede604a0f8cf65 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Fri, Apr 15, 2011 at 1:14 AM, KAMEZAW= A Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
On Thu, 14 Apr 2011 23:08:40 -0700
Ying Han <yingha= n@google.com> wrote:

> On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiro= yuki <
> kamezawa.hiroyu@jp.f= ujitsu.com> wrote:

> >
> > As you know, memcg works against user'= ;s memory, memory should be in highmem
> > zone.
> > Memcg-kswapd is not for memory-shortage, but for voluntary page d= ropping by
> > _user_.
> >
>
> in some sense, yes. but it would also related to memory-shortage on fu= lly
> packed machines.
>

No. _at this point_, this is just for freeing volutary before hitting= limit
to gain performance. Anyway, this understainding is not affecting the patch=
itself.

> >
> > If this memcg-kswapd drops pages from lower zones first, ah, ok, = it's good
> > for
> > the system because memcg's pages should be on higher zone if = we have free
> > memory.
> >
> > So, I think the reason for dma->highmem is different from glob= al kswapd.
> >
>
> yes. I agree that the logic of dma->highmem ordering is not exactly= the same
> from per-memcg kswapd and per-node kswapd. But still the page allocati= on
> happens on the other side, and this is still good for the system as yo= u
> pointed out.
>
> >
> >
> >
> >
> > > + =A0 =A0 for (i =3D 0; i < pgdat->nr_zones; i++) { > > > + =A0 =A0 =A0 =A0 =A0 =A0 struct zone *zone =3D pgdat->no= de_zones + i;
> > > +
> > > + =A0 =A0 =A0 =A0 =A0 =A0 if (!populated_zone(zone))
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
> > > +
> > > + =A0 =A0 =A0 =A0 =A0 =A0 sc->nr_scanned =3D 0;
> > > + =A0 =A0 =A0 =A0 =A0 =A0 shrink_zone(priority, zone, sc); > > > + =A0 =A0 =A0 =A0 =A0 =A0 total_scanned +=3D sc->nr_scann= ed;
> > > +
> > > + =A0 =A0 =A0 =A0 =A0 =A0 /*
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0* If we've done a decent am= ount of scanning and
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0* the reclaim ratio is low, sta= rt doing writepage
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0* even in laptop mode
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
> > > + =A0 =A0 =A0 =A0 =A0 =A0 if (total_scanned > SWAP_CLUSTE= R_MAX * 2 &&
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 total_scanned > sc->= nr_reclaimed + sc->nr_reclaimed /
> > 2) {
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->may_writepa= ge =3D 1;
> > > + =A0 =A0 =A0 =A0 =A0 =A0 }
> > > + =A0 =A0 }
> > > +
> > > + =A0 =A0 sc->nr_scanned =3D total_scanned;
> > > + =A0 =A0 return;
> > > +}
> > > +
> > > +/*
> > > + * Per cgroup background reclaim.
> > > + * TODO: Take off the order since memcg always do order 0 > > > + */
> > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cg= roup
> > *mem_cont,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 int order)
> > > +{
> > > + =A0 =A0 int i, nid;
> > > + =A0 =A0 int start_node;
> > > + =A0 =A0 int priority;
> > > + =A0 =A0 bool wmark_ok;
> > > + =A0 =A0 int loop;
> > > + =A0 =A0 pg_data_t *pgdat;
> > > + =A0 =A0 nodemask_t do_nodes;
> > > + =A0 =A0 unsigned long total_scanned;
> > > + =A0 =A0 struct scan_control sc =3D {
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .gfp_mask =3D GFP_KERNEL,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .may_unmap =3D 1,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .may_swap =3D 1,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .nr_to_reclaim =3D ULONG_MAX,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .swappiness =3D vm_swappiness,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .order =3D order,
> > > + =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D mem_cont,
> > > + =A0 =A0 };
> > > +
> > > +loop_again:
> > > + =A0 =A0 do_nodes =3D NODE_MASK_NONE;
> > > + =A0 =A0 sc.may_writepage =3D !laptop_mode;
> >
> > I think may_writepage should start from '0' always. We= 9;re not sure
> > the system is in memory shortage...we just want to release memory=
> > volunatary. write_page will add huge costs, I guess.
> >
> > For exmaple,
> > =A0 =A0 =A0 =A0sc.may_writepage =3D !!loop
> > may be better for memcg.
> >
> > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modi= fy it later.
> >
> > I think you should add some logic to fix it to right value.
> >
> > For example, before calling shrink_zone(),
> >
> > sc->nr_to_reclaim =3D min(SWAP_CLUSETR_MAX, memcg_usage_in_thi= s_zone() /
> > 100); =A0# 1% in this zone.
> >
> > if we love 'fair pressure for each zone'.
> >
>
> Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in = kswapd
> case which is intended to add equal memory pressure for each zone.

And it need to reclaim memory from the zone.
memcg can visit other zone/node because it's not work for zone/pgdat.

> So in the shrink_zone, we won't bail out in the following conditio= n:
>
>
> >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > >------->------->------->------->-------nr[LRU_INA= CTIVE_FILE]) {
> >
>
> =A0>------->-------if (nr_reclaimed >=3D nr_to_reclaim &&= amp; priority <
> DEF_PRIORITY)
> >------->------->-------break;
>
> }

Yes. So, by setting nr_to_reclaim to be proper value for a zone,
we can visit next zone/node sooner. memcg's kswapd is not requrested to=
free memory from a node/zone. (But we'll need a hint for bias, later.)<= br>
By making nr_reclaimed to be ULONG_MAX, to quit this loop, we need to
loop until all nr[lru] to be 0. When memcg kswapd finds that memcg's us= age
is difficult to be reduced under high_wmark, priority goes up dramatically<= br> and we'll see long loop in this zone if zone is busy.

For memcg kswapd, it can visit next zone rather than loop more. Then,
we'll be able to reduce cpu usage and contention by memcg_kswapd.

I think this do-more/skip-and-next logic will be a difficult issue
and need to be maintained with long time research. For now, I bet
ULONG_MAX is not a choice. As usual try_to_free_page() does,
SWAP_CLUSTER_MAX will be enough. As it is, we can visit next node.

fair enough and make sense. I will make the cha= nge on the next post.

--Ying=A0

Thanks,
-Kame




--000e0cdfd082fdede604a0f8cf65-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org