Re: [PATCH V4 06/10] Per-memcg background reclaim.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Tejun Heo <tj@kernel.org>, Pavel Emelyanov <xemul@openvz.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Li Zefan <lizf@cn.fujitsu.com>, Mel Gorman <mel@csn.ul.ie>,
	Christoph Lameter <cl@linux.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Michal Hocko <mhocko@suse.cz>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Zhu Yanhai <zhu.yanhai@gmail.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH V4 06/10] Per-memcg background reclaim.
Date: Fri, 15 Apr 2011 17:14:37 +0900	[thread overview]
Message-ID: <20110415171437.098392da.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <BANLkTin0r26b2JgRJkXwLxP4m5HGAaxH=A@mail.gmail.com>

On Thu, 14 Apr 2011 23:08:40 -0700
Ying Han <yinghan@google.com> wrote:

> On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:

> >
> > As you know, memcg works against user's memory, memory should be in highmem
> > zone.
> > Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by
> > _user_.
> >
> 
> in some sense, yes. but it would also related to memory-shortage on fully
> packed machines.
> 

No. _at this point_, this is just for freeing volutary before hitting limit
to gain performance. Anyway, this understainding is not affecting the patch
itself.

> >
> > If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good
> > for
> > the system because memcg's pages should be on higher zone if we have free
> > memory.
> >
> > So, I think the reason for dma->highmem is different from global kswapd.
> >
> 
> yes. I agree that the logic of dma->highmem ordering is not exactly the same
> from per-memcg kswapd and per-node kswapd. But still the page allocation
> happens on the other side, and this is still good for the system as you
> pointed out.
> 
> >
> >
> >
> >
> > > +     for (i = 0; i < pgdat->nr_zones; i++) {
> > > +             struct zone *zone = pgdat->node_zones + i;
> > > +
> > > +             if (!populated_zone(zone))
> > > +                     continue;
> > > +
> > > +             sc->nr_scanned = 0;
> > > +             shrink_zone(priority, zone, sc);
> > > +             total_scanned += sc->nr_scanned;
> > > +
> > > +             /*
> > > +              * If we've done a decent amount of scanning and
> > > +              * the reclaim ratio is low, start doing writepage
> > > +              * even in laptop mode
> > > +              */
> > > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /
> > 2) {
> > > +                     sc->may_writepage = 1;
> > > +             }
> > > +     }
> > > +
> > > +     sc->nr_scanned = total_scanned;
> > > +     return;
> > > +}
> > > +
> > > +/*
> > > + * Per cgroup background reclaim.
> > > + * TODO: Take off the order since memcg always do order 0
> > > + */
> > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup
> > *mem_cont,
> > > +                                           int order)
> > > +{
> > > +     int i, nid;
> > > +     int start_node;
> > > +     int priority;
> > > +     bool wmark_ok;
> > > +     int loop;
> > > +     pg_data_t *pgdat;
> > > +     nodemask_t do_nodes;
> > > +     unsigned long total_scanned;
> > > +     struct scan_control sc = {
> > > +             .gfp_mask = GFP_KERNEL,
> > > +             .may_unmap = 1,
> > > +             .may_swap = 1,
> > > +             .nr_to_reclaim = ULONG_MAX,
> > > +             .swappiness = vm_swappiness,
> > > +             .order = order,
> > > +             .mem_cgroup = mem_cont,
> > > +     };
> > > +
> > > +loop_again:
> > > +     do_nodes = NODE_MASK_NONE;
> > > +     sc.may_writepage = !laptop_mode;
> >
> > I think may_writepage should start from '0' always. We're not sure
> > the system is in memory shortage...we just want to release memory
> > volunatary. write_page will add huge costs, I guess.
> >
> > For exmaple,
> >        sc.may_writepage = !!loop
> > may be better for memcg.
> >
> > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later.
> >
> > I think you should add some logic to fix it to right value.
> >
> > For example, before calling shrink_zone(),
> >
> > sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /
> > 100);  # 1% in this zone.
> >
> > if we love 'fair pressure for each zone'.
> >
> 
> Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd
> case which is intended to add equal memory pressure for each zone. 

And it need to reclaim memory from the zone.
memcg can visit other zone/node because it's not work for zone/pgdat.

> So in the shrink_zone, we won't bail out in the following condition:
> 
> 
> >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) {
> >
> 
>  >------->-------if (nr_reclaimed >= nr_to_reclaim && priority <
> DEF_PRIORITY)
> >------->------->-------break;
> 
> }

Yes. So, by setting nr_to_reclaim to be proper value for a zone,
we can visit next zone/node sooner. memcg's kswapd is not requrested to
free memory from a node/zone. (But we'll need a hint for bias, later.)

By making nr_reclaimed to be ULONG_MAX, to quit this loop, we need to
loop until all nr[lru] to be 0. When memcg kswapd finds that memcg's usage
is difficult to be reduced under high_wmark, priority goes up dramatically
and we'll see long loop in this zone if zone is busy.

For memcg kswapd, it can visit next zone rather than loop more. Then,
we'll be able to reduce cpu usage and contention by memcg_kswapd.

I think this do-more/skip-and-next logic will be a difficult issue
and need to be maintained with long time research. For now, I bet
ULONG_MAX is not a choice. As usual try_to_free_page() does,
SWAP_CLUSTER_MAX will be enough. As it is, we can visit next node.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-04-15  8:21 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-14 22:54 [PATCH V4 00/10] memcg: per cgroup " Ying Han
2011-04-14 22:54 ` [PATCH V4 01/10] Add kswapd descriptor Ying Han
2011-04-15  0:04   ` KAMEZAWA Hiroyuki
2011-04-15  3:35     ` Ying Han
2011-04-15  4:16       ` KAMEZAWA Hiroyuki
2011-04-15 21:46         ` Ying Han
2011-04-14 22:54 ` [PATCH V4 02/10] Add per memcg reclaim watermarks Ying Han
2011-04-15  0:16   ` KAMEZAWA Hiroyuki
2011-04-15  3:45     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 03/10] New APIs to adjust per-memcg wmarks Ying Han
2011-04-15  0:25   ` KAMEZAWA Hiroyuki
2011-04-15  4:00     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 04/10] Infrastructure to support per-memcg reclaim Ying Han
2011-04-15  0:34   ` KAMEZAWA Hiroyuki
2011-04-15  4:04     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 05/10] Implement the select_victim_node within memcg Ying Han
2011-04-15  0:40   ` KAMEZAWA Hiroyuki
2011-04-15  4:36     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 06/10] Per-memcg background reclaim Ying Han
2011-04-15  1:11   ` KAMEZAWA Hiroyuki
2011-04-15  6:08     ` Ying Han
2011-04-15  8:14       ` KAMEZAWA Hiroyuki [this message]
2011-04-15 18:00         ` Ying Han
2011-04-15  6:26     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 07/10] Add per-memcg zone "unreclaimable" Ying Han
2011-04-15  1:32   ` KAMEZAWA Hiroyuki
2012-03-19  8:27     ` Zhu Yanhai
2012-03-20  5:45       ` Ying Han
2012-03-22  1:13         ` Zhu Yanhai
2011-04-14 22:54 ` [PATCH V4 08/10] Enable per-memcg background reclaim Ying Han
2011-04-15  1:34   ` KAMEZAWA Hiroyuki
2011-04-14 22:54 ` [PATCH V4 09/10] Add API to export per-memcg kswapd pid Ying Han
2011-04-15  1:40   ` KAMEZAWA Hiroyuki
2011-04-15  4:47     ` Ying Han
2011-04-14 22:54 ` [PATCH V4 10/10] Add some per-memcg stats Ying Han
2011-04-15  9:40 ` [PATCH V4 00/10] memcg: per cgroup background reclaim Michal Hocko
2011-04-15 16:40   ` Ying Han
2011-04-18  9:13     ` Michal Hocko
2011-04-18 17:01       ` Ying Han
2011-04-18 18:42         ` Michal Hocko
2011-04-18 22:27           ` Ying Han
2011-04-19  2:48             ` Zhu Yanhai
2011-04-19  3:46               ` Ying Han

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110415171437.098392da.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=cl@linux.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=mel@csn.ul.ie \
    --cc=mhocko@suse.cz \
    --cc=minchan.kim@gmail.com \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=riel@redhat.com \
    --cc=tj@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yinghan@google.com \
    --cc=zhu.yanhai@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox