Re: [PATCH V3 0/2] memcg softlimit reclaim rework

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ying Han <yinghan@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>, Mel Gorman <mel@csn.ul.ie>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Rik van Riel <riel@redhat.com>, Hillf Danton <dhillf@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	Dan Magenheimer <dan.magenheimer@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org
Subject: Re: [PATCH V3 0/2] memcg softlimit reclaim rework
Date: Wed, 18 Apr 2012 11:00:40 -0700	[thread overview]
Message-ID: <CALWz4iz_17fQa=EfT2KqvJUGyHQFc5v9r+7b947yMbocC9rrjA@mail.gmail.com> (raw)
In-Reply-To: <20120418122448.GB1771@cmpxchg.org>

On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> The "soft_limit" was introduced in memcg to support over-committing the
>> memory resource on the host. Each cgroup configures its "hard_limit" where
>> it will be throttled or OOM killed by going over the limit. However, the
>> cgroup can go above the "soft_limit" as long as there is no system-wide
>> memory contention. So, the "soft_limit" is the kernel mechanism for
>> re-distributing system spare memory among cgroups.
>>
>> This patch reworks the softlimit reclaim by hooking it into the new global
>> reclaim scheme. So the global reclaim path including direct reclaim and
>> background reclaim will respect the memcg softlimit.
>>
>> v3..v2:
>> 1. rebase the patch on 3.4-rc3
>> 2. squash the commits of replacing the old implementation with new
>> implementation into one commit. This is to make sure to leave the tree
>> in stable state between each commit.
>> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>> case. The need of that patch is not obvious now.
>>
>> Note:
>> 1. the new implementation of softlimit reclaim is rather simple and first
>> step for further optimizations. there is no memory pressure balancing between
>> memcgs for each zone, and that is something we would like to add as follow-ups.
>>
>> 2. this patch is slightly different from the last one posted from Johannes
>> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> where his patch is closer to the reverted implementation by doing hierarchical
>> reclaim for each selected memcg. However, that is not expected behavior from
>> user perspective. Considering the following example:
>>
>> root (32G capacity)
>> --> A (hard limit 20G, soft limit 15G, usage 16G)
>>    --> A1 (soft limit 5G, usage 4G)
>>    --> A2 (soft limit 10G, usage 12G)
>> --> B (hard limit 20G, soft limit 10G, usage 16G)
>>
>> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> exceeds softlimit. This is what admin expects by setting softlimit to the
>> actual working set size and only reclaim pages under softlimit if system has
>> trouble to reclaim.
>
> Actually, this is exactly what the admin expects when creating a
> hierarchy, because she defines that A1 is a child of A and is
> responsible for the memory situation in its parent.

> That's the single point of having a hierarchy.  Why do you create them
> if you don't want their behaviour?

I agree with the hierarchical reclaim which pushing the pressure down
from A to A1 and A2. But that only apply naturally to hard_limit but
not soft_limit.

One of the use cases to create hierarchy is to get finer granularity
of accounting for subset of processes, and they share the same
hardlimit at the same time.

Imagine there were no A1 and A2 created and all the processes running
under A to start with. The problem with for that they all share a
single accounting and memcg naturally provide finer granularity
accounting by creating sub-cgroups under A. After setting
"use_hierarchy" to 1, the direct reclaim from A (A hits its
hard_limit) should also reclaim from A1 and A2 regardless of each
individual usage_in_bytes since both A1 and A2 contribute to A's
charge.

However, we need to be more selective for soft_limit since most users
setting it to protect the cgroup's working_set_size. We don't want to
reclaim from A1's anon pages while reclaiming from A2's cold page
cache pages could satisfy the page allocation.

Note, soft_limit setting is always optional not like hard_limit. Once
admin chooses to set it, he/she wants to protect the hot memory of
each cgroup.

>
> And A does not have its own pages (usage is just the sum of its
> children), what SHOULD its soft limit even mean in your example?

A does have pages on its LRU which are pages allocated for processes
running directly under A and also the re-parented pages after rmdir of
A1/A2. The softlimit of A will include both cases.

>
> If you had
>
>    A (hard 20G, usage 16G)
>       A1 (soft  5G, usage  4G)
>       A2 (soft 10G, usage 12G)
>    B (hard 20G, soft 10G, usage 16G)
>
> (i.e. no soft limit on A), you could reasonably make it so that on
> global reclaim, only A2 and B would get reclaimed, like you want it
> to, while still keeping the hierarchical properties of soft limits.

> If you want soft limits applied to leaf nodes only, don't set them
> anywhere else..?

No softlimit on A means leave it as default value:

unlimited (now) : then pages linked to A's lru will not get chance to
be reclaimed at all under softlimit reclaim.

0 (after this patch):  it will end up reclaiming from A's children always.

> Ultimately, we want to support nesting memcgs within containers.  For
> this reason, they need to be applied hierarchically, or the admin of
> the host does not have soft limit control over untrusted guest groups:
>
>    container A (hard 20G, soft 16G)
>      group A-1 (soft 100G)
>    container B (hard 20G, soft 16G)
>      group B-1
>
> In this case under global memory pressure, contrary to your claims, we
> actually do want to from reclaim A-1, not just from B-1.  Otherwise, a
> container could gain priority over another one by setting ridiculous
> soft limits.

This is a mis-configuration of softlimit assuming the machine capacity
< 100G. I am wondering if we should design the system to compromise
the mis-configuration with drawback of breaking the exception of
properly configured system.

> We have been at this point a couple times.  Could you please explain
> what you are trying to do in the first place, why you need
> hierarchies, why you configure them like you do?

The hierarchy is needed for sharing one hard_limit but also finer
granularity of accounting. The soft_limit is set to protect working
set for each cgroup of the system and it works purely like a filtering
and prioritize the reclaim order only after the whole system under
memory contention.

In my mind, soft_limit should be optional and admin only set them if
they know what they want to do with it. The main use case we use it
for now is to protect the working set and that is the exception when
they choose to set that.

--Ying

>
> Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-04-18 18:00 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-17 16:37 Ying Han
2012-04-18 12:24 ` Johannes Weiner
2012-04-18 18:00   ` Ying Han [this message]
2012-04-19 17:04     ` Michal Hocko
2012-04-19 17:47       ` Ying Han
2012-04-19 22:33         ` Johannes Weiner
2012-04-19 22:51           ` Johannes Weiner
2012-04-20  7:37           ` Ying Han
2012-04-20  8:21             ` KAMEZAWA Hiroyuki
2012-04-20 14:17               ` Rik van Riel
2012-04-20 16:56                 ` Ying Han
2012-04-20 13:17             ` Johannes Weiner
2012-04-20 17:44               ` Ying Han
2012-04-20 18:58                 ` Michal Hocko
2012-04-20 22:50                   ` Ying Han
2012-04-20 22:56                     ` Rik van Riel
2012-04-20 23:14                       ` Ying Han
2012-04-21  0:19                     ` Johannes Weiner
2012-04-21  0:48                       ` Johannes Weiner
2012-04-23 22:19                         ` Ying Han
2012-04-20 23:29                   ` Johannes Weiner
2012-04-23 13:59                     ` Michal Hocko
2012-04-20  8:28           ` Michal Hocko
2012-04-20  8:11         ` Michal Hocko
2012-04-20 17:22           ` Ying Han

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALWz4iz_17fQa=EfT2KqvJUGyHQFc5v9r+7b947yMbocC9rrjA@mail.gmail.com' \
    --to=yinghan@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.magenheimer@oracle.com \
    --cc=dhillf@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox