From: Yang Shi <yang.shi@linux.alibaba.com>
To: David Rientjes <rientjes@google.com>
Cc: ktkhai@virtuozzo.com, hannes@cmpxchg.org, mhocko@suse.com,
kirill.shutemov@linux.intel.com, hughd@google.com,
shakeelb@google.com, akpm@linux-foundation.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] Make deferred split shrinker memcg aware
Date: Wed, 29 May 2019 10:34:24 +0800 [thread overview]
Message-ID: <2e23bd8c-6120-5a86-9e9e-ab43b02ce150@linux.alibaba.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1905281817090.86034@chino.kir.corp.google.com>
On 5/29/19 9:22 AM, David Rientjes wrote:
> On Tue, 28 May 2019, Yang Shi wrote:
>
>> I got some reports from our internal application team about memcg OOM.
>> Even though the application has been killed by oom killer, there are
>> still a lot THPs reside, page reclaim doesn't reclaim them at all.
>>
>> Some investigation shows they are on deferred split queue, memcg direct
>> reclaim can't shrink them since THP deferred split shrinker is not memcg
>> aware, this may cause premature OOM in memcg. The issue can be
>> reproduced easily by the below test:
>>
> Right, we've also encountered this. I talked to Kirill about it a week or
> so ago where the suggestion was to split all compound pages on the
> deferred split queues under the presence of even memory pressure.
>
> That breaks cgroup isolation and perhaps unfairly penalizes workloads that
> are running attached to other memcg hierarchies that are not under
> pressure because their compound pages are now split as a side effect.
> There is a benefit to keeping these compound pages around while not under
> memory pressure if all pages are subsequently mapped again.
Yes, I do agree. I tried other approaches too, it sounds making deferred
split queue per memcg is the optimal one.
>
>> $ cgcreate -g memory:thp
>> $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
>> $ cgexec -g memory:thp ./transhuge-stress 4000
>>
>> transhuge-stress comes from kernel selftest.
>>
>> It is easy to hit OOM, but there are still a lot THP on the deferred split
>> queue, memcg direct reclaim can't touch them since the deferred split
>> shrinker is not memcg aware.
>>
> Yes, we have seen this on at least 4.15 as well.
>
>> Convert deferred split shrinker memcg aware by introducing per memcg deferred
>> split queue. The THP should be on either per node or per memcg deferred
>> split queue if it belongs to a memcg. When the page is immigrated to the
>> other memcg, it will be immigrated to the target memcg's deferred split queue
>> too.
>>
>> And, move deleting THP from deferred split queue in page free before memcg
>> uncharge so that the page's memcg information is available.
>>
>> Reuse the second tail page's deferred_list for per memcg list since the same
>> THP can't be on multiple deferred split queues at the same time.
>>
>> Remove THP specific destructor since it is not used anymore with memcg aware
>> THP shrinker (Please see the commit log of patch 2/3 for the details).
>>
>> Make deferred split shrinker not depend on memcg kmem since it is not slab.
>> It doesn't make sense to not shrink THP even though memcg kmem is disabled.
>>
>> With the above change the test demonstrated above doesn't trigger OOM anymore
>> even though with cgroup.memory=nokmem.
>>
> I'm curious if your internal applications team is also asking for
> statistics on how much memory can be freed if the deferred split queues
> can be shrunk? We have applications that monitor their own memory usage
No, but this reminds me. The THPs on deferred split queue should be
accounted into available memory too.
> through memcg stats or usage and proactively try to reduce that usage when
> it is growing too large. The deferred split queues have significantly
> increased both memcg usage and rss when they've upgraded kernels.
>
> How are your applications monitoring how much memory from deferred split
> queues can be freed on memory pressure? Any thoughts on providing it as a
> memcg stat?
I don't think they have such monitor. I saw rss_huge is abormal in memcg
stat even after the application is killed by oom, so I realized the
deferred split queue may play a role here.
The memcg stat doesn't have counters for available memory as global
vmstat. It may be better to have such statistics, or extending
reclaimable "slab" to shrinkable/reclaimable "memory".
>
> Thanks!
next prev parent reply other threads:[~2019-05-29 2:34 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-28 12:44 Yang Shi
2019-05-28 12:44 ` [PATCH 1/3] mm: thp: make " Yang Shi
2019-05-28 14:42 ` Kirill Tkhai
2019-05-29 2:43 ` Yang Shi
2019-05-29 8:14 ` Kirill Tkhai
2019-05-29 11:25 ` Yang Shi
2019-06-10 8:23 ` Kirill Tkhai
2019-06-10 17:25 ` Yang Shi
2019-06-13 8:19 ` Kirill Tkhai
2019-06-13 17:53 ` Yang Shi
2019-05-30 12:07 ` Kirill A. Shutemov
2019-05-30 13:29 ` Yang Shi
2019-05-28 12:44 ` [PATCH 2/3] mm: thp: remove THP destructor Yang Shi
2019-05-28 12:44 ` [PATCH 3/3] mm: shrinker: make shrinker not depend on memcg kmem Yang Shi
2019-05-30 12:08 ` Kirill A. Shutemov
2019-05-30 13:20 ` Yang Shi
2019-05-29 1:22 ` [RFC PATCH 0/3] Make deferred split shrinker memcg aware David Rientjes
2019-05-29 2:34 ` Yang Shi [this message]
2019-05-29 21:07 ` David Rientjes
2019-05-30 3:22 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2e23bd8c-6120-5a86-9e9e-ab43b02ce150@linux.alibaba.com \
--to=yang.shi@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=ktkhai@virtuozzo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox