From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=vc3H=PU=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4B887C433F5
	for <linux-mm@archiver.kernel.org>; Mon,  1 Nov 2021 02:50:37 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id CE11860F46
	for <linux-mm@archiver.kernel.org>; Mon,  1 Nov 2021 02:50:36 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org CE11860F46
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 49C3B80007; Sun, 31 Oct 2021 22:50:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 44B916B0073; Sun, 31 Oct 2021 22:50:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3140E80007; Sun, 31 Oct 2021 22:50:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0079.hostedemail.com [216.40.44.79])
	by kanga.kvack.org (Postfix) with ESMTP id 1D2776B0071
	for <linux-mm@kvack.org>; Sun, 31 Oct 2021 22:50:36 -0400 (EDT)
Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id C0F52181A4A17
	for <linux-mm@kvack.org>; Mon,  1 Nov 2021 02:50:35 +0000 (UTC)
X-FDA: 78758833230.04.533C5AD
Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44])
	by imf05.hostedemail.com (Postfix) with ESMTP id 42417508C89E
	for <linux-mm@kvack.org>; Mon,  1 Nov 2021 02:50:19 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0UuQ2uq5_1635735028;
Received: from ali-074845.local(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0UuQ2uq5_1635735028)
          by smtp.aliyun-inc.com(127.0.0.1);
          Mon, 01 Nov 2021 10:50:29 +0800
Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
To: Yang Shi <shy828301@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>, Linux MM
 <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>,
 Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
 Vladimir Davydov <vdavydov.dev@gmail.com>, Yu Zhao <yuzhao@google.com>,
 Gang Deng <gavin.dg@linux.alibaba.com>
References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com>
 <20211028141333.kgcjgsnrrjuq4hjx@box.shutemov.name>
 <dbee3a87-c789-0948-8266-fa02ca6a33f8@linux.alibaba.com>
 <CAHbLzkoTUCKnkWkj4Pc-xtcijieG61xqkg8pb10Equo_MtiV3A@mail.gmail.com>
From: ning zhang <ningzhang@linux.alibaba.com>
Message-ID: <30787ee3-895c-09b7-ebec-2f5885ac9769@linux.alibaba.com>
Date: Mon, 1 Nov 2021 10:50:28 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.12.0
MIME-Version: 1.0
In-Reply-To: <CAHbLzkoTUCKnkWkj4Pc-xtcijieG61xqkg8pb10Equo_MtiV3A@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Authentication-Results: imf05.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=alibaba.com;
	spf=pass (imf05.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.44 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com
X-Stat-Signature: 4y6s57nag5u6anhb6xi6yg4ngqo9r4mp
X-Rspamd-Queue-Id: 42417508C89E
X-Rspamd-Server: rspam01
X-HE-Tag: 1635735019-662790
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


=E5=9C=A8 2021/10/30 =E4=B8=8A=E5=8D=8812:56, Yang Shi =E5=86=99=E9=81=93=
:
> On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com=
> wrote:
>>
>> =E5=9C=A8 2021/10/28 =E4=B8=8B=E5=8D=8810:13, Kirill A. Shutemov =E5=86=
=99=E9=81=93:
>>> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>>     zero_subpages   huge_pages  waste
>>>>     [     0,     1) 186         0.00%
>>>>     [     1,     2) 23          0.01%
>>>>     [     2,     4) 36          0.02%
>>>>     [     4,     8) 67          0.08%
>>>>     [     8,    16) 80          0.23%
>>>>     [    16,    32) 109         0.61%
>>>>     [    32,    64) 44          0.49%
>>>>     [    64,   128) 12          0.30%
>>>>     [   128,   256) 28          1.54%
>>>>     [   256,   513) 159        18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>>>
>>>> This patchset introduce a new mechanism to split the huge page
>>>> which has zero subpages and reclaim these zero subpages.
>>>>
>>>> We add the anonymous huge page to a list to reduce the cost of
>>>> finding the huge page. When the memory reclaim is triggering,
>>>> the list will be walked and the huge page contains enough zero
>>>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>>>> by ZERO_PAGE(0).
>>> Does it actually help your workload?
>>>
>>> I mean this will only be triggered via vmscan that was going to split
>>> pages and free anyway.
>>>
>>> You prioritize splitting THP and freeing zero subpages over reclaimin=
g
>>> other pages. It may or may not be right thing to do, depending on
>>> workload.
>>>
>>> Maybe it makes more sense to check for all-zero pages just after
>>> split_huge_page_to_list() in vmscan and free such pages immediately r=
ather
>>> then add all this complexity?
>>>
>> The purpose of zero subpages reclaim(ZSR) is to pick out the huge page=
s
>> which
>> have waste and reclaim them.
>>
>> We do this for two reasons:
>> 1. If swap is off, anonymous pages will not be scanned, and we don't
>> have the
>>      opportunity  to split the huge page. ZSR can be helpful for this.
>> 2. If swap is on, splitting first will not only split the huge page, b=
ut
>> also
>>      swap out the nonzero subpages, while ZSR will only split the huge=
 page.
>>      Splitting first will result to more performance degradation. If Z=
SR
>> can't
>>      reclaim enough pages, swap can still work.
>>
>> Why use a seperate ZSR list instead of the default LRU list?
>>
>> Because it may cause high CPU overhead to scan for target huge pages i=
f
>> there
>> both exist a lot of regular and huge pages. And it maybe especially
>> terrible
>> when swap is off, we may scan the whole LRU list many times. A huge pa=
ge
>> will
>> be deleted from ZSR list when it was scanned, so the page will be
>> scanned only
>> once. It's hard to use LRU list, because it may add new pages into LRU=
 list
>> continuously when scanning.
>>
>> Also, we can decrease the priority to prioritize reclaiming file-backe=
d
>> page.
>> For example, only triggerring ZSR when the priority is less than 4.
> I'm not sure if this will help the workloads in general or not. The
> problem is it doesn't check if the huge page is "hot" or not. It just
> picks up the first huge page from the list, which seems like a FIFO
> list IIUC. But if the huge page is "hot" even though there is some
> internal access imbalance it may be better to keep the huge page since
> the performance gain may outperform the memory saving. But if the huge
> page is not "hot", then I think the question is why it is a THP in the
> first place.
We don't split all the huge pages, and just split the huge page
contains enough zero subpages. It's hard to check a anonymous
page is hot or cold, and we are working on it.

We only scan 32 huge pages maximum except the last loop when
reclaiming. I think we can start ZSR when priority is 1 or 2,
or maybe only when priority is 0. In this case, If we don't
start ZSR, the process will be killed by OOM.
>
> Let's step back to think about whether allocating THP upon first
> access for such area or workload is good or not. We should be able to
> check the access imbalance in allocation stage instead of reclaim
> stage. Currently anonymous THP just supports 3 modes: always, madvise
> and none. Both always and madvise tries to allocate THP in page fault
> path (assuming anonymous THP) upon first access. I'm wondering if we
> could add a "defer" mode or not. It defers THP allocation/collapse to
> khugepaged instead of in page fault path. Then all the knobs used by
> khugepaged could be applied, particularly max_ptes_none in your case.
> You could set a low max_ptes_none if you prefer memory saving. IMHO,
> this seems much simpler than scanning list (may be quite long) to find
> out suitable candidate then split then replace to zero page.
>
> Of course this may have some potential performance impact since the
> THP install is delayed for some time. This could be optimized by
> respecting  MADV_HUGEPAGE.
>
> Anyway, just some wild idea.
>
>>>> Yu Zhao has done some similar work when the huge page is swap out
>>>> or migrated to accelerate[1]. While we do this in the normal memory
>>>> shrink path for the swapoff scene to avoid OOM.
>>>>
>>>> In the future, we will do the proactive reclaim to reclaim the "cold=
"
>>>> huge page proactively. This is for keeping the performance of thp as
>>>> for as possible. In addition to that, some users want the memory usa=
ge
>>>> using thp is equal to the usage using 4K.
>>> Proactive reclaim can be harmful if your max_ptes_none allows to recr=
eate
>>> THP back.
>> Thanks! We will consider it.