From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=IlG+=PR=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 993A9C433F5
	for <linux-mm@archiver.kernel.org>; Fri, 29 Oct 2021 12:07:45 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 15016610EA
	for <linux-mm@archiver.kernel.org>; Fri, 29 Oct 2021 12:07:45 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 15016610EA
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 6AC0F6B0071; Fri, 29 Oct 2021 08:07:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 65CC66B0072; Fri, 29 Oct 2021 08:07:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4FC706B0073; Fri, 29 Oct 2021 08:07:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0149.hostedemail.com [216.40.44.149])
	by kanga.kvack.org (Postfix) with ESMTP id 291576B0071
	for <linux-mm@kvack.org>; Fri, 29 Oct 2021 08:07:44 -0400 (EDT)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 9F5D98249980
	for <linux-mm@kvack.org>; Fri, 29 Oct 2021 12:07:43 +0000 (UTC)
X-FDA: 78749350806.18.24E168F
Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42])
	by imf11.hostedemail.com (Postfix) with ESMTP id 8F94EF0000B4
	for <linux-mm@kvack.org>; Fri, 29 Oct 2021 12:07:41 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Uu942yo_1635509255;
Received: from ali-074845.local(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0Uu942yo_1635509255)
          by smtp.aliyun-inc.com(127.0.0.1);
          Fri, 29 Oct 2021 20:07:36 +0800
Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
 Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
 Vladimir Davydov <vdavydov.dev@gmail.com>, Yu Zhao <yuzhao@google.com>,
 Gang Deng <gavin.dg@linux.alibaba.com>
References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com>
 <20211028141333.kgcjgsnrrjuq4hjx@box.shutemov.name>
From: ning zhang <ningzhang@linux.alibaba.com>
Message-ID: <dbee3a87-c789-0948-8266-fa02ca6a33f8@linux.alibaba.com>
Date: Fri, 29 Oct 2021 20:07:35 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.12.0
MIME-Version: 1.0
In-Reply-To: <20211028141333.kgcjgsnrrjuq4hjx@box.shutemov.name>
Content-Type: text/plain; charset=UTF-8; format=flowed
Authentication-Results: imf11.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=alibaba.com;
	spf=pass (imf11.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.42 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com
X-Stat-Signature: xa46yyeoidhi7hcbd6tkrhwizuiruuam
X-Rspamd-Queue-Id: 8F94EF0000B4
X-Rspamd-Server: rspam01
X-HE-Tag: 1635509261-936317
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


=E5=9C=A8 2021/10/28 =E4=B8=8B=E5=8D=8810:13, Kirill A. Shutemov =E5=86=99=
=E9=81=93:
> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>> As we know, thp may lead to memory bloat which may cause OOM.
>> Through testing with some apps, we found that the reason of
>> memory bloat is a huge page may contain some zero subpages
>> (may accessed or not). And we found that most zero subpages
>> are centralized in a few huge pages.
>>
>> Following is a text_classification_rnn case for tensorflow:
>>
>>    zero_subpages   huge_pages  waste
>>    [     0,     1) 186         0.00%
>>    [     1,     2) 23          0.01%
>>    [     2,     4) 36          0.02%
>>    [     4,     8) 67          0.08%
>>    [     8,    16) 80          0.23%
>>    [    16,    32) 109         0.61%
>>    [    32,    64) 44          0.49%
>>    [    64,   128) 12          0.30%
>>    [   128,   256) 28          1.54%
>>    [   256,   513) 159        18.03%
>>
>> In the case, there are 187 huge pages (25% of the total huge pages)
>> which contain more then 128 zero subpages. And these huge pages
>> lead to 19.57% waste of the total rss. It means we can reclaim
>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>> zero subpages.
>>
>> This patchset introduce a new mechanism to split the huge page
>> which has zero subpages and reclaim these zero subpages.
>>
>> We add the anonymous huge page to a list to reduce the cost of
>> finding the huge page. When the memory reclaim is triggering,
>> the list will be walked and the huge page contains enough zero
>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>> by ZERO_PAGE(0).
> Does it actually help your workload?
>
> I mean this will only be triggered via vmscan that was going to split
> pages and free anyway.
>
> You prioritize splitting THP and freeing zero subpages over reclaiming
> other pages. It may or may not be right thing to do, depending on
> workload.
>
> Maybe it makes more sense to check for all-zero pages just after
> split_huge_page_to_list() in vmscan and free such pages immediately rat=
her
> then add all this complexity?
>
The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages=20
which
have waste and reclaim them.

We do this for two reasons:
1. If swap is off, anonymous pages will not be scanned, and we don't=20
have the
 =C2=A0=C2=A0 opportunity=C2=A0 to split the huge page. ZSR can be helpfu=
l for this.
2. If swap is on, splitting first will not only split the huge page, but=20
also
 =C2=A0=C2=A0 swap out the nonzero subpages, while ZSR will only split th=
e huge page.
 =C2=A0=C2=A0 Splitting first will result to more performance degradation=
. If ZSR=20
can't
 =C2=A0=C2=A0 reclaim enough pages, swap can still work.

Why use a seperate ZSR list instead of the default LRU list?

Because it may cause high CPU overhead to scan for target huge pages if=20
there
both exist a lot of regular and huge pages. And it maybe especially=20
terrible
when swap is off, we may scan the whole LRU list many times. A huge page=20
will
be deleted from ZSR list when it was scanned, so the page will be=20
scanned only
once. It's hard to use LRU list, because it may add new pages into LRU li=
st
continuously when scanning.

Also, we can decrease the priority to prioritize reclaiming file-backed=20
page.
For example, only triggerring ZSR when the priority is less than 4.
>> Yu Zhao has done some similar work when the huge page is swap out
>> or migrated to accelerate[1]. While we do this in the normal memory
>> shrink path for the swapoff scene to avoid OOM.
>>
>> In the future, we will do the proactive reclaim to reclaim the "cold"
>> huge page proactively. This is for keeping the performance of thp as
>> for as possible. In addition to that, some users want the memory usage
>> using thp is equal to the usage using 4K.
> Proactive reclaim can be harmful if your max_ptes_none allows to recrea=
te
> THP back.
Thanks! We will consider it.
>