From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B887C433F5 for ; Mon, 1 Nov 2021 02:50:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CE11860F46 for ; Mon, 1 Nov 2021 02:50:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org CE11860F46 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 49C3B80007; Sun, 31 Oct 2021 22:50:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 44B916B0073; Sun, 31 Oct 2021 22:50:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3140E80007; Sun, 31 Oct 2021 22:50:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0079.hostedemail.com [216.40.44.79]) by kanga.kvack.org (Postfix) with ESMTP id 1D2776B0071 for ; Sun, 31 Oct 2021 22:50:36 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id C0F52181A4A17 for ; Mon, 1 Nov 2021 02:50:35 +0000 (UTC) X-FDA: 78758833230.04.533C5AD Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44]) by imf05.hostedemail.com (Postfix) with ESMTP id 42417508C89E for ; Mon, 1 Nov 2021 02:50:19 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0UuQ2uq5_1635735028; Received: from ali-074845.local(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0UuQ2uq5_1635735028) by smtp.aliyun-inc.com(127.0.0.1); Mon, 01 Nov 2021 10:50:29 +0800 Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat To: Yang Shi Cc: "Kirill A. Shutemov" , Linux MM , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao , Gang Deng References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> <20211028141333.kgcjgsnrrjuq4hjx@box.shutemov.name> From: ning zhang Message-ID: <30787ee3-895c-09b7-ebec-2f5885ac9769@linux.alibaba.com> Date: Mon, 1 Nov 2021 10:50:28 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf05.hostedemail.com: domain of ningzhang@linux.alibaba.com designates 115.124.30.44 as permitted sender) smtp.mailfrom=ningzhang@linux.alibaba.com X-Stat-Signature: 4y6s57nag5u6anhb6xi6yg4ngqo9r4mp X-Rspamd-Queue-Id: 42417508C89E X-Rspamd-Server: rspam01 X-HE-Tag: 1635735019-662790 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =E5=9C=A8 2021/10/30 =E4=B8=8A=E5=8D=8812:56, Yang Shi =E5=86=99=E9=81=93= : > On Fri, Oct 29, 2021 at 5:08 AM ning zhang wrote: >> >> =E5=9C=A8 2021/10/28 =E4=B8=8B=E5=8D=8810:13, Kirill A. Shutemov =E5=86= =99=E9=81=93: >>> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote: >>>> As we know, thp may lead to memory bloat which may cause OOM. >>>> Through testing with some apps, we found that the reason of >>>> memory bloat is a huge page may contain some zero subpages >>>> (may accessed or not). And we found that most zero subpages >>>> are centralized in a few huge pages. >>>> >>>> Following is a text_classification_rnn case for tensorflow: >>>> >>>> zero_subpages huge_pages waste >>>> [ 0, 1) 186 0.00% >>>> [ 1, 2) 23 0.01% >>>> [ 2, 4) 36 0.02% >>>> [ 4, 8) 67 0.08% >>>> [ 8, 16) 80 0.23% >>>> [ 16, 32) 109 0.61% >>>> [ 32, 64) 44 0.49% >>>> [ 64, 128) 12 0.30% >>>> [ 128, 256) 28 1.54% >>>> [ 256, 513) 159 18.03% >>>> >>>> In the case, there are 187 huge pages (25% of the total huge pages) >>>> which contain more then 128 zero subpages. And these huge pages >>>> lead to 19.57% waste of the total rss. It means we can reclaim >>>> 19.57% memory by splitting the 187 huge pages and reclaiming the >>>> zero subpages. >>>> >>>> This patchset introduce a new mechanism to split the huge page >>>> which has zero subpages and reclaim these zero subpages. >>>> >>>> We add the anonymous huge page to a list to reduce the cost of >>>> finding the huge page. When the memory reclaim is triggering, >>>> the list will be walked and the huge page contains enough zero >>>> subpages may be reclaimed. Meanwhile, replace the zero subpages >>>> by ZERO_PAGE(0). >>> Does it actually help your workload? >>> >>> I mean this will only be triggered via vmscan that was going to split >>> pages and free anyway. >>> >>> You prioritize splitting THP and freeing zero subpages over reclaimin= g >>> other pages. It may or may not be right thing to do, depending on >>> workload. >>> >>> Maybe it makes more sense to check for all-zero pages just after >>> split_huge_page_to_list() in vmscan and free such pages immediately r= ather >>> then add all this complexity? >>> >> The purpose of zero subpages reclaim(ZSR) is to pick out the huge page= s >> which >> have waste and reclaim them. >> >> We do this for two reasons: >> 1. If swap is off, anonymous pages will not be scanned, and we don't >> have the >> opportunity to split the huge page. ZSR can be helpful for this. >> 2. If swap is on, splitting first will not only split the huge page, b= ut >> also >> swap out the nonzero subpages, while ZSR will only split the huge= page. >> Splitting first will result to more performance degradation. If Z= SR >> can't >> reclaim enough pages, swap can still work. >> >> Why use a seperate ZSR list instead of the default LRU list? >> >> Because it may cause high CPU overhead to scan for target huge pages i= f >> there >> both exist a lot of regular and huge pages. And it maybe especially >> terrible >> when swap is off, we may scan the whole LRU list many times. A huge pa= ge >> will >> be deleted from ZSR list when it was scanned, so the page will be >> scanned only >> once. It's hard to use LRU list, because it may add new pages into LRU= list >> continuously when scanning. >> >> Also, we can decrease the priority to prioritize reclaiming file-backe= d >> page. >> For example, only triggerring ZSR when the priority is less than 4. > I'm not sure if this will help the workloads in general or not. The > problem is it doesn't check if the huge page is "hot" or not. It just > picks up the first huge page from the list, which seems like a FIFO > list IIUC. But if the huge page is "hot" even though there is some > internal access imbalance it may be better to keep the huge page since > the performance gain may outperform the memory saving. But if the huge > page is not "hot", then I think the question is why it is a THP in the > first place. We don't split all the huge pages, and just split the huge page contains enough zero subpages. It's hard to check a anonymous page is hot or cold, and we are working on it. We only scan 32 huge pages maximum except the last loop when reclaiming. I think we can start ZSR when priority is 1 or 2, or maybe only when priority is 0. In this case, If we don't start ZSR, the process will be killed by OOM. > > Let's step back to think about whether allocating THP upon first > access for such area or workload is good or not. We should be able to > check the access imbalance in allocation stage instead of reclaim > stage. Currently anonymous THP just supports 3 modes: always, madvise > and none. Both always and madvise tries to allocate THP in page fault > path (assuming anonymous THP) upon first access. I'm wondering if we > could add a "defer" mode or not. It defers THP allocation/collapse to > khugepaged instead of in page fault path. Then all the knobs used by > khugepaged could be applied, particularly max_ptes_none in your case. > You could set a low max_ptes_none if you prefer memory saving. IMHO, > this seems much simpler than scanning list (may be quite long) to find > out suitable candidate then split then replace to zero page. > > Of course this may have some potential performance impact since the > THP install is delayed for some time. This could be optimized by > respecting MADV_HUGEPAGE. > > Anyway, just some wild idea. > >>>> Yu Zhao has done some similar work when the huge page is swap out >>>> or migrated to accelerate[1]. While we do this in the normal memory >>>> shrink path for the swapoff scene to avoid OOM. >>>> >>>> In the future, we will do the proactive reclaim to reclaim the "cold= " >>>> huge page proactively. This is for keeping the performance of thp as >>>> for as possible. In addition to that, some users want the memory usa= ge >>>> using thp is equal to the usage using 4K. >>> Proactive reclaim can be harmful if your max_ptes_none allows to recr= eate >>> THP back. >> Thanks! We will consider it.