From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 967FCC433EF for ; Fri, 29 Oct 2021 16:56:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 28AE661177 for ; Fri, 29 Oct 2021 16:56:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 28AE661177 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 6A07E940007; Fri, 29 Oct 2021 12:56:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 64F016B0072; Fri, 29 Oct 2021 12:56:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4EF2C940007; Fri, 29 Oct 2021 12:56:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0253.hostedemail.com [216.40.44.253]) by kanga.kvack.org (Postfix) with ESMTP id 290616B0071 for ; Fri, 29 Oct 2021 12:56:29 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id AAF381815A346 for ; Fri, 29 Oct 2021 16:56:28 +0000 (UTC) X-FDA: 78750078456.18.DB7973C Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf19.hostedemail.com (Postfix) with ESMTP id E703EB0000A7 for ; Fri, 29 Oct 2021 16:56:22 +0000 (UTC) Received: by mail-ed1-f42.google.com with SMTP id w15so40938844edc.9 for ; Fri, 29 Oct 2021 09:56:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=Hn7Ae8xXbXRJDstYWEyBhSVALKS24UruZBsiSHTd2/4=; b=PVfYNkUoUXZv27TiOjhnSBpDzAfMuJYp8fqBL2zpKtADTi68CsnVzJ8hBvFkHTEFse FskGqpqH2d88IAa4ZI7P56TpHrLh/iKTpgb1GLmUYWIpE/EwyEGEnntZ+JZU28pieq9N mI0I9WyZhTyC8tHi5+uw/q3ID3VYSOlKNQ/hAVvyNMk+8GGYhjDVPaeJ9a/4yPab1bD7 Xcg9KIGHm4wGO919EgMc0OOSNIsvB8qEhccsvJ2spKobZuCLVBq3zEJ53+0+8RZaQ46X lNqfx0vBsdonBV3csBBbnS3azpYqXJuuuHrEQFljTL8koBc3CkILIhBwj2pzzxSxavco t03A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=Hn7Ae8xXbXRJDstYWEyBhSVALKS24UruZBsiSHTd2/4=; b=H7txL1nSnGlUnuwpMuw7dDw3NpOJdMui9cjE2FMmmnrtbvpchTzCZrsYbvgzKwXjow UhxqdhwMGogz8h+LP/cT/H1TzRFZN+5t2wrajz3kKNyayR82qSJSM1wqGcpeeRmtE7Pv W2wM3Z+OnJwLls6YFpKSaLdvaWmEIIE4xi56OYLKf4k9lD256RUKlJRPlckvOtdV0FYG j8fa6tDkr+a6nlmg+UjsT3+Z1I8dQk1nFJ8QGZBadiPDl5XD746TNcOux4LgfW7o0yZH Vp2IQH3dCFaFSENUoN0of+LOLnShVjJvgxJ4KYHYDH+Txd3GW/H0Adz4w1DoBTfz5PWk Q/YA== X-Gm-Message-State: AOAM532DzgP+lnfsYei3DX66R7NTRty32d48VzQk27a8CPlHFcMfv82S q+9JxnwaqAVNu1YK5dzGCf/OCUGNTzVbYBrQ4rc= X-Google-Smtp-Source: ABdhPJz5ekbLFIyLXkH5y/cR9scBoQ4ePPbUdFVh9nKfDT4NKGa/AA76/bdK2Zt+VqJJWzWHGVeuDKlvs7wq3SdD2yg= X-Received: by 2002:a05:6402:5112:: with SMTP id m18mr16238400edd.101.1635526586963; Fri, 29 Oct 2021 09:56:26 -0700 (PDT) MIME-Version: 1.0 References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> <20211028141333.kgcjgsnrrjuq4hjx@box.shutemov.name> In-Reply-To: From: Yang Shi Date: Fri, 29 Oct 2021 09:56:14 -0700 Message-ID: Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat To: ning zhang Cc: "Kirill A. Shutemov" , Linux MM , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Yu Zhao , Gang Deng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E703EB0000A7 X-Stat-Signature: 3h356ekru16gfxr7hqkhj8tphod4hmha Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=PVfYNkUo; spf=pass (imf19.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1635526582-184593 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 29, 2021 at 5:08 AM ning zhang wr= ote: > > > =E5=9C=A8 2021/10/28 =E4=B8=8B=E5=8D=8810:13, Kirill A. Shutemov =E5=86= =99=E9=81=93: > > On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote: > >> As we know, thp may lead to memory bloat which may cause OOM. > >> Through testing with some apps, we found that the reason of > >> memory bloat is a huge page may contain some zero subpages > >> (may accessed or not). And we found that most zero subpages > >> are centralized in a few huge pages. > >> > >> Following is a text_classification_rnn case for tensorflow: > >> > >> zero_subpages huge_pages waste > >> [ 0, 1) 186 0.00% > >> [ 1, 2) 23 0.01% > >> [ 2, 4) 36 0.02% > >> [ 4, 8) 67 0.08% > >> [ 8, 16) 80 0.23% > >> [ 16, 32) 109 0.61% > >> [ 32, 64) 44 0.49% > >> [ 64, 128) 12 0.30% > >> [ 128, 256) 28 1.54% > >> [ 256, 513) 159 18.03% > >> > >> In the case, there are 187 huge pages (25% of the total huge pages) > >> which contain more then 128 zero subpages. And these huge pages > >> lead to 19.57% waste of the total rss. It means we can reclaim > >> 19.57% memory by splitting the 187 huge pages and reclaiming the > >> zero subpages. > >> > >> This patchset introduce a new mechanism to split the huge page > >> which has zero subpages and reclaim these zero subpages. > >> > >> We add the anonymous huge page to a list to reduce the cost of > >> finding the huge page. When the memory reclaim is triggering, > >> the list will be walked and the huge page contains enough zero > >> subpages may be reclaimed. Meanwhile, replace the zero subpages > >> by ZERO_PAGE(0). > > Does it actually help your workload? > > > > I mean this will only be triggered via vmscan that was going to split > > pages and free anyway. > > > > You prioritize splitting THP and freeing zero subpages over reclaiming > > other pages. It may or may not be right thing to do, depending on > > workload. > > > > Maybe it makes more sense to check for all-zero pages just after > > split_huge_page_to_list() in vmscan and free such pages immediately rat= her > > then add all this complexity? > > > The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages > which > have waste and reclaim them. > > We do this for two reasons: > 1. If swap is off, anonymous pages will not be scanned, and we don't > have the > opportunity to split the huge page. ZSR can be helpful for this. > 2. If swap is on, splitting first will not only split the huge page, but > also > swap out the nonzero subpages, while ZSR will only split the huge pag= e. > Splitting first will result to more performance degradation. If ZSR > can't > reclaim enough pages, swap can still work. > > Why use a seperate ZSR list instead of the default LRU list? > > Because it may cause high CPU overhead to scan for target huge pages if > there > both exist a lot of regular and huge pages. And it maybe especially > terrible > when swap is off, we may scan the whole LRU list many times. A huge page > will > be deleted from ZSR list when it was scanned, so the page will be > scanned only > once. It's hard to use LRU list, because it may add new pages into LRU li= st > continuously when scanning. > > Also, we can decrease the priority to prioritize reclaiming file-backed > page. > For example, only triggerring ZSR when the priority is less than 4. I'm not sure if this will help the workloads in general or not. The problem is it doesn't check if the huge page is "hot" or not. It just picks up the first huge page from the list, which seems like a FIFO list IIUC. But if the huge page is "hot" even though there is some internal access imbalance it may be better to keep the huge page since the performance gain may outperform the memory saving. But if the huge page is not "hot", then I think the question is why it is a THP in the first place. Let's step back to think about whether allocating THP upon first access for such area or workload is good or not. We should be able to check the access imbalance in allocation stage instead of reclaim stage. Currently anonymous THP just supports 3 modes: always, madvise and none. Both always and madvise tries to allocate THP in page fault path (assuming anonymous THP) upon first access. I'm wondering if we could add a "defer" mode or not. It defers THP allocation/collapse to khugepaged instead of in page fault path. Then all the knobs used by khugepaged could be applied, particularly max_ptes_none in your case. You could set a low max_ptes_none if you prefer memory saving. IMHO, this seems much simpler than scanning list (may be quite long) to find out suitable candidate then split then replace to zero page. Of course this may have some potential performance impact since the THP install is delayed for some time. This could be optimized by respecting MADV_HUGEPAGE. Anyway, just some wild idea. > >> Yu Zhao has done some similar work when the huge page is swap out > >> or migrated to accelerate[1]. While we do this in the normal memory > >> shrink path for the swapoff scene to avoid OOM. > >> > >> In the future, we will do the proactive reclaim to reclaim the "cold" > >> huge page proactively. This is for keeping the performance of thp as > >> for as possible. In addition to that, some users want the memory usage > >> using thp is equal to the usage using 4K. > > Proactive reclaim can be harmful if your max_ptes_none allows to recrea= te > > THP back. > Thanks! We will consider it. > > >