Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: ning zhang <ningzhang@linux.alibaba.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Yu Zhao <yuzhao@google.com>,
	Gang Deng <gavin.dg@linux.alibaba.com>
Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
Date: Mon, 1 Nov 2021 10:50:28 +0800	[thread overview]
Message-ID: <30787ee3-895c-09b7-ebec-2f5885ac9769@linux.alibaba.com> (raw)
In-Reply-To: <CAHbLzkoTUCKnkWkj4Pc-xtcijieG61xqkg8pb10Equo_MtiV3A@mail.gmail.com>


在 2021/10/30 上午12:56, Yang Shi 写道:
> On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com> wrote:
>>
>> 在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
>>> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>>     zero_subpages   huge_pages  waste
>>>>     [     0,     1) 186         0.00%
>>>>     [     1,     2) 23          0.01%
>>>>     [     2,     4) 36          0.02%
>>>>     [     4,     8) 67          0.08%
>>>>     [     8,    16) 80          0.23%
>>>>     [    16,    32) 109         0.61%
>>>>     [    32,    64) 44          0.49%
>>>>     [    64,   128) 12          0.30%
>>>>     [   128,   256) 28          1.54%
>>>>     [   256,   513) 159        18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>>>
>>>> This patchset introduce a new mechanism to split the huge page
>>>> which has zero subpages and reclaim these zero subpages.
>>>>
>>>> We add the anonymous huge page to a list to reduce the cost of
>>>> finding the huge page. When the memory reclaim is triggering,
>>>> the list will be walked and the huge page contains enough zero
>>>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>>>> by ZERO_PAGE(0).
>>> Does it actually help your workload?
>>>
>>> I mean this will only be triggered via vmscan that was going to split
>>> pages and free anyway.
>>>
>>> You prioritize splitting THP and freeing zero subpages over reclaiming
>>> other pages. It may or may not be right thing to do, depending on
>>> workload.
>>>
>>> Maybe it makes more sense to check for all-zero pages just after
>>> split_huge_page_to_list() in vmscan and free such pages immediately rather
>>> then add all this complexity?
>>>
>> The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages
>> which
>> have waste and reclaim them.
>>
>> We do this for two reasons:
>> 1. If swap is off, anonymous pages will not be scanned, and we don't
>> have the
>>      opportunity  to split the huge page. ZSR can be helpful for this.
>> 2. If swap is on, splitting first will not only split the huge page, but
>> also
>>      swap out the nonzero subpages, while ZSR will only split the huge page.
>>      Splitting first will result to more performance degradation. If ZSR
>> can't
>>      reclaim enough pages, swap can still work.
>>
>> Why use a seperate ZSR list instead of the default LRU list?
>>
>> Because it may cause high CPU overhead to scan for target huge pages if
>> there
>> both exist a lot of regular and huge pages. And it maybe especially
>> terrible
>> when swap is off, we may scan the whole LRU list many times. A huge page
>> will
>> be deleted from ZSR list when it was scanned, so the page will be
>> scanned only
>> once. It's hard to use LRU list, because it may add new pages into LRU list
>> continuously when scanning.
>>
>> Also, we can decrease the priority to prioritize reclaiming file-backed
>> page.
>> For example, only triggerring ZSR when the priority is less than 4.
> I'm not sure if this will help the workloads in general or not. The
> problem is it doesn't check if the huge page is "hot" or not. It just
> picks up the first huge page from the list, which seems like a FIFO
> list IIUC. But if the huge page is "hot" even though there is some
> internal access imbalance it may be better to keep the huge page since
> the performance gain may outperform the memory saving. But if the huge
> page is not "hot", then I think the question is why it is a THP in the
> first place.
We don't split all the huge pages, and just split the huge page
contains enough zero subpages. It's hard to check a anonymous
page is hot or cold, and we are working on it.

We only scan 32 huge pages maximum except the last loop when
reclaiming. I think we can start ZSR when priority is 1 or 2,
or maybe only when priority is 0. In this case, If we don't
start ZSR, the process will be killed by OOM.
>
> Let's step back to think about whether allocating THP upon first
> access for such area or workload is good or not. We should be able to
> check the access imbalance in allocation stage instead of reclaim
> stage. Currently anonymous THP just supports 3 modes: always, madvise
> and none. Both always and madvise tries to allocate THP in page fault
> path (assuming anonymous THP) upon first access. I'm wondering if we
> could add a "defer" mode or not. It defers THP allocation/collapse to
> khugepaged instead of in page fault path. Then all the knobs used by
> khugepaged could be applied, particularly max_ptes_none in your case.
> You could set a low max_ptes_none if you prefer memory saving. IMHO,
> this seems much simpler than scanning list (may be quite long) to find
> out suitable candidate then split then replace to zero page.
>
> Of course this may have some potential performance impact since the
> THP install is delayed for some time. This could be optimized by
> respecting  MADV_HUGEPAGE.
>
> Anyway, just some wild idea.
>
>>>> Yu Zhao has done some similar work when the huge page is swap out
>>>> or migrated to accelerate[1]. While we do this in the normal memory
>>>> shrink path for the swapoff scene to avoid OOM.
>>>>
>>>> In the future, we will do the proactive reclaim to reclaim the "cold"
>>>> huge page proactively. This is for keeping the performance of thp as
>>>> for as possible. In addition to that, some users want the memory usage
>>>> using thp is equal to the usage using 4K.
>>> Proactive reclaim can be harmful if your max_ptes_none allows to recreate
>>> THP back.
>> Thanks! We will consider it.

next prev parent reply	other threads:[~2021-11-01  2:50 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-28 11:56 Ning Zhang
2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
2021-10-28 12:53   ` Matthew Wilcox
2021-10-29 12:16     ` ning zhang
2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Ning Zhang
2021-10-28 11:56 ` [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 5/6] mm, thp: add some statistics for " Ning Zhang
2021-10-28 11:56 ` [RFC 6/6] mm, thp: add document " Ning Zhang
2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
2021-10-29 12:07   ` ning zhang
2021-10-29 16:56     ` Yang Shi
2021-11-01  2:50       ` ning zhang [this message]
2021-10-29 13:38 ` Michal Hocko
2021-10-29 16:12   ` ning zhang
2021-11-01  9:20     ` Michal Hocko
2021-11-08  3:24       ` ning zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30787ee3-895c-09b7-ebec-2f5885ac9769@linux.alibaba.com \
    --to=ningzhang@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=gavin.dg@linux.alibaba.com \
    --cc=hannes@cmpxchg.org \
    --cc=kirill@shutemov.name \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=vdavydov.dev@gmail.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox