From: ning zhang <ningzhang@linux.alibaba.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
Linux MM <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Yu Zhao <yuzhao@google.com>,
Gang Deng <gavin.dg@linux.alibaba.com>
Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
Date: Mon, 1 Nov 2021 10:50:28 +0800 [thread overview]
Message-ID: <30787ee3-895c-09b7-ebec-2f5885ac9769@linux.alibaba.com> (raw)
In-Reply-To: <CAHbLzkoTUCKnkWkj4Pc-xtcijieG61xqkg8pb10Equo_MtiV3A@mail.gmail.com>
在 2021/10/30 上午12:56, Yang Shi 写道:
> On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com> wrote:
>>
>> 在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
>>> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>> zero_subpages huge_pages waste
>>>> [ 0, 1) 186 0.00%
>>>> [ 1, 2) 23 0.01%
>>>> [ 2, 4) 36 0.02%
>>>> [ 4, 8) 67 0.08%
>>>> [ 8, 16) 80 0.23%
>>>> [ 16, 32) 109 0.61%
>>>> [ 32, 64) 44 0.49%
>>>> [ 64, 128) 12 0.30%
>>>> [ 128, 256) 28 1.54%
>>>> [ 256, 513) 159 18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>>>
>>>> This patchset introduce a new mechanism to split the huge page
>>>> which has zero subpages and reclaim these zero subpages.
>>>>
>>>> We add the anonymous huge page to a list to reduce the cost of
>>>> finding the huge page. When the memory reclaim is triggering,
>>>> the list will be walked and the huge page contains enough zero
>>>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>>>> by ZERO_PAGE(0).
>>> Does it actually help your workload?
>>>
>>> I mean this will only be triggered via vmscan that was going to split
>>> pages and free anyway.
>>>
>>> You prioritize splitting THP and freeing zero subpages over reclaiming
>>> other pages. It may or may not be right thing to do, depending on
>>> workload.
>>>
>>> Maybe it makes more sense to check for all-zero pages just after
>>> split_huge_page_to_list() in vmscan and free such pages immediately rather
>>> then add all this complexity?
>>>
>> The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages
>> which
>> have waste and reclaim them.
>>
>> We do this for two reasons:
>> 1. If swap is off, anonymous pages will not be scanned, and we don't
>> have the
>> opportunity to split the huge page. ZSR can be helpful for this.
>> 2. If swap is on, splitting first will not only split the huge page, but
>> also
>> swap out the nonzero subpages, while ZSR will only split the huge page.
>> Splitting first will result to more performance degradation. If ZSR
>> can't
>> reclaim enough pages, swap can still work.
>>
>> Why use a seperate ZSR list instead of the default LRU list?
>>
>> Because it may cause high CPU overhead to scan for target huge pages if
>> there
>> both exist a lot of regular and huge pages. And it maybe especially
>> terrible
>> when swap is off, we may scan the whole LRU list many times. A huge page
>> will
>> be deleted from ZSR list when it was scanned, so the page will be
>> scanned only
>> once. It's hard to use LRU list, because it may add new pages into LRU list
>> continuously when scanning.
>>
>> Also, we can decrease the priority to prioritize reclaiming file-backed
>> page.
>> For example, only triggerring ZSR when the priority is less than 4.
> I'm not sure if this will help the workloads in general or not. The
> problem is it doesn't check if the huge page is "hot" or not. It just
> picks up the first huge page from the list, which seems like a FIFO
> list IIUC. But if the huge page is "hot" even though there is some
> internal access imbalance it may be better to keep the huge page since
> the performance gain may outperform the memory saving. But if the huge
> page is not "hot", then I think the question is why it is a THP in the
> first place.
We don't split all the huge pages, and just split the huge page
contains enough zero subpages. It's hard to check a anonymous
page is hot or cold, and we are working on it.
We only scan 32 huge pages maximum except the last loop when
reclaiming. I think we can start ZSR when priority is 1 or 2,
or maybe only when priority is 0. In this case, If we don't
start ZSR, the process will be killed by OOM.
>
> Let's step back to think about whether allocating THP upon first
> access for such area or workload is good or not. We should be able to
> check the access imbalance in allocation stage instead of reclaim
> stage. Currently anonymous THP just supports 3 modes: always, madvise
> and none. Both always and madvise tries to allocate THP in page fault
> path (assuming anonymous THP) upon first access. I'm wondering if we
> could add a "defer" mode or not. It defers THP allocation/collapse to
> khugepaged instead of in page fault path. Then all the knobs used by
> khugepaged could be applied, particularly max_ptes_none in your case.
> You could set a low max_ptes_none if you prefer memory saving. IMHO,
> this seems much simpler than scanning list (may be quite long) to find
> out suitable candidate then split then replace to zero page.
>
> Of course this may have some potential performance impact since the
> THP install is delayed for some time. This could be optimized by
> respecting MADV_HUGEPAGE.
>
> Anyway, just some wild idea.
>
>>>> Yu Zhao has done some similar work when the huge page is swap out
>>>> or migrated to accelerate[1]. While we do this in the normal memory
>>>> shrink path for the swapoff scene to avoid OOM.
>>>>
>>>> In the future, we will do the proactive reclaim to reclaim the "cold"
>>>> huge page proactively. This is for keeping the performance of thp as
>>>> for as possible. In addition to that, some users want the memory usage
>>>> using thp is equal to the usage using 4K.
>>> Proactive reclaim can be harmful if your max_ptes_none allows to recreate
>>> THP back.
>> Thanks! We will consider it.
next prev parent reply other threads:[~2021-11-01 2:50 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-28 11:56 Ning Zhang
2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
2021-10-28 12:53 ` Matthew Wilcox
2021-10-29 12:16 ` ning zhang
2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Ning Zhang
2021-10-28 11:56 ` [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 5/6] mm, thp: add some statistics for " Ning Zhang
2021-10-28 11:56 ` [RFC 6/6] mm, thp: add document " Ning Zhang
2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
2021-10-29 12:07 ` ning zhang
2021-10-29 16:56 ` Yang Shi
2021-11-01 2:50 ` ning zhang [this message]
2021-10-29 13:38 ` Michal Hocko
2021-10-29 16:12 ` ning zhang
2021-11-01 9:20 ` Michal Hocko
2021-11-08 3:24 ` ning zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=30787ee3-895c-09b7-ebec-2f5885ac9769@linux.alibaba.com \
--to=ningzhang@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=gavin.dg@linux.alibaba.com \
--cc=hannes@cmpxchg.org \
--cc=kirill@shutemov.name \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shy828301@gmail.com \
--cc=vdavydov.dev@gmail.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox