Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yu Zhao <yuzhao@google.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "Alex Zhu (Kernel)" <alexlzhu@fb.com>, Rik van Riel <riel@fb.com>,
	Kernel Team <Kernel-team@fb.com>,
	 "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"willy@infradead.org" <willy@infradead.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	Ning Zhang <ningzhang@linux.alibaba.com>,
	 Miaohe Lin <linmiaohe@huawei.com>
Subject: Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Date: Thu, 11 Aug 2022 16:59:28 -0600	[thread overview]
Message-ID: <CAOUHufaqV1diQ8W1omcxMoMcbgEjR7n3MJct=-v=dPttx20KMA@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkpRVOOGK2d0L25A4PKfEbHngm-WEWWH34UYVg1O40Tiig@mail.gmail.com>

On Thu, Aug 11, 2022 at 4:12 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 2:55 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Aug 11, 2022 at 1:20 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> > >
> > > Hi Yu,
> > >
> > > I’ve updated your patch set from last year to work with folio and am testing it now. The functionality in split_huge_page() is the same as what I have. Was there any follow up work done later?
> >
> > Yes, but it won't change the landscape any time soon (see below). So
> > please feel free to continue along your current direction.
> >
> > > If not, I would like to incorporate this into what I have, and then resubmit. Will reference the original patchset. We need this functionality for the shrinker, but even the changes to split_huge_page() by itself it should show some performance improvement when used by the existing deferred_split_huge_page().
> >
> > SGTM. Thanks!
> >
> > A side note:
> >
> > I'm working on a new mode: THP=auto, meaning the kernel will detect
> > internal fragmentation of 2MB compound pages to decide whether to map
> > them by PMDs or split them under memory pressure. The general workflow
> > of this new mode is as follows.
>
> I tend to agree that avoiding allocating THP in the first place is the
> preferred way to avoid internal fragmentation. But I got some
> questions about your design/implementation:
>
> >
> > In the page fault path:
> > 1. Compound pages are allocated as usual.
> > 2. Each is mapped by 512 consecutive PTEs rather than a PMD.
> > 3. There will be more TLB misses but the same number of page faults.
> > 4. TLB coalescing can mitigate the performance degradation.
>
> Why not just allocate base pages in the first place? Khugepaged has
> max_pte_none tunable to detect internal fragmentation. If you worry
> about zero page, you could add max_pte_zero tunable.
>
> Or did you investigate whether the new MADV_COLLAPSE may be helpful or
> not? It leaves the decision to the userspace.

There are two problems we have to workaround.
1. External fragmentation that prevents later compound page allocations.
2. The cost of taking mmap_lock for write.

IIRC, the first reference I listed describes the first problem. (It
uses a similar reservation technique.) From a very high level, smaller
page allocations add more entropy than larger ones and accelerate the
system toward equilibrium, in which state the system can't allocate
more THPs without exerting additional force (compaction).

Reserving compound pages delays the equilibrium whereas MADV_COLLAPSE
tries to reverse the equilibrium. The latter has a higher cost. In
addition, it needs to take mmap_lock for write.

> > In khugepaged:
> > 1. Check the dirty bit in the PTEs mapping a compound page, to
> > determine its utilization.
> > 2. Remap compound pages that meet a certain utilization threshold by
> > PMDs in place, i.e., no migrations.
> >
> > In the reclaim path, e.g., MGLRU page table scanning:
> > 1. Decide whether compound pages mapped by PTEs should be split based
> > on their utilizations and memory pressure, e.g., reclaim priority.
> > 2. Clean subpages should be freed directly after split, rather than swapped out.
> >
> > N.B.
> > 1. This workflow relies on the dirty bit rather examining the content of a page.
> > 2. Sampling can be done by periodically switching between a PMD and
> > 512 consecutive PTEs.
> > 3. It only needs to hold mmap_lock for read because this special mode
> > (512 consecutive PTEs) is not considered the split mode.
> > 4. Don't hold your breath :)
> >
> > Other references:
> > 1. https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf
> > 2. https://www.usenix.org/system/files/osdi21-hunter.pdf

next prev parent reply	other threads:[~2022-08-11 23:00 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-05 18:40 alexlzhu
2022-08-05 18:50 ` Matthew Wilcox
2022-08-05 19:04   ` Rik van Riel
2022-08-05 19:24     ` Matthew Wilcox
2022-08-05 19:51       ` Alex Zhu (Kernel)
2022-08-08 17:55         ` Yang Shi
2022-08-08 18:35           ` Rik van Riel
2022-08-09 17:11             ` Yang Shi
2022-08-09 17:15               ` Alex Zhu (Kernel)
2022-08-09 23:35                 ` Yu Zhao
2022-08-10 17:07                   ` Yang Shi
2022-08-10 17:14                     ` Alex Zhu (Kernel)
2022-08-10 17:54                       ` Yu Zhao
2022-08-10 21:39                         ` Alex Zhu (Kernel)
2022-08-10 21:56                           ` Yu Zhao
2022-08-11  0:00                             ` Alex Zhu (Kernel)
2022-08-11  1:15                               ` Yu Zhao
2022-08-11  2:08                                 ` Alex Zhu (Kernel)
2022-08-11 19:20                                   ` Alex Zhu (Kernel)
2022-08-11 21:55                                     ` Yu Zhao
2022-08-11 22:12                                       ` Yang Shi
2022-08-11 22:59                                         ` Yu Zhao [this message]
2022-08-07  6:03 ` kernel test robot
2022-08-07  6:44 ` kernel test robot
2022-08-07  6:55 ` kernel test robot
2022-08-08 17:52 ` Yang Shi
2022-08-05 20:28 William Kucharski
2022-08-05 21:14 William Kucharski
2022-08-05 21:46 ` Alex Zhu (Kernel)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOUHufaqV1diQ8W1omcxMoMcbgEjR7n3MJct=-v=dPttx20KMA@mail.gmail.com' \
    --to=yuzhao@google.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexlzhu@fb.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ningzhang@linux.alibaba.com \
    --cc=riel@fb.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox