Re: [PATCH v3 0/6] mm: split underutilized THPs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: Andi Kleen <ak@linux.intel.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
	roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com,
	baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org,
	willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH v3 0/6] mm: split underutilized THPs
Date: Wed, 14 Aug 2024 11:13:35 +0100	[thread overview]
Message-ID: <e73c801a-8f36-44d8-a267-dd993aeccf35@gmail.com> (raw)
In-Reply-To: <87y150mj6f.fsf@linux.intel.com>

On 13/08/2024 18:22, Andi Kleen wrote:
> Usama Arif <usamaarif642@gmail.com> writes:
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> 
> Sometimes when writing a benchmark I fill things with zero explictly
> to avoid faults later. For example if you want to measure memory
> read bandwidth you need to fault the pages first, but that fault
> pattern may well be zero.
> 
> With your patch if there is memory pressure there are two effects:
> 
> - If things are remapped to the zero page the benchmark
> reading memory may give unrealistically good results because
> what is thinks is a big memory area is actually only backed
> by a single page.
> 
> - If I expect to write I may end up with an unexpected zeropage->real
> memory fault if the pages got remapped. 
> 
> I expect such patterns can happen without benchmarking too.
> I could see it being a problem for latency sensitive applications.
> 
> Now you could argue that this all should only happen under memory
> pressure and when that happens things may be slow anyways and your
> patch will still be an improvement.
> 
> Maybe that's true but there might be still corner cases
> which are negatively impacted by this. I don't have a good solution
> other than a tunable, but I expect it will cause problems for someone.
> 

There are currently 2 knobs to control behaviour of THP low utilization shrinker introduced in this series.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none:
The current default value for this is HPAGE_PMD_NR - 1 (511 for x86). If set to 511, the shrinker will immediately remove the folio from the deferred_list (Please see first if statement in thp_underutilized in Patch 5) and split is not attempted. Not a single page is checked at this point and there is no memory accesses done to impact performance.
If someone sets its to 510, it will exit as soon as a single page containing non-zero data is encountered (the else part in thp_underutilized).

/sys/kernel/mm/transparent_hugepage/thp_low_util_shrinker:
Introduced in patch 6, if someone really doesn't want to enable the shrinker, then they can set this to false. The folio will not be added to the _deferred_list at fault or collapse time, and it will be as if these patches didn't exist. Personally, I don't think its absolutely necessary to have this, but I added it incase someone comes up with some corner case.

For the first effect you mentioned, with the default behaviour of the patches with max_ptes_none set to 511, there will be no splitting of THPs, so you will get the same performance as without the series. 
If there is some benchmark that allocates all of the system memory with zeropages, causing shrinker to run and if someone has changed max_ptes_none and if they have kept thp_low_util_shrinker enabled and if all the benchmark does is read those pages, thus giving good memory results, then that benchmark is not really useful and the good results it gives is not unrealistic but a result of these patches. The stress example I have in the cover letter is an example. With these patches you can run stress or any other benchmark that behaves like this and still run other applications at the same time that consume memory, so the improvement is not unrealistic.

For the second effect of memory faults affecting latency sensitive applications, if THP is always enabled, and such applications are running out of memory resulting in shrinker to run, then a higher priority should be to have memory to run (which the shrinker will provide) rather than stalling for memory creating memory pressure which will result in latency spikes and possibly OOM killer being invoked killing the application.

I think we should focus on real world applications for which I have posted numbers in the cover letter and not tailor this for some benchmarks. If there is some real world low latency application where you could show these patches causing an issue, I would be happy to look into it. But again, with the default max_ptes_none of 511, it wouldn't.

> The other problem I have with your patch is that it may cause the kernel
> to pollute CPU caches in the background, which again will cause noise in
> the system. Instead of plain memchr_inv, you should probably use some
> primitive to bypass caches or use a NTA prefetch hint at least.
> 

A few points on this:
- the page is checked in 2 places, at shrink time and at split time, so having the page in cache is useful and needed.
- there is stuff like this already done in the kernel when there is memory pressure, for e.g. at swap time [1]. Its not memchr_inv, but doing the exact same thing as memchr_inv.
- At the time the shrinker runs, one of the highest priority of the kernel/system is to get free memory. We should not try to make this slower by messing around with caches.

I think the current behaviour in the patches is good because of the above points. But also I don't think there is a standard way of doing NTA prefetch across all architectures, x86 prefetch does it [1], but arm prefetch [2] does pld1keep, i.e. keep the data in L1 cache which is the opposite of what NTA prefetch is intended doing. 

[1] https://elixir.bootlin.com/linux/v6.10.4/source/mm/zswap.c#L1390
[2] https://elixir.bootlin.com/linux/v6.10.4/source/arch/x86/include/asm/processor.h#L614
[3] https://elixir.bootlin.com/linux/v6.10.4/source/arch/arm64/include/asm/processor.h#L360

next prev parent reply	other threads:[~2024-08-14 10:13 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 12:02 Usama Arif
2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-08-15 18:47   ` Kairui Song
2024-08-15 19:16     ` Usama Arif
2024-08-16 16:55       ` Kairui Song
2024-08-16 17:02         ` Usama Arif
2024-08-16 18:11           ` Kairui Song
2024-08-13 12:02 ` [PATCH v3 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
2024-08-13 12:02 ` [PATCH v3 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
2024-08-14  3:30   ` Yu Zhao
2024-08-14 10:20     ` Usama Arif
2024-08-14 10:44   ` Barry Song
2024-08-14 10:52     ` Barry Song
2024-08-14 11:11     ` Usama Arif
2024-08-14 11:20       ` Barry Song
2024-08-14 11:26         ` Barry Song
2024-08-14 11:30         ` Usama Arif
2024-08-14 11:10   ` Barry Song
2024-08-14 11:20     ` Usama Arif
2024-08-14 11:23       ` Barry Song
2024-08-14 12:36         ` Usama Arif
2024-08-14 23:05           ` Barry Song
2024-08-15 15:25             ` Usama Arif
2024-08-15 23:30               ` Andrew Morton
2024-08-16  2:50                 ` Yu Zhao
2024-08-15 16:33   ` David Hildenbrand
2024-08-15 17:10     ` Usama Arif
2024-08-15 21:06       ` Barry Song
2024-08-15 21:08       ` David Hildenbrand
2024-08-16 15:44   ` Matthew Wilcox
2024-08-16 16:08     ` Usama Arif
2024-08-16 16:28       ` Matthew Wilcox
2024-08-16 16:41         ` Usama Arif
2024-08-13 12:02 ` [PATCH v3 5/6] mm: split underutilized THPs Usama Arif
2024-08-13 12:02 ` [PATCH v3 6/6] mm: add sysfs entry to disable splitting " Usama Arif
2024-08-13 17:22 ` [PATCH v3 0/6] mm: split " Andi Kleen
2024-08-14 10:13   ` Usama Arif [this message]
2024-08-18  5:13 ` Hugh Dickins
2024-08-18  7:45   ` David Hildenbrand
2024-08-19  2:38     ` Usama Arif
2024-08-19  2:36   ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e73c801a-8f36-44d8-a267-dd993aeccf35@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=corbet@lwn.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox