From: David Hildenbrand <david@redhat.com>
To: Ryan Roberts <ryan.roberts@arm.com>, Nico Pache <npache@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
anshuman.khandual@arm.com, catalin.marinas@arm.com,
cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
haowenchao22@gmail.com, hughd@google.com,
aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
peterx@redhat.com, ioworker0@gmail.com,
wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
21cnbao@gmail.com, willy@infradead.org,
kirill.shutemov@linux.intel.com, aarcange@redhat.com,
raquini@redhat.com, dev.jain@arm.com, sunnanyong@huawei.com,
usamaarif642@gmail.com, audra@redhat.com,
akpm@linux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support
Date: Mon, 20 Jan 2025 19:39:03 +0100 [thread overview]
Message-ID: <95472249-44f6-4764-a5fa-fac834eb5a49@redhat.com> (raw)
In-Reply-To: <9bf875ad-3e31-464d-bccd-7c737a2c53bc@arm.com>
On 20.01.25 17:27, Ryan Roberts wrote:
> On 20/01/2025 13:56, David Hildenbrand wrote:
>> On 20.01.25 14:37, Ryan Roberts wrote:
>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>> don't
>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>> collapse
>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>> even
>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>> issue?
>>>>>> If not, how did you solve?
>>>>>
>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>> lower as a default would "help".
>>>>
>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>> document that the other weird configurations will make mTHP skip, because "weird
>>>> and unexpetced" ? :)
>
> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
> perhaps it's better to say we round to closest of 0 and max?
Maybe. Rounding down always implies doing something not necessarily desired.
In any case, I assume most setups just have the default values here ... :)
>
>>>>
>>>
>>> That sounds like a great simplification in principle!
>>
>> And certainly a much easier to start with :)
>>
>> If we ever get the request to support something else, maybe that's also where we
>> can learn *why*, and what we would actually want to do with mTHP.
>>
>>> We would need to consider
>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>> with those?
>>
>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>> "/ 8" heuristics.
>>
>>
>> One simple starting point here is of course to say "when collapsing mTHP, all
>> have to be unshared and all have to be swapped in", so to essentially ignore
>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>> collapse and worry about that later, when really required.
>
> For swap, if we assume we start with the whole VMA swapped out, I think setting
> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
> in sequentially? I guess that's creep due to faulting pattern though, so at
> least it's not due to collapse. Doesn't feel ideal though.
> > I'm not sure what the semantic of "shared" is? I'm guessing it's
specifically
> for private COWed pages, and khugepaged will trigger the COW on collapse?
Yes.
> So
> again depending on the pattern of writes we could still end up with creep in a
> similar way to swap?
I think in regards of both "yes", so a simple starting point but not
necessarily what we want long term. The creep is at least "not wasting
more memory", because we don't collapse where PMD wouldn't have collapsed.
After all, right now we don't collapse mTHP, now we would collapse mTHP
in many scenarios, so we don't have to be perfect initially.
Deriving stuff for small THP sizes when configured for PMD THP sizes is
not easy to do right.
>
>>
>> Two alternatives I discussed with Nico for these (not sure which is implemented
>> here) is to calculate it proportionally to the folio order we are collapsing:
>
> You're only listing one option here... what's the other one you discussed?
>
Ah sorry, reshuffled it and then had to rush.
The other thing I had in mind is to scan the whole PMD range, and
discard skip the whole PMD range if it doesn't obey the max_ptes_*
stuff. Not perfect, but will mean that we behave just like PMD collapse
would, unless I am missing something.
>>
>> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
>> (256 PTEs), 32 PTEs would be allowed to be swapped out.
>
> Yeah this is exactly what Dev's version is doing at the moment. But that's the
> behaviour that leads to the "creep" problem.
Right.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-01-20 18:39 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-08 23:31 Nico Pache
2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
2025-01-10 6:25 ` Dev Jain
2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
2025-01-10 6:11 ` Dev Jain
2025-01-10 19:37 ` Nico Pache
2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-01-10 6:23 ` Dev Jain
2025-01-10 19:41 ` Nico Pache
2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-01-10 6:38 ` Dev Jain
2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-01-10 9:05 ` Dev Jain
2025-01-10 21:48 ` Nico Pache
2025-01-12 11:23 ` Dev Jain
2025-01-13 22:25 ` Nico Pache
2025-01-10 14:54 ` Dev Jain
2025-01-10 21:48 ` Nico Pache
2025-01-12 15:13 ` Dev Jain
2025-01-12 16:41 ` Dev Jain
2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
2025-01-10 9:20 ` Dev Jain
2025-01-10 13:36 ` Dev Jain
2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-01-09 6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
2025-01-10 2:27 ` Nico Pache
2025-01-10 4:56 ` Dev Jain
2025-01-10 22:01 ` Nico Pache
2025-01-12 14:11 ` Dev Jain
2025-01-13 23:00 ` Nico Pache
2025-01-09 6:27 ` Dev Jain
2025-01-10 1:28 ` Nico Pache
2025-01-16 9:47 ` Ryan Roberts
2025-01-16 20:53 ` Nico Pache
2025-01-20 5:17 ` Dev Jain
2025-01-23 20:24 ` Nico Pache
2025-01-24 7:13 ` Dev Jain
2025-01-24 7:38 ` Dev Jain
2025-01-20 12:49 ` Ryan Roberts
2025-01-23 20:42 ` Nico Pache
2025-01-20 12:54 ` David Hildenbrand
2025-01-20 13:37 ` Ryan Roberts
2025-01-20 13:56 ` David Hildenbrand
2025-01-20 16:27 ` Ryan Roberts
2025-01-20 18:39 ` David Hildenbrand [this message]
2025-01-21 9:48 ` Ryan Roberts
2025-01-21 10:19 ` David Hildenbrand
2025-01-27 9:31 ` Dev Jain
2025-01-22 5:18 ` Dev Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=95472249-44f6-4764-a5fa-fac834eb5a49@redhat.com \
--to=david@redhat.com \
--cc=21cnbao@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=audra@redhat.com \
--cc=baohua@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=dave.hansen@linux.intel.com \
--cc=dev.jain@arm.com \
--cc=haowenchao22@gmail.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jack@suse.cz \
--cc=jglisse@google.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=peterx@redhat.com \
--cc=raquini@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=srivatsa@csail.mit.edu \
--cc=sunnanyong@huawei.com \
--cc=surenb@google.com \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox