From: Dev Jain <dev.jain@arm.com>
To: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
anshuman.khandual@arm.com, catalin.marinas@arm.com,
cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
haowenchao22@gmail.com, hughd@google.com,
aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
peterx@redhat.com, ioworker0@gmail.com,
wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
21cnbao@gmail.com, willy@infradead.org,
kirill.shutemov@linux.intel.com, david@redhat.com,
aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com,
usamaarif642@gmail.com, audra@redhat.com,
akpm@linux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support
Date: Fri, 24 Jan 2025 12:43:39 +0530 [thread overview]
Message-ID: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> (raw)
In-Reply-To: <CAA1CXcCOWNDsVOWj8apTqoFaJJsG=op4637qgAJ1k6GzmTtXEA@mail.gmail.com>
On 24/01/25 1:54 am, Nico Pache wrote:
> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> --- snip ---
>>>>
>>>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>>>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>>>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
>>> The bitmap was to encode the state of PMD without needing rescanning
>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>> (for x86). Dev did some good analysis for this here
>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>> I think I swayed away and over-analyzed, and probably did not make my
>> main objection clear enough, so let us cut to the chase.
>> *Why* is it correct to remember the state of the PMD?
>>
>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>> tunables again, since we dropped the lock. The bitmap thingy which you
>> are doing, and in general, any algorithm which tries to remember the
>> state of the PMD, violates the entire point of max_ptes_*. Take for
>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
> per your recommendation I dropped the read_lock() -> read_unlock() and
> made it a conditional unlock
That's not the one I was talking about here...
>> of stuff. Now, you do write_lock(), which means that you need to wait
>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>> is to stop the collapse here, since we do not want memory bloat
>> (collapse will grab more memory from the buddy and the old memory won't
>> be freed because it has a reference from the parent/child).
>
> That's a fair point, but given the other feedback, my current
> implementation now requires mTHPs to have no shared/swap, and ive
> improved the sysctl interactions for the set_bitmap and the
> max_ptes_none check in the _isolate function.
I am guessing you are following the policy of letting the creep happen
for none ptes, and assuming shared and swap to be zero.
>
> As for *why* remembering the state is correct. It just prevents
> needing to rescan.
That is what I am saying...if collapse_huge_page() fails, then you have
dropped the mmap write lock, so now the state of the PTEs may have
changed, so you must rescan...
>
>> Another example would be, a sysadmin does not want too much memory
>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>> scan the PTE table you justify the collapse. After you drop the PTL and
>> the mmap_lock, a munmap() happens in the region, no longer justifying
>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>> munmap() on a VMA will happen on the single PTE table present.
>>
>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>> to ask whether any algorithm remembering the state of the PMD is even
>> conceptually right.
>
> Both the issues you raised dont really have to do with the bitmap...
Correct, my issue is with any general algorithm remembering PTE state.
> they are fair points, but they are more of a criticism of my sysctl
> handling. Ive cleaned up the max_ptes_none interactions, and now that
> we dont plan to initially support swap/shared both these problems are
> 'gone'.
>>
>> Then, you have the harder task of proving that your optimization is
>> actually an optimization, that it is not turned into being futile
>> because of overhead. From a high-level mathematical PoV, you are saving
>> iterations. Any mathematical analysis has the underlying assumption that
>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>> order)] is virtually and physically contiguous in memory so prefetching
>> helps us. You are trying to save on pte memory references, but then look
>> at the number of bitmap memory references you have created, not to
>> mention that you are doing a (costly?) division operation in there, you
>> have a while loop, a stack, new structs, and if conditions. I do not see
>> how this is any faster than a naive linear scan.
>
> Yeah it's hard to say without real performance testing. I hope to
> include some performance results with my next post.
>
>>
>>> This prevents needing to hold the read lock for longer, and prevents
>>> needing to reacquire it too.
>>
>> My implementation does not hold the read lock for longer. What you mean
>> to say is, I need to reacquire the lock, and this is by design, to
> yes sorry.
>> ensure correctness, which boils down to what I wrote above.
> The write lock is what ensures correctness, not the read lock. The
> read lock is to gain insight of potential collapse candidates while
> avoiding the cost of the write lock.
>
> Cheers!
> -- Nico
>>
>
next prev parent reply other threads:[~2025-01-24 7:13 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-08 23:31 Nico Pache
2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
2025-01-10 6:25 ` Dev Jain
2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
2025-01-10 6:11 ` Dev Jain
2025-01-10 19:37 ` Nico Pache
2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-01-10 6:23 ` Dev Jain
2025-01-10 19:41 ` Nico Pache
2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-01-10 6:38 ` Dev Jain
2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-01-10 9:05 ` Dev Jain
2025-01-10 21:48 ` Nico Pache
2025-01-12 11:23 ` Dev Jain
2025-01-13 22:25 ` Nico Pache
2025-01-10 14:54 ` Dev Jain
2025-01-10 21:48 ` Nico Pache
2025-01-12 15:13 ` Dev Jain
2025-01-12 16:41 ` Dev Jain
2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
2025-01-10 9:20 ` Dev Jain
2025-01-10 13:36 ` Dev Jain
2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-01-09 6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
2025-01-10 2:27 ` Nico Pache
2025-01-10 4:56 ` Dev Jain
2025-01-10 22:01 ` Nico Pache
2025-01-12 14:11 ` Dev Jain
2025-01-13 23:00 ` Nico Pache
2025-01-09 6:27 ` Dev Jain
2025-01-10 1:28 ` Nico Pache
2025-01-16 9:47 ` Ryan Roberts
2025-01-16 20:53 ` Nico Pache
2025-01-20 5:17 ` Dev Jain
2025-01-23 20:24 ` Nico Pache
2025-01-24 7:13 ` Dev Jain [this message]
2025-01-24 7:38 ` Dev Jain
2025-01-20 12:49 ` Ryan Roberts
2025-01-23 20:42 ` Nico Pache
2025-01-20 12:54 ` David Hildenbrand
2025-01-20 13:37 ` Ryan Roberts
2025-01-20 13:56 ` David Hildenbrand
2025-01-20 16:27 ` Ryan Roberts
2025-01-20 18:39 ` David Hildenbrand
2025-01-21 9:48 ` Ryan Roberts
2025-01-21 10:19 ` David Hildenbrand
2025-01-27 9:31 ` Dev Jain
2025-01-22 5:18 ` Dev Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com \
--to=dev.jain@arm.com \
--cc=21cnbao@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=audra@redhat.com \
--cc=baohua@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=haowenchao22@gmail.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jack@suse.cz \
--cc=jglisse@google.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=peterx@redhat.com \
--cc=raquini@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=srivatsa@csail.mit.edu \
--cc=sunnanyong@huawei.com \
--cc=surenb@google.com \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox