From: Nico Pache <npache@redhat.com>
To: Dev Jain <dev.jain@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
anshuman.khandual@arm.com, catalin.marinas@arm.com,
cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
apopple@nvidia.com, dave.hansen@linux.intel.com,
will@kernel.org, baohua@kernel.org, jack@suse.cz,
srivatsa@csail.mit.edu, haowenchao22@gmail.com,
hughd@google.com, aneesh.kumar@kernel.org,
yang@os.amperecomputing.com, peterx@redhat.com,
ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com,
jglisse@google.com, surenb@google.com, vishal.moola@gmail.com,
zokeefe@google.com, zhengqi.arch@bytedance.com,
jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org,
kirill.shutemov@linux.intel.com, david@redhat.com,
aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com,
usamaarif642@gmail.com, audra@redhat.com,
akpm@linux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support
Date: Thu, 23 Jan 2025 13:24:01 -0700 [thread overview]
Message-ID: <CAA1CXcCOWNDsVOWj8apTqoFaJJsG=op4637qgAJ1k6GzmTtXEA@mail.gmail.com> (raw)
In-Reply-To: <209ba507-fab8-4011-bc36-1cc38a303800@arm.com>
On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> --- snip ---
> >>
> >> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> >> is vs just iterating through the PTEs like Dev does; is there a significant cost
> >> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> > The bitmap was to encode the state of PMD without needing rescanning
> > (or refactor a lot of code). We keep the scan runtime constant at 512
> > (for x86). Dev did some good analysis for this here
> > https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>
> I think I swayed away and over-analyzed, and probably did not make my
> main objection clear enough, so let us cut to the chase.
> *Why* is it correct to remember the state of the PMD?
>
> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
> tunables again, since we dropped the lock. The bitmap thingy which you
> are doing, and in general, any algorithm which tries to remember the
> state of the PMD, violates the entire point of max_ptes_*. Take for
> example: Suppose the PTE table had a lot of shared ptes. After you drop
> the PTL, you do this: scan_bitmap() -> read_unlock() ->
> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
per your recommendation I dropped the read_lock() -> read_unlock() and
made it a conditional unlock
> of stuff. Now, you do write_lock(), which means that you need to wait
> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
> forks and then a lot of PTEs become shared. The point of max_ptes_shared
> is to stop the collapse here, since we do not want memory bloat
> (collapse will grab more memory from the buddy and the old memory won't
> be freed because it has a reference from the parent/child).
That's a fair point, but given the other feedback, my current
implementation now requires mTHPs to have no shared/swap, and ive
improved the sysctl interactions for the set_bitmap and the
max_ptes_none check in the _isolate function.
As for *why* remembering the state is correct. It just prevents
needing to rescan.
> Another example would be, a sysadmin does not want too much memory
> wastage from khugepaged, so we decide to set max_ptes_none low. When you
> scan the PTE table you justify the collapse. After you drop the PTL and
> the mmap_lock, a munmap() happens in the region, no longer justifying
> the collapse. If you have a lot of VMAs of size <= 2MB, then any
> munmap() on a VMA will happen on the single PTE table present.
>
> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
> to ask whether any algorithm remembering the state of the PMD is even
> conceptually right.
Both the issues you raised dont really have to do with the bitmap...
they are fair points, but they are more of a criticism of my sysctl
handling. Ive cleaned up the max_ptes_none interactions, and now that
we dont plan to initially support swap/shared both these problems are
'gone'.
>
> Then, you have the harder task of proving that your optimization is
> actually an optimization, that it is not turned into being futile
> because of overhead. From a high-level mathematical PoV, you are saving
> iterations. Any mathematical analysis has the underlying assumption that
> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
> order)] is virtually and physically contiguous in memory so prefetching
> helps us. You are trying to save on pte memory references, but then look
> at the number of bitmap memory references you have created, not to
> mention that you are doing a (costly?) division operation in there, you
> have a while loop, a stack, new structs, and if conditions. I do not see
> how this is any faster than a naive linear scan.
Yeah it's hard to say without real performance testing. I hope to
include some performance results with my next post.
>
> > This prevents needing to hold the read lock for longer, and prevents
> > needing to reacquire it too.
>
> My implementation does not hold the read lock for longer. What you mean
> to say is, I need to reacquire the lock, and this is by design, to
yes sorry.
> ensure correctness, which boils down to what I wrote above.
The write lock is what ensures correctness, not the read lock. The
read lock is to gain insight of potential collapse candidates while
avoiding the cost of the write lock.
Cheers!
-- Nico
>
next prev parent reply other threads:[~2025-01-23 20:24 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-08 23:31 Nico Pache
2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
2025-01-10 6:25 ` Dev Jain
2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
2025-01-10 6:11 ` Dev Jain
2025-01-10 19:37 ` Nico Pache
2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-01-10 6:23 ` Dev Jain
2025-01-10 19:41 ` Nico Pache
2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-01-10 6:38 ` Dev Jain
2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-01-10 9:05 ` Dev Jain
2025-01-10 21:48 ` Nico Pache
2025-01-12 11:23 ` Dev Jain
2025-01-13 22:25 ` Nico Pache
2025-01-10 14:54 ` Dev Jain
2025-01-10 21:48 ` Nico Pache
2025-01-12 15:13 ` Dev Jain
2025-01-12 16:41 ` Dev Jain
2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
2025-01-10 9:20 ` Dev Jain
2025-01-10 13:36 ` Dev Jain
2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-01-09 6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
2025-01-10 2:27 ` Nico Pache
2025-01-10 4:56 ` Dev Jain
2025-01-10 22:01 ` Nico Pache
2025-01-12 14:11 ` Dev Jain
2025-01-13 23:00 ` Nico Pache
2025-01-09 6:27 ` Dev Jain
2025-01-10 1:28 ` Nico Pache
2025-01-16 9:47 ` Ryan Roberts
2025-01-16 20:53 ` Nico Pache
2025-01-20 5:17 ` Dev Jain
2025-01-23 20:24 ` Nico Pache [this message]
2025-01-24 7:13 ` Dev Jain
2025-01-24 7:38 ` Dev Jain
2025-01-20 12:49 ` Ryan Roberts
2025-01-23 20:42 ` Nico Pache
2025-01-20 12:54 ` David Hildenbrand
2025-01-20 13:37 ` Ryan Roberts
2025-01-20 13:56 ` David Hildenbrand
2025-01-20 16:27 ` Ryan Roberts
2025-01-20 18:39 ` David Hildenbrand
2025-01-21 9:48 ` Ryan Roberts
2025-01-21 10:19 ` David Hildenbrand
2025-01-27 9:31 ` Dev Jain
2025-01-22 5:18 ` Dev Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAA1CXcCOWNDsVOWj8apTqoFaJJsG=op4637qgAJ1k6GzmTtXEA@mail.gmail.com' \
--to=npache@redhat.com \
--cc=21cnbao@gmail.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=audra@redhat.com \
--cc=baohua@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=haowenchao22@gmail.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jack@suse.cz \
--cc=jglisse@google.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=peterx@redhat.com \
--cc=raquini@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=srivatsa@csail.mit.edu \
--cc=sunnanyong@huawei.com \
--cc=surenb@google.com \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox