From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: lsf-pc@lists.linux-foundation.org, Peter Xu <peterx@redhat.com>,
Muchun Song <muchun.song@linux.dev>,
linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
Date: Tue, 4 Feb 2025 21:40:16 +0100 [thread overview]
Message-ID: <74ecaa8b-9e94-4ba8-a2f0-a312607516ba@redhat.com> (raw)
In-Reply-To: <Z6CWMBKlN0-RHI2P@localhost.localdomain>
On 03.02.25 11:10, Oscar Salvador wrote:
> On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
>> On 30.01.25 22:36, Oscar Salvador wrote:
>> Hi Oscar,
>
> Hi David
Hi,
>
>>>
>>> HugeTLB has its own way of dealing with things.
>>> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
>>> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
>>>
>>> One of the challenges that this raises is that if we want pmd/pud walkers to
>>> be able to make sense of hugetlb stuff, we need to implement pud/pmd
>>> (maybe some pmd we already have because of THP) variants of those.
>>
>> that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
>> abstracted by hugetlb to be a single unit ("hugetlb pte").
>
> Yes, that is a PITA to be honest.
> E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
> assumption.
> I was checking how we could make cont-{pmd,pte} work for hugetlb in that
> area (just because it happens to be the first are I am trying to convert).
> E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.
>
> I came up with something like the following (completely untested, just a
> PoC) to make smaps_pte_range work for cont-ptes.
> We would also need to change all the references of PAGE_SIZE to the
> actual size of the folio (if there are).
Unfortunately not that easy. You can only have parts of a large folio
mapped. It would be PAGE_SIZE * nr_ptes when batching.
>
> But looking closer, folio_pte_batch is only declared in mm/internal.h,
> so I am not sure I can make use of that outside mm/ realm?
Sure, we could export that. We'd need something similar for PMDs, and
batching all other non-present stuff (e.g., migration entries)
> If not, how could we work that around?
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 193dd5f91fbf..21fb122ebcac 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
> #endif
>
> static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> - struct mm_walk *walk)
> + struct mm_walk *walk, int nr_batch_entries)
> {
> struct mem_size_stats *mss = walk->private;
> struct vm_area_struct *vma = walk->vma;
> @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>
> if (pte_present(ptent)) {
> page = vm_normal_page(vma, addr, ptent);
> - young = pte_young(ptent);
> - dirty = pte_dirty(ptent);
> + if (nr_batch_entries) {
> + int nr;
> + nr = folio_pte_batch(folio_page(page, 0), addr, pte,
> + ptent, max_nr, 0, NULL, &young,
> + &dirty);
> + if (nr != nr_batch_entries) {
> + walk->action = ACTION_AGAIN;
> + return;
> + }
> + } else {
> + young = pte_young(ptent);
> + dirty = pte_dirty(ptent);
> + }
> present = true;> } else if (is_swap_pte(ptent)) {
> swp_entry_t swpent = pte_to_swp_entry(ptent);
> @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> struct mm_walk *walk)
> {
> struct vm_area_struct *vma = walk->vma;
> - pte_t *pte;
> + pte_t *pte, ptent;
> spinlock_t *ptl;
>
> ptl = pmd_huge_lock(pmd, vma);
> @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> walk->action = ACTION_AGAIN;
> return 0;
> }
> +
> + ptent = ptep_get(pte);
> + if (pte_present(ptent)) {
> + struct folio *folio = vm_normal_folio(vma, addr, ptent);
> + if (folio_test_large(folio)) {
> + /* Let us use pte batching */
> + smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
> + pte_unmap_unlock(pte - 1, ptl);
> +
> + return 0;
> + }
> + }
> +
> for (; addr != end; pte++, addr += PAGE_SIZE)
> - smaps_pte_entry(pte, addr, walk);
> + smaps_pte_entry(pte, addr, walk, 0);
> pte_unmap_unlock(pte - 1, ptl);
> out:
> cond_resched();
>
I think this is the wrong approach. We should replace that pagewalk API
usage by something better that hides all the batching.
The interface would look similar to folio_walk (no callbacks, handling
of locking), but (a) work on ranges; (b) work also on non-folio entries;
and (c) batch all suitable entries in the range.
Something like a pt_range_walk_start() that returns a "type" (folio
range, migration entries, swap range, ...) + stores other details
(range, level, ptep, ...) in a structure like folio_walk, to then
provide mechanisms to continue (pt_walk_continue()) to walk or abort
(pt_walk_done()) it. Similar to page_vma_mapped_walk(), but not specific
to a given page/folio.
Then, we would simply process the output of that. With the hope that,
for hugetlb it will just batch all cont-pte / cont-pmd entries into a
single return value.
That will make the R/O walking as in task_mmu.c easier, hopefully.
Not so much with PTE/PMD modifications, like damon_mkold_ops ... :( But
maybe that just has to be special-cased for hugetlb, somehow ...
>
>
>> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
>> managed: you can modify any PTE part of the cont-gang and it will just
>> work as expected, transparently.
>
> But that is because AFAIU, on arm64 it knows if it's dealing with a
> cont-pte and it queries the status of the whole "gang", right?
Yes, that's how I remember it works on arm64.
Note that, though, that non-present PTEs don't use the cont bit. So one
must "batch" these always manually.
>
> I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
> with cont-ptes and modifies/queries all sub-ptes.
> How is that done for non-hugetlb pages?
Similarly, at least on arm64. When setting individual PTEs, it
automatically sets the cont-pte bit once it detects that the applicable
PTEs are all "cont-compatible" (e.g., belong to same folio, same
protection bits). Other architectures might require more work (powrpc).
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-02-04 20:40 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-30 21:36 Oscar Salvador
2025-01-30 22:45 ` Peter Xu
2025-01-30 22:46 ` Matthew Wilcox
2025-01-30 23:19 ` David Hildenbrand
2025-01-31 15:42 ` Christophe Leroy
2025-02-04 20:19 ` David Hildenbrand
2025-02-03 10:10 ` Oscar Salvador
2025-02-04 20:40 ` David Hildenbrand [this message]
2025-02-05 9:33 ` Oscar Salvador
2025-02-11 13:31 ` David Hildenbrand
2025-02-12 9:13 ` Oscar Salvador
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=74ecaa8b-9e94-4ba8-a2f0-a312607516ba@redhat.com \
--to=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox