From: Oscar Salvador <osalvador@suse.de>
To: David Hildenbrand <david@redhat.com>
Cc: lsf-pc@lists.linux-foundation.org, Peter Xu <peterx@redhat.com>,
Muchun Song <muchun.song@linux.dev>,
linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
Date: Mon, 3 Feb 2025 11:10:56 +0100 [thread overview]
Message-ID: <Z6CWMBKlN0-RHI2P@localhost.localdomain> (raw)
In-Reply-To: <4c50a439-e2b8-4f54-ba3d-366d0e2961b2@redhat.com>
On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
> On 30.01.25 22:36, Oscar Salvador wrote:
> Hi Oscar,
Hi David
> >
> > HugeTLB has its own way of dealing with things.
> > E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> > huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
> >
> > One of the challenges that this raises is that if we want pmd/pud walkers to
> > be able to make sense of hugetlb stuff, we need to implement pud/pmd
> > (maybe some pmd we already have because of THP) variants of those.
>
> that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
> abstracted by hugetlb to be a single unit ("hugetlb pte").
Yes, that is a PITA to be honest.
E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
assumption.
I was checking how we could make cont-{pmd,pte} work for hugetlb in that
area (just because it happens to be the first are I am trying to convert).
E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.
I came up with something like the following (completely untested, just a
PoC) to make smaps_pte_range work for cont-ptes.
We would also need to change all the references of PAGE_SIZE to the
actual size of the folio (if there are).
But looking closer, folio_pte_batch is only declared in mm/internal.h,
so I am not sure I can make use of that outside mm/ realm?
If not, how could we work that around?
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 193dd5f91fbf..21fb122ebcac 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
#endif
static void smaps_pte_entry(pte_t *pte, unsigned long addr,
- struct mm_walk *walk)
+ struct mm_walk *walk, int nr_batch_entries)
{
struct mem_size_stats *mss = walk->private;
struct vm_area_struct *vma = walk->vma;
@@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
if (pte_present(ptent)) {
page = vm_normal_page(vma, addr, ptent);
- young = pte_young(ptent);
- dirty = pte_dirty(ptent);
+ if (nr_batch_entries) {
+ int nr;
+ nr = folio_pte_batch(folio_page(page, 0), addr, pte,
+ ptent, max_nr, 0, NULL, &young,
+ &dirty);
+ if (nr != nr_batch_entries) {
+ walk->action = ACTION_AGAIN;
+ return;
+ }
+ } else {
+ young = pte_young(ptent);
+ dirty = pte_dirty(ptent);
+ }
present = true;
} else if (is_swap_pte(ptent)) {
swp_entry_t swpent = pte_to_swp_entry(ptent);
@@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
- pte_t *pte;
+ pte_t *pte, ptent;
spinlock_t *ptl;
ptl = pmd_huge_lock(pmd, vma);
@@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
walk->action = ACTION_AGAIN;
return 0;
}
+
+ ptent = ptep_get(pte);
+ if (pte_present(ptent)) {
+ struct folio *folio = vm_normal_folio(vma, addr, ptent);
+ if (folio_test_large(folio)) {
+ /* Let us use pte batching */
+ smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
+ pte_unmap_unlock(pte - 1, ptl);
+
+ return 0;
+ }
+ }
+
for (; addr != end; pte++, addr += PAGE_SIZE)
- smaps_pte_entry(pte, addr, walk);
+ smaps_pte_entry(pte, addr, walk, 0);
pte_unmap_unlock(pte - 1, ptl);
out:
cond_resched();
> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
> managed: you can modify any PTE part of the cont-gang and it will just
> work as expected, transparently.
But that is because AFAIU, on arm64 it knows if it's dealing with a
cont-pte and it queries the status of the whole "gang", right?
I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
with cont-ptes and modifies/queries all sub-ptes.
How is that done for non-hugetlb pages?
> Not so with hugetlb, where you have to modify (or even query) the whole thing.
I see.
>
> For GUP it was easier, because it was able to grab all information it needed
> from the sub-ptes fairly easily, and it doesn't modify any page tabls.
>
>
> I ran into this problem with folio_walk, and had to document it rather nastily:
>
> * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care.
> * For example, PMD page table sharing might require prior unsharing. Also,
> * logical hugetlb entries might span multiple physical page table entries,
> * which *must* be modified in a single operation (set_huge_pte_at(),
> * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
> * not correspond to the first physical entry of a logical hugetlb entry.
>
> I wanted to use it to rewrite the uprobe code to also handle hugetlb with
> less special casing, but that work stalled so far. I think my next attempt would rule
> out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.
>
> It all gets weird with things like:
>
> commit 0549e76663730235a10395a7af7ad3d3ce6e2402
> Author: Christophe Leroy <christophe.leroy@csgroup.eu>
> Date: Tue Jul 2 15:51:25 2024 +0200
>
> powerpc/8xx: rework support for 8M pages using contiguous PTE entries
> In order to fit better with standard Linux page tables layout, add support
> for 8M pages using contiguous PTE entries in a standard page table. Page
> tables will then be populated with 1024 similar entries and two PMD
> entries will point to that page table.
> The PMD entries also get a flag to tell it is addressing an 8M page, this
> is required for the HW tablewalk assistance.
>
> Where we are walking a PTE table, but actually there is another PTE table we
> have to modify in the same go.
>
>
> Very hard to make that non-hugetlb aware, as it's simply completely different compared
> to ordinary page table walking/modifications today.
>
> Maybe there are ideas to tackle that, and I'd be very interested in them.
Yes, that one is very nasty, and as I recall from other conversations
with Christophe, we ran out of bits (powerpc-8xx) to flag it somehow
that that entry is special, so not really sure either.
Although it would be a pitty that that is the thing that ends up
stalling this
--
Oscar Salvador
SUSE Labs
next prev parent reply other threads:[~2025-02-03 10:11 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-30 21:36 Oscar Salvador
2025-01-30 22:45 ` Peter Xu
2025-01-30 22:46 ` Matthew Wilcox
2025-01-30 23:19 ` David Hildenbrand
2025-01-31 15:42 ` Christophe Leroy
2025-02-04 20:19 ` David Hildenbrand
2025-02-03 10:10 ` Oscar Salvador [this message]
2025-02-04 20:40 ` David Hildenbrand
2025-02-05 9:33 ` Oscar Salvador
2025-02-11 13:31 ` David Hildenbrand
2025-02-12 9:13 ` Oscar Salvador
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z6CWMBKlN0-RHI2P@localhost.localdomain \
--to=osalvador@suse.de \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=muchun.song@linux.dev \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox