[LSF/MM/BPF TOPIC] HugeTLB generic pagewalk

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
@ 2025-01-30 21:36 Oscar Salvador
  2025-01-30 22:45 ` Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Oscar Salvador @ 2025-01-30 21:36 UTC (permalink / raw)
  To: lsf-pc; +Cc: David Hildenbrand, Peter Xu, Muchun Song, linux-mm

Hi,

last year Peter Xu did a presention at LSFM/MM on how to better integrate hugetlb
in the mm core.
There are several reasons we want to do that, but one could say that the two that
matter the most are 1) code duplication and 2) making hugetlb less special.

During the last year several patches that went in that direction were merged e.g:
gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic unmapping
path [3] to name some.

There was also a concern on how to integrate hugetlb into the generic pagewalk,
getting rid by doing so of a lot of code and have a generic path that could handle
everything. 
This was first worked in [4] (very basic draft).

Although a second version is on the works, I would like to present some concerns
I have wrt. that work.

HugeTLB has its own way of dealing with things.
E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.

One of the challenges that this raises is that if we want pmd/pud walkers to
be able to make sense of hugetlb stuff, we need to implement pud/pmd
(maybe some pmd we already have because of THP) variants of those.

E.g: HugeTLB code uses is_swap_pte and pte_to_swp_entry.
If we want PUD walkers to be able to handle hugetlb, this means that we would
need some sort of is_swap_pud and pud_to_swp_entry implementations.
The same happens with a handful of other functions (e.g: huge_pte_*_uffd_wp,
hugetlb pte markers, etc.)

This has never been a problem because hugetlb has its way of doing things
and we implemented code around that logic, but this falls off the cliff as
soon as we want to make it less special and more generic, because we need to
start implementing all those pte_* variants for pud/pmd_* 

I would like to know how people feel about it, whether this is something worth
pursuing, or we live with the fact that HugeTLB is special, and so it remains
this way.

[1]
https://patchwork.kernel.org/project/linux-mm/cover/20240327152332.950956-1-peterx@redhat.com/
[2]
https://patchwork.kernel.org/project/linux-mm/cover/20240812181225.1360970-1-peterx@redhat.com/
[3]
https://patchwork.kernel.org/project/linux-mm/cover/20241007075037.267650-1-osalvador@suse.de/
[4]
https://patchwork.kernel.org/project/linux-mm/cover/20240704043132.28501-1-osalvador@suse.de/

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-01-30 21:36 [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Oscar Salvador
@ 2025-01-30 22:45 ` Peter Xu
  2025-01-30 22:46 ` Matthew Wilcox
  2025-01-30 23:19 ` David Hildenbrand
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2025-01-30 22:45 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: lsf-pc, David Hildenbrand, Muchun Song, linux-mm

On Thu, Jan 30, 2025 at 10:36:51PM +0100, Oscar Salvador wrote:
> Hi,

Hello, Oscar,

> 
> last year Peter Xu did a presention at LSFM/MM on how to better integrate hugetlb
> in the mm core.
> There are several reasons we want to do that, but one could say that the two that
> matter the most are 1) code duplication and 2) making hugetlb less special.
> 
> During the last year several patches that went in that direction were merged e.g:
> gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic unmapping
> path [3] to name some.
> 
> There was also a concern on how to integrate hugetlb into the generic pagewalk,
> getting rid by doing so of a lot of code and have a generic path that could handle
> everything. 
> This was first worked in [4] (very basic draft).
> 
> Although a second version is on the works, I would like to present some concerns
> I have wrt. that work.
> 
> HugeTLB has its own way of dealing with things.
> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
> 
> One of the challenges that this raises is that if we want pmd/pud walkers to
> be able to make sense of hugetlb stuff, we need to implement pud/pmd
> (maybe some pmd we already have because of THP) variants of those.
> 
> E.g: HugeTLB code uses is_swap_pte and pte_to_swp_entry.
> If we want PUD walkers to be able to handle hugetlb, this means that we would
> need some sort of is_swap_pud and pud_to_swp_entry implementations.
> The same happens with a handful of other functions (e.g: huge_pte_*_uffd_wp,
> hugetlb pte markers, etc.)
> 
> This has never been a problem because hugetlb has its way of doing things
> and we implemented code around that logic, but this falls off the cliff as
> soon as we want to make it less special and more generic, because we need to
> start implementing all those pte_* variants for pud/pmd_* 
> 
> I would like to know how people feel about it, whether this is something worth
> pursuing, or we live with the fact that HugeTLB is special, and so it remains
> this way.
> 
> [1]
> https://patchwork.kernel.org/project/linux-mm/cover/20240327152332.950956-1-peterx@redhat.com/
> [2]
> https://patchwork.kernel.org/project/linux-mm/cover/20240812181225.1360970-1-peterx@redhat.com/
> [3]
> https://patchwork.kernel.org/project/linux-mm/cover/20241007075037.267650-1-osalvador@suse.de/
> [4]
> https://patchwork.kernel.org/project/linux-mm/cover/20240704043132.28501-1-osalvador@suse.de/

Thanks for bringing up this topic.

I won't be able to try to apply for lsfmm this year due to some family
plan, but I can share some quick thoughts in case if it's anything helpful.
We also had some relevant discussions on this, but I guess most of them are
not on the list.

I definitely agree with you that such cleanup on hugetlb would be always
nice especially on the pgtable side. Fundamentally, it's because huge
mappings in the pgtables don't have any real difference between hugetlbfs
or other forms of it, at least on the known archs I'm aware of.  In
general, the pgtable part only defines the size of a mapping not the
attributes.

I have also once shared with you on the concern I had with the work: not
only we've got limited resources on developers who would be willing to do
this cleanup, but also whoever will be able to review it properly.  In
general, knowing that hugetlbfs can be feature-freeze after the HGM
attempt, I start to evaluate the pros and cons of such global cleanup, and
whether it'll pay-off for everyone, with the risk of easily break any
existing hugetlbfs users.

I won't be surprised that such whole effort should take at least 5-digits
LOCs change to be complete, even if we assume the idea is 100% workable,
and 100% perfect code, which can still be quite some effort.

The gain of such whole work would be having clean code base for pgtable,
with no functional change (hopefully!  unless we regress some perf here and
there.. normally hugetlb API is _slightly_ faster.. even if uglier, per my
"can-be-outdated" impression..).

So that's a major concern I have, on whether we should stick with clean
everything up, or thinking about other approaches, e.g.:

  - We could still do the low hanging fruits if we see fit, that are self
    contained, and have direct benefit.  E.g. I think maybe it still makes
    sense to finish your page walk API rewrites at least if it's already
    half way through (which is my gut feeling, but you know the best..).

  - We could think about refactoring hugetlb in a way that we could make it
    more usable and provide new features, rather than reworking on a
    feature-freeze base idea so we can't get more than "cleanups" only.

The latter is also why I started looking at integrating HugeTLB pages /
folios without hugetlbfs's presence. So far gmem does look like a good
container out of it, as confidential computing will have similar demand to
allocate 1G pages out of somewhere, and that "somewhere" shares a lot of
common issues to be resolved by hugetlbfs as well.  It means it makes sense
to me to rework that part of hugetlbfs to suite more consumers (which I
start to call them "hugetlb pages / folios" v.s. "hugetlbfs" just to
differenciate from the file system).

And if gmem 1G can work with CoCo, it can be pretty simple to extend that
to !CoCo (which is fundamentally in-place consuming gmem folios with no
need to do private<->shared conversions), which means there's chance for a
VM cloud provider (private or public) move over to gmem completely
replacing hugetlbfs 1G, for either confidential or normal VMs.

Then there's a hugetlb-based (not hugetlbfs-based) solution that is not
feature-freeze, and meanwhile whatever we rework on hugetlb from this
perspective will not only be cleaups, but paving way for anything to be
built on top of hugetlb folios to work properly.

I don't think it's justified we shouldn't keep cleaning hugetlbfs code,
though.  So there's still the 3rd option that we could choose to try finish
this work.  It'll just be challenging from different aspects.

Sorry for all these pretty random thoughts.  Please ignore them if some (if
not all..) doesn't apply for the topic you plan to discuss!

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-01-30 21:36 [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Oscar Salvador
  2025-01-30 22:45 ` Peter Xu
@ 2025-01-30 22:46 ` Matthew Wilcox
  2025-01-30 23:19 ` David Hildenbrand
  2 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2025-01-30 22:46 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: lsf-pc, David Hildenbrand, Peter Xu, Muchun Song, linux-mm

On Thu, Jan 30, 2025 at 10:36:51PM +0100, Oscar Salvador wrote:
> HugeTLB has its own way of dealing with things.
> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.

This is a bug, not a feature.  It makes some horrendous assumptions
about how each architecture encodes its ptes/pmd/puds (ie they're all
compatible).  By and large they're mostly compatible, but there are some
awful hacks to work around the cases where they aren't.

> One of the challenges that this raises is that if we want pmd/pud walkers to
> be able to make sense of hugetlb stuff, we need to implement pud/pmd
> (maybe some pmd we already have because of THP) variants of those.

That's a good thing!  typesafety is good!  hugetlbfs tries to defeat it
and it works rather too well.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-01-30 21:36 [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Oscar Salvador
  2025-01-30 22:45 ` Peter Xu
  2025-01-30 22:46 ` Matthew Wilcox
@ 2025-01-30 23:19 ` David Hildenbrand
  2025-01-31 15:42   ` Christophe Leroy
  2025-02-03 10:10   ` Oscar Salvador
  2 siblings, 2 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-01-30 23:19 UTC (permalink / raw)
  To: Oscar Salvador, lsf-pc; +Cc: Peter Xu, Muchun Song, linux-mm

On 30.01.25 22:36, Oscar Salvador wrote:
> Hi,
> 
> last year Peter Xu did a presention at LSFM/MM on how to better integrate hugetlb
> in the mm core.
> There are several reasons we want to do that, but one could say that the two that
> matter the most are 1) code duplication and 2) making hugetlb less special.
> 
> During the last year several patches that went in that direction were merged e.g:
> gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic unmapping
> path [3] to name some.
> 
> There was also a concern on how to integrate hugetlb into the generic pagewalk,
> getting rid by doing so of a lot of code and have a generic path that could handle
> everything.
> This was first worked in [4] (very basic draft).
> 
> Although a second version is on the works, I would like to present some concerns
> I have wrt. that work.

Hi Oscar,

> 
> HugeTLB has its own way of dealing with things.
> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
> 
> One of the challenges that this raises is that if we want pmd/pud walkers to
> be able to make sense of hugetlb stuff, we need to implement pud/pmd
> (maybe some pmd we already have because of THP) variants of those.

that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
abstracted by hugetlb to be a single unit ("hugetlb pte").

For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
managed: you can modify any PTE part of the cont-gang and it will just
work as expected, transparently.

Not so with hugetlb, where you have to modify (or even query) the whole thing.

For GUP it was easier, because it was able to grab all information it needed
from the sub-ptes fairly easily, and it doesn't modify any page tabls.

I ran into this problem with folio_walk, and had to document it rather nastily:

  * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care.
  * For example, PMD page table sharing might require prior unsharing. Also,
  * logical hugetlb entries might span multiple physical page table entries,
  * which *must* be modified in a single operation (set_huge_pte_at(),
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.

I wanted to use it to rewrite the uprobe code to also handle hugetlb with
less special casing, but that work stalled so far. I think my next attempt would rule
out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.

It all gets weird with things like:

commit 0549e76663730235a10395a7af7ad3d3ce6e2402
Author: Christophe Leroy <christophe.leroy@csgroup.eu>
Date:   Tue Jul 2 15:51:25 2024 +0200

     powerpc/8xx: rework support for 8M pages using contiguous PTE entries

     In order to fit better with standard Linux page tables layout, add support
     for 8M pages using contiguous PTE entries in a standard page table.  Page
     tables will then be populated with 1024 similar entries and two PMD
     entries will point to that page table.

     The PMD entries also get a flag to tell it is addressing an 8M page, this
     is required for the HW tablewalk assistance.

Where we are walking a PTE table, but actually there is another PTE table we
have to modify in the same go.

Very hard to make that non-hugetlb aware, as it's simply completely different compared
to ordinary page table walking/modifications today.

Maybe there are ideas to tackle that, and I'd be very interested in them.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-01-30 23:19 ` David Hildenbrand
@ 2025-01-31 15:42   ` Christophe Leroy
  2025-02-04 20:19     ` David Hildenbrand
  2025-02-03 10:10   ` Oscar Salvador
  1 sibling, 1 reply; 11+ messages in thread
From: Christophe Leroy @ 2025-01-31 15:42 UTC (permalink / raw)
  To: David Hildenbrand, Oscar Salvador, lsf-pc; +Cc: Peter Xu, Muchun Song, linux-mm



Le 31/01/2025 à 00:19, David Hildenbrand a écrit :
> On 30.01.25 22:36, Oscar Salvador wrote:
>> Hi,
>>
>> last year Peter Xu did a presention at LSFM/MM on how to better 
>> integrate hugetlb
>> in the mm core.
>> There are several reasons we want to do that, but one could say that 
>> the two that
>> matter the most are 1) code duplication and 2) making hugetlb less 
>> special.
>>
>> During the last year several patches that went in that direction were 
>> merged e.g:
>> gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic 
>> unmapping
>> path [3] to name some.
>>
>> There was also a concern on how to integrate hugetlb into the generic 
>> pagewalk,
>> getting rid by doing so of a lot of code and have a generic path that 
>> could handle
>> everything.
>> This was first worked in [4] (very basic draft).
>>
>> Although a second version is on the works, I would like to present 
>> some concerns
>> I have wrt. that work.
> 
> Hi Oscar,
> 
>>
>> HugeTLB has its own way of dealing with things.
>> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, 
>> huge_pte_clear_uffd_wp,
>> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
>>
>> One of the challenges that this raises is that if we want pmd/pud 
>> walkers to
>> be able to make sense of hugetlb stuff, we need to implement pud/pmd
>> (maybe some pmd we already have because of THP) variants of those.
> 
> that's the easy case I'm afraid. The real problem are cont-pte 
> constructs (or worse)
> abstracted by hugetlb to be a single unit ("hugetlb pte").
> 
> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays 
> transparently
> managed: you can modify any PTE part of the cont-gang and it will just
> work as expected, transparently.
> 
> Not so with hugetlb, where you have to modify (or even query) the whole 
> thing.
> 
> For GUP it was easier, because it was able to grab all information it 
> needed
> from the sub-ptes fairly easily, and it doesn't modify any page tabls.
> 
> 
> I ran into this problem with folio_walk, and had to document it rather 
> nastily:
> 
>   * WARNING: Modifying page table entries in hugetlb VMAs requires a lot 
> of care.
>   * For example, PMD page table sharing might require prior unsharing. 
> Also,
>   * logical hugetlb entries might span multiple physical page table 
> entries,
>   * which *must* be modified in a single operation (set_huge_pte_at(),
>   * huge_ptep_set_*, ...). Note that the page table entry stored in @fw 
> might
>   * not correspond to the first physical entry of a logical hugetlb entry.
> 
> I wanted to use it to rewrite the uprobe code to also handle hugetlb with
> less special casing, but that work stalled so far. I think my next 
> attempt would rule
> out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.
> 
> It all gets weird with things like:
> 
> commit 0549e76663730235a10395a7af7ad3d3ce6e2402
> Author: Christophe Leroy <christophe.leroy@csgroup.eu>
> Date:   Tue Jul 2 15:51:25 2024 +0200
> 
>      powerpc/8xx: rework support for 8M pages using contiguous PTE entries
>      In order to fit better with standard Linux page tables layout, add 
> support
>      for 8M pages using contiguous PTE entries in a standard page 
> table.  Page
>      tables will then be populated with 1024 similar entries and two PMD
>      entries will point to that page table.
>      The PMD entries also get a flag to tell it is addressing an 8M 
> page, this
>      is required for the HW tablewalk assistance.
> 
> Where we are walking a PTE table, but actually there is another PTE 
> table we
> have to modify in the same go.
> 
> 
> Very hard to make that non-hugetlb aware, as it's simply completely 
> different compared
> to ordinary page table walking/modifications today.
> 
> Maybe there are ideas to tackle that, and I'd be very interested in them.
> 


But at least that 8xx change allowed us to get ride of huge page 
directories (hugepd) which was even more painful IIUC.

Neverthless, can't we turn that into a standard walk in a way or another ?

While we walk we reach a PMD entry which is marked as a CONT-PMD, but it 
is not tagged as a leaf entry, so there is a page table below. PMD_SIZE 
is 4M but the page size is 8M so once you've walked the page table 
entirely you know you still have 4M to go so you have to walk the second 
PMD and the page table it points to.

By the way, don't know it can help or make things worse, but indeed from 
a HW point of view there is no need to replicate 1024 times the PTE 
entry. Here we used a standard page table because it looked more generic 
from kernel point of view, but all the HW needs is a single PTE located 
at a page aligned address. Thats what we had when we used huge page 
directories (hugepd). It was even easier because both PMD entries were 
pointing to the same hugepd entry hence no need of CONT-PTE-like 
management at PTE level.

Christophe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-01-30 23:19 ` David Hildenbrand
  2025-01-31 15:42   ` Christophe Leroy
@ 2025-02-03 10:10   ` Oscar Salvador
  2025-02-04 20:40     ` David Hildenbrand
  1 sibling, 1 reply; 11+ messages in thread
From: Oscar Salvador @ 2025-02-03 10:10 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: lsf-pc, Peter Xu, Muchun Song, linux-mm

On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
> On 30.01.25 22:36, Oscar Salvador wrote:
> Hi Oscar,

Hi David

> > 
> > HugeTLB has its own way of dealing with things.
> > E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> > huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
> > 
> > One of the challenges that this raises is that if we want pmd/pud walkers to
> > be able to make sense of hugetlb stuff, we need to implement pud/pmd
> > (maybe some pmd we already have because of THP) variants of those.
> 
> that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
> abstracted by hugetlb to be a single unit ("hugetlb pte").

Yes, that is a PITA to be honest.
E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
assumption.
I was checking how we could make cont-{pmd,pte} work for hugetlb in that
area (just because it happens to be the first are I am trying to convert).
E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.

I came up with something like the following (completely untested, just a
PoC) to make smaps_pte_range work for cont-ptes.
We would also need to change all the references of PAGE_SIZE to the
actual size of the folio (if there are).

But looking closer, folio_pte_batch is only declared in mm/internal.h,
so I am not sure I can make use of that outside mm/ realm?
If not, how could we work that around?

 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 index 193dd5f91fbf..21fb122ebcac 100644
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
  #endif
 
  static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 -		struct mm_walk *walk)
 +		struct mm_walk *walk, int nr_batch_entries)
  {
  	struct mem_size_stats *mss = walk->private;
  	struct vm_area_struct *vma = walk->vma;
 @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
  	if (pte_present(ptent)) {
  		page = vm_normal_page(vma, addr, ptent);
 -		young = pte_young(ptent);
 -		dirty = pte_dirty(ptent);
 +		if (nr_batch_entries) {
 +			int nr;
 +			nr = folio_pte_batch(folio_page(page, 0), addr, pte,
 +					     ptent, max_nr, 0, NULL, &young,
 +					     &dirty);
 +			if (nr != nr_batch_entries) {
 +				walk->action = ACTION_AGAIN;
 +				return;
 +			}
 +		} else {
 +			young = pte_young(ptent);
 +			dirty = pte_dirty(ptent);
 +		}
  		present = true;
  	} else if (is_swap_pte(ptent)) {
  		swp_entry_t swpent = pte_to_swp_entry(ptent);
 @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
  			   struct mm_walk *walk)
  {
  	struct vm_area_struct *vma = walk->vma;
 -	pte_t *pte;
 +	pte_t *pte, ptent;
  	spinlock_t *ptl;
 
  	ptl = pmd_huge_lock(pmd, vma);
 @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
  		walk->action = ACTION_AGAIN;
  		return 0;
  	}
 +
 +	ptent = ptep_get(pte);
 +	if (pte_present(ptent)) {
 +		struct folio *folio = vm_normal_folio(vma, addr, ptent);
 +		if (folio_test_large(folio)) {
 +			/* Let us use pte batching */
 +			smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
 +			pte_unmap_unlock(pte - 1, ptl);
 +
 +			return 0;
 +		}
 +	}
 +
  	for (; addr != end; pte++, addr += PAGE_SIZE)
 -		smaps_pte_entry(pte, addr, walk);
 +		smaps_pte_entry(pte, addr, walk, 0);
  	pte_unmap_unlock(pte - 1, ptl);
  out:
  	cond_resched();



> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
> managed: you can modify any PTE part of the cont-gang and it will just
> work as expected, transparently.

But that is because AFAIU, on arm64 it knows if it's dealing with a
cont-pte and it queries the status of the whole "gang", right?

I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
with cont-ptes and modifies/queries all sub-ptes.
How is that done for non-hugetlb pages?

> Not so with hugetlb, where you have to modify (or even query) the whole thing.

I see.

> 
> For GUP it was easier, because it was able to grab all information it needed
> from the sub-ptes fairly easily, and it doesn't modify any page tabls.
> 
> 
> I ran into this problem with folio_walk, and had to document it rather nastily:
> 
>  * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care.
>  * For example, PMD page table sharing might require prior unsharing. Also,
>  * logical hugetlb entries might span multiple physical page table entries,
>  * which *must* be modified in a single operation (set_huge_pte_at(),
>  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
>  * not correspond to the first physical entry of a logical hugetlb entry.
> 
> I wanted to use it to rewrite the uprobe code to also handle hugetlb with
> less special casing, but that work stalled so far. I think my next attempt would rule
> out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.
> 
> It all gets weird with things like:
> 
> commit 0549e76663730235a10395a7af7ad3d3ce6e2402
> Author: Christophe Leroy <christophe.leroy@csgroup.eu>
> Date:   Tue Jul 2 15:51:25 2024 +0200
> 
>     powerpc/8xx: rework support for 8M pages using contiguous PTE entries
>     In order to fit better with standard Linux page tables layout, add support
>     for 8M pages using contiguous PTE entries in a standard page table.  Page
>     tables will then be populated with 1024 similar entries and two PMD
>     entries will point to that page table.
>     The PMD entries also get a flag to tell it is addressing an 8M page, this
>     is required for the HW tablewalk assistance.
> 
> Where we are walking a PTE table, but actually there is another PTE table we
> have to modify in the same go.
> 
> 
> Very hard to make that non-hugetlb aware, as it's simply completely different compared
> to ordinary page table walking/modifications today.
> 
> Maybe there are ideas to tackle that, and I'd be very interested in them.

Yes, that one is very nasty, and as I recall from other conversations
with Christophe, we ran out of bits (powerpc-8xx) to flag it somehow
that that entry is special, so not really sure either.
Although it would be a pitty that that is the thing that ends up
stalling this


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-01-31 15:42   ` Christophe Leroy
@ 2025-02-04 20:19     ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-02-04 20:19 UTC (permalink / raw)
  To: Christophe Leroy, Oscar Salvador, lsf-pc; +Cc: Peter Xu, Muchun Song, linux-mm

>>
>> commit 0549e76663730235a10395a7af7ad3d3ce6e2402
>> Author: Christophe Leroy <christophe.leroy@csgroup.eu>
>> Date:   Tue Jul 2 15:51:25 2024 +0200
>>
>>       powerpc/8xx: rework support for 8M pages using contiguous PTE entries
>>       In order to fit better with standard Linux page tables layout, add
>> support
>>       for 8M pages using contiguous PTE entries in a standard page
>> table.  Page
>>       tables will then be populated with 1024 similar entries and two PMD
>>       entries will point to that page table.
>>       The PMD entries also get a flag to tell it is addressing an 8M
>> page, this
>>       is required for the HW tablewalk assistance.
>>
>> Where we are walking a PTE table, but actually there is another PTE
>> table we
>> have to modify in the same go.
>>
>>
>> Very hard to make that non-hugetlb aware, as it's simply completely
>> different compared
>> to ordinary page table walking/modifications today.
>>
>> Maybe there are ideas to tackle that, and I'd be very interested in them.
>>
> 
> 
> But at least that 8xx change allowed us to get ride of huge page
> directories (hugepd) which was even more painful IIUC.

Yes, don't get me wrong, it was a clear win to get rid of hugepd, 
allowing for GUP and folio_walk to work in a non-hugetlb fashion: at 
least, when all we want to do is lookup which page is mapped at a given 
address.

Unfortunately, that's not what all page table walkers do.

> 
> Neverthless, can't we turn that into a standard walk in a way or another ?
> 
> While we walk we reach a PMD entry which is marked as a CONT-PMD, but it
> is not tagged as a leaf entry, so there is a page table below. PMD_SIZE
> is 4M but the page size is 8M so once you've walked the page table
> entirely you know you still have 4M to go so you have to walk the second
> PMD and the page table it points to.

We would somehow have to fake that it is a PMD leaf, and realize that 
they both are cont, so we can batch both PMDs. The PTE page table 
handling is a bit of a pain, though.

... and modifying entries it is a bit of a pain as well; unless we can 
hide all that somehow in the powerpc pmd setters.

Hm, far from ideal, at least at this stage, because we don't really 
support cont-pmd outside of hugetlb, and a lot of page table walkers 
must be taught do deal with cont-pmd.

> 
> By the way, don't know it can help or make things worse, but indeed from
> a HW point of view there is no need to replicate 1024 times the PTE
> entry. Here we used a standard page table because it looked more generic
> from kernel point of view, but all the HW needs is a single PTE located
> at a page aligned address. Thats what we had when we used huge page
> directories (hugepd). It was even easier because both PMD entries were
> pointing to the same hugepd entry hence no need of CONT-PTE-like
> management at PTE level.

Ah, I see. I'll have to think about that a bit ... far from trivial.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-02-03 10:10   ` Oscar Salvador
@ 2025-02-04 20:40     ` David Hildenbrand
  2025-02-05  9:33       ` Oscar Salvador
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-02-04 20:40 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: lsf-pc, Peter Xu, Muchun Song, linux-mm

On 03.02.25 11:10, Oscar Salvador wrote:
> On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
>> On 30.01.25 22:36, Oscar Salvador wrote:
>> Hi Oscar,
> 
> Hi David

Hi,

> 
>>>
>>> HugeTLB has its own way of dealing with things.
>>> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
>>> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
>>>
>>> One of the challenges that this raises is that if we want pmd/pud walkers to
>>> be able to make sense of hugetlb stuff, we need to implement pud/pmd
>>> (maybe some pmd we already have because of THP) variants of those.
>>
>> that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
>> abstracted by hugetlb to be a single unit ("hugetlb pte").
> 
> Yes, that is a PITA to be honest.
> E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
> assumption.
> I was checking how we could make cont-{pmd,pte} work for hugetlb in that
> area (just because it happens to be the first are I am trying to convert).
> E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.
> 
> I came up with something like the following (completely untested, just a
> PoC) to make smaps_pte_range work for cont-ptes.
> We would also need to change all the references of PAGE_SIZE to the
> actual size of the folio (if there are).

Unfortunately not that easy. You can only have parts of a large folio 
mapped. It would be PAGE_SIZE * nr_ptes when batching.

> 
> But looking closer, folio_pte_batch is only declared in mm/internal.h,
> so I am not sure I can make use of that outside mm/ realm?

Sure, we could export that. We'd need something similar for PMDs, and 
batching all other non-present stuff (e.g., migration entries)

> If not, how could we work that around?
> 
>   diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>   index 193dd5f91fbf..21fb122ebcac 100644
>   --- a/fs/proc/task_mmu.c
>   +++ b/fs/proc/task_mmu.c
>   @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
>    #endif
>   
>    static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>   -		struct mm_walk *walk)
>   +		struct mm_walk *walk, int nr_batch_entries)
>    {
>    	struct mem_size_stats *mss = walk->private;
>    	struct vm_area_struct *vma = walk->vma;
>   @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>   
>    	if (pte_present(ptent)) {
>    		page = vm_normal_page(vma, addr, ptent);
>   -		young = pte_young(ptent);
>   -		dirty = pte_dirty(ptent);
>   +		if (nr_batch_entries) {
>   +			int nr;
>   +			nr = folio_pte_batch(folio_page(page, 0), addr, pte,
>   +					     ptent, max_nr, 0, NULL, &young,
>   +					     &dirty);
>   +			if (nr != nr_batch_entries) {
>   +				walk->action = ACTION_AGAIN;
>   +				return;
>   +			}
>   +		} else {
>   +			young = pte_young(ptent);
>   +			dirty = pte_dirty(ptent);
>   +		}
 >    		present = true;>    	} else if (is_swap_pte(ptent)) {
>    		swp_entry_t swpent = pte_to_swp_entry(ptent);
>   @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>    			   struct mm_walk *walk)
>    {
>    	struct vm_area_struct *vma = walk->vma;
>   -	pte_t *pte;
>   +	pte_t *pte, ptent;
>    	spinlock_t *ptl;
>   
>    	ptl = pmd_huge_lock(pmd, vma);
>   @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>    		walk->action = ACTION_AGAIN;
>    		return 0;
>    	}
>   +
>   +	ptent = ptep_get(pte);
>   +	if (pte_present(ptent)) {
>   +		struct folio *folio = vm_normal_folio(vma, addr, ptent);
>   +		if (folio_test_large(folio)) {
>   +			/* Let us use pte batching */
>   +			smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
>   +			pte_unmap_unlock(pte - 1, ptl);
>   +
>   +			return 0;
>   +		}
>   +	}
>   +
>    	for (; addr != end; pte++, addr += PAGE_SIZE)
>   -		smaps_pte_entry(pte, addr, walk);
>   +		smaps_pte_entry(pte, addr, walk, 0);
>    	pte_unmap_unlock(pte - 1, ptl);
>    out:
>    	cond_resched();
> 

I think this is the wrong approach. We should replace that pagewalk API 
usage by something better that hides all the batching.

The interface would look similar to folio_walk (no callbacks, handling 
of locking), but (a) work on ranges; (b) work also on non-folio entries; 
and (c) batch all suitable entries in the range.

Something like a pt_range_walk_start() that returns a "type" (folio 
range, migration entries, swap range, ...) + stores other details 
(range, level, ptep, ...) in a structure like folio_walk, to then 
provide mechanisms to continue (pt_walk_continue()) to walk or abort 
(pt_walk_done()) it. Similar to page_vma_mapped_walk(), but not specific 
to a given page/folio.

Then, we would simply process the output of that. With the hope that, 
for hugetlb it will just batch all cont-pte / cont-pmd entries into a 
single return value.

That will make the R/O walking as in task_mmu.c easier, hopefully.

Not so much with PTE/PMD modifications, like  damon_mkold_ops ... :( But 
maybe that just has to be special-cased for hugetlb, somehow ...

> 
> 
>> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
>> managed: you can modify any PTE part of the cont-gang and it will just
>> work as expected, transparently.
> 
> But that is because AFAIU, on arm64 it knows if it's dealing with a
> cont-pte and it queries the status of the whole "gang", right?

Yes, that's how I remember it works on arm64.

Note that, though, that non-present PTEs don't use the cont bit. So one 
must "batch" these always manually.

> 
> I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
> with cont-ptes and modifies/queries all sub-ptes.
> How is that done for non-hugetlb pages?

Similarly, at least on arm64. When setting individual PTEs, it 
automatically sets the cont-pte bit once it detects that the applicable 
PTEs are all "cont-compatible" (e.g., belong to same folio, same 
protection bits). Other architectures might require more work (powrpc).


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-02-04 20:40     ` David Hildenbrand
@ 2025-02-05  9:33       ` Oscar Salvador
  2025-02-11 13:31         ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Oscar Salvador @ 2025-02-05  9:33 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: lsf-pc, Peter Xu, Muchun Song, linux-mm

On Tue, Feb 04, 2025 at 09:40:16PM +0100, David Hildenbrand wrote:
> Unfortunately not that easy. You can only have parts of a large folio
> mapped. It would be PAGE_SIZE * nr_ptes when batching.

I see, you are right.

> I think this is the wrong approach. We should replace that pagewalk API
> usage by something better that hides all the batching.
> 
> The interface would look similar to folio_walk (no callbacks, handling of
> locking), but (a) work on ranges; (b) work also on non-folio entries; and
> (c) batch all suitable entries in the range.
> 
> Something like a pt_range_walk_start() that returns a "type" (folio range,
> migration entries, swap range, ...) + stores other details (range, level,
> ptep, ...) in a structure like folio_walk, to then provide mechanisms to
> continue (pt_walk_continue()) to walk or abort (pt_walk_done()) it. Similar
> to page_vma_mapped_walk(), but not specific to a given page/folio.
> 
> Then, we would simply process the output of that. With the hope that, for
> hugetlb it will just batch all cont-pte / cont-pmd entries into a single
> return value.
> 
> That will make the R/O walking as in task_mmu.c easier, hopefully.
> 
> Not so much with PTE/PMD modifications, like  damon_mkold_ops ... :( But
> maybe that just has to be special-cased for hugetlb, somehow ...

Ok, let me see if we are on the same page.
You are basically saying that we should replace the existing pagewalk API with
something similar to what you described above.

I have to confess that when you first mentioned this back in July when I
posted the RFC, I felt dishearted, because it implies an even bigger
surgery.

But having felt the mess that dealing with cont-{pmd,pud}s, and the
inability of the existing API to do that in a clean way (without having
to teach each and every function about that if needed), maybe it is the
only way to do this 1) right and 2) clean.

I thought that maybe we can get away and to the batching somehow before
calling in the callbacks e.g: at walk_{pud,pmd,pte}_range level, but I
am not sure whether 1) that is possible and 2) how ugly it would look.

So, given that this is not a really urgent matter, something that needs
to be fixed asap, maybe the way to go is to create an API that can deal
with all that, abstracting all these details.

I am willing to take a shot on this, if we are clear that it makes sense
to pursue this road.

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-02-05  9:33       ` Oscar Salvador
@ 2025-02-11 13:31         ` David Hildenbrand
  2025-02-12  9:13           ` Oscar Salvador
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-02-11 13:31 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: lsf-pc, Peter Xu, Muchun Song, linux-mm

On 05.02.25 10:33, Oscar Salvador wrote:
> On Tue, Feb 04, 2025 at 09:40:16PM +0100, David Hildenbrand wrote:
>> Unfortunately not that easy. You can only have parts of a large folio
>> mapped. It would be PAGE_SIZE * nr_ptes when batching.
> 
> I see, you are right.
> 
>> I think this is the wrong approach. We should replace that pagewalk API
>> usage by something better that hides all the batching.
>>
>> The interface would look similar to folio_walk (no callbacks, handling of
>> locking), but (a) work on ranges; (b) work also on non-folio entries; and
>> (c) batch all suitable entries in the range.
>>
>> Something like a pt_range_walk_start() that returns a "type" (folio range,
>> migration entries, swap range, ...) + stores other details (range, level,
>> ptep, ...) in a structure like folio_walk, to then provide mechanisms to
>> continue (pt_walk_continue()) to walk or abort (pt_walk_done()) it. Similar
>> to page_vma_mapped_walk(), but not specific to a given page/folio.
>>
>> Then, we would simply process the output of that. With the hope that, for
>> hugetlb it will just batch all cont-pte / cont-pmd entries into a single
>> return value.
>>
>> That will make the R/O walking as in task_mmu.c easier, hopefully.
>>
>> Not so much with PTE/PMD modifications, like  damon_mkold_ops ... :( But
>> maybe that just has to be special-cased for hugetlb, somehow ...
> 
> Ok, let me see if we are on the same page.
> You are basically saying that we should replace the existing pagewalk API with
> something similar to what you described above.

Yes, something that adds a better abstraction and walks ranges (e.g., 
folio range, swap range, ..) moving away from the per-pte callbacks etc.

> 
> I have to confess that when you first mentioned this back in July when I
> posted the RFC, I felt dishearted, because it implies an even bigger
> surgery.

Yeah. But it will be worth it in the long run (at least I hope so ;) )

> 
> But having felt the mess that dealing with cont-{pmd,pud}s, and the
> inability of the existing API to do that in a clean way (without having
> to teach each and every function about that if needed), maybe it is the
> only way to do this 1) right and 2) clean.

Jup

> I thought that maybe we can get away and to the batching somehow before
> calling in the callbacks e.g: at walk_{pud,pmd,pte}_range level, but I
> am not sure whether 1) that is possible and 2) how ugly it would look.

Exactly.

> 
> So, given that this is not a really urgent matter, something that needs
> to be fixed asap, maybe the way to go is to create an API that can deal
> with all that, abstracting all these details.

Yes, it's been on my todo list for longer (after I completed 
folio_walk), but I didn't get to it yet.

> 
> I am willing to take a shot on this, if we are clear that it makes sense
> to pursue this road.

Yes, feel free to use what I propose as a starting point. If I get to 
prototype something in the meantime, I'll send it to you. (staring at my 
inbox, I'm not so sure I'll get to it shortly :) )

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
  2025-02-11 13:31         ` David Hildenbrand
@ 2025-02-12  9:13           ` Oscar Salvador
  0 siblings, 0 replies; 11+ messages in thread
From: Oscar Salvador @ 2025-02-12  9:13 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: lsf-pc, Peter Xu, Muchun Song, linux-mm

On Tue, Feb 11, 2025 at 02:31:58PM +0100, David Hildenbrand wrote:
> Yes, something that adds a better abstraction and walks ranges (e.g., folio
> range, swap range, ..) moving away from the per-pte callbacks etc.

Yes, that makes sense.

> Yes, feel free to use what I propose as a starting point. If I get to
> prototype something in the meantime, I'll send it to you. (staring at my
> inbox, I'm not so sure I'll get to it shortly :) )

Yesterday I started to prototype something. Let me have something more
solid and I will get back to you to see if we are aligned with the API
and what can be improved/fixed.

Thanks David for your insights!


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-02-12  9:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-30 21:36 [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk Oscar Salvador
2025-01-30 22:45 ` Peter Xu
2025-01-30 22:46 ` Matthew Wilcox
2025-01-30 23:19 ` David Hildenbrand
2025-01-31 15:42   ` Christophe Leroy
2025-02-04 20:19     ` David Hildenbrand
2025-02-03 10:10   ` Oscar Salvador
2025-02-04 20:40     ` David Hildenbrand
2025-02-05  9:33       ` Oscar Salvador
2025-02-11 13:31         ` David Hildenbrand
2025-02-12  9:13           ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox