linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: lsf-pc@lists.linux-foundation.org, Peter Xu <peterx@redhat.com>,
	Muchun Song <muchun.song@linux.dev>,
	linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk
Date: Tue, 4 Feb 2025 21:40:16 +0100	[thread overview]
Message-ID: <74ecaa8b-9e94-4ba8-a2f0-a312607516ba@redhat.com> (raw)
In-Reply-To: <Z6CWMBKlN0-RHI2P@localhost.localdomain>

On 03.02.25 11:10, Oscar Salvador wrote:
> On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
>> On 30.01.25 22:36, Oscar Salvador wrote:
>> Hi Oscar,
> 
> Hi David

Hi,

> 
>>>
>>> HugeTLB has its own way of dealing with things.
>>> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
>>> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
>>>
>>> One of the challenges that this raises is that if we want pmd/pud walkers to
>>> be able to make sense of hugetlb stuff, we need to implement pud/pmd
>>> (maybe some pmd we already have because of THP) variants of those.
>>
>> that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
>> abstracted by hugetlb to be a single unit ("hugetlb pte").
> 
> Yes, that is a PITA to be honest.
> E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
> assumption.
> I was checking how we could make cont-{pmd,pte} work for hugetlb in that
> area (just because it happens to be the first are I am trying to convert).
> E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.
> 
> I came up with something like the following (completely untested, just a
> PoC) to make smaps_pte_range work for cont-ptes.
> We would also need to change all the references of PAGE_SIZE to the
> actual size of the folio (if there are).

Unfortunately not that easy. You can only have parts of a large folio 
mapped. It would be PAGE_SIZE * nr_ptes when batching.

> 
> But looking closer, folio_pte_batch is only declared in mm/internal.h,
> so I am not sure I can make use of that outside mm/ realm?

Sure, we could export that. We'd need something similar for PMDs, and 
batching all other non-present stuff (e.g., migration entries)

> If not, how could we work that around?
> 
>   diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>   index 193dd5f91fbf..21fb122ebcac 100644
>   --- a/fs/proc/task_mmu.c
>   +++ b/fs/proc/task_mmu.c
>   @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
>    #endif
>   
>    static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>   -		struct mm_walk *walk)
>   +		struct mm_walk *walk, int nr_batch_entries)
>    {
>    	struct mem_size_stats *mss = walk->private;
>    	struct vm_area_struct *vma = walk->vma;
>   @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>   
>    	if (pte_present(ptent)) {
>    		page = vm_normal_page(vma, addr, ptent);
>   -		young = pte_young(ptent);
>   -		dirty = pte_dirty(ptent);
>   +		if (nr_batch_entries) {
>   +			int nr;
>   +			nr = folio_pte_batch(folio_page(page, 0), addr, pte,
>   +					     ptent, max_nr, 0, NULL, &young,
>   +					     &dirty);
>   +			if (nr != nr_batch_entries) {
>   +				walk->action = ACTION_AGAIN;
>   +				return;
>   +			}
>   +		} else {
>   +			young = pte_young(ptent);
>   +			dirty = pte_dirty(ptent);
>   +		}
 >    		present = true;>    	} else if (is_swap_pte(ptent)) {
>    		swp_entry_t swpent = pte_to_swp_entry(ptent);
>   @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>    			   struct mm_walk *walk)
>    {
>    	struct vm_area_struct *vma = walk->vma;
>   -	pte_t *pte;
>   +	pte_t *pte, ptent;
>    	spinlock_t *ptl;
>   
>    	ptl = pmd_huge_lock(pmd, vma);
>   @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>    		walk->action = ACTION_AGAIN;
>    		return 0;
>    	}
>   +
>   +	ptent = ptep_get(pte);
>   +	if (pte_present(ptent)) {
>   +		struct folio *folio = vm_normal_folio(vma, addr, ptent);
>   +		if (folio_test_large(folio)) {
>   +			/* Let us use pte batching */
>   +			smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
>   +			pte_unmap_unlock(pte - 1, ptl);
>   +
>   +			return 0;
>   +		}
>   +	}
>   +
>    	for (; addr != end; pte++, addr += PAGE_SIZE)
>   -		smaps_pte_entry(pte, addr, walk);
>   +		smaps_pte_entry(pte, addr, walk, 0);
>    	pte_unmap_unlock(pte - 1, ptl);
>    out:
>    	cond_resched();
> 

I think this is the wrong approach. We should replace that pagewalk API 
usage by something better that hides all the batching.

The interface would look similar to folio_walk (no callbacks, handling 
of locking), but (a) work on ranges; (b) work also on non-folio entries; 
and (c) batch all suitable entries in the range.

Something like a pt_range_walk_start() that returns a "type" (folio 
range, migration entries, swap range, ...) + stores other details 
(range, level, ptep, ...) in a structure like folio_walk, to then 
provide mechanisms to continue (pt_walk_continue()) to walk or abort 
(pt_walk_done()) it. Similar to page_vma_mapped_walk(), but not specific 
to a given page/folio.

Then, we would simply process the output of that. With the hope that, 
for hugetlb it will just batch all cont-pte / cont-pmd entries into a 
single return value.

That will make the R/O walking as in task_mmu.c easier, hopefully.

Not so much with PTE/PMD modifications, like  damon_mkold_ops ... :( But 
maybe that just has to be special-cased for hugetlb, somehow ...

> 
> 
>> For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
>> managed: you can modify any PTE part of the cont-gang and it will just
>> work as expected, transparently.
> 
> But that is because AFAIU, on arm64 it knows if it's dealing with a
> cont-pte and it queries the status of the whole "gang", right?

Yes, that's how I remember it works on arm64.

Note that, though, that non-present PTEs don't use the cont bit. So one 
must "batch" these always manually.

> 
> I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
> with cont-ptes and modifies/queries all sub-ptes.
> How is that done for non-hugetlb pages?

Similarly, at least on arm64. When setting individual PTEs, it 
automatically sets the cont-pte bit once it detects that the applicable 
PTEs are all "cont-compatible" (e.g., belong to same folio, same 
protection bits). Other architectures might require more work (powrpc).


-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-02-04 20:40 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-30 21:36 Oscar Salvador
2025-01-30 22:45 ` Peter Xu
2025-01-30 22:46 ` Matthew Wilcox
2025-01-30 23:19 ` David Hildenbrand
2025-01-31 15:42   ` Christophe Leroy
2025-02-04 20:19     ` David Hildenbrand
2025-02-03 10:10   ` Oscar Salvador
2025-02-04 20:40     ` David Hildenbrand [this message]
2025-02-05  9:33       ` Oscar Salvador
2025-02-11 13:31         ` David Hildenbrand
2025-02-12  9:13           ` Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74ecaa8b-9e94-4ba8-a2f0-a312607516ba@redhat.com \
    --to=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox