Re: [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nico Pache <npache@redhat.com>
To: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	aarcange@redhat.com, akpm@linux-foundation.org,
	anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org,
	baolin.wang@linux.alibaba.com, byungchul@sk.com,
	catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net,
	dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com,
	gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com,
	jackmanb@google.com, jack@suse.cz, jannh@google.com,
	jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org,
	lance.yang@linux.dev, Liam.Howlett@oracle.com,
	lorenzo.stoakes@oracle.com, mathieu.desnoyers@efficios.com,
	matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com,
	peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com,
	raquini@redhat.com, rdunlap@infradead.org,
	richard.weiyang@gmail.com, rientjes@google.com,
	rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com,
	shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com,
	thomas.hellstrom@linux.intel.com, tiwai@suse.de,
	usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com,
	wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org,
	yang@os.amperecomputing.com, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, zokeefe@google.com
Subject: Re: [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers
Date: Wed, 18 Mar 2026 10:48:55 -0600	[thread overview]
Message-ID: <42341241-23dc-439e-b32b-da011743adce@redhat.com> (raw)
In-Reply-To: <5eae219d-62b8-4aea-905d-7a9b59b9892b@lucifer.local>



On 3/16/26 12:17 PM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Mar 11, 2026 at 03:13:11PM -0600, Nico Pache wrote:
>> The anonymous page fault handler in do_anonymous_page() open-codes the
>> sequence to map a newly allocated anonymous folio at the PTE level:
>> 	- construct the PTE entry
>> 	- add rmap
>> 	- add to LRU
>> 	- set the PTEs
>> 	- update the MMU cache.
> 
> Yikes yeah this all needs work. Thanks for looking at this!

np! I believe it looks much cleaner now, and it has the added benefit of
cleaning up some of the mTHP patchset.

I also believe you suggested this so I'll add your SB when I add your RB tag :)

> 
>>
>> Introduce a two helpers to consolidate this duplicated logic, mirroring the
> 
> NIT: 'Introduce a two helpers' -> "introduce two helpers'

ack!

> 
>> existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings:
>>
>> 	map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio
>> 	references, adds anon rmap and LRU. This function also handles the
>> 	uffd_wp that can occur in the pf variant.
>>
>> 	map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES
>> 	counter updates, and mTHP fault allocation statistics for the page fault
>> 	path.
> 
> MEGA nit, not sure why you're not just putting this in a bullet list, just
> weird to see not-code indented here :)

Ill drop the indentation!

> 
>>
>> The zero-page read path in do_anonymous_page() is also untangled from the
>> shared setpte label, since it does not allocate a folio and should not
>> share the same mapping sequence as the write path. Make nr_pages = 1
>> rather than relying on the variable. This makes it more clear that we
>> are operating on the zero page only.
>>
>> This refactoring will also help reduce code duplication between mm/memory.c
>> and mm/khugepaged.c, and provides a clean API for PTE-level anonymous folio
>> mapping that can be reused by future callers.
> 
> Maybe worth mentioning subseqent patches that will use what you set up
> here?

ok ill add something like "... that will be used when adding mTHP support to
khugepaged, and may find future reuse by other callers." Speaking of which I
tried to leverage this elsewhere and I believe that will take extra focus and
time, but may be doable.

> 
> Also you split things out into _nopf() and _pf() variants, it might be
> worth saying exactly why you're doing that or what you are preparing to do?

ok ill expand on this in the "bullet list" you reference above.

> 
>>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> There's nits above and below, but overall the logic looks good, so with
> nits addressed/reasonably responded to:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thanks :) Ill take care of those

> 
>> ---
>>  include/linux/mm.h |  4 ++++
>>  mm/memory.c        | 60 +++++++++++++++++++++++++++++++---------------
>>  2 files changed, 45 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 4c4fd55fc823..9fea354bd17f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4903,4 +4903,8 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>>
>>  void snapshot_page(struct page_snapshot *ps, const struct page *page);
>>
>> +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
>> +		struct vm_area_struct *vma, unsigned long addr,
>> +		bool uffd_wp);
> 
> How I hate how uffd infiltrates all our code like this.
> 
> Not your fault :)
> 
>> +
>>  #endif /* _LINUX_MM_H */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 6aa0ea4af1fc..5c8bf1eb55f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5197,6 +5197,37 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>  	return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
>>  }
>>
>> +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte,
>> +		struct vm_area_struct *vma, unsigned long addr,
>> +		bool uffd_wp)
>> +{
>> +	unsigned int nr_pages = folio_nr_pages(folio);
> 
> const would be good
> 
>> +	pte_t entry = folio_mk_pte(folio, vma->vm_page_prot);
>> +
>> +	entry = pte_sw_mkyoung(entry);
>> +
>> +	if (vma->vm_flags & VM_WRITE)
>> +		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>> +	if (uffd_wp)
>> +		entry = pte_mkuffd_wp(entry);
>> +
>> +	folio_ref_add(folio, nr_pages - 1);
>> +	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
>> +	folio_add_lru_vma(folio, vma);
>> +	set_ptes(vma->vm_mm, addr, pte, entry, nr_pages);
>> +	update_mmu_cache_range(NULL, vma, addr, pte, nr_pages);
>> +}
>> +
>> +static void map_anon_folio_pte_pf(struct folio *folio, pte_t *pte,
>> +		struct vm_area_struct *vma, unsigned long addr, bool uffd_wp)
>> +{
>> +	unsigned int order = folio_order(folio);
> 
> const would be good here also!

ack on the const's

> 
>> +
>> +	map_anon_folio_pte_nopf(folio, pte, vma, addr, uffd_wp);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1 << order);
> 
> Is 1 << order strictly right here? This field is a long value, so 1L <<
> order maybe? I get nervous about these shifts...
> 
> Note that folio_large_nr_pages() uses 1L << order so that does seem
> preferable.

Ok sounds good thanks!

> 
>> +	count_mthp_stat(order, MTHP_STAT_ANON_FAULT_ALLOC);
>> +}
>> +
>>  /*
>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>   * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -5243,7 +5274,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>  		}
>> -		goto setpte;
>> +		if (vmf_orig_pte_uffd_wp(vmf))
>> +			entry = pte_mkuffd_wp(entry);
>> +		set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> 
> How I _despise_ how uffd is implemented in mm. Feels like open coded
> nonsense spills out everywhere.
> 
> Not your fault obviously :)
> 
>> +
>> +		/* No need to invalidate - it was non-present before */
>> +		update_mmu_cache_range(vmf, vma, addr, vmf->pte,
>> +				       /*nr_pages=*/ 1);
> 
> Is there any point in passing vmf here given you pass NULL above, and it
> appears that nobody actually uses this? I guess it doesn't matter but
> seeing this immediately made we question why you set it in one, and not the
> other?
> 
> Maybe I'm mistaken and some arch uses it? Don't think so though.
> 
> Also can't we then just use update_mmu_cache() which is the single-page
> wrapper of this AFAICT? That'd make it even simpler.

> 
> Having done this, is there any reason to keep the annoying and confusing
> initial assignment of nr_pages = 1 at declaration time?
> 
> It seems that nr_pages is unconditionally assigned before it's used
> anywhere now at line 5298:
> 
> 	nr_pages = folio_nr_pages(folio);
> 	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> 	...
> 
> It's kinda weird to use nr_pages again after you go out of your way to
> avoid it using folio_nr_pages() in map_anon_folio_pte_nopf() and
> folio_order() in map_anon_folio_pte_pf().

Yes I believe so, thank you that is cleaner!  In the past I had nr_pages as a
variable but David suggested being explicit with the "1" so it would be obvious
its a single page... update_mmu_cache() should indicate the same thing.

> 
> But yeah, ok we align the address and it's yucky maybe leave for now (but
> we can definitely stop defaulting nr_pages to 1 :)
> 
>> +		goto unlock;
>>  	}
>>
>>  	/* Allocate our own private page. */
>> @@ -5267,11 +5305,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	 */
>>  	__folio_mark_uptodate(folio);
>>
>> -	entry = folio_mk_pte(folio, vma->vm_page_prot);
>> -	entry = pte_sw_mkyoung(entry);
>> -	if (vma->vm_flags & VM_WRITE)
>> -		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>> -
>>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (!vmf->pte)
>>  		goto release;
>> @@ -5293,19 +5326,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		folio_put(folio);
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>> -
>> -	folio_ref_add(folio, nr_pages - 1);
>> -	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> -	count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
>> -	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
>> -	folio_add_lru_vma(folio, vma);
>> -setpte:
>> -	if (vmf_orig_pte_uffd_wp(vmf))
>> -		entry = pte_mkuffd_wp(entry);
>> -	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>> -
>> -	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>> +	map_anon_folio_pte_pf(folio, vmf->pte, vma, addr,
>> +			      vmf_orig_pte_uffd_wp(vmf));
> 
> So we're going from:
> 
> entry = folio_mk_pte(...)
> entry = pte_sw_mkyoung(...)
> if (write)
> 	entry = pte_mkwrite(also dirty...)
> folio_ref_add(nr_pages - 1)
> add_mm_counter(... MM_ANON_PAGES, nr_pages)
> count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC)
> folio_add_new_anon_rmap(.., RMAP_EXCLUSIVE)
> folio_add_lru_vma(folio, vma)
> if (vmf_orig_pte_uffd_wp(vmf))
> 	entry = pte_mkuffd_wp(entry)
> set_ptes(mm, addr, vmf->pte, entry, nr_pages)
> update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages)
> 
> To:
> 
> <map_anon_folio_pte_nopf>
> entry = folio_mk_pte(...)
> entry = pte_sw_mkyoung(...)
> if (write)
> 	entry = pte_mkwrite(also dirty...)
> if (vmf_orig_pte_uffd_wp(vmf) <passed in via uffd_wp>) <-- reordered
> 	entry = pte_mkuffd_wp(entry)
> folio_ref_add(nr_pages - 1)
> folio_add_new_anon_rmap(.., RMAP_EXCLUSIVE)
> folio_add_lru_vma(folio, vma)
> set_ptes(mm, addr, pte, entry, nr_pages)
> update_mmu_cache_range(NULL, vma, addr, pte, nr_pages)
> 
> <map_anon_folio_pte_pf>
> add_mm_counter(... MM_ANON_PAGES, nr_pages) <-- reordered
> count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC) <-- reodrdered
> 
> But the reorderings seem fine, and it is achieving the same thing.
> 
> All the parameters being passed seem correct too.

Thanks for the review and verifying the logic :) I was particularly scared of
changing the page fault handler, so im glad multiple people have confirmed this
seems fine.

Cheers,
-- Nico

> 
>>  unlock:
>>  	if (vmf->pte)
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> --
>> 2.53.0
>>
> 
> Cheers, Lorenzo
>

next prev parent reply	other threads:[~2026-03-18 16:49 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-11 21:13 [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 1/5] mm: consolidate anonymous folio PTE mapping into helpers Nico Pache
2026-03-16 18:17   ` Lorenzo Stoakes (Oracle)
2026-03-18 16:48     ` Nico Pache [this message]
2026-03-11 21:13 ` [PATCH mm-unstable v3 2/5] mm: introduce is_pmd_order helper Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 3/5] mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1 Nico Pache
2026-03-16 18:18   ` Lorenzo Stoakes (Oracle)
2026-03-18 16:50     ` Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 4/5] mm/khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2026-03-11 21:13 ` [PATCH mm-unstable v3 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd() Nico Pache
2026-03-12  2:04   ` Wei Yang
2026-03-18 16:54     ` Nico Pache
2026-03-20  7:53       ` Wei Yang
2026-03-12  9:27   ` David Hildenbrand (Arm)
2026-03-13  8:33   ` Baolin Wang
2026-03-15 15:16   ` Lance Yang
2026-03-16 18:54   ` Lorenzo Stoakes (Oracle)
2026-03-18 17:22     ` Nico Pache
2026-03-19 16:01       ` Lorenzo Stoakes (Oracle)
2026-03-11 21:34 ` [PATCH mm-unstable v3 0/5] mm: khugepaged cleanups and mTHP prerequisites Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42341241-23dc-439e-b32b-da011743adce@redhat.com \
    --to=npache@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shivankg@amd.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox