Re: [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Cc: akpm@linux-foundation.org, axelrasmussen@google.com,
	yuanchu@google.com, david@kernel.org, hughd@google.com,
	chrisl@kernel.org, kasong@tencent.com, weixugc@google.com,
	Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, riel@surriel.com,
	harry.yoo@oracle.com, jannh@google.com, pfalcato@suse.de,
	baolin.wang@linux.alibaba.com, shikemeng@huaweicloud.com,
	nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org,
	youngjun.park@lge.com, ziy@nvidia.com, kas@kernel.org,
	willy@infradead.org, yuzhao@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, ryan.roberts@arm.com,
	anshuman.khandual@arm.com
Subject: Re: [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes
Date: Wed, 11 Mar 2026 13:39:25 +0530	[thread overview]
Message-ID: <aeb28e78-4943-4330-b323-799dd80207f2@arm.com> (raw)
In-Reply-To: <3ff22f08-5617-4c05-8fa0-1d808806e322@lucifer.local>



On 10/03/26 3:08 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:12PM +0530, Dev Jain wrote:
>> In the quest of enabling batched unmapping of anonymous folios, we need to
>> handle the sharing of exclusive pages. Hence, a batched version of
>> folio_try_share_anon_rmap_pte is required.
>>
>> Currently, the sole purpose of nr_pages in __folio_try_share_anon_rmap is
>> to do some rmap sanity checks. Add helpers to set and clear the
>> PageAnonExclusive bit on a batch of nr_pages. Note that
>> __folio_try_share_anon_rmap can receive nr_pages == HPAGE_PMD_NR from the
>> PMD path, but currently we only clear the bit on the head page. Retain this
>> behaviour by setting nr_pages = 1 in case the caller is
>> folio_try_share_anon_rmap_pmd.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/page-flags.h | 11 +++++++++++
>>  include/linux/rmap.h       | 28 ++++++++++++++++++++++++++--
>>  mm/rmap.c                  |  2 +-
>>  3 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 0e03d816e8b9d..1d74ed9a28c41 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -1178,6 +1178,17 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
>>  	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>>  }
>>
>> +static __always_inline void ClearPagesAnonExclusive(struct page *page,
>> +		unsigned int nr)
> 
> You're randomly moving between nr and nr_pages, can we just consistently use
> nr_pages please.

Okay.

> 
>> +{
>> +	for (;;) {
>> +		ClearPageAnonExclusive(page);
>> +		if (--nr == 0)
> 
> You really require nr to != 0 here or otherwise you're going to be clearing 4
> billion pages :)

I'm following the pattern in pgtable.h, see set_ptes,
clear_young_dirty_ptes, etc.

> 
>> +			break;
>> +		++page;
>> +	}
>> +}
> 
> Can we put this in mm.h or somewhere else please, and can we do away with this

What is the problem with page-flags.h? I am not very aware on which
function to put in which header file semantically, so please educate me on
this.

> HorribleNamingConvention, this is new, we can 'get away' with making it
> something sensible :)

I'll name it folio_clear_pages_anon_exclusive.


> 
> I wonder if we shouldn't also add a folio pointer here, and some
> VM_WARN_ON_ONCE()'s. Like:
> 
> static inline void folio_clear_page_batch(struct folio *folio,
> 					  struct page *first_subpage,
> 					  unsigned int nr_pages)
> {
> 	struct page *subpage = first_subpage;
> 
> 	VM_WARN_ON_ONCE(!nr_pages);
> 	VM_WARN_ON_ONCE(... check first_subpage in folio ...);
> 	VM_WARN_ON_ONCE(... check first_subpage -> first_subpage + nr_pages in folio ...);

I like what you are saying, but __folio_rmap_sanity_checks in the caller
checks exactly this :)

> 
> 	while (nr_pages--)
> 		ClearPageAnonExclusive(subpage++);
> }
> 
>> +
>>  #ifdef CONFIG_MMU
>>  #define __PG_MLOCKED		(1UL << PG_mlocked)
>>  #else
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 1b7720c66ac87..7a67776dca3fe 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -712,9 +712,13 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>>  	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
>>  	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>
>> +	/* We only clear anon-exclusive from head page of PMD folio */
> 
> Is this accurate? David? I thought anon exclusive was per-subpage for any large
> folio...?

The current behaviour is to do this only. I was also surprised with this,
so I had dug in and found out:

https://lore.kernel.org/all/20220428083441.37290-13-david@redhat.com/

where David says:

"Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP."


> 
> If we're explicitly doing this for some reason here, then why introduce it?
> 
>> +	if (level == PGTABLE_LEVEL_PMD)
>> +		nr_pages = 1;
>> +
>>  	/* device private folios cannot get pinned via GUP. */
>>  	if (unlikely(folio_is_device_private(folio))) {
>> -		ClearPageAnonExclusive(page);
>> +		ClearPagesAnonExclusive(page, nr_pages);
> 
> I really kind of hate this 'we are looking at subpage X with variable page in
> folio Y, but we don't mention Y' thing. It's super confusing that we have a
> pointer to a thing which sometimes we deref and treat as a value we care about
> and sometimes treat as an array.
> 
> This pattern exists throughout all the batched stuff and I kind of hate it
> everywhere.
> 
> I guess the batching means that we are looking at a sub-folio range.
> 
> If C had a better type system we could somehow have a type that encoded this,
> but it doesn't :>)
> 
> But I wonder if we shouldn't just go ahead and rename page -> pages and be
> consistent about this?

Agreed. You are correct in saying that this function should receive struct
folio to assert that we are essentially in a page array, and some sanity
checking should happen. But the callers are already doing the checking
in folio_rmap_sanity_checks. Let me think on this.


> 
>>  		return 0;
>>  	}
>>
>> @@ -766,7 +770,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>>
>>  	if (unlikely(folio_maybe_dma_pinned(folio)))
>>  		return -EBUSY;
>> -	ClearPageAnonExclusive(page);
>> +	ClearPagesAnonExclusive(page, nr_pages);
>>
>>  	/*
>>  	 * This is conceptually a smp_wmb() paired with the smp_rmb() in
>> @@ -804,6 +808,26 @@ static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
>>  	return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
>>  }
>>
>> +/**
>> + * folio_try_share_anon_rmap_ptes - try marking exclusive anonymous pages
>> + *				    mapped by PTEs possibly shared to prepare
>> + *				   for KSM or temporary unmapping
> 
> This description is very confusing. 'Try marking exclusive anonymous pages
> [... marking them as what?] mapped by PTEs[, or (]possibly shared[, or )] to
> prepare for KSM[under what circumstances? Why mention KSM here?] or temporary
> unmapping [why temporary?]
> 
> OK I think you mean to say 'marking' them as 'possibly' shared.
> 
> But really by 'shared' you mean clearing anon exclusive right? So maybe the
> function name and description should reference that instead.
> 
> But this needs clarifying. This isn't an exercise in minimum number of words to
> describe the function.
> 
> Ohhh now I see this is what the comment is in folio_try_share_anon_rmap_pte() :P
> 
> Well, I wish we could update the original too ;) but OK this is fine as-is to
> matc that then.
> 
>> + * @folio:	The folio to share a mapping of
>> + * @page:	The first mapped exclusive page of the batch
>> + * @nr_pages:	The number of pages to share (batch size)
>> + *
>> + * See folio_try_share_anon_rmap_pte for full description.
>> + *
>> + * Context: The caller needs to hold the page table lock and has to have the
>> + * page table entries cleared/invalidated. Those PTEs used to map consecutive
>> + * pages of the folio passed here. The PTEs are all in the same PMD and VMA.
> 
> Can we VM_WARN_ON_ONCE() any of this? Not completely a necessity.

Again, we have WARN checks in folio_rmap_sanity_checks, and even in
folio_pte_batch. I am afraid of duplication.

> 
>> + */
>> +static inline int folio_try_share_anon_rmap_ptes(struct folio *folio,
>> +		struct page *page, unsigned int nr)
>> +{
>> +	return __folio_try_share_anon_rmap(folio, page, nr, PGTABLE_LEVEL_PTE);
>> +}
>> +
>>  /**
>>   * folio_try_share_anon_rmap_pmd - try marking an exclusive anonymous page
>>   *				   range mapped by a PMD possibly shared to
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 42f6b00cced01..bba5b571946d8 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2300,7 +2300,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
>>  			if (anon_exclusive &&
>> -			    folio_try_share_anon_rmap_pte(folio, subpage)) {
>> +			    folio_try_share_anon_rmap_ptes(folio, subpage, 1)) {
> 
> I guess this is because you intend to make this batched later with >1, but I
> don't see the point of adding this since folio_try_share_anon_rmap_pte() already
> does what you're doing here.
> 
> So why not just change this when you actually batch?

It is generally the consensus to introduce a new function along with its
caller. Although one may argue that the caller introduced here is not
a functional change (still passing nr_pages = 1). So I am fine doing
what you suggest.

> 
> Buuuut.... haven't you not already changed this whole function to now 'jump'
> ahead if batched, so why are we only specifying nr_pages = 1 here?

Because...please bear with the insanity :) currently we are in a ridiculous
situation where

nr_pages can be > 1 for file folios, and lazyfree folios, *and* it is
required that the VMA is non-uffd.

So, the "jump" thingy I was doing in the previous patch was adding support
for file folios, belonging to uffd VMAs (see pte_install_uffd_wp_if_needed,
we need to handle uffd-wp marker for file folios only, and also I should
have mentioned "file folio" in the subject line of that patch, of course
I missed that because reasoning through this code is very difficult)

To answer your question, currently for anon folios nr_pages == 1, so
the jump is a no-op.

When I had discovered the uffd-wp bug some weeks back, I was pushing
back against the idea of hacking around it by disabling batching
for uffd-VMAs in folio_unmap_pte_batch, but solve it then and there
properly. Now we have too many cases - first we added lazyfree support,
then file-non-uffd support, my patch 5 adds file-uffd support, and
last patch finally completes this with anon support.


> 
> Honestly I think this function needs to be fully refactored away from the
> appalling giant-ball-of-string mess it is now before we try to add in batching
> to be honest.
> 
> Let's not accumulate more technical debt.

I agree, I am happy to help in cleaning up this function.

> 
> 
>>  				folio_put_swap(folio, subpage, 1);
>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>  				goto walk_abort;
>> --
>> 2.34.1
>>
> 
> Thanks, Lorenzo

next prev parent reply	other threads:[~2026-03-11  8:09 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
2026-03-10  7:30 ` [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one Dev Jain
2026-03-10  7:56   ` Lorenzo Stoakes (Oracle)
2026-03-10  8:06     ` David Hildenbrand (Arm)
2026-03-10  8:23       ` Dev Jain
2026-03-10 12:40         ` Matthew Wilcox
2026-03-11  4:54           ` Dev Jain
2026-03-10  7:30 ` [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start " Dev Jain
2026-03-10  8:10   ` Lorenzo Stoakes (Oracle)
2026-03-10  8:31     ` Dev Jain
2026-03-10  8:39       ` Lorenzo Stoakes (Oracle)
2026-03-10  8:43         ` Dev Jain
2026-03-10  7:30 ` [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio() Dev Jain
2026-03-10  8:19   ` Lorenzo Stoakes (Oracle)
2026-03-10  8:42     ` Dev Jain
2026-03-19 15:53       ` Lorenzo Stoakes (Oracle)
2026-03-10  7:30 ` [PATCH 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
2026-03-10  7:30 ` [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
2026-03-10  8:34   ` Lorenzo Stoakes (Oracle)
2026-03-10 23:32     ` Barry Song
2026-03-11  4:14       ` Barry Song
2026-03-11  4:52         ` Dev Jain
2026-03-11  4:56     ` Dev Jain
2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
2026-03-10  8:27   ` Kairui Song
2026-03-10  8:46     ` Dev Jain
2026-03-10  8:49   ` Lorenzo Stoakes (Oracle)
2026-03-11  5:42     ` Dev Jain
2026-03-19 15:26       ` Lorenzo Stoakes (Oracle)
2026-03-19 16:47       ` Matthew Wilcox
2026-03-18  0:20   ` kernel test robot
2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
2026-03-10  8:29   ` Kairui Song
2026-03-10  8:50     ` Dev Jain
2026-03-10  8:55   ` Lorenzo Stoakes (Oracle)
2026-03-18  1:04   ` kernel test robot
2026-03-10  7:30 ` [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes Dev Jain
2026-03-10  9:38   ` Lorenzo Stoakes (Oracle)
2026-03-11  8:09     ` Dev Jain [this message]
2026-03-12  8:19       ` Wei Yang
2026-03-19 15:47       ` Lorenzo Stoakes (Oracle)
2026-04-08  7:14         ` Dev Jain
2026-03-10  7:30 ` [PATCH 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
2026-03-10  8:02 ` [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Lorenzo Stoakes (Oracle)
2026-03-10  9:28   ` Dev Jain
2026-03-10 12:59 ` Lance Yang
2026-03-11  8:11   ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeb28e78-4943-4330-b323-799dd80207f2@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=harry.yoo@oracle.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=kas@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nphamcs@gmail.com \
    --cc=pfalcato@suse.de \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=youngjun.park@lge.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox