Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>,
	Barry Song <21cnbao@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>,
	akpm@linux-foundation.org, david@kernel.org,
	catalin.marinas@arm.com, will@kernel.org,
	lorenzo.stoakes@oracle.com, ryan.roberts@arm.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, riel@surriel.com,
	harry.yoo@oracle.com, jannh@google.com, willy@infradead.org,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
Date: Mon, 19 Jan 2026 12:06:59 +0530	[thread overview]
Message-ID: <6ed22a34-ce8a-4531-9776-b28e1a942450@arm.com> (raw)
In-Reply-To: <29634bee-18c6-42c2-ac7f-703d4dfed867@linux.alibaba.com>


On 19/01/26 11:20 am, Baolin Wang wrote:
>
>
> On 1/18/26 1:46 PM, Dev Jain wrote:
>>
>> On 16/01/26 7:58 pm, Barry Song wrote:
>>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com>
>>>>>> wrote:
>>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping
>>>>>>>> for file
>>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>>
>>>>>>>> Barry previously implemented batched unmapping for lazyfree
>>>>>>>> anonymous large
>>>>>>>> folios[1] and did not further optimize anonymous large folios or
>>>>>>>> file-backed
>>>>>>>> large folios at that stage. As for file-backed large folios, the
>>>>>>>> batched
>>>>>>>> unmapping support is relatively straightforward, as we only need
>>>>>>>> to clear
>>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>>
>>>>>>>> Performance testing:
>>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory
>>>>>>>> cgroup, and try to
>>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I
>>>>>>>> can observe
>>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+
>>>>>>>> improvement
>>>>>>>> on my X86 machine) with this patch.
>>>>>>>>
>>>>>>>> W/o patch:
>>>>>>>> real    0m1.018s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m1.018s
>>>>>>>>
>>>>>>>> W/ patch:
>>>>>>>> real   0m0.249s
>>>>>>>> user   0m0.000s
>>>>>>>> sys    0m0.249s
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>> ---
>>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int
>>>>>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>>
>>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>>> +      /* We only support lazyfree or file folios batching for now
>>>>>>>> ... */
>>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>>                return 1;
>>>>>>>> +
>>>>>>>>        if (pte_unused(pte))
>>>>>>>>                return 1;
>>>>>>>>
>>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio
>>>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>>>                         *
>>>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>>>                         */
>>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio),
>>>>>>>> -nr_pages);
>>>>>>>>                }
>>>>>>>> discard:
>>>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>>> -- 
>>>>>>>> 2.47.3
>>>>>>>>
>>>>>>> Hi, Baolin
>>>>>>>
>>>>>>> When reading your patch, I come up one small question.
>>>>>>>
>>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>>
>>>>>>>      try_to_unmap_one()
>>>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>>>
>>>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>>>                  goto walk_done;
>>>>>>>          }
>>>>>>>
>>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>>
>>>>>>> If my understanding is correct, page_vma_mapped_walk() would start
>>>>>>> from
>>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already
>>>>>>> cleared to
>>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>>
>>>>>>> Not sure my understanding is correct, if so do we have some reason
>>>>>>> not to
>>>>>>> skip the cleared range?
>>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs
>>>>>> inside.
>>>>>>
>>>>>> take a look:
>>>>>>
>>>>>> next_pte:
>>>>>>                 do {
>>>>>>                         pvmw->address += PAGE_SIZE;
>>>>>>                         if (pvmw->address >= end)
>>>>>>                                 return not_found(pvmw);
>>>>>>                         /* Did we cross page table boundary? */
>>>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE))
>>>>>> == 0) {
>>>>>>                                 if (pvmw->ptl) {
>>>>>>                                         spin_unlock(pvmw->ptl);
>>>>>>                                         pvmw->ptl = NULL;
>>>>>>                                 }
>>>>>>                                 pte_unmap(pvmw->pte);
>>>>>>                                 pvmw->pte = NULL;
>>>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>>                                 goto restart;
>>>>>>                         }
>>>>>>                         pvmw->pte++;
>>>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>>>
>>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are
>>>>> pte_none(), they
>>>>> will be skipped.
>>>>>
>>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio
>>>>> *folio, struct vm_area_struct *vma,
>>>>>                 */
>>>>>                if (nr_pages == folio_nr_pages(folio))
>>>>>                        goto walk_done;
>>>>> +             else {
>>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>>> +                     pvmw.pte += nr_pages - 1;
>>>>> +             }
>>>>>                continue;
>>>>>   walk_abort:
>>>>>                ret = false;
>>>> I am of the opinion that we should do something like this. In the
>>>> internal pvmw code,
>>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>>> is the right place. If we really want to skip certain PTEs early,
>>> should we instead hint page_vma_mapped_walk()? That said, I don't
>>> see much value in doing so, since in most cases nr is either 1 or
>>> folio_nr_pages(folio).
>>>
>>>> we keep skipping ptes till the ptes are none. With my proposed
>>>> uffd-fix [1], if the old
>>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert
>>>> all ptes from none
>>>> to not none, and we will lose the batching effect. I also plan to
>>>> extend support to
>>>> anonymous folios (therefore generalizing for all types of memory)
>>>> which will set a
>>>> batch of ptes as swap, and the internal pvmw code won't be able to
>>>> skip through the
>>>> batch.
>>> Thanks for catching this, Dev. I already filter out some of the more
>>> complex cases, for example:
>>> if (pte_unused(pte))
>>>          return 1;
>>>
>>> Since the userfaultfd write-protection case is also a corner case,
>>> could we filter it out as well?
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index c86f1135222b..6bb8ba6f046e 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>>> folio_unmap_pte_batch(struct folio *folio,
>>>          if (pte_unused(pte))
>>>                  return 1;
>>>
>>> +       if (userfaultfd_wp(vma))
>>> +               return 1;
>>> +
>>>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>> }
>>>
>>> Just offering a second option — yours is probably better.
>>
>> No. This is not an edge case. This is a case which gets exposed by your
>> work, and
>> I believe that if you intend to get the file folio batching thingy in,
>> then you
>> need to fix the uffd stuff too.
>
> Barry’s point isn’t that this is an edge case. I think he means that uffd
> is not a common performance-sensitive scenario in production. Also, we
> typically fall back to per-page handling for uffd cases (see
> finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s
> suggestion and filter out the uffd cases until we have test case to show
> performance improvement. 

I am of the opinion that you are making the wrong analogy here. The
per-page fault fidelity is *required* for uffd.

When you say you want to support file folio batched unmapping, I think it's
inappropriate to say "let us refuse to

batch if the pte mapping the file folio is smeared with a particular bit
and consider it a totally different case". Instead

of getting in folio (all memory types) batched unmapping in, we have
already broken this to "lazyfree folio", then

"file folio", the remaining being "anon folio". Now you intend to break
"file folio" to "file folio non uffd" and "file folio uffd".


Just my 2C, I don't opine strongly here.


>
> I also think you can continue iterating your patch[1] to support batched
> unmapping for uffd VMAs, and provide data to evaluate its value.
>
> [1]
> https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

next prev parent reply	other threads:[~2026-01-19  6:37 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for " Baolin Wang
2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
2026-01-07  6:01   ` Harry Yoo
2026-02-09  8:49   ` David Hildenbrand (Arm)
2026-02-09  9:14     ` Baolin Wang
2026-02-09  9:20       ` David Hildenbrand (Arm)
2026-02-09  9:25         ` Baolin Wang
2025-12-26  6:07 ` [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
2026-02-09  8:50   ` David Hildenbrand (Arm)
2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
2026-01-02 12:21   ` Ryan Roberts
2026-02-09  9:02   ` David Hildenbrand (Arm)
2025-12-26  6:07 ` [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
2026-01-28 11:47   ` Chris Mason
2026-01-29  1:42     ` Baolin Wang
2026-02-09  9:09       ` David Hildenbrand (Arm)
2026-02-09  9:36         ` Baolin Wang
2026-02-09  9:55           ` David Hildenbrand (Arm)
2026-02-09 10:13             ` Baolin Wang
2026-02-16  0:24               ` Alistair Popple
2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
2026-01-06 13:22   ` Wei Yang
2026-01-06 21:29     ` Barry Song
2026-01-07  1:46       ` Wei Yang
2026-01-07  2:21         ` Barry Song
2026-01-07  2:29           ` Baolin Wang
2026-01-07  3:31             ` Wei Yang
2026-01-16  9:53         ` Dev Jain
2026-01-16 11:14           ` Lorenzo Stoakes
2026-01-16 14:28           ` Barry Song
2026-01-16 15:23             ` Barry Song
2026-01-16 15:49             ` Baolin Wang
2026-01-18  5:46             ` Dev Jain
2026-01-19  5:50               ` Baolin Wang
2026-01-19  6:36                 ` Dev Jain [this message]
2026-01-19  7:22                   ` Baolin Wang
2026-01-16 15:14           ` Barry Song
2026-01-18  5:48             ` Dev Jain
2026-01-07  6:54   ` Harry Yoo
2026-01-16  8:42   ` Lorenzo Stoakes
2026-01-16 16:26   ` [PATCH] mm: rmap: skip batched unmapping for UFFD vmas Baolin Wang
2026-02-09  9:54     ` David Hildenbrand (Arm)
2026-02-09 10:49       ` Barry Song
2026-02-09 10:58         ` David Hildenbrand (Arm)
2026-02-10 12:01         ` Dev Jain
2026-02-09  9:38   ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios David Hildenbrand (Arm)
2026-02-09  9:43     ` Baolin Wang
2026-02-13  5:19       ` Barry Song
2026-02-18 12:26         ` Dev Jain
2026-01-16  8:41 ` [PATCH v5 0/5] support batch checking of references and unmapping for " Lorenzo Stoakes
2026-01-16 10:53   ` David Hildenbrand (Red Hat)
2026-01-16 10:52 ` David Hildenbrand (Red Hat)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6ed22a34-ce8a-4531-9776-b28e1a942450@arm.com \
    --to=dev.jain@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@kernel.org \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox