[PATCH 0/2] mm: Improve mlock tracking for large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm: Improve mlock tracking for large folios
@ 2025-09-18 11:21 kirill
  2025-09-18 11:21 ` [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault() kirill
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: kirill @ 2025-09-18 11:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

We do not mlock large folios on adding them to rmap deferring until
relaim. It leads to substantial undercount of Mlocked in /proc/meminfo.

This patchset improves the situation by mlocking large folios fully
mapped to the VMA.

Partially mapped large folios are still not accounted, but it brings
meminfo value closer to the truth and makes it useful.

Kiryl Shutsemau (2):
  mm/fault: Try to map the entire file folio in finish_fault()
  mm/rmap: Improve mlock tracking for large folios

 mm/memory.c |  9 ++-------
 mm/rmap.c   | 13 ++++---------
 2 files changed, 6 insertions(+), 16 deletions(-)

-- 
2.50.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault()
  2025-09-18 11:21 [PATCH 0/2] mm: Improve mlock tracking for large folios kirill
@ 2025-09-18 11:21 ` kirill
  2025-09-18 11:30   ` David Hildenbrand
  2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
  2025-09-18 13:14 ` [PATCH 0/2] mm: " Lorenzo Stoakes
  2 siblings, 1 reply; 13+ messages in thread
From: kirill @ 2025-09-18 11:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

The finish_fault() function uses per-page fault for file folios. This
only occurs for file folios smaller than PMD_SIZE.

The comment suggests that this approach prevents RSS inflation.
However, it only prevents RSS accounting. The folio is still mapped to
the process, and the fact that it is mapped by a single PTE does not
affect memory pressure. Additionally, the kernel's ability to map
large folios as PMD if they are large enough does not support this
argument.

When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.

Mapping large folios at once will allow the rmap code to mlock it on
add, as it will recognize that it is fully mapped and mlocking is safe.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..812a7d9f6531 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 	nr_pages = folio_nr_pages(folio);
 
-	/*
-	 * Using per-page fault to maintain the uffd semantics, and same
-	 * approach also applies to non shmem/tmpfs faults to avoid
-	 * inflating the RSS of the process.
-	 */
-	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
-	    unlikely(needs_fallback)) {
+	/* Using per-page fault to maintain the uffd semantics */
+	if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
 		nr_pages = 1;
 	} else if (nr_pages > 1) {
 		pgoff_t idx = folio_page_idx(folio, page);
-- 
2.50.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault()
  2025-09-18 11:21 ` [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault() kirill
@ 2025-09-18 11:30   ` David Hildenbrand
  2025-09-18 13:13     ` Lorenzo Stoakes
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-09-18 11:30 UTC (permalink / raw)
  To: kirill, Andrew Morton, Hugh Dickins, Matthew Wilcox
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau, Baolin Wang

On 18.09.25 13:21, kirill@shutemov.name wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> The finish_fault() function uses per-page fault for file folios. This
> only occurs for file folios smaller than PMD_SIZE.
> 
> The comment suggests that this approach prevents RSS inflation.
> However, it only prevents RSS accounting. The folio is still mapped to
> the process, and the fact that it is mapped by a single PTE does not
> affect memory pressure. Additionally, the kernel's ability to map
> large folios as PMD if they are large enough does not support this
> argument.
> 
> When possible, map large folios in one shot. This reduces the number of
> minor page faults and allows for TLB coalescing.
> 
> Mapping large folios at once will allow the rmap code to mlock it on
> add, as it will recognize that it is fully mapped and mlocking is safe.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>   mm/memory.c | 9 ++-------
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ba4f6b71847..812a7d9f6531 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   
>   	nr_pages = folio_nr_pages(folio);
>   
> -	/*
> -	 * Using per-page fault to maintain the uffd semantics, and same
> -	 * approach also applies to non shmem/tmpfs faults to avoid
> -	 * inflating the RSS of the process.
> -	 */
> -	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> -	    unlikely(needs_fallback)) {
> +	/* Using per-page fault to maintain the uffd semantics */
> +	if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
>   		nr_pages = 1;
>   	} else if (nr_pages > 1) {
>   		pgoff_t idx = folio_page_idx(folio, page);

I could have sworn that we recently discussed that.

Ah yes, there it is

https://lkml.kernel.org/r/a1c9ba0f-544d-4204-ad3b-60fe1be2ab32@linux.alibaba.com

CCing Baolin as he wanted to look into this.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault()
  2025-09-18 11:30   ` David Hildenbrand
@ 2025-09-18 13:13     ` Lorenzo Stoakes
  2025-09-19  2:52       ` Baolin Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kirill, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau, Baolin Wang

On Thu, Sep 18, 2025 at 01:30:32PM +0200, David Hildenbrand wrote:
> On 18.09.25 13:21, kirill@shutemov.name wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > The finish_fault() function uses per-page fault for file folios. This
> > only occurs for file folios smaller than PMD_SIZE.
> >
> > The comment suggests that this approach prevents RSS inflation.
> > However, it only prevents RSS accounting. The folio is still mapped to
> > the process, and the fact that it is mapped by a single PTE does not
> > affect memory pressure. Additionally, the kernel's ability to map
> > large folios as PMD if they are large enough does not support this
> > argument.
> >
> > When possible, map large folios in one shot. This reduces the number of
> > minor page faults and allows for TLB coalescing.
> >
> > Mapping large folios at once will allow the rmap code to mlock it on
> > add, as it will recognize that it is fully mapped and mlocking is safe.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> >   mm/memory.c | 9 ++-------
> >   1 file changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0ba4f6b71847..812a7d9f6531 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >   	nr_pages = folio_nr_pages(folio);
> > -	/*
> > -	 * Using per-page fault to maintain the uffd semantics, and same
> > -	 * approach also applies to non shmem/tmpfs faults to avoid
> > -	 * inflating the RSS of the process.
> > -	 */
> > -	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > -	    unlikely(needs_fallback)) {
> > +	/* Using per-page fault to maintain the uffd semantics */
> > +	if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
> >   		nr_pages = 1;
> >   	} else if (nr_pages > 1) {
> >   		pgoff_t idx = folio_page_idx(folio, page);
>
> I could have sworn that we recently discussed that.
>
> Ah yes, there it is
>
> https://lkml.kernel.org/r/a1c9ba0f-544d-4204-ad3b-60fe1be2ab32@linux.alibaba.com
>
> CCing Baolin as he wanted to look into this.
>
> --
> Cheers
>
> David / dhildenb
>

Yeah Baolin already did work here [0] so let's get his input first I think! :)

[0]:https://lore.kernel.org/linux-mm/440940e78aeb7430c5cc8b6d2088ae98265b9809.1751599072.git.baolin.wang@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault()
  2025-09-18 13:13     ` Lorenzo Stoakes
@ 2025-09-19  2:52       ` Baolin Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Baolin Wang @ 2025-09-19  2:52 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: kirill, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau, hughd



On 2025/9/18 21:13, Lorenzo Stoakes wrote:
> On Thu, Sep 18, 2025 at 01:30:32PM +0200, David Hildenbrand wrote:
>> On 18.09.25 13:21, kirill@shutemov.name wrote:
>>> From: Kiryl Shutsemau <kas@kernel.org>
>>>
>>> The finish_fault() function uses per-page fault for file folios. This
>>> only occurs for file folios smaller than PMD_SIZE.
>>>
>>> The comment suggests that this approach prevents RSS inflation.
>>> However, it only prevents RSS accounting. The folio is still mapped to
>>> the process, and the fact that it is mapped by a single PTE does not
>>> affect memory pressure. Additionally, the kernel's ability to map
>>> large folios as PMD if they are large enough does not support this
>>> argument.
>>>
>>> When possible, map large folios in one shot. This reduces the number of
>>> minor page faults and allows for TLB coalescing.
>>>
>>> Mapping large folios at once will allow the rmap code to mlock it on
>>> add, as it will recognize that it is fully mapped and mlocking is safe.
>>>
>>> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>>> ---
>>>    mm/memory.c | 9 ++-------
>>>    1 file changed, 2 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 0ba4f6b71847..812a7d9f6531 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>    	nr_pages = folio_nr_pages(folio);
>>> -	/*
>>> -	 * Using per-page fault to maintain the uffd semantics, and same
>>> -	 * approach also applies to non shmem/tmpfs faults to avoid
>>> -	 * inflating the RSS of the process.
>>> -	 */
>>> -	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>> -	    unlikely(needs_fallback)) {
>>> +	/* Using per-page fault to maintain the uffd semantics */
>>> +	if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
>>>    		nr_pages = 1;
>>>    	} else if (nr_pages > 1) {
>>>    		pgoff_t idx = folio_page_idx(folio, page);
>>
>> I could have sworn that we recently discussed that.
>>
>> Ah yes, there it is
>>
>> https://lkml.kernel.org/r/a1c9ba0f-544d-4204-ad3b-60fe1be2ab32@linux.alibaba.com
>>
>> CCing Baolin as he wanted to look into this.
>>
>> --
>> Cheers
>>
>> David / dhildenb
>>
> 
> Yeah Baolin already did work here [0] so let's get his input first I think! :)
> 
> [0]:https://lore.kernel.org/linux-mm/440940e78aeb7430c5cc8b6d2088ae98265b9809.1751599072.git.baolin.wang@linux.alibaba.com/

Thanks CCing me. Also CCing Hugh.

Hugh previously suggested adding restrictions to the mapping of file 
folios (using fault_around_bytes). However, personally, I am not 
inclined to use fault_around_bytes to control, because:

1. This doesn't cause serious write amplification issues.
2. It will inflate the RSS of the process, but does it matter? It seems 
not very important.
3. The default configuration for 'fault_around_bytes' is 65536 (16 
pages), which is too small for mapping large file folios.
4. We could try adjusting 'fault_around_bytes' to a larger value, but 
we've found in real customer environments that 'fault_around_bytes' can 
lead to more aggressive readahead, impacting performance. So if 
'fault_around_bytes' controls more, it will bring more different 
intersecting factors into play.

Therefore, I personally prefer Kiryl's patch (it's what I intended to 
do, but I haven't had the time:().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 11:21 [PATCH 0/2] mm: Improve mlock tracking for large folios kirill
  2025-09-18 11:21 ` [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault() kirill
@ 2025-09-18 11:21 ` kirill
  2025-09-18 11:31   ` David Hildenbrand
                     ` (3 more replies)
  2025-09-18 13:14 ` [PATCH 0/2] mm: " Lorenzo Stoakes
  2 siblings, 4 replies; 13+ messages in thread
From: kirill @ 2025-09-18 11:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

The kernel currently does not mlock large folios when adding them to
rmap, stating that it is difficult to confirm that the folio is fully
mapped and safe to mlock it. However, nowadays the caller passes a
number of pages of the folio that are getting mapped, making it easy to
check if the entire folio is mapped to the VMA.

mlock the folio on rmap if it is fully mapped to the VMA.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/rmap.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..ca8d4ef42c2d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1478,13 +1478,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 				 PageAnonExclusive(cur_page), folio);
 	}
 
-	/*
-	 * For large folio, only mlock it if it's fully mapped to VMA. It's
-	 * not easy to check whether the large folio is fully mapped to VMA
-	 * here. Only mlock normal 4K folio and leave page reclaim to handle
-	 * large folio.
-	 */
-	if (!folio_test_large(folio))
+	/* Only mlock it if the folio is fully mapped to the VMA */
+	if (folio_nr_pages(folio) == nr_pages)
 		mlock_vma_folio(folio, vma);
 }
 
@@ -1620,8 +1615,8 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
 	nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
 	__folio_mod_stat(folio, nr, nr_pmdmapped);
 
-	/* See comments in folio_add_anon_rmap_*() */
-	if (!folio_test_large(folio))
+	/* Only mlock it if the folio is fully mapped to the VMA */
+	if (folio_nr_pages(folio) == nr_pages)
 		mlock_vma_folio(folio, vma);
 }
 
-- 
2.50.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
@ 2025-09-18 11:31   ` David Hildenbrand
  2025-09-18 13:10   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-09-18 11:31 UTC (permalink / raw)
  To: kirill, Andrew Morton, Hugh Dickins, Matthew Wilcox
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau

On 18.09.25 13:21, kirill@shutemov.name wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> The kernel currently does not mlock large folios when adding them to
> rmap, stating that it is difficult to confirm that the folio is fully
> mapped and safe to mlock it. However, nowadays the caller passes a
> number of pages of the folio that are getting mapped, making it easy to
> check if the entire folio is mapped to the VMA.
> 
> mlock the folio on rmap if it is fully mapped to the VMA.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
  2025-09-18 11:31   ` David Hildenbrand
@ 2025-09-18 13:10   ` Lorenzo Stoakes
  2025-09-18 13:48     ` Kiryl Shutsemau
  2025-09-18 14:38   ` Johannes Weiner
  2025-09-18 19:32   ` Shakeel Butt
  3 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:10 UTC (permalink / raw)
  To: kirill
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau

On Thu, Sep 18, 2025 at 12:21:57PM +0100, kirill@shutemov.name wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> The kernel currently does not mlock large folios when adding them to
> rmap, stating that it is difficult to confirm that the folio is fully
> mapped and safe to mlock it. However, nowadays the caller passes a
> number of pages of the folio that are getting mapped, making it easy to
> check if the entire folio is mapped to the VMA.
>
> mlock the folio on rmap if it is fully mapped to the VMA.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

The logic looks good to me, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

But note the comments below.

> ---
>  mm/rmap.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..ca8d4ef42c2d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1478,13 +1478,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>  				 PageAnonExclusive(cur_page), folio);
>  	}
>
> -	/*
> -	 * For large folio, only mlock it if it's fully mapped to VMA. It's
> -	 * not easy to check whether the large folio is fully mapped to VMA
> -	 * here. Only mlock normal 4K folio and leave page reclaim to handle
> -	 * large folio.
> -	 */
> -	if (!folio_test_large(folio))
> +	/* Only mlock it if the folio is fully mapped to the VMA */
> +	if (folio_nr_pages(folio) == nr_pages)

OK this is nice, as partially mapped will have folio_nr_pages() != nr_pages. So
logically this must be correct.

>  		mlock_vma_folio(folio, vma);
>  }
>
> @@ -1620,8 +1615,8 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
>  	nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
>  	__folio_mod_stat(folio, nr, nr_pmdmapped);
>
> -	/* See comments in folio_add_anon_rmap_*() */
> -	if (!folio_test_large(folio))
> +	/* Only mlock it if the folio is fully mapped to the VMA */
> +	if (folio_nr_pages(folio) == nr_pages)
>  		mlock_vma_folio(folio, vma);
>  }
>
> --
> 2.50.1
>

I see in try_to_unmap_one():

		if (!(flags & TTU_IGNORE_MLOCK) &&
		    (vma->vm_flags & VM_LOCKED)) {
			/* Restore the mlock which got missed */
			if (!folio_test_large(folio))
				mlock_vma_folio(folio, vma);

Do we care about this?

It seems like folio_referenced_one() does some similar logic:

		if (vma->vm_flags & VM_LOCKED) {
			if (!folio_test_large(folio) || !pvmw.pte) {
				/* Restore the mlock which got missed */
				mlock_vma_folio(folio, vma);
				page_vma_mapped_walk_done(&pvmw);
				pra->vm_flags |= VM_LOCKED;
				return false; /* To break the loop */
			}

...

	if ((vma->vm_flags & VM_LOCKED) &&
			folio_test_large(folio) &&
			folio_within_vma(folio, vma)) {
		unsigned long s_align, e_align;

		s_align = ALIGN_DOWN(start, PMD_SIZE);
		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);

		/* folio doesn't cross page table boundary and fully mapped */
		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
			/* Restore the mlock which got missed */
			mlock_vma_folio(folio, vma);
			pra->vm_flags |= VM_LOCKED;
			return false; /* To break the loop */
		}
	}

So maybe we could do something similar in try_to_unmap_one()?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 13:10   ` Lorenzo Stoakes
@ 2025-09-18 13:48     ` Kiryl Shutsemau
  2025-09-18 14:58       ` Kiryl Shutsemau
  0 siblings, 1 reply; 13+ messages in thread
From: Kiryl Shutsemau @ 2025-09-18 13:48 UTC (permalink / raw)
  To: Lorenzo Stoakes, Yin Fengwei
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel

On Thu, Sep 18, 2025 at 02:10:05PM +0100, Lorenzo Stoakes wrote:
> On Thu, Sep 18, 2025 at 12:21:57PM +0100, kirill@shutemov.name wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > The kernel currently does not mlock large folios when adding them to
> > rmap, stating that it is difficult to confirm that the folio is fully
> > mapped and safe to mlock it. However, nowadays the caller passes a
> > number of pages of the folio that are getting mapped, making it easy to
> > check if the entire folio is mapped to the VMA.
> >
> > mlock the folio on rmap if it is fully mapped to the VMA.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> 
> The logic looks good to me, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> But note the comments below.
> 
> > ---
> >  mm/rmap.c | 13 ++++---------
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 568198e9efc2..ca8d4ef42c2d 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1478,13 +1478,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >  				 PageAnonExclusive(cur_page), folio);
> >  	}
> >
> > -	/*
> > -	 * For large folio, only mlock it if it's fully mapped to VMA. It's
> > -	 * not easy to check whether the large folio is fully mapped to VMA
> > -	 * here. Only mlock normal 4K folio and leave page reclaim to handle
> > -	 * large folio.
> > -	 */
> > -	if (!folio_test_large(folio))
> > +	/* Only mlock it if the folio is fully mapped to the VMA */
> > +	if (folio_nr_pages(folio) == nr_pages)
> 
> OK this is nice, as partially mapped will have folio_nr_pages() != nr_pages. So
> logically this must be correct.
> 
> >  		mlock_vma_folio(folio, vma);
> >  }
> >
> > @@ -1620,8 +1615,8 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
> >  	nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
> >  	__folio_mod_stat(folio, nr, nr_pmdmapped);
> >
> > -	/* See comments in folio_add_anon_rmap_*() */
> > -	if (!folio_test_large(folio))
> > +	/* Only mlock it if the folio is fully mapped to the VMA */
> > +	if (folio_nr_pages(folio) == nr_pages)
> >  		mlock_vma_folio(folio, vma);
> >  }
> >
> > --
> > 2.50.1
> >
> 
> I see in try_to_unmap_one():
> 
> 		if (!(flags & TTU_IGNORE_MLOCK) &&
> 		    (vma->vm_flags & VM_LOCKED)) {
> 			/* Restore the mlock which got missed */
> 			if (!folio_test_large(folio))
> 				mlock_vma_folio(folio, vma);
> 
> Do we care about this?
> 
> It seems like folio_referenced_one() does some similar logic:
> 
> 		if (vma->vm_flags & VM_LOCKED) {
> 			if (!folio_test_large(folio) || !pvmw.pte) {
> 				/* Restore the mlock which got missed */
> 				mlock_vma_folio(folio, vma);
> 				page_vma_mapped_walk_done(&pvmw);
> 				pra->vm_flags |= VM_LOCKED;
> 				return false; /* To break the loop */
> 			}
> 
> ...
> 
> 	if ((vma->vm_flags & VM_LOCKED) &&
> 			folio_test_large(folio) &&
> 			folio_within_vma(folio, vma)) {
> 		unsigned long s_align, e_align;
> 
> 		s_align = ALIGN_DOWN(start, PMD_SIZE);
> 		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
> 
> 		/* folio doesn't cross page table boundary and fully mapped */
> 		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
> 			/* Restore the mlock which got missed */
> 			mlock_vma_folio(folio, vma);
> 			pra->vm_flags |= VM_LOCKED;
> 			return false; /* To break the loop */
> 		}
> 	}
> 
> So maybe we could do something similar in try_to_unmap_one()?

Hm. This seems to be buggy to me.

mlock_vma_folio() has to be called with ptl taken, no? It gets dropped
by this place.

+Fengwei.

I think this has to be handled inside the loop once ptes reaches
folio_nr_pages(folio).

Maybe something like this (untested):

diff --git a/mm/rmap.c b/mm/rmap.c
index ca8d4ef42c2d..719f1c99470c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -858,17 +858,13 @@ static bool folio_referenced_one(struct folio *folio,
 		address = pvmw.address;
 
 		if (vma->vm_flags & VM_LOCKED) {
-			if (!folio_test_large(folio) || !pvmw.pte) {
-				/* Restore the mlock which got missed */
-				mlock_vma_folio(folio, vma);
-				page_vma_mapped_walk_done(&pvmw);
-				pra->vm_flags |= VM_LOCKED;
-				return false; /* To break the loop */
-			}
+			unsigned long s_align, e_align;
+
+			/* Small folio or PMD-mapped large folio */
+			if (!folio_test_large(folio) || !pvmw.pte)
+				goto restore_mlock;
+
 			/*
-			 * For large folio fully mapped to VMA, will
-			 * be handled after the pvmw loop.
-			 *
 			 * For large folio cross VMA boundaries, it's
 			 * expected to be picked  by page reclaim. But
 			 * should skip reference of pages which are in
@@ -878,7 +874,23 @@ static bool folio_referenced_one(struct folio *folio,
 			 */
 			ptes++;
 			pra->mapcount--;
-			continue;
+
+			/* Folio must be fully mapped to be mlocked */
+			if (ptes != folio_nr_pages(folio))
+				continue;
+
+			s_align = ALIGN_DOWN(start, PMD_SIZE);
+			e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
+
+			/* folio doesn't cross page table */
+			if (s_align != e_align)
+				continue;
+restore_mlock:
+			/* Restore the mlock which got missed */
+			mlock_vma_folio(folio, vma);
+			page_vma_mapped_walk_done(&pvmw);
+			pra->vm_flags |= VM_LOCKED;
+			return false; /* To break the loop */
 		}
 
 		/*
@@ -914,23 +926,6 @@ static bool folio_referenced_one(struct folio *folio,
 		pra->mapcount--;
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) &&
-			folio_test_large(folio) &&
-			folio_within_vma(folio, vma)) {
-		unsigned long s_align, e_align;
-
-		s_align = ALIGN_DOWN(start, PMD_SIZE);
-		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
-
-		/* folio doesn't cross page table boundary and fully mapped */
-		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
-			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma);
-			pra->vm_flags |= VM_LOCKED;
-			return false; /* To break the loop */
-		}
-	}
-
 	if (referenced)
 		folio_clear_idle(folio);
 	if (folio_test_clear_young(folio))
-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 13:48     ` Kiryl Shutsemau
@ 2025-09-18 14:58       ` Kiryl Shutsemau
  0 siblings, 0 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2025-09-18 14:58 UTC (permalink / raw)
  To: Lorenzo Stoakes, Yin Fengwei
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel

On Thu, Sep 18, 2025 at 02:48:27PM +0100, Kiryl Shutsemau wrote:
> > So maybe we could do something similar in try_to_unmap_one()?
> 
> Hm. This seems to be buggy to me.
> 
> mlock_vma_folio() has to be called with ptl taken, no? It gets dropped
> by this place.
> 
> +Fengwei.
> 
> I think this has to be handled inside the loop once ptes reaches
> folio_nr_pages(folio).
> 
> Maybe something like this (untested):

With a little bit more tinkering I've come up with the change below.

Still untested.

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6cd020eea37a..86975033cb96 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -928,6 +928,11 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
 
+/* Result flags */
+
+/* The page mapped across page boundary */
+#define PVMW_PGTABLE_CROSSSED	(1 << 16)
+
 struct page_vma_mapped_walk {
 	unsigned long pfn;
 	unsigned long nr_pages;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..a184b88743c3 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -309,6 +309,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 				}
 				pte_unmap(pvmw->pte);
 				pvmw->pte = NULL;
+				pvmw->flags |= PVMW_PGTABLE_CROSSSED;
 				goto restart;
 			}
 			pvmw->pte++;
diff --git a/mm/rmap.c b/mm/rmap.c
index ca8d4ef42c2d..afe2711f4e3d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -851,34 +851,34 @@ static bool folio_referenced_one(struct folio *folio,
 {
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	int referenced = 0;
-	unsigned long start = address, ptes = 0;
+	int ptes = 0, referenced = 0;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
 		if (vma->vm_flags & VM_LOCKED) {
-			if (!folio_test_large(folio) || !pvmw.pte) {
-				/* Restore the mlock which got missed */
-				mlock_vma_folio(folio, vma);
-				page_vma_mapped_walk_done(&pvmw);
-				pra->vm_flags |= VM_LOCKED;
-				return false; /* To break the loop */
-			}
-			/*
-			 * For large folio fully mapped to VMA, will
-			 * be handled after the pvmw loop.
-			 *
-			 * For large folio cross VMA boundaries, it's
-			 * expected to be picked  by page reclaim. But
-			 * should skip reference of pages which are in
-			 * the range of VM_LOCKED vma. As page reclaim
-			 * should just count the reference of pages out
-			 * the range of VM_LOCKED vma.
-			 */
 			ptes++;
 			pra->mapcount--;
-			continue;
+
+			/* Only mlock fully mapped pages */
+			if (pvmw.pte && ptes != pvmw.nr_pages)
+				continue;
+
+			/*
+			 * All PTEs must be protected by page table lock in
+			 * order to mlock the page.
+			 *
+			 * If page table boundary has been cross, current ptl
+			 * only protect part of ptes.
+			 */
+			if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
+				continue;
+
+			/* Restore the mlock which got missed */
+			mlock_vma_folio(folio, vma);
+			page_vma_mapped_walk_done(&pvmw);
+			pra->vm_flags |= VM_LOCKED;
+			return false; /* To break the loop */
 		}
 
 		/*
@@ -914,23 +914,6 @@ static bool folio_referenced_one(struct folio *folio,
 		pra->mapcount--;
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) &&
-			folio_test_large(folio) &&
-			folio_within_vma(folio, vma)) {
-		unsigned long s_align, e_align;
-
-		s_align = ALIGN_DOWN(start, PMD_SIZE);
-		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
-
-		/* folio doesn't cross page table boundary and fully mapped */
-		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
-			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma);
-			pra->vm_flags |= VM_LOCKED;
-			return false; /* To break the loop */
-		}
-	}
-
 	if (referenced)
 		folio_clear_idle(folio);
 	if (folio_test_clear_young(folio))
@@ -1882,6 +1865,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long nr_pages = 1, end_addr;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	int ptes = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1922,9 +1906,24 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
+			ptes++;
+
+			/* Only mlock fully mapped pages */
+			if (pvmw.pte && ptes != pvmw.nr_pages)
+				goto walk_abort;
+
+			/*
+			 * All PTEs must be protected by page table lock in
+			 * order to mlock the page.
+			 *
+			 * If page table boundary has been cross, current ptl
+			 * only protect part of ptes.
+			 */
+			if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
+				goto walk_abort;
+
 			/* Restore the mlock which got missed */
-			if (!folio_test_large(folio))
-				mlock_vma_folio(folio, vma);
+			mlock_vma_folio(folio, vma);
 			goto walk_abort;
 		}
 
-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
  2025-09-18 11:31   ` David Hildenbrand
  2025-09-18 13:10   ` Lorenzo Stoakes
@ 2025-09-18 14:38   ` Johannes Weiner
  2025-09-18 19:32   ` Shakeel Butt
  3 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2025-09-18 14:38 UTC (permalink / raw)
  To: kirill
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Shakeel Butt, linux-mm, linux-kernel, Kiryl Shutsemau

On Thu, Sep 18, 2025 at 12:21:57PM +0100, kirill@shutemov.name wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> The kernel currently does not mlock large folios when adding them to
> rmap, stating that it is difficult to confirm that the folio is fully
> mapped and safe to mlock it. However, nowadays the caller passes a
> number of pages of the folio that are getting mapped, making it easy to
> check if the entire folio is mapped to the VMA.
> 
> mlock the folio on rmap if it is fully mapped to the VMA.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/rmap.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..ca8d4ef42c2d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1478,13 +1478,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>  				 PageAnonExclusive(cur_page), folio);
>  	}
>  
> -	/*
> -	 * For large folio, only mlock it if it's fully mapped to VMA. It's
> -	 * not easy to check whether the large folio is fully mapped to VMA
> -	 * here. Only mlock normal 4K folio and leave page reclaim to handle
> -	 * large folio.
> -	 */
> -	if (!folio_test_large(folio))
> +	/* Only mlock it if the folio is fully mapped to the VMA */
> +	if (folio_nr_pages(folio) == nr_pages)
>  		mlock_vma_folio(folio, vma);

Minor nit, but it might be useful to still mention in the comment that
partially mapped folios are punted to reclaim.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios
  2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
                     ` (2 preceding siblings ...)
  2025-09-18 14:38   ` Johannes Weiner
@ 2025-09-18 19:32   ` Shakeel Butt
  3 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2025-09-18 19:32 UTC (permalink / raw)
  To: kirill
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, linux-mm, linux-kernel, Kiryl Shutsemau

On Thu, Sep 18, 2025 at 12:21:57PM +0100, kirill@shutemov.name wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> The kernel currently does not mlock large folios when adding them to
> rmap, stating that it is difficult to confirm that the folio is fully
> mapped and safe to mlock it. However, nowadays the caller passes a
> number of pages of the folio that are getting mapped, making it easy to
> check if the entire folio is mapped to the VMA.
> 
> mlock the folio on rmap if it is fully mapped to the VMA.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

It would be interesting to state how this specific issue was causing
problems in our production particularly for workloads that were doing
load shedding based on mlock stats.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] mm: Improve mlock tracking for large folios
  2025-09-18 11:21 [PATCH 0/2] mm: Improve mlock tracking for large folios kirill
  2025-09-18 11:21 ` [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault() kirill
  2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
@ 2025-09-18 13:14 ` Lorenzo Stoakes
  2 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:14 UTC (permalink / raw)
  To: kirill
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
	Kiryl Shutsemau

On Thu, Sep 18, 2025 at 12:21:55PM +0100, kirill@shutemov.name wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> We do not mlock large folios on adding them to rmap deferring until
> relaim. It leads to substantial undercount of Mlocked in /proc/meminfo.
>
> This patchset improves the situation by mlocking large folios fully
> mapped to the VMA.
>
> Partially mapped large folios are still not accounted, but it brings
> meminfo value closer to the truth and makes it useful.
>
> Kiryl Shutsemau (2):
>   mm/fault: Try to map the entire file folio in finish_fault()

I feel like you need to speak more about this change in the cover letter.

>   mm/rmap: Improve mlock tracking for large folios
>
>  mm/memory.c |  9 ++-------
>  mm/rmap.c   | 13 ++++---------
>  2 files changed, 6 insertions(+), 16 deletions(-)
>
> --
> 2.50.1
>

FYI I compile tested each comit, mm self test tested blah blah + all
looking good. So just about Baolin's input really for 1/2.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-09-19  2:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-18 11:21 [PATCH 0/2] mm: Improve mlock tracking for large folios kirill
2025-09-18 11:21 ` [PATCH 1/2] mm/fault: Try to map the entire file folio in finish_fault() kirill
2025-09-18 11:30   ` David Hildenbrand
2025-09-18 13:13     ` Lorenzo Stoakes
2025-09-19  2:52       ` Baolin Wang
2025-09-18 11:21 ` [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios kirill
2025-09-18 11:31   ` David Hildenbrand
2025-09-18 13:10   ` Lorenzo Stoakes
2025-09-18 13:48     ` Kiryl Shutsemau
2025-09-18 14:58       ` Kiryl Shutsemau
2025-09-18 14:38   ` Johannes Weiner
2025-09-18 19:32   ` Shakeel Butt
2025-09-18 13:14 ` [PATCH 0/2] mm: " Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox