[PATCH v2 0/3] support large folio for mlock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] support large folio for mlock
@ 2023-08-09  6:11 Yin Fengwei
  2023-08-09  6:11 ` [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Yin Fengwei @ 2023-08-09  6:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, hughd, yosryahmed,
	ryan.roberts, david, shy828301
  Cc: fengwei.yin

Yu mentioned at [1] about the mlock() can't be applied to large folio.

I leant the related code and here is my understanding:
- For RLIMIT_MEMLOCK related, there is no problem. Because the
  RLIMIT_MEMLOCK statistics is not related underneath page. That means
  underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
  statistics collection which is always correct.

- For keeping the page in RAM, there is no problem either. At least,
  during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
  set in vm_flags, the folio will be kept whatever the folio is
  mlocked or not.

So the function of mlock for large folio works. But it's not optimized
because the page reclaim needs scan these large folio and may split
them.

This series identified the large folio for mlock to four types:
  - The large folio is in VM_LOCKED range and fully mapped to the
    range

  - The large folio is in the VM_LOCKED range but not fully mapped to
    the range

  - The large folio cross VM_LOCKED VMA boundary

  - The large folio cross last level page table boundary

For the first type, we mlock large folio so page reclaim will skip it.

For the second/third type, we don't mlock large folio. As the pages
not mapped to VM_LOACKED range are mapped to none VM_LOCKED range,
if system is in memory pressure situation, the large folio can be
picked by page reclaim and split. Then the pages not mapped to
VM_LOCKED range can be reclaimed.

For the fourth type, we don't mlock large folio because locking one
page table lock can't prevent the part in another last level page
table being unmapped. Thanks to Ryan for pointing this out.


To check whether the folio is fully mapped to the range, PTEs needs
be checked to see whether the page of folio is associated. Which
needs take page table lock and is heavy operation. So far, the
only place needs this check is madvise and page reclaim. These
functions already have their own PTE iterator.


patch1 introduce API to check whether large folio is in VMA range.
patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
       large folio mlock/munlock.
patch3 make mlock/munlock syscall support large folio.

testing done:
  - kernel selftest. No extra failure introduced

v1 was post here [2].

Yu also mentioned a race which can make folio unevictable after munlock
during RFC v2 discussion [3]:
We decided that race issue didn't block this series based on:
  - That race issue was not introduced by this series

  - We had a looks-ok fix for that race issue. Need to wait
    for mlock_count fixing patch as Yosry Ahmed suggested [4]

ChangeLog from V1:
  - Remove the PTE check from folio_in_range() and reuse the page
    table iterator (in madvise and folio_referenced_one) to check
    whether fully mapped or not in callers

  - Avoid mlock the folio which cross last level page table. Thanks
    to Ryan for pointing this out.

  - Drop pte_none() check when iterate page table because we only
    care pte_present() case.

  - move folio_test_large() out of  m(un)lock_vma_folio()


ChangeLog from RFC v2:
  - Removed RFC

  - dropped folio_is_large() check as suggested by both Yu and Huge

  - Besides the address/pgoff check, also check the page table
    entry when check whether the folio is in the range. This is
    to handle mremap case that address/pgoff is in range, but
    folio can't be identified as in range.

  - Fixed one issue in page_add_anon_rmap() and page_add_anon_rmap()
    introdued by RFC v2. As these two functions can be called multiple
    times against one folio. And remove_rmap() may not be called same
    times. Which can bring imbalanced mlock_count. Fix it by skip
    mlock large folio in these two functions.

[1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20230728070929.2487065-1-fengwei.yin@intel.com/
[3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/
[4] https://lore.kernel.org/linux-mm/CAJD7tkZJFG=7xs=9otc5CKs6odWu48daUuZP9Wd9Z-sZF07hXg@mail.gmail.com/

Yin Fengwei (3):
  mm: add functions folio_in_range() and folio_within_vma()
  mm: handle large folio when large folio in VM_LOCKED VMA range
  mm: mlock: update mlock_pte_range to handle large folio

 mm/internal.h | 58 ++++++++++++++++++++++++++++++++++++--------
 mm/mlock.c    | 66 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/rmap.c     | 66 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 167 insertions(+), 23 deletions(-)

-- 
2.39.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-09  6:11 [PATCH v2 0/3] support large folio for mlock Yin Fengwei
@ 2023-08-09  6:11 ` Yin Fengwei
  2023-08-09 19:34   ` Ryan Roberts
  2023-08-09  6:11 ` [PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
  2023-08-09  6:11 ` [PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
  2 siblings, 1 reply; 6+ messages in thread
From: Yin Fengwei @ 2023-08-09  6:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, hughd, yosryahmed,
	ryan.roberts, david, shy828301
  Cc: fengwei.yin

It will be used to check whether the folio is mapped to specific
VMA and whether the mapping address of folio is in the range.

Also a helper function folio_within_vma() to check whether folio
is in the range of vma based on folio_in_range().

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/internal.h | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/mm/internal.h b/mm/internal.h
index 154da4f0d557..5d1b71010fd2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -585,6 +585,41 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   bool write, int *locked);
 extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
 			       unsigned long bytes);
+
+static inline bool
+folio_in_range(struct folio *folio, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end)
+{
+	pgoff_t pgoff, addr;
+	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
+	if (start > end)
+		return false;
+
+	if (start < vma->vm_start)
+		start = vma->vm_start;
+
+	if (end > vma->vm_end)
+		end = vma->vm_end;
+
+	pgoff = folio_pgoff(folio);
+
+	/* if folio start address is not in vma range */
+	if (!in_range(pgoff, vma->vm_pgoff, vma_pglen))
+		return false;
+
+	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+
+	return !(addr < start || end - addr < folio_size(folio));
+}
+
+static inline bool
+folio_within_vma(struct folio *folio, struct vm_area_struct *vma)
+{
+	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end);
+}
+
 /*
  * mlock_vma_folio() and munlock_vma_folio():
  * should be called with vma's mmap_lock held for read or write,
-- 
2.39.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-09  6:11 ` [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
@ 2023-08-09 19:34   ` Ryan Roberts
  2023-08-10  1:30     ` Yin Fengwei
  0 siblings, 1 reply; 6+ messages in thread
From: Ryan Roberts @ 2023-08-09 19:34 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, linux-kernel, akpm, yuzhao, willy, hughd,
	yosryahmed, david, shy828301

On 09/08/2023 07:11, Yin Fengwei wrote:
> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
> 
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/internal.h | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 154da4f0d557..5d1b71010fd2 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -585,6 +585,41 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>  				   bool write, int *locked);
>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>  			       unsigned long bytes);
> +
> +static inline bool
> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end)

I still think it would be beneficial to have a comment block describing the
requirements and behaviour of the function:

 - folio must have at least 1 page that is mapped in vma
 - the result tells you if the folio lies within the range, but it does not tell
you that all of its pages are actually _mapped_ (e.g. they may not have been
faulted in yet).

 - I think [start, end) is intended intersect with the vma too? (although I'm
pretty sure sure the logic works if it doesn't?)

> +{
> +	pgoff_t pgoff, addr;
> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
> +	if (start > end)
> +		return false;
> +
> +	if (start < vma->vm_start)
> +		start = vma->vm_start;
> +
> +	if (end > vma->vm_end)
> +		end = vma->vm_end;
> +
> +	pgoff = folio_pgoff(folio);
> +
> +	/* if folio start address is not in vma range */
> +	if (!in_range(pgoff, vma->vm_pgoff, vma_pglen))
> +		return false;
> +
> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +
> +	return !(addr < start || end - addr < folio_size(folio));
> +}
> +
> +static inline bool
> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma)

why call this *within* but call the folio_in_range() *in*? Feels cleaner to use
the same word for both.

> +{
> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end);
> +}
> +
>  /*
>   * mlock_vma_folio() and munlock_vma_folio():
>   * should be called with vma's mmap_lock held for read or write,



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma()
  2023-08-09 19:34   ` Ryan Roberts
@ 2023-08-10  1:30     ` Yin Fengwei
  0 siblings, 0 replies; 6+ messages in thread
From: Yin Fengwei @ 2023-08-10  1:30 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm, linux-kernel, akpm, yuzhao, willy, hughd,
	yosryahmed, david, shy828301



On 8/10/23 03:34, Ryan Roberts wrote:
> On 09/08/2023 07:11, Yin Fengwei wrote:
>> It will be used to check whether the folio is mapped to specific
>> VMA and whether the mapping address of folio is in the range.
>>
>> Also a helper function folio_within_vma() to check whether folio
>> is in the range of vma based on folio_in_range().
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/internal.h | 35 +++++++++++++++++++++++++++++++++++
>>  1 file changed, 35 insertions(+)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 154da4f0d557..5d1b71010fd2 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -585,6 +585,41 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>  				   bool write, int *locked);
>>  extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>  			       unsigned long bytes);
>> +
>> +static inline bool
>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>> +		unsigned long start, unsigned long end)
> 
> I still think it would be beneficial to have a comment block describing the
> requirements and behaviour of the function:
Definitely. Thanks a lot for reminding me again.

> 
>  - folio must have at least 1 page that is mapped in vma
This is typical usage. 

>  - the result tells you if the folio lies within the range, but it does not tell
> you that all of its pages are actually _mapped_ (e.g. they may not have been
> faulted in yet).
Exactly. Something like following:

This function can't tell whether the folio is fully mapped in the range as it
doesn't check whether all pages of filio are associated with page table of VMA.
Caller needs to do page table check if it cares about the fully mapping.

Typical usage (mlock or madvise) is caller calls this function to check whether
the folio is in the range first. Then check page table to know whether the
folio is fully mapped to the range when it knows at least 1 page of folio is
associated with page table of VMA.

> 
>  - I think [start, end) is intended intersect with the vma too? (although I'm
> pretty sure sure the logic works if it doesn't?)
> 
>> +{
>> +	pgoff_t pgoff, addr;
>> +	unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +	VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>> +	if (start > end)
>> +		return false;
>> +
>> +	if (start < vma->vm_start)
>> +		start = vma->vm_start;
>> +
>> +	if (end > vma->vm_end)
>> +		end = vma->vm_end;
>> +
>> +	pgoff = folio_pgoff(folio);
>> +
>> +	/* if folio start address is not in vma range */
>> +	if (!in_range(pgoff, vma->vm_pgoff, vma_pglen))
>> +		return false;
>> +
>> +	addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +
>> +	return !(addr < start || end - addr < folio_size(folio));
>> +}
>> +
>> +static inline bool
>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma)
> 
> why call this *within* but call the folio_in_range() *in*? Feels cleaner to use
> the same word for both.
Good point. I will change folio_in_range() to folio_within_range().


Regards
Yin, Fengwei

> 
>> +{
>> +	return folio_in_range(folio, vma, vma->vm_start, vma->vm_end);
>> +}
>> +
>>  /*
>>   * mlock_vma_folio() and munlock_vma_folio():
>>   * should be called with vma's mmap_lock held for read or write,
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range
  2023-08-09  6:11 [PATCH v2 0/3] support large folio for mlock Yin Fengwei
  2023-08-09  6:11 ` [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
@ 2023-08-09  6:11 ` Yin Fengwei
  2023-08-09  6:11 ` [PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
  2 siblings, 0 replies; 6+ messages in thread
From: Yin Fengwei @ 2023-08-09  6:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, hughd, yosryahmed,
	ryan.roberts, david, shy828301
  Cc: fengwei.yin

If large folio is in the range of VM_LOCKED VMA, it should be
mlocked to avoid being picked by page reclaim. Which may split
the large folio and then mlock each pages again.

Mlock this kind of large folio to prevent them being picked by
page reclaim.

For the large folio which cross the boundary of VM_LOCKED VMA
or not fully mapped to VM_LOCKED VMA, we'd better not to mlock
it. So if the system is under memory pressure, this kind of
large folio will be split and the pages ouf of VM_LOCKED VMA
can be reclaimed.

Ideally, for large folio, we should mlock it when the large folio
is fully mapped to VMA and munlock it if any page are unmampped
from VMA. But it's not easy to detect whether the large folio is
fully mapped to VMA in some cases (like add/remove rmap). So we
update mlock_vma_folio() and munlock_vma_folio() to mlock/munlock
the folio according to vma->vm_flags. Let caller to decide whether
they should call these two functions.

For add rmap, only mlock normal 4K folio and postpone large folio
handling to page reclaim phase. It is possible to reuse page table
iterator to detect whether folio is fully mapped or not during
page reclaim phase. For remove rmap, invoke munlock_vma_folio()
to munlock folio unconditionly because rmap makes folio not fully
mapped to VMA.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/internal.h | 23 ++++++++++--------
 mm/rmap.c     | 66 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5d1b71010fd2..b14fb2d8b04c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -628,14 +628,10 @@ folio_within_vma(struct folio *folio, struct vm_area_struct *vma)
  * mlock is usually called at the end of page_add_*_rmap(), munlock at
  * the end of page_remove_rmap(); but new anon folios are managed by
  * folio_add_lru_vma() calling mlock_new_folio().
- *
- * @compound is used to include pmd mappings of THPs, but filter out
- * pte mappings of THPs, which cannot be consistently counted: a pte
- * mapping of the THP head cannot be distinguished by the page alone.
  */
 void mlock_folio(struct folio *folio);
 static inline void mlock_vma_folio(struct folio *folio,
-			struct vm_area_struct *vma, bool compound)
+				struct vm_area_struct *vma)
 {
 	/*
 	 * The VM_SPECIAL check here serves two purposes.
@@ -645,17 +641,24 @@ static inline void mlock_vma_folio(struct folio *folio,
 	 *    file->f_op->mmap() is using vm_insert_page(s), when VM_LOCKED may
 	 *    still be set while VM_SPECIAL bits are added: so ignore it then.
 	 */
-	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
-	    (compound || !folio_test_large(folio)))
+	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED))
 		mlock_folio(folio);
 }
 
 void munlock_folio(struct folio *folio);
 static inline void munlock_vma_folio(struct folio *folio,
-			struct vm_area_struct *vma, bool compound)
+					struct vm_area_struct *vma)
 {
-	if (unlikely(vma->vm_flags & VM_LOCKED) &&
-	    (compound || !folio_test_large(folio)))
+	/*
+	 * munlock if the function is called. Ideally, we should only
+	 * do munlock if any page of folio is unmapped from VMA and
+	 * cause folio not fully mapped to VMA.
+	 *
+	 * But it's not easy to confirm that's the situation. So we
+	 * always munlock the folio and page reclaim will correct it
+	 * if it's wrong.
+	 */
+	if (unlikely(vma->vm_flags & VM_LOCKED))
 		munlock_folio(folio);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 3c20d0d79905..dae0443e9ab0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,7 @@ struct folio_referenced_arg {
 	unsigned long vm_flags;
 	struct mem_cgroup *memcg;
 };
+
 /*
  * arg: folio_referenced_arg will be passed
  */
@@ -807,17 +808,33 @@ static bool folio_referenced_one(struct folio *folio,
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int referenced = 0;
+	unsigned long start = address, ptes = 0;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if ((vma->vm_flags & VM_LOCKED) &&
-		    (!folio_test_large(folio) || !pvmw.pte)) {
-			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma, !pvmw.pte);
-			page_vma_mapped_walk_done(&pvmw);
-			pra->vm_flags |= VM_LOCKED;
-			return false; /* To break the loop */
+		if (vma->vm_flags & VM_LOCKED) {
+			if (!folio_test_large(folio) || !pvmw.pte) {
+				/* Restore the mlock which got missed */
+				mlock_vma_folio(folio, vma);
+				page_vma_mapped_walk_done(&pvmw);
+				pra->vm_flags |= VM_LOCKED;
+				return false; /* To break the loop */
+			}
+			/*
+			 * For large folio fully mapped to VMA, will
+			 * be handled after the pvmw loop.
+			 *
+			 * For large folio cross VMA boundaries, it's
+			 * expected to be picked  by page reclaim. But
+			 * should skip reference of pages which are in
+			 * the range of VM_LOCKED vma. As page reclaim
+			 * should just count the reference of pages out
+			 * the range of VM_LOCKED vma.
+			 */
+			ptes++;
+			pra->mapcount--;
+			continue;
 		}
 
 		if (pvmw.pte) {
@@ -842,6 +859,23 @@ static bool folio_referenced_one(struct folio *folio,
 		pra->mapcount--;
 	}
 
+	if ((vma->vm_flags & VM_LOCKED) &&
+			folio_test_large(folio) &&
+			folio_within_vma(folio, vma)) {
+		unsigned long s_align, e_align;
+
+		s_align = ALIGN_DOWN(start, PMD_SIZE);
+		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
+
+		/* folio doesn't cross page table boundary and fully mapped */
+		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
+			/* Restore the mlock which got missed */
+			mlock_vma_folio(folio, vma);
+			pra->vm_flags |= VM_LOCKED;
+			return false; /* To break the loop */
+		}
+	}
+
 	if (referenced)
 		folio_clear_idle(folio);
 	if (folio_test_clear_young(folio))
@@ -1260,7 +1294,14 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 			__page_check_anon_rmap(folio, page, vma, address);
 	}
 
-	mlock_vma_folio(folio, vma, compound);
+	/*
+	 * For large folio, only mlock it if it's fully mapped to VMA. It's
+	 * not easy to check whether the large folio is fully mapped to VMA
+	 * here. Only mlock normal 4K folio and leave page reclaim to handle
+	 * large folio.
+	 */
+	if (!folio_test_large(folio))
+		mlock_vma_folio(folio, vma);
 }
 
 void folio_add_new_anon_rmap_range(struct folio *folio,
@@ -1371,7 +1412,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
 	if (nr)
 		__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
 
-	mlock_vma_folio(folio, vma, compound);
+	/* See comments in page_add_anon_rmap() */
+	if (!folio_test_large(folio))
+		mlock_vma_folio(folio, vma);
 }
 
 /**
@@ -1482,7 +1525,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	 * it's only reliable while mapped.
 	 */
 
-	munlock_vma_folio(folio, vma, compound);
+	munlock_vma_folio(folio, vma);
 }
 
 /*
@@ -1543,7 +1586,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
 			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma, false);
+			if (!folio_test_large(folio))
+				mlock_vma_folio(folio, vma);
 			page_vma_mapped_walk_done(&pvmw);
 			ret = false;
 			break;
-- 
2.39.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio
  2023-08-09  6:11 [PATCH v2 0/3] support large folio for mlock Yin Fengwei
  2023-08-09  6:11 ` [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
  2023-08-09  6:11 ` [PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
@ 2023-08-09  6:11 ` Yin Fengwei
  2 siblings, 0 replies; 6+ messages in thread
From: Yin Fengwei @ 2023-08-09  6:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, yuzhao, willy, hughd, yosryahmed,
	ryan.roberts, david, shy828301
  Cc: fengwei.yin

Current kernel only lock base size folio during mlock syscall.
Add large folio support with following rules:
  - Only mlock large folio when it's in VM_LOCKED VMA range
    and fully mapped to page table.

    fully mapped folio is required as if folio is not fully
    mapped to a VM_LOCKED VMA, if system is in memory pressure,
    page reclaim is allowed to pick up this folio, split it
    and reclaim the pages which are not in VM_LOCKED VMA.

  - munlock will apply to the large folio which is in VMA range
    or cross the VMA boundary.

    This is required to handle the case that the large folio is
    mlocked, later the VMA is split in the middle of large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/mlock.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 06bdfab83b58..1da1996745e7 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -305,6 +305,58 @@ void munlock_folio(struct folio *folio)
 	local_unlock(&mlock_fbatch.lock);
 }
 
+static inline unsigned int folio_mlock_step(struct folio *folio,
+		pte_t *pte, unsigned long addr, unsigned long end)
+{
+	unsigned int count, i, nr = folio_nr_pages(folio);
+	unsigned long pfn = folio_pfn(folio);
+	pte_t ptent = ptep_get(pte);
+
+	if (!folio_test_large(folio))
+		return 1;
+
+	count = pfn + nr - pte_pfn(ptent);
+	count = min_t(unsigned int, count, (end - addr) >> PAGE_SHIFT);
+
+	for (i = 0; i < count; i++, pte++) {
+		pte_t entry = ptep_get(pte);
+
+		if (!pte_present(entry))
+			break;
+		if (pte_pfn(entry) - pfn >= nr)
+			break;
+	}
+
+	return i;
+}
+
+static inline bool allow_mlock_munlock(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned int step)
+{
+	/*
+	 * For unlock, allow munlock large folio which is partially
+	 * mapped to VMA. As it's possible that large folio is
+	 * mlocked and VMA is split later.
+	 *
+	 * During memory pressure, such kind of large folio can
+	 * be split. And the pages are not in VM_LOCKed VMA
+	 * can be reclaimed.
+	 */
+	if (!(vma->vm_flags & VM_LOCKED))
+		return true;
+
+	/* folio not in range [start, end), skip mlock */
+	if (!folio_in_range(folio, vma, start, end))
+		return false;
+
+	/* folio is not fully mapped, skip mlock */
+	if (step != folio_nr_pages(folio))
+		return false;
+
+	return true;
+}
+
 static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 			   unsigned long end, struct mm_walk *walk)
 
@@ -314,6 +366,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *start_pte, *pte;
 	pte_t ptent;
 	struct folio *folio;
+	unsigned int step = 1;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -334,6 +388,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		walk->action = ACTION_AGAIN;
 		return 0;
 	}
+
 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = ptep_get(pte);
 		if (!pte_present(ptent))
@@ -341,12 +396,19 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		folio = vm_normal_folio(vma, addr, ptent);
 		if (!folio || folio_is_zone_device(folio))
 			continue;
-		if (folio_test_large(folio))
-			continue;
+
+		step = folio_mlock_step(folio, pte, addr, end);
+		if (!allow_mlock_munlock(folio, vma, start, end, step))
+			goto next_entry;
+
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_folio(folio);
 		else
 			munlock_folio(folio);
+
+next_entry:
+		pte += step - 1;
+		addr += (step - 1) << PAGE_SHIFT;
 	}
 	pte_unmap(start_pte);
 out:
-- 
2.39.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-08-10  1:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-09  6:11 [PATCH v2 0/3] support large folio for mlock Yin Fengwei
2023-08-09  6:11 ` [PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma() Yin Fengwei
2023-08-09 19:34   ` Ryan Roberts
2023-08-10  1:30     ` Yin Fengwei
2023-08-09  6:11 ` [PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
2023-08-09  6:11 ` [PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox