* [PATCHv2 1/5] mm/page_vma_mapped: Track if the page is mapped across page table boundary
2025-09-19 12:40 [PATCHv2 0/5] mm: Improve mlock tracking for large folios Kiryl Shutsemau
@ 2025-09-19 12:40 ` Kiryl Shutsemau
2025-09-19 20:25 ` Shakeel Butt
2025-09-19 12:40 ` [PATCHv2 2/5] mm/rmap: Fix a mlock race condition in folio_referenced_one() Kiryl Shutsemau
` (3 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-19 12:40 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel, Kiryl Shutsemau
From: Kiryl Shutsemau <kas@kernel.org>
Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
the page is mapped across page table boundary. Unlike other PVMW_*
flags, this one is result of page_vma_mapped_walk() and not set by the
caller.
folio_referenced_one() will use it detect if it safe to mlock the folio.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/rmap.h | 5 +++++
mm/page_vma_mapped.c | 1 +
2 files changed, 6 insertions(+)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6cd020eea37a..04797cea3205 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -928,6 +928,11 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
/* Look for migration entries rather than present PTEs */
#define PVMW_MIGRATION (1 << 1)
+/* Result flags */
+
+/* The page is mapped across page boundary */
+#define PVMW_PGTABLE_CROSSSED (1 << 16)
+
struct page_vma_mapped_walk {
unsigned long pfn;
unsigned long nr_pages;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..a184b88743c3 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -309,6 +309,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
}
pte_unmap(pvmw->pte);
pvmw->pte = NULL;
+ pvmw->flags |= PVMW_PGTABLE_CROSSSED;
goto restart;
}
pvmw->pte++;
--
2.50.1
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 1/5] mm/page_vma_mapped: Track if the page is mapped across page table boundary
2025-09-19 12:40 ` [PATCHv2 1/5] mm/page_vma_mapped: Track if the page is mapped across page table boundary Kiryl Shutsemau
@ 2025-09-19 20:25 ` Shakeel Butt
2025-09-22 16:13 ` Kiryl Shutsemau
0 siblings, 1 reply; 17+ messages in thread
From: Shakeel Butt @ 2025-09-19 20:25 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel,
Kiryl Shutsemau
On Fri, Sep 19, 2025 at 01:40:32PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
> the page is mapped across page table boundary. Unlike other PVMW_*
> flags, this one is result of page_vma_mapped_walk() and not set by the
> caller.
>
> folio_referenced_one() will use it detect if it safe to mlock the folio.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
> include/linux/rmap.h | 5 +++++
> mm/page_vma_mapped.c | 1 +
> 2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 6cd020eea37a..04797cea3205 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -928,6 +928,11 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> /* Look for migration entries rather than present PTEs */
> #define PVMW_MIGRATION (1 << 1)
>
> +/* Result flags */
> +
> +/* The page is mapped across page boundary */
I think you meant "page table boundary" in above comment.
> +#define PVMW_PGTABLE_CROSSSED (1 << 16)
> +
> struct page_vma_mapped_walk {
> unsigned long pfn;
> unsigned long nr_pages;
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..a184b88743c3 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -309,6 +309,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> }
> pte_unmap(pvmw->pte);
> pvmw->pte = NULL;
> + pvmw->flags |= PVMW_PGTABLE_CROSSSED;
> goto restart;
> }
> pvmw->pte++;
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 1/5] mm/page_vma_mapped: Track if the page is mapped across page table boundary
2025-09-19 20:25 ` Shakeel Butt
@ 2025-09-22 16:13 ` Kiryl Shutsemau
0 siblings, 0 replies; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-22 16:13 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel
On Fri, Sep 19, 2025 at 01:25:36PM -0700, Shakeel Butt wrote:
> On Fri, Sep 19, 2025 at 01:40:32PM +0100, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
> > the page is mapped across page table boundary. Unlike other PVMW_*
> > flags, this one is result of page_vma_mapped_walk() and not set by the
> > caller.
> >
> > folio_referenced_one() will use it detect if it safe to mlock the folio.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>
> > ---
> > include/linux/rmap.h | 5 +++++
> > mm/page_vma_mapped.c | 1 +
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 6cd020eea37a..04797cea3205 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -928,6 +928,11 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> > /* Look for migration entries rather than present PTEs */
> > #define PVMW_MIGRATION (1 << 1)
> >
> > +/* Result flags */
> > +
> > +/* The page is mapped across page boundary */
>
> I think you meant "page table boundary" in above comment.
Right. Will fix in the v3.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv2 2/5] mm/rmap: Fix a mlock race condition in folio_referenced_one()
2025-09-19 12:40 [PATCHv2 0/5] mm: Improve mlock tracking for large folios Kiryl Shutsemau
2025-09-19 12:40 ` [PATCHv2 1/5] mm/page_vma_mapped: Track if the page is mapped across page table boundary Kiryl Shutsemau
@ 2025-09-19 12:40 ` Kiryl Shutsemau
2025-09-19 21:18 ` Shakeel Butt
2025-09-19 12:40 ` [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one() Kiryl Shutsemau
` (2 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-19 12:40 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel, Kiryl Shutsemau
From: Kiryl Shutsemau <kas@kernel.org>
The mlock_vma_folio() function requires the page table lock to be held
in order to safely mlock the folio. However, folio_referenced_one()
mlocks a large folios outside of the page_vma_mapped_walk() loop where
the page table lock has already been dropped.
Rework the mlock logic to use the same code path inside the loop for
both large and small folios.
Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a
page table boundary.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/rmap.c | 59 ++++++++++++++++++++-----------------------------------
1 file changed, 21 insertions(+), 38 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..3d0235f332de 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -851,34 +851,34 @@ static bool folio_referenced_one(struct folio *folio,
{
struct folio_referenced_arg *pra = arg;
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
- int referenced = 0;
- unsigned long start = address, ptes = 0;
+ int ptes = 0, referenced = 0;
while (page_vma_mapped_walk(&pvmw)) {
address = pvmw.address;
if (vma->vm_flags & VM_LOCKED) {
- if (!folio_test_large(folio) || !pvmw.pte) {
- /* Restore the mlock which got missed */
- mlock_vma_folio(folio, vma);
- page_vma_mapped_walk_done(&pvmw);
- pra->vm_flags |= VM_LOCKED;
- return false; /* To break the loop */
- }
- /*
- * For large folio fully mapped to VMA, will
- * be handled after the pvmw loop.
- *
- * For large folio cross VMA boundaries, it's
- * expected to be picked by page reclaim. But
- * should skip reference of pages which are in
- * the range of VM_LOCKED vma. As page reclaim
- * should just count the reference of pages out
- * the range of VM_LOCKED vma.
- */
ptes++;
pra->mapcount--;
- continue;
+
+ /* Only mlock fully mapped pages */
+ if (pvmw.pte && ptes != pvmw.nr_pages)
+ continue;
+
+ /*
+ * All PTEs must be protected by page table lock in
+ * order to mlock the page.
+ *
+ * If page table boundary has been cross, current ptl
+ * only protect part of ptes.
+ */
+ if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
+ continue;
+
+ /* Restore the mlock which got missed */
+ mlock_vma_folio(folio, vma);
+ page_vma_mapped_walk_done(&pvmw);
+ pra->vm_flags |= VM_LOCKED;
+ return false; /* To break the loop */
}
/*
@@ -914,23 +914,6 @@ static bool folio_referenced_one(struct folio *folio,
pra->mapcount--;
}
- if ((vma->vm_flags & VM_LOCKED) &&
- folio_test_large(folio) &&
- folio_within_vma(folio, vma)) {
- unsigned long s_align, e_align;
-
- s_align = ALIGN_DOWN(start, PMD_SIZE);
- e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
-
- /* folio doesn't cross page table boundary and fully mapped */
- if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
- /* Restore the mlock which got missed */
- mlock_vma_folio(folio, vma);
- pra->vm_flags |= VM_LOCKED;
- return false; /* To break the loop */
- }
- }
-
if (referenced)
folio_clear_idle(folio);
if (folio_test_clear_young(folio))
--
2.50.1
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 2/5] mm/rmap: Fix a mlock race condition in folio_referenced_one()
2025-09-19 12:40 ` [PATCHv2 2/5] mm/rmap: Fix a mlock race condition in folio_referenced_one() Kiryl Shutsemau
@ 2025-09-19 21:18 ` Shakeel Butt
0 siblings, 0 replies; 17+ messages in thread
From: Shakeel Butt @ 2025-09-19 21:18 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel,
Kiryl Shutsemau
On Fri, Sep 19, 2025 at 01:40:33PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> The mlock_vma_folio() function requires the page table lock to be held
> in order to safely mlock the folio. However, folio_referenced_one()
> mlocks a large folios outside of the page_vma_mapped_walk() loop where
> the page table lock has already been dropped.
>
> Rework the mlock logic to use the same code path inside the loop for
> both large and small folios.
>
> Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a
> page table boundary.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Nice, this simplifies the code.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one()
2025-09-19 12:40 [PATCHv2 0/5] mm: Improve mlock tracking for large folios Kiryl Shutsemau
2025-09-19 12:40 ` [PATCHv2 1/5] mm/page_vma_mapped: Track if the page is mapped across page table boundary Kiryl Shutsemau
2025-09-19 12:40 ` [PATCHv2 2/5] mm/rmap: Fix a mlock race condition in folio_referenced_one() Kiryl Shutsemau
@ 2025-09-19 12:40 ` Kiryl Shutsemau
2025-09-19 21:27 ` Shakeel Butt
2025-09-19 12:40 ` [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault() Kiryl Shutsemau
2025-09-19 12:40 ` [PATCHv2 5/5] mm/rmap: Improve mlock tracking for large folios Kiryl Shutsemau
4 siblings, 1 reply; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-19 12:40 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel, Kiryl Shutsemau
From: Kiryl Shutsemau <kas@kernel.org>
Currently, try_to_unmap_once() only tries to mlock small folios.
Use logic similar to folio_referenced_one() to mlock large folios:
only do this for fully mapped folios and under page table lock that
protects all page table entries.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/rmap.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 3d0235f332de..482e6504fa88 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1870,6 +1870,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long nr_pages = 1, end_addr;
unsigned long pfn;
unsigned long hsz = 0;
+ int ptes = 0;
/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1910,10 +1911,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (!(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED)) {
+ ptes++;
+ ret = false;
+
+ /* Only mlock fully mapped pages */
+ if (pvmw.pte && ptes != pvmw.nr_pages)
+ continue;
+
+ /*
+ * All PTEs must be protected by page table lock in
+ * order to mlock the page.
+ *
+ * If page table boundary has been cross, current ptl
+ * only protect part of ptes.
+ */
+ if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
+ goto walk_done;
+
/* Restore the mlock which got missed */
- if (!folio_test_large(folio))
- mlock_vma_folio(folio, vma);
- goto walk_abort;
+ mlock_vma_folio(folio, vma);
+ goto walk_done;
}
if (!pvmw.pte) {
--
2.50.1
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one()
2025-09-19 12:40 ` [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one() Kiryl Shutsemau
@ 2025-09-19 21:27 ` Shakeel Butt
2025-09-22 9:51 ` Kiryl Shutsemau
0 siblings, 1 reply; 17+ messages in thread
From: Shakeel Butt @ 2025-09-19 21:27 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel,
Kiryl Shutsemau
On Fri, Sep 19, 2025 at 01:40:34PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> Currently, try_to_unmap_once() only tries to mlock small folios.
>
> Use logic similar to folio_referenced_one() to mlock large folios:
> only do this for fully mapped folios and under page table lock that
> protects all page table entries.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
> mm/rmap.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3d0235f332de..482e6504fa88 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> unsigned long nr_pages = 1, end_addr;
> unsigned long pfn;
> unsigned long hsz = 0;
> + int ptes = 0;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1910,10 +1911,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> if (!(flags & TTU_IGNORE_MLOCK) &&
> (vma->vm_flags & VM_LOCKED)) {
> + ptes++;
> + ret = false;
> +
> + /* Only mlock fully mapped pages */
> + if (pvmw.pte && ptes != pvmw.nr_pages)
> + continue;
> +
> + /*
> + * All PTEs must be protected by page table lock in
> + * order to mlock the page.
> + *
> + * If page table boundary has been cross, current ptl
> + * only protect part of ptes.
> + */
> + if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
> + goto walk_done;
Should it be goto walk_abort?
> +
> /* Restore the mlock which got missed */
> - if (!folio_test_large(folio))
> - mlock_vma_folio(folio, vma);
> - goto walk_abort;
> + mlock_vma_folio(folio, vma);
> + goto walk_done;
Here too?
> }
>
> if (!pvmw.pte) {
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one()
2025-09-19 21:27 ` Shakeel Butt
@ 2025-09-22 9:51 ` Kiryl Shutsemau
2025-09-22 20:16 ` Shakeel Butt
0 siblings, 1 reply; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-22 9:51 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel
On Fri, Sep 19, 2025 at 02:27:40PM -0700, Shakeel Butt wrote:
> On Fri, Sep 19, 2025 at 01:40:34PM +0100, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > Currently, try_to_unmap_once() only tries to mlock small folios.
> >
> > Use logic similar to folio_referenced_one() to mlock large folios:
> > only do this for fully mapped folios and under page table lock that
> > protects all page table entries.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> > mm/rmap.c | 23 ++++++++++++++++++++---
> > 1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 3d0235f332de..482e6504fa88 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1870,6 +1870,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > unsigned long nr_pages = 1, end_addr;
> > unsigned long pfn;
> > unsigned long hsz = 0;
> > + int ptes = 0;
> >
> > /*
> > * When racing against e.g. zap_pte_range() on another cpu,
> > @@ -1910,10 +1911,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > */
> > if (!(flags & TTU_IGNORE_MLOCK) &&
> > (vma->vm_flags & VM_LOCKED)) {
> > + ptes++;
> > + ret = false;
> > +
> > + /* Only mlock fully mapped pages */
> > + if (pvmw.pte && ptes != pvmw.nr_pages)
> > + continue;
> > +
> > + /*
> > + * All PTEs must be protected by page table lock in
> > + * order to mlock the page.
> > + *
> > + * If page table boundary has been cross, current ptl
> > + * only protect part of ptes.
> > + */
> > + if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
> > + goto walk_done;
>
> Should it be goto walk_abort?
I already have to set ret to false above to make it work for partially
mapped large folios. So walk_done is enough here.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one()
2025-09-22 9:51 ` Kiryl Shutsemau
@ 2025-09-22 20:16 ` Shakeel Butt
0 siblings, 0 replies; 17+ messages in thread
From: Shakeel Butt @ 2025-09-22 20:16 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel
On Mon, Sep 22, 2025 at 10:51:35AM +0100, Kiryl Shutsemau wrote:
> On Fri, Sep 19, 2025 at 02:27:40PM -0700, Shakeel Butt wrote:
> > On Fri, Sep 19, 2025 at 01:40:34PM +0100, Kiryl Shutsemau wrote:
> > > From: Kiryl Shutsemau <kas@kernel.org>
> > >
> > > Currently, try_to_unmap_once() only tries to mlock small folios.
> > >
> > > Use logic similar to folio_referenced_one() to mlock large folios:
> > > only do this for fully mapped folios and under page table lock that
> > > protects all page table entries.
> > >
> > > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > > ---
> > > mm/rmap.c | 23 ++++++++++++++++++++---
> > > 1 file changed, 20 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 3d0235f332de..482e6504fa88 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1870,6 +1870,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > > unsigned long nr_pages = 1, end_addr;
> > > unsigned long pfn;
> > > unsigned long hsz = 0;
> > > + int ptes = 0;
> > >
> > > /*
> > > * When racing against e.g. zap_pte_range() on another cpu,
> > > @@ -1910,10 +1911,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > > */
> > > if (!(flags & TTU_IGNORE_MLOCK) &&
> > > (vma->vm_flags & VM_LOCKED)) {
> > > + ptes++;
> > > + ret = false;
> > > +
> > > + /* Only mlock fully mapped pages */
> > > + if (pvmw.pte && ptes != pvmw.nr_pages)
> > > + continue;
> > > +
> > > + /*
> > > + * All PTEs must be protected by page table lock in
> > > + * order to mlock the page.
> > > + *
> > > + * If page table boundary has been cross, current ptl
> > > + * only protect part of ptes.
> > > + */
> > > + if (pvmw.flags & PVMW_PGTABLE_CROSSSED)
> > > + goto walk_done;
> >
> > Should it be goto walk_abort?
>
> I already have to set ret to false above to make it work for partially
> mapped large folios. So walk_done is enough here.
Indeed and I missed that. What do you think about adding a comment when
setting ret to false? Everywhere else we are jumping to abort for
scenarios which needs to break the rmap walk loop but here we need to
keep counting the ptes for mlock handling. Anyways it's just a nit.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault()
2025-09-19 12:40 [PATCHv2 0/5] mm: Improve mlock tracking for large folios Kiryl Shutsemau
` (2 preceding siblings ...)
2025-09-19 12:40 ` [PATCHv2 3/5] mm/rmap: mlock large folios in try_to_unmap_one() Kiryl Shutsemau
@ 2025-09-19 12:40 ` Kiryl Shutsemau
2025-09-19 21:28 ` Shakeel Butt
` (2 more replies)
2025-09-19 12:40 ` [PATCHv2 5/5] mm/rmap: Improve mlock tracking for large folios Kiryl Shutsemau
4 siblings, 3 replies; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-19 12:40 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel, Kiryl Shutsemau
From: Kiryl Shutsemau <kas@kernel.org>
The finish_fault() function uses per-page fault for file folios. This
only occurs for file folios smaller than PMD_SIZE.
The comment suggests that this approach prevents RSS inflation.
However, it only prevents RSS accounting. The folio is still mapped to
the process, and the fact that it is mapped by a single PTE does not
affect memory pressure. Additionally, the kernel's ability to map
large folios as PMD if they are large enough does not support this
argument.
When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.
Mapping large folios at once will allow the rmap code to mlock it on
add, as it will recognize that it is fully mapped and mlocking is safe.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/memory.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..812a7d9f6531 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
nr_pages = folio_nr_pages(folio);
- /*
- * Using per-page fault to maintain the uffd semantics, and same
- * approach also applies to non shmem/tmpfs faults to avoid
- * inflating the RSS of the process.
- */
- if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
- unlikely(needs_fallback)) {
+ /* Using per-page fault to maintain the uffd semantics */
+ if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
nr_pages = 1;
} else if (nr_pages > 1) {
pgoff_t idx = folio_page_idx(folio, page);
--
2.50.1
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault()
2025-09-19 12:40 ` [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault() Kiryl Shutsemau
@ 2025-09-19 21:28 ` Shakeel Butt
2025-09-20 7:37 ` Matthew Wilcox
2025-09-22 3:19 ` Baolin Wang
2 siblings, 0 replies; 17+ messages in thread
From: Shakeel Butt @ 2025-09-19 21:28 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Baolin Wang, linux-mm, linux-kernel,
Kiryl Shutsemau
On Fri, Sep 19, 2025 at 01:40:35PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> The finish_fault() function uses per-page fault for file folios. This
> only occurs for file folios smaller than PMD_SIZE.
>
> The comment suggests that this approach prevents RSS inflation.
> However, it only prevents RSS accounting. The folio is still mapped to
> the process, and the fact that it is mapped by a single PTE does not
> affect memory pressure. Additionally, the kernel's ability to map
> large folios as PMD if they are large enough does not support this
> argument.
>
> When possible, map large folios in one shot. This reduces the number of
> minor page faults and allows for TLB coalescing.
>
> Mapping large folios at once will allow the rmap code to mlock it on
> add, as it will recognize that it is fully mapped and mlocking is safe.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault()
2025-09-19 12:40 ` [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault() Kiryl Shutsemau
2025-09-19 21:28 ` Shakeel Butt
@ 2025-09-20 7:37 ` Matthew Wilcox
2025-09-22 16:16 ` Kiryl Shutsemau
2025-09-22 3:19 ` Baolin Wang
2 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2025-09-20 7:37 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel, Kiryl Shutsemau
On Fri, Sep 19, 2025 at 01:40:35PM +0100, Kiryl Shutsemau wrote:
> The finish_fault() function uses per-page fault for file folios. This
> only occurs for file folios smaller than PMD_SIZE.
>
> The comment suggests that this approach prevents RSS inflation.
> However, it only prevents RSS accounting. The folio is still mapped to
> the process, and the fact that it is mapped by a single PTE does not
> affect memory pressure. Additionally, the kernel's ability to map
> large folios as PMD if they are large enough does not support this
> argument.
>
> When possible, map large folios in one shot. This reduces the number of
> minor page faults and allows for TLB coalescing.
>
> Mapping large folios at once will allow the rmap code to mlock it on
> add, as it will recognize that it is fully mapped and mlocking is safe.
Does this patch have any measurable effect? Almost all folios are
mapped through do_fault_around(). I'm not objecting to the patch,
but the commit message maybe makes this sound more important than it is.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault()
2025-09-20 7:37 ` Matthew Wilcox
@ 2025-09-22 16:16 ` Kiryl Shutsemau
0 siblings, 0 replies; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-22 16:16 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel
On Sat, Sep 20, 2025 at 08:37:37AM +0100, Matthew Wilcox wrote:
> On Fri, Sep 19, 2025 at 01:40:35PM +0100, Kiryl Shutsemau wrote:
> > The finish_fault() function uses per-page fault for file folios. This
> > only occurs for file folios smaller than PMD_SIZE.
> >
> > The comment suggests that this approach prevents RSS inflation.
> > However, it only prevents RSS accounting. The folio is still mapped to
> > the process, and the fact that it is mapped by a single PTE does not
> > affect memory pressure. Additionally, the kernel's ability to map
> > large folios as PMD if they are large enough does not support this
> > argument.
> >
> > When possible, map large folios in one shot. This reduces the number of
> > minor page faults and allows for TLB coalescing.
> >
> > Mapping large folios at once will allow the rmap code to mlock it on
> > add, as it will recognize that it is fully mapped and mlocking is safe.
>
> Does this patch have any measurable effect? Almost all folios are
> mapped through do_fault_around(). I'm not objecting to the patch,
> but the commit message maybe makes this sound more important than it is.
You are right. My test cases used write faults to populate the VMA.
Mlock accounting is still broken with faultaround.
I will look into this.
I think we would need to rethink how we handle large folios in the
faultaround.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault()
2025-09-19 12:40 ` [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault() Kiryl Shutsemau
2025-09-19 21:28 ` Shakeel Butt
2025-09-20 7:37 ` Matthew Wilcox
@ 2025-09-22 3:19 ` Baolin Wang
2 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2025-09-22 3:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Hugh Dickins,
Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
Kiryl Shutsemau
On 2025/9/19 20:40, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> The finish_fault() function uses per-page fault for file folios. This
> only occurs for file folios smaller than PMD_SIZE.
>
> The comment suggests that this approach prevents RSS inflation.
> However, it only prevents RSS accounting. The folio is still mapped to
> the process, and the fact that it is mapped by a single PTE does not
> affect memory pressure. Additionally, the kernel's ability to map
> large folios as PMD if they are large enough does not support this
> argument.
>
> When possible, map large folios in one shot. This reduces the number of
> minor page faults and allows for TLB coalescing.
>
> Mapping large folios at once will allow the rmap code to mlock it on
> add, as it will recognize that it is fully mapped and mlocking is safe.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
LGTM. Thanks.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> mm/memory.c | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ba4f6b71847..812a7d9f6531 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>
> nr_pages = folio_nr_pages(folio);
>
> - /*
> - * Using per-page fault to maintain the uffd semantics, and same
> - * approach also applies to non shmem/tmpfs faults to avoid
> - * inflating the RSS of the process.
> - */
> - if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> - unlikely(needs_fallback)) {
> + /* Using per-page fault to maintain the uffd semantics */
> + if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
> nr_pages = 1;
> } else if (nr_pages > 1) {
> pgoff_t idx = folio_page_idx(folio, page);
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv2 5/5] mm/rmap: Improve mlock tracking for large folios
2025-09-19 12:40 [PATCHv2 0/5] mm: Improve mlock tracking for large folios Kiryl Shutsemau
` (3 preceding siblings ...)
2025-09-19 12:40 ` [PATCHv2 4/5] mm/fault: Try to map the entire file folio in finish_fault() Kiryl Shutsemau
@ 2025-09-19 12:40 ` Kiryl Shutsemau
2025-09-22 3:20 ` Baolin Wang
4 siblings, 1 reply; 17+ messages in thread
From: Kiryl Shutsemau @ 2025-09-19 12:40 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, Baolin Wang, linux-mm,
linux-kernel, Kiryl Shutsemau
From: Kiryl Shutsemau <kas@kernel.org>
The kernel currently does not mlock large folios when adding them to
rmap, stating that it is difficult to confirm that the folio is fully
mapped and safe to mlock it.
This leads to a significant undercount of Mlocked in /proc/meminfo,
causing problems in production where the stat was used to estimate
system utilization and determine if load shedding is required.
However, nowadays the caller passes a number of pages of the folio that
are getting mapped, making it easy to check if the entire folio is
mapped to the VMA.
mlock the folio on rmap if it is fully mapped to the VMA.
Mlocked in /proc/meminfo can still undercount, but the value is closer
the truth and is useful for userspace.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/rmap.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 482e6504fa88..6e09956670f4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1462,12 +1462,12 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
}
/*
- * For large folio, only mlock it if it's fully mapped to VMA. It's
- * not easy to check whether the large folio is fully mapped to VMA
- * here. Only mlock normal 4K folio and leave page reclaim to handle
- * large folio.
+ * Only mlock it if the folio is fully mapped to the VMA.
+ *
+ * Partially mapped folios can be split on reclaim and part outside
+ * of mlocked VMA can be evicted or freed.
*/
- if (!folio_test_large(folio))
+ if (folio_nr_pages(folio) == nr_pages)
mlock_vma_folio(folio, vma);
}
@@ -1603,8 +1603,13 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
__folio_mod_stat(folio, nr, nr_pmdmapped);
- /* See comments in folio_add_anon_rmap_*() */
- if (!folio_test_large(folio))
+ /*
+ * Only mlock it if the folio is fully mapped to the VMA.
+ *
+ * Partially mapped folios can be split on reclaim and part outside
+ * of mlocked VMA can be evicted or freed.
+ */
+ if (folio_nr_pages(folio) == nr_pages)
mlock_vma_folio(folio, vma);
}
--
2.50.1
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCHv2 5/5] mm/rmap: Improve mlock tracking for large folios
2025-09-19 12:40 ` [PATCHv2 5/5] mm/rmap: Improve mlock tracking for large folios Kiryl Shutsemau
@ 2025-09-22 3:20 ` Baolin Wang
0 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2025-09-22 3:20 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Hugh Dickins,
Matthew Wilcox
Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
Johannes Weiner, Shakeel Butt, linux-mm, linux-kernel,
Kiryl Shutsemau
On 2025/9/19 20:40, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> The kernel currently does not mlock large folios when adding them to
> rmap, stating that it is difficult to confirm that the folio is fully
> mapped and safe to mlock it.
>
> This leads to a significant undercount of Mlocked in /proc/meminfo,
> causing problems in production where the stat was used to estimate
> system utilization and determine if load shedding is required.
>
> However, nowadays the caller passes a number of pages of the folio that
> are getting mapped, making it easy to check if the entire folio is
> mapped to the VMA.
>
> mlock the folio on rmap if it is fully mapped to the VMA.
>
> Mlocked in /proc/meminfo can still undercount, but the value is closer
> the truth and is useful for userspace.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> mm/rmap.c | 19 ++++++++++++-------
> 1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 482e6504fa88..6e09956670f4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1462,12 +1462,12 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> }
>
> /*
> - * For large folio, only mlock it if it's fully mapped to VMA. It's
> - * not easy to check whether the large folio is fully mapped to VMA
> - * here. Only mlock normal 4K folio and leave page reclaim to handle
> - * large folio.
> + * Only mlock it if the folio is fully mapped to the VMA.
> + *
> + * Partially mapped folios can be split on reclaim and part outside
> + * of mlocked VMA can be evicted or freed.
> */
> - if (!folio_test_large(folio))
> + if (folio_nr_pages(folio) == nr_pages)
> mlock_vma_folio(folio, vma);
> }
>
> @@ -1603,8 +1603,13 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
> nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
> __folio_mod_stat(folio, nr, nr_pmdmapped);
>
> - /* See comments in folio_add_anon_rmap_*() */
> - if (!folio_test_large(folio))
> + /*
> + * Only mlock it if the folio is fully mapped to the VMA.
> + *
> + * Partially mapped folios can be split on reclaim and part outside
> + * of mlocked VMA can be evicted or freed.
> + */
> + if (folio_nr_pages(folio) == nr_pages)
> mlock_vma_folio(folio, vma);
> }
>
^ permalink raw reply [flat|nested] 17+ messages in thread