[PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
@ 2025-04-10  0:00 SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: SeongJae Park @ 2025-04-10  0:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand,
	Lorenzo Stoakes, Rik van Riel, Shakeel Butt, Vlastimil Babka,
	kernel-team, linux-kernel, linux-mm

When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
MADV_FREE with multiple address ranges, tlb flushes happen for each of
the given address ranges.  Because such tlb flushes are for the same
process, doing those in a batch is more efficient while still being
safe.  Modify process_madvise() entry level code path to do such batched
tlb flushes, while the internal unmap logic do only gathering of the tlb
entries to flush.

In more detail, modify the entry functions to initialize an mmu_gather
object and pass it to the internal logic.  And make the internal logic
do only gathering of the tlb entries to flush into the received
mmu_gather object.  After all internal function calls are done, the
entry functions flush the gathered tlb entries at once.

Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code
consistent and cleaner.  It is only for keeping the code clean, and
shouldn't degrade madvise().  It could rather provide a potential tlb
flushes reduction benefit for a case that there are multiple vmas for
the given address range.  It is only a side effect from an effort to
keep code clean, so we don't measure it separately.

Similar optimizations might be applicable to other madvise behavior such
as MADV_COLD and MADV_PAGEOUT.  Those are simply out of the scope of
this patch series, though.

Patches Sequence
================

The first patch defines a new data structure for managing information
that is required for batched tlb flushes (mmu_gather and behavior), and
update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling
internal logic to receive it.

The second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().

Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes
batching.  The third patch splits zap_page_range_single() for batching
of MADV_DONTNEED[_LOCKED] handling.  The fourth patch batches tlb
flushes for the hint using the sub-logic that the third patch split out,
and the helpers for batched tlb flushes that introduced for the
MADV_FREE case, by the second patch.

Test Results
============

I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size per
process_madvise() call (vlen) from 1 to 1024.  The source code for the
measurement is available at GitHub[1].  To reduce measurement errors, I
did the measurement five times.

The measurement results are as below.  'sz_batch' column shows the batch
size of process_madvise() calls.  'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively.  For the baseline, mm-new tree of
2025-04-09[2] has been used, after reverting the second version of this
patch series and adding a temporal fix for !CONFIG_DEBUG_VM build
failure[3].  'B-stdev' and 'A-stdev' columns show ratios of latency
measurements standard deviation to average in percent for 'Before' and
'After', respectively.  'Latency_reduction' shows the reduction of the
latency that the 'After' has achieved compared to 'Before', in percent.
Higher 'Latency_reduction' values mean more efficiency improvements.

    sz_batch  Before      B-stdev  After        A-stdev  Latency_reduction
    1         146386348   2.78     111327360.6  3.13     23.95
    2         108222130   1.54     72131173.6   2.39     33.35
    4         93617846.8  2.76     51859294.4   2.50     44.61
    8         80555150.4  2.38     44328790     1.58     44.97
    16        77272777    1.62     37489433.2   1.16     51.48
    32        76478465.2  2.75     33570506     3.48     56.10
    64        75810266.6  1.15     27037652.6   1.61     64.34
    128       73222748    3.86     25517629.4   3.30     65.15
    256       72534970.8  2.31     25002180.4   0.94     65.53
    512       71809392    5.12     24152285.4   2.41     66.37
    1024      73281170.2  4.53     24183615     2.09     67.00

Unexpectedly the latency has reduced (improved) even with batch size
one.  I think some of compiler optimizations have affected that, like
also observed with the first version of this patch series.

So, please focus on the proportion between the improvement and the batch
size.  As expected, tlb flushes batching provides latency reduction that
proportional to the batch size.  The efficiency gain ranges from about
33 percent with batch size 2, and up to 67 percent with batch size
1,024.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

Chagelong
=========

Changes from v2
(https://lore.kernel.org/20250404210700.2156-1-sj@kernel.org)
- Fix typos on cover letter
- Rename madvise_behavior pointers to madv_behavior
- Rename notify_unmap_single_vma() to zap_page_range_single_batched()
- Add a sanity check of tlb parameter to zap_page_range_single_batched()
- Add missed full stop of a comment
- Add details to MADV_DONTNEED tlb flush batching commit message
- Collect Reviewed-by: from Lorenzo for the second patch

Changes from v1
(https://lore.kernel.org/20250310172318.653630-1-sj@kernel.org)
- Split code cleanup part out
- Keep the order between tlb flushes and hugetlb_zap_end()
- Put mm/memory change just before its real usage
- Add VM_WARN_ON_ONCE() for invlaid tlb argument to unmap_vma_single()
- Cleanups following nice reviewers suggestions

Changes from RFC
(https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
- Clarify motivation of the change on the cover letter
- Add average and stdev of evaluation results
- Show latency reduction on evaluation results
- Fix !CONFIG_MEMORY_FAILURE build error
- Rename is_memory_populate() to is_madvise_populate()
- Squash patches 5-8
- Add kerneldoc for unmap_vm_area_struct()
- Squash patches 10 and 11
- Squash patches 12-14
- Squash patches 15 and 16

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit 3923d30a2d51 ("mm-mempolicy-support-memory-hotplug-in-weighted-interleave-checkpatch-fixes") # mm-new
[3] https://lore.kernel.org/20250409165452.305371-1-sj@kernel.org

SeongJae Park (4):
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/madvise: batch tlb flushes for MADV_FREE
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]

 mm/internal.h |   3 ++
 mm/madvise.c  | 101 ++++++++++++++++++++++++++++++++++++++------------
 mm/memory.c   |  47 ++++++++++++++++++-----
 3 files changed, 118 insertions(+), 33 deletions(-)

base-commit: 5d1eb3ed3b3aee67f6d1bda64ef710bfcf52f342
-- 
2.39.5

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-04-10  0:00 [PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
@ 2025-04-10  0:00 ` SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: SeongJae Park @ 2025-04-10  0:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand,
	Lorenzo Stoakes, Rik van Riel, Shakeel Butt, Vlastimil Babka,
	kernel-team, linux-kernel, linux-mm

To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
MADV_FREE, an mmu_gather object in addition to the behavior integer need
to be passed to the internal logics.  Using a struct can make it easy
without increasing the number of parameters of all code paths towards
the internal logic.  Define a struct for the purpose and use it on the
code path that starts from madvise_do_behavior() and ends on
madvise_dontneed_free().  Note that this changes madvise_walk_vmas()
visitor type signature, too.  Specifically, it changes its 'arg' type
from 'unsigned long' to the new struct pointer.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index b17f684322ad..26fa868b41af 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -48,6 +48,11 @@ struct madvise_walk_private {
 	bool pageout;
 };
 
+struct madvise_behavior {
+	int behavior;
+	struct mmu_gather *tlb;
+};
+
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
  * take mmap_lock for writing. Others, which simply traverse vmas, need
@@ -893,8 +898,9 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
-				  int behavior)
+				  struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
@@ -1249,8 +1255,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior)
+				void *behavior_arg)
 {
+	struct madvise_behavior *arg = behavior_arg;
+	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
 	unsigned long new_flags = vma->vm_flags;
@@ -1270,7 +1278,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(vma, prev, start, end, arg);
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
@@ -1487,10 +1495,10 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, unsigned long arg,
+		      unsigned long end, void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
-				   unsigned long end, unsigned long arg))
+				   unsigned long end, void *arg))
 {
 	struct vm_area_struct *vma;
 	struct vm_area_struct *prev;
@@ -1548,7 +1556,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 static int madvise_vma_anon_name(struct vm_area_struct *vma,
 				 struct vm_area_struct **prev,
 				 unsigned long start, unsigned long end,
-				 unsigned long anon_name)
+				 void *anon_name)
 {
 	int error;
 
@@ -1557,7 +1565,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma,
 		return -EBADF;
 
 	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
-				   (struct anon_vma_name *)anon_name);
+				   anon_name);
 
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
@@ -1589,7 +1597,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
+	return madvise_walk_vmas(mm, start, end, anon_name,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
@@ -1677,8 +1685,10 @@ static bool is_madvise_populate(int behavior)
 }
 
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, int behavior)
+		unsigned long start, size_t len_in,
+		struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct blk_plug plug;
 	unsigned long end;
 	int error;
@@ -1692,7 +1702,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_madvise_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end, behavior,
+		error = madvise_walk_vmas(mm, start, end, madv_behavior,
 					  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
@@ -1773,13 +1783,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, behavior);
+	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1796,6 +1807,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	total_len = iov_iter_count(iter);
 
@@ -1811,7 +1823,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, behavior);
+			ret = madvise_do_behavior(mm, start, len_in,
+					&madv_behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 2/4] mm/madvise: batch tlb flushes for MADV_FREE
  2025-04-10  0:00 [PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
@ 2025-04-10  0:00 ` SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
  3 siblings, 0 replies; 7+ messages in thread
From: SeongJae Park @ 2025-04-10  0:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand,
	Lorenzo Stoakes, Rik van Riel, Shakeel Butt, Vlastimil Babka,
	kernel-team, linux-kernel, linux-mm

MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
each address range.  Update the logic to do tlb flushes in a batched
way.  Initialize an mmu_gather object from do_madvise() and
vector_madvise(), which are the entry level functions for
[process_]madvise(), respectively.  And pass those objects to the
function for per-vma work, via madvise_behavior struct.  Make the
per-vma logic not flushes tlb on their own but just saves the tlb
entries to the received mmu_gather object.  Finally, the entry level
functions flush the tlb entries that gathered for the entire user
request, at once.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 57 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 26fa868b41af..951038a9f36f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -799,12 +799,13 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-static int madvise_free_single_vma(struct vm_area_struct *vma,
+static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
+			struct vm_area_struct *vma,
 			unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+	struct mmu_gather *tlb = madv_behavior->tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -820,17 +821,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
-	tlb_start_vma(&tlb, vma);
+	tlb_start_vma(tlb, vma);
 	walk_page_range(vma->vm_mm, range.start, range.end,
-			&madvise_free_walk_ops, &tlb);
-	tlb_end_vma(&tlb, vma);
+			&madvise_free_walk_ops, tlb);
+	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
-
 	return 0;
 }
 
@@ -954,7 +952,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
 		return madvise_dontneed_single_vma(vma, start, end);
 	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(vma, start, end);
+		return madvise_free_single_vma(madv_behavior, vma, start, end);
 	else
 		return -EINVAL;
 }
@@ -1627,6 +1625,29 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 		mmap_read_unlock(mm);
 }
 
+static bool madvise_batch_tlb_flush(int behavior)
+{
+	switch (behavior) {
+	case MADV_FREE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
+		struct mm_struct *mm)
+{
+	if (madvise_batch_tlb_flush(madv_behavior->behavior))
+		tlb_gather_mmu(madv_behavior->tlb, mm);
+}
+
+static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
+{
+	if (madvise_batch_tlb_flush(madv_behavior->behavior))
+		tlb_finish_mmu(madv_behavior->tlb);
+}
+
 static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 {
 	size_t len;
@@ -1783,14 +1804,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
+	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1807,13 +1834,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
 	if (ret)
 		return ret;
+	madvise_init_tlb(&madv_behavior, mm);
 
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1842,14 +1874,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			}
 
 			/* Drop and reacquire lock to unwind race. */
+			madvise_finish_tlb(&madv_behavior);
 			madvise_unlock(mm, behavior);
 			madvise_lock(mm, behavior);
+			madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-10  0:00 [PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
  2025-04-10  0:00 ` [PATCH v3 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
@ 2025-04-10  0:00 ` SeongJae Park
  2025-04-11 13:08   ` Lorenzo Stoakes
  2025-04-10  0:00 ` [PATCH v3 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
  3 siblings, 1 reply; 7+ messages in thread
From: SeongJae Park @ 2025-04-10  0:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand,
	Lorenzo Stoakes, Rik van Riel, Shakeel Butt, Vlastimil Babka,
	kernel-team, linux-kernel, linux-mm

Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() flushes tlb for each invocation.  Split out the
body of zap_page_range_single() except mmu_gather object initialization
and gathered tlb entries flushing for such batched tlb flushing usage.

To avoid hugetlb pages allocation failures from concurrent page faults,
the tlb flush should be done before hugetlb faults unlocking, though.
Do the flush and the unlock inside the split out function in the order
for hugetlb vma case.  Refer to commit 2820b0f09be9 ("hugetlbfs: close
race between MADV_DONTNEED and page fault") for more details about the
concurrent faults' page allocation failure problem.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fda6d6429a27..690695643dfb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1998,36 +1998,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range_single - remove user pages in a given range
+/*
+ * zap_page_range_single_batched - remove user pages in a given range
+ * @tlb: pointer to the caller's struct mmu_gather
  * @vma: vm_area_struct holding the applicable pages
- * @address: starting address of pages to zap
- * @size: number of bytes to zap
+ * @address: starting address of pages to remove
+ * @size: number of bytes to remove
  * @details: details of shared cache invalidation
  *
- * The range must fit into one VMA.
+ * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
+ * hugetlb, @tlb is flushed and re-initialized by this function.
  */
-void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+static void zap_page_range_single_batched(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+
+	VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, end);
 	hugetlb_zap_begin(vma, &range.start, &range.end);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	/*
 	 * unmap 'address-end' not 'range.start-range.end' as range
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
-	unmap_single_vma(&tlb, vma, address, end, details, false);
+	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
+	if (is_vm_hugetlb_page(vma)) {
+		/*
+		 * flush tlb and free resources before hugetlb_zap_end(), to
+		 * avoid concurrent page faults' allocation failure.
+		 */
+		tlb_finish_mmu(tlb);
+		hugetlb_zap_end(vma, details);
+		tlb_gather_mmu(tlb, vma->vm_mm);
+	}
+}
+
+/**
+ * zap_page_range_single - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
+ * @details: details of shared cache invalidation
+ *
+ * The range must fit into one VMA.
+ */
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+		unsigned long size, struct zap_details *details)
+{
+	struct mmu_gather tlb;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	zap_page_range_single_batched(&tlb, vma, address, size, details);
 	tlb_finish_mmu(&tlb);
-	hugetlb_zap_end(vma, details);
 }
 
 /**
-- 
2.39.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
  2025-04-10  0:00 [PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (2 preceding siblings ...)
  2025-04-10  0:00 ` [PATCH v3 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-04-10  0:00 ` SeongJae Park
  2025-04-11 13:10   ` Lorenzo Stoakes
  3 siblings, 1 reply; 7+ messages in thread
From: SeongJae Park @ 2025-04-10  0:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R.Howlett, David Hildenbrand,
	Lorenzo Stoakes, Rik van Riel, Shakeel Butt, Vlastimil Babka,
	kernel-team, linux-kernel, linux-mm

MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for
each vma of each address range.  Update the logic to do tlb flushes in a
batched way.  Initialize an mmu_gather object from do_madvise() and
vector_madvise(), which are the entry level functions for
[process_]madvise(), respectively.  And pass those objects to the
function for per-vma work, via madvise_behavior struct.  Make the
per-vma logic not flushes tlb on their own but just saves the tlb
entries to the received mmu_gather object.  For this internal logic
change, make zap_page_range_single_batched() non-static and use it
directly from madvise_dontneed_single_vma().  Finally, the entry level
functions flush the tlb entries that gathered for the entire user
request, at once.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/internal.h |  3 +++
 mm/madvise.c  | 11 ++++++++---
 mm/memory.c   |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ef92e88738fe..c5f9dd007215 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -435,6 +435,9 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
+void zap_page_range_single_batched(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long addr,
+		unsigned long size, struct zap_details *details);
 int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
 			   gfp_t gfp);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 951038a9f36f..8433ac9b27e0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -851,7 +851,8 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
  * An interface that causes the system to free clean pages and flush
  * dirty pages is already available as msync(MS_INVALIDATE).
  */
-static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
+static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior,
+					struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
 	struct zap_details details = {
@@ -859,7 +860,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 		.even_cows = true,
 	};
 
-	zap_page_range_single(vma, start, end - start, &details);
+	zap_page_range_single_batched(
+			madv_behavior->tlb, vma, start, end - start, &details);
 	return 0;
 }
 
@@ -950,7 +952,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	}
 
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(vma, start, end);
+		return madvise_dontneed_single_vma(
+				madv_behavior, vma, start, end);
 	else if (behavior == MADV_FREE)
 		return madvise_free_single_vma(madv_behavior, vma, start, end);
 	else
@@ -1628,6 +1631,8 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 static bool madvise_batch_tlb_flush(int behavior)
 {
 	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
 	case MADV_FREE:
 		return true;
 	default:
diff --git a/mm/memory.c b/mm/memory.c
index 690695643dfb..559f3e194438 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1998,7 +1998,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/*
+/**
  * zap_page_range_single_batched - remove user pages in a given range
  * @tlb: pointer to the caller's struct mmu_gather
  * @vma: vm_area_struct holding the applicable pages
@@ -2009,7 +2009,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
  * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
  * hugetlb, @tlb is flushed and re-initialized by this function.
  */
-static void zap_page_range_single_batched(struct mmu_gather *tlb,
+void zap_page_range_single_batched(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-- 
2.39.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-10  0:00 ` [PATCH v3 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-04-11 13:08   ` Lorenzo Stoakes
  0 siblings, 0 replies; 7+ messages in thread
From: Lorenzo Stoakes @ 2025-04-11 13:08 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Apr 09, 2025 at 05:00:21PM -0700, SeongJae Park wrote:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() flushes tlb for each invocation.  Split out the
> body of zap_page_range_single() except mmu_gather object initialization
> and gathered tlb entries flushing for such batched tlb flushing usage.
>
> To avoid hugetlb pages allocation failures from concurrent page faults,
> the tlb flush should be done before hugetlb faults unlocking, though.
> Do the flush and the unlock inside the split out function in the order
> for hugetlb vma case.  Refer to commit 2820b0f09be9 ("hugetlbfs: close
> race between MADV_DONTNEED and page fault") for more details about the
> concurrent faults' page allocation failure problem.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

LGTM, thanks!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fda6d6429a27..690695643dfb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1998,36 +1998,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -/**
> - * zap_page_range_single - remove user pages in a given range
> +/*
> + * zap_page_range_single_batched - remove user pages in a given range
> + * @tlb: pointer to the caller's struct mmu_gather
>   * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> + * @address: starting address of pages to remove
> + * @size: number of bytes to remove
>   * @details: details of shared cache invalidation
>   *
> - * The range must fit into one VMA.
> + * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
> + * hugetlb, @tlb is flushed and re-initialized by this function.
>   */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void zap_page_range_single_batched(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
> +
> +	VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm);
>
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>  				address, end);
>  	hugetlb_zap_begin(vma, &range.start, &range.end);
> -	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	/*
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */
> -	unmap_single_vma(&tlb, vma, address, end, details, false);
> +	unmap_single_vma(tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
> +	if (is_vm_hugetlb_page(vma)) {
> +		/*
> +		 * flush tlb and free resources before hugetlb_zap_end(), to
> +		 * avoid concurrent page faults' allocation failure.
> +		 */
> +		tlb_finish_mmu(tlb);
> +		hugetlb_zap_end(vma, details);
> +		tlb_gather_mmu(tlb, vma->vm_mm);
> +	}
> +}
> +
> +/**
> + * zap_page_range_single - remove user pages in a given range
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of pages to zap
> + * @size: number of bytes to zap
> + * @details: details of shared cache invalidation
> + *
> + * The range must fit into one VMA.
> + */
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size, struct zap_details *details)
> +{
> +	struct mmu_gather tlb;
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	zap_page_range_single_batched(&tlb, vma, address, size, details);
>  	tlb_finish_mmu(&tlb);
> -	hugetlb_zap_end(vma, details);
>  }
>
>  /**
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED]
  2025-04-10  0:00 ` [PATCH v3 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
@ 2025-04-11 13:10   ` Lorenzo Stoakes
  0 siblings, 0 replies; 7+ messages in thread
From: Lorenzo Stoakes @ 2025-04-11 13:10 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R.Howlett, David Hildenbrand, Rik van Riel,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Apr 09, 2025 at 05:00:22PM -0700, SeongJae Park wrote:
> MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for
> each vma of each address range.  Update the logic to do tlb flushes in a
> batched way.  Initialize an mmu_gather object from do_madvise() and
> vector_madvise(), which are the entry level functions for
> [process_]madvise(), respectively.  And pass those objects to the
> function for per-vma work, via madvise_behavior struct.  Make the
> per-vma logic not flushes tlb on their own but just saves the tlb
> entries to the received mmu_gather object.  For this internal logic
> change, make zap_page_range_single_batched() non-static and use it
> directly from madvise_dontneed_single_vma().  Finally, the entry level
> functions flush the tlb entries that gathered for the entire user
> request, at once.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

Thanks, as usual always a pleasure to review your series :) Cheers for
these changes!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/internal.h |  3 +++
>  mm/madvise.c  | 11 ++++++++---
>  mm/memory.c   |  4 ++--
>  3 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ef92e88738fe..c5f9dd007215 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -435,6 +435,9 @@ void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details);
> +void zap_page_range_single_batched(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long addr,
> +		unsigned long size, struct zap_details *details);
>  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>  			   gfp_t gfp);
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 951038a9f36f..8433ac9b27e0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -851,7 +851,8 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
>   * An interface that causes the system to free clean pages and flush
>   * dirty pages is already available as msync(MS_INVALIDATE).
>   */
> -static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> +static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior,
> +					struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
>  	struct zap_details details = {
> @@ -859,7 +860,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  		.even_cows = true,
>  	};
>
> -	zap_page_range_single(vma, start, end - start, &details);
> +	zap_page_range_single_batched(
> +			madv_behavior->tlb, vma, start, end - start, &details);
>  	return 0;
>  }
>
> @@ -950,7 +952,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	}
>
>  	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
> -		return madvise_dontneed_single_vma(vma, start, end);
> +		return madvise_dontneed_single_vma(
> +				madv_behavior, vma, start, end);
>  	else if (behavior == MADV_FREE)
>  		return madvise_free_single_vma(madv_behavior, vma, start, end);
>  	else
> @@ -1628,6 +1631,8 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
>  static bool madvise_batch_tlb_flush(int behavior)
>  {
>  	switch (behavior) {
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
>  	case MADV_FREE:
>  		return true;
>  	default:
> diff --git a/mm/memory.c b/mm/memory.c
> index 690695643dfb..559f3e194438 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1998,7 +1998,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -/*
> +/**
>   * zap_page_range_single_batched - remove user pages in a given range
>   * @tlb: pointer to the caller's struct mmu_gather
>   * @vma: vm_area_struct holding the applicable pages
> @@ -2009,7 +2009,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>   * @tlb shouldn't be NULL.  The range must fit into one VMA.  If @vma is for
>   * hugetlb, @tlb is flushed and re-initialized by this function.
>   */
> -static void zap_page_range_single_batched(struct mmu_gather *tlb,
> +void zap_page_range_single_batched(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-04-11 13:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-10  0:00 [PATCH v3 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
2025-04-10  0:00 ` [PATCH v3 1/4] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
2025-04-10  0:00 ` [PATCH v3 2/4] mm/madvise: batch tlb flushes for MADV_FREE SeongJae Park
2025-04-10  0:00 ` [PATCH v3 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
2025-04-11 13:08   ` Lorenzo Stoakes
2025-04-10  0:00 ` [PATCH v3 4/4] mm/madvise: batch tlb flushes for MADV_DONTNEED[_LOCKED] SeongJae Park
2025-04-11 13:10   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox