linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
@ 2025-03-10 17:23 SeongJae Park
  2025-03-10 17:23 ` [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
                   ` (10 more replies)
  0 siblings, 11 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
MADV_FREE with multiple address ranges, tlb flushes happen for each of
the given address ranges.  Because such tlb flushes are for same
process, doing those in a batch is more efficient while still being
safe.  Modify process_madvise() entry level code path to do such batched
tlb flushes, while the internal unmap logics do only gathering of the
tlb entries to flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  And make the internal logics
do only gathering of the tlb entries to flush into the received
mmu_gather object.  After all internal function calls are done, the
entry functions flush the gathered tlb entries at once.

The inefficiency should be smaller on madvise() use case, since it
receives only a single address range.  But if there are multiple vmas
for the range, same problem can happen.  It is unclear if such use case
is common and the inefficiency is significant.  Make the change for
madivse(), too, since it doesn't really change madvise() internal
behavior while helps keeping the code that shared between
process_madvise() and madvise() internal logics clean.

Patches Seuquence
=================

First four patches are minor cleanups of madvise.c for readability.

Fifth patch defines new data structure for managing information
that required for batched tlb flushes (mmu_gather and behavior), and
update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling
internal logics to receive it.

Sixth and seventh patches make internal logics for handling
MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing.  The
patches keep the support of unbatched tlb flushes use case, for
fine-grained and safe transitions.

Eighth patch updates madvise() and process_madvise() code to do the
batched tlb flushes utilizing the previous patches introduced changes.

The final ninth patch removes the internal logics' unbatched tlb flushes
use case support code, which is no more be used.

Test Results
============

I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size per
process_madvise() call (vlen) from 1 to 1024.  The source code for the
measurement is available at GitHub[1].  To reduce measurement errors, I
did the measurement five times.

The measurement results are as below.  'sz_batch' column shows the batch
size of process_madvise() calls.  'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching patch of this
series, respectively.  For the baseline, mm-unstable tree of
2025-03-07[2] has been used.  'B-stdev' and 'A-stdev' columns show
ratios of latency measurements standard deviation to average in percent
for 'Before' and 'After', respectively.  'Latency_reduction' shows the
reduction of the latency that the commit has achieved, in percent.
Higher 'Latency_reduction' values mean more efficiency improvements.

    sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
    1          128691595.4   6.09      106878698.4   2.76      16.95
    2          94046750.8    3.30      68691907      2.79      26.96
    4          80538496.8    5.35      50230673.8    5.30      37.63
    8          72672225.2    5.50      43918112      3.54      39.57
    16         66955104.4    4.25      36549941.2    1.62      45.41
    32         65814679      5.89      33060291      3.30      49.77
    64         65099205.2    2.69      26003756.4    1.56      60.06
    128        62591307.2    4.02      24008180.4    1.61      61.64
    256        64124368.6    2.93      23868856      2.20      62.78
    512        62325618      5.72      23311330.6    1.74      62.60
    1024       64802138.4    5.05      23651695.2    3.40      63.50

Interestingly, the latency has reduced (improved) even with batch size
1.  I think some of compiler optimizations have affected that, like also
observed with the previous process_madvise() mmap_lock optimization
patch sereis[3].

So, let's focus on the proportion between the improvement and the batch
size.  As expected, tlb flushes batching provides latency reduction that
proportional to the batch size.  The efficiency gain ranges from about
27 percent with batch size 2, and up to 63 percent with batch size
1,024.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

Changes from RFC
(https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
- Clarify motivation of the change on the cover letter
- Add average and stdev of evaluation results
- Show latency reduction on evaluation results
- Fix !CONFIG_MEMORY_FAILURE build error
- Rename is_memory_populate() to is_madvise_populate()
- Squash patches 5-8
- Add kerneldoc for unmap_vm_area_struct()
- Squash patches 10 and 11
- Squash patches 12-14
- Squash patches 15 and 16

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable
[3] https://lore.kernel.org/20250211182833.4193-1-sj@kernel.org

SeongJae Park (9):
  mm/madvise: use is_memory_failure() from madvise_do_behavior()
  mm/madvise: split out populate behavior check logic
  mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  mm/madvise: remove len parameter of madvise_do_behavior()
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches
    tlb flushes
  mm/madvise: batch tlb flushes for
    [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
  mm/madvise: remove !tlb support from
    madvise_{dontneed,free}_single_vma()

 mm/internal.h |   3 +
 mm/madvise.c  | 221 +++++++++++++++++++++++++++++++++-----------------
 mm/memory.c   |  38 ++++++---
 3 files changed, 176 insertions(+), 86 deletions(-)


base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior()
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 11:27   ` Lorenzo Stoakes
  2025-03-10 17:23 ` [PATCH 2/9] mm/madvise: split out populate behavior check logic SeongJae Park
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure()
has introduced.  madvise_do_behavior() is still doing the same
open-coded check, though.  Use is_memory_failure() instead.

To avoid build failure on !CONFIG_MEMORY_FAILURE case, implement an
empty madvise_inject_error() under the config.  Also move the definition
of is_memory_failure() inside #ifdef CONFIG_MEMORY_FAILURE clause for
madvise_inject_error() definition, to reduce duplicated ifdef clauses.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 49 +++++++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 388dc289b5d1..c3ab1f283b18 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1392,7 +1392,32 @@ static int madvise_inject_error(int behavior,
 
 	return 0;
 }
-#endif
+
+static bool is_memory_failure(int behavior)
+{
+	switch (behavior) {
+	case MADV_HWPOISON:
+	case MADV_SOFT_OFFLINE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+#else
+
+static int madvise_inject_error(int behavior,
+		unsigned long start, unsigned long end)
+{
+	return 0;
+}
+
+static bool is_memory_failure(int behavior)
+{
+	return false;
+}
+
+#endif	/* CONFIG_MEMORY_FAILURE */
 
 static bool
 madvise_behavior_valid(int behavior)
@@ -1569,24 +1594,6 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif /* CONFIG_ANON_VMA_NAME */
 
-#ifdef CONFIG_MEMORY_FAILURE
-static bool is_memory_failure(int behavior)
-{
-	switch (behavior) {
-	case MADV_HWPOISON:
-	case MADV_SOFT_OFFLINE:
-		return true;
-	default:
-		return false;
-	}
-}
-#else
-static bool is_memory_failure(int behavior)
-{
-	return false;
-}
-#endif
-
 static int madvise_lock(struct mm_struct *mm, int behavior)
 {
 	if (is_memory_failure(behavior))
@@ -1640,10 +1647,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	unsigned long end;
 	int error;
 
-#ifdef CONFIG_MEMORY_FAILURE
-	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+	if (is_memory_failure(behavior))
 		return madvise_inject_error(behavior, start, start + len_in);
-#endif
 	start = untagged_addr_remote(mm, start);
 	end = start + len;
 
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/9] mm/madvise: split out populate behavior check logic
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-03-10 17:23 ` [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 11:29   ` Lorenzo Stoakes
  2025-03-10 17:23 ` [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_do_behavior() has a long open-coded 'behavior' check for
MADV_POPULATE_{READ,WRITE}.  It adds multiple layers[1] and make the
code arguably take longer time to read.  Like is_memory_failure(), split
out the check to a separate function.  This is not technically removing
the additional layer but discourage further extending the switch-case.
Also it makes madvise_do_behavior() code shorter and therefore easier to
read.

[1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c3ab1f283b18..611db868ae38 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1640,6 +1640,17 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 	return true;
 }
 
+static bool is_madvise_populate(int behavior)
+{
+	switch (behavior) {
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int madvise_do_behavior(struct mm_struct *mm,
 		unsigned long start, size_t len_in, size_t len, int behavior)
 {
@@ -1653,16 +1664,11 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	end = start + len;
 
 	blk_start_plug(&plug);
-	switch (behavior) {
-	case MADV_POPULATE_READ:
-	case MADV_POPULATE_WRITE:
+	if (is_madvise_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
-		break;
-	default:
+	else
 		error = madvise_walk_vmas(mm, start, end, behavior,
 					  madvise_vma_behavior);
-		break;
-	}
 	blk_finish_plug(&plug);
 	return error;
 }
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-03-10 17:23 ` [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
  2025-03-10 17:23 ` [PATCH 2/9] mm/madvise: split out populate behavior check logic SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 12:02   ` Lorenzo Stoakes
  2025-03-10 17:23 ` [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

The logic for checking if a given madvise() request for a single memory
range can skip real work, namely madvise_do_behavior(), is duplicated in
do_madvise() and vector_madvise().  Split out the logic to a function
and resue it.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 611db868ae38..764ec1f2475b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1640,6 +1640,27 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 	return true;
 }
 
+/*
+ * madvise_should_skip() - Return if an madivse request can skip real works.
+ * @start:	Start address of madvise-requested address range.
+ * @len_in:	Length of madvise-requested address range.
+ * @behavior:	Requested madvise behavor.
+ * @err:	Pointer to store an error code from the check.
+ */
+static bool madvise_should_skip(unsigned long start, size_t len_in,
+		int behavior, int *err)
+{
+	if (!is_valid_madvise(start, len_in, behavior)) {
+		*err = -EINVAL;
+		return true;
+	}
+	if (start + PAGE_ALIGN(len_in) == start) {
+		*err = 0;
+		return true;
+	}
+	return false;
+}
+
 static bool is_madvise_populate(int behavior)
 {
 	switch (behavior) {
@@ -1747,23 +1768,15 @@ static int madvise_do_behavior(struct mm_struct *mm,
  */
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
-	unsigned long end;
 	int error;
-	size_t len;
-
-	if (!is_valid_madvise(start, len_in, behavior))
-		return -EINVAL;
-
-	len = PAGE_ALIGN(len_in);
-	end = start + len;
-
-	if (end == start)
-		return 0;
 
+	if (madvise_should_skip(start, len_in, behavior, &error))
+		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, len, behavior);
+	error = madvise_do_behavior(mm, start, len_in, PAGE_ALIGN(len_in),
+			behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1790,19 +1803,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
 		size_t len_in = iter_iov_len(iter);
-		size_t len;
-
-		if (!is_valid_madvise(start, len_in, behavior)) {
-			ret = -EINVAL;
-			break;
-		}
+		int error;
 
-		len = PAGE_ALIGN(len_in);
-		if (start + len == start)
-			ret = 0;
+		if (madvise_should_skip(start, len_in, behavior, &error))
+			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, len,
-					behavior);
+			ret = madvise_do_behavior(mm, start, len_in,
+					PAGE_ALIGN(len_in), behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior()
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (2 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 12:05   ` Lorenzo Stoakes
  2025-03-10 17:23 ` [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Because madise_should_skip() logic is factored out, making
madvise_do_behavior() calculates 'len' on its own rather then receiving
it as a parameter makes code simpler.  Remove the parameter.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 764ec1f2475b..469c25690a0e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1673,7 +1673,7 @@ static bool is_madvise_populate(int behavior)
 }
 
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, size_t len, int behavior)
+		unsigned long start, size_t len_in, int behavior)
 {
 	struct blk_plug plug;
 	unsigned long end;
@@ -1682,7 +1682,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_memory_failure(behavior))
 		return madvise_inject_error(behavior, start, start + len_in);
 	start = untagged_addr_remote(mm, start);
-	end = start + len;
+	end = start + PAGE_ALIGN(len_in);
 
 	blk_start_plug(&plug);
 	if (is_madvise_populate(behavior))
@@ -1775,8 +1775,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, PAGE_ALIGN(len_in),
-			behavior);
+	error = madvise_do_behavior(mm, start, len_in, behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1808,8 +1807,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in,
-					PAGE_ALIGN(len_in), behavior);
+			ret = madvise_do_behavior(mm, start, len_in, behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (3 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 12:17   ` Lorenzo Stoakes
  2025-03-10 17:23 ` [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
MADV_FREE, an mmu_gather object in addition to the behavior integer need
to be passed to the internal logics.  Using a struct can make it easy
without increasing the number of parameters of all code paths towards
the internal logic.  Define a struct for the purpose and use it on the
code path that starts from madvise_do_behavior() and ends on
madvise_dontneed_free().

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 469c25690a0e..ba2a78795207 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -890,11 +890,16 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 	return true;
 }
 
+struct madvise_behavior {
+	int behavior;
+};
+
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
-				  int behavior)
+				  struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
@@ -1249,8 +1254,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior)
+				void *behavior_arg)
 {
+	struct madvise_behavior *arg = behavior_arg;
+	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
 	unsigned long new_flags = vma->vm_flags;
@@ -1270,7 +1277,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(vma, prev, start, end, arg);
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
@@ -1487,10 +1494,10 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, unsigned long arg,
+		      unsigned long end, void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
-				   unsigned long end, unsigned long arg))
+				   unsigned long end, void *arg))
 {
 	struct vm_area_struct *vma;
 	struct vm_area_struct *prev;
@@ -1548,7 +1555,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 static int madvise_vma_anon_name(struct vm_area_struct *vma,
 				 struct vm_area_struct **prev,
 				 unsigned long start, unsigned long end,
-				 unsigned long anon_name)
+				 void *anon_name)
 {
 	int error;
 
@@ -1557,7 +1564,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma,
 		return -EBADF;
 
 	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
-				   (struct anon_vma_name *)anon_name);
+				   anon_name);
 
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
@@ -1589,7 +1596,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
+	return madvise_walk_vmas(mm, start, end, anon_name,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
@@ -1673,8 +1680,10 @@ static bool is_madvise_populate(int behavior)
 }
 
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, int behavior)
+		unsigned long start, size_t len_in,
+		struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct blk_plug plug;
 	unsigned long end;
 	int error;
@@ -1688,7 +1697,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_madvise_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end, behavior,
+		error = madvise_walk_vmas(mm, start, end, madv_behavior,
 					  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
@@ -1769,13 +1778,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, behavior);
+	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1792,6 +1802,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	total_len = iov_iter_count(iter);
 
@@ -1807,7 +1818,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, behavior);
+			ret = madvise_do_behavior(mm, start, len_in,
+					&madv_behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (4 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 12:45   ` Lorenzo Stoakes
  2025-04-01  1:45   ` Liam R. Howlett
  2025-03-10 17:23 ` [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes SeongJae Park
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() does tlb flushing for each invocation.  Split
out the body of zap_page_range_single() except mmu_gather object
initialization and gathered tlb entries flushing parts for such batched
tlb flushing usage.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/memory.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78c7ee62795e..88c478e2ed1a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range_single - remove user pages in a given range
- * @vma: vm_area_struct holding the applicable pages
- * @address: starting address of pages to zap
- * @size: number of bytes to zap
- * @details: details of shared cache invalidation
- *
- * The range must fit into one VMA.
- */
-void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+static void unmap_vma_single(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, end);
 	hugetlb_zap_begin(vma, &range.start, &range.end);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	/*
 	 * unmap 'address-end' not 'range.start-range.end' as range
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
-	unmap_single_vma(&tlb, vma, address, end, details, false);
+	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
 	hugetlb_zap_end(vma, details);
 }
 
+/**
+ * zap_page_range_single - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
+ * @details: details of shared cache invalidation
+ *
+ * The range must fit into one VMA.
+ */
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+		unsigned long size, struct zap_details *details)
+{
+	struct mmu_gather tlb;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	unmap_vma_single(&tlb, vma, address, size, details);
+	tlb_finish_mmu(&tlb);
+}
+
 /**
  * zap_vma_ptes - remove ptes mapping the vma
  * @vma: vm_area_struct holding ptes to be zapped
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (5 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 13:07   ` Lorenzo Stoakes
  2025-03-10 17:23 ` [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) SeongJae Park
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Update madvise_dontneed_single_vma() and madvise_free_single_vma()
functions so that the caller can pass an mmu_gather object that should
be initialized and will be finished outside, for batched tlb flushes.
Also modify their internal code to support such usage by skipping the
initialization and finishing of self-allocated mmu_gather object if it
received a valid mmu_gather object.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/internal.h |  3 +++
 mm/madvise.c  | 37 +++++++++++++++++++++++++------------
 mm/memory.c   | 16 +++++++++++++---
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0caa64dc2cb7..ce7fb2383f65 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -438,6 +438,9 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
+void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		      unsigned long addr, unsigned long size,
+		      struct zap_details *details);
 int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
 			   gfp_t gfp);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index ba2a78795207..d7ea71c6422c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -794,12 +794,19 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-static int madvise_free_single_vma(struct vm_area_struct *vma,
-			unsigned long start_addr, unsigned long end_addr)
+static int madvise_free_single_vma(
+		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,
+		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+	struct mmu_gather self_tlb;
+	struct mmu_gather *tlb;
+
+	if (caller_tlb)
+		tlb = caller_tlb;
+	else
+		tlb = &self_tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -815,16 +822,18 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	if (!caller_tlb)
+		tlb_gather_mmu(tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
-	tlb_start_vma(&tlb, vma);
+	tlb_start_vma(tlb, vma);
 	walk_page_range(vma->vm_mm, range.start, range.end,
-			&madvise_free_walk_ops, &tlb);
-	tlb_end_vma(&tlb, vma);
+			&madvise_free_walk_ops, tlb);
+	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
+	if (!caller_tlb)
+		tlb_finish_mmu(tlb);
 
 	return 0;
 }
@@ -848,7 +857,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
  * An interface that causes the system to free clean pages and flush
  * dirty pages is already available as msync(MS_INVALIDATE).
  */
-static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
+static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
+					struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
 	struct zap_details details = {
@@ -856,7 +866,10 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 		.even_cows = true,
 	};
 
-	zap_page_range_single(vma, start, end - start, &details);
+	if (!tlb)
+		zap_page_range_single(vma, start, end - start, &details);
+	else
+		unmap_vma_single(tlb, vma, start, end - start, &details);
 	return 0;
 }
 
@@ -951,9 +964,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	}
 
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(vma, start, end);
+		return madvise_dontneed_single_vma(NULL, vma, start, end);
 	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(vma, start, end);
+		return madvise_free_single_vma(NULL, vma, start, end);
 	else
 		return -EINVAL;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 88c478e2ed1a..3256b9713cbd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1995,9 +1995,19 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-static void unmap_vma_single(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, unsigned long address,
-		unsigned long size, struct zap_details *details)
+/**
+ * unmap_vma_single - remove user pages in a given range
+ * @tlb: pointer to the caller's struct mmu_gather
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of the pages
+ * @size: number of bytes to remove
+ * @details: details of shared cache invalidation
+ *
+ * @tlb shouldn't be NULL.  The range must fit into one VMA.
+ */
+void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		      unsigned long address, unsigned long size,
+		      struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (6 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 13:59   ` Lorenzo Stoakes
  2025-04-01 21:17   ` SeongJae Park
  2025-03-10 17:23 ` [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma() SeongJae Park
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

MADV_DONTNEED[_LOCKED] and MADV_FREE internal logics for
[process_]madvise() can be invoked with batched tlb flushes.  Update
vector_madvise() and do_madvise(), which are called for the two system
calls  respectively, to use those in the efficient way.  Initialize an
mmu_gather object before starting the internal works, and flush the
gathered tlb entries at once after all the internal works are done.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d7ea71c6422c..d5f4ce3041a4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -905,6 +905,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 
 struct madvise_behavior {
 	int behavior;
+	struct mmu_gather *tlb;
 };
 
 static long madvise_dontneed_free(struct vm_area_struct *vma,
@@ -964,9 +965,11 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	}
 
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(NULL, vma, start, end);
+		return madvise_dontneed_single_vma(
+				madv_behavior->tlb, vma, start, end);
 	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(NULL, vma, start, end);
+		return madvise_free_single_vma(
+				madv_behavior->tlb, vma, start, end);
 	else
 		return -EINVAL;
 }
@@ -1639,6 +1642,32 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 		mmap_read_unlock(mm);
 }
 
+static bool madvise_batch_tlb_flush(int behavior)
+{
+	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
+		struct mm_struct *mm)
+{
+	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
+		return;
+	tlb_gather_mmu(madv_behavior->tlb, mm);
+}
+
+static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
+{
+	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
+		return;
+	tlb_finish_mmu(madv_behavior->tlb);
+}
+
 static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 {
 	size_t len;
@@ -1791,14 +1820,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
+	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1815,13 +1850,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
 	if (ret)
 		return ret;
+	madvise_init_tlb(&madv_behavior, mm);
 
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1850,14 +1890,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			}
 
 			/* Drop and reacquire lock to unwind race. */
+			madvise_finish_tlb(&madv_behavior);
 			madvise_unlock(mm, behavior);
 			madvise_lock(mm, behavior);
+			madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma()
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (7 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) SeongJae Park
@ 2025-03-10 17:23 ` SeongJae Park
  2025-03-11 14:01   ` Lorenzo Stoakes
  2025-03-10 22:39 ` [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Andrew Morton
  2025-03-11 12:49 ` Lorenzo Stoakes
  10 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 17:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_dontneed_single_vma() and madvise_free_single_vma() support both
batched tlb flushes and unbatched tlb flushes use cases depending on
received tlb parameter's value.  The supports were for safe and fine
transition of the usages from the unbatched flushes to the batched ones.
Now the transition is done, and therefore there is no real unbatched tlb
flushes use case.  Remove the code for supporting the no more being used
cases.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d5f4ce3041a4..25af0a24c00b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -795,18 +795,11 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 };
 
 static int madvise_free_single_vma(
-		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,
+		struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather self_tlb;
-	struct mmu_gather *tlb;
-
-	if (caller_tlb)
-		tlb = caller_tlb;
-	else
-		tlb = &self_tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -822,8 +815,6 @@ static int madvise_free_single_vma(
 				range.start, range.end);
 
 	lru_add_drain();
-	if (!caller_tlb)
-		tlb_gather_mmu(tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
@@ -832,9 +823,6 @@ static int madvise_free_single_vma(
 			&madvise_free_walk_ops, tlb);
 	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	if (!caller_tlb)
-		tlb_finish_mmu(tlb);
-
 	return 0;
 }
 
@@ -866,10 +854,7 @@ static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
 		.even_cows = true,
 	};
 
-	if (!tlb)
-		zap_page_range_single(vma, start, end - start, &details);
-	else
-		unmap_vma_single(tlb, vma, start, end - start, &details);
+	unmap_vma_single(tlb, vma, start, end - start, &details);
 	return 0;
 }
 
-- 
2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (8 preceding siblings ...)
  2025-03-10 17:23 ` [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma() SeongJae Park
@ 2025-03-10 22:39 ` Andrew Morton
  2025-03-10 23:15   ` Shakeel Butt
  2025-03-10 23:27   ` SeongJae Park
  2025-03-11 12:49 ` Lorenzo Stoakes
  10 siblings, 2 replies; 42+ messages in thread
From: Andrew Morton @ 2025-03-10 22:39 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, David Hildenbrand, Lorenzo Stoakes,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote:

>  It is unclear if such use case
> is common and the inefficiency is significant. 

Well, we could conduct a survey,

Can you add some logging to detect when userspace performs such an
madvise() call, then run that kernel on some "typical" machines which
are running "typical" workloads?  That should give us a feeling for how
often userspace does this, and hence will help us understand the usefulness
of this patchset.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-10 22:39 ` [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Andrew Morton
@ 2025-03-10 23:15   ` Shakeel Butt
  2025-03-10 23:36     ` Roman Gushchin
  2025-03-10 23:27   ` SeongJae Park
  1 sibling, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-03-10 23:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Mon, Mar 10, 2025 at 03:39:21PM -0700, Andrew Morton wrote:
> On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> >  It is unclear if such use case
> > is common and the inefficiency is significant. 
> 
> Well, we could conduct a survey,
> 
> Can you add some logging to detect when userspace performs such an
> madvise() call, then run that kernel on some "typical" machines which
> are running "typical" workloads?  That should give us a feeling for how
> often userspace does this, and hence will help us understand the usefulness
> of this patchset.

Just for the clarification, this patchset is very useful for the
process_madvise() and the experiment results show that. I think what SJ
wants to say is specific to madvise() syscall that this change might or
might not be that helpful. It will be helpful if the user application is
madvising regions comprising of multiple vmas. However this patchset is
very very useful for process_madvise().


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-10 22:39 ` [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Andrew Morton
  2025-03-10 23:15   ` Shakeel Butt
@ 2025-03-10 23:27   ` SeongJae Park
  1 sibling, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-10 23:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Mon, 10 Mar 2025 15:39:21 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> >  It is unclear if such use case
> > is common and the inefficiency is significant. 
> 
> Well, we could conduct a survey,
> 
> Can you add some logging to detect when userspace performs such an
> madvise() call, then run that kernel on some "typical" machines which
> are running "typical" workloads?  That should give us a feeling for how
> often userspace does this,

I agree that could make this patch series more informative.

> and hence will help us understand the usefulness
> of this patchset.

Nevertheless, what this patchset is really trying to optimize is not the
madvise() use case, but process_madvise() use.  I believe the usage is
apparently common, given the vectorization based semantic of process_madvise().
Jemalloc is also adding[1] that kind of use case.  And the benefit is clear,
given the microbenchmark results that I shared.

Also, this patchset shouldn't introduce new regression to madvise().

Hence, I think the survey can be interestign and helpful, but shouldn't be a
blocker for this patch series.  Coudl you please let me know if I'm missing
something and if you still want the survey?

[1] https://github.com/jemalloc/jemalloc/pull/2794


Thanks,
SJ


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-10 23:15   ` Shakeel Butt
@ 2025-03-10 23:36     ` Roman Gushchin
  2025-03-11 11:17       ` Lorenzo Stoakes
  0 siblings, 1 reply; 42+ messages in thread
From: Roman Gushchin @ 2025-03-10 23:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, SeongJae Park, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Mon, Mar 10, 2025 at 04:15:06PM -0700, Shakeel Butt wrote:
> On Mon, Mar 10, 2025 at 03:39:21PM -0700, Andrew Morton wrote:
> > On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote:
> > 
> > >  It is unclear if such use case
> > > is common and the inefficiency is significant. 
> > 
> > Well, we could conduct a survey,
> > 
> > Can you add some logging to detect when userspace performs such an
> > madvise() call, then run that kernel on some "typical" machines which
> > are running "typical" workloads?  That should give us a feeling for how
> > often userspace does this, and hence will help us understand the usefulness
> > of this patchset.
> 
> Just for the clarification, this patchset is very useful for the
> process_madvise() and the experiment results show that.

+1

Google carried an internal version for a vectorized madvise() which
was much faster than process_madvise() last time I measured it.
I hope SJ's patchset will (partially) address this difference,
which will hopefully allow to drop the internal implementation
for process_madvise.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-10 23:36     ` Roman Gushchin
@ 2025-03-11 11:17       ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 11:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Andrew Morton, SeongJae Park, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Mon, Mar 10, 2025 at 11:36:52PM +0000, Roman Gushchin wrote:
> On Mon, Mar 10, 2025 at 04:15:06PM -0700, Shakeel Butt wrote:
> > On Mon, Mar 10, 2025 at 03:39:21PM -0700, Andrew Morton wrote:
> > > On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote:
> > >
> > > >  It is unclear if such use case
> > > > is common and the inefficiency is significant.
> > >
> > > Well, we could conduct a survey,
> > >
> > > Can you add some logging to detect when userspace performs such an
> > > madvise() call, then run that kernel on some "typical" machines which
> > > are running "typical" workloads?  That should give us a feeling for how
> > > often userspace does this, and hence will help us understand the usefulness
> > > of this patchset.
> >
> > Just for the clarification, this patchset is very useful for the
> > process_madvise() and the experiment results show that.
>
> +1
>
> Google carried an internal version for a vectorized madvise() which
> was much faster than process_madvise() last time I measured it.
> I hope SJ's patchset will (partially) address this difference,
> which will hopefully allow to drop the internal implementation
> for process_madvise.

Relatedly I also feel, at some point, we ought to remove the UIO_FASTIOV
limit on process_madvise().

But one for a future series...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior()
  2025-03-10 17:23 ` [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
@ 2025-03-11 11:27   ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 11:27 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:10AM -0700, SeongJae Park wrote:
> To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
> MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure()
> has introduced.  madvise_do_behavior() is still doing the same

NIT: 'is' introduced.

> open-coded check, though.  Use is_memory_failure() instead.
>
> To avoid build failure on !CONFIG_MEMORY_FAILURE case, implement an
> empty madvise_inject_error() under the config.  Also move the definition
> of is_memory_failure() inside #ifdef CONFIG_MEMORY_FAILURE clause for
> madvise_inject_error() definition, to reduce duplicated ifdef clauses.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

Good improvement, thanks!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/madvise.c | 49 +++++++++++++++++++++++++++----------------------
>  1 file changed, 27 insertions(+), 22 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 388dc289b5d1..c3ab1f283b18 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1392,7 +1392,32 @@ static int madvise_inject_error(int behavior,
>
>  	return 0;
>  }
> -#endif
> +
> +static bool is_memory_failure(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_HWPOISON:
> +	case MADV_SOFT_OFFLINE:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +#else
> +
> +static int madvise_inject_error(int behavior,
> +		unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
> +static bool is_memory_failure(int behavior)
> +{
> +	return false;
> +}
> +
> +#endif	/* CONFIG_MEMORY_FAILURE */
>
>  static bool
>  madvise_behavior_valid(int behavior)
> @@ -1569,24 +1594,6 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  }
>  #endif /* CONFIG_ANON_VMA_NAME */
>
> -#ifdef CONFIG_MEMORY_FAILURE
> -static bool is_memory_failure(int behavior)
> -{
> -	switch (behavior) {
> -	case MADV_HWPOISON:
> -	case MADV_SOFT_OFFLINE:
> -		return true;
> -	default:
> -		return false;
> -	}
> -}
> -#else
> -static bool is_memory_failure(int behavior)
> -{
> -	return false;
> -}
> -#endif
> -
>  static int madvise_lock(struct mm_struct *mm, int behavior)
>  {
>  	if (is_memory_failure(behavior))
> @@ -1640,10 +1647,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	unsigned long end;
>  	int error;
>
> -#ifdef CONFIG_MEMORY_FAILURE
> -	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> +	if (is_memory_failure(behavior))
>  		return madvise_inject_error(behavior, start, start + len_in);
> -#endif
>  	start = untagged_addr_remote(mm, start);
>  	end = start + len;
>
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/9] mm/madvise: split out populate behavior check logic
  2025-03-10 17:23 ` [PATCH 2/9] mm/madvise: split out populate behavior check logic SeongJae Park
@ 2025-03-11 11:29   ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 11:29 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:11AM -0700, SeongJae Park wrote:
> madvise_do_behavior() has a long open-coded 'behavior' check for
> MADV_POPULATE_{READ,WRITE}.  It adds multiple layers[1] and make the
> code arguably take longer time to read.  Like is_memory_failure(), split
> out the check to a separate function.  This is not technically removing
> the additional layer but discourage further extending the switch-case.
> Also it makes madvise_do_behavior() code shorter and therefore easier to
> read.
>
> [1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/madvise.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c3ab1f283b18..611db868ae38 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1640,6 +1640,17 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
>  	return true;
>  }
>
> +static bool is_madvise_populate(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_POPULATE_READ:
> +	case MADV_POPULATE_WRITE:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  static int madvise_do_behavior(struct mm_struct *mm,
>  		unsigned long start, size_t len_in, size_t len, int behavior)
>  {
> @@ -1653,16 +1664,11 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	end = start + len;
>
>  	blk_start_plug(&plug);
> -	switch (behavior) {
> -	case MADV_POPULATE_READ:
> -	case MADV_POPULATE_WRITE:
> +	if (is_madvise_populate(behavior))
>  		error = madvise_populate(mm, start, end, behavior);
> -		break;
> -	default:
> +	else
>  		error = madvise_walk_vmas(mm, start, end, behavior,
>  					  madvise_vma_behavior);
> -		break;
> -	}
>  	blk_finish_plug(&plug);
>  	return error;
>  }
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  2025-03-10 17:23 ` [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
@ 2025-03-11 12:02   ` Lorenzo Stoakes
  2025-03-11 20:54     ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 12:02 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:12AM -0700, SeongJae Park wrote:
> The logic for checking if a given madvise() request for a single memory
> range can skip real work, namely madvise_do_behavior(), is duplicated in
> do_madvise() and vector_madvise().  Split out the logic to a function
> and resue it.

NIT: typo :)

>
> Signed-off-by: SeongJae Park <sj@kernel.org>

madvise_set_anon_name() seems to do something similar, but somewhat
differently... not sure if you address this in a later commit but worth
looking at too!

Anyway this seems sane, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Note nits!

> ---
>  mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 30 insertions(+), 23 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 611db868ae38..764ec1f2475b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1640,6 +1640,27 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
>  	return true;
>  }
>
> +/*
> + * madvise_should_skip() - Return if an madivse request can skip real works.

NIT: 'real works' sounds strange.

I'd say something like 'if the specified behaviour is invalid or nothing
would occur, we skip this operation. In the former case we return an
error.'

> + * @start:	Start address of madvise-requested address range.
> + * @len_in:	Length of madvise-requested address range.
> + * @behavior:	Requested madvise behavor.
> + * @err:	Pointer to store an error code from the check.
> + */
> +static bool madvise_should_skip(unsigned long start, size_t len_in,
> +		int behavior, int *err)
> +{
> +	if (!is_valid_madvise(start, len_in, behavior)) {
> +		*err = -EINVAL;
> +		return true;
> +	}
> +	if (start + PAGE_ALIGN(len_in) == start) {
> +		*err = 0;
> +		return true;
> +	}
> +	return false;
> +}
> +
>  static bool is_madvise_populate(int behavior)
>  {
>  	switch (behavior) {
> @@ -1747,23 +1768,15 @@ static int madvise_do_behavior(struct mm_struct *mm,
>   */
>  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
>  {
> -	unsigned long end;
>  	int error;
> -	size_t len;
> -
> -	if (!is_valid_madvise(start, len_in, behavior))
> -		return -EINVAL;
> -
> -	len = PAGE_ALIGN(len_in);
> -	end = start + len;
> -
> -	if (end == start)
> -		return 0;
>
> +	if (madvise_should_skip(start, len_in, behavior, &error))
> +		return error;
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> -	error = madvise_do_behavior(mm, start, len_in, len, behavior);
> +	error = madvise_do_behavior(mm, start, len_in, PAGE_ALIGN(len_in),
> +			behavior);
>  	madvise_unlock(mm, behavior);
>
>  	return error;
> @@ -1790,19 +1803,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  	while (iov_iter_count(iter)) {
>  		unsigned long start = (unsigned long)iter_iov_addr(iter);
>  		size_t len_in = iter_iov_len(iter);
> -		size_t len;
> -
> -		if (!is_valid_madvise(start, len_in, behavior)) {
> -			ret = -EINVAL;
> -			break;
> -		}
> +		int error;
>
> -		len = PAGE_ALIGN(len_in);
> -		if (start + len == start)
> -			ret = 0;
> +		if (madvise_should_skip(start, len_in, behavior, &error))
> +			ret = error;
>  		else
> -			ret = madvise_do_behavior(mm, start, len_in, len,
> -					behavior);
> +			ret = madvise_do_behavior(mm, start, len_in,
> +					PAGE_ALIGN(len_in), behavior);
>  		/*
>  		 * An madvise operation is attempting to restart the syscall,
>  		 * but we cannot proceed as it would not be correct to repeat
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior()
  2025-03-10 17:23 ` [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
@ 2025-03-11 12:05   ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 12:05 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:13AM -0700, SeongJae Park wrote:
> Because madise_should_skip() logic is factored out, making
> madvise_do_behavior() calculates 'len' on its own rather then receiving
> it as a parameter makes code simpler.  Remove the parameter.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/madvise.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 764ec1f2475b..469c25690a0e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1673,7 +1673,7 @@ static bool is_madvise_populate(int behavior)
>  }
>
>  static int madvise_do_behavior(struct mm_struct *mm,
> -		unsigned long start, size_t len_in, size_t len, int behavior)
> +		unsigned long start, size_t len_in, int behavior)
>  {
>  	struct blk_plug plug;
>  	unsigned long end;
> @@ -1682,7 +1682,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	if (is_memory_failure(behavior))
>  		return madvise_inject_error(behavior, start, start + len_in);
>  	start = untagged_addr_remote(mm, start);
> -	end = start + len;
> +	end = start + PAGE_ALIGN(len_in);
>
>  	blk_start_plug(&plug);
>  	if (is_madvise_populate(behavior))
> @@ -1775,8 +1775,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> -	error = madvise_do_behavior(mm, start, len_in, PAGE_ALIGN(len_in),
> -			behavior);
> +	error = madvise_do_behavior(mm, start, len_in, behavior);
>  	madvise_unlock(mm, behavior);
>
>  	return error;
> @@ -1808,8 +1807,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		if (madvise_should_skip(start, len_in, behavior, &error))
>  			ret = error;
>  		else
> -			ret = madvise_do_behavior(mm, start, len_in,
> -					PAGE_ALIGN(len_in), behavior);
> +			ret = madvise_do_behavior(mm, start, len_in, behavior);
>  		/*
>  		 * An madvise operation is attempting to restart the syscall,
>  		 * but we cannot proceed as it would not be correct to repeat
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-10 17:23 ` [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
@ 2025-03-11 12:17   ` Lorenzo Stoakes
  2025-03-11 20:56     ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 12:17 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:14AM -0700, SeongJae Park wrote:
> To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
> MADV_FREE, an mmu_gather object in addition to the behavior integer need
> to be passed to the internal logics.  Using a struct can make it easy
> without increasing the number of parameters of all code paths towards
> the internal logic.  Define a struct for the purpose and use it on the
> code path that starts from madvise_do_behavior() and ends on
> madvise_dontneed_free().

Oh a helper struct! I like these!

Nitty but...

I wonder if we should just add the the mmu_gather field immediately even if
it isn't used yet?

Also I feel like this patch and 6 should be swapped around, as you are
laying the groundwork here for patch 7 but then doing something unrelated
in 6, unless I'm missing something.

Also maybe add a bit in commit msg about changing the madvise_walk_vmas()
visitor type signature (I wonder if that'd be better as a typedef tbh?)

However, this change looks fine aside from nits (and you know, helper
struct and I'm sold obviously ;) so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/madvise.c | 36 ++++++++++++++++++++++++------------
>  1 file changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 469c25690a0e..ba2a78795207 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -890,11 +890,16 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
>  	return true;
>  }
>
> +struct madvise_behavior {
> +	int behavior;
> +};
> +
>  static long madvise_dontneed_free(struct vm_area_struct *vma,
>  				  struct vm_area_struct **prev,
>  				  unsigned long start, unsigned long end,
> -				  int behavior)
> +				  struct madvise_behavior *madv_behavior)

Nitty, but not sure about the need for 'madv_' here. I think keeping this as
'behavior' is fine, as the type is very clear.

>  {
> +	int behavior = madv_behavior->behavior;
>  	struct mm_struct *mm = vma->vm_mm;
>
>  	*prev = vma;
> @@ -1249,8 +1254,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
>  static int madvise_vma_behavior(struct vm_area_struct *vma,
>  				struct vm_area_struct **prev,
>  				unsigned long start, unsigned long end,
> -				unsigned long behavior)
> +				void *behavior_arg)
>  {
> +	struct madvise_behavior *arg = behavior_arg;
> +	int behavior = arg->behavior;
>  	int error;
>  	struct anon_vma_name *anon_name;
>  	unsigned long new_flags = vma->vm_flags;
> @@ -1270,7 +1277,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  	case MADV_DONTNEED_LOCKED:
> -		return madvise_dontneed_free(vma, prev, start, end, behavior);
> +		return madvise_dontneed_free(vma, prev, start, end, arg);
>  	case MADV_NORMAL:
>  		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
>  		break;
> @@ -1487,10 +1494,10 @@ static bool process_madvise_remote_valid(int behavior)
>   */
>  static
>  int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> -		      unsigned long end, unsigned long arg,
> +		      unsigned long end, void *arg,
>  		      int (*visit)(struct vm_area_struct *vma,
>  				   struct vm_area_struct **prev, unsigned long start,
> -				   unsigned long end, unsigned long arg))
> +				   unsigned long end, void *arg))
>  {
>  	struct vm_area_struct *vma;
>  	struct vm_area_struct *prev;
> @@ -1548,7 +1555,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  static int madvise_vma_anon_name(struct vm_area_struct *vma,
>  				 struct vm_area_struct **prev,
>  				 unsigned long start, unsigned long end,
> -				 unsigned long anon_name)
> +				 void *anon_name)
>  {
>  	int error;
>
> @@ -1557,7 +1564,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma,
>  		return -EBADF;
>
>  	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
> -				   (struct anon_vma_name *)anon_name);
> +				   anon_name);
>
>  	/*
>  	 * madvise() returns EAGAIN if kernel resources, such as
> @@ -1589,7 +1596,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  	if (end == start)
>  		return 0;
>
> -	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
> +	return madvise_walk_vmas(mm, start, end, anon_name,
>  				 madvise_vma_anon_name);
>  }
>  #endif /* CONFIG_ANON_VMA_NAME */
> @@ -1673,8 +1680,10 @@ static bool is_madvise_populate(int behavior)
>  }
>
>  static int madvise_do_behavior(struct mm_struct *mm,
> -		unsigned long start, size_t len_in, int behavior)
> +		unsigned long start, size_t len_in,
> +		struct madvise_behavior *madv_behavior)
>  {
> +	int behavior = madv_behavior->behavior;
>  	struct blk_plug plug;
>  	unsigned long end;
>  	int error;
> @@ -1688,7 +1697,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	if (is_madvise_populate(behavior))
>  		error = madvise_populate(mm, start, end, behavior);
>  	else
> -		error = madvise_walk_vmas(mm, start, end, behavior,
> +		error = madvise_walk_vmas(mm, start, end, madv_behavior,
>  					  madvise_vma_behavior);
>  	blk_finish_plug(&plug);
>  	return error;
> @@ -1769,13 +1778,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
>  {
>  	int error;
> +	struct madvise_behavior madv_behavior = {.behavior = behavior};
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> -	error = madvise_do_behavior(mm, start, len_in, behavior);
> +	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	return error;
> @@ -1792,6 +1802,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  {
>  	ssize_t ret = 0;
>  	size_t total_len;
> +	struct madvise_behavior madv_behavior = {.behavior = behavior};
>
>  	total_len = iov_iter_count(iter);
>
> @@ -1807,7 +1818,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		if (madvise_should_skip(start, len_in, behavior, &error))
>  			ret = error;
>  		else
> -			ret = madvise_do_behavior(mm, start, len_in, behavior);
> +			ret = madvise_do_behavior(mm, start, len_in,
> +					&madv_behavior);
>  		/*
>  		 * An madvise operation is attempting to restart the syscall,
>  		 * but we cannot proceed as it would not be correct to repeat
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-10 17:23 ` [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-03-11 12:45   ` Lorenzo Stoakes
  2025-03-11 20:58     ` SeongJae Park
  2025-04-01  1:45   ` Liam R. Howlett
  1 sibling, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 12:45 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() does tlb flushing for each invocation.  Split
> out the body of zap_page_range_single() except mmu_gather object
> initialization and gathered tlb entries flushing parts for such batched
> tlb flushing usage.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/memory.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 78c7ee62795e..88c478e2ed1a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -/**
> - * zap_page_range_single - remove user pages in a given range
> - * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> - * @details: details of shared cache invalidation
> - *
> - * The range must fit into one VMA.
> - */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void unmap_vma_single(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
>
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>  				address, end);
>  	hugetlb_zap_begin(vma, &range.start, &range.end);
> -	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	/*
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */
> -	unmap_single_vma(&tlb, vma, address, end, details, false);
> +	unmap_single_vma(tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
>  	hugetlb_zap_end(vma, details);

Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
before?

This seems like a major problem with this change. If not you need to explain why
not in the commit message.

>  }
>
> +/**
> + * zap_page_range_single - remove user pages in a given range
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of pages to zap
> + * @size: number of bytes to zap
> + * @details: details of shared cache invalidation
> + *
> + * The range must fit into one VMA.
> + */
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size, struct zap_details *details)
> +{
> +	struct mmu_gather tlb;
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	unmap_vma_single(&tlb, vma, address, size, details);
> +	tlb_finish_mmu(&tlb);
> +}
> +
>  /**
>   * zap_vma_ptes - remove ptes mapping the vma
>   * @vma: vm_area_struct holding ptes to be zapped
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (9 preceding siblings ...)
  2025-03-10 22:39 ` [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Andrew Morton
@ 2025-03-11 12:49 ` Lorenzo Stoakes
  2025-03-11 21:03   ` SeongJae Park
  10 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 12:49 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

The below commit message talks about tlb bits, but you spend a lot of this
series refactoring mm/madvise.c.

Can we just separate out the two into separate series please?

I am doing the same kind of thing with mremap() at the moment, but sent the
refactor _first_ before the work that builds upon it :)

Your refactoring is great, so I want to be able to take that (and Andrew's
objections obviously don't apply there), and then we can address tlb stuff
separately and in a more focused way.

On Mon, Mar 10, 2025 at 10:23:09AM -0700, SeongJae Park wrote:
> When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> MADV_FREE with multiple address ranges, tlb flushes happen for each of
> the given address ranges.  Because such tlb flushes are for same
> process, doing those in a batch is more efficient while still being
> safe.  Modify process_madvise() entry level code path to do such batched
> tlb flushes, while the internal unmap logics do only gathering of the
> tlb entries to flush.
>
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  And make the internal logics
> do only gathering of the tlb entries to flush into the received
> mmu_gather object.  After all internal function calls are done, the
> entry functions flush the gathered tlb entries at once.
>
> The inefficiency should be smaller on madvise() use case, since it
> receives only a single address range.  But if there are multiple vmas
> for the range, same problem can happen.  It is unclear if such use case
> is common and the inefficiency is significant.  Make the change for
> madivse(), too, since it doesn't really change madvise() internal
> behavior while helps keeping the code that shared between
> process_madvise() and madvise() internal logics clean.
>
> Patches Seuquence
> =================
>
> First four patches are minor cleanups of madvise.c for readability.
>
> Fifth patch defines new data structure for managing information
> that required for batched tlb flushes (mmu_gather and behavior), and
> update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling
> internal logics to receive it.
>
> Sixth and seventh patches make internal logics for handling
> MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing.  The
> patches keep the support of unbatched tlb flushes use case, for
> fine-grained and safe transitions.
>
> Eighth patch updates madvise() and process_madvise() code to do the
> batched tlb flushes utilizing the previous patches introduced changes.
>
> The final ninth patch removes the internal logics' unbatched tlb flushes
> use case support code, which is no more be used.
>
> Test Results
> ============
>
> I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size per
> process_madvise() call (vlen) from 1 to 1024.  The source code for the
> measurement is available at GitHub[1].  To reduce measurement errors, I
> did the measurement five times.
>
> The measurement results are as below.  'sz_batch' column shows the batch
> size of process_madvise() calls.  'Before' and 'After' columns show the
> average of latencies in nanoseconds that measured five times on kernels
> that built without and with the tlb flushes batching patch of this
> series, respectively.  For the baseline, mm-unstable tree of
> 2025-03-07[2] has been used.  'B-stdev' and 'A-stdev' columns show
> ratios of latency measurements standard deviation to average in percent
> for 'Before' and 'After', respectively.  'Latency_reduction' shows the
> reduction of the latency that the commit has achieved, in percent.
> Higher 'Latency_reduction' values mean more efficiency improvements.
>
>     sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
>     1          128691595.4   6.09      106878698.4   2.76      16.95
>     2          94046750.8    3.30      68691907      2.79      26.96
>     4          80538496.8    5.35      50230673.8    5.30      37.63
>     8          72672225.2    5.50      43918112      3.54      39.57
>     16         66955104.4    4.25      36549941.2    1.62      45.41
>     32         65814679      5.89      33060291      3.30      49.77
>     64         65099205.2    2.69      26003756.4    1.56      60.06
>     128        62591307.2    4.02      24008180.4    1.61      61.64
>     256        64124368.6    2.93      23868856      2.20      62.78
>     512        62325618      5.72      23311330.6    1.74      62.60
>     1024       64802138.4    5.05      23651695.2    3.40      63.50
>
> Interestingly, the latency has reduced (improved) even with batch size
> 1.  I think some of compiler optimizations have affected that, like also
> observed with the previous process_madvise() mmap_lock optimization
> patch sereis[3].
>
> So, let's focus on the proportion between the improvement and the batch
> size.  As expected, tlb flushes batching provides latency reduction that
> proportional to the batch size.  The efficiency gain ranges from about
> 27 percent with batch size 2, and up to 63 percent with batch size
> 1,024.
>
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.
>
> Changes from RFC
> (https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
> - Clarify motivation of the change on the cover letter
> - Add average and stdev of evaluation results
> - Show latency reduction on evaluation results
> - Fix !CONFIG_MEMORY_FAILURE build error
> - Rename is_memory_populate() to is_madvise_populate()
> - Squash patches 5-8
> - Add kerneldoc for unmap_vm_area_struct()
> - Squash patches 10 and 11
> - Squash patches 12-14
> - Squash patches 15 and 16
>
> References
> ==========
>
> [1] https://github.com/sjp38/eval_proc_madvise
> [2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable
> [3] https://lore.kernel.org/20250211182833.4193-1-sj@kernel.org
>
> SeongJae Park (9):
>   mm/madvise: use is_memory_failure() from madvise_do_behavior()
>   mm/madvise: split out populate behavior check logic
>   mm/madvise: deduplicate madvise_do_behavior() skip case handlings
>   mm/madvise: remove len parameter of madvise_do_behavior()
>   mm/madvise: define and use madvise_behavior struct for
>     madvise_do_behavior()
>   mm/memory: split non-tlb flushing part from zap_page_range_single()
>   mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches
>     tlb flushes
>   mm/madvise: batch tlb flushes for
>     [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
>   mm/madvise: remove !tlb support from
>     madvise_{dontneed,free}_single_vma()
>
>  mm/internal.h |   3 +
>  mm/madvise.c  | 221 +++++++++++++++++++++++++++++++++-----------------
>  mm/memory.c   |  38 ++++++---
>  3 files changed, 176 insertions(+), 86 deletions(-)
>
>
> base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes
  2025-03-10 17:23 ` [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes SeongJae Park
@ 2025-03-11 13:07   ` Lorenzo Stoakes
  2025-03-11 21:00     ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 13:07 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

Super super UBER nitty but... pretty sure the subject here should be <= 75
chars right? :P

On Mon, Mar 10, 2025 at 10:23:16AM -0700, SeongJae Park wrote:
> Update madvise_dontneed_single_vma() and madvise_free_single_vma()
> functions so that the caller can pass an mmu_gather object that should
> be initialized and will be finished outside, for batched tlb flushes.
> Also modify their internal code to support such usage by skipping the
> initialization and finishing of self-allocated mmu_gather object if it
> received a valid mmu_gather object.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/internal.h |  3 +++
>  mm/madvise.c  | 37 +++++++++++++++++++++++++------------
>  mm/memory.c   | 16 +++++++++++++---
>  3 files changed, 41 insertions(+), 15 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 0caa64dc2cb7..ce7fb2383f65 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -438,6 +438,9 @@ void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details);
> +void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		      unsigned long addr, unsigned long size,
> +		      struct zap_details *details);
>  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>  			   gfp_t gfp);
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index ba2a78795207..d7ea71c6422c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -794,12 +794,19 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
>  	.walk_lock		= PGWALK_RDLOCK,
>  };
>
> -static int madvise_free_single_vma(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +static int madvise_free_single_vma(
> +		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,

I find this interface horrible, and super confusing. It's not clear at all
what's going on here.

Why not use your new helper struct to add a field you can thread through
here?

> +		unsigned long start_addr, unsigned long end_addr)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
> +	struct mmu_gather self_tlb;
> +	struct mmu_gather *tlb;
> +
> +	if (caller_tlb)
> +		tlb = caller_tlb;
> +	else
> +		tlb = &self_tlb;
>
>  	/* MADV_FREE works for only anon vma at the moment */
>  	if (!vma_is_anonymous(vma))
> @@ -815,16 +822,18 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  				range.start, range.end);
>
>  	lru_add_drain();
> -	tlb_gather_mmu(&tlb, mm);
> +	if (!caller_tlb)
> +		tlb_gather_mmu(tlb, mm);

Yeah really don't like this.

Ideally we'd abstract the mmu_gather struct to the helper struct (which I
see you do in a subsequent patch anyway) would be ideal if you could find a
way to make that work.

But if not, then:

if (behavior->batched_tlb)
	tlb_gather_mmu(&tlb, mm);

etc. etc.

Would work better.

>  	update_hiwater_rss(mm);
>
>  	mmu_notifier_invalidate_range_start(&range);
> -	tlb_start_vma(&tlb, vma);
> +	tlb_start_vma(tlb, vma);

Also not a fan of making tlb refer to a pointer now when before it
didn't... I mean that's more of a nit and maybe unavoidable, but still!

I mean yeah ok this is probably unavoidable, ignore.

>  	walk_page_range(vma->vm_mm, range.start, range.end,
> -			&madvise_free_walk_ops, &tlb);
> -	tlb_end_vma(&tlb, vma);
> +			&madvise_free_walk_ops, tlb);
> +	tlb_end_vma(tlb, vma);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
> +	if (!caller_tlb)
> +		tlb_finish_mmu(tlb);
>
>  	return 0;
>  }
> @@ -848,7 +857,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>   * An interface that causes the system to free clean pages and flush
>   * dirty pages is already available as msync(MS_INVALIDATE).
>   */
> -static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> +static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
> +					struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
>  	struct zap_details details = {
> @@ -856,7 +866,10 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  		.even_cows = true,
>  	};
>
> -	zap_page_range_single(vma, start, end - start, &details);
> +	if (!tlb)
> +		zap_page_range_single(vma, start, end - start, &details);

Please don't put the negation case first, it's confusing. Swap them!


> +	else
> +		unmap_vma_single(tlb, vma, start, end - start, &details);
>  	return 0;
>  }
>
> @@ -951,9 +964,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	}
>
>  	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
> -		return madvise_dontneed_single_vma(vma, start, end);
> +		return madvise_dontneed_single_vma(NULL, vma, start, end);
>  	else if (behavior == MADV_FREE)
> -		return madvise_free_single_vma(vma, start, end);
> +		return madvise_free_single_vma(NULL, vma, start, end);

Not to labour the point, but this is also horrid, passing a mystery NULL
parameter first...

>  	else
>  		return -EINVAL;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 88c478e2ed1a..3256b9713cbd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1995,9 +1995,19 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -static void unmap_vma_single(struct mmu_gather *tlb,
> -		struct vm_area_struct *vma, unsigned long address,
> -		unsigned long size, struct zap_details *details)
> +/**
> + * unmap_vma_single - remove user pages in a given range
> + * @tlb: pointer to the caller's struct mmu_gather
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of the pages
> + * @size: number of bytes to remove
> + * @details: details of shared cache invalidation
> + *
> + * @tlb shouldn't be NULL.  The range must fit into one VMA.

Can we add some VM_WARN_ON[_ONCE]()'s for these conditions please?

Thanks for documenting!

> + */
> +void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		      unsigned long address, unsigned long size,
> +		      struct zap_details *details)
>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
  2025-03-10 17:23 ` [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) SeongJae Park
@ 2025-03-11 13:59   ` Lorenzo Stoakes
  2025-03-11 21:01     ` SeongJae Park
  2025-04-01 21:17   ` SeongJae Park
  1 sibling, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 13:59 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm,
	Rik van Riel

+cc Rik on this, as he's working on TLB flush-related stuff. Maybe worth
cc-ing him on series respins too? Unless Rik objects of course :P

Again, nit, but your subject line/first line of commit message is
definitely too long here! :)

On Mon, Mar 10, 2025 at 10:23:17AM -0700, SeongJae Park wrote:
> MADV_DONTNEED[_LOCKED] and MADV_FREE internal logics for
> [process_]madvise() can be invoked with batched tlb flushes.  Update
> vector_madvise() and do_madvise(), which are called for the two system
> calls  respectively, to use those in the efficient way.  Initialize an
> mmu_gather object before starting the internal works, and flush the
> gathered tlb entries at once after all the internal works are done.

super nit but logics -> logic and works -> work :)

I think we need more here as to why you're restricting to
MADV_DONTNEED_LOCKED and MADV_FREE. I see pageout initialises a tlb gather
object, so does cold, etc. etc.?

>
> Signed-off-by: SeongJae Park <sj@kernel.org>

This is really nice, I love how we're able to evolve this towards batching
flushes.

Overall though I'd like you to address some of the concerns here before
giving tags... :)

> ---
>  mm/madvise.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 47 insertions(+), 4 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d7ea71c6422c..d5f4ce3041a4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -905,6 +905,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
>
>  struct madvise_behavior {
>  	int behavior;
> +	struct mmu_gather *tlb;
>  };

Aha! Good :)

I see in 9/9 you actually pull the caller_tlb stuff out, I still feel like
we should be threading this state through further, if possible, rather than
passing in behavior->tlb as a parameter.

But this is nitty I suppose!

>
>  static long madvise_dontneed_free(struct vm_area_struct *vma,
> @@ -964,9 +965,11 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	}
>
>  	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
> -		return madvise_dontneed_single_vma(NULL, vma, start, end);
> +		return madvise_dontneed_single_vma(
> +				madv_behavior->tlb, vma, start, end);
>  	else if (behavior == MADV_FREE)
> -		return madvise_free_single_vma(NULL, vma, start, end);
> +		return madvise_free_single_vma(
> +				madv_behavior->tlb, vma, start, end);

Yeah as I said above be nice to just pass madv_behavior, makes things more
flexible to pass a pointer to the helper struct through, as you can

>  	else
>  		return -EINVAL;
>  }
> @@ -1639,6 +1642,32 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
>  		mmap_read_unlock(mm);
>  }
>
> +static bool madvise_batch_tlb_flush(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}

I kind of hate this madvise_ prefix stuff, like we're in mm/madvise.c, it's
pretty obvious static functions are related to madvise :) but this is a
pre-existing thing, not your fault, and it's actually right to maintain
consistency with this.

So this is purely a whine that can be >/dev/null.

> +
> +static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
> +		struct mm_struct *mm)
> +{
> +	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
> +		return;
> +	tlb_gather_mmu(madv_behavior->tlb, mm);
> +}
> +
> +static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
> +{
> +	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
> +		return;
> +	tlb_finish_mmu(madv_behavior->tlb);
> +}
> +

Nitty, but for both of these, usually I like the guard clause pattern, but
since it's such a trivial thing I think it reads better as:

	if (madvise_batch_tlb_flush(madv_behavior->behavior))
		tlb_gather_mmu(madv_behavior->tlb, mm);

and:

	if (madvise_batch_tlb_flush(madv_behavior->behavior))
		tlb_finish_mmu(madv_behavior->tlb);

>  static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
>  {
>  	size_t len;
> @@ -1791,14 +1820,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
>  {
>  	int error;
> -	struct madvise_behavior madv_behavior = {.behavior = behavior};
> +	struct mmu_gather tlb;
> +	struct madvise_behavior madv_behavior = {
> +		.behavior = behavior,
> +		.tlb = &tlb,
> +	};
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> +	madvise_init_tlb(&madv_behavior, mm);
>  	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
> +	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	return error;
> @@ -1815,13 +1850,18 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  {
>  	ssize_t ret = 0;
>  	size_t total_len;
> -	struct madvise_behavior madv_behavior = {.behavior = behavior};
> +	struct mmu_gather tlb;
> +	struct madvise_behavior madv_behavior = {
> +		.behavior = behavior,
> +		.tlb = &tlb,
> +	};
>
>  	total_len = iov_iter_count(iter);
>
>  	ret = madvise_lock(mm, behavior);
>  	if (ret)
>  		return ret;
> +	madvise_init_tlb(&madv_behavior, mm);
>
>  	while (iov_iter_count(iter)) {
>  		unsigned long start = (unsigned long)iter_iov_addr(iter);
> @@ -1850,14 +1890,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  			}
>
>  			/* Drop and reacquire lock to unwind race. */
> +			madvise_finish_tlb(&madv_behavior);
>  			madvise_unlock(mm, behavior);
>  			madvise_lock(mm, behavior);
> +			madvise_init_tlb(&madv_behavior, mm);

Fun, but I guess necessary. I strongly suspect this code path will never
actually happen, but we need to account for it.

>  			continue;
>  		}
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
> +	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma()
  2025-03-10 17:23 ` [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma() SeongJae Park
@ 2025-03-11 14:01   ` Lorenzo Stoakes
  2025-03-11 21:02     ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-11 14:01 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Mon, Mar 10, 2025 at 10:23:18AM -0700, SeongJae Park wrote:
> madvise_dontneed_single_vma() and madvise_free_single_vma() support both
> batched tlb flushes and unbatched tlb flushes use cases depending on
> received tlb parameter's value.  The supports were for safe and fine
> transition of the usages from the unbatched flushes to the batched ones.
> Now the transition is done, and therefore there is no real unbatched tlb
> flushes use case.  Remove the code for supporting the no more being used
> cases.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

Obviously I support this based on previous preview :) but I wonder if we
can avoid this horrid caller_tlb pattern in the first instance.

FWIW:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/madvise.c | 19 ++-----------------
>  1 file changed, 2 insertions(+), 17 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d5f4ce3041a4..25af0a24c00b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -795,18 +795,11 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
>  };
>
>  static int madvise_free_single_vma(
> -		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,
> +		struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		unsigned long start_addr, unsigned long end_addr)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather self_tlb;
> -	struct mmu_gather *tlb;
> -
> -	if (caller_tlb)
> -		tlb = caller_tlb;
> -	else
> -		tlb = &self_tlb;
>
>  	/* MADV_FREE works for only anon vma at the moment */
>  	if (!vma_is_anonymous(vma))
> @@ -822,8 +815,6 @@ static int madvise_free_single_vma(
>  				range.start, range.end);
>
>  	lru_add_drain();
> -	if (!caller_tlb)
> -		tlb_gather_mmu(tlb, mm);
>  	update_hiwater_rss(mm);
>
>  	mmu_notifier_invalidate_range_start(&range);
> @@ -832,9 +823,6 @@ static int madvise_free_single_vma(
>  			&madvise_free_walk_ops, tlb);
>  	tlb_end_vma(tlb, vma);
>  	mmu_notifier_invalidate_range_end(&range);
> -	if (!caller_tlb)
> -		tlb_finish_mmu(tlb);
> -
>  	return 0;
>  }
>
> @@ -866,10 +854,7 @@ static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
>  		.even_cows = true,
>  	};
>
> -	if (!tlb)
> -		zap_page_range_single(vma, start, end - start, &details);
> -	else
> -		unmap_vma_single(tlb, vma, start, end - start, &details);
> +	unmap_vma_single(tlb, vma, start, end - start, &details);
>  	return 0;
>  }
>
> --
> 2.39.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  2025-03-11 12:02   ` Lorenzo Stoakes
@ 2025-03-11 20:54     ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 20:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Tue, 11 Mar 2025 12:02:48 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Mar 10, 2025 at 10:23:12AM -0700, SeongJae Park wrote:
> > The logic for checking if a given madvise() request for a single memory
> > range can skip real work, namely madvise_do_behavior(), is duplicated in
> > do_madvise() and vector_madvise().  Split out the logic to a function
> > and resue it.
> 
> NIT: typo :)
> 
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> madvise_set_anon_name() seems to do something similar, but somewhat
> differently... not sure if you address this in a later commit but worth
> looking at too!

Thank you for sharing the good finding!  I don't have a plan to address that
for now.

> 
> Anyway this seems sane, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thank you for your review!

> 
> Note nits!
> 
> > ---
> >  mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
> >  1 file changed, 30 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 611db868ae38..764ec1f2475b 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1640,6 +1640,27 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
> >  	return true;
> >  }
> >
> > +/*
> > + * madvise_should_skip() - Return if an madivse request can skip real works.
> 
> NIT: 'real works' sounds strange.
> 
> I'd say something like 'if the specified behaviour is invalid or nothing
> would occur, we skip this operation. In the former case we return an
> error.'

Thanks for this nice suggestion.  I will updte so in the next version.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-11 12:17   ` Lorenzo Stoakes
@ 2025-03-11 20:56     ` SeongJae Park
  2025-03-12  5:47       ` Lorenzo Stoakes
  0 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 20:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Tue, 11 Mar 2025 12:17:40 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Mar 10, 2025 at 10:23:14AM -0700, SeongJae Park wrote:
> > To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
> > MADV_FREE, an mmu_gather object in addition to the behavior integer need
> > to be passed to the internal logics.  Using a struct can make it easy
> > without increasing the number of parameters of all code paths towards
> > the internal logic.  Define a struct for the purpose and use it on the
> > code path that starts from madvise_do_behavior() and ends on
> > madvise_dontneed_free().
> 
> Oh a helper struct! I like these!
> 
> Nitty but...
> 
> I wonder if we should just add the the mmu_gather field immediately even if
> it isn't used yet?

I will do so in the next spin.

> 
> Also I feel like this patch and 6 should be swapped around, as you are
> laying the groundwork here for patch 7 but then doing something unrelated
> in 6, unless I'm missing something.

I actually introduced patch 6 before this one initially.  But I started
thinking this patch could help not only tlb flushes batching but potential
future madvise behavior extensions, and ended up chaging the order in current
way.  I have no strong opinion and your suggestion also sounds nice to me.  I
will swap those in the next version unless someone makes voice.

> 
> Also maybe add a bit in commit msg about changing the madvise_walk_vmas()
> visitor type signature

I will do so, in the next version.

> (I wonder if that'd be better as a typedef tbh?)

Something like below?

    typedef void *madvise_walk_arg;

I think that could make the code easier to read.  But I feel the void pointer
is also not very bad for the current simple static functions use case, so I'd
like keep this as is if you don't mind.

Please let me know if I'm missing your point.

> 
> However, this change looks fine aside from nits (and you know, helper
> struct and I'm sold obviously ;) so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thank you! :)

> 
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/madvise.c | 36 ++++++++++++++++++++++++------------
> >  1 file changed, 24 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 469c25690a0e..ba2a78795207 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -890,11 +890,16 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
> >  	return true;
> >  }
> >
> > +struct madvise_behavior {
> > +	int behavior;
> > +};
> > +
> >  static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  				  struct vm_area_struct **prev,
> >  				  unsigned long start, unsigned long end,
> > -				  int behavior)
> > +				  struct madvise_behavior *madv_behavior)
> 
> Nitty, but not sure about the need for 'madv_' here. I think keeping this as
> 'behavior' is fine, as the type is very clear.

Agreed.  I wanted to reduce the name conflict causing diff lines, but I think
your suggestion is the right thing to do for long term.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-11 12:45   ` Lorenzo Stoakes
@ 2025-03-11 20:58     ` SeongJae Park
  2025-03-31 20:24       ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 20:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm, Rik van Riel

On Tue, 11 Mar 2025 12:45:44 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() does tlb flushing for each invocation.  Split
> > out the body of zap_page_range_single() except mmu_gather object
> > initialization and gathered tlb entries flushing parts for such batched
> > tlb flushing usage.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 78c7ee62795e..88c478e2ed1a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > - * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > - * @details: details of shared cache invalidation
> > - *
> > - * The range must fit into one VMA.
> > - */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void unmap_vma_single(struct mmu_gather *tlb,
> > +		struct vm_area_struct *vma, unsigned long address,
> >  		unsigned long size, struct zap_details *details)
> >  {
> >  	const unsigned long end = address + size;
> >  	struct mmu_notifier_range range;
> > -	struct mmu_gather tlb;
> >
> >  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> >  				address, end);
> >  	hugetlb_zap_begin(vma, &range.start, &range.end);
> > -	tlb_gather_mmu(&tlb, vma->vm_mm);
> >  	update_hiwater_rss(vma->vm_mm);
> >  	mmu_notifier_invalidate_range_start(&range);
> >  	/*
> >  	 * unmap 'address-end' not 'range.start-range.end' as range
> >  	 * could have been expanded for hugetlb pmd sharing.
> >  	 */
> > -	unmap_single_vma(&tlb, vma, address, end, details, false);
> > +	unmap_single_vma(tlb, vma, address, end, details, false);
> >  	mmu_notifier_invalidate_range_end(&range);
> > -	tlb_finish_mmu(&tlb);
> >  	hugetlb_zap_end(vma, details);
> 
> Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
> before?
> 
> This seems like a major problem with this change.

Oh, you're right.  This could re-introduce the racy hugetlb allocation failure
problem that fixed by commit 2820b0f09be9 ("hugetlbfs: close race between
MADV_DONTNEED and page fault").  That is, this patch can make hugetlb
allocation failures increase while MADV_DONTNEED is going on.

Maybe a straightforward fix of the problem is doing hugetlb_zap_end() for all
vmas in a batched manner, similar to that for tlb flush.  For example, add a
list or an array for the vmas in 'struct madvise_behavior', let
'unmap_vma_single()' adds each vma in there, and call hugetlb_zap_end() for
gathered vmas at vector_madvise() or do_madvise().  Does that make sense?

Also Cc-ing Rik, who is the author of the commit 2820b0f09be9 ("hugetlbfs:
close race between MADV_DONTNEED and page fault") for a case that I'm missing
something important.

> If not you need to explain why
> not in the commit message.

I now think it is a problem.  If it turns out I'm wrong, I will of course add
the reason on the commit message.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes
  2025-03-11 13:07   ` Lorenzo Stoakes
@ 2025-03-11 21:00     ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 21:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Tue, 11 Mar 2025 13:07:25 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> Super super UBER nitty but... pretty sure the subject here should be <= 75
> chars right? :P

I believe that's not a hard limit, but I will try to make it shorter in the
next spin.

> 
> On Mon, Mar 10, 2025 at 10:23:16AM -0700, SeongJae Park wrote:
> > Update madvise_dontneed_single_vma() and madvise_free_single_vma()
> > functions so that the caller can pass an mmu_gather object that should
> > be initialized and will be finished outside, for batched tlb flushes.
> > Also modify their internal code to support such usage by skipping the
> > initialization and finishing of self-allocated mmu_gather object if it
> > received a valid mmu_gather object.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/internal.h |  3 +++
> >  mm/madvise.c  | 37 +++++++++++++++++++++++++------------
> >  mm/memory.c   | 16 +++++++++++++---
> >  3 files changed, 41 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 0caa64dc2cb7..ce7fb2383f65 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -438,6 +438,9 @@ void unmap_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> >  			     unsigned long addr, unsigned long end,
> >  			     struct zap_details *details);
> > +void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > +		      unsigned long addr, unsigned long size,
> > +		      struct zap_details *details);
> >  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
> >  			   gfp_t gfp);
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index ba2a78795207..d7ea71c6422c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -794,12 +794,19 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
> >  	.walk_lock		= PGWALK_RDLOCK,
> >  };
> >
> > -static int madvise_free_single_vma(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +static int madvise_free_single_vma(
> > +		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,
> 
> I find this interface horrible, and super confusing. It's not clear at all
> what's going on here.
> 
> Why not use your new helper struct to add a field you can thread through
> here?

I will do so in the next spin.

> 
> > +		unsigned long start_addr, unsigned long end_addr)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_notifier_range range;
> > -	struct mmu_gather tlb;
> > +	struct mmu_gather self_tlb;
> > +	struct mmu_gather *tlb;
> > +
> > +	if (caller_tlb)
> > +		tlb = caller_tlb;
> > +	else
> > +		tlb = &self_tlb;
> >
> >  	/* MADV_FREE works for only anon vma at the moment */
> >  	if (!vma_is_anonymous(vma))
> > @@ -815,16 +822,18 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >  				range.start, range.end);
> >
> >  	lru_add_drain();
> > -	tlb_gather_mmu(&tlb, mm);
> > +	if (!caller_tlb)
> > +		tlb_gather_mmu(tlb, mm);
> 
> Yeah really don't like this.
> 
> Ideally we'd abstract the mmu_gather struct to the helper struct (which I
> see you do in a subsequent patch anyway) would be ideal if you could find a
> way to make that work.
> 
> But if not, then:
> 
> if (behavior->batched_tlb)
> 	tlb_gather_mmu(&tlb, mm);
> 
> etc. etc.
> 
> Would work better.

Agreed.

> 
> >  	update_hiwater_rss(mm);
> >
> >  	mmu_notifier_invalidate_range_start(&range);
> > -	tlb_start_vma(&tlb, vma);
> > +	tlb_start_vma(tlb, vma);
> 
> Also not a fan of making tlb refer to a pointer now when before it
> didn't... I mean that's more of a nit and maybe unavoidable, but still!
> 
> I mean yeah ok this is probably unavoidable, ignore.

Yeah... I also find no good way to make this very cleaner without the followup
cleanup for now.

> 
> >  	walk_page_range(vma->vm_mm, range.start, range.end,
> > -			&madvise_free_walk_ops, &tlb);
> > -	tlb_end_vma(&tlb, vma);
> > +			&madvise_free_walk_ops, tlb);
> > +	tlb_end_vma(tlb, vma);
> >  	mmu_notifier_invalidate_range_end(&range);
> > -	tlb_finish_mmu(&tlb);
> > +	if (!caller_tlb)
> > +		tlb_finish_mmu(tlb);
> >
> >  	return 0;
> >  }
> > @@ -848,7 +857,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >   * An interface that causes the system to free clean pages and flush
> >   * dirty pages is already available as msync(MS_INVALIDATE).
> >   */
> > -static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > +static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
> > +					struct vm_area_struct *vma,
> >  					unsigned long start, unsigned long end)
> >  {
> >  	struct zap_details details = {
> > @@ -856,7 +866,10 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> >  		.even_cows = true,
> >  	};
> >
> > -	zap_page_range_single(vma, start, end - start, &details);
> > +	if (!tlb)
> > +		zap_page_range_single(vma, start, end - start, &details);
> 
> Please don't put the negation case first, it's confusing. Swap them!

Ok, I will do so.

> 
> 
> > +	else
> > +		unmap_vma_single(tlb, vma, start, end - start, &details);
> >  	return 0;
> >  }
> >
> > @@ -951,9 +964,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  	}
> >
> >  	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
> > -		return madvise_dontneed_single_vma(vma, start, end);
> > +		return madvise_dontneed_single_vma(NULL, vma, start, end);
> >  	else if (behavior == MADV_FREE)
> > -		return madvise_free_single_vma(vma, start, end);
> > +		return madvise_free_single_vma(NULL, vma, start, end);
> 
> Not to labour the point, but this is also horrid, passing a mystery NULL
> parameter first...

Agreed again.  I will just pass the madvise_behavior struct in the next spin.

> 
> >  	else
> >  		return -EINVAL;
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 88c478e2ed1a..3256b9713cbd 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1995,9 +1995,19 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >
> > -static void unmap_vma_single(struct mmu_gather *tlb,
> > -		struct vm_area_struct *vma, unsigned long address,
> > -		unsigned long size, struct zap_details *details)
> > +/**
> > + * unmap_vma_single - remove user pages in a given range
> > + * @tlb: pointer to the caller's struct mmu_gather
> > + * @vma: vm_area_struct holding the applicable pages
> > + * @address: starting address of the pages
> > + * @size: number of bytes to remove
> > + * @details: details of shared cache invalidation
> > + *
> > + * @tlb shouldn't be NULL.  The range must fit into one VMA.
> 
> Can we add some VM_WARN_ON[_ONCE]()'s for these conditions please?

Nice suggestion, I will do so.

> 
> Thanks for documenting!

Kudos to Shakeel, who suggested this kerneldoc comment :)


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
  2025-03-11 13:59   ` Lorenzo Stoakes
@ 2025-03-11 21:01     ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 21:01 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm, Rik van Riel

On Tue, 11 Mar 2025 13:59:10 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> +cc Rik on this, as he's working on TLB flush-related stuff. Maybe worth
> cc-ing him on series respins too? Unless Rik objects of course :P
> 
> Again, nit, but your subject line/first line of commit message is
> definitely too long here! :)

I will reduce.

> 
> On Mon, Mar 10, 2025 at 10:23:17AM -0700, SeongJae Park wrote:
> > MADV_DONTNEED[_LOCKED] and MADV_FREE internal logics for
> > [process_]madvise() can be invoked with batched tlb flushes.  Update
> > vector_madvise() and do_madvise(), which are called for the two system
> > calls  respectively, to use those in the efficient way.  Initialize an
> > mmu_gather object before starting the internal works, and flush the
> > gathered tlb entries at once after all the internal works are done.
> 
> super nit but logics -> logic and works -> work :)
> 
> I think we need more here as to why you're restricting to
> MADV_DONTNEED_LOCKED and MADV_FREE. I see pageout initialises a tlb gather
> object, so does cold, etc. etc.?

Good point.  I'm just trying to start from small things.  I will clarify this
on the next spin.

> 
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> This is really nice, I love how we're able to evolve this towards batching
> flushes.
> 
> Overall though I'd like you to address some of the concerns here before
> giving tags... :)

Thank you for nice comments! :)

> 
> > ---
> >  mm/madvise.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 47 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index d7ea71c6422c..d5f4ce3041a4 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -905,6 +905,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
> >
> >  struct madvise_behavior {
> >  	int behavior;
> > +	struct mmu_gather *tlb;
> >  };
> 
> Aha! Good :)
> 
> I see in 9/9 you actually pull the caller_tlb stuff out, I still feel like
> we should be threading this state through further, if possible, rather than
> passing in behavior->tlb as a parameter.

Yes, I will do so.

> 
> But this is nitty I suppose!
> 
> >
> >  static long madvise_dontneed_free(struct vm_area_struct *vma,
> > @@ -964,9 +965,11 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  	}
> >
> >  	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
> > -		return madvise_dontneed_single_vma(NULL, vma, start, end);
> > +		return madvise_dontneed_single_vma(
> > +				madv_behavior->tlb, vma, start, end);
> >  	else if (behavior == MADV_FREE)
> > -		return madvise_free_single_vma(NULL, vma, start, end);
> > +		return madvise_free_single_vma(
> > +				madv_behavior->tlb, vma, start, end);
> 
> Yeah as I said above be nice to just pass madv_behavior, makes things more
> flexible to pass a pointer to the helper struct through, as you can

Yes.

> 
> >  	else
> >  		return -EINVAL;
> >  }
> > @@ -1639,6 +1642,32 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
> >  		mmap_read_unlock(mm);
> >  }
> >
> > +static bool madvise_batch_tlb_flush(int behavior)
> > +{
> > +	switch (behavior) {
> > +	case MADV_DONTNEED:
> > +	case MADV_DONTNEED_LOCKED:
> > +		return true;
> > +	default:
> > +		return false;
> > +	}
> > +}
> 
> I kind of hate this madvise_ prefix stuff, like we're in mm/madvise.c, it's
> pretty obvious static functions are related to madvise :) but this is a
> pre-existing thing, not your fault, and it's actually right to maintain
> consistency with this.
> 
> So this is purely a whine that can be >/dev/null.

Thank you for understanding :)

> 
> > +
> > +static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
> > +		struct mm_struct *mm)
> > +{
> > +	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
> > +		return;
> > +	tlb_gather_mmu(madv_behavior->tlb, mm);
> > +}
> > +
> > +static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
> > +{
> > +	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
> > +		return;
> > +	tlb_finish_mmu(madv_behavior->tlb);
> > +}
> > +
> 
> Nitty, but for both of these, usually I like the guard clause pattern, but
> since it's such a trivial thing I think it reads better as:
> 
> 	if (madvise_batch_tlb_flush(madv_behavior->behavior))
> 		tlb_gather_mmu(madv_behavior->tlb, mm);
> 
> and:
> 
> 	if (madvise_batch_tlb_flush(madv_behavior->behavior))
> 		tlb_finish_mmu(madv_behavior->tlb);

Totally agreed, thank you for catching this.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma()
  2025-03-11 14:01   ` Lorenzo Stoakes
@ 2025-03-11 21:02     ` SeongJae Park
  2025-03-12 13:46       ` Lorenzo Stoakes
  0 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 21:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Tue, 11 Mar 2025 14:01:20 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Mar 10, 2025 at 10:23:18AM -0700, SeongJae Park wrote:
> > madvise_dontneed_single_vma() and madvise_free_single_vma() support both
> > batched tlb flushes and unbatched tlb flushes use cases depending on
> > received tlb parameter's value.  The supports were for safe and fine
> > transition of the usages from the unbatched flushes to the batched ones.
> > Now the transition is done, and therefore there is no real unbatched tlb
> > flushes use case.  Remove the code for supporting the no more being used
> > cases.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> Obviously I support this based on previous preview :) but I wonder if we
> can avoid this horrid caller_tlb pattern in the first instance.

I will try, though I have no good idea for that for now.

Maybe we could simply squash patches 7-9.  I'm bit concerned if it makes
changes unnecessariy mixed and not small, but I have no strong opinion about
it.  Please feel free to let me know if you want that.

> 
> FWIW:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Appreciate your reviews!


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-11 12:49 ` Lorenzo Stoakes
@ 2025-03-11 21:03   ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-11 21:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Tue, 11 Mar 2025 12:49:38 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> The below commit message talks about tlb bits, but you spend a lot of this
> series refactoring mm/madvise.c.
> 
> Can we just separate out the two into separate series please?
> 
> I am doing the same kind of thing with mremap() at the moment, but sent the
> refactor _first_ before the work that builds upon it :)
> 
> Your refactoring is great, so I want to be able to take that (and Andrew's
> objections obviously don't apply there), and then we can address tlb stuff
> separately and in a more focused way.

Thank you for the nice suggestion and I agree.  I will separate those in the
next spin.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-11 20:56     ` SeongJae Park
@ 2025-03-12  5:47       ` Lorenzo Stoakes
  2025-03-12 17:23         ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-12  5:47 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Tue, Mar 11, 2025 at 01:56:17PM -0700, SeongJae Park wrote:
> On Tue, 11 Mar 2025 12:17:40 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Mon, Mar 10, 2025 at 10:23:14AM -0700, SeongJae Park wrote:
> > > To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
> > > MADV_FREE, an mmu_gather object in addition to the behavior integer need
> > > to be passed to the internal logics.  Using a struct can make it easy
> > > without increasing the number of parameters of all code paths towards
> > > the internal logic.  Define a struct for the purpose and use it on the
> > > code path that starts from madvise_do_behavior() and ends on
> > > madvise_dontneed_free().
> >
> > Oh a helper struct! I like these!
> >
> > Nitty but...
> >
> > I wonder if we should just add the the mmu_gather field immediately even if
> > it isn't used yet?
>
> I will do so in the next spin.
>
> >
> > Also I feel like this patch and 6 should be swapped around, as you are
> > laying the groundwork here for patch 7 but then doing something unrelated
> > in 6, unless I'm missing something.
>
> I actually introduced patch 6 before this one initially.  But I started
> thinking this patch could help not only tlb flushes batching but potential
> future madvise behavior extensions, and ended up chaging the order in current
> way.  I have no strong opinion and your suggestion also sounds nice to me.  I
> will swap those in the next version unless someone makes voice.
>
> >
> > Also maybe add a bit in commit msg about changing the madvise_walk_vmas()
> > visitor type signature
>
> I will do so, in the next version.
>
> > (I wonder if that'd be better as a typedef tbh?)
>
> Something like below?
>
>     typedef void *madvise_walk_arg;
>
> I think that could make the code easier to read.  But I feel the void pointer
> is also not very bad for the current simple static functions use case, so I'd
> like keep this as is if you don't mind.
>
> Please let me know if I'm missing your point.

No to be clear I meant the:

int (*visit)(struct vm_area_struct *vma,
				   struct vm_area_struct **prev, unsigned long start,
				   unsigned long end, unsigned long arg)

Function pointer.

But this is not a big deal and let's leave it as-is for now, we can address
this later potentially! :)

>
> >
> > However, this change looks fine aside from nits (and you know, helper
> > struct and I'm sold obviously ;) so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thank you! :)
>
> >
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/madvise.c | 36 ++++++++++++++++++++++++------------
> > >  1 file changed, 24 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 469c25690a0e..ba2a78795207 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -890,11 +890,16 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
> > >  	return true;
> > >  }
> > >
> > > +struct madvise_behavior {
> > > +	int behavior;
> > > +};
> > > +
> > >  static long madvise_dontneed_free(struct vm_area_struct *vma,
> > >  				  struct vm_area_struct **prev,
> > >  				  unsigned long start, unsigned long end,
> > > -				  int behavior)
> > > +				  struct madvise_behavior *madv_behavior)
> >
> > Nitty, but not sure about the need for 'madv_' here. I think keeping this as
> > 'behavior' is fine, as the type is very clear.
>
> Agreed.  I wanted to reduce the name conflict causing diff lines, but I think
> your suggestion is the right thing to do for long term.
>
>
> Thanks,
> SJ
>
> [...]

Thanks for being so flexible on the feedback! Appreciated :>)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma()
  2025-03-11 21:02     ` SeongJae Park
@ 2025-03-12 13:46       ` Lorenzo Stoakes
  2025-04-01 21:22         ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-03-12 13:46 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On Tue, Mar 11, 2025 at 02:02:11PM -0700, SeongJae Park wrote:
> On Tue, 11 Mar 2025 14:01:20 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Mon, Mar 10, 2025 at 10:23:18AM -0700, SeongJae Park wrote:
> > > madvise_dontneed_single_vma() and madvise_free_single_vma() support both
> > > batched tlb flushes and unbatched tlb flushes use cases depending on
> > > received tlb parameter's value.  The supports were for safe and fine
> > > transition of the usages from the unbatched flushes to the batched ones.
> > > Now the transition is done, and therefore there is no real unbatched tlb
> > > flushes use case.  Remove the code for supporting the no more being used
> > > cases.
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> >
> > Obviously I support this based on previous preview :) but I wonder if we
> > can avoid this horrid caller_tlb pattern in the first instance.
>
> I will try, though I have no good idea for that for now.
>
> Maybe we could simply squash patches 7-9.  I'm bit concerned if it makes
> changes unnecessariy mixed and not small, but I have no strong opinion about
> it.  Please feel free to let me know if you want that.

Yeah, though maybe try to make things as incremental as possible within
that?

>
> >
> > FWIW:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Appreciate your reviews!

No problem! Feel free to propagate to respin (assuming no major changes :)
thanks for writing good clean code!

>
>
> Thanks,
> SJ
>
> [...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-12  5:47       ` Lorenzo Stoakes
@ 2025-03-12 17:23         ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-12 17:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 12 Mar 2025 05:47:02 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Tue, Mar 11, 2025 at 01:56:17PM -0700, SeongJae Park wrote:
> > On Tue, 11 Mar 2025 12:17:40 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > On Mon, Mar 10, 2025 at 10:23:14AM -0700, SeongJae Park wrote:
[...]
> > > (I wonder if that'd be better as a typedef tbh?)
> >
> > Something like below?
> >
> >     typedef void *madvise_walk_arg;
> >
> > I think that could make the code easier to read.  But I feel the void pointer
> > is also not very bad for the current simple static functions use case, so I'd
> > like keep this as is if you don't mind.
> >
> > Please let me know if I'm missing your point.
> 
> No to be clear I meant the:
> 
> int (*visit)(struct vm_area_struct *vma,
> 				   struct vm_area_struct **prev, unsigned long start,
> 				   unsigned long end, unsigned long arg)
> 
> Function pointer.

Thanks for clarifying!  And I agree this is a good idea.

> 
> But this is not a big deal and let's leave it as-is for now, we can address
> this later potentially! :)

Agreed, either! :)

[...]
> Thanks for being so flexible on the feedback! Appreciated :>)

Thank you for your nice and helpful reviews :)


Thanks,
SJ


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-11 20:58     ` SeongJae Park
@ 2025-03-31 20:24       ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-03-31 20:24 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R. Howlett,
	David Hildenbrand, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm, Rik van Riel

On Tue, 11 Mar 2025 13:58:01 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Tue, 11 Mar 2025 12:45:44 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> > On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> > > Some of zap_page_range_single() callers such as [process_]madvise() with
> > > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > > zap_page_range_single() does tlb flushing for each invocation.  Split
> > > out the body of zap_page_range_single() except mmu_gather object
> > > initialization and gathered tlb entries flushing parts for such batched
> > > tlb flushing usage.
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 78c7ee62795e..88c478e2ed1a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  	mmu_notifier_invalidate_range_end(&range);
> > >  }
> > >
> > > -/**
> > > - * zap_page_range_single - remove user pages in a given range
> > > - * @vma: vm_area_struct holding the applicable pages
> > > - * @address: starting address of pages to zap
> > > - * @size: number of bytes to zap
> > > - * @details: details of shared cache invalidation
> > > - *
> > > - * The range must fit into one VMA.
> > > - */
> > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +static void unmap_vma_single(struct mmu_gather *tlb,
> > > +		struct vm_area_struct *vma, unsigned long address,
> > >  		unsigned long size, struct zap_details *details)
> > >  {
> > >  	const unsigned long end = address + size;
> > >  	struct mmu_notifier_range range;
> > > -	struct mmu_gather tlb;
> > >
> > >  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> > >  				address, end);
> > >  	hugetlb_zap_begin(vma, &range.start, &range.end);
> > > -	tlb_gather_mmu(&tlb, vma->vm_mm);
> > >  	update_hiwater_rss(vma->vm_mm);
> > >  	mmu_notifier_invalidate_range_start(&range);
> > >  	/*
> > >  	 * unmap 'address-end' not 'range.start-range.end' as range
> > >  	 * could have been expanded for hugetlb pmd sharing.
> > >  	 */
> > > -	unmap_single_vma(&tlb, vma, address, end, details, false);
> > > +	unmap_single_vma(tlb, vma, address, end, details, false);
> > >  	mmu_notifier_invalidate_range_end(&range);
> > > -	tlb_finish_mmu(&tlb);
> > >  	hugetlb_zap_end(vma, details);
> > 
> > Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
> > before?
> > 
> > This seems like a major problem with this change.
> 
> Oh, you're right.  This could re-introduce the racy hugetlb allocation failure
> problem that fixed by commit 2820b0f09be9 ("hugetlbfs: close race between
> MADV_DONTNEED and page fault").  That is, this patch can make hugetlb
> allocation failures increase while MADV_DONTNEED is going on.
> 
> Maybe a straightforward fix of the problem is doing hugetlb_zap_end() for all
> vmas in a batched manner, similar to that for tlb flush.  For example, add a
> list or an array for the vmas in 'struct madvise_behavior', let
> 'unmap_vma_single()' adds each vma in there, and call hugetlb_zap_end() for
> gathered vmas at vector_madvise() or do_madvise().  Does that make sense?
> 
> Also Cc-ing Rik, who is the author of the commit 2820b0f09be9 ("hugetlbfs:
> close race between MADV_DONTNEED and page fault") for a case that I'm missing
> something important.

I now think the straightforward fix I mentioned in the previous message might
be unnecessarily big change.  Maybe letting the unmap_vma_single() caller does
hugetlb_zap_end() and tlb_finish_mmu() on their own in a correct sequence could
be another way?  Then zap_page_range_single() can do the calls for each
invocation as it did before.  process_madvise() could do batched tlb flushes
only for non-hugetlb case.  That is, do the tlb entries gathering as this
version of patch series proposes in usual.  But see if the address range is for
hugetlb and therefore require hugetlb_zap_end() call in real.  If so, flush the
so far gathered tlb entries, call hugetlb_zap_end(), and then start next batch?

In other words, I'm proposing to split the batched flushes when a hugetlb is
encountered.  This means that tlb flush overhead reduction might be smaller
than expected if process_madvise() for unmapping hugetlb pages is intensively
invoked.  But I 't think that's not a common use case.  Having the benefit for
non-hugetlb pages with simple change first, and revisiting hugetlb case later
once the problem comes out might be a way, in my opinion.

For example, my idea could implemented like below, on top of this entire patch
series.

diff --git a/mm/madvise.c b/mm/madvise.c
index 4021db51aeda..e6a74e7ef864 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -861,6 +861,20 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *behavior,
 	};
 
 	unmap_vma_single(behavior->tlb, vma, start, end - start, &details);
+	/*
+	 * hugetlb_zap_end() should be called after tlb_finish_mmu() to avoid
+	 * hugetlb faults for the tlb-flushing memory hanppen before freeing of
+	 * the memory.  If not, the fault will fail memory allocation.
+	 *
+	 * If hugetlb_zap_end() really need to be called, flush so-far gathered
+	 * tlb entries, invoke hugetlb_zap_end(), and start another batch of
+	 * tlb flushes for remaining unmap works.
+	 */
+	if (is_vm_hugetlb_page(vma)) {
+		tlb_finish_mmu(behavior->tlb);
+		hugetlb_zap_end(vma, &details);
+		tlb_gather_mmu(behavior->tlb, vma->vm_mm);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index add8d540cb63..4431630d3240 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2023,7 +2023,6 @@ void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
-	hugetlb_zap_end(vma, details);
 }
 
 /**
@@ -2043,6 +2042,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	unmap_vma_single(&tlb, vma, address, size, details);
 	tlb_finish_mmu(&tlb);
+	hugetlb_zap_end(vma, details);
 }
 
 /**

Any concern or something I'm missing?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-10 17:23 ` [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
  2025-03-11 12:45   ` Lorenzo Stoakes
@ 2025-04-01  1:45   ` Liam R. Howlett
  2025-04-01  2:48     ` SeongJae Park
  1 sibling, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2025-04-01  1:45 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

* SeongJae Park <sj@kernel.org> [250310 13:24]:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() does tlb flushing for each invocation.  Split
> out the body of zap_page_range_single() except mmu_gather object
> initialization and gathered tlb entries flushing parts for such batched
> tlb flushing usage.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/memory.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 78c7ee62795e..88c478e2ed1a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>  
> -/**
> - * zap_page_range_single - remove user pages in a given range
> - * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> - * @details: details of shared cache invalidation
> - *
> - * The range must fit into one VMA.
> - */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void unmap_vma_single(struct mmu_gather *tlb,

I could not, for the life of me, figure out what was going on here until
I realised that is is a new function name and not unmap_single_vma(),
which is called below.

Can we name this differently somehow?  notify_unmap_single_vma() or
something better?

Also, maybe add a description of the function to this patch vs the next
patch?

> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
>  
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>  				address, end);
>  	hugetlb_zap_begin(vma, &range.start, &range.end);
> -	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	/*
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */
> -	unmap_single_vma(&tlb, vma, address, end, details, false);
> +	unmap_single_vma(tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
>  	hugetlb_zap_end(vma, details);
>  }
>  
> +/**
> + * zap_page_range_single - remove user pages in a given range
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of pages to zap
> + * @size: number of bytes to zap
> + * @details: details of shared cache invalidation
> + *
> + * The range must fit into one VMA.
> + */
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size, struct zap_details *details)
> +{
> +	struct mmu_gather tlb;
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	unmap_vma_single(&tlb, vma, address, size, details);
> +	tlb_finish_mmu(&tlb);
> +}
> +
>  /**
>   * zap_vma_ptes - remove ptes mapping the vma
>   * @vma: vm_area_struct holding ptes to be zapped
> -- 
> 2.39.5
> 

[1]. https://lore.kernel.org/lkml/1406212541-25975-1-git-send-email-joro@8bytes.org/


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-01  1:45   ` Liam R. Howlett
@ 2025-04-01  2:48     ` SeongJae Park
  2025-04-01 14:03       ` Liam R. Howlett
  0 siblings, 1 reply; 42+ messages in thread
From: SeongJae Park @ 2025-04-01  2:48 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Mon, 31 Mar 2025 21:45:40 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> * SeongJae Park <sj@kernel.org> [250310 13:24]:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() does tlb flushing for each invocation.  Split
> > out the body of zap_page_range_single() except mmu_gather object
> > initialization and gathered tlb entries flushing parts for such batched
> > tlb flushing usage.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 78c7ee62795e..88c478e2ed1a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >  
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > - * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > - * @details: details of shared cache invalidation
> > - *
> > - * The range must fit into one VMA.
> > - */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void unmap_vma_single(struct mmu_gather *tlb,
> 
> I could not, for the life of me, figure out what was going on here until
> I realised that is is a new function name and not unmap_single_vma(),
> which is called below.

Agreed, definitely the name is confusing, especially given the existence of
unmap_single_vma().

> 
> Can we name this differently somehow?  notify_unmap_single_vma() or
> something better?

notify_unmap_single_vma() sounds good to me.  I'll use the name in the next
revision unless we find a better one.

> 
> Also, maybe add a description of the function to this patch vs the next
> patch?

That makes sense.  In the next revision, I will add the kernel-doc comment
here, but not as a valid kernel-doc comment (maybe wtarts with /* instead of
/**) since this function is a static function as of this patch.  On the next
patch that makes this non-static, I will make the comment a valid kernel-doc
comment with a minimum change.

I prefer not having a valid kernel-doc comment for static function, but that's
just a personal preferrence and I have no strong reason to object other way.
Please feel free to let me know if you prefer making it valid kernel doc
comment starting from this patch.

Thank you for nice suggestions, Liam.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-01  2:48     ` SeongJae Park
@ 2025-04-01 14:03       ` Liam R. Howlett
  2025-04-01 21:25         ` SeongJae Park
  0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2025-04-01 14:03 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

* SeongJae Park <sj@kernel.org> [250331 22:48]:
> On Mon, 31 Mar 2025 21:45:40 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> 
> > * SeongJae Park <sj@kernel.org> [250310 13:24]:
> > > Some of zap_page_range_single() callers such as [process_]madvise() with
> > > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > > zap_page_range_single() does tlb flushing for each invocation.  Split
> > > out the body of zap_page_range_single() except mmu_gather object
> > > initialization and gathered tlb entries flushing parts for such batched
> > > tlb flushing usage.
> > > 
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 78c7ee62795e..88c478e2ed1a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  	mmu_notifier_invalidate_range_end(&range);
> > >  }
> > >  
> > > -/**
> > > - * zap_page_range_single - remove user pages in a given range
> > > - * @vma: vm_area_struct holding the applicable pages
> > > - * @address: starting address of pages to zap
> > > - * @size: number of bytes to zap
> > > - * @details: details of shared cache invalidation
> > > - *
> > > - * The range must fit into one VMA.
> > > - */
> > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +static void unmap_vma_single(struct mmu_gather *tlb,
> > 
> > I could not, for the life of me, figure out what was going on here until
> > I realised that is is a new function name and not unmap_single_vma(),
> > which is called below.
> 
> Agreed, definitely the name is confusing, especially given the existence of
> unmap_single_vma().
> 
> > 
> > Can we name this differently somehow?  notify_unmap_single_vma() or
> > something better?
> 
> notify_unmap_single_vma() sounds good to me.  I'll use the name in the next
> revision unless we find a better one.

Thanks.  I don't really mind if you have anything else to name it, as
long as it reduces the confusion.

> 
> > 
> > Also, maybe add a description of the function to this patch vs the next
> > patch?
> 
> That makes sense.  In the next revision, I will add the kernel-doc comment
> here, but not as a valid kernel-doc comment (maybe wtarts with /* instead of
> /**) since this function is a static function as of this patch.  On the next
> patch that makes this non-static, I will make the comment a valid kernel-doc
> comment with a minimum change.
> 
> I prefer not having a valid kernel-doc comment for static function, but that's
> just a personal preferrence and I have no strong reason to object other way.
> Please feel free to let me know if you prefer making it valid kernel doc
> comment starting from this patch.
> 

Yes, that was what I was thinking as well.

...

Thanks,
Liam


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
  2025-03-10 17:23 ` [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) SeongJae Park
  2025-03-11 13:59   ` Lorenzo Stoakes
@ 2025-04-01 21:17   ` SeongJae Park
  1 sibling, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-04-01 21:17 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Mon, 10 Mar 2025 10:23:17 -0700 SeongJae Park <sj@kernel.org> wrote:

> MADV_DONTNEED[_LOCKED] and MADV_FREE internal logics for
> [process_]madvise() can be invoked with batched tlb flushes.  Update
> vector_madvise() and do_madvise(), which are called for the two system
> calls  respectively, to use those in the efficient way.  Initialize an
> mmu_gather object before starting the internal works, and flush the
> gathered tlb entries at once after all the internal works are done.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/madvise.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 47 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d7ea71c6422c..d5f4ce3041a4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
[...]
> @@ -1639,6 +1642,32 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
>  		mmap_read_unlock(mm);
>  }
>  
> +static bool madvise_batch_tlb_flush(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}

The above function should return true for MADV_FREE, too.  I will make so in the
next spin.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma()
  2025-03-12 13:46       ` Lorenzo Stoakes
@ 2025-04-01 21:22         ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-04-01 21:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 12 Mar 2025 13:46:38 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Tue, Mar 11, 2025 at 02:02:11PM -0700, SeongJae Park wrote:
> > On Tue, 11 Mar 2025 14:01:20 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > On Mon, Mar 10, 2025 at 10:23:18AM -0700, SeongJae Park wrote:
> > > > madvise_dontneed_single_vma() and madvise_free_single_vma() support both
> > > > batched tlb flushes and unbatched tlb flushes use cases depending on
> > > > received tlb parameter's value.  The supports were for safe and fine
> > > > transition of the usages from the unbatched flushes to the batched ones.
> > > > Now the transition is done, and therefore there is no real unbatched tlb
> > > > flushes use case.  Remove the code for supporting the no more being used
> > > > cases.
> > > >
> > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > >
> > > Obviously I support this based on previous preview :) but I wonder if we
> > > can avoid this horrid caller_tlb pattern in the first instance.
> >
> > I will try, though I have no good idea for that for now.
> >
> > Maybe we could simply squash patches 7-9.  I'm bit concerned if it makes
> > changes unnecessariy mixed and not small, but I have no strong opinion about
> > it.  Please feel free to let me know if you want that.
> 
> Yeah, though maybe try to make things as incremental as possible within
> that?

Now I think we can make entire batching change for MADV_FREE first, and then
make another change for MADV_DONTNEED[_LOCKED].  In the way, the caller_tlb
pattern will not be introduced at all and changes in individual commit would be
small and dense.

Please let me know if you have any concern about it.  If I don't hear some
concerns about it, I will format the next spin in the way.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-04-01 14:03       ` Liam R. Howlett
@ 2025-04-01 21:25         ` SeongJae Park
  0 siblings, 0 replies; 42+ messages in thread
From: SeongJae Park @ 2025-04-01 21:25 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Tue, 1 Apr 2025 10:03:17 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> * SeongJae Park <sj@kernel.org> [250331 22:48]:
> > On Mon, 31 Mar 2025 21:45:40 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> > 
> > > * SeongJae Park <sj@kernel.org> [250310 13:24]:
[...] 
> Thanks.  I don't really mind if you have anything else to name it, as
> long as it reduces the confusion.

Fully agreed and thanks again for the nice name suggestion.

[...]
> > That makes sense.  In the next revision, I will add the kernel-doc comment
> > here, but not as a valid kernel-doc comment (maybe wtarts with /* instead of
> > /**) since this function is a static function as of this patch.  On the next
> > patch that makes this non-static, I will make the comment a valid kernel-doc
> > comment with a minimum change.
> > 
> > I prefer not having a valid kernel-doc comment for static function, but that's
> > just a personal preferrence and I have no strong reason to object other way.
> > Please feel free to let me know if you prefer making it valid kernel doc
> > comment starting from this patch.
> > 
> 
> Yes, that was what I was thinking as well.

Thanks for kindly clarifying, Liam :)


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-04-01 21:25 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
2025-03-10 17:23 ` [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
2025-03-11 11:27   ` Lorenzo Stoakes
2025-03-10 17:23 ` [PATCH 2/9] mm/madvise: split out populate behavior check logic SeongJae Park
2025-03-11 11:29   ` Lorenzo Stoakes
2025-03-10 17:23 ` [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
2025-03-11 12:02   ` Lorenzo Stoakes
2025-03-11 20:54     ` SeongJae Park
2025-03-10 17:23 ` [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
2025-03-11 12:05   ` Lorenzo Stoakes
2025-03-10 17:23 ` [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
2025-03-11 12:17   ` Lorenzo Stoakes
2025-03-11 20:56     ` SeongJae Park
2025-03-12  5:47       ` Lorenzo Stoakes
2025-03-12 17:23         ` SeongJae Park
2025-03-10 17:23 ` [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
2025-03-11 12:45   ` Lorenzo Stoakes
2025-03-11 20:58     ` SeongJae Park
2025-03-31 20:24       ` SeongJae Park
2025-04-01  1:45   ` Liam R. Howlett
2025-04-01  2:48     ` SeongJae Park
2025-04-01 14:03       ` Liam R. Howlett
2025-04-01 21:25         ` SeongJae Park
2025-03-10 17:23 ` [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes SeongJae Park
2025-03-11 13:07   ` Lorenzo Stoakes
2025-03-11 21:00     ` SeongJae Park
2025-03-10 17:23 ` [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) SeongJae Park
2025-03-11 13:59   ` Lorenzo Stoakes
2025-03-11 21:01     ` SeongJae Park
2025-04-01 21:17   ` SeongJae Park
2025-03-10 17:23 ` [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma() SeongJae Park
2025-03-11 14:01   ` Lorenzo Stoakes
2025-03-11 21:02     ` SeongJae Park
2025-03-12 13:46       ` Lorenzo Stoakes
2025-04-01 21:22         ` SeongJae Park
2025-03-10 22:39 ` [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Andrew Morton
2025-03-10 23:15   ` Shakeel Butt
2025-03-10 23:36     ` Roman Gushchin
2025-03-11 11:17       ` Lorenzo Stoakes
2025-03-10 23:27   ` SeongJae Park
2025-03-11 12:49 ` Lorenzo Stoakes
2025-03-11 21:03   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox