[RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
@ 2025-03-05 18:15 SeongJae Park
  2025-03-05 18:15 ` [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
                   ` (18 more replies)
  0 siblings, 19 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:15 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
can happen for each vma of the given address ranges.  Because such tlb
flushes are for address ranges of same process, doing those in a batch
is more efficient while still being safe.  Modify madvise() and
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logics do only gathering of the tlb entries to
flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  Also modify the internal
logics to do only gathering of the tlb entries to flush into the
received mmu_gather object.  After all internal function calls are done,
the entry functions finish the mmu_gather object to flush the gathered
tlb entries in the one batch.

Patches Seuquence
=================

First four patches are minor cleanups of madvise.c for readability.

Following four patches (patches 5-8) define new data structure for
managing information that required for batched tlb flushing (mmu_gather
and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
MADV_FREE handling internal logics to receive it.

Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
and MADV_FREE handling logic ready for batched tlb flushing follow.  The
patches keep the support of unbatched tlb flushes use case, for
fine-grained and safe transitions.

Next three patches (patches 12-14) update madvise() and
process_madvise() code to do the batched tlb flushes utilizing the
previous patches introduced changes.

Final two patches (patches 15-16) clean up the internal logics'
unbatched tlb flushes use case support code, which is no more be used.

Test Results
============

I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size (vlen) from 1 to
1024.  The source code for the measurement is available at GitHub[1].

The measurement results are as below.  'sz_batches' column shows the
batch size of process_madvise() calls.  'before' and 'after' columns are
the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
nanoseconds, on kernels that built without and with the MADV_DONTNEED
tlb flushes batching patch of this series, respectively.  For the
baseline, mm-unstable tree of 2025-03-04[2] has been used.
'after/before' column is the ratio of 'after' to 'before'.  So
'afetr/before' value lower than 1.0 means this patch increased
efficiency over the baseline.  And lower value means better efficiency.

    sz_batches    before       after        after/before
    1             102842895    106507398    1.03563204828102
    2             73364942     74529223     1.01586971880929
    4             58823633     51608504     0.877343022998937
    8             47532390     44820223     0.942940655834895
    16            43591587     36727177     0.842529018271347
    32            44207282     33946975     0.767904595446515
    64            41832437     26738286     0.639175910310939
    128           40278193     23262940     0.577556694263817
    256           41568533     22355103     0.537789077136785
    512           41626638     22822516     0.54826709762148
    1024          44440870     22676017     0.510251419470411

For <=2 batch size, tlb flushes batching shows no big difference but
slight overhead.  I think that's in an error range of this simple
micro-benchmark, and therefore can be ignored.  Starting from batch size
4, however, tlb flushes batching shows clear efficiency gain.  The
efficiency gain tends to be proportional to the batch size, as expected.
The efficiency gain ranges from about 13 percent with batch size 4, and
up to 49 percent with batch size 1,024.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit 7b6c5895bb9a ("mm: hugetlb: log time needed to allocate hugepages") # mm-unstable

SeongJae Park (16):
  mm/madvise: use is_memory_failure() from madvise_do_behavior()
  mm/madvise: split out populate behavior check logic
  mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  mm/madvise: remove len parameter of madvise_do_behavior()
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/madvise: pass madvise_behavior struct to madvise_vma_behavior()
  mm/madvise: make madvise_walk_vmas() visit function receives a void
    pointer
  mm/madvise: pass madvise_behavior struct to madvise_dontneed_free()
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: let madvise_dontneed_single_vma() caller batches tlb
    flushes
  mm/madvise: let madvise_free_single_vma() caller batches tlb flushes
  mm/madvise: batch tlb flushes for
    process_madvise(MADV_DONTNEED[_LOCKED])
  mm/madvise: batch tlb flushes for process_madvise(MADV_FREE)
  mm/madvise: batch tlb flushes for
    madvise(MADV_{DONTNEED[_LOCKED],FREE}
  mm/madvise: remove !tlb support from madvise_dontneed_single_vma()
  mm/madvise: remove !caller_tlb case of madvise_free_single_vma()

 mm/internal.h |   3 +
 mm/madvise.c  | 176 ++++++++++++++++++++++++++++++++++----------------
 mm/memory.c   |  36 +++++++----
 3 files changed, 144 insertions(+), 71 deletions(-)

base-commit: f653b037b4a6d83c68098fc3949090dfb63316fb
-- 
2.39.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
@ 2025-03-05 18:15 ` SeongJae Park
  2025-03-05 20:25   ` Shakeel Butt
  2025-03-05 18:15 ` [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic SeongJae Park
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:15 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure()
has introduced.  madvise_do_behavior() is still doing the same
open-coded check, though.  Use is_memory_failure() instead.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 388dc289b5d1..dbc8fec05cc6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1640,10 +1640,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	unsigned long end;
 	int error;
 
-#ifdef CONFIG_MEMORY_FAILURE
-	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+	if (is_memory_failure(behavior))
 		return madvise_inject_error(behavior, start, start + len_in);
-#endif
 	start = untagged_addr_remote(mm, start);
 	end = start + len;
 
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior()
  2025-03-05 18:15 ` [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
@ 2025-03-05 20:25   ` Shakeel Butt
  2025-03-05 23:13     ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 20:25 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:15:56AM -0800, SeongJae Park wrote:
> To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
> MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure()
> has introduced.  madvise_do_behavior() is still doing the same
> open-coded check, though.  Use is_memory_failure() instead.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/madvise.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 388dc289b5d1..dbc8fec05cc6 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1640,10 +1640,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	unsigned long end;
>  	int error;
>  
> -#ifdef CONFIG_MEMORY_FAILURE
> -	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> +	if (is_memory_failure(behavior))
>  		return madvise_inject_error(behavior, start, start + len_in);

You might want to either define empty madvise_inject_error() for
!CONFIG_MEMORY_FAILURE or keep CONFIG_MEMORY_FAILURE here.

> -#endif
>  	start = untagged_addr_remote(mm, start);
>  	end = start + len;
>  
> -- 
> 2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior()
  2025-03-05 20:25   ` Shakeel Butt
@ 2025-03-05 23:13     ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 23:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 5 Mar 2025 12:25:51 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:15:56AM -0800, SeongJae Park wrote:
> > To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
> > MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure()
> > has introduced.  madvise_do_behavior() is still doing the same
> > open-coded check, though.  Use is_memory_failure() instead.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/madvise.c | 4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 388dc289b5d1..dbc8fec05cc6 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1640,10 +1640,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
> >  	unsigned long end;
> >  	int error;
> >  
> > -#ifdef CONFIG_MEMORY_FAILURE
> > -	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > +	if (is_memory_failure(behavior))
> >  		return madvise_inject_error(behavior, start, start + len_in);
> 
> You might want to either define empty madvise_inject_error() for
> !CONFIG_MEMORY_FAILURE or keep CONFIG_MEMORY_FAILURE here.

Good catch.  I confirmed build fails when !CONFIG_MEMORY_FAILURE.  I will
define empty madvise_inject_error().

> 
> > -#endif
> >  	start = untagged_addr_remote(mm, start);
> >  	end = start + len;
> >  
> > -- 
> > 2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-03-05 18:15 ` [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
@ 2025-03-05 18:15 ` SeongJae Park
  2025-03-05 20:32   ` Shakeel Butt
  2025-03-05 18:15 ` [RFC PATCH 03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:15 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_do_behavior() has a long open-coded 'behavior' check for
MADV_POPULATE_{READ,WRITE}.  It adds multiple layers[1] and make the
code arguably take longer time to read.  Like is_memory_failure(), split
out the check to a separate function.  This is not technically removing
the additional layer but discourage further extending the switch-case.
Also it makes madvise_do_behavior() code shorter and therefore easier to
read.

[1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index dbc8fec05cc6..4a91590656dc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1633,6 +1633,17 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 	return true;
 }
 
+static bool is_memory_populate(int behavior)
+{
+	switch (behavior) {
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int madvise_do_behavior(struct mm_struct *mm,
 		unsigned long start, size_t len_in, size_t len, int behavior)
 {
@@ -1646,16 +1657,11 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	end = start + len;
 
 	blk_start_plug(&plug);
-	switch (behavior) {
-	case MADV_POPULATE_READ:
-	case MADV_POPULATE_WRITE:
+	if (is_memory_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
-		break;
-	default:
+	else
 		error = madvise_walk_vmas(mm, start, end, behavior,
 					  madvise_vma_behavior);
-		break;
-	}
 	blk_finish_plug(&plug);
 	return error;
 }
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic
  2025-03-05 18:15 ` [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic SeongJae Park
@ 2025-03-05 20:32   ` Shakeel Butt
  2025-03-05 23:18     ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 20:32 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:15:57AM -0800, SeongJae Park wrote:
> madvise_do_behavior() has a long open-coded 'behavior' check for
> MADV_POPULATE_{READ,WRITE}.  It adds multiple layers[1] and make the
> code arguably take longer time to read.  Like is_memory_failure(), split
> out the check to a separate function.  This is not technically removing
> the additional layer but discourage further extending the switch-case.
> Also it makes madvise_do_behavior() code shorter and therefore easier to
> read.
> 
> [1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/madvise.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dbc8fec05cc6..4a91590656dc 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1633,6 +1633,17 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
>  	return true;
>  }
>  
> +static bool is_memory_populate(int behavior)

No strong opinion on this patch but if you want to keep it, the above
name feels weird. How about either is_madvise_populate() or
is_populate_memory()?

> +{
> +	switch (behavior) {
> +	case MADV_POPULATE_READ:
> +	case MADV_POPULATE_WRITE:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  static int madvise_do_behavior(struct mm_struct *mm,
>  		unsigned long start, size_t len_in, size_t len, int behavior)
>  {
> @@ -1646,16 +1657,11 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	end = start + len;
>  
>  	blk_start_plug(&plug);
> -	switch (behavior) {
> -	case MADV_POPULATE_READ:
> -	case MADV_POPULATE_WRITE:
> +	if (is_memory_populate(behavior))
>  		error = madvise_populate(mm, start, end, behavior);
> -		break;
> -	default:
> +	else
>  		error = madvise_walk_vmas(mm, start, end, behavior,
>  					  madvise_vma_behavior);
> -		break;
> -	}
>  	blk_finish_plug(&plug);
>  	return error;
>  }
> -- 
> 2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic
  2025-03-05 20:32   ` Shakeel Butt
@ 2025-03-05 23:18     ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 23:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 5 Mar 2025 12:32:52 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:15:57AM -0800, SeongJae Park wrote:
> > madvise_do_behavior() has a long open-coded 'behavior' check for
> > MADV_POPULATE_{READ,WRITE}.  It adds multiple layers[1] and make the
> > code arguably take longer time to read.  Like is_memory_failure(), split
> > out the check to a separate function.  This is not technically removing
> > the additional layer but discourage further extending the switch-case.
> > Also it makes madvise_do_behavior() code shorter and therefore easier to
> > read.
> > 
> > [1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/madvise.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index dbc8fec05cc6..4a91590656dc 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1633,6 +1633,17 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
> >  	return true;
> >  }
> >  
> > +static bool is_memory_populate(int behavior)
> 
> No strong opinion on this patch but if you want to keep it, the above
> name feels weird. How about either is_madvise_populate() or
> is_populate_memory()?

I wanted to make this reads consistent with other similar purpose ones like
is_memory_failure(behavior).  I have no strong opinions, either, though.
Unless someone makes a voice here, I will rename this to is_madvise_populate()
in the next version.

> 
> > +{
> > +	switch (behavior) {
> > +	case MADV_POPULATE_READ:
> > +	case MADV_POPULATE_WRITE:
> > +		return true;
> > +	default:
> > +		return false;
> > +	}
> > +}

Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
  2025-03-05 18:15 ` [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
  2025-03-05 18:15 ` [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic SeongJae Park
@ 2025-03-05 18:15 ` SeongJae Park
  2025-03-05 18:15 ` [RFC PATCH 04/16] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:15 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

The logic for checking if a given madvise() request for a single memory
range can skip real work, namely madvise_do_behavior(), is duplicated in
do_madvise() and vector_madvise().  Split out the logic to a function
and resue it.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4a91590656dc..265b325d8829 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1633,6 +1633,27 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 	return true;
 }
 
+/*
+ * madvise_should_skip() - Return if an madivse request can skip real works.
+ * @start:	Start address of madvise-requested address range.
+ * @len_in:	Length of madvise-requested address range.
+ * @behavior:	Requested madvise behavor.
+ * @err:	Pointer to store an error code from the check.
+ */
+static bool madvise_should_skip(unsigned long start, size_t len_in,
+		int behavior, int *err)
+{
+	if (!is_valid_madvise(start, len_in, behavior)) {
+		*err = -EINVAL;
+		return true;
+	}
+	if (start + PAGE_ALIGN(len_in) == start) {
+		*err = 0;
+		return true;
+	}
+	return false;
+}
+
 static bool is_memory_populate(int behavior)
 {
 	switch (behavior) {
@@ -1740,23 +1761,15 @@ static int madvise_do_behavior(struct mm_struct *mm,
  */
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
-	unsigned long end;
 	int error;
-	size_t len;
-
-	if (!is_valid_madvise(start, len_in, behavior))
-		return -EINVAL;
-
-	len = PAGE_ALIGN(len_in);
-	end = start + len;
-
-	if (end == start)
-		return 0;
 
+	if (madvise_should_skip(start, len_in, behavior, &error))
+		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, len, behavior);
+	error = madvise_do_behavior(mm, start, len_in, PAGE_ALIGN(len_in),
+			behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1783,19 +1796,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
 		size_t len_in = iter_iov_len(iter);
-		size_t len;
-
-		if (!is_valid_madvise(start, len_in, behavior)) {
-			ret = -EINVAL;
-			break;
-		}
+		int error;
 
-		len = PAGE_ALIGN(len_in);
-		if (start + len == start)
-			ret = 0;
+		if (madvise_should_skip(start, len_in, behavior, &error))
+			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, len,
-					behavior);
+			ret = madvise_do_behavior(mm, start, len_in,
+					PAGE_ALIGN(len_in), behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 04/16] mm/madvise: remove len parameter of madvise_do_behavior()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (2 preceding siblings ...)
  2025-03-05 18:15 ` [RFC PATCH 03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
@ 2025-03-05 18:15 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:15 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Because madise_should_skip() logic is factored out, making
madvise_do_behavior() calculates 'len' on its own rather then receiving
it as a parameter makes code simpler.  Remove the parameter.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 265b325d8829..c5e1a4d1df72 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1666,7 +1666,7 @@ static bool is_memory_populate(int behavior)
 }
 
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, size_t len, int behavior)
+		unsigned long start, size_t len_in, int behavior)
 {
 	struct blk_plug plug;
 	unsigned long end;
@@ -1675,7 +1675,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_memory_failure(behavior))
 		return madvise_inject_error(behavior, start, start + len_in);
 	start = untagged_addr_remote(mm, start);
-	end = start + len;
+	end = start + PAGE_ALIGN(len_in);
 
 	blk_start_plug(&plug);
 	if (is_memory_populate(behavior))
@@ -1768,8 +1768,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, PAGE_ALIGN(len_in),
-			behavior);
+	error = madvise_do_behavior(mm, start, len_in, behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1801,8 +1800,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in,
-					PAGE_ALIGN(len_in), behavior);
+			ret = madvise_do_behavior(mm, start, len_in, behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (3 preceding siblings ...)
  2025-03-05 18:15 ` [RFC PATCH 04/16] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 21:02   ` Shakeel Butt
  2025-03-05 18:16 ` [RFC PATCH 06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior() SeongJae Park
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
MADV_FREE, an mmu_gather object in addition to the behavior integer need
to be passed to the internal logics.  Define a struct for passing such
information together with the behavior value without increasing number
of parameters of all code paths towards the internal logic.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c5e1a4d1df72..3346e593e07d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1665,9 +1665,15 @@ static bool is_memory_populate(int behavior)
 	}
 }
 
+struct madvise_behavior {
+	int behavior;
+};
+
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, int behavior)
+		unsigned long start, size_t len_in,
+		struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct blk_plug plug;
 	unsigned long end;
 	int error;
@@ -1762,13 +1768,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, behavior);
+	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1785,6 +1792,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
+	struct madvise_behavior madv_behavior = {.behavior = behavior};
 
 	total_len = iov_iter_count(iter);
 
@@ -1800,7 +1808,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, behavior);
+			ret = madvise_do_behavior(mm, start, len_in,
+					&madv_behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-05 18:16 ` [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
@ 2025-03-05 21:02   ` Shakeel Butt
  2025-03-05 21:40     ` Shakeel Butt
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 21:02 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:16:00AM -0800, SeongJae Park wrote:
> To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
> MADV_FREE, an mmu_gather object in addition to the behavior integer need
> to be passed to the internal logics.  Define a struct for passing such
> information together with the behavior value without increasing number
> of parameters of all code paths towards the internal logic.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/madvise.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c5e1a4d1df72..3346e593e07d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1665,9 +1665,15 @@ static bool is_memory_populate(int behavior)
>  	}
>  }
>  
> +struct madvise_behavior {
> +	int behavior;
> +};

I think the patch 5 to 8 are just causing churn and will be much better
to be a single patch. Also why not make the above struct a general
madvise visit param struct which can be used by both
madvise_vma_anon_name() and madvise_vma_behavior()?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-05 21:02   ` Shakeel Butt
@ 2025-03-05 21:40     ` Shakeel Butt
  2025-03-05 23:56       ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 21:40 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 01:02:15PM -0800, Shakeel Butt wrote:
> On Wed, Mar 05, 2025 at 10:16:00AM -0800, SeongJae Park wrote:
> > To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
> > MADV_FREE, an mmu_gather object in addition to the behavior integer need
> > to be passed to the internal logics.  Define a struct for passing such
> > information together with the behavior value without increasing number
> > of parameters of all code paths towards the internal logic.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/madvise.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c5e1a4d1df72..3346e593e07d 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1665,9 +1665,15 @@ static bool is_memory_populate(int behavior)
> >  	}
> >  }
> >  
> > +struct madvise_behavior {
> > +	int behavior;
> > +};
> 
> I think the patch 5 to 8 are just causing churn and will be much better
> to be a single patch. Also why not make the above struct a general
> madvise visit param struct which can be used by both
> madvise_vma_anon_name() and madvise_vma_behavior()?

Something like following:

diff --git a/mm/madvise.c b/mm/madvise.c
index c5e1a4d1df72..cbc3817366a6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -890,11 +890,17 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 	return true;
 }
 
+struct madvise_walk_param {
+	int behavior;
+	struct anon_vma_name *anon_name;
+};
+
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
-				  int behavior)
+				  struct madvise_walk_param *arg)
 {
+	int behavior = arg->behavior;
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
@@ -1249,8 +1255,9 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior)
+				struct madvise_walk_param *arg)
 {
+	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
 	unsigned long new_flags = vma->vm_flags;
@@ -1270,7 +1277,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(vma, prev, start, end, arg);
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
@@ -1462,10 +1469,10 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, unsigned long arg,
+		      unsigned long end, struct madvise_walk_param *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
-				   unsigned long end, unsigned long arg))
+				   unsigned long end, struct madvise_walk_param *arg))
 {
 	struct vm_area_struct *vma;
 	struct vm_area_struct *prev;
@@ -1523,7 +1530,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 static int madvise_vma_anon_name(struct vm_area_struct *vma,
 				 struct vm_area_struct **prev,
 				 unsigned long start, unsigned long end,
-				 unsigned long anon_name)
+				 struct madvise_walk_param *arg)
 {
 	int error;
 
@@ -1532,7 +1539,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma,
 		return -EBADF;
 
 	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
-				   (struct anon_vma_name *)anon_name);
+				   arg->anon_name);
 
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
@@ -1548,6 +1555,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 {
 	unsigned long end;
 	unsigned long len;
+	struct madvise_walk_param param = { .anon_name = anon_name };
 
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
@@ -1564,7 +1572,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
+	return madvise_walk_vmas(mm, start, end, &param,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
@@ -1666,8 +1674,10 @@ static bool is_memory_populate(int behavior)
 }
 
 static int madvise_do_behavior(struct mm_struct *mm,
-		unsigned long start, size_t len_in, int behavior)
+		unsigned long start, size_t len_in,
+		struct madvise_walk_param *arg)
 {
+	int behavior = arg->behavior;
 	struct blk_plug plug;
 	unsigned long end;
 	int error;
@@ -1681,7 +1691,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_memory_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end, behavior,
+		error = madvise_walk_vmas(mm, start, end, arg,
 					  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
@@ -1762,13 +1772,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
+	struct madvise_walk_param arg = {.behavior = behavior};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
-	error = madvise_do_behavior(mm, start, len_in, behavior);
+	error = madvise_do_behavior(mm, start, len_in, &arg);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1785,6 +1796,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
+	struct madvise_walk_param arg = {.behavior = behavior};
 
 	total_len = iov_iter_count(iter);
 
@@ -1800,7 +1812,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (madvise_should_skip(start, len_in, behavior, &error))
 			ret = error;
 		else
-			ret = madvise_do_behavior(mm, start, len_in, behavior);
+			ret = madvise_do_behavior(mm, start, len_in, &arg);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
-- 
2.43.5



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-05 21:40     ` Shakeel Butt
@ 2025-03-05 23:56       ` SeongJae Park
  2025-03-06  3:37         ` Shakeel Butt
  0 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 23:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 5 Mar 2025 13:40:17 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 01:02:15PM -0800, Shakeel Butt wrote:
> > On Wed, Mar 05, 2025 at 10:16:00AM -0800, SeongJae Park wrote:
> > > To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and
> > > MADV_FREE, an mmu_gather object in addition to the behavior integer need
> > > to be passed to the internal logics.  Define a struct for passing such
> > > information together with the behavior value without increasing number
> > > of parameters of all code paths towards the internal logic.
> > > 
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/madvise.c | 15 ++++++++++++---
> > >  1 file changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index c5e1a4d1df72..3346e593e07d 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1665,9 +1665,15 @@ static bool is_memory_populate(int behavior)
> > >  	}
> > >  }
> > >  
> > > +struct madvise_behavior {
> > > +	int behavior;
> > > +};
> > 
> > I think the patch 5 to 8 are just causing churn and will be much better
> > to be a single patch.

I agree.  I will do so in the next version.

> > Also why not make the above struct a general
> > madvise visit param struct which can be used by both
> > madvise_vma_anon_name() and madvise_vma_behavior()?

I was also thinking we can further extend the struct for more clean and
efficient code, so agree to your fundamental point.  I ended up desiring to
focus on tlb flushes batching at the moment, though.

However, what you are suggesting is bit different from what I was thinking.  To
me, madvise_walk_vmas() is for general purposes and hence the visit functions
should be able to recieve an argument of arbitrary types.  It is true that
there are only two visit functions, but they seems doing very different purpose
works to me, so having a same type argument doesn't seem very straightforward
to understand its usage, nor efficient to my humble viewpoint.

> 
> Something like following:
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c5e1a4d1df72..cbc3817366a6 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -890,11 +890,17 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
>  	return true;
>  }
>  
> +struct madvise_walk_param {
> +	int behavior;
> +	struct anon_vma_name *anon_name;
> +};

Only madvise_vma_behavior() will use 'behavior' field.  And only
madvise_vma_anon_name() will use 'anon_name' field.  But I will have to assume
both function _might_ use both fields when reading the madvise_walk_vmas()
function code.  That doesn't make my humble code reading that simple and
straightforward.

Also populating and passing a data structure that has data that would not
really be used seems not very efficient to me.

[...]
> @@ -1666,8 +1674,10 @@ static bool is_memory_populate(int behavior)
>  }
>  
>  static int madvise_do_behavior(struct mm_struct *mm,
> -		unsigned long start, size_t len_in, int behavior)
> +		unsigned long start, size_t len_in,
> +		struct madvise_walk_param *arg)
>  {
> +	int behavior = arg->behavior;
>  	struct blk_plug plug;
>  	unsigned long end;
>  	int error;
> @@ -1681,7 +1691,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  	if (is_memory_populate(behavior))
>  		error = madvise_populate(mm, start, end, behavior);

'arg' is for madvise_walk_vmas() visit function, but we're using it as a
capsule for passing an information that will be used for madvise_do_behavior().
This also seems not very straightforward to my humble perspective.

I have no strong opinion and maybe my humble taste is too peculiar.  But, if
this is not a blocker for tlb flushes batcing, I'd like to suggest keep this
part as is for now, and revisit for more code cleanup later.  What do you
think, Shakeel?

Thanks,
SJ

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-05 23:56       ` SeongJae Park
@ 2025-03-06  3:37         ` Shakeel Butt
  2025-03-06  4:18           ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-06  3:37 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 03:56:32PM -0800, SeongJae Park wrote:
> On Wed, 5 Mar 2025 13:40:17 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> >  
> > +struct madvise_walk_param {
> > +	int behavior;
> > +	struct anon_vma_name *anon_name;
> > +};
> 
> Only madvise_vma_behavior() will use 'behavior' field.  And only
> madvise_vma_anon_name() will use 'anon_name' field.  But I will have to assume
> both function _might_ use both fields when reading the madvise_walk_vmas()
> function code.  That doesn't make my humble code reading that simple and
> straightforward.
> 
> Also populating and passing a data structure that has data that would not
> really be used seems not very efficient to me.
> 

This is not a new pattern. You can find a lot of examples in kernel
where a struct encapsulates multiple fields and its pointer is passed
around rather than those fields (or subset of them). You can check out
zap_details, vm_fault, fs_parameter. If some fields are mutually
exclusive you can have union in the struct. Also I am not sure what do
you mean by "not efficient" here. Inefficient in what sense? Unused
memory or something else?

> [...]
> > @@ -1666,8 +1674,10 @@ static bool is_memory_populate(int behavior)
> >  }
> >  
> >  static int madvise_do_behavior(struct mm_struct *mm,
> > -		unsigned long start, size_t len_in, int behavior)
> > +		unsigned long start, size_t len_in,
> > +		struct madvise_walk_param *arg)
> >  {
> > +	int behavior = arg->behavior;
> >  	struct blk_plug plug;
> >  	unsigned long end;
> >  	int error;
> > @@ -1681,7 +1691,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
> >  	if (is_memory_populate(behavior))
> >  		error = madvise_populate(mm, start, end, behavior);
> 
> 'arg' is for madvise_walk_vmas() visit function, but we're using it as a
> capsule for passing an information that will be used for madvise_do_behavior().
> This also seems not very straightforward to my humble perspective.

Here we can keep behavior as parameter to madvise_walk_vmas() and define
struct madvise_walk_param inside it with the passed behavior. Anyways
this is a very common pattern in kernel.

> 
> I have no strong opinion and maybe my humble taste is too peculiar.  But, if
> this is not a blocker for tlb flushes batcing, I'd like to suggest keep this
> part as is for now, and revisit for more code cleanup later.  What do you
> think, Shakeel?
> 

Squashing patches 5 to 8 into one is the main request from me. My other
suggestion you can ignore but let's see what other says.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior()
  2025-03-06  3:37         ` Shakeel Butt
@ 2025-03-06  4:18           ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-06  4:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 5 Mar 2025 19:37:28 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 03:56:32PM -0800, SeongJae Park wrote:
> > On Wed, 5 Mar 2025 13:40:17 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > 
> > >  
> > > +struct madvise_walk_param {
> > > +	int behavior;
> > > +	struct anon_vma_name *anon_name;
> > > +};
> > 
> > Only madvise_vma_behavior() will use 'behavior' field.  And only
> > madvise_vma_anon_name() will use 'anon_name' field.  But I will have to assume
> > both function _might_ use both fields when reading the madvise_walk_vmas()
> > function code.  That doesn't make my humble code reading that simple and
> > straightforward.
> > 
> > Also populating and passing a data structure that has data that would not
> > really be used seems not very efficient to me.
> > 
> 
> This is not a new pattern. You can find a lot of examples in kernel
> where a struct encapsulates multiple fields and its pointer is passed
> around rather than those fields (or subset of them). You can check out
> zap_details, vm_fault, fs_parameter. If some fields are mutually
> exclusive you can have union in the struct.

You're right.  But we do so for readability and/or efficiency, right?  And what
I'm saying is that I'm not very sure if this change is making code much easier
to read and/or efficient.

> Also I am not sure what do
> you mean by "not efficient" here. Inefficient in what sense? Unused
> memory or something else?

I was meaning unused memory in this context.  It is a negligible extent, of
course.  Nevertheless, what I wanted to say is not that the change will degrade
efficiency too much, but I'm not clearly seeing the efficiency benefit that
explains why the change is something that should be made.

> 
> > [...]
> > > @@ -1666,8 +1674,10 @@ static bool is_memory_populate(int behavior)
> > >  }
> > >  
> > >  static int madvise_do_behavior(struct mm_struct *mm,
> > > -		unsigned long start, size_t len_in, int behavior)
> > > +		unsigned long start, size_t len_in,
> > > +		struct madvise_walk_param *arg)
> > >  {
> > > +	int behavior = arg->behavior;
> > >  	struct blk_plug plug;
> > >  	unsigned long end;
> > >  	int error;
> > > @@ -1681,7 +1691,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > >  	if (is_memory_populate(behavior))
> > >  		error = madvise_populate(mm, start, end, behavior);
> > 
> > 'arg' is for madvise_walk_vmas() visit function, but we're using it as a
> > capsule for passing an information that will be used for madvise_do_behavior().
> > This also seems not very straightforward to my humble perspective.
> 
> Here we can keep behavior as parameter to madvise_walk_vmas() and define
> struct madvise_walk_param inside it with the passed behavior. Anyways
> this is a very common pattern in kernel.
> 
> > 
> > I have no strong opinion and maybe my humble taste is too peculiar.  But, if
> > this is not a blocker for tlb flushes batcing, I'd like to suggest keep this
> > part as is for now, and revisit for more code cleanup later.  What do you
> > think, Shakeel?
> > 
> 
> Squashing patches 5 to 8 into one is the main request from me. My other
> suggestion you can ignore but let's see what other says.

Makes sense.  Thank you for your inputs :)


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (4 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 07/16] mm/madvise: make madvise_walk_vmas() visit function receives a void pointer SeongJae Park
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Most internal madvise logics are executed by madvise_walk_vmas() visit
function, namely madvise_vma_behavior().  The function receives only a
single behavior value, so difficult to extend for tlb flushes batching
and potential future optimizations that may require information more
than single behavior value.  Modify it to receive
'struct madvise_behavior' object instead, for upcoming tlb flushes
batching change and possible future optimizations.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 3346e593e07d..8c4c128eaeb7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1241,6 +1241,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 			       &guard_remove_walk_ops, NULL);
 }
 
+struct madvise_behavior {
+	int behavior;
+};
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1249,8 +1253,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior)
+				unsigned long behavior_arg)
 {
+	struct madvise_behavior *arg = (struct madvise_behavior *)behavior_arg;
+	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
 	unsigned long new_flags = vma->vm_flags;
@@ -1665,10 +1671,6 @@ static bool is_memory_populate(int behavior)
 	}
 }
 
-struct madvise_behavior {
-	int behavior;
-};
-
 static int madvise_do_behavior(struct mm_struct *mm,
 		unsigned long start, size_t len_in,
 		struct madvise_behavior *madv_behavior)
@@ -1687,7 +1689,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_memory_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end, behavior,
+		error = madvise_walk_vmas(mm, start, end,
+					  (unsigned long)madv_behavior,
 					  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 07/16] mm/madvise: make madvise_walk_vmas() visit function receives a void pointer
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (5 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior() SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 08/16] mm/madvise: pass madvise_behavior struct to madvise_dontneed_free() SeongJae Park
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_walk_vmas() is used for two visit functions, namely
madvise_vma_anon_name() and madvise_walk_vmas().  The visit function
type is defined to receive an 'unsigned long' type argument.  But, both
visit functions need a pointer argument, so casting the arguments
between 'unsinged long' and the real pointer types.  It is more
idiomatic and clean to use a void pointer type for such cases.  Update
the visit function type to receive a void pointer as the argument and
cleanup the type-casting code.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 8c4c128eaeb7..6fa7dabe5bad 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1253,9 +1253,9 @@ struct madvise_behavior {
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior_arg)
+				void *behavior_arg)
 {
-	struct madvise_behavior *arg = (struct madvise_behavior *)behavior_arg;
+	struct madvise_behavior *arg = behavior_arg;
 	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
@@ -1468,10 +1468,10 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, unsigned long arg,
+		      unsigned long end, void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
-				   unsigned long end, unsigned long arg))
+				   unsigned long end, void *arg))
 {
 	struct vm_area_struct *vma;
 	struct vm_area_struct *prev;
@@ -1529,7 +1529,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 static int madvise_vma_anon_name(struct vm_area_struct *vma,
 				 struct vm_area_struct **prev,
 				 unsigned long start, unsigned long end,
-				 unsigned long anon_name)
+				 void *anon_name)
 {
 	int error;
 
@@ -1538,7 +1538,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma,
 		return -EBADF;
 
 	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
-				   (struct anon_vma_name *)anon_name);
+				   anon_name);
 
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
@@ -1570,7 +1570,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
+	return madvise_walk_vmas(mm, start, end, anon_name,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
@@ -1689,8 +1689,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_memory_populate(behavior))
 		error = madvise_populate(mm, start, end, behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end,
-					  (unsigned long)madv_behavior,
+		error = madvise_walk_vmas(mm, start, end, madv_behavior,
 					  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 08/16] mm/madvise: pass madvise_behavior struct to madvise_dontneed_free()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (6 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 07/16] mm/madvise: make madvise_walk_vmas() visit function receives a void pointer SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_dontneed_free() can be optimized to batch tlb flushes by
receiving mmu_gather object that initialized and finished outside, and
doing only gathering of tlb entries that need to be flushed into the
received mmu_gather object.  But the function is receiving only one
single behavior integer value.  Update it to receive
'struct madvise_behavior' object so that upcoming tlb flushes batching
optimization can pass the mmu_gather object.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6fa7dabe5bad..73a4323197e2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -890,11 +890,16 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 	return true;
 }
 
+struct madvise_behavior {
+	int behavior;
+};
+
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
-				  int behavior)
+				  struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
@@ -1241,10 +1246,6 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
 			       &guard_remove_walk_ops, NULL);
 }
 
-struct madvise_behavior {
-	int behavior;
-};
-
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1276,7 +1277,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(vma, prev, start, end, arg);
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (7 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 08/16] mm/madvise: pass madvise_behavior struct to madvise_dontneed_free() SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-06 18:45   ` Shakeel Butt
  2025-03-05 18:16 ` [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes SeongJae Park
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() does tlb flushing for each invocation.  Split
out core part of zap_page_range_single() except mmu_gather object
initialization and gathered tlb entries flushing part for such batched
tlb flushing usage.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/memory.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a838c8c44bfd..aadb2844c701 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2011,38 +2011,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range_single - remove user pages in a given range
- * @vma: vm_area_struct holding the applicable pages
- * @address: starting address of pages to zap
- * @size: number of bytes to zap
- * @details: details of shared cache invalidation
- *
- * The range must fit into one VMA.
- */
-void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+static void unmap_vma_single(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, end);
 	hugetlb_zap_begin(vma, &range.start, &range.end);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	/*
 	 * unmap 'address-end' not 'range.start-range.end' as range
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
-	unmap_single_vma(&tlb, vma, address, end, details, false);
+	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
 	hugetlb_zap_end(vma, details);
 }
 
+/**
+ * zap_page_range_single - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
+ * @details: details of shared cache invalidation
+ *
+ * The range must fit into one VMA.
+ */
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+		unsigned long size, struct zap_details *details)
+{
+	struct mmu_gather tlb;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	unmap_vma_single(&tlb, vma, address, size, details);
+	tlb_finish_mmu(&tlb);
+}
+
 /**
  * zap_vma_ptes - remove ptes mapping the vma
  * @vma: vm_area_struct holding ptes to be zapped
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-05 18:16 ` [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-03-06 18:45   ` Shakeel Butt
  2025-03-06 19:09     ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-06 18:45 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:16:04AM -0800, SeongJae Park wrote:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() does tlb flushing for each invocation.  Split
> out core part of zap_page_range_single() except mmu_gather object
> initialization and gathered tlb entries flushing part for such batched
> tlb flushing usage.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/memory.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a838c8c44bfd..aadb2844c701 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2011,38 +2011,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>  
> -/**
> - * zap_page_range_single - remove user pages in a given range
> - * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> - * @details: details of shared cache invalidation
> - *
> - * The range must fit into one VMA.
> - */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void unmap_vma_single(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {

Please add kerneldoc for this function and explicitly specify that tlb
can not be NULL. Maybe do that in the patch where you make it
non-static.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single()
  2025-03-06 18:45   ` Shakeel Butt
@ 2025-03-06 19:09     ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-06 19:09 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Thu, 6 Mar 2025 10:45:01 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:16:04AM -0800, SeongJae Park wrote:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() does tlb flushing for each invocation.  Split
> > out core part of zap_page_range_single() except mmu_gather object
> > initialization and gathered tlb entries flushing part for such batched
> > tlb flushing usage.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a838c8c44bfd..aadb2844c701 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2011,38 +2011,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >  
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > - * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > - * @details: details of shared cache invalidation
> > - *
> > - * The range must fit into one VMA.
> > - */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void unmap_vma_single(struct mmu_gather *tlb,
> > +		struct vm_area_struct *vma, unsigned long address,
> >  		unsigned long size, struct zap_details *details)
> >  {
> 
> Please add kerneldoc for this function and explicitly specify that tlb
> can not be NULL. Maybe do that in the patch where you make it
> non-static.

Thank you for this nice suggestion, I will do so in the next version.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (8 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-06 18:36   ` Shakeel Butt
  2025-03-05 18:16 ` [RFC PATCH 11/16] mm/madvise: let madvise_free_single_vma() " SeongJae Park
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Update madvise_dontneed_single_vma() function so that the caller can
pass an mmu_gather object that should be initialized and will be
finished outside, for batched tlb flushes.  Also modify
madvise_dontneed_single_vma() internal code to support such usage by
skipping the initialization and finishing of self-allocated mmu_gather
object if it received a valid mmu_gather object.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/internal.h |  3 +++
 mm/madvise.c  | 10 +++++++---
 mm/memory.c   |  6 +++---
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0413822ee34b..45b8d08cd176 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -438,6 +438,9 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
+void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		      unsigned long addr, unsigned long size,
+		      struct zap_details *details);
 int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
 			   gfp_t gfp);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 73a4323197e2..22da6699613c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -848,7 +848,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
  * An interface that causes the system to free clean pages and flush
  * dirty pages is already available as msync(MS_INVALIDATE).
  */
-static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
+static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
+					struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
 	struct zap_details details = {
@@ -856,7 +857,10 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 		.even_cows = true,
 	};
 
-	zap_page_range_single(vma, start, end - start, &details);
+	if (!tlb)
+		zap_page_range_single(vma, start, end - start, &details);
+	else
+		unmap_vma_single(tlb, vma, start, end - start, &details);
 	return 0;
 }
 
@@ -951,7 +955,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	}
 
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(vma, start, end);
+		return madvise_dontneed_single_vma(NULL, vma, start, end);
 	else if (behavior == MADV_FREE)
 		return madvise_free_single_vma(vma, start, end);
 	else
diff --git a/mm/memory.c b/mm/memory.c
index aadb2844c701..ba011c064936 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2011,9 +2011,9 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-static void unmap_vma_single(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, unsigned long address,
-		unsigned long size, struct zap_details *details)
+void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		      unsigned long address, unsigned long size,
+		      struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes
  2025-03-05 18:16 ` [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes SeongJae Park
@ 2025-03-06 18:36   ` Shakeel Butt
  2025-03-06 19:10     ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-06 18:36 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:16:05AM -0800, SeongJae Park wrote:
> Update madvise_dontneed_single_vma() function so that the caller can
> pass an mmu_gather object that should be initialized and will be
> finished outside, for batched tlb flushes.  Also modify
> madvise_dontneed_single_vma() internal code to support such usage by
> skipping the initialization and finishing of self-allocated mmu_gather
> object if it received a valid mmu_gather object.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>

Please squash patch 10 and 11.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes
  2025-03-06 18:36   ` Shakeel Butt
@ 2025-03-06 19:10     ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-06 19:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Thu, 6 Mar 2025 10:36:08 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:16:05AM -0800, SeongJae Park wrote:
> > Update madvise_dontneed_single_vma() function so that the caller can
> > pass an mmu_gather object that should be initialized and will be
> > finished outside, for batched tlb flushes.  Also modify
> > madvise_dontneed_single_vma() internal code to support such usage by
> > skipping the initialization and finishing of self-allocated mmu_gather
> > object if it received a valid mmu_gather object.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> Please squash patch 10 and 11.

I will do so in the next version.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 11/16] mm/madvise: let madvise_free_single_vma() caller batches tlb flushes
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (9 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED]) SeongJae Park
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

Update madvise_free_single_vma() function so that the caller can pass an
mmu_gather object that should be initialized and will be finished
outside, for batched tlb flushes.  Also modify madvise_free_single_vma()
internal code to support such usage while keeping support of olde usage
that the mmu_gather object is not passed.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 22da6699613c..767c5d21ee75 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -794,12 +794,19 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-static int madvise_free_single_vma(struct vm_area_struct *vma,
-			unsigned long start_addr, unsigned long end_addr)
+static int madvise_free_single_vma(
+		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,
+		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
+	struct mmu_gather self_tlb;
+	struct mmu_gather *tlb;
+
+	if (caller_tlb)
+		tlb = caller_tlb;
+	else
+		tlb = &self_tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -815,16 +822,18 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	if (!caller_tlb)
+		tlb_gather_mmu(tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
-	tlb_start_vma(&tlb, vma);
+	tlb_start_vma(tlb, vma);
 	walk_page_range(vma->vm_mm, range.start, range.end,
-			&madvise_free_walk_ops, &tlb);
-	tlb_end_vma(&tlb, vma);
+			&madvise_free_walk_ops, tlb);
+	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
+	if (!caller_tlb)
+		tlb_finish_mmu(tlb);
 
 	return 0;
 }
@@ -957,7 +966,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
 		return madvise_dontneed_single_vma(NULL, vma, start, end);
 	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(vma, start, end);
+		return madvise_free_single_vma(NULL, vma, start, end);
 	else
 		return -EINVAL;
 }
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED])
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (10 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 11/16] mm/madvise: let madvise_free_single_vma() " SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-06 18:36   ` Shakeel Butt
  2025-03-05 18:16 ` [RFC PATCH 13/16] mm/madvise: batch tlb flushes for process_madvise(MADV_FREE) SeongJae Park
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

MADV_DONTNEED[_LOCKED] internal logic can be invoked with batched tlb
flushes.  Update vector_madvise(), which is called for
process_madvise(), to use that in the way, by passing an mmu_gather
object that it initializes before starting the internal works, and
flushing the gathered tlb entries at once after all the internal works
are done.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 767c5d21ee75..efa4184d6cf5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -905,6 +905,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
 
 struct madvise_behavior {
 	int behavior;
+	struct mmu_gather *tlb;
 };
 
 static long madvise_dontneed_free(struct vm_area_struct *vma,
@@ -964,7 +965,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	}
 
 	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
-		return madvise_dontneed_single_vma(NULL, vma, start, end);
+		return madvise_dontneed_single_vma(
+				madv_behavior->tlb, vma, start, end);
 	else if (behavior == MADV_FREE)
 		return madvise_free_single_vma(NULL, vma, start, end);
 	else
@@ -1802,19 +1804,50 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	return do_madvise(current->mm, start, len_in, behavior);
 }
 
+static bool vector_madvise_batch_tlb_flush(int behavior)
+{
+	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void vector_madvise_init_tlb(struct madvise_behavior *madv_behavior,
+		struct mm_struct *mm)
+{
+	if (!vector_madvise_batch_tlb_flush(madv_behavior->behavior))
+		return;
+	tlb_gather_mmu(madv_behavior->tlb, mm);
+}
+
+static void vector_madvise_finish_tlb(struct madvise_behavior *madv_behavior)
+{
+	if (!vector_madvise_batch_tlb_flush(madv_behavior->behavior))
+		return;
+	tlb_finish_mmu(madv_behavior->tlb);
+}
+
 /* Perform an madvise operation over a vector of addresses and lengths. */
 static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			      int behavior)
 {
 	ssize_t ret = 0;
 	size_t total_len;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
 	if (ret)
 		return ret;
+	vector_madvise_init_tlb(&madv_behavior, mm);
 
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1843,14 +1876,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			}
 
 			/* Drop and reacquire lock to unwind race. */
+			vector_madvise_finish_tlb(&madv_behavior);
 			madvise_unlock(mm, behavior);
 			madvise_lock(mm, behavior);
+			vector_madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
+	vector_madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED])
  2025-03-05 18:16 ` [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED]) SeongJae Park
@ 2025-03-06 18:36   ` Shakeel Butt
  2025-03-06 19:11     ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-06 18:36 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:16:07AM -0800, SeongJae Park wrote:
> MADV_DONTNEED[_LOCKED] internal logic can be invoked with batched tlb
> flushes.  Update vector_madvise(), which is called for
> process_madvise(), to use that in the way, by passing an mmu_gather
> object that it initializes before starting the internal works, and
> flushing the gathered tlb entries at once after all the internal works
> are done.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>

Please squash 12, 13 and 14 patches.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED])
  2025-03-06 18:36   ` Shakeel Butt
@ 2025-03-06 19:11     ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-06 19:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Thu, 6 Mar 2025 10:36:43 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:16:07AM -0800, SeongJae Park wrote:
> > MADV_DONTNEED[_LOCKED] internal logic can be invoked with batched tlb
> > flushes.  Update vector_madvise(), which is called for
> > process_madvise(), to use that in the way, by passing an mmu_gather
> > object that it initializes before starting the internal works, and
> > flushing the gathered tlb entries at once after all the internal works
> > are done.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> 
> Please squash 12, 13 and 14 patches.

Surely I will do so in the next spin, for all your squash requests, inlcuding
that for 15th and 16th patches.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 13/16] mm/madvise: batch tlb flushes for process_madvise(MADV_FREE)
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (11 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED]) SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 14/16] mm/madvise: batch tlb flushes for madvise(MADV_{DONTNEED[_LOCKED],FREE} SeongJae Park
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

MADV_FREE internal logic can be invoked with batching tlb flushes.
Update vector_madvise(), which is called for process_madvise(), to use
that in the efficient way, by passing an mmu_gather object that it
initializes before starting the work, and flushing the tlb entries at
once after all the internal works are done.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index efa4184d6cf5..f1beadb6176a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -968,7 +968,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 		return madvise_dontneed_single_vma(
 				madv_behavior->tlb, vma, start, end);
 	else if (behavior == MADV_FREE)
-		return madvise_free_single_vma(NULL, vma, start, end);
+		return madvise_free_single_vma(
+				madv_behavior->tlb, vma, start, end);
 	else
 		return -EINVAL;
 }
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 14/16] mm/madvise: batch tlb flushes for madvise(MADV_{DONTNEED[_LOCKED],FREE}
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (12 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 13/16] mm/madvise: batch tlb flushes for process_madvise(MADV_FREE) SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:16 ` [RFC PATCH 15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma() SeongJae Park
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

MADV_DONTNEED[_LOCKED] and MADV_FREE internal logics can be invoked with
batched tlb flushes.  madvise() is called for single address range, but
if there are multiple vmas for the range, tlb flush will happen multiple
times.  Update do_madvise(), which is called for madvise(), to use that
in the efficient way, by passing an mmu_gather object that it
initializes before starting the work, and flushing the tlb entries at
once after all the internal works are done.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 68 ++++++++++++++++++++++++++++------------------------
 1 file changed, 37 insertions(+), 31 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index f1beadb6176a..0d292b8e1a0e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1635,6 +1635,32 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
 		mmap_read_unlock(mm);
 }
 
+static bool madvise_batch_tlb_flush(int behavior)
+{
+	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
+		struct mm_struct *mm)
+{
+	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
+		return;
+	tlb_gather_mmu(madv_behavior->tlb, mm);
+}
+
+static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
+{
+	if (!madvise_batch_tlb_flush(madv_behavior->behavior))
+		return;
+	tlb_finish_mmu(madv_behavior->tlb);
+}
+
 static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
 {
 	size_t len;
@@ -1787,14 +1813,20 @@ static int madvise_do_behavior(struct mm_struct *mm,
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	int error;
-	struct madvise_behavior madv_behavior = {.behavior = behavior};
+	struct mmu_gather tlb;
+	struct madvise_behavior madv_behavior = {
+		.behavior = behavior,
+		.tlb = &tlb,
+	};
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
+	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	return error;
@@ -1805,32 +1837,6 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	return do_madvise(current->mm, start, len_in, behavior);
 }
 
-static bool vector_madvise_batch_tlb_flush(int behavior)
-{
-	switch (behavior) {
-	case MADV_DONTNEED:
-	case MADV_DONTNEED_LOCKED:
-		return true;
-	default:
-		return false;
-	}
-}
-
-static void vector_madvise_init_tlb(struct madvise_behavior *madv_behavior,
-		struct mm_struct *mm)
-{
-	if (!vector_madvise_batch_tlb_flush(madv_behavior->behavior))
-		return;
-	tlb_gather_mmu(madv_behavior->tlb, mm);
-}
-
-static void vector_madvise_finish_tlb(struct madvise_behavior *madv_behavior)
-{
-	if (!vector_madvise_batch_tlb_flush(madv_behavior->behavior))
-		return;
-	tlb_finish_mmu(madv_behavior->tlb);
-}
-
 /* Perform an madvise operation over a vector of addresses and lengths. */
 static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			      int behavior)
@@ -1848,7 +1854,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 	ret = madvise_lock(mm, behavior);
 	if (ret)
 		return ret;
-	vector_madvise_init_tlb(&madv_behavior, mm);
+	madvise_init_tlb(&madv_behavior, mm);
 
 	while (iov_iter_count(iter)) {
 		unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1877,17 +1883,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			}
 
 			/* Drop and reacquire lock to unwind race. */
-			vector_madvise_finish_tlb(&madv_behavior);
+			madvise_finish_tlb(&madv_behavior);
 			madvise_unlock(mm, behavior);
 			madvise_lock(mm, behavior);
-			vector_madvise_init_tlb(&madv_behavior, mm);
+			madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
-	vector_madvise_finish_tlb(&madv_behavior);
+	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (13 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 14/16] mm/madvise: batch tlb flushes for madvise(MADV_{DONTNEED[_LOCKED],FREE} SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-06 18:37   ` Shakeel Butt
  2025-03-05 18:16 ` [RFC PATCH 16/16] mm/madvise: remove !caller_tlb case of madvise_free_single_vma() SeongJae Park
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_dontneed_single_vma() supports both batched tlb flushes and
unbatched tlb flushes use cases depending on received tlb parameter's
value.  Both were supported for safe and fine transition of the usages
from the unbatched flushed to the batched ones.  Now the transition is
done, and therefore there is no real unbatched tlb flushes use case.
Remove the code for supporting the no more being used input case.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 0d292b8e1a0e..1dd2c25c83d8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -866,10 +866,7 @@ static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
 		.even_cows = true,
 	};
 
-	if (!tlb)
-		zap_page_range_single(vma, start, end - start, &details);
-	else
-		unmap_vma_single(tlb, vma, start, end - start, &details);
+	unmap_vma_single(tlb, vma, start, end - start, &details);
 	return 0;
 }
 
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma()
  2025-03-05 18:16 ` [RFC PATCH 15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma() SeongJae Park
@ 2025-03-06 18:37   ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2025-03-06 18:37 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:16:10AM -0800, SeongJae Park wrote:
> madvise_dontneed_single_vma() supports both batched tlb flushes and
> unbatched tlb flushes use cases depending on received tlb parameter's
> value.  Both were supported for safe and fine transition of the usages
> from the unbatched flushed to the batched ones.  Now the transition is
> done, and therefore there is no real unbatched tlb flushes use case.
> Remove the code for supporting the no more being used input case.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>

Please squash 15 and 16.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 16/16] mm/madvise: remove !caller_tlb case of madvise_free_single_vma()
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (14 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma() SeongJae Park
@ 2025-03-05 18:16 ` SeongJae Park
  2025-03-05 18:56 ` [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Matthew Wilcox
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 18:16 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

madvise_free_single_vma() supports both batched tlb flushes and
unbatched tlb flushes use cases depending on received tlb parameter's
value.  Both were supported for safe and fine transition of the usages
from the unbatched flushed to the batched ones.  Now the transition is
done, and therefore there is no real unbatched tlb flushes use case.
Remove the code for supporting the no more being used input case.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 1dd2c25c83d8..03ba5ff0cf9b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -795,18 +795,11 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
 };
 
 static int madvise_free_single_vma(
-		struct mmu_gather *caller_tlb, struct vm_area_struct *vma,
+		struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	struct mmu_gather self_tlb;
-	struct mmu_gather *tlb;
-
-	if (caller_tlb)
-		tlb = caller_tlb;
-	else
-		tlb = &self_tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
 	if (!vma_is_anonymous(vma))
@@ -822,8 +815,6 @@ static int madvise_free_single_vma(
 				range.start, range.end);
 
 	lru_add_drain();
-	if (!caller_tlb)
-		tlb_gather_mmu(tlb, mm);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
@@ -832,9 +823,6 @@ static int madvise_free_single_vma(
 			&madvise_free_walk_ops, tlb);
 	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	if (!caller_tlb)
-		tlb_finish_mmu(tlb);
-
 	return 0;
 }
 
-- 
2.39.5


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (15 preceding siblings ...)
  2025-03-05 18:16 ` [RFC PATCH 16/16] mm/madvise: remove !caller_tlb case of madvise_free_single_vma() SeongJae Park
@ 2025-03-05 18:56 ` Matthew Wilcox
  2025-03-05 19:19   ` David Hildenbrand
  2025-03-05 20:22 ` Shakeel Butt
  2025-03-05 20:36 ` Nadav Amit
  18 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2025-03-05 18:56 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.

Do real applications actually do madvise requests that span multiple
VMAs?  It just seems weird to me.  Like, each vma comes from a separate
call to mmap [1], so why would it make sense for an application to
call madvise() across a VMA boundary?

[1] Yes, I know we sometimes merge and/or split VMAs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 18:56 ` [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Matthew Wilcox
@ 2025-03-05 19:19   ` David Hildenbrand
  2025-03-05 19:26     ` Lorenzo Stoakes
  2025-03-05 19:46     ` Shakeel Butt
  0 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:19 UTC (permalink / raw)
  To: Matthew Wilcox, SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, Lorenzo Stoakes, Shakeel Butt,
	Vlastimil Babka, kernel-team, linux-kernel, linux-mm

On 05.03.25 19:56, Matthew Wilcox wrote:
> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>> can happen for each vma of the given address ranges.  Because such tlb
>> flushes are for address ranges of same process, doing those in a batch
>> is more efficient while still being safe.  Modify madvise() and
>> process_madvise() entry level code path to do such batched tlb flushes,
>> while the internal unmap logics do only gathering of the tlb entries to
>> flush.
> 
> Do real applications actually do madvise requests that span multiple
> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> call to mmap [1], so why would it make sense for an application to
> call madvise() across a VMA boundary?

I had the same question. If this happens in an app, I would assume that 
a single MADV_DONTNEED call would usually not span multiples VMAs, and 
if it does, not that many (and that often) that we would really care 
about it.

OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED 
version would make more sense to me. I don't recall if process_madvise() 
allows for that already, and if it does, is this series primarily 
tackling optimizing that?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:19   ` David Hildenbrand
@ 2025-03-05 19:26     ` Lorenzo Stoakes
  2025-03-05 19:35       ` David Hildenbrand
  2025-03-05 19:46     ` Shakeel Butt
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-03-05 19:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, SeongJae Park, Liam R. Howlett, Andrew Morton,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> On 05.03.25 19:56, Matthew Wilcox wrote:
> > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > can happen for each vma of the given address ranges.  Because such tlb
> > > flushes are for address ranges of same process, doing those in a batch
> > > is more efficient while still being safe.  Modify madvise() and
> > > process_madvise() entry level code path to do such batched tlb flushes,
> > > while the internal unmap logics do only gathering of the tlb entries to
> > > flush.
> >
> > Do real applications actually do madvise requests that span multiple
> > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > call to mmap [1], so why would it make sense for an application to
> > call madvise() across a VMA boundary?
>
> I had the same question. If this happens in an app, I would assume that a
> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> does, not that many (and that often) that we would really care about it.
>
> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> would make more sense to me. I don't recall if process_madvise() allows for
> that already, and if it does, is this series primarily tackling optimizing
> that?

Yeah it's weird, but people can get caught out by unexpected failures to merge
if they do fun stuff with mremap().

Then again mremap() itself _mandates_ that you only span a single VMA (or part
of one) :)

Can we talk about the _true_ horror show that - you can span multiple VMAs _with
gaps_ and it'll allow you, only it'll return -ENOMEM at the end?

In madvise_walk_vmas():

	for (;;) {
		...

		if (start < vma->vm_start) {
			unmapped_error = -ENOMEM;
			start = vma->vm_start;
			...
		}

		...

		error = visit(vma, &prev, start, tmp, arg);
		if (error)
			return error;

		...
	}

	return unmapped_error;

So, you have no idea if that -ENOMEM is due to a gap, or do to the
operation returning an -ENOMEM?

I mean can we just drop this? Does anybody in their right mind rely on
this? Or is it intentional to deal with somehow a racing unmap?

But, no, we hold an mmap lock so that's not it.

Yeah OK so can we drop this madness? :) or am I missing some very important
detail about why we allow this?

I guess spanning multiple VMAs we _have_ to leave in because plausibly
there are users of that out there?

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:26     ` Lorenzo Stoakes
@ 2025-03-05 19:35       ` David Hildenbrand
  2025-03-05 19:39         ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox, SeongJae Park, Liam R. Howlett, Andrew Morton,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On 05.03.25 20:26, Lorenzo Stoakes wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
>> On 05.03.25 19:56, Matthew Wilcox wrote:
>>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>>>> can happen for each vma of the given address ranges.  Because such tlb
>>>> flushes are for address ranges of same process, doing those in a batch
>>>> is more efficient while still being safe.  Modify madvise() and
>>>> process_madvise() entry level code path to do such batched tlb flushes,
>>>> while the internal unmap logics do only gathering of the tlb entries to
>>>> flush.
>>>
>>> Do real applications actually do madvise requests that span multiple
>>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
>>> call to mmap [1], so why would it make sense for an application to
>>> call madvise() across a VMA boundary?
>>
>> I had the same question. If this happens in an app, I would assume that a
>> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
>> does, not that many (and that often) that we would really care about it.
>>
>> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
>> would make more sense to me. I don't recall if process_madvise() allows for
>> that already, and if it does, is this series primarily tackling optimizing
>> that?
> 
> Yeah it's weird, but people can get caught out by unexpected failures to merge
> if they do fun stuff with mremap().
> 
> Then again mremap() itself _mandates_ that you only span a single VMA (or part
> of one) :)

Maybe some garbage collection use cases that shuffle individual pages, 
and later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely.

> 
> Can we talk about the _true_ horror show that - you can span multiple VMAs _with
> gaps_ and it'll allow you, only it'll return -ENOMEM at the end?
> 
> In madvise_walk_vmas():
> 
> 	for (;;) {
> 		...
> 
> 		if (start < vma->vm_start) {
> 			unmapped_error = -ENOMEM;
> 			start = vma->vm_start;
> 			...
> 		}
> 
> 		...
> 
> 		error = visit(vma, &prev, start, tmp, arg);
> 		if (error)
> 			return error;
> 
> 		...
> 	}
> 
> 	return unmapped_error;
> 
> So, you have no idea if that -ENOMEM is due to a gap, or do to the
> operation returning an -ENOMEM?
 > > I mean can we just drop this? Does anybody in their right mind rely on
> this? Or is it intentional to deal with somehow a racing unmap?
 > > But, no, we hold an mmap lock so that's not it.

Races could still happen if user space would do it from separate 
threads. But then, who would prevent user space from doing another 
mmap() and getting pages zapped ... so that sounds unlikely.

> 
> Yeah OK so can we drop this madness? :) or am I missing some very important
> detail about why we allow this?

I stumbled over that myself a while ago. It's well documented behavior 
in the man page :(

At that point I stopped caring, because apparently somebody else cared 
enough to document that clearly in the man page :)

> 
> I guess spanning multiple VMAs we _have_ to leave in because plausibly
> there are users of that out there?

Spanning multiple VMAs can probably easily happen. At least in QEMU I 
know some sane ways to trigger it on guest memory. But, all corner 
cases, so nothing relevant for performance.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:35       ` David Hildenbrand
@ 2025-03-05 19:39         ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-03-05 19:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, SeongJae Park, Liam R. Howlett, Andrew Morton,
	Shakeel Butt, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 08:35:36PM +0100, David Hildenbrand wrote:
> On 05.03.25 20:26, Lorenzo Stoakes wrote:
> > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > > flushes are for address ranges of same process, doing those in a batch
> > > > > is more efficient while still being safe.  Modify madvise() and
> > > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > > flush.
> > > >
> > > > Do real applications actually do madvise requests that span multiple
> > > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > > call to mmap [1], so why would it make sense for an application to
> > > > call madvise() across a VMA boundary?
> > >
> > > I had the same question. If this happens in an app, I would assume that a
> > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > > does, not that many (and that often) that we would really care about it.
> > >
> > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > > would make more sense to me. I don't recall if process_madvise() allows for
> > > that already, and if it does, is this series primarily tackling optimizing
> > > that?
> >
> > Yeah it's weird, but people can get caught out by unexpected failures to merge
> > if they do fun stuff with mremap().
> >
> > Then again mremap() itself _mandates_ that you only span a single VMA (or part
> > of one) :)
>
> Maybe some garbage collection use cases that shuffle individual pages, and
> later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely.
>
> >
> > Can we talk about the _true_ horror show that - you can span multiple VMAs _with
> > gaps_ and it'll allow you, only it'll return -ENOMEM at the end?
> >
> > In madvise_walk_vmas():
> >
> > 	for (;;) {
> > 		...
> >
> > 		if (start < vma->vm_start) {
> > 			unmapped_error = -ENOMEM;
> > 			start = vma->vm_start;
> > 			...
> > 		}
> >
> > 		...
> >
> > 		error = visit(vma, &prev, start, tmp, arg);
> > 		if (error)
> > 			return error;
> >
> > 		...
> > 	}
> >
> > 	return unmapped_error;
> >
> > So, you have no idea if that -ENOMEM is due to a gap, or do to the
> > operation returning an -ENOMEM?
> > > I mean can we just drop this? Does anybody in their right mind rely on
> > this? Or is it intentional to deal with somehow a racing unmap?
> > > But, no, we hold an mmap lock so that's not it.
>
> Races could still happen if user space would do it from separate threads.
> But then, who would prevent user space from doing another mmap() and getting
> pages zapped ... so that sounds unlikely.

Ah yeah, I mean if you got unlucky on timing, munmap()/mmap() or mremap() with
MAP_FIXED quickly unmaps on another thread before you grab the mmap lock and fun
times.

I mean for that to happen you'd have to be doing something... very odd or insane
so. But still.

>
> >
> > Yeah OK so can we drop this madness? :) or am I missing some very important
> > detail about why we allow this?
>
> I stumbled over that myself a while ago. It's well documented behavior in
> the man page :(

Haha ok I guess we have to live with it then.

>
> At that point I stopped caring, because apparently somebody else cared
> enough to document that clearly in the man page :)
>
> >
> > I guess spanning multiple VMAs we _have_ to leave in because plausibly
> > there are users of that out there?
>
> Spanning multiple VMAs can probably easily happen. At least in QEMU I know
> some sane ways to trigger it on guest memory. But, all corner cases, so
> nothing relevant for performance.
>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:19   ` David Hildenbrand
  2025-03-05 19:26     ` Lorenzo Stoakes
@ 2025-03-05 19:46     ` Shakeel Butt
  2025-03-05 19:49       ` David Hildenbrand
  2025-03-05 19:49       ` Lorenzo Stoakes
  1 sibling, 2 replies; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 19:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, SeongJae Park, Liam R. Howlett, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> On 05.03.25 19:56, Matthew Wilcox wrote:
> > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > can happen for each vma of the given address ranges.  Because such tlb
> > > flushes are for address ranges of same process, doing those in a batch
> > > is more efficient while still being safe.  Modify madvise() and
> > > process_madvise() entry level code path to do such batched tlb flushes,
> > > while the internal unmap logics do only gathering of the tlb entries to
> > > flush.
> > 
> > Do real applications actually do madvise requests that span multiple
> > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > call to mmap [1], so why would it make sense for an application to
> > call madvise() across a VMA boundary?
> 
> I had the same question. If this happens in an app, I would assume that a
> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> does, not that many (and that often) that we would really care about it.

IMHO madvise() is just an add-on and the real motivation behind this
series is your next point.

> 
> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> would make more sense to me. I don't recall if process_madvise() allows for
> that already, and if it does, is this series primarily tackling optimizing
> that?

Yes process_madvise() allows that and that is what SJ has benchmarked
and reported in the cover letter. In addition, we are adding
process_madvise() support in jemalloc which will land soon.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:46     ` Shakeel Butt
@ 2025-03-05 19:49       ` David Hildenbrand
  2025-03-05 20:59         ` SeongJae Park
  2025-03-05 19:49       ` Lorenzo Stoakes
  1 sibling, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Matthew Wilcox, SeongJae Park, Liam R. Howlett, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On 05.03.25 20:46, Shakeel Butt wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
>> On 05.03.25 19:56, Matthew Wilcox wrote:
>>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>>>> can happen for each vma of the given address ranges.  Because such tlb
>>>> flushes are for address ranges of same process, doing those in a batch
>>>> is more efficient while still being safe.  Modify madvise() and
>>>> process_madvise() entry level code path to do such batched tlb flushes,
>>>> while the internal unmap logics do only gathering of the tlb entries to
>>>> flush.
>>>
>>> Do real applications actually do madvise requests that span multiple
>>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
>>> call to mmap [1], so why would it make sense for an application to
>>> call madvise() across a VMA boundary?
>>
>> I had the same question. If this happens in an app, I would assume that a
>> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
>> does, not that many (and that often) that we would really care about it.
> 
> IMHO madvise() is just an add-on and the real motivation behind this
> series is your next point.
> 
>>
>> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
>> would make more sense to me. I don't recall if process_madvise() allows for
>> that already, and if it does, is this series primarily tackling optimizing
>> that?
> 
> Yes process_madvise() allows that and that is what SJ has benchmarked
> and reported in the cover letter. In addition, we are adding
> process_madvise() support in jemalloc which will land soon.

Makes a lot of sense to me!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:49       ` David Hildenbrand
@ 2025-03-05 20:59         ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 20:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, Shakeel Butt, Matthew Wilcox, Liam R. Howlett,
	Andrew Morton, Lorenzo Stoakes, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Wed, 5 Mar 2025 20:49:13 +0100 David Hildenbrand <david@redhat.com> wrote:

> On 05.03.25 20:46, Shakeel Butt wrote:
> > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> >> On 05.03.25 19:56, Matthew Wilcox wrote:
> >>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> >>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> >>>> can happen for each vma of the given address ranges.  Because such tlb
> >>>> flushes are for address ranges of same process, doing those in a batch
> >>>> is more efficient while still being safe.  Modify madvise() and
> >>>> process_madvise() entry level code path to do such batched tlb flushes,
> >>>> while the internal unmap logics do only gathering of the tlb entries to
> >>>> flush.
> >>>
> >>> Do real applications actually do madvise requests that span multiple
> >>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> >>> call to mmap [1], so why would it make sense for an application to
> >>> call madvise() across a VMA boundary?
> >>
> >> I had the same question. If this happens in an app, I would assume that a
> >> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> >> does, not that many (and that often) that we would really care about it.
> > 
> > IMHO madvise() is just an add-on and the real motivation behind this
> > series is your next point.
> > 
> >>
> >> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> >> would make more sense to me. I don't recall if process_madvise() allows for
> >> that already, and if it does, is this series primarily tackling optimizing
> >> that?
> > 
> > Yes process_madvise() allows that and that is what SJ has benchmarked
> > and reported in the cover letter. In addition, we are adding
> > process_madvise() support in jemalloc which will land soon.

Shakeel is correct.  Thank you for making the early clarification Shakeel.

Also sorry for causing confuses.  I will make this point clearer on next spin.

> 
> Makes a lot of sense to me!

Seems Shakeel already addressed all question so far, but please feel free to
raise more question for anything not yet cleared!

> 
> -- 
> Cheers,
> 
> David / dhildenb


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:46     ` Shakeel Butt
  2025-03-05 19:49       ` David Hildenbrand
@ 2025-03-05 19:49       ` Lorenzo Stoakes
  2025-03-05 19:57         ` Shakeel Butt
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-03-05 19:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand, Matthew Wilcox, SeongJae Park,
	Liam R. Howlett, Andrew Morton, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > flushes are for address ranges of same process, doing those in a batch
> > > > is more efficient while still being safe.  Modify madvise() and
> > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > flush.
> > >
> > > Do real applications actually do madvise requests that span multiple
> > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > call to mmap [1], so why would it make sense for an application to
> > > call madvise() across a VMA boundary?
> >
> > I had the same question. If this happens in an app, I would assume that a
> > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > does, not that many (and that often) that we would really care about it.
>
> IMHO madvise() is just an add-on and the real motivation behind this
> series is your next point.
>
> >
> > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > would make more sense to me. I don't recall if process_madvise() allows for
> > that already, and if it does, is this series primarily tackling optimizing
> > that?
>
> Yes process_madvise() allows that and that is what SJ has benchmarked
> and reported in the cover letter. In addition, we are adding
> process_madvise() support in jemalloc which will land soon.
>

Feels like me adjusting that to allow for batched usage for guard regions
has opened up unexpected avenues, which is really cool to see :)

I presume the usage is intended for PIDFD_SELF usage right?

At some point we need to look at allowing larger iovec size. This was
something I was planning to look at at some point, but my workload is
really overwhelming + that's low priority for me so happy for you guys to
handle that if you want.

Can discuss at lsf if you guys will be there also :)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:49       ` Lorenzo Stoakes
@ 2025-03-05 19:57         ` Shakeel Butt
  2025-03-05 22:46           ` SeongJae Park
  0 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 19:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Matthew Wilcox, SeongJae Park,
	Liam R. Howlett, Andrew Morton, Vlastimil Babka, kernel-team,
	linux-kernel, linux-mm

On Wed, Mar 05, 2025 at 07:49:50PM +0000, Lorenzo Stoakes wrote:
> On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote:
> > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > > flushes are for address ranges of same process, doing those in a batch
> > > > > is more efficient while still being safe.  Modify madvise() and
> > > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > > flush.
> > > >
> > > > Do real applications actually do madvise requests that span multiple
> > > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > > call to mmap [1], so why would it make sense for an application to
> > > > call madvise() across a VMA boundary?
> > >
> > > I had the same question. If this happens in an app, I would assume that a
> > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > > does, not that many (and that often) that we would really care about it.
> >
> > IMHO madvise() is just an add-on and the real motivation behind this
> > series is your next point.
> >
> > >
> > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > > would make more sense to me. I don't recall if process_madvise() allows for
> > > that already, and if it does, is this series primarily tackling optimizing
> > > that?
> >
> > Yes process_madvise() allows that and that is what SJ has benchmarked
> > and reported in the cover letter. In addition, we are adding
> > process_madvise() support in jemalloc which will land soon.
> >
> 
> Feels like me adjusting that to allow for batched usage for guard regions
> has opened up unexpected avenues, which is really cool to see :)
> 
> I presume the usage is intended for PIDFD_SELF usage right?

Yes.

> 
> At some point we need to look at allowing larger iovec size. This was
> something I was planning to look at at some point, but my workload is
> really overwhelming + that's low priority for me so happy for you guys to
> handle that if you want.
> 
> Can discuss at lsf if you guys will be there also :)

Yup, we will be there and will be happy to discuss.

Also the draft of jemalloc using process_madvise() is at [1].

[1] https://github.com/jemalloc/jemalloc/pull/2794


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 19:57         ` Shakeel Butt
@ 2025-03-05 22:46           ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 22:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Lorenzo Stoakes, David Hildenbrand,
	Matthew Wilcox, Liam R. Howlett, Andrew Morton, Vlastimil Babka,
	kernel-team, linux-kernel, linux-mm

On Wed, 5 Mar 2025 11:57:34 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 07:49:50PM +0000, Lorenzo Stoakes wrote:
> > On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote:
> > > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > > > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > > > flushes are for address ranges of same process, doing those in a batch
> > > > > > is more efficient while still being safe.  Modify madvise() and
> > > > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > > > flush.
> > > > >
> > > > > Do real applications actually do madvise requests that span multiple
> > > > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > > > call to mmap [1], so why would it make sense for an application to
> > > > > call madvise() across a VMA boundary?
> > > >
> > > > I had the same question. If this happens in an app, I would assume that a
> > > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > > > does, not that many (and that often) that we would really care about it.
> > >
> > > IMHO madvise() is just an add-on and the real motivation behind this
> > > series is your next point.
> > >
> > > >
> > > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > > > would make more sense to me. I don't recall if process_madvise() allows for
> > > > that already, and if it does, is this series primarily tackling optimizing
> > > > that?
> > >
> > > Yes process_madvise() allows that and that is what SJ has benchmarked
> > > and reported in the cover letter. In addition, we are adding
> > > process_madvise() support in jemalloc which will land soon.
> > >
> > 
> > Feels like me adjusting that to allow for batched usage for guard regions
> > has opened up unexpected avenues, which is really cool to see :)
> > 
> > I presume the usage is intended for PIDFD_SELF usage right?
> 
> Yes.

Indeed. Thank you for opening up the door, Lorenzo :)

> 
> > 
> > At some point we need to look at allowing larger iovec size. This was
> > something I was planning to look at at some point, but my workload is
> > really overwhelming + that's low priority for me so happy for you guys to
> > handle that if you want.
> > 
> > Can discuss at lsf if you guys will be there also :)
> 
> Yup, we will be there and will be happy to discuss.

Me, too.  Looking forward to the day!


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (16 preceding siblings ...)
  2025-03-05 18:56 ` [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Matthew Wilcox
@ 2025-03-05 20:22 ` Shakeel Butt
  2025-03-05 22:58   ` SeongJae Park
  2025-03-05 20:36 ` Nadav Amit
  18 siblings, 1 reply; 48+ messages in thread
From: Shakeel Butt @ 2025-03-05 20:22 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.
> 
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  Also modify the internal
> logics to do only gathering of the tlb entries to flush into the
> received mmu_gather object.  After all internal function calls are done,
> the entry functions finish the mmu_gather object to flush the gathered
> tlb entries in the one batch.
> 
> Patches Seuquence
> =================
> 
> First four patches are minor cleanups of madvise.c for readability.
> 
> Following four patches (patches 5-8) define new data structure for
> managing information that required for batched tlb flushing (mmu_gather
> and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
> MADV_FREE handling internal logics to receive it.
> 
> Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
> and MADV_FREE handling logic ready for batched tlb flushing follow. 

I think you forgot to complete the above sentence or the 'follow' at the
end seems weird.

> The
> patches keep the support of unbatched tlb flushes use case, for
> fine-grained and safe transitions.
> 
> Next three patches (patches 12-14) update madvise() and
> process_madvise() code to do the batched tlb flushes utilizing the
> previous patches introduced changes.
> 
> Final two patches (patches 15-16) clean up the internal logics'
> unbatched tlb flushes use case support code, which is no more be used.
> 
> Test Results
> ============
> 
> I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size (vlen) from 1 to
> 1024.  The source code for the measurement is available at GitHub[1].
> 
> The measurement results are as below.  'sz_batches' column shows the
> batch size of process_madvise() calls.  'before' and 'after' columns are
> the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
> nanoseconds, on kernels that built without and with the MADV_DONTNEED
> tlb flushes batching patch of this series, respectively.  For the
> baseline, mm-unstable tree of 2025-03-04[2] has been used.
> 'after/before' column is the ratio of 'after' to 'before'.  So
> 'afetr/before' value lower than 1.0 means this patch increased
> efficiency over the baseline.  And lower value means better efficiency.

I would recommend to replace the after/end column with percentage i.e.
percentage improvement or degradation.

> 
>     sz_batches    before       after        after/before
>     1             102842895    106507398    1.03563204828102
>     2             73364942     74529223     1.01586971880929
>     4             58823633     51608504     0.877343022998937
>     8             47532390     44820223     0.942940655834895
>     16            43591587     36727177     0.842529018271347
>     32            44207282     33946975     0.767904595446515
>     64            41832437     26738286     0.639175910310939
>     128           40278193     23262940     0.577556694263817
>     256           41568533     22355103     0.537789077136785
>     512           41626638     22822516     0.54826709762148
>     1024          44440870     22676017     0.510251419470411
> 
> For <=2 batch size, tlb flushes batching shows no big difference but
> slight overhead.  I think that's in an error range of this simple
> micro-benchmark, and therefore can be ignored.  

I would recommend to run the experiment multiple times and report
averages and standard deviation which will support your error range
claim.

> Starting from batch size
> 4, however, tlb flushes batching shows clear efficiency gain.  The
> efficiency gain tends to be proportional to the batch size, as expected.
> The efficiency gain ranges from about 13 percent with batch size 4, and
> up to 49 percent with batch size 1,024.
> 
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.
> 

I think you are running a single thread benchmark on a free machine. I
expect this series to be much more beneficial on loaded machine and for
multi-threaded applications. No need to test that scenario but if you
have already done that then it would be good to report.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 20:22 ` Shakeel Butt
@ 2025-03-05 22:58   ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 22:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, kernel-team, linux-kernel,
	linux-mm

On Wed, 5 Mar 2025 12:22:25 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > can happen for each vma of the given address ranges.  Because such tlb
> > flushes are for address ranges of same process, doing those in a batch
> > is more efficient while still being safe.  Modify madvise() and
> > process_madvise() entry level code path to do such batched tlb flushes,
> > while the internal unmap logics do only gathering of the tlb entries to
> > flush.
> > 
> > In more detail, modify the entry functions to initialize an mmu_gather
> > ojbect and pass it to the internal logics.  Also modify the internal
> > logics to do only gathering of the tlb entries to flush into the
> > received mmu_gather object.  After all internal function calls are done,
> > the entry functions finish the mmu_gather object to flush the gathered
> > tlb entries in the one batch.
> > 
> > Patches Seuquence
> > =================
> > 
> > First four patches are minor cleanups of madvise.c for readability.
> > 
> > Following four patches (patches 5-8) define new data structure for
> > managing information that required for batched tlb flushing (mmu_gather
> > and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
> > MADV_FREE handling internal logics to receive it.
> > 
> > Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
> > and MADV_FREE handling logic ready for batched tlb flushing follow. 
> 
> I think you forgot to complete the above sentence or the 'follow' at the
> end seems weird.

Thank you for catching this.  I just wanted to say these three patches come
after the previous ones.  I will wordsmith this part in the next version.

> 
> > The
> > patches keep the support of unbatched tlb flushes use case, for
> > fine-grained and safe transitions.
> > 
> > Next three patches (patches 12-14) update madvise() and
> > process_madvise() code to do the batched tlb flushes utilizing the
> > previous patches introduced changes.
> > 
> > Final two patches (patches 15-16) clean up the internal logics'
> > unbatched tlb flushes use case support code, which is no more be used.
> > 
> > Test Results
> > ============
> > 
> > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple process_madvise() calls.  I apply the advice in 4 KiB
> > sized regions granularity, but with varying batch size (vlen) from 1 to
> > 1024.  The source code for the measurement is available at GitHub[1].
> > 
> > The measurement results are as below.  'sz_batches' column shows the
> > batch size of process_madvise() calls.  'before' and 'after' columns are
> > the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
> > nanoseconds, on kernels that built without and with the MADV_DONTNEED
> > tlb flushes batching patch of this series, respectively.  For the
> > baseline, mm-unstable tree of 2025-03-04[2] has been used.
> > 'after/before' column is the ratio of 'after' to 'before'.  So
> > 'afetr/before' value lower than 1.0 means this patch increased
> > efficiency over the baseline.  And lower value means better efficiency.
> 
> I would recommend to replace the after/end column with percentage i.e.
> percentage improvement or degradation.

Thank you for the nice suggestion.  I will do so in the next version.

> 
> > 
> >     sz_batches    before       after        after/before
> >     1             102842895    106507398    1.03563204828102
> >     2             73364942     74529223     1.01586971880929
> >     4             58823633     51608504     0.877343022998937
> >     8             47532390     44820223     0.942940655834895
> >     16            43591587     36727177     0.842529018271347
> >     32            44207282     33946975     0.767904595446515
> >     64            41832437     26738286     0.639175910310939
> >     128           40278193     23262940     0.577556694263817
> >     256           41568533     22355103     0.537789077136785
> >     512           41626638     22822516     0.54826709762148
> >     1024          44440870     22676017     0.510251419470411
> > 
> > For <=2 batch size, tlb flushes batching shows no big difference but
> > slight overhead.  I think that's in an error range of this simple
> > micro-benchmark, and therefore can be ignored.  
> 
> I would recommend to run the experiment multiple times and report
> averages and standard deviation which will support your error range
> claim.

Again, good suggestion.  I will do so.

> 
> > Starting from batch size
> > 4, however, tlb flushes batching shows clear efficiency gain.  The
> > efficiency gain tends to be proportional to the batch size, as expected.
> > The efficiency gain ranges from about 13 percent with batch size 4, and
> > up to 49 percent with batch size 1,024.
> > 
> > Please note that this is a very simple microbenchmark, so real
> > efficiency gain on real workload could be very different.
> > 
> 
> I think you are running a single thread benchmark on a free machine. I
> expect this series to be much more beneficial on loaded machine and for
> multi-threaded applications.

Your understanding of my test setup is correct and I agree to your expectation.

> No need to test that scenario but if you
> have already done that then it would be good to report.

I don't have such test results or plans for those with specific timeline for
now.  I will share those if I get a chance, of course.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
                   ` (17 preceding siblings ...)
  2025-03-05 20:22 ` Shakeel Butt
@ 2025-03-05 20:36 ` Nadav Amit
  2025-03-05 23:02   ` SeongJae Park
  18 siblings, 1 reply; 48+ messages in thread
From: Nadav Amit @ 2025-03-05 20:36 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	Linux Kernel Mailing List, open list:MEMORY MANAGEMENT


> On 5 Mar 2025, at 20:15, SeongJae Park <sj@kernel.org> wrote:
> 
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.

I made some related (similar?) patches in the past. You can see if you
find something useful in the discussion there. I think your version avoids
some of the “mistakes” I made.

[1] https://lore.kernel.org/all/20210926161259.238054-1-namit@vmware.com/T/#m23ccd29bad04a963c4d8c64ec3581f7c301c7806

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE
  2025-03-05 20:36 ` Nadav Amit
@ 2025-03-05 23:02   ` SeongJae Park
  0 siblings, 0 replies; 48+ messages in thread
From: SeongJae Park @ 2025-03-05 23:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, kernel-team,
	Linux Kernel Mailing List, open list:MEMORY MANAGEMENT

On Wed, 5 Mar 2025 22:36:45 +0200 Nadav Amit <nadav.amit@gmail.com> wrote:

> 
> > On 5 Mar 2025, at 20:15, SeongJae Park <sj@kernel.org> wrote:
> > 
> > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > can happen for each vma of the given address ranges.  Because such tlb
> > flushes are for address ranges of same process, doing those in a batch
> > is more efficient while still being safe.  Modify madvise() and
> > process_madvise() entry level code path to do such batched tlb flushes,
> > while the internal unmap logics do only gathering of the tlb entries to
> > flush.
> 
> I made some related (similar?) patches in the past. You can see if you
> find something useful in the discussion there. I think your version avoids
> some of the “mistakes” I made.

Thank you for sharing this!  I sure I can learn and use many things from this
work and previous discussions.

> 
> [1] https://lore.kernel.org/all/20210926161259.238054-1-namit@vmware.com/T/#m23ccd29bad04a963c4d8c64ec3581f7c301c7806


Thanks,
SJ


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-03-06 19:11 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-05 18:15 [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
2025-03-05 18:15 ` [RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
2025-03-05 20:25   ` Shakeel Butt
2025-03-05 23:13     ` SeongJae Park
2025-03-05 18:15 ` [RFC PATCH 02/16] mm/madvise: split out populate behavior check logic SeongJae Park
2025-03-05 20:32   ` Shakeel Butt
2025-03-05 23:18     ` SeongJae Park
2025-03-05 18:15 ` [RFC PATCH 03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
2025-03-05 18:15 ` [RFC PATCH 04/16] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
2025-03-05 21:02   ` Shakeel Butt
2025-03-05 21:40     ` Shakeel Butt
2025-03-05 23:56       ` SeongJae Park
2025-03-06  3:37         ` Shakeel Butt
2025-03-06  4:18           ` SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior() SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 07/16] mm/madvise: make madvise_walk_vmas() visit function receives a void pointer SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 08/16] mm/madvise: pass madvise_behavior struct to madvise_dontneed_free() SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 09/16] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
2025-03-06 18:45   ` Shakeel Butt
2025-03-06 19:09     ` SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes SeongJae Park
2025-03-06 18:36   ` Shakeel Butt
2025-03-06 19:10     ` SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 11/16] mm/madvise: let madvise_free_single_vma() " SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED]) SeongJae Park
2025-03-06 18:36   ` Shakeel Butt
2025-03-06 19:11     ` SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 13/16] mm/madvise: batch tlb flushes for process_madvise(MADV_FREE) SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 14/16] mm/madvise: batch tlb flushes for madvise(MADV_{DONTNEED[_LOCKED],FREE} SeongJae Park
2025-03-05 18:16 ` [RFC PATCH 15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma() SeongJae Park
2025-03-06 18:37   ` Shakeel Butt
2025-03-05 18:16 ` [RFC PATCH 16/16] mm/madvise: remove !caller_tlb case of madvise_free_single_vma() SeongJae Park
2025-03-05 18:56 ` [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Matthew Wilcox
2025-03-05 19:19   ` David Hildenbrand
2025-03-05 19:26     ` Lorenzo Stoakes
2025-03-05 19:35       ` David Hildenbrand
2025-03-05 19:39         ` Lorenzo Stoakes
2025-03-05 19:46     ` Shakeel Butt
2025-03-05 19:49       ` David Hildenbrand
2025-03-05 20:59         ` SeongJae Park
2025-03-05 19:49       ` Lorenzo Stoakes
2025-03-05 19:57         ` Shakeel Butt
2025-03-05 22:46           ` SeongJae Park
2025-03-05 20:22 ` Shakeel Butt
2025-03-05 22:58   ` SeongJae Park
2025-03-05 20:36 ` Nadav Amit
2025-03-05 23:02   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox