* [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
@ 2025-01-17 1:30 SeongJae Park
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
` (5 more replies)
0 siblings, 6 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-17 1:30 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
process_madvise() calls do_madvise() for each address range. Then, each
do_madvise() invocation holds and releases same mmap_lock. Optimize the
redundant lock operations by splitting do_madvise() internal logics
including the mmap_lock operations, and calling the small logics
directly from process_madvise() in a sequence that removes the redundant
locking.
Changes from RFC v1 (20250111004618.1566-1-sj@kernel.org)
- Split out do_madvise() and use those from vector_madvise(), instead of
adding a flag to do_madvise() (Liam R. Howlett)
SeongJae Park (4):
mm/madvise: split out mmap locking operations for madvise()
mm/madvise: split out madvise input validity check
mm/madvise: split out madvise() behavior execution
mm/madvise: remove redundant mmap_lock operations from
process_madvise()
mm/madvise.c | 150 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 103 insertions(+), 47 deletions(-)
base-commit: b43ba6938d01ad4487028592109d4116a28b7afa
--
2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise()
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-01-17 1:30 ` SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
` (2 more replies)
2025-01-17 1:30 ` [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check SeongJae Park
` (4 subsequent siblings)
5 siblings, 3 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-17 1:30 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
Split out the madvise behavior-dependent mmap_lock operations from
do_madvise(), for easier reuse of the logic in an upcoming change.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 49f3a75046f6..ae0964bc4d88 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1565,6 +1565,33 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
madvise_vma_anon_name);
}
#endif /* CONFIG_ANON_VMA_NAME */
+
+static int madvise_lock(struct mm_struct *mm, int behavior)
+{
+
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+ return 0;
+#endif
+
+ if (madvise_need_mmap_write(behavior)) {
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ } else {
+ mmap_read_lock(mm);
+ }
+ return 0;
+
+}
+
+static void madvise_unlock(struct mm_struct *mm, int behavior)
+{
+ if (madvise_need_mmap_write(behavior))
+ mmap_write_unlock(mm);
+ else
+ mmap_read_unlock(mm);
+}
+
/*
* The madvise(2) system call.
*
@@ -1641,7 +1668,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
{
unsigned long end;
int error;
- int write;
size_t len;
struct blk_plug plug;
@@ -1663,19 +1689,15 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
if (end == start)
return 0;
+ error = madvise_lock(mm, behavior);
+ if (error)
+ return error;
+
#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
return madvise_inject_error(behavior, start, start + len_in);
#endif
- write = madvise_need_mmap_write(behavior);
- if (write) {
- if (mmap_write_lock_killable(mm))
- return -EINTR;
- } else {
- mmap_read_lock(mm);
- }
-
start = untagged_addr_remote(mm, start);
end = start + len;
@@ -1692,10 +1714,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
}
blk_finish_plug(&plug);
- if (write)
- mmap_write_unlock(mm);
- else
- mmap_read_unlock(mm);
+ madvise_unlock(mm, behavior);
return error;
}
--
2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
@ 2025-01-17 1:30 ` SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
` (2 more replies)
2025-01-17 1:30 ` [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
` (3 subsequent siblings)
5 siblings, 3 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-17 1:30 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
Split out the madvise parameters validation logic from do_madvise(), for
easy reuse of the logic from a future change.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index ae0964bc4d88..9cc31efe875a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1592,6 +1592,27 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
mmap_read_unlock(mm);
}
+static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
+{
+ size_t len;
+
+ if (!madvise_behavior_valid(behavior))
+ return false;
+
+ if (!PAGE_ALIGNED(start))
+ return false;
+ len = PAGE_ALIGN(len_in);
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return false;
+
+ if (start + len < start)
+ return false;
+
+ return true;
+}
+
/*
* The madvise(2) system call.
*
@@ -1671,20 +1692,11 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
size_t len;
struct blk_plug plug;
- if (!madvise_behavior_valid(behavior))
+ if (!is_valid_madvise(start, len_in, behavior))
return -EINVAL;
- if (!PAGE_ALIGNED(start))
- return -EINVAL;
len = PAGE_ALIGN(len_in);
-
- /* Check to see whether len was rounded up from small -ve to zero */
- if (len_in && !len)
- return -EINVAL;
-
end = start + len;
- if (end < start)
- return -EINVAL;
if (end == start)
return 0;
--
2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
2025-01-17 1:30 ` [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check SeongJae Park
@ 2025-01-17 1:30 ` SeongJae Park
2025-01-29 19:19 ` Shakeel Butt
2025-01-31 16:10 ` Lorenzo Stoakes
2025-01-17 1:30 ` [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
` (2 subsequent siblings)
5 siblings, 2 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-17 1:30 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
Split out the madvise behavior applying logic from do_madvise() to make
it easier to reuse from the following change.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 23 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 9cc31efe875a..913936a5c353 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1613,6 +1613,35 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
return true;
}
+static int madvise_do_behavior(struct mm_struct *mm,
+ unsigned long start, size_t len_in, size_t len, int behavior)
+{
+ struct blk_plug plug;
+ unsigned long end;
+ int error;
+
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+ return madvise_inject_error(behavior, start, start + len_in);
+#endif
+ start = untagged_addr_remote(mm, start);
+ end = start + len;
+
+ blk_start_plug(&plug);
+ switch (behavior) {
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
+ error = madvise_populate(mm, start, end, behavior);
+ break;
+ default:
+ error = madvise_walk_vmas(mm, start, end, behavior,
+ madvise_vma_behavior);
+ break;
+ }
+ blk_finish_plug(&plug);
+ return error;
+}
+
/*
* The madvise(2) system call.
*
@@ -1690,7 +1719,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
unsigned long end;
int error;
size_t len;
- struct blk_plug plug;
if (!is_valid_madvise(start, len_in, behavior))
return -EINVAL;
@@ -1704,28 +1732,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
error = madvise_lock(mm, behavior);
if (error)
return error;
-
-#ifdef CONFIG_MEMORY_FAILURE
- if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
- return madvise_inject_error(behavior, start, start + len_in);
-#endif
-
- start = untagged_addr_remote(mm, start);
- end = start + len;
-
- blk_start_plug(&plug);
- switch (behavior) {
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
- error = madvise_populate(mm, start, end, behavior);
- break;
- default:
- error = madvise_walk_vmas(mm, start, end, behavior,
- madvise_vma_behavior);
- break;
- }
- blk_finish_plug(&plug);
-
+ error = madvise_do_behavior(mm, start, len_in, len, behavior);
madvise_unlock(mm, behavior);
return error;
--
2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
` (2 preceding siblings ...)
2025-01-17 1:30 ` [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
@ 2025-01-17 1:30 ` SeongJae Park
2025-01-29 19:20 ` Shakeel Butt
2025-01-31 16:53 ` Lorenzo Stoakes
2025-01-29 19:22 ` [RFC PATCH v2 0/4] " Shakeel Butt
2025-01-31 16:04 ` Liam R. Howlett
5 siblings, 2 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-17 1:30 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
Optimize redundant mmap lock operations from process_madvise() by
directly doing the mmap locking first, and then the remaining works for
all ranges in the loop.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 913936a5c353..1a796a03a4fb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
total_len = iov_iter_count(iter);
+ ret = madvise_lock(mm, behavior);
+ if (ret)
+ return ret;
+
while (iov_iter_count(iter)) {
- ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
- iter_iov_len(iter), behavior);
+ unsigned long start = (unsigned long)iter_iov_addr(iter);
+ size_t len_in = iter_iov_len(iter);
+ size_t len;
+
+ if (!is_valid_madvise(start, len_in, behavior)) {
+ ret = -EINVAL;
+ break;
+ }
+
+ len = PAGE_ALIGN(len_in);
+ if (start + len == start)
+ ret = 0;
+ else
+ ret = madvise_do_behavior(mm, start, len_in, len,
+ behavior);
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
@@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
break;
iov_iter_advance(iter, iter_iov_len(iter));
}
+ madvise_unlock(mm, behavior);
ret = (total_len - iov_iter_count(iter)) ? : ret;
--
2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise()
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
@ 2025-01-29 19:18 ` Shakeel Butt
2025-01-31 15:58 ` Lorenzo Stoakes
2025-01-31 17:33 ` Davidlohr Bueso
2 siblings, 0 replies; 31+ messages in thread
From: Shakeel Butt @ 2025-01-29 19:18 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:55PM -0800, SeongJae Park wrote:
> Split out the madvise behavior-dependent mmap_lock operations from
> do_madvise(), for easier reuse of the logic in an upcoming change.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check
2025-01-17 1:30 ` [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check SeongJae Park
@ 2025-01-29 19:18 ` Shakeel Butt
2025-01-31 16:01 ` Lorenzo Stoakes
2025-01-31 19:19 ` Davidlohr Bueso
2 siblings, 0 replies; 31+ messages in thread
From: Shakeel Butt @ 2025-01-29 19:18 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:56PM -0800, SeongJae Park wrote:
> Split out the madvise parameters validation logic from do_madvise(), for
> easy reuse of the logic from a future change.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution
2025-01-17 1:30 ` [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
@ 2025-01-29 19:19 ` Shakeel Butt
2025-01-31 16:10 ` Lorenzo Stoakes
1 sibling, 0 replies; 31+ messages in thread
From: Shakeel Butt @ 2025-01-29 19:19 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:57PM -0800, SeongJae Park wrote:
> Split out the madvise behavior applying logic from do_madvise() to make
> it easier to reuse from the following change.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-17 1:30 ` [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-01-29 19:20 ` Shakeel Butt
2025-01-31 16:53 ` Lorenzo Stoakes
1 sibling, 0 replies; 31+ messages in thread
From: Shakeel Butt @ 2025-01-29 19:20 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
` (3 preceding siblings ...)
2025-01-17 1:30 ` [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-01-29 19:22 ` Shakeel Butt
2025-01-29 21:09 ` SeongJae Park
2025-01-31 16:04 ` Liam R. Howlett
5 siblings, 1 reply; 31+ messages in thread
From: Shakeel Butt @ 2025-01-29 19:22 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:54PM -0800, SeongJae Park wrote:
> process_madvise() calls do_madvise() for each address range. Then, each
> do_madvise() invocation holds and releases same mmap_lock. Optimize the
> redundant lock operations by splitting do_madvise() internal logics
> including the mmap_lock operations, and calling the small logics
> directly from process_madvise() in a sequence that removes the redundant
> locking.
>
You skipped the evaluation section which was present in the v1. If you
decide to post a new version, please include that otherwise we can ask
Andrew to pickup the evaluation section from the v1 [1].
[1] https://lore.kernel.org/all/20250111004618.1566-1-sj@kernel.org/
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-29 19:22 ` [RFC PATCH v2 0/4] " Shakeel Butt
@ 2025-01-29 21:09 ` SeongJae Park
0 siblings, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-29 21:09 UTC (permalink / raw)
To: Shakeel Butt
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-mm
Hi Shakeel,
On Wed, 29 Jan 2025 11:22:29 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> On Thu, Jan 16, 2025 at 05:30:54PM -0800, SeongJae Park wrote:
> > process_madvise() calls do_madvise() for each address range. Then, each
> > do_madvise() invocation holds and releases same mmap_lock. Optimize the
> > redundant lock operations by splitting do_madvise() internal logics
> > including the mmap_lock operations, and calling the small logics
> > directly from process_madvise() in a sequence that removes the redundant
> > locking.
> >
>
> You skipped the evaluation section which was present in the v1. If you
> decide to post a new version, please include that otherwise we can ask
> Andrew to pickup the evaluation section from the v1 [1].
>
> [1] https://lore.kernel.org/all/20250111004618.1566-1-sj@kernel.org/
Sure. I will run the evaluation again on the new version and include the
result on the next version.
Also, thank you very much for reviews :)
Thanks,
SJ
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise()
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
@ 2025-01-31 15:58 ` Lorenzo Stoakes
2025-01-31 17:33 ` Davidlohr Bueso
2 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 15:58 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Shakeel Butt,
Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:55PM -0800, SeongJae Park wrote:
> Split out the madvise behavior-dependent mmap_lock operations from
> do_madvise(), for easier reuse of the logic in an upcoming change.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Looks good to me, a decent cleanup regardless of application.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 45 ++++++++++++++++++++++++++++++++-------------
> 1 file changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 49f3a75046f6..ae0964bc4d88 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1565,6 +1565,33 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> madvise_vma_anon_name);
> }
> #endif /* CONFIG_ANON_VMA_NAME */
> +
> +static int madvise_lock(struct mm_struct *mm, int behavior)
> +{
> +
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> + return 0;
> +#endif
> +
> + if (madvise_need_mmap_write(behavior)) {
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> + } else {
> + mmap_read_lock(mm);
> + }
> + return 0;
> +
> +}
> +
> +static void madvise_unlock(struct mm_struct *mm, int behavior)
> +{
> + if (madvise_need_mmap_write(behavior))
> + mmap_write_unlock(mm);
> + else
> + mmap_read_unlock(mm);
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1641,7 +1668,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> {
> unsigned long end;
> int error;
> - int write;
> size_t len;
> struct blk_plug plug;
>
> @@ -1663,19 +1689,15 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> if (end == start)
> return 0;
>
> + error = madvise_lock(mm, behavior);
> + if (error)
> + return error;
> +
> #ifdef CONFIG_MEMORY_FAILURE
> if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> return madvise_inject_error(behavior, start, start + len_in);
> #endif
>
> - write = madvise_need_mmap_write(behavior);
> - if (write) {
> - if (mmap_write_lock_killable(mm))
> - return -EINTR;
> - } else {
> - mmap_read_lock(mm);
> - }
> -
> start = untagged_addr_remote(mm, start);
> end = start + len;
>
> @@ -1692,10 +1714,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> }
> blk_finish_plug(&plug);
>
> - if (write)
> - mmap_write_unlock(mm);
> - else
> - mmap_read_unlock(mm);
> + madvise_unlock(mm, behavior);
>
> return error;
> }
> --
> 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check
2025-01-17 1:30 ` [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
@ 2025-01-31 16:01 ` Lorenzo Stoakes
2025-01-31 19:19 ` Davidlohr Bueso
2 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 16:01 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Shakeel Butt,
Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:56PM -0800, SeongJae Park wrote:
> Split out the madvise parameters validation logic from do_madvise(), for
> easy reuse of the logic from a future change.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Another decent cleanup, regardless of appllication, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 32 ++++++++++++++++++++++----------
> 1 file changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index ae0964bc4d88..9cc31efe875a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1592,6 +1592,27 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
> mmap_read_unlock(mm);
> }
>
> +static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
> +{
> + size_t len;
> +
> + if (!madvise_behavior_valid(behavior))
> + return false;
> +
> + if (!PAGE_ALIGNED(start))
> + return false;
> + len = PAGE_ALIGN(len_in);
Kind of a pity to duplicate this, but not exactly a big deal.
> +
> + /* Check to see whether len was rounded up from small -ve to zero */
> + if (len_in && !len)
> + return false;
> +
> + if (start + len < start)
> + return false;
> +
> + return true;
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1671,20 +1692,11 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> size_t len;
> struct blk_plug plug;
>
> - if (!madvise_behavior_valid(behavior))
> + if (!is_valid_madvise(start, len_in, behavior))
> return -EINVAL;
>
> - if (!PAGE_ALIGNED(start))
> - return -EINVAL;
> len = PAGE_ALIGN(len_in);
> -
> - /* Check to see whether len was rounded up from small -ve to zero */
> - if (len_in && !len)
> - return -EINVAL;
> -
> end = start + len;
> - if (end < start)
> - return -EINVAL;
>
> if (end == start)
> return 0;
> --
> 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
` (4 preceding siblings ...)
2025-01-29 19:22 ` [RFC PATCH v2 0/4] " Shakeel Butt
@ 2025-01-31 16:04 ` Liam R. Howlett
2025-01-31 16:30 ` SeongJae Park
2025-01-31 16:55 ` Lorenzo Stoakes
5 siblings, 2 replies; 31+ messages in thread
From: Liam R. Howlett @ 2025-01-31 16:04 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Shakeel Butt,
Vlastimil Babka, linux-kernel, linux-mm
* SeongJae Park <sj@kernel.org> [250116 20:31]:
> process_madvise() calls do_madvise() for each address range. Then, each
> do_madvise() invocation holds and releases same mmap_lock. Optimize the
> redundant lock operations by splitting do_madvise() internal logics
> including the mmap_lock operations, and calling the small logics
> directly from process_madvise() in a sequence that removes the redundant
> locking.
>
> Changes from RFC v1 (20250111004618.1566-1-sj@kernel.org)
> - Split out do_madvise() and use those from vector_madvise(), instead of
> adding a flag to do_madvise() (Liam R. Howlett)
I was waiting for a non-RFC to re-examine the series. It looks like a
good clean up.
Do you think you'll send out a non-RFC version soon?
>
> SeongJae Park (4):
> mm/madvise: split out mmap locking operations for madvise()
> mm/madvise: split out madvise input validity check
> mm/madvise: split out madvise() behavior execution
> mm/madvise: remove redundant mmap_lock operations from
> process_madvise()
>
> mm/madvise.c | 150 +++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 103 insertions(+), 47 deletions(-)
>
>
> base-commit: b43ba6938d01ad4487028592109d4116a28b7afa
> --
> 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution
2025-01-17 1:30 ` [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
2025-01-29 19:19 ` Shakeel Butt
@ 2025-01-31 16:10 ` Lorenzo Stoakes
1 sibling, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 16:10 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Shakeel Butt,
Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:57PM -0800, SeongJae Park wrote:
> Split out the madvise behavior applying logic from do_madvise() to make
> it easier to reuse from the following change.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Thanks, once again this is a nice cleanup, and you're doing stuff I thought of
doing but never got round to :P so inevitably:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
> 1 file changed, 30 insertions(+), 23 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9cc31efe875a..913936a5c353 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1613,6 +1613,35 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
> return true;
> }
>
> +static int madvise_do_behavior(struct mm_struct *mm,
> + unsigned long start, size_t len_in, size_t len, int behavior)
> +{
> + struct blk_plug plug;
> + unsigned long end;
> + int error;
> +
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> + return madvise_inject_error(behavior, start, start + len_in);
> +#endif
> + start = untagged_addr_remote(mm, start);
> + end = start + len;
> +
> + blk_start_plug(&plug);
> + switch (behavior) {
I kind of hate this, I'd like somebody (maybe me maybe you maybe somebody else)
to go further and refactor this thing with fire and fumigation, because having
multiple layers of how we do stuff is just ugh.
Not your fault.
Not even related to your series.
But a moan I shall commit to list nonetheless! ;)
> + case MADV_POPULATE_READ:
> + case MADV_POPULATE_WRITE:
> + error = madvise_populate(mm, start, end, behavior);
> + break;
> + default:
> + error = madvise_walk_vmas(mm, start, end, behavior,
> + madvise_vma_behavior);
> + break;
> + }
> + blk_finish_plug(&plug);
> + return error;
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1690,7 +1719,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> unsigned long end;
> int error;
> size_t len;
> - struct blk_plug plug;
>
> if (!is_valid_madvise(start, len_in, behavior))
> return -EINVAL;
> @@ -1704,28 +1732,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> error = madvise_lock(mm, behavior);
> if (error)
> return error;
> -
> -#ifdef CONFIG_MEMORY_FAILURE
> - if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> - return madvise_inject_error(behavior, start, start + len_in);
> -#endif
> -
> - start = untagged_addr_remote(mm, start);
> - end = start + len;
> -
> - blk_start_plug(&plug);
> - switch (behavior) {
> - case MADV_POPULATE_READ:
> - case MADV_POPULATE_WRITE:
> - error = madvise_populate(mm, start, end, behavior);
> - break;
> - default:
> - error = madvise_walk_vmas(mm, start, end, behavior,
> - madvise_vma_behavior);
> - break;
> - }
> - blk_finish_plug(&plug);
> -
> + error = madvise_do_behavior(mm, start, len_in, len, behavior);
> madvise_unlock(mm, behavior);
>
> return error;
> --
> 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 16:04 ` Liam R. Howlett
@ 2025-01-31 16:30 ` SeongJae Park
2025-01-31 16:55 ` Lorenzo Stoakes
1 sibling, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-01-31 16:30 UTC (permalink / raw)
To: Liam R. Howlett
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
On Fri, 31 Jan 2025 11:04:51 -0500 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> * SeongJae Park <sj@kernel.org> [250116 20:31]:
> > process_madvise() calls do_madvise() for each address range. Then, each
> > do_madvise() invocation holds and releases same mmap_lock. Optimize the
> > redundant lock operations by splitting do_madvise() internal logics
> > including the mmap_lock operations, and calling the small logics
> > directly from process_madvise() in a sequence that removes the redundant
> > locking.
> >
> > Changes from RFC v1 (20250111004618.1566-1-sj@kernel.org)
> > - Split out do_madvise() and use those from vector_madvise(), instead of
> > adding a flag to do_madvise() (Liam R. Howlett)
>
> I was waiting for a non-RFC to re-examine the series. It looks like a
> good clean up.
Thank you Liam :)
>
> Do you think you'll send out a non-RFC version soon?
I'm planning to send it next week.
Thanks,
SJ
>
> >
> > SeongJae Park (4):
> > mm/madvise: split out mmap locking operations for madvise()
> > mm/madvise: split out madvise input validity check
> > mm/madvise: split out madvise() behavior execution
> > mm/madvise: remove redundant mmap_lock operations from
> > process_madvise()
> >
> > mm/madvise.c | 150 +++++++++++++++++++++++++++++++++++----------------
> > 1 file changed, 103 insertions(+), 47 deletions(-)
> >
> >
> > base-commit: b43ba6938d01ad4487028592109d4116a28b7afa
> > --
> > 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-17 1:30 ` [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-01-29 19:20 ` Shakeel Butt
@ 2025-01-31 16:53 ` Lorenzo Stoakes
2025-01-31 17:31 ` Davidlohr Bueso
2025-02-04 18:56 ` SeongJae Park
1 sibling, 2 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 16:53 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand, Shakeel Butt,
Vlastimil Babka, linux-kernel, linux-mm
On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
I wonder if this might increase lock contention because now all of the
vector operations will hold the relevant mm lock without releasing after
each operation?
Probably it's ok given limited size of iov, but maybe in future we'd want
to set a limit on the ranges before we drop/reacquire lock?
I've tested this against my PIDFD_SELF changes [0] and the
process_madvise() invocation in the guard-pages tests which utilises
process_madvise() in a perhaps 'unusual' way so is a good test bed for
this, and all working fine.
Amazingly, this appears to be the only place (afaict) where
process_madivse() is actually tested...
Buuuut, I think this change is incorrect, I comment on this below. Should
be an easy fix though.
[0]:https://lore.kernel.org/all/cover.1738268370.git.lorenzo.stoakes@oracle.com/
> ---
> mm/madvise.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 913936a5c353..1a796a03a4fb 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
> total_len = iov_iter_count(iter);
>
> + ret = madvise_lock(mm, behavior);
> + if (ret)
> + return ret;
> +
> while (iov_iter_count(iter)) {
> - ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> - iter_iov_len(iter), behavior);
> + unsigned long start = (unsigned long)iter_iov_addr(iter);
This is nicer than just passing in.
> + size_t len_in = iter_iov_len(iter);
Equally so here...
> + size_t len;
> +
> + if (!is_valid_madvise(start, len_in, behavior)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + len = PAGE_ALIGN(len_in);
> + if (start + len == start)
> + ret = 0;
> + else
> + ret = madvise_do_behavior(mm, start, len_in, len,
> + behavior);
Very slight duplication here (I wonder if we can somehow wrap the 'if zero
length return 0' thing?).
But it doesn't really matter, probably better to spell it out, actually.
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
> @@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> break;
> iov_iter_advance(iter, iter_iov_len(iter));
> }
> + madvise_unlock(mm, behavior);
>
> ret = (total_len - iov_iter_count(iter)) ? : ret;
So I think this is now wrong because of the work I did recently. In this code:
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
* the operation in aggregate, and would be surprising to the
* user.
*
* As we have already dropped locks, it is safe to just loop and
* try again. We check for fatal signals in case we need exit
* early anyway.
*/
if (ret == -ERESTARTNOINTR) {
if (fatal_signal_pending(current)) {
ret = -EINTR;
break;
}
continue;
}
Note that it assumes the locks have been dropped before simply trying
again, as the only way this would happen is because of a race, and we may
end up stuck in a loop if we just hold on to the lock.
So I'd suggest updating this comment and changing the code like this:
if (ret == -ERESTARTNOINTR) {
if (fatal_signal_pending(current)) {
ret = -EINTR;
break;
}
+ /* Drop and reacquire lock to unwind race. */
+ madvise_unlock(mm, behaviour);
+ madvise_lock(mm, behaviour);
continue;
}
Which brings back the existing behaviour.
By the way I hate that this function swallows error codes. But that's not
your fault, and is now established user-facing behaviour so yeah. Big sigh.
>
> --
> 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 16:04 ` Liam R. Howlett
2025-01-31 16:30 ` SeongJae Park
@ 2025-01-31 16:55 ` Lorenzo Stoakes
2025-01-31 17:53 ` Lorenzo Stoakes
1 sibling, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 16:55 UTC (permalink / raw)
To: Liam R. Howlett, SeongJae Park, Andrew Morton, David Hildenbrand,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
On Fri, Jan 31, 2025 at 11:04:51AM -0500, Liam R. Howlett wrote:
> * SeongJae Park <sj@kernel.org> [250116 20:31]:
> > process_madvise() calls do_madvise() for each address range. Then, each
> > do_madvise() invocation holds and releases same mmap_lock. Optimize the
> > redundant lock operations by splitting do_madvise() internal logics
> > including the mmap_lock operations, and calling the small logics
> > directly from process_madvise() in a sequence that removes the redundant
> > locking.
> >
> > Changes from RFC v1 (20250111004618.1566-1-sj@kernel.org)
> > - Split out do_madvise() and use those from vector_madvise(), instead of
> > adding a flag to do_madvise() (Liam R. Howlett)
>
> I was waiting for a non-RFC to re-examine the series. It looks like a
> good clean up.
>
> Do you think you'll send out a non-RFC version soon?
This is definitely a great cleanup, there's a problem with patch 3/3, but
SJ - feel free to un-RFC with the fix I suggested - and then happy to give
R-b and T-b tags!
Thanks for doing this!
Cheers, Lorenzo
>
> >
> > SeongJae Park (4):
> > mm/madvise: split out mmap locking operations for madvise()
> > mm/madvise: split out madvise input validity check
> > mm/madvise: split out madvise() behavior execution
> > mm/madvise: remove redundant mmap_lock operations from
> > process_madvise()
> >
> > mm/madvise.c | 150 +++++++++++++++++++++++++++++++++++----------------
> > 1 file changed, 103 insertions(+), 47 deletions(-)
> >
> >
> > base-commit: b43ba6938d01ad4487028592109d4116a28b7afa
> > --
> > 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 16:53 ` Lorenzo Stoakes
@ 2025-01-31 17:31 ` Davidlohr Bueso
2025-01-31 17:47 ` Liam R. Howlett
2025-02-04 18:56 ` SeongJae Park
1 sibling, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2025-01-31 17:31 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
>On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
>> Optimize redundant mmap lock operations from process_madvise() by
>> directly doing the mmap locking first, and then the remaining works for
>> all ranges in the loop.
>>
>> Signed-off-by: SeongJae Park <sj@kernel.org>
>
>I wonder if this might increase lock contention because now all of the
>vector operations will hold the relevant mm lock without releasing after
>each operation?
That was exactly my concern. While afaict the numbers presented in v1
are quite nice, this is ultimately a micro-benchmark, where no other
unrelated threads are impacted by these new hold times.
>Probably it's ok given limited size of iov, but maybe in future we'd want
>to set a limit on the ranges before we drop/reacquire lock?
imo, this should best be done in the same patch/series. Maybe extend
the benchmark to use IOV_MAX and find a sweet spot?
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise()
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
2025-01-31 15:58 ` Lorenzo Stoakes
@ 2025-01-31 17:33 ` Davidlohr Bueso
2 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2025-01-31 17:33 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Thu, 16 Jan 2025, SeongJae Park wrote:
>Split out the madvise behavior-dependent mmap_lock operations from
>do_madvise(), for easier reuse of the logic in an upcoming change.
>
>Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 17:31 ` Davidlohr Bueso
@ 2025-01-31 17:47 ` Liam R. Howlett
2025-01-31 17:51 ` Lorenzo Stoakes
2025-01-31 19:17 ` Shakeel Butt
0 siblings, 2 replies; 31+ messages in thread
From: Liam R. Howlett @ 2025-01-31 17:47 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Lorenzo Stoakes, SeongJae Park, Andrew Morton, David Hildenbrand,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
* Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
>
> > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > Optimize redundant mmap lock operations from process_madvise() by
> > > directly doing the mmap locking first, and then the remaining works for
> > > all ranges in the loop.
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> >
> > I wonder if this might increase lock contention because now all of the
> > vector operations will hold the relevant mm lock without releasing after
> > each operation?
>
> That was exactly my concern. While afaict the numbers presented in v1
> are quite nice, this is ultimately a micro-benchmark, where no other
> unrelated threads are impacted by these new hold times.
Indeed, I was also concerned about this scenario.
But this method does have the added advantage of keeping the vma space
in the same state as it was expected during the initial call - although
the race does still exist on looking vs acting on the data. This would
just remove the intermediate changes.
>
> > Probably it's ok given limited size of iov, but maybe in future we'd want
> > to set a limit on the ranges before we drop/reacquire lock?
>
> imo, this should best be done in the same patch/series. Maybe extend
> the benchmark to use IOV_MAX and find a sweet spot?
Are you worried this is over-engineering for a problem that may never be
an issue, or is there a particular usecase you have in mind?
It is probably worth investigating, and maybe a potential usecase would
help with the targeted sweet spot?
Thanks,
Liam
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 17:47 ` Liam R. Howlett
@ 2025-01-31 17:51 ` Lorenzo Stoakes
2025-01-31 17:58 ` Davidlohr Bueso
` (2 more replies)
2025-01-31 19:17 ` Shakeel Butt
1 sibling, 3 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 17:51 UTC (permalink / raw)
To: Liam R. Howlett, Davidlohr Bueso, SeongJae Park, Andrew Morton,
David Hildenbrand, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> >
> > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > directly doing the mmap locking first, and then the remaining works for
> > > > all ranges in the loop.
> > > >
> > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > >
> > > I wonder if this might increase lock contention because now all of the
> > > vector operations will hold the relevant mm lock without releasing after
> > > each operation?
> >
> > That was exactly my concern. While afaict the numbers presented in v1
> > are quite nice, this is ultimately a micro-benchmark, where no other
> > unrelated threads are impacted by these new hold times.
>
> Indeed, I was also concerned about this scenario.
>
> But this method does have the added advantage of keeping the vma space
> in the same state as it was expected during the initial call - although
> the race does still exist on looking vs acting on the data. This would
> just remove the intermediate changes.
>
> >
> > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > to set a limit on the ranges before we drop/reacquire lock?
> >
> > imo, this should best be done in the same patch/series. Maybe extend
> > the benchmark to use IOV_MAX and find a sweet spot?
>
> Are you worried this is over-engineering for a problem that may never be
> an issue, or is there a particular usecase you have in mind?
>
> It is probably worth investigating, and maybe a potential usecase would
> help with the targeted sweet spot?
>
Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
high, but rather UIO_FASTIOV, which is limited to 8 entries.
(Some have been surprised by this limitation...!)
So I think at this point scaling isn't a huge issue, I raise it because in
future we may want to increase this limit, at which point we should think about
it, which is why I sort of hand-waved it away a bit.
> Thanks,
> Liam
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 16:55 ` Lorenzo Stoakes
@ 2025-01-31 17:53 ` Lorenzo Stoakes
0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-01-31 17:53 UTC (permalink / raw)
To: Liam R. Howlett, SeongJae Park, Andrew Morton, David Hildenbrand,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
On Fri, Jan 31, 2025 at 04:55:05PM +0000, Lorenzo Stoakes wrote:
> On Fri, Jan 31, 2025 at 11:04:51AM -0500, Liam R. Howlett wrote:
> > * SeongJae Park <sj@kernel.org> [250116 20:31]:
> > > process_madvise() calls do_madvise() for each address range. Then, each
> > > do_madvise() invocation holds and releases same mmap_lock. Optimize the
> > > redundant lock operations by splitting do_madvise() internal logics
> > > including the mmap_lock operations, and calling the small logics
> > > directly from process_madvise() in a sequence that removes the redundant
> > > locking.
> > >
> > > Changes from RFC v1 (20250111004618.1566-1-sj@kernel.org)
> > > - Split out do_madvise() and use those from vector_madvise(), instead of
> > > adding a flag to do_madvise() (Liam R. Howlett)
> >
> > I was waiting for a non-RFC to re-examine the series. It looks like a
> > good clean up.
> >
> > Do you think you'll send out a non-RFC version soon?
>
> This is definitely a great cleanup, there's a problem with patch 3/3, but
> SJ - feel free to un-RFC with the fix I suggested - and then happy to give
> R-b and T-b tags!
Liam's pedantry brought me here ;)
Obviously by 3/3 I mean 4/4 here.
Maybe not obviously. But anyway.
>
> Thanks for doing this!
>
> Cheers, Lorenzo
>
> >
> > >
> > > SeongJae Park (4):
> > > mm/madvise: split out mmap locking operations for madvise()
> > > mm/madvise: split out madvise input validity check
> > > mm/madvise: split out madvise() behavior execution
> > > mm/madvise: remove redundant mmap_lock operations from
> > > process_madvise()
> > >
> > > mm/madvise.c | 150 +++++++++++++++++++++++++++++++++++----------------
> > > 1 file changed, 103 insertions(+), 47 deletions(-)
> > >
> > >
> > > base-commit: b43ba6938d01ad4487028592109d4116a28b7afa
> > > --
> > > 2.39.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 17:51 ` Lorenzo Stoakes
@ 2025-01-31 17:58 ` Davidlohr Bueso
2025-02-04 19:53 ` SeongJae Park
2025-05-17 19:28 ` Lorenzo Stoakes
2 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2025-01-31 17:58 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Liam R. Howlett, SeongJae Park, Andrew Morton, David Hildenbrand,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
>Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
>high, but rather UIO_FASTIOV, which is limited to 8 entries.
>
>(Some have been surprised by this limitation...!)
Ah if that is the case then I'm fine with this patch as is.
>
>So I think at this point scaling isn't a huge issue, I raise it because in
>future we may want to increase this limit, at which point we should think about
>it, which is why I sort of hand-waved it away a bit.
Agreed.
SJ, feel free to add my:
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 17:47 ` Liam R. Howlett
2025-01-31 17:51 ` Lorenzo Stoakes
@ 2025-01-31 19:17 ` Shakeel Butt
1 sibling, 0 replies; 31+ messages in thread
From: Shakeel Butt @ 2025-01-31 19:17 UTC (permalink / raw)
To: Liam R. Howlett, Davidlohr Bueso, Lorenzo Stoakes, SeongJae Park,
Andrew Morton, David Hildenbrand, Vlastimil Babka, linux-kernel,
linux-mm
On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> >
> > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > directly doing the mmap locking first, and then the remaining works for
> > > > all ranges in the loop.
> > > >
> > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > >
> > > I wonder if this might increase lock contention because now all of the
> > > vector operations will hold the relevant mm lock without releasing after
> > > each operation?
> >
> > That was exactly my concern. While afaict the numbers presented in v1
> > are quite nice, this is ultimately a micro-benchmark, where no other
> > unrelated threads are impacted by these new hold times.
>
> Indeed, I was also concerned about this scenario.
>
> But this method does have the added advantage of keeping the vma space
> in the same state as it was expected during the initial call - although
> the race does still exist on looking vs acting on the data. This would
> just remove the intermediate changes.
>
> >
> > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > to set a limit on the ranges before we drop/reacquire lock?
> >
> > imo, this should best be done in the same patch/series. Maybe extend
> > the benchmark to use IOV_MAX and find a sweet spot?
>
> Are you worried this is over-engineering for a problem that may never be
> an issue, or is there a particular usecase you have in mind?
>
> It is probably worth investigating, and maybe a potential usecase would
> help with the targeted sweet spot?
>
>
Lorenzo already explained that it is not an issue at the moment. I think
this is good discussion to have as I think we will be expanding the
limit from UIO_FASTIOV to higher value (maybe UIO_MAXIOV) soon in the
followup. At the moment, my gut feeling is that batch size of regions
given to the syscall will depend on the advise parameter. For example,
For MADV_[NO]HUGEPAGE which is a simple flag [re]set, a single write
lock and possible a single tree traversal will be better than multiple
write lock/unlock operations even for large batch size. Anyways we will
need some experimental data to show that.
JFYI SJ is planning to work on two improvements: (1) single tree
traversal for all the given regions and (2) TLB flush batching for
MADV_DONTNEED[_LOCKED] and MADV_FREE.
Thanks everyone for your time and feedback.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check
2025-01-17 1:30 ` [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
2025-01-31 16:01 ` Lorenzo Stoakes
@ 2025-01-31 19:19 ` Davidlohr Bueso
2 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2025-01-31 19:19 UTC (permalink / raw)
To: SeongJae Park
Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Thu, 16 Jan 2025, SeongJae Park wrote:
>Split out the madvise parameters validation logic from do_madvise(), for
>easy reuse of the logic from a future change.
>
>Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 16:53 ` Lorenzo Stoakes
2025-01-31 17:31 ` Davidlohr Bueso
@ 2025-02-04 18:56 ` SeongJae Park
1 sibling, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-02-04 18:56 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Shakeel Butt, Vlastimil Babka, linux-kernel, linux-mm
On Fri, 31 Jan 2025 16:53:17 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > Optimize redundant mmap lock operations from process_madvise() by
> > directly doing the mmap locking first, and then the remaining works for
> > all ranges in the loop.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
[...]
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
[...]
> > /*
> > * An madvise operation is attempting to restart the syscall,
> > * but we cannot proceed as it would not be correct to repeat
> > @@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> > break;
> > iov_iter_advance(iter, iter_iov_len(iter));
> > }
> > + madvise_unlock(mm, behavior);
> >
> > ret = (total_len - iov_iter_count(iter)) ? : ret;
>
> So I think this is now wrong because of the work I did recently. In this code:
>
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
> * the operation in aggregate, and would be surprising to the
> * user.
> *
> * As we have already dropped locks, it is safe to just loop and
> * try again. We check for fatal signals in case we need exit
> * early anyway.
> */
> if (ret == -ERESTARTNOINTR) {
> if (fatal_signal_pending(current)) {
> ret = -EINTR;
> break;
> }
> continue;
> }
>
> Note that it assumes the locks have been dropped before simply trying
> again, as the only way this would happen is because of a race, and we may
> end up stuck in a loop if we just hold on to the lock.
Nice catch!
>
> So I'd suggest updating this comment and changing the code like this:
>
> if (ret == -ERESTARTNOINTR) {
> if (fatal_signal_pending(current)) {
> ret = -EINTR;
> break;
> }
>
> + /* Drop and reacquire lock to unwind race. */
> + madvise_unlock(mm, behaviour);
> + madvise_lock(mm, behaviour);
> continue;
> }
>
> Which brings back the existing behaviour.
Thank you for this kind suggestion. I will update next version of this patch
in this way.
>
> By the way I hate that this function swallows error codes. But that's not
> your fault, and is now established user-facing behaviour so yeah. Big sigh.
>
> >
> > --
> > 2.39.5
Thanks,
SJ
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 17:51 ` Lorenzo Stoakes
2025-01-31 17:58 ` Davidlohr Bueso
@ 2025-02-04 19:53 ` SeongJae Park
2025-02-06 6:28 ` SeongJae Park
2025-05-17 19:28 ` Lorenzo Stoakes
2 siblings, 1 reply; 31+ messages in thread
From: SeongJae Park @ 2025-02-04 19:53 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Liam R. Howlett, Davidlohr Bueso, Andrew Morton,
David Hildenbrand, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Fri, 31 Jan 2025 17:51:45 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > >
> > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > all ranges in the loop.
> > > > >
> > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > >
> > > > I wonder if this might increase lock contention because now all of the
> > > > vector operations will hold the relevant mm lock without releasing after
> > > > each operation?
> > >
> > > That was exactly my concern. While afaict the numbers presented in v1
> > > are quite nice, this is ultimately a micro-benchmark, where no other
> > > unrelated threads are impacted by these new hold times.
> >
> > Indeed, I was also concerned about this scenario.
> >
> > But this method does have the added advantage of keeping the vma space
> > in the same state as it was expected during the initial call - although
> > the race does still exist on looking vs acting on the data. This would
> > just remove the intermediate changes.
> >
> > >
> > > > Probably it's ok given limited size of iov,
I think so. Also, users could adjust the batching size for their workloads.
> > > > but maybe in future we'd want
> > > > to set a limit on the ranges before we drop/reacquire lock?
> > >
> > > imo, this should best be done in the same patch/series. Maybe extend
> > > the benchmark to use IOV_MAX and find a sweet spot?
> >
> > Are you worried this is over-engineering for a problem that may never be
> > an issue, or is there a particular usecase you have in mind?
> >
> > It is probably worth investigating, and maybe a potential usecase would
> > help with the targeted sweet spot?
I think the sweet spot may depend on the workload and the advice type. So
selection of the benchmark will be important. Do you have a benchmark in your
mind? My humble microbenchmark[1] does only single-thread usage performance
evaluation, so may not be appropriate to be used here. I actually do the
evaluation with batching size up to the IOV_MAX (1024), but show no clear
evidence of the sweet spot.
> >
>
> Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> high, but rather UIO_FASTIOV, which is limited to 8 entries.
process_madvise() indeed have iovec array of UIO_FASTIOV size, namely iovstack.
But, if I understood the code correctly, iostack is used only for a fast path
that the user requested advicing less than UIO_FASTIOV regions.
I actually confirmed I can make the loop itrate 1024 times, using my
microbenchmark[1]. My step for the check was running the program with
'eval_pmadv $((4*1024*1024)) $((4*1024)) $((4*1024*1024))' command, and
counting the number of the itration of the vector_madvise() main loop using
printk(). Please let me know if I'm missing something.
>
> (Some have been surprised by this limitation...!)
>
> So I think at this point scaling isn't a huge issue, I raise it because in
> future we may want to increase this limit, at which point we should think about
> it, which is why I sort of hand-waved it away a bit.
I personally think this may still not be a huge issue, especially given the
fact that users can test and tune the limit. But I'd like to hear more
opinions if available.
[1] https://github.com/sjp38/eval_proc_madvise/blob/master/eval_pmadv.c
Thanks,
SJ
>
> > Thanks,
> > Liam
> >
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-04 19:53 ` SeongJae Park
@ 2025-02-06 6:28 ` SeongJae Park
0 siblings, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-02-06 6:28 UTC (permalink / raw)
To: SeongJae Park
Cc: Lorenzo Stoakes, Liam R. Howlett, Davidlohr Bueso, Andrew Morton,
David Hildenbrand, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Tue, 4 Feb 2025 11:53:43 -0800 SeongJae Park <sj@kernel.org> wrote:
> On Fri, 31 Jan 2025 17:51:45 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > > >
> > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > > all ranges in the loop.
> > > > > >
> > > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > > >
> > > > > I wonder if this might increase lock contention because now all of the
> > > > > vector operations will hold the relevant mm lock without releasing after
> > > > > each operation?
[...]
> >
> > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> > high, but rather UIO_FASTIOV, which is limited to 8 entries.
>
> process_madvise() indeed have iovec array of UIO_FASTIOV size, namely iovstack.
> But, if I understood the code correctly, iostack is used only for a fast path
> that the user requested advicing less than UIO_FASTIOV regions.
>
> I actually confirmed I can make the loop itrate 1024 times, using my
> microbenchmark[1]. My step for the check was running the program with
> 'eval_pmadv $((4*1024*1024)) $((4*1024)) $((4*1024*1024))' command, and
> counting the number of the itration of the vector_madvise() main loop using
> printk(). Please let me know if I'm missing something.
>
> >
> > (Some have been surprised by this limitation...!)
> >
> > So I think at this point scaling isn't a huge issue, I raise it because in
> > future we may want to increase this limit, at which point we should think about
> > it, which is why I sort of hand-waved it away a bit.
>
> I personally think this may still not be a huge issue, especially given the
> fact that users can test and tune the limit. But I'd like to hear more
> opinions if available.
Maybe I should wait more for the more opinions, but I just impatiently sent the
next spin of this patch series[1]. Because Davidlohr and Lorenzo promised
their tags based on the assumption of UIO_FASTIOV limit while I think that may
not be an effective limit, I didn't add their tags on the fourth patch of the
series. Please let me know if you have any concern about that, either on this
thread or on the new patch series thread[1].
[1] https://lore.kernel.org/20250206061517.2958-1-sj@kernel.org
>
> [1] https://github.com/sjp38/eval_proc_madvise/blob/master/eval_pmadv.c
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-01-31 17:51 ` Lorenzo Stoakes
2025-01-31 17:58 ` Davidlohr Bueso
2025-02-04 19:53 ` SeongJae Park
@ 2025-05-17 19:28 ` Lorenzo Stoakes
2025-05-19 18:25 ` SeongJae Park
2 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-05-17 19:28 UTC (permalink / raw)
To: Liam R. Howlett, Davidlohr Bueso, SeongJae Park, Andrew Morton,
David Hildenbrand, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Fri, Jan 31, 2025 at 05:51:45PM +0000, Lorenzo Stoakes wrote:
> On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > >
> > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > all ranges in the loop.
> > > > >
> > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > >
> > > > I wonder if this might increase lock contention because now all of the
> > > > vector operations will hold the relevant mm lock without releasing after
> > > > each operation?
> > >
> > > That was exactly my concern. While afaict the numbers presented in v1
> > > are quite nice, this is ultimately a micro-benchmark, where no other
> > > unrelated threads are impacted by these new hold times.
> >
> > Indeed, I was also concerned about this scenario.
> >
> > But this method does have the added advantage of keeping the vma space
> > in the same state as it was expected during the initial call - although
> > the race does still exist on looking vs acting on the data. This would
> > just remove the intermediate changes.
> >
> > >
> > > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > > to set a limit on the ranges before we drop/reacquire lock?
> > >
> > > imo, this should best be done in the same patch/series. Maybe extend
> > > the benchmark to use IOV_MAX and find a sweet spot?
> >
> > Are you worried this is over-engineering for a problem that may never be
> > an issue, or is there a particular usecase you have in mind?
> >
> > It is probably worth investigating, and maybe a potential usecase would
> > help with the targeted sweet spot?
> >
>
> Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> high, but rather UIO_FASTIOV, which is limited to 8 entries.
>
> (Some have been surprised by this limitation...!)
Surprised, perhaps because I was wrong about this :) Apologies for that.
SJ raised this in [0] and the non-RFC version of this series is over at [1].
[0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/
[1]: https://lore.kernel.org/all/20250206061517.2958-1-sj@kernel.org/
We should revisit this and determine whether the drop/reacquire lock is
required, perhaps doing some experiments around heavy operations using
UIO_MAXIOV entries?
SJ - could you take a look at this please?
>
> So I think at this point scaling isn't a huge issue, I raise it because in
> future we may want to increase this limit, at which point we should think about
> it, which is why I sort of hand-waved it away a bit.
Again as I said here, I suspect _probably_ this won't be too much of an
issue - but it is absolutely one we need to address.
>
> > Thanks,
> > Liam
> >
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-05-17 19:28 ` Lorenzo Stoakes
@ 2025-05-19 18:25 ` SeongJae Park
0 siblings, 0 replies; 31+ messages in thread
From: SeongJae Park @ 2025-05-19 18:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Liam R. Howlett, Davidlohr Bueso, Andrew Morton,
David Hildenbrand, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Sat, 17 May 2025 20:28:49 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jan 31, 2025 at 05:51:45PM +0000, Lorenzo Stoakes wrote:
> > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > > >
> > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > > all ranges in the loop.
> > > > > >
> > > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > > >
> > > > > I wonder if this might increase lock contention because now all of the
> > > > > vector operations will hold the relevant mm lock without releasing after
> > > > > each operation?
> > > >
> > > > That was exactly my concern. While afaict the numbers presented in v1
> > > > are quite nice, this is ultimately a micro-benchmark, where no other
> > > > unrelated threads are impacted by these new hold times.
> > >
> > > Indeed, I was also concerned about this scenario.
> > >
> > > But this method does have the added advantage of keeping the vma space
> > > in the same state as it was expected during the initial call - although
> > > the race does still exist on looking vs acting on the data. This would
> > > just remove the intermediate changes.
> > >
> > > >
> > > > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > > > to set a limit on the ranges before we drop/reacquire lock?
> > > >
> > > > imo, this should best be done in the same patch/series. Maybe extend
> > > > the benchmark to use IOV_MAX and find a sweet spot?
> > >
> > > Are you worried this is over-engineering for a problem that may never be
> > > an issue, or is there a particular usecase you have in mind?
> > >
> > > It is probably worth investigating, and maybe a potential usecase would
> > > help with the targeted sweet spot?
> > >
> >
> > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> > high, but rather UIO_FASTIOV, which is limited to 8 entries.
> >
> > (Some have been surprised by this limitation...!)
>
> Surprised, perhaps because I was wrong about this :) Apologies for that.
>
> SJ raised this in [0] and the non-RFC version of this series is over at [1].
>
> [0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/
> [1]: https://lore.kernel.org/all/20250206061517.2958-1-sj@kernel.org/
I actually mentioned[1] I think the real limit is UIO_MAXIOV but still that
wouldn't be a real problem since users can tune the batching size. Actually
jemalloc has made a change to use process_madvise() with up to 128 batching
size.
I impatiently sent[3] the next revision without giving you enough time to
reply, though.
>
> We should revisit this and determine whether the drop/reacquire lock is
> required, perhaps doing some experiments around heavy operations using
> UIO_MAXIOV entries?
>
> SJ - could you take a look at this please?
We had a chance to test this against a production workload, and found no
visible regression. The workload is not intesively calling process_madvise()
though. Our internal testing of kernels having this change also didn't find
any problem so far, though process_madvise() calls from the internal testing is
also not intensive to my best knowledge.
So my thought about UIO_MAXIOV is same. I anticipate no issue (until someone
yells ;) ) and didn't find an evidence of the problem. But also same to the
previous discussion[1], I agree more testing would be good, while I have no
good list of benchmarks for this. It would be nice if someone can give me the
name of the benchmarks.
>
> >
> > So I think at this point scaling isn't a huge issue, I raise it because in
> > future we may want to increase this limit, at which point we should think about
> > it, which is why I sort of hand-waved it away a bit.
>
> Again as I said here, I suspect _probably_ this won't be too much of an
> issue - but it is absolutely one we need to address.
Yes, I agree :)
[1] https://lore.kernel.org/20250204195343.16500-1-sj@kernel.org
[2] https://github.com/jemalloc/jemalloc/pull/2794/commits/c3604456d4c1f570348a
[3] https://lore.kernel.org/20250206062801.3060-1-sj@kernel.org
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2025-05-19 18:25 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-17 1:30 [RFC PATCH v2 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-01-17 1:30 ` [RFC PATCH v2 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
2025-01-31 15:58 ` Lorenzo Stoakes
2025-01-31 17:33 ` Davidlohr Bueso
2025-01-17 1:30 ` [RFC PATCH v2 2/4] mm/madvise: split out madvise input validity check SeongJae Park
2025-01-29 19:18 ` Shakeel Butt
2025-01-31 16:01 ` Lorenzo Stoakes
2025-01-31 19:19 ` Davidlohr Bueso
2025-01-17 1:30 ` [RFC PATCH v2 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
2025-01-29 19:19 ` Shakeel Butt
2025-01-31 16:10 ` Lorenzo Stoakes
2025-01-17 1:30 ` [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-01-29 19:20 ` Shakeel Butt
2025-01-31 16:53 ` Lorenzo Stoakes
2025-01-31 17:31 ` Davidlohr Bueso
2025-01-31 17:47 ` Liam R. Howlett
2025-01-31 17:51 ` Lorenzo Stoakes
2025-01-31 17:58 ` Davidlohr Bueso
2025-02-04 19:53 ` SeongJae Park
2025-02-06 6:28 ` SeongJae Park
2025-05-17 19:28 ` Lorenzo Stoakes
2025-05-19 18:25 ` SeongJae Park
2025-01-31 19:17 ` Shakeel Butt
2025-02-04 18:56 ` SeongJae Park
2025-01-29 19:22 ` [RFC PATCH v2 0/4] " Shakeel Butt
2025-01-29 21:09 ` SeongJae Park
2025-01-31 16:04 ` Liam R. Howlett
2025-01-31 16:30 ` SeongJae Park
2025-01-31 16:55 ` Lorenzo Stoakes
2025-01-31 17:53 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox