* [PATCH 1/4] mm/madvise: split out mmap locking operations for madvise()
2025-02-06 6:15 [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-02-06 6:15 ` SeongJae Park
2025-02-06 20:27 ` Liam R. Howlett
2025-02-06 6:15 ` [PATCH 2/4] mm/madvise: split out madvise input validity check SeongJae Park
` (3 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: SeongJae Park @ 2025-02-06 6:15 UTC (permalink / raw)
To: Andrew Morton
Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm
Split out the madvise behavior-dependent mmap_lock operations from
do_madvise(), for easier reuse of the logic in an upcoming change.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index c8e28d51978a..df5a87a1846a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1567,6 +1567,33 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
madvise_vma_anon_name);
}
#endif /* CONFIG_ANON_VMA_NAME */
+
+static int madvise_lock(struct mm_struct *mm, int behavior)
+{
+
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+ return 0;
+#endif
+
+ if (madvise_need_mmap_write(behavior)) {
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ } else {
+ mmap_read_lock(mm);
+ }
+ return 0;
+
+}
+
+static void madvise_unlock(struct mm_struct *mm, int behavior)
+{
+ if (madvise_need_mmap_write(behavior))
+ mmap_write_unlock(mm);
+ else
+ mmap_read_unlock(mm);
+}
+
/*
* The madvise(2) system call.
*
@@ -1643,7 +1670,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
{
unsigned long end;
int error;
- int write;
size_t len;
struct blk_plug plug;
@@ -1665,19 +1691,15 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
if (end == start)
return 0;
+ error = madvise_lock(mm, behavior);
+ if (error)
+ return error;
+
#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
return madvise_inject_error(behavior, start, start + len_in);
#endif
- write = madvise_need_mmap_write(behavior);
- if (write) {
- if (mmap_write_lock_killable(mm))
- return -EINTR;
- } else {
- mmap_read_lock(mm);
- }
-
start = untagged_addr_remote(mm, start);
end = start + len;
@@ -1694,10 +1716,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
}
blk_finish_plug(&plug);
- if (write)
- mmap_write_unlock(mm);
- else
- mmap_read_unlock(mm);
+ madvise_unlock(mm, behavior);
return error;
}
--
2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 1/4] mm/madvise: split out mmap locking operations for madvise()
2025-02-06 6:15 ` [PATCH 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
@ 2025-02-06 20:27 ` Liam R. Howlett
0 siblings, 0 replies; 17+ messages in thread
From: Liam R. Howlett @ 2025-02-06 20:27 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Davidlohr Bueso,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
* SeongJae Park <sj@kernel.org> [250206 01:15]:
> Split out the madvise behavior-dependent mmap_lock operations from
> do_madvise(), for easier reuse of the logic in an upcoming change.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Liam R. Howlett <howlett@gmail.com>
> ---
> mm/madvise.c | 45 ++++++++++++++++++++++++++++++++-------------
> 1 file changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c8e28d51978a..df5a87a1846a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1567,6 +1567,33 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> madvise_vma_anon_name);
> }
> #endif /* CONFIG_ANON_VMA_NAME */
> +
> +static int madvise_lock(struct mm_struct *mm, int behavior)
> +{
> +
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> + return 0;
> +#endif
> +
> + if (madvise_need_mmap_write(behavior)) {
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> + } else {
> + mmap_read_lock(mm);
> + }
> + return 0;
> +
> +}
> +
> +static void madvise_unlock(struct mm_struct *mm, int behavior)
> +{
> + if (madvise_need_mmap_write(behavior))
> + mmap_write_unlock(mm);
> + else
> + mmap_read_unlock(mm);
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1643,7 +1670,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> {
> unsigned long end;
> int error;
> - int write;
> size_t len;
> struct blk_plug plug;
>
> @@ -1665,19 +1691,15 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> if (end == start)
> return 0;
>
> + error = madvise_lock(mm, behavior);
> + if (error)
> + return error;
> +
> #ifdef CONFIG_MEMORY_FAILURE
> if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> return madvise_inject_error(behavior, start, start + len_in);
> #endif
>
> - write = madvise_need_mmap_write(behavior);
> - if (write) {
> - if (mmap_write_lock_killable(mm))
> - return -EINTR;
> - } else {
> - mmap_read_lock(mm);
> - }
> -
> start = untagged_addr_remote(mm, start);
> end = start + len;
>
> @@ -1694,10 +1716,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> }
> blk_finish_plug(&plug);
>
> - if (write)
> - mmap_write_unlock(mm);
> - else
> - mmap_read_unlock(mm);
> + madvise_unlock(mm, behavior);
>
> return error;
> }
> --
> 2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/4] mm/madvise: split out madvise input validity check
2025-02-06 6:15 [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-02-06 6:15 ` [PATCH 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
@ 2025-02-06 6:15 ` SeongJae Park
2025-02-06 20:29 ` Liam R. Howlett
2025-02-06 6:15 ` [PATCH 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
` (2 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: SeongJae Park @ 2025-02-06 6:15 UTC (permalink / raw)
To: Andrew Morton
Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm
Split out the madvise parameters validation logic from do_madvise(), for
easy reuse of the logic from a future change.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index df5a87a1846a..efab2878be7c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1594,6 +1594,27 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
mmap_read_unlock(mm);
}
+static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
+{
+ size_t len;
+
+ if (!madvise_behavior_valid(behavior))
+ return false;
+
+ if (!PAGE_ALIGNED(start))
+ return false;
+ len = PAGE_ALIGN(len_in);
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return false;
+
+ if (start + len < start)
+ return false;
+
+ return true;
+}
+
/*
* The madvise(2) system call.
*
@@ -1673,20 +1694,11 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
size_t len;
struct blk_plug plug;
- if (!madvise_behavior_valid(behavior))
+ if (!is_valid_madvise(start, len_in, behavior))
return -EINVAL;
- if (!PAGE_ALIGNED(start))
- return -EINVAL;
len = PAGE_ALIGN(len_in);
-
- /* Check to see whether len was rounded up from small -ve to zero */
- if (len_in && !len)
- return -EINVAL;
-
end = start + len;
- if (end < start)
- return -EINVAL;
if (end == start)
return 0;
--
2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 2/4] mm/madvise: split out madvise input validity check
2025-02-06 6:15 ` [PATCH 2/4] mm/madvise: split out madvise input validity check SeongJae Park
@ 2025-02-06 20:29 ` Liam R. Howlett
0 siblings, 0 replies; 17+ messages in thread
From: Liam R. Howlett @ 2025-02-06 20:29 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Davidlohr Bueso,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
* SeongJae Park <sj@kernel.org> [250206 01:15]:
> Split out the madvise parameters validation logic from do_madvise(), for
> easy reuse of the logic from a future change.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Liam R. Howlett <howlett@gmail.com>
> ---
> mm/madvise.c | 32 ++++++++++++++++++++++----------
> 1 file changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index df5a87a1846a..efab2878be7c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1594,6 +1594,27 @@ static void madvise_unlock(struct mm_struct *mm, int behavior)
> mmap_read_unlock(mm);
> }
>
> +static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
> +{
> + size_t len;
> +
> + if (!madvise_behavior_valid(behavior))
> + return false;
> +
> + if (!PAGE_ALIGNED(start))
> + return false;
> + len = PAGE_ALIGN(len_in);
> +
> + /* Check to see whether len was rounded up from small -ve to zero */
> + if (len_in && !len)
> + return false;
> +
> + if (start + len < start)
> + return false;
> +
> + return true;
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1673,20 +1694,11 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> size_t len;
> struct blk_plug plug;
>
> - if (!madvise_behavior_valid(behavior))
> + if (!is_valid_madvise(start, len_in, behavior))
> return -EINVAL;
>
> - if (!PAGE_ALIGNED(start))
> - return -EINVAL;
> len = PAGE_ALIGN(len_in);
> -
> - /* Check to see whether len was rounded up from small -ve to zero */
> - if (len_in && !len)
> - return -EINVAL;
> -
> end = start + len;
> - if (end < start)
> - return -EINVAL;
>
> if (end == start)
> return 0;
> --
> 2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/4] mm/madvise: split out madvise() behavior execution
2025-02-06 6:15 [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-02-06 6:15 ` [PATCH 1/4] mm/madvise: split out mmap locking operations for madvise() SeongJae Park
2025-02-06 6:15 ` [PATCH 2/4] mm/madvise: split out madvise input validity check SeongJae Park
@ 2025-02-06 6:15 ` SeongJae Park
2025-02-06 20:30 ` Liam R. Howlett
2025-02-06 6:15 ` [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-02-11 8:48 ` [PATCH 0/4] " Vern Hao
4 siblings, 1 reply; 17+ messages in thread
From: SeongJae Park @ 2025-02-06 6:15 UTC (permalink / raw)
To: Andrew Morton
Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm
Split out the madvise behavior applying logic from do_madvise() to make
it easier to reuse from the following change.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 23 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index efab2878be7c..31e5df75b926 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1615,6 +1615,35 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
return true;
}
+static int madvise_do_behavior(struct mm_struct *mm,
+ unsigned long start, size_t len_in, size_t len, int behavior)
+{
+ struct blk_plug plug;
+ unsigned long end;
+ int error;
+
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+ return madvise_inject_error(behavior, start, start + len_in);
+#endif
+ start = untagged_addr_remote(mm, start);
+ end = start + len;
+
+ blk_start_plug(&plug);
+ switch (behavior) {
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
+ error = madvise_populate(mm, start, end, behavior);
+ break;
+ default:
+ error = madvise_walk_vmas(mm, start, end, behavior,
+ madvise_vma_behavior);
+ break;
+ }
+ blk_finish_plug(&plug);
+ return error;
+}
+
/*
* The madvise(2) system call.
*
@@ -1692,7 +1721,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
unsigned long end;
int error;
size_t len;
- struct blk_plug plug;
if (!is_valid_madvise(start, len_in, behavior))
return -EINVAL;
@@ -1706,28 +1734,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
error = madvise_lock(mm, behavior);
if (error)
return error;
-
-#ifdef CONFIG_MEMORY_FAILURE
- if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
- return madvise_inject_error(behavior, start, start + len_in);
-#endif
-
- start = untagged_addr_remote(mm, start);
- end = start + len;
-
- blk_start_plug(&plug);
- switch (behavior) {
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
- error = madvise_populate(mm, start, end, behavior);
- break;
- default:
- error = madvise_walk_vmas(mm, start, end, behavior,
- madvise_vma_behavior);
- break;
- }
- blk_finish_plug(&plug);
-
+ error = madvise_do_behavior(mm, start, len_in, len, behavior);
madvise_unlock(mm, behavior);
return error;
--
2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 3/4] mm/madvise: split out madvise() behavior execution
2025-02-06 6:15 ` [PATCH 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
@ 2025-02-06 20:30 ` Liam R. Howlett
0 siblings, 0 replies; 17+ messages in thread
From: Liam R. Howlett @ 2025-02-06 20:30 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Davidlohr Bueso,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
* SeongJae Park <sj@kernel.org> [250206 01:15]:
> Split out the madvise behavior applying logic from do_madvise() to make
> it easier to reuse from the following change.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Liam R. Howlett <howlett@gmail.com>
> ---
> mm/madvise.c | 53 +++++++++++++++++++++++++++++-----------------------
> 1 file changed, 30 insertions(+), 23 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index efab2878be7c..31e5df75b926 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1615,6 +1615,35 @@ static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior)
> return true;
> }
>
> +static int madvise_do_behavior(struct mm_struct *mm,
> + unsigned long start, size_t len_in, size_t len, int behavior)
> +{
> + struct blk_plug plug;
> + unsigned long end;
> + int error;
> +
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> + return madvise_inject_error(behavior, start, start + len_in);
> +#endif
> + start = untagged_addr_remote(mm, start);
> + end = start + len;
> +
> + blk_start_plug(&plug);
> + switch (behavior) {
> + case MADV_POPULATE_READ:
> + case MADV_POPULATE_WRITE:
> + error = madvise_populate(mm, start, end, behavior);
> + break;
> + default:
> + error = madvise_walk_vmas(mm, start, end, behavior,
> + madvise_vma_behavior);
> + break;
> + }
> + blk_finish_plug(&plug);
> + return error;
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1692,7 +1721,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> unsigned long end;
> int error;
> size_t len;
> - struct blk_plug plug;
>
> if (!is_valid_madvise(start, len_in, behavior))
> return -EINVAL;
> @@ -1706,28 +1734,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> error = madvise_lock(mm, behavior);
> if (error)
> return error;
> -
> -#ifdef CONFIG_MEMORY_FAILURE
> - if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> - return madvise_inject_error(behavior, start, start + len_in);
> -#endif
> -
> - start = untagged_addr_remote(mm, start);
> - end = start + len;
> -
> - blk_start_plug(&plug);
> - switch (behavior) {
> - case MADV_POPULATE_READ:
> - case MADV_POPULATE_WRITE:
> - error = madvise_populate(mm, start, end, behavior);
> - break;
> - default:
> - error = madvise_walk_vmas(mm, start, end, behavior,
> - madvise_vma_behavior);
> - break;
> - }
> - blk_finish_plug(&plug);
> -
> + error = madvise_do_behavior(mm, start, len_in, len, behavior);
> madvise_unlock(mm, behavior);
>
> return error;
> --
> 2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-06 6:15 [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
` (2 preceding siblings ...)
2025-02-06 6:15 ` [PATCH 3/4] mm/madvise: split out madvise() behavior execution SeongJae Park
@ 2025-02-06 6:15 ` SeongJae Park
2025-02-06 13:04 ` Lorenzo Stoakes
` (2 more replies)
2025-02-11 8:48 ` [PATCH 0/4] " Vern Hao
4 siblings, 3 replies; 17+ messages in thread
From: SeongJae Park @ 2025-02-06 6:15 UTC (permalink / raw)
To: Andrew Morton
Cc: SeongJae Park, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm
Optimize redundant mmap lock operations from process_madvise() by
directly doing the mmap locking first, and then the remaining works for
all ranges in the loop.
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 31e5df75b926..5a0a1fc99d27 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1754,9 +1754,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
total_len = iov_iter_count(iter);
+ ret = madvise_lock(mm, behavior);
+ if (ret)
+ return ret;
+
while (iov_iter_count(iter)) {
- ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
- iter_iov_len(iter), behavior);
+ unsigned long start = (unsigned long)iter_iov_addr(iter);
+ size_t len_in = iter_iov_len(iter);
+ size_t len;
+
+ if (!is_valid_madvise(start, len_in, behavior)) {
+ ret = -EINVAL;
+ break;
+ }
+
+ len = PAGE_ALIGN(len_in);
+ if (start + len == start)
+ ret = 0;
+ else
+ ret = madvise_do_behavior(mm, start, len_in, len,
+ behavior);
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
@@ -1772,12 +1789,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
ret = -EINTR;
break;
}
+
+ /* Drop and reacquire lock to unwind race. */
+ madvise_unlock(mm, behavior);
+ madvise_lock(mm, behavior);
continue;
}
if (ret < 0)
break;
iov_iter_advance(iter, iter_iov_len(iter));
}
+ madvise_unlock(mm, behavior);
ret = (total_len - iov_iter_count(iter)) ? : ret;
--
2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-06 6:15 ` [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-02-06 13:04 ` Lorenzo Stoakes
2025-02-06 16:53 ` SeongJae Park
2025-02-06 20:32 ` Liam R. Howlett
2025-02-11 5:30 ` Lai, Yi
2 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-02-06 13:04 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Wed, Feb 05, 2025 at 10:15:17PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
But please fixup the comment as below. Maybe Andrew could do it if I am
giving enough info to go on, otherwise it could be a quick fix-patch? I
attach a lazy attempt at a fix-patch in case that's useful.
> ---
> mm/madvise.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 31e5df75b926..5a0a1fc99d27 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1754,9 +1754,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
> total_len = iov_iter_count(iter);
>
> + ret = madvise_lock(mm, behavior);
> + if (ret)
> + return ret;
> +
> while (iov_iter_count(iter)) {
> - ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> - iter_iov_len(iter), behavior);
> + unsigned long start = (unsigned long)iter_iov_addr(iter);
> + size_t len_in = iter_iov_len(iter);
> + size_t len;
> +
> + if (!is_valid_madvise(start, len_in, behavior)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + len = PAGE_ALIGN(len_in);
> + if (start + len == start)
> + ret = 0;
> + else
> + ret = madvise_do_behavior(mm, start, len_in, len,
> + behavior);
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
This comment is now incorrect, please fix up the 2nd paragraph so it is
something like:
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
* the operation in aggregate, and would be surprising to the
* user.
*
* We drop and reacquire locks so it is safe to just loop and
* try again. We check for fatal signals in case we need exit
* early anyway.
*/
See below fixpatch if that's easier...
> @@ -1772,12 +1789,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> ret = -EINTR;
> break;
> }
> +
> + /* Drop and reacquire lock to unwind race. */
> + madvise_unlock(mm, behavior);
> + madvise_lock(mm, behavior);
Thanks for adding!
> continue;
> }
> if (ret < 0)
> break;
> iov_iter_advance(iter, iter_iov_len(iter));
> }
> + madvise_unlock(mm, behavior);
>
> ret = (total_len - iov_iter_count(iter)) ? : ret;
>
> --
> 2.39.5
For convenience I attach a lazy, untested fixpatch which may or may not work :)
----8<----
From 86e99a658a5e0195050d6bb9e19975f54bf14e7a Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Thu, 6 Feb 2025 13:02:52 +0000
Subject: [PATCH] foo
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index c8e28d51978a..ab5f5da1571f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1725,7 +1725,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
* the operation in aggregate, and would be surprising to the
* user.
*
- * As we have already dropped locks, it is safe to just loop and
+ * We drop and reacquire locks so it is safe to just loop and
* try again. We check for fatal signals in case we need exit
* early anyway.
*/
--
2.48.1
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-06 13:04 ` Lorenzo Stoakes
@ 2025-02-06 16:53 ` SeongJae Park
0 siblings, 0 replies; 17+ messages in thread
From: SeongJae Park @ 2025-02-06 16:53 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
On Thu, 6 Feb 2025 13:04:53 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Wed, Feb 05, 2025 at 10:15:17PM -0800, SeongJae Park wrote:
> > Optimize redundant mmap lock operations from process_madvise() by
> > directly doing the mmap locking first, and then the remaining works for
> > all ranges in the loop.
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Signed-off-by: SeongJae Park <sj@kernel.org>
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> But please fixup the comment as below. Maybe Andrew could do it if I am
> giving enough info to go on, otherwise it could be a quick fix-patch? I
> attach a lazy attempt at a fix-patch in case that's useful.
Thank you for your reviews and this fixup! Since Andrew already picked this
patch in mm-unstable, I will keep my eyes on the tree and make some action if
needed, to ensure your suggestion is applied.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-06 6:15 ` [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-02-06 13:04 ` Lorenzo Stoakes
@ 2025-02-06 20:32 ` Liam R. Howlett
2025-02-11 5:30 ` Lai, Yi
2 siblings, 0 replies; 17+ messages in thread
From: Liam R. Howlett @ 2025-02-06 20:32 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Davidlohr Bueso,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
* SeongJae Park <sj@kernel.org> [250206 01:15]:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: SeongJae Park <sj@kernel.org>
It might be worth calling out the drop/reacquire in the change log as
well as the comment like Lorenzo said.
Reviewed-by: Liam R. Howlett <howlett@gmail.com>
> ---
> mm/madvise.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 31e5df75b926..5a0a1fc99d27 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1754,9 +1754,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
> total_len = iov_iter_count(iter);
>
> + ret = madvise_lock(mm, behavior);
> + if (ret)
> + return ret;
> +
> while (iov_iter_count(iter)) {
> - ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> - iter_iov_len(iter), behavior);
> + unsigned long start = (unsigned long)iter_iov_addr(iter);
> + size_t len_in = iter_iov_len(iter);
> + size_t len;
> +
> + if (!is_valid_madvise(start, len_in, behavior)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + len = PAGE_ALIGN(len_in);
> + if (start + len == start)
> + ret = 0;
> + else
> + ret = madvise_do_behavior(mm, start, len_in, len,
> + behavior);
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
> @@ -1772,12 +1789,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> ret = -EINTR;
> break;
> }
> +
> + /* Drop and reacquire lock to unwind race. */
> + madvise_unlock(mm, behavior);
> + madvise_lock(mm, behavior);
> continue;
> }
> if (ret < 0)
> break;
> iov_iter_advance(iter, iter_iov_len(iter));
> }
> + madvise_unlock(mm, behavior);
>
> ret = (total_len - iov_iter_count(iter)) ? : ret;
>
> --
> 2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-06 6:15 ` [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-02-06 13:04 ` Lorenzo Stoakes
2025-02-06 20:32 ` Liam R. Howlett
@ 2025-02-11 5:30 ` Lai, Yi
2025-02-11 6:37 ` SeongJae Park
2025-02-11 10:34 ` Lorenzo Stoakes
2 siblings, 2 replies; 17+ messages in thread
From: Lai, Yi @ 2025-02-11 5:30 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm, yi1.lai
On Wed, Feb 05, 2025 at 10:15:17PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
> mm/madvise.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 31e5df75b926..5a0a1fc99d27 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1754,9 +1754,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
> total_len = iov_iter_count(iter);
>
> + ret = madvise_lock(mm, behavior);
> + if (ret)
> + return ret;
> +
> while (iov_iter_count(iter)) {
> - ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> - iter_iov_len(iter), behavior);
> + unsigned long start = (unsigned long)iter_iov_addr(iter);
> + size_t len_in = iter_iov_len(iter);
> + size_t len;
> +
> + if (!is_valid_madvise(start, len_in, behavior)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + len = PAGE_ALIGN(len_in);
> + if (start + len == start)
> + ret = 0;
> + else
> + ret = madvise_do_behavior(mm, start, len_in, len,
> + behavior);
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
> @@ -1772,12 +1789,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> ret = -EINTR;
> break;
> }
> +
> + /* Drop and reacquire lock to unwind race. */
> + madvise_unlock(mm, behavior);
> + madvise_lock(mm, behavior);
> continue;
> }
> if (ret < 0)
> break;
> iov_iter_advance(iter, iter_iov_len(iter));
> }
> + madvise_unlock(mm, behavior);
>
> ret = (total_len - iov_iter_count(iter)) ? : ret;
>
Hi SeongJae Park,
Greetings!
I used Syzkaller and found that there is WARNING in madvise_unlock in linux-next tag - next-20250210.
After bisection and the first bad commit is:
"
ec68fbd9e99f mm/madvise: remove redundant mmap_lock operations from process_madvise()
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250210_144836_madvise_unlock/bzImage_df5d6180169ae06a2eac57e33b077ad6f6252440
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250210_144836_madvise_unlock/df5d6180169ae06a2eac57e33b077ad6f6252440_dmesg.log
"
[ 135.191347] Injecting memory failure for pfn 0x8ea0 at process virtual address 0x20e7f000
[ 135.194964] Memory failure: 0x8ea0: recovery action for reserved kernel page: Ignored
[ 135.195584] ------------[ cut here ]------------
[ 135.195863] WARNING: CPU: 1 PID: 680 at ./include/linux/rwsem.h:203 madvise_unlock+0x17e/0x1a0
[ 135.196395] Modules linked in:
[ 135.196612] CPU: 1 UID: 0 PID: 680 Comm: repro Not tainted 6.14.0-rc2-next-20250210-df5d6180169a #1
[ 135.197135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 135.197818] RIP: 0010:madvise_unlock+0x17e/0x1a0
[ 135.198108] Code: a1 9f ff 31 f6 49 8d bc 24 e0 01 00 00 e8 fa 80 d5 03 31 ff 89 c3 89 c6 e8 9f 9b 9f ff 85 db 0f 85 1c ff ff ffe
[ 135.199154] RSP: 0018:ffff888022647e88 EFLAGS: 00010293
[ 135.199468] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81e878f1
[ 135.199876] RDX: ffff88802264a540 RSI: ffffffff81e878fe RDI: 0000000000000005
[ 135.200285] RBP: ffff888022647ea0 R08: 0000000000000001 R09: ffffed10044c8f25
[ 135.200692] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888021116180
[ 135.201100] R13: ffff8880211162f0 R14: 0000000000004000 R15: ffff888021116180
[ 135.201531] FS: 00007f0327a07600(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
[ 135.201996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.202329] CR2: 00007f032773e7f0 CR3: 0000000021bd4003 CR4: 0000000000770ef0
[ 135.202737] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 135.203155] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 135.203563] PKRU: 55555554
[ 135.203731] Call Trace:
[ 135.203883] <TASK>
[ 135.204020] ? show_regs+0x6d/0x80
[ 135.204240] ? __warn+0xf3/0x390
[ 135.204447] ? report_bug+0x25e/0x4b0
[ 135.204692] ? madvise_unlock+0x17e/0x1a0
[ 135.204937] ? report_bug+0x2cb/0x4b0
[ 135.205167] ? madvise_unlock+0x17e/0x1a0
[ 135.205413] ? madvise_unlock+0x17f/0x1a0
[ 135.205706] ? handle_bug+0xf1/0x190
[ 135.206188] ? exc_invalid_op+0x3c/0x80
[ 135.206423] ? asm_exc_invalid_op+0x1f/0x30
[ 135.206678] ? madvise_unlock+0x171/0x1a0
[ 135.206919] ? madvise_unlock+0x17e/0x1a0
[ 135.207156] ? madvise_unlock+0x17e/0x1a0
[ 135.207388] ? madvise_unlock+0x17e/0x1a0
[ 135.207623] do_madvise+0x14f/0x1a0
[ 135.207836] __x64_sys_madvise+0xb2/0x120
[ 135.208067] ? syscall_trace_enter+0x14f/0x280
[ 135.208328] x64_sys_call+0x19b1/0x2150
[ 135.208552] do_syscall_64+0x6d/0x140
[ 135.208766] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 135.209050] RIP: 0033:0x7f032763ee5d
[ 135.209261] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
[ 135.210261] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
[ 135.210678] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
[ 135.211069] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
[ 135.211458] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
[ 135.211851] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
[ 135.212248] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
[ 135.212656] </TASK>
[ 135.212794] irq event stamp: 807
[ 135.212987] hardirqs last enabled at (815): [<ffffffff81663b55>] __up_console_sem+0x95/0xb0
[ 135.213477] hardirqs last disabled at (824): [<ffffffff81663b3a>] __up_console_sem+0x7a/0xb0
[ 135.213951] softirqs last enabled at (628): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
[ 135.214426] softirqs last disabled at (623): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
[ 135.214910] ---[ end trace 0000000000000000 ]---
[ 135.215182]
[ 135.215283] =====================================
[ 135.215550] WARNING: bad unlock balance detected!
[ 135.215817] 6.14.0-rc2-next-20250210-df5d6180169a #1 Tainted: G W
[ 135.216234] -------------------------------------
[ 135.216502] repro/680 is trying to release lock (&mm->mmap_lock) at:
[ 135.216863] [<ffffffff81e87854>] madvise_unlock+0xd4/0x1a0
[ 135.217179] but there are no more locks to release!
[ 135.217457]
[ 135.217457] other info that might help us debug this:
[ 135.217819] no locks held by repro/680.
[ 135.218046]
[ 135.218046] stack backtrace:
[ 135.218296] CPU: 1 UID: 0 PID: 680 Comm: repro Tainted: G W 6.14.0-rc2-next-20250210-df5d6180169a #1
[ 135.218308] Tainted: [W]=WARN
[ 135.218310] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 135.218315] Call Trace:
[ 135.218317] <TASK>
[ 135.218320] dump_stack_lvl+0xea/0x150
[ 135.218330] ? madvise_unlock+0xd4/0x1a0
[ 135.218341] dump_stack+0x19/0x20
[ 135.218349] print_unlock_imbalance_bug+0x1b5/0x200
[ 135.218368] ? madvise_unlock+0xd4/0x1a0
[ 135.218379] lock_release+0x5bc/0x870
[ 135.218386] ? madvise_unlock+0x17f/0x1a0
[ 135.218397] ? handle_bug+0xf1/0x190
[ 135.218407] ? __pfx_lock_release+0x10/0x10
[ 135.218415] ? exc_invalid_op+0x3c/0x80
[ 135.218426] ? asm_exc_invalid_op+0x1f/0x30
[ 135.218441] up_write+0x31/0x550
[ 135.218449] ? madvise_unlock+0x17e/0x1a0
[ 135.218462] madvise_unlock+0xd4/0x1a0
[ 135.218474] do_madvise+0x14f/0x1a0
[ 135.218487] __x64_sys_madvise+0xb2/0x120
[ 135.218500] ? syscall_trace_enter+0x14f/0x280
[ 135.218511] x64_sys_call+0x19b1/0x2150
[ 135.218522] do_syscall_64+0x6d/0x140
[ 135.218531] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 135.218542] RIP: 0033:0x7f032763ee5d
[ 135.218548] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
[ 135.218556] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
[ 135.218562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
[ 135.218569] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
[ 135.218573] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
[ 135.218578] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
[ 135.218582] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
[ 135.218594] </TASK>
[ 135.228272] ------------[ cut here ]------------
[ 135.228541] DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) && !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE)): count = 0x0, magy
[ 135.229541] WARNING: CPU: 1 PID: 680 at kernel/locking/rwsem.c:1367 up_write+0x451/0x550
[ 135.229997] Modules linked in:
[ 135.230182] CPU: 1 UID: 0 PID: 680 Comm: repro Tainted: G W 6.14.0-rc2-next-20250210-df5d6180169a #1
[ 135.230758] Tainted: [W]=WARN
[ 135.230940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 135.231565] RIP: 0010:up_write+0x451/0x550
[ 135.231806] Code: ea 03 80 3c 02 00 0f 85 d5 00 00 00 49 8b 14 24 53 4d 89 f9 4c 89 f1 48 c7 c6 40 bd eb 85 48 c7 c7 60 bc eb 858
[ 135.232814] RSP: 0018:ffff888022647e30 EFLAGS: 00010282
[ 135.233113] RAX: 0000000000000000 RBX: ffffffff85ebbba0 RCX: ffffffff8146cfd3
[ 135.233528] RDX: ffff88802264a540 RSI: ffffffff8146cfe0 RDI: 0000000000000001
[ 135.233929] RBP: ffff888022647e78 R08: 0000000000000001 R09: ffffed10044c8f66
[ 135.234325] R10: 0000000000000001 R11: 57525f4755424544 R12: ffff8880211162f0
[ 135.234722] R13: ffff8880211162f8 R14: ffff8880211162f0 R15: ffff88802264a540
[ 135.235122] FS: 00007f0327a07600(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
[ 135.235584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.235912] CR2: 00007f032773e7f0 CR3: 0000000021bd4003 CR4: 0000000000770ef0
[ 135.236314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 135.236713] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 135.237114] PKRU: 55555554
[ 135.237276] Call Trace:
[ 135.237445] <TASK>
[ 135.237580] ? show_regs+0x6d/0x80
[ 135.237789] ? __warn+0xf3/0x390
[ 135.237987] ? find_bug+0x310/0x490
[ 135.238199] ? up_write+0x451/0x550
[ 135.238412] ? report_bug+0x2cb/0x4b0
[ 135.238636] ? up_write+0x451/0x550
[ 135.238853] ? up_write+0x452/0x550
[ 135.239065] ? handle_bug+0xf1/0x190
[ 135.239285] ? exc_invalid_op+0x3c/0x80
[ 135.239515] ? asm_exc_invalid_op+0x1f/0x30
[ 135.239768] ? __warn_printk+0x173/0x2e0
[ 135.240001] ? __warn_printk+0x180/0x2e0
[ 135.240235] ? up_write+0x451/0x550
[ 135.240445] ? madvise_unlock+0x17e/0x1a0
[ 135.240686] madvise_unlock+0xd4/0x1a0
[ 135.240913] do_madvise+0x14f/0x1a0
[ 135.241126] __x64_sys_madvise+0xb2/0x120
[ 135.241364] ? syscall_trace_enter+0x14f/0x280
[ 135.241647] x64_sys_call+0x19b1/0x2150
[ 135.241887] do_syscall_64+0x6d/0x140
[ 135.242113] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 135.242407] RIP: 0033:0x7f032763ee5d
[ 135.242619] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
[ 135.243646] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
[ 135.244075] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
[ 135.244476] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
[ 135.244877] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
[ 135.245279] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
[ 135.245696] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
[ 135.246106] </TASK>
[ 135.246242] irq event stamp: 857
[ 135.246434] hardirqs last enabled at (857): [<ffffffff81663b55>] __up_console_sem+0x95/0xb0
[ 135.246915] hardirqs last disabled at (856): [<ffffffff81663b3a>] __up_console_sem+0x7a/0xb0
[ 135.247392] softirqs last enabled at (628): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
[ 135.247871] softirqs last disabled at (623): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
[ 135.248347] ---[ end trace 0000000000000000 ]---
"
Hope this cound be insightful to you.
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
> --
> 2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-11 5:30 ` Lai, Yi
@ 2025-02-11 6:37 ` SeongJae Park
2025-02-11 10:34 ` Lorenzo Stoakes
1 sibling, 0 replies; 17+ messages in thread
From: SeongJae Park @ 2025-02-11 6:37 UTC (permalink / raw)
To: Lai, Yi
Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm, yi1.lai
Hello Lai,
On Tue, 11 Feb 2025 13:30:49 +0800 "Lai, Yi" <yi1.lai@linux.intel.com> wrote:
[...]
> Hi SeongJae Park,
>
> Greetings!
>
> I used Syzkaller and found that there is WARNING in madvise_unlock in linux-next tag - next-20250210.
Thank you so much for this nice report! I just sent a fix:
https://lore.kernel.org/20250211063201.5106-1-sj@kernel.org
>
> After bisection and the first bad commit is:
> "
> ec68fbd9e99f mm/madvise: remove redundant mmap_lock operations from process_madvise()
> "
Nonetheless, I think the real first bad commit is f19c9d7b57cf ("mm/madvise:
split out madvise() behavior execution"). I confirmed I can reproduce the
issue using your reproducer on the commit. And I think the fix may better to
be squashed into an earlier commit, 948a0a9ea070 ("mm/madvise: split out mmap
locking operations for madvise()"). Please refer to the fix for details about
why I think so, and let me know if anything seems wrong.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-11 5:30 ` Lai, Yi
2025-02-11 6:37 ` SeongJae Park
@ 2025-02-11 10:34 ` Lorenzo Stoakes
2025-02-11 18:32 ` SeongJae Park
1 sibling, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-02-11 10:34 UTC (permalink / raw)
To: Lai, Yi
Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm, yi1.lai, Naresh Kamboju, Arnd Bergmann
+cc Naresh, as [0] does not appear to be in lore which seems to be having issues
(meaning I can't reply to that thread at all), also +cc Arnd for his reply in [1].
[0]:https://lwn.net/ml/linux-mm/CA+G9fYt5QwJ4_F8fJj7jx9_0Le9kOVSeG38ox9qnKqwsrDdvHQ@mail.gmail.com/
[1]:https://lwn.net/ml/linux-mm/fa1a7a10-f892-4e7e-acb4-0b058aa53d88@app.fastmail.com/
The report here from Yi Lai (thanks for that!) appears to be the same thing.
SJ - I think the issue is we're unconditionally trying to unlock even in the
case of MADV_HWPOISON, MADV_SOFT_OFFLINE. E.g.:
static int madvise_lock(struct mm_struct *mm, int behavior)
{
#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
return 0;
#endif
...
}
But on unlock...
static void madvise_unlock(struct mm_struct *mm, int behavior)
{
if (madvise_need_mmap_write(behavior))
mmap_write_unlock(mm);
else
mmap_read_unlock(mm);
}
So we just need to insert an equivalent line (ideally abstracted to a helper) on
unlock also.
OK having just typed that I realise SJ sent a fix already. Lore being broken
means I can't link to it. Sigh.
Will reply there.
On Tue, Feb 11, 2025 at 01:30:49PM +0800, Lai, Yi wrote:
> On Wed, Feb 05, 2025 at 10:15:17PM -0800, SeongJae Park wrote:
> > Optimize redundant mmap lock operations from process_madvise() by
> > directly doing the mmap locking first, and then the remaining works for
> > all ranges in the loop.
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> > mm/madvise.c | 26 ++++++++++++++++++++++++--
> > 1 file changed, 24 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 31e5df75b926..5a0a1fc99d27 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1754,9 +1754,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >
> > total_len = iov_iter_count(iter);
> >
> > + ret = madvise_lock(mm, behavior);
> > + if (ret)
> > + return ret;
> > +
> > while (iov_iter_count(iter)) {
> > - ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> > - iter_iov_len(iter), behavior);
> > + unsigned long start = (unsigned long)iter_iov_addr(iter);
> > + size_t len_in = iter_iov_len(iter);
> > + size_t len;
> > +
> > + if (!is_valid_madvise(start, len_in, behavior)) {
> > + ret = -EINVAL;
> > + break;
> > + }
> > +
> > + len = PAGE_ALIGN(len_in);
> > + if (start + len == start)
> > + ret = 0;
> > + else
> > + ret = madvise_do_behavior(mm, start, len_in, len,
> > + behavior);
> > /*
> > * An madvise operation is attempting to restart the syscall,
> > * but we cannot proceed as it would not be correct to repeat
> > @@ -1772,12 +1789,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> > ret = -EINTR;
> > break;
> > }
> > +
> > + /* Drop and reacquire lock to unwind race. */
> > + madvise_unlock(mm, behavior);
> > + madvise_lock(mm, behavior);
> > continue;
> > }
> > if (ret < 0)
> > break;
> > iov_iter_advance(iter, iter_iov_len(iter));
> > }
> > + madvise_unlock(mm, behavior);
> >
> > ret = (total_len - iov_iter_count(iter)) ? : ret;
> >
>
> Hi SeongJae Park,
>
> Greetings!
>
> I used Syzkaller and found that there is WARNING in madvise_unlock in linux-next tag - next-20250210.
>
> After bisection and the first bad commit is:
> "
> ec68fbd9e99f mm/madvise: remove redundant mmap_lock operations from process_madvise()
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250210_144836_madvise_unlock/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250210_144836_madvise_unlock/bzImage_df5d6180169ae06a2eac57e33b077ad6f6252440
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250210_144836_madvise_unlock/df5d6180169ae06a2eac57e33b077ad6f6252440_dmesg.log
>
> "
> [ 135.191347] Injecting memory failure for pfn 0x8ea0 at process virtual address 0x20e7f000
> [ 135.194964] Memory failure: 0x8ea0: recovery action for reserved kernel page: Ignored
> [ 135.195584] ------------[ cut here ]------------
> [ 135.195863] WARNING: CPU: 1 PID: 680 at ./include/linux/rwsem.h:203 madvise_unlock+0x17e/0x1a0
> [ 135.196395] Modules linked in:
> [ 135.196612] CPU: 1 UID: 0 PID: 680 Comm: repro Not tainted 6.14.0-rc2-next-20250210-df5d6180169a #1
> [ 135.197135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 135.197818] RIP: 0010:madvise_unlock+0x17e/0x1a0
> [ 135.198108] Code: a1 9f ff 31 f6 49 8d bc 24 e0 01 00 00 e8 fa 80 d5 03 31 ff 89 c3 89 c6 e8 9f 9b 9f ff 85 db 0f 85 1c ff ff ffe
> [ 135.199154] RSP: 0018:ffff888022647e88 EFLAGS: 00010293
> [ 135.199468] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81e878f1
> [ 135.199876] RDX: ffff88802264a540 RSI: ffffffff81e878fe RDI: 0000000000000005
> [ 135.200285] RBP: ffff888022647ea0 R08: 0000000000000001 R09: ffffed10044c8f25
> [ 135.200692] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888021116180
> [ 135.201100] R13: ffff8880211162f0 R14: 0000000000004000 R15: ffff888021116180
> [ 135.201531] FS: 00007f0327a07600(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [ 135.201996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 135.202329] CR2: 00007f032773e7f0 CR3: 0000000021bd4003 CR4: 0000000000770ef0
> [ 135.202737] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 135.203155] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 135.203563] PKRU: 55555554
> [ 135.203731] Call Trace:
> [ 135.203883] <TASK>
> [ 135.204020] ? show_regs+0x6d/0x80
> [ 135.204240] ? __warn+0xf3/0x390
> [ 135.204447] ? report_bug+0x25e/0x4b0
> [ 135.204692] ? madvise_unlock+0x17e/0x1a0
> [ 135.204937] ? report_bug+0x2cb/0x4b0
> [ 135.205167] ? madvise_unlock+0x17e/0x1a0
> [ 135.205413] ? madvise_unlock+0x17f/0x1a0
> [ 135.205706] ? handle_bug+0xf1/0x190
> [ 135.206188] ? exc_invalid_op+0x3c/0x80
> [ 135.206423] ? asm_exc_invalid_op+0x1f/0x30
> [ 135.206678] ? madvise_unlock+0x171/0x1a0
> [ 135.206919] ? madvise_unlock+0x17e/0x1a0
> [ 135.207156] ? madvise_unlock+0x17e/0x1a0
> [ 135.207388] ? madvise_unlock+0x17e/0x1a0
> [ 135.207623] do_madvise+0x14f/0x1a0
> [ 135.207836] __x64_sys_madvise+0xb2/0x120
> [ 135.208067] ? syscall_trace_enter+0x14f/0x280
> [ 135.208328] x64_sys_call+0x19b1/0x2150
> [ 135.208552] do_syscall_64+0x6d/0x140
> [ 135.208766] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 135.209050] RIP: 0033:0x7f032763ee5d
> [ 135.209261] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
> [ 135.210261] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
> [ 135.210678] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
> [ 135.211069] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
> [ 135.211458] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
> [ 135.211851] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
> [ 135.212248] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
> [ 135.212656] </TASK>
> [ 135.212794] irq event stamp: 807
> [ 135.212987] hardirqs last enabled at (815): [<ffffffff81663b55>] __up_console_sem+0x95/0xb0
> [ 135.213477] hardirqs last disabled at (824): [<ffffffff81663b3a>] __up_console_sem+0x7a/0xb0
> [ 135.213951] softirqs last enabled at (628): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [ 135.214426] softirqs last disabled at (623): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [ 135.214910] ---[ end trace 0000000000000000 ]---
> [ 135.215182]
> [ 135.215283] =====================================
> [ 135.215550] WARNING: bad unlock balance detected!
> [ 135.215817] 6.14.0-rc2-next-20250210-df5d6180169a #1 Tainted: G W
> [ 135.216234] -------------------------------------
> [ 135.216502] repro/680 is trying to release lock (&mm->mmap_lock) at:
> [ 135.216863] [<ffffffff81e87854>] madvise_unlock+0xd4/0x1a0
> [ 135.217179] but there are no more locks to release!
> [ 135.217457]
> [ 135.217457] other info that might help us debug this:
> [ 135.217819] no locks held by repro/680.
> [ 135.218046]
> [ 135.218046] stack backtrace:
> [ 135.218296] CPU: 1 UID: 0 PID: 680 Comm: repro Tainted: G W 6.14.0-rc2-next-20250210-df5d6180169a #1
> [ 135.218308] Tainted: [W]=WARN
> [ 135.218310] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 135.218315] Call Trace:
> [ 135.218317] <TASK>
> [ 135.218320] dump_stack_lvl+0xea/0x150
> [ 135.218330] ? madvise_unlock+0xd4/0x1a0
> [ 135.218341] dump_stack+0x19/0x20
> [ 135.218349] print_unlock_imbalance_bug+0x1b5/0x200
> [ 135.218368] ? madvise_unlock+0xd4/0x1a0
> [ 135.218379] lock_release+0x5bc/0x870
> [ 135.218386] ? madvise_unlock+0x17f/0x1a0
> [ 135.218397] ? handle_bug+0xf1/0x190
> [ 135.218407] ? __pfx_lock_release+0x10/0x10
> [ 135.218415] ? exc_invalid_op+0x3c/0x80
> [ 135.218426] ? asm_exc_invalid_op+0x1f/0x30
> [ 135.218441] up_write+0x31/0x550
> [ 135.218449] ? madvise_unlock+0x17e/0x1a0
> [ 135.218462] madvise_unlock+0xd4/0x1a0
> [ 135.218474] do_madvise+0x14f/0x1a0
> [ 135.218487] __x64_sys_madvise+0xb2/0x120
> [ 135.218500] ? syscall_trace_enter+0x14f/0x280
> [ 135.218511] x64_sys_call+0x19b1/0x2150
> [ 135.218522] do_syscall_64+0x6d/0x140
> [ 135.218531] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 135.218542] RIP: 0033:0x7f032763ee5d
> [ 135.218548] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
> [ 135.218556] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
> [ 135.218562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
> [ 135.218569] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
> [ 135.218573] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
> [ 135.218578] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
> [ 135.218582] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
> [ 135.218594] </TASK>
> [ 135.228272] ------------[ cut here ]------------
> [ 135.228541] DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) && !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE)): count = 0x0, magy
> [ 135.229541] WARNING: CPU: 1 PID: 680 at kernel/locking/rwsem.c:1367 up_write+0x451/0x550
> [ 135.229997] Modules linked in:
> [ 135.230182] CPU: 1 UID: 0 PID: 680 Comm: repro Tainted: G W 6.14.0-rc2-next-20250210-df5d6180169a #1
> [ 135.230758] Tainted: [W]=WARN
> [ 135.230940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 135.231565] RIP: 0010:up_write+0x451/0x550
> [ 135.231806] Code: ea 03 80 3c 02 00 0f 85 d5 00 00 00 49 8b 14 24 53 4d 89 f9 4c 89 f1 48 c7 c6 40 bd eb 85 48 c7 c7 60 bc eb 858
> [ 135.232814] RSP: 0018:ffff888022647e30 EFLAGS: 00010282
> [ 135.233113] RAX: 0000000000000000 RBX: ffffffff85ebbba0 RCX: ffffffff8146cfd3
> [ 135.233528] RDX: ffff88802264a540 RSI: ffffffff8146cfe0 RDI: 0000000000000001
> [ 135.233929] RBP: ffff888022647e78 R08: 0000000000000001 R09: ffffed10044c8f66
> [ 135.234325] R10: 0000000000000001 R11: 57525f4755424544 R12: ffff8880211162f0
> [ 135.234722] R13: ffff8880211162f8 R14: ffff8880211162f0 R15: ffff88802264a540
> [ 135.235122] FS: 00007f0327a07600(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [ 135.235584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 135.235912] CR2: 00007f032773e7f0 CR3: 0000000021bd4003 CR4: 0000000000770ef0
> [ 135.236314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 135.236713] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 135.237114] PKRU: 55555554
> [ 135.237276] Call Trace:
> [ 135.237445] <TASK>
> [ 135.237580] ? show_regs+0x6d/0x80
> [ 135.237789] ? __warn+0xf3/0x390
> [ 135.237987] ? find_bug+0x310/0x490
> [ 135.238199] ? up_write+0x451/0x550
> [ 135.238412] ? report_bug+0x2cb/0x4b0
> [ 135.238636] ? up_write+0x451/0x550
> [ 135.238853] ? up_write+0x452/0x550
> [ 135.239065] ? handle_bug+0xf1/0x190
> [ 135.239285] ? exc_invalid_op+0x3c/0x80
> [ 135.239515] ? asm_exc_invalid_op+0x1f/0x30
> [ 135.239768] ? __warn_printk+0x173/0x2e0
> [ 135.240001] ? __warn_printk+0x180/0x2e0
> [ 135.240235] ? up_write+0x451/0x550
> [ 135.240445] ? madvise_unlock+0x17e/0x1a0
> [ 135.240686] madvise_unlock+0xd4/0x1a0
> [ 135.240913] do_madvise+0x14f/0x1a0
> [ 135.241126] __x64_sys_madvise+0xb2/0x120
> [ 135.241364] ? syscall_trace_enter+0x14f/0x280
> [ 135.241647] x64_sys_call+0x19b1/0x2150
> [ 135.241887] do_syscall_64+0x6d/0x140
> [ 135.242113] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 135.242407] RIP: 0033:0x7f032763ee5d
> [ 135.242619] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b8
> [ 135.243646] RSP: 002b:00007ffcaafd26c8 EFLAGS: 00000207 ORIG_RAX: 000000000000001c
> [ 135.244075] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f032763ee5d
> [ 135.244476] RDX: 0000000000000064 RSI: 0000000000004000 RDI: 0000000020e7f000
> [ 135.244877] RBP: 00007ffcaafd26d0 R08: 00007ffcaafd2140 R09: 00007ffcaafd2700
> [ 135.245279] R10: 0000000000000000 R11: 0000000000000207 R12: 00007ffcaafd2828
> [ 135.245696] R13: 00000000004016da R14: 0000000000403e08 R15: 00007f0327a4e000
> [ 135.246106] </TASK>
> [ 135.246242] irq event stamp: 857
> [ 135.246434] hardirqs last enabled at (857): [<ffffffff81663b55>] __up_console_sem+0x95/0xb0
> [ 135.246915] hardirqs last disabled at (856): [<ffffffff81663b3a>] __up_console_sem+0x7a/0xb0
> [ 135.247392] softirqs last enabled at (628): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [ 135.247871] softirqs last disabled at (623): [<ffffffff8148c40e>] __irq_exit_rcu+0x10e/0x170
> [ 135.248347] ---[ end trace 0000000000000000 ]---
> "
>
> Hope this cound be insightful to you.
>
> Regards,
> Yi Lai
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
> // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
> // You could change the bzImage_xxx as you want
> // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>
> > --
> > 2.39.5
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-11 10:34 ` Lorenzo Stoakes
@ 2025-02-11 18:32 ` SeongJae Park
0 siblings, 0 replies; 17+ messages in thread
From: SeongJae Park @ 2025-02-11 18:32 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Lai, Yi, Andrew Morton, Liam R. Howlett,
David Hildenbrand, Davidlohr Bueso, Shakeel Butt,
Vlastimil Babka, linux-kernel, linux-mm, yi1.lai, Naresh Kamboju,
Arnd Bergmann
On Tue, 11 Feb 2025 10:34:19 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> +cc Naresh, as [0] does not appear to be in lore which seems to be having issues
> (meaning I can't reply to that thread at all), also +cc Arnd for his reply in [1].
>
> [0]:https://lwn.net/ml/linux-mm/CA+G9fYt5QwJ4_F8fJj7jx9_0Le9kOVSeG38ox9qnKqwsrDdvHQ@mail.gmail.com/
> [1]:https://lwn.net/ml/linux-mm/fa1a7a10-f892-4e7e-acb4-0b058aa53d88@app.fastmail.com/
>
> The report here from Yi Lai (thanks for that!) appears to be the same thing.
[...]
> OK having just typed that I realise SJ sent a fix already. Lore being broken
> means I can't link to it. Sigh.
>
> Will reply there.
No worry, and yes, let's continue discussion on the thread. Seems lore is also
working again, so for others' reference, the thread is at
https://lore.kernel.org/20250211063201.5106-1-sj@kernel.org
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-06 6:15 [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
` (3 preceding siblings ...)
2025-02-06 6:15 ` [PATCH 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-02-11 8:48 ` Vern Hao
2025-02-11 18:28 ` SeongJae Park
4 siblings, 1 reply; 17+ messages in thread
From: Vern Hao @ 2025-02-11 8:48 UTC (permalink / raw)
To: SeongJae Park, Andrew Morton
Cc: Liam R. Howlett, David Hildenbrand, Davidlohr Bueso,
Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka, linux-kernel,
linux-mm
[-- Attachment #1: Type: text/plain, Size: 4828 bytes --]
On 2025/2/6 14:15, SeongJae Park wrote:
> process_madvise() calls do_madvise() for each address range. Then, each
> do_madvise() invocation holds and releases same mmap_lock. Optimize the
> redundant lock operations by splitting do_madvise() internal logics
> including the mmap_lock operations, and calling the small logics
> directly from process_madvise() in a sequence that removes the redundant
> locking. As a result of this change, process_madvise() becomes more
> efficient and less racy in terms of its results and latency.
>
> Note that the potential downside of this series is that other mmap_lock
> holders may take more time due to the increased length of mmap_lock
> critical section for process_madvise() calls. But there is maximum
> limit in the kernel space (IOV_MAX), and the user space can control the
> critical section length by setting the request size. Hence, the
> downside would be limited and controllable.
>
> Evaluation
> ==========
>
> I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple madvise() calls, 4 KiB per each call. I also do the same
> with process_madvise(), but with varying batch size (vlen) from 1 to
> 1024. The source code for the measurement is available at GitHub[1].
> Because the microbenchmark result is not that stable, I ran each
> configuration five times and use the average.
>
> The measurement results are as below. 'sz_batches' column shows the
> batch size of process_madvise() calls. '0' batch size is for madvise()
> calls case.
Hi, i just wonder why these patches can reduce latency time on call
madvise() DONT_NEED.
> 'before' and 'after' columns are the measured time to apply
> MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> that built without and with the last patch of this series, respectively.
> So lower value means better efficiency. 'after/before' column is the
> ratio of 'after' to 'before'.
>
> sz_batches before after after/before
> 0 146294215.2 121280536.2 0.829017989769427
> 1 165851018.8 136305598.2 0.821855658085351
> 2 129469321.2 103740383.6 0.801273866569094
> 4 110369232.4 87835896.2 0.795836795182785
> 8 102906232.4 77420920.2 0.752344327397609
> 16 97551017.4 74959714.4 0.768415506038587
> 32 94809848.2 71200848.4 0.750985786305689
> 64 96087575.6 72593180 0.755489765942227
> 128 96154163.8 68517055.4 0.712575022154163
> 256 92901257.6 69054216.6 0.743307662177439
> 512 93646170.8 67053296.2 0.716028168874151
> 1024 92663219.2 70168196.8 0.75723892830177
>
> In despite of the unstable nature of the tet program, the trend is
> somewhat we can expect. The measurement shows this patch reduces the
> process_madvise() latency, proportional to the batching size. The
> latency gain was about 20% with the batch size 2, and it has increased
> to about 28% with the batch size 512, since more number of mmap locking
> is reduced with larger batch size.
>
> Note that the standard devitation of the measurements for each
> sz_batches configuration was ranging from 1.9% to 7.2%. That is, this
> result is still not very stable. The average of the standard deviations
> for different batch sizes were 4.62% and 4.70% for the 'before' and
> 'after' kernel measurements.
>
> Also note that this patch has somehow decreased latencies of madvise()
> and single batch size process_madvise(). Seems this code path is small
> enough to significantly be affected by compiler optimizations including
> inlining of split-out functions. Please focus on only the improvement
> amount that changed by the batch size.
>
> Changelog
> =========
>
> Changes from RFC v2
> (https://lore.kernel.org/20250117013058.1843-1-sj@kernel.org)
> - Release and acquire mmap lock again when a race-caused failure happens
> (Lorenzo Stoakes)
> - Collected Reviewed-by: tags from Shakeel, Lorenzo and Davidlohr.
>
> Changes from RFC v1
> (https://lore.kernel.org/20250111004618.1566-1-sj@kernel.org)
> - Split out do_madvise() and use those from vector_madvise(), instead of
> adding a flag to do_madvise() (Liam R. Howlett)
>
> [1]https://github.com/sjp38/eval_proc_madvise
>
> SeongJae Park (4):
> mm/madvise: split out mmap locking operations for madvise()
> mm/madvise: split out madvise input validity check
> mm/madvise: split out madvise() behavior execution
> mm/madvise: remove redundant mmap_lock operations from
> process_madvise()
>
> mm/madvise.c | 154 +++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 107 insertions(+), 47 deletions(-)
>
>
> base-commit: f104b8534d19f31443a4fe6cb701bdb15fd931eb
[-- Attachment #2: Type: text/html, Size: 5484 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 0/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()
2025-02-11 8:48 ` [PATCH 0/4] " Vern Hao
@ 2025-02-11 18:28 ` SeongJae Park
0 siblings, 0 replies; 17+ messages in thread
From: SeongJae Park @ 2025-02-11 18:28 UTC (permalink / raw)
To: Vern Hao
Cc: SeongJae Park, Andrew Morton, Liam R. Howlett, David Hildenbrand,
Davidlohr Bueso, Lorenzo Stoakes, Shakeel Butt, Vlastimil Babka,
linux-kernel, linux-mm
Hi Vern,
On Tue, 11 Feb 2025 16:48:06 +0800 Vern Hao <haoxing990@gmail.com> wrote:
>
> On 2025/2/6 14:15, SeongJae Park wrote:
> > process_madvise() calls do_madvise() for each address range. Then, each
> > do_madvise() invocation holds and releases same mmap_lock. Optimize the
> > redundant lock operations by splitting do_madvise() internal logics
> > including the mmap_lock operations, and calling the small logics
> > directly from process_madvise() in a sequence that removes the redundant
> > locking. As a result of this change, process_madvise() becomes more
> > efficient and less racy in terms of its results and latency.
[...]
> >
> > Evaluation
> > ==========
> >
[...]
> > The measurement results are as below. 'sz_batches' column shows the
> > batch size of process_madvise() calls. '0' batch size is for madvise()
> > calls case.
> Hi, i just wonder why these patches can reduce latency time on call
> madvise() DONT_NEED.
Thank you for asking this!
> > 'before' and 'after' columns are the measured time to apply
> > MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> > that built without and with the last patch of this series, respectively.
> > So lower value means better efficiency. 'after/before' column is the
> > ratio of 'after' to 'before'.
> >
> > sz_batches before after after/before
> > 0 146294215.2 121280536.2 0.829017989769427
> > 1 165851018.8 136305598.2 0.821855658085351
> > 2 129469321.2 103740383.6 0.801273866569094
> > 4 110369232.4 87835896.2 0.795836795182785
> > 8 102906232.4 77420920.2 0.752344327397609
> > 16 97551017.4 74959714.4 0.768415506038587
> > 32 94809848.2 71200848.4 0.750985786305689
> > 64 96087575.6 72593180 0.755489765942227
> > 128 96154163.8 68517055.4 0.712575022154163
> > 256 92901257.6 69054216.6 0.743307662177439
> > 512 93646170.8 67053296.2 0.716028168874151
> > 1024 92663219.2 70168196.8 0.75723892830177
[...]
> > Also note that this patch has somehow decreased latencies of madvise()
> > and single batch size process_madvise(). Seems this code path is small
> > enough to significantly be affected by compiler optimizations including
> > inlining of split-out functions. Please focus on only the improvement
> > amount that changed by the batch size.
I believe the above paragraph may answer your question. Please let me know if
not.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread