* [PATCH v2] fix AMDGPU failure with periodic signal
@ 2025-11-07 17:48 Mikulas Patocka
2026-01-02 19:02 ` Lorenzo Stoakes
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Mikulas Patocka @ 2025-11-07 17:48 UTC (permalink / raw)
To: Alex Deucher, Christian König, Andrew Morton, David Hildenbrand
Cc: amd-gfx, linux-mm
If a process sets up a timer that periodically sends a signal in short
intervals and if it uses OpenCL on AMDGPU at the same time, we get random
errors. Sometimes, probing the OpenCL device fails (strace shows that
open("/dev/kfd") failed with -EINTR). Sometimes we get the message
"amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
syslog.
The bug can be reproduced with this program:
http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c
The root cause for these failures is in the function mm_take_all_locks.
This function fails with -EINTR if there is pending signal. The -EINTR is
propagated up the call stack to userspace and userspace fails if it gets
this error.
There is the following call chain: kfd_open -> kfd_create_process ->
create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
__mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
If the failure happens in init_user_pages, there is the following call
chain: init_user_pages -> amdgpu_hmm_register ->
mmu_interval_notifier_insert -> mmu_notifier_register ->
__mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
In order to fix these failures, this commit changes
signal_pending(current) to fatal_signal_pending(current) in
mm_take_all_locks, so that it is interrupted only if the signal is
actually killing the process.
Also, this commit skips pr_err in init_user_pages if the process is being
killed - in this situation, there was no error and so we don't want to
report it in the syslog.
I'm submitting this patch for the stable kernels, because this bug may
cause random failures in any OpenCL code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++++--
mm/vma.c | 8 ++++----
2 files changed, 11 insertions(+), 6 deletions(-)
Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
===================================================================
--- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
ret = amdgpu_hmm_register(bo, user_addr);
if (ret) {
- pr_err("%s: Failed to register MMU notifier: %d\n",
- __func__, ret);
+ /*
+ * If we got EINTR because the process was killed, don't report
+ * it, because no error happened.
+ */
+ if (!(fatal_signal_pending(current) && ret == -EINTR))
+ pr_err("%s: Failed to register MMU notifier: %d\n",
+ __func__, ret);
goto out;
}
Index: linux-6.17.7/mm/vma.c
===================================================================
--- linux-6.17.7.orig/mm/vma.c
+++ linux-6.17.7/mm/vma.c
@@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
* is reached.
*/
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
vma_start_write(vma);
}
vma_iter_init(&vmi, mm, 0);
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
if (vma->vm_file && vma->vm_file->f_mapping &&
is_vm_hugetlb_page(vma))
@@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
vma_iter_init(&vmi, mm, 0);
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
if (vma->vm_file && vma->vm_file->f_mapping &&
!is_vm_hugetlb_page(vma))
@@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
vma_iter_init(&vmi, mm, 0);
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
if (vma->anon_vma)
list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
@ 2026-01-02 19:02 ` Lorenzo Stoakes
2026-01-02 19:08 ` Lorenzo Stoakes
2026-01-04 21:12 ` Mikulas Patocka
2026-01-02 19:15 ` Lorenzo Stoakes
2026-01-06 11:51 ` Lorenzo Stoakes
2 siblings, 2 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-02 19:02 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Alex Deucher, Christian König, Andrew Morton,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
+cc literally everyone you should have cc'd in mm :/
Hi Mikulas,
You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
without cc'ing a single maintainer or reviewer of that file. I just happened to
notice this by chance, even lei seemed to mess up the file query for some
reason.
I'm confused in general about this patch, you sent it on 7th Nov? And it's been
ignored until now and then taken without review to the hotfixes queue?
Andrew - what's going on here? The patch looks fine but we do need to be made
aware of this stuff!
And it's seemingly against a specific stable version?... I guess this code is
antiquated so safe but still.
Thanks, Lorenzo
On Fri, Nov 07, 2025 at 06:48:01PM +0100, Mikulas Patocka wrote:
> If a process sets up a timer that periodically sends a signal in short
> intervals and if it uses OpenCL on AMDGPU at the same time, we get random
> errors. Sometimes, probing the OpenCL device fails (strace shows that
> open("/dev/kfd") failed with -EINTR). Sometimes we get the message
> "amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
> syslog.
>
> The bug can be reproduced with this program:
> http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c
>
> The root cause for these failures is in the function mm_take_all_locks.
> This function fails with -EINTR if there is pending signal. The -EINTR is
> propagated up the call stack to userspace and userspace fails if it gets
> this error.
>
> There is the following call chain: kfd_open -> kfd_create_process ->
> create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> If the failure happens in init_user_pages, there is the following call
> chain: init_user_pages -> amdgpu_hmm_register ->
> mmu_interval_notifier_insert -> mmu_notifier_register ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> In order to fix these failures, this commit changes
> signal_pending(current) to fatal_signal_pending(current) in
> mm_take_all_locks, so that it is interrupted only if the signal is
> actually killing the process.
>
> Also, this commit skips pr_err in init_user_pages if the process is being
> killed - in this situation, there was no error and so we don't want to
> report it in the syslog.
>
> I'm submitting this patch for the stable kernels, because this bug may
> cause random failures in any OpenCL code.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++++--
> mm/vma.c | 8 ++++----
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> ===================================================================
> --- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
>
> ret = amdgpu_hmm_register(bo, user_addr);
> if (ret) {
> - pr_err("%s: Failed to register MMU notifier: %d\n",
> - __func__, ret);
> + /*
> + * If we got EINTR because the process was killed, don't report
> + * it, because no error happened.
> + */
> + if (!(fatal_signal_pending(current) && ret == -EINTR))
> + pr_err("%s: Failed to register MMU notifier: %d\n",
> + __func__, ret);
> goto out;
> }
>
> Index: linux-6.17.7/mm/vma.c
> ===================================================================
> --- linux-6.17.7.orig/mm/vma.c
> +++ linux-6.17.7/mm/vma.c
> @@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
> * is reached.
> */
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> vma_start_write(vma);
> }
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->vm_file && vma->vm_file->f_mapping &&
> is_vm_hugetlb_page(vma))
> @@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->vm_file && vma->vm_file->f_mapping &&
> !is_vm_hugetlb_page(vma))
> @@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->anon_vma)
> list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-02 19:02 ` Lorenzo Stoakes
@ 2026-01-02 19:08 ` Lorenzo Stoakes
2026-01-03 17:58 ` Andrew Morton
2026-01-04 21:12 ` Mikulas Patocka
1 sibling, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-02 19:08 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Alex Deucher, Christian König, Andrew Morton,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
On Fri, Jan 02, 2026 at 07:02:40PM +0000, Lorenzo Stoakes wrote:
> +cc literally everyone you should have cc'd in mm :/
>
> Hi Mikulas,
>
> You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> without cc'ing a single maintainer or reviewer of that file. I just happened to
> notice this by chance, even lei seemed to mess up the file query for some
> reason.
Ah yes, it's because this patch breaks the VMA userland tests.
You need to modify tools/testing/vma/vma_internal.h and rename signal_pending() to
fatal_signal_pending().
You can check it by going to the tools/testing/vma directory running make and
executing the vma executable.
This one I don't blame you for, there were meant to be CI tests for this in mm
but for some reason that's just not been done.
But this needs fixing. If this is being backported to all human history you
probably don't want to do that, but that leaves commits with broken tests in so
an alternative would be to add a patch that gets added before this one that adds
fatal_signal_pending() to vma_internal.h.
But not sure how feasible that is? Andrew?
Thanks.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
2026-01-02 19:02 ` Lorenzo Stoakes
@ 2026-01-02 19:15 ` Lorenzo Stoakes
2026-01-06 11:51 ` Lorenzo Stoakes
2 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-02 19:15 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Alex Deucher, Christian König, Andrew Morton,
David Hildenbrand, amd-gfx, linux-mm
And now I've become aware of this patch (!) some review...
This subject is weird, this looks like an mm fix so the subject should be 'mm:
only interrupt taking all mm locks on fatal signal' or something.
It should describe what the patch is doing in general (hint: it doesn't only
affect AMD GPUs). So this is actively bad. Please change it.
You haven't provided a link to v1 of the patch which is not helpful.
On Fri, Nov 07, 2025 at 06:48:01PM +0100, Mikulas Patocka wrote:
> If a process sets up a timer that periodically sends a signal in short
> intervals and if it uses OpenCL on AMDGPU at the same time, we get random
> errors. Sometimes, probing the OpenCL device fails (strace shows that
> open("/dev/kfd") failed with -EINTR). Sometimes we get the message
> "amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
> syslog.
> The bug can be reproduced with this program:
> http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c
Please don't provide random links in commit messages. They die. Just include the
reproducer in the commit message.
>
> The root cause for these failures is in the function mm_take_all_locks.
> This function fails with -EINTR if there is pending signal. The -EINTR is
> propagated up the call stack to userspace and userspace fails if it gets
> this error.
>
> There is the following call chain: kfd_open -> kfd_create_process ->
> create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> If the failure happens in init_user_pages, there is the following call
> chain: init_user_pages -> amdgpu_hmm_register ->
> mmu_interval_notifier_insert -> mmu_notifier_register ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> In order to fix these failures, this commit changes
> signal_pending(current) to fatal_signal_pending(current) in
> mm_take_all_locks, so that it is interrupted only if the signal is
> actually killing the process.
>
> Also, this commit skips pr_err in init_user_pages if the process is being
> killed - in this situation, there was no error and so we don't want to
> report it in the syslog.
>
> I'm submitting this patch for the stable kernels, because this bug may
> cause random failures in any OpenCL code.
I mean, in general it might cause failures in any code which relies upon
mm_take_all_locks() or the mmu notifier logic you mention above right? I'd make
the commit message more generally about how it's sensible to check for fatal
signals, and then reference your use case as an example.
You're changing core mm code, the commit message should reflect that.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
No fixes tag...? How can the stable guys know which stable kernels to apply this
against? Please fix.
>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++++--
> mm/vma.c | 8 ++++----
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> ===================================================================
Whatever you're doing here is just wrong. You send mm patches against Andrew's
tree, mm-unstable probably best branch at the moment.
I mean this has been taken already, but now no tooling like 'b4 shazam'
etc. will work here.
> --- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
>
> ret = amdgpu_hmm_register(bo, user_addr);
> if (ret) {
> - pr_err("%s: Failed to register MMU notifier: %d\n",
> - __func__, ret);
> + /*
> + * If we got EINTR because the process was killed, don't report
> + * it, because no error happened.
> + */
> + if (!(fatal_signal_pending(current) && ret == -EINTR))
> + pr_err("%s: Failed to register MMU notifier: %d\n",
> + __func__, ret);
Why are you doing this here? It seems to me that this isn't really important,
and the bit you want to backport is _just_ the mm stuff.
Since you're going to be backporting to every single stable kernel, I suggest
you drop this and do it as a separate patch?
> goto out;
> }
>
> Index: linux-6.17.7/mm/vma.c
> ===================================================================
> --- linux-6.17.7.orig/mm/vma.c
> +++ linux-6.17.7/mm/vma.c
> @@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
> * is reached.
> */
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> vma_start_write(vma);
> }
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->vm_file && vma->vm_file->f_mapping &&
> is_vm_hugetlb_page(vma))
> @@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->vm_file && vma->vm_file->f_mapping &&
> !is_vm_hugetlb_page(vma))
> @@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->anon_vma)
> list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>
>
This change seems reasonable to me at a glance, I don't think there's any
legitimate reason why we'd want to interrupt on _any_ signal here, a bit of
kernel archeology suggests this has been here since ~2008 so people probably
just assumed there were reasons (TM) why we checked signals in general here.
Others might have some great insight into why we did that though...
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-02 19:08 ` Lorenzo Stoakes
@ 2026-01-03 17:58 ` Andrew Morton
2026-01-05 10:43 ` Lorenzo Stoakes
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2026-01-03 17:58 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Mikulas Patocka, Alex Deucher, Christian König,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
On Fri, 2 Jan 2026 19:08:37 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jan 02, 2026 at 07:02:40PM +0000, Lorenzo Stoakes wrote:
> > +cc literally everyone you should have cc'd in mm :/
> >
> > Hi Mikulas,
> >
> > You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> > without cc'ing a single maintainer or reviewer of that file. I just happened to
> > notice this by chance, even lei seemed to mess up the file query for some
> > reason.
>
> Ah yes, it's because this patch breaks the VMA userland tests.
>
> You need to modify tools/testing/vma/vma_internal.h and rename signal_pending() to
> fatal_signal_pending().
>
> You can check it by going to the tools/testing/vma directory running make and
> executing the vma executable.
>
> This one I don't blame you for, there were meant to be CI tests for this in mm
> but for some reason that's just not been done.
>
> But this needs fixing. If this is being backported to all human history you
> probably don't want to do that, but that leaves commits with broken tests in so
> an alternative would be to add a patch that gets added before this one that adds
> fatal_signal_pending() to vma_internal.h.
>
> But not sure how feasible that is? Andrew?
Not understanding why it requires a separate patch. Can we modify this
patch so it makes the necessary alterations to selftests?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-02 19:02 ` Lorenzo Stoakes
2026-01-02 19:08 ` Lorenzo Stoakes
@ 2026-01-04 21:12 ` Mikulas Patocka
2026-01-05 10:45 ` Lorenzo Stoakes
1 sibling, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2026-01-04 21:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Alex Deucher, Christian König, Andrew Morton,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
On Fri, 2 Jan 2026, Lorenzo Stoakes wrote:
> +cc literally everyone you should have cc'd in mm :/
>
> Hi Mikulas,
>
> You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> without cc'ing a single maintainer or reviewer of that file. I just happened to
> notice this by chance, even lei seemed to mess up the file query for some
> reason.
I saw
MEMORY MANAGEMENT
M: Andrew Morton <akpm@linux-foundation.org>
L: linux-mm@kvack.org
S: Maintained
in the MAINTAINERS file, so I sent the patch to Andrew and to
linux-mm@kvack.org. I should have sent it also to people on the "MEMORY
MANAGEMENT - CORE" section, but I missed it.
> I'm confused in general about this patch, you sent it on 7th Nov? And it's been
> ignored until now and then taken without review to the hotfixes queue?
I'm developing code that translates parallelizable loops written in the
Ajla programming language (www.ajla-lang.cz) into OpenCL and runs them on
the graphics card. Ajla sets up a periodic timer that sends a signal for
scheduling purposes and this signal interferes with OpenCL, causing the
-EINTR failures.
So far, I worked around this bug by blocking all signals around the
functions clGetPlatformIDs and clGetDeviceIDs - but it would be better to
fix it in the Linux kernel and remove the signal-blocking hacks.
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-03 17:58 ` Andrew Morton
@ 2026-01-05 10:43 ` Lorenzo Stoakes
0 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-05 10:43 UTC (permalink / raw)
To: Andrew Morton
Cc: Mikulas Patocka, Alex Deucher, Christian König,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
On Sat, Jan 03, 2026 at 09:58:45AM -0800, Andrew Morton wrote:
> On Fri, 2 Jan 2026 19:08:37 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Fri, Jan 02, 2026 at 07:02:40PM +0000, Lorenzo Stoakes wrote:
> > > +cc literally everyone you should have cc'd in mm :/
> > >
> > > Hi Mikulas,
> > >
> > > You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> > > without cc'ing a single maintainer or reviewer of that file. I just happened to
> > > notice this by chance, even lei seemed to mess up the file query for some
> > > reason.
> >
> > Ah yes, it's because this patch breaks the VMA userland tests.
> >
> > You need to modify tools/testing/vma/vma_internal.h and rename signal_pending() to
> > fatal_signal_pending().
> >
> > You can check it by going to the tools/testing/vma directory running make and
> > executing the vma executable.
> >
> > This one I don't blame you for, there were meant to be CI tests for this in mm
> > but for some reason that's just not been done.
> >
> > But this needs fixing. If this is being backported to all human history you
> > probably don't want to do that, but that leaves commits with broken tests in so
> > an alternative would be to add a patch that gets added before this one that adds
> > fatal_signal_pending() to vma_internal.h.
> >
> > But not sure how feasible that is? Andrew?
>
> Not understanding why it requires a separate patch. Can we modify this
> patch so it makes the necessary alterations to selftests?
>
Because afaict this needs backporting to every single stable kernel (we
need a fixes tag clearly, I reviewed that elsewhere), so doing other stuff
might make it not-backportable or at least very very unreasonably painful.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-04 21:12 ` Mikulas Patocka
@ 2026-01-05 10:45 ` Lorenzo Stoakes
2026-01-05 10:59 ` Lorenzo Stoakes
0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-05 10:45 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Alex Deucher, Christian König, Andrew Morton,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
On Sun, Jan 04, 2026 at 10:12:50PM +0100, Mikulas Patocka wrote:
>
>
> On Fri, 2 Jan 2026, Lorenzo Stoakes wrote:
>
> > +cc literally everyone you should have cc'd in mm :/
> >
> > Hi Mikulas,
> >
> > You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> > without cc'ing a single maintainer or reviewer of that file. I just happened to
> > notice this by chance, even lei seemed to mess up the file query for some
> > reason.
>
> I saw
>
> MEMORY MANAGEMENT
> M: Andrew Morton <akpm@linux-foundation.org>
> L: linux-mm@kvack.org
> S: Maintained
>
> in the MAINTAINERS file, so I sent the patch to Andrew and to
> linux-mm@kvack.org. I should have sent it also to people on the "MEMORY
> MANAGEMENT - CORE" section, but I missed it.
Yup things have changed :) scripts/get_maintainers.pl --no-git ... is the way to
be sure.
Understand if you haven't touched mm for a while you'd assume it was as was but
now we really do try to have people assigned to each bit.
>
> > I'm confused in general about this patch, you sent it on 7th Nov? And it's been
> > ignored until now and then taken without review to the hotfixes queue?
>
> I'm developing code that translates parallelizable loops written in the
> Ajla programming language (www.ajla-lang.cz) into OpenCL and runs them on
> the graphics card. Ajla sets up a periodic timer that sends a signal for
> scheduling purposes and this signal interferes with OpenCL, causing the
> -EINTR failures.
>
> So far, I worked around this bug by blocking all signals around the
> functions clGetPlatformIDs and clGetDeviceIDs - but it would be better to
> fix it in the Linux kernel and remove the signal-blocking hacks.
Sure I get that, I'm just confused as to why this was suddenly taken to
mm-unstable-hotfix out of the blue a couple months later.
Anyway I sent another mail with review, please rework this patch accordingly.
>
> Mikulas
>
>
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-05 10:45 ` Lorenzo Stoakes
@ 2026-01-05 10:59 ` Lorenzo Stoakes
0 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-05 10:59 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Alex Deucher, Christian König, Andrew Morton,
David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato
On Mon, Jan 05, 2026 at 10:45:25AM +0000, Lorenzo Stoakes wrote:
> On Sun, Jan 04, 2026 at 10:12:50PM +0100, Mikulas Patocka wrote:
> >
> >
> > On Fri, 2 Jan 2026, Lorenzo Stoakes wrote:
> >
> > > +cc literally everyone you should have cc'd in mm :/
> > >
> > > Hi Mikulas,
> > >
> > > You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> > > without cc'ing a single maintainer or reviewer of that file. I just happened to
> > > notice this by chance, even lei seemed to mess up the file query for some
> > > reason.
> >
> > I saw
> >
> > MEMORY MANAGEMENT
> > M: Andrew Morton <akpm@linux-foundation.org>
> > L: linux-mm@kvack.org
> > S: Maintained
> >
> > in the MAINTAINERS file, so I sent the patch to Andrew and to
> > linux-mm@kvack.org. I should have sent it also to people on the "MEMORY
> > MANAGEMENT - CORE" section, but I missed it.
>
> Yup things have changed :) scripts/get_maintainers.pl --no-git ... is the way to
> be sure.
>
> Understand if you haven't touched mm for a while you'd assume it was as was but
> now we really do try to have people assigned to each bit.
>
> >
> > > I'm confused in general about this patch, you sent it on 7th Nov? And it's been
> > > ignored until now and then taken without review to the hotfixes queue?
> >
> > I'm developing code that translates parallelizable loops written in the
> > Ajla programming language (www.ajla-lang.cz) into OpenCL and runs them on
> > the graphics card. Ajla sets up a periodic timer that sends a signal for
> > scheduling purposes and this signal interferes with OpenCL, causing the
> > -EINTR failures.
> >
> > So far, I worked around this bug by blocking all signals around the
> > functions clGetPlatformIDs and clGetDeviceIDs - but it would be better to
> > fix it in the Linux kernel and remove the signal-blocking hacks.
>
> Sure I get that, I'm just confused as to why this was suddenly taken to
> mm-unstable-hotfix out of the blue a couple months later.
>
> Anyway I sent another mail with review, please rework this patch accordingly.
OK sorry you already have, sorry I am catching up with ~700 mails after annual
leave :)
>
> >
> > Mikulas
> >
> >
>
> Thanks, Lorenzo
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
2026-01-02 19:02 ` Lorenzo Stoakes
2026-01-02 19:15 ` Lorenzo Stoakes
@ 2026-01-06 11:51 ` Lorenzo Stoakes
2026-01-06 18:12 ` Andrew Morton
2 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-06 11:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Alex Deucher, Mikulas Patocka, Christian König,
David Hildenbrand, amd-gfx, linux-mm
Andrew,
I'm not sure if the git repos are lagging vs. quilt, but as reported this
patch breaks the VMA tests, and the tests are _still_ broken.
Yet it's still in mm-new, mm-unstable, and even mm-hotfixes-unstable.
This is interfering with my work, can we please drop this.
Also the v3 is currently being debated, so surely should have been dropped
until we have this resolved?
Thanks, Lorenzo
On Fri, Nov 07, 2025 at 06:48:01PM +0100, Mikulas Patocka wrote:
> If a process sets up a timer that periodically sends a signal in short
> intervals and if it uses OpenCL on AMDGPU at the same time, we get random
> errors. Sometimes, probing the OpenCL device fails (strace shows that
> open("/dev/kfd") failed with -EINTR). Sometimes we get the message
> "amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
> syslog.
>
> The bug can be reproduced with this program:
> http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c
>
> The root cause for these failures is in the function mm_take_all_locks.
> This function fails with -EINTR if there is pending signal. The -EINTR is
> propagated up the call stack to userspace and userspace fails if it gets
> this error.
>
> There is the following call chain: kfd_open -> kfd_create_process ->
> create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> If the failure happens in init_user_pages, there is the following call
> chain: init_user_pages -> amdgpu_hmm_register ->
> mmu_interval_notifier_insert -> mmu_notifier_register ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> In order to fix these failures, this commit changes
> signal_pending(current) to fatal_signal_pending(current) in
> mm_take_all_locks, so that it is interrupted only if the signal is
> actually killing the process.
>
> Also, this commit skips pr_err in init_user_pages if the process is being
> killed - in this situation, there was no error and so we don't want to
> report it in the syslog.
>
> I'm submitting this patch for the stable kernels, because this bug may
> cause random failures in any OpenCL code.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++++--
> mm/vma.c | 8 ++++----
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> ===================================================================
> --- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
>
> ret = amdgpu_hmm_register(bo, user_addr);
> if (ret) {
> - pr_err("%s: Failed to register MMU notifier: %d\n",
> - __func__, ret);
> + /*
> + * If we got EINTR because the process was killed, don't report
> + * it, because no error happened.
> + */
> + if (!(fatal_signal_pending(current) && ret == -EINTR))
> + pr_err("%s: Failed to register MMU notifier: %d\n",
> + __func__, ret);
> goto out;
> }
>
> Index: linux-6.17.7/mm/vma.c
> ===================================================================
> --- linux-6.17.7.orig/mm/vma.c
> +++ linux-6.17.7/mm/vma.c
> @@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
> * is reached.
> */
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> vma_start_write(vma);
> }
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->vm_file && vma->vm_file->f_mapping &&
> is_vm_hugetlb_page(vma))
> @@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->vm_file && vma->vm_file->f_mapping &&
> !is_vm_hugetlb_page(vma))
> @@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
>
> vma_iter_init(&vmi, mm, 0);
> for_each_vma(vmi, vma) {
> - if (signal_pending(current))
> + if (fatal_signal_pending(current))
> goto out_unlock;
> if (vma->anon_vma)
> list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-06 11:51 ` Lorenzo Stoakes
@ 2026-01-06 18:12 ` Andrew Morton
2026-01-06 18:24 ` Lorenzo Stoakes
0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2026-01-06 18:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Alex Deucher, Mikulas Patocka, Christian König,
David Hildenbrand, amd-gfx, linux-mm
On Tue, 6 Jan 2026 11:51:49 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> I'm not sure if the git repos are lagging vs. quilt, but as reported this
> patch breaks the VMA tests, and the tests are _still_ broken.
>
> Yet it's still in mm-new, mm-unstable, and even mm-hotfixes-unstable.
>
> This is interfering with my work, can we please drop this.
>
> Also the v3 is currently being debated, so surely should have been dropped
> until we have this resolved?
Well. I don't drop fixes unless it's decided to be a non-issue or
unless a better fix is available.
I've done this for ever - I've held onto "wrong" fixes for *years*.
View this as a weird issue-tracking system for a project which has no
issue-tracking system. It's to prevent issues from falling through
cracks and getting lost.
It's unfortunate that this one causes disruption so I guess I'll loudly
comment it out and track the issue that way.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-06 18:12 ` Andrew Morton
@ 2026-01-06 18:24 ` Lorenzo Stoakes
2026-01-06 20:59 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-06 18:24 UTC (permalink / raw)
To: Andrew Morton
Cc: Alex Deucher, Mikulas Patocka, Christian König,
David Hildenbrand, amd-gfx, linux-mm
On Tue, Jan 06, 2026 at 10:12:49AM -0800, Andrew Morton wrote:
> On Tue, 6 Jan 2026 11:51:49 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > I'm not sure if the git repos are lagging vs. quilt, but as reported this
> > patch breaks the VMA tests, and the tests are _still_ broken.
> >
> > Yet it's still in mm-new, mm-unstable, and even mm-hotfixes-unstable.
> >
> > This is interfering with my work, can we please drop this.
> >
> > Also the v3 is currently being debated, so surely should have been dropped
> > until we have this resolved?
>
> Well. I don't drop fixes unless it's decided to be a non-issue or
> unless a better fix is available.
Even if it breaks the build and that's been reported on-list?
>
> I've done this for ever - I've held onto "wrong" fixes for *years*.
> View this as a weird issue-tracking system for a project which has no
> issue-tracking system. It's to prevent issues from falling through
> cracks and getting lost.
I think a lot of the issue is these processes seem to work to you but those
on the ground are finding them not to work.
The kernel today is not the same as the kernel X years ago, esp. in terms
of sheer volume.
Having a patch that none of the relevant maintainers/reviewers have seen
land in an -rc out of the blue is a really serious problem.
Also it was taken 2 months after it was submitted, so nobody could have
_possibly_ picked this up by reading the list. This is why I am really
underlining this case.
Again, requiring an M signoff fixes this completely.
No patch should be merged without review, most certainly not one expedited
to an -rc.
>
> It's unfortunate that this one causes disruption so I guess I'll loudly
> comment it out and track the issue that way.
>
I think we need a better approach, yes.
We in mm are really very responsive compared to most, I think asking people
to wait and resend if somehow it got missed is considerably saner than
'well I'll take any patch purporting to be a fix from anyone so we keep
track of stuff'.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-06 18:24 ` Lorenzo Stoakes
@ 2026-01-06 20:59 ` Andrew Morton
2026-01-06 21:52 ` Pedro Falcato
2026-01-07 12:33 ` [PATCH v2] fix AMDGPU failure with periodic signal Lorenzo Stoakes
0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2026-01-06 20:59 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Alex Deucher, Mikulas Patocka, Christian König,
David Hildenbrand, amd-gfx, linux-mm
On Tue, 6 Jan 2026 18:24:10 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Tue, Jan 06, 2026 at 10:12:49AM -0800, Andrew Morton wrote:
> > On Tue, 6 Jan 2026 11:51:49 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > I'm not sure if the git repos are lagging vs. quilt, but as reported this
> > > patch breaks the VMA tests, and the tests are _still_ broken.
> > >
> > > Yet it's still in mm-new, mm-unstable, and even mm-hotfixes-unstable.
> > >
> > > This is interfering with my work, can we please drop this.
> > >
> > > Also the v3 is currently being debated, so surely should have been dropped
> > > until we have this resolved?
> >
> > Well. I don't drop fixes unless it's decided to be a non-issue or
> > unless a better fix is available.
>
> Even if it breaks the build and that's been reported on-list?
I addressed that.
> >
> > I've done this for ever - I've held onto "wrong" fixes for *years*.
> > View this as a weird issue-tracking system for a project which has no
> > issue-tracking system. It's to prevent issues from falling through
> > cracks and getting lost.
>
> I think a lot of the issue is these processes seem to work to you but those
> on the ground are finding them not to work.
>
> The kernel today is not the same as the kernel X years ago, esp. in terms
> of sheer volume.
>
> Having a patch that none of the relevant maintainers/reviewers have seen
> land in an -rc out of the blue is a really serious problem.
It isn't in -rc. It's in mm-hotfixes-unstable and it's marked "acks?",
which means not to go upstream without further consideration.
> Also it was taken 2 months after it was submitted, so nobody could have
> _possibly_ picked this up by reading the list. This is why I am really
> underlining this case.
That's why I grabbed it. Had I not done so, this issue would have been
lost. What I do *worked*.
> >
> > It's unfortunate that this one causes disruption so I guess I'll loudly
> > comment it out and track the issue that way.
> >
>
> I think we need a better approach, yes.
>
> We in mm are really very responsive compared to most, I think asking people
> to wait and resend if somehow it got missed is considerably saner than
> 'well I'll take any patch purporting to be a fix from anyone so we keep
> track of stuff'.
If someone wants to step up and be MM issue tracking person then great.
I don't want to be that person.
And let me reiterate: had I not done this, the issue Mikulas identified
would have remained unaddressed.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-06 20:59 ` Andrew Morton
@ 2026-01-06 21:52 ` Pedro Falcato
2026-01-08 6:26 ` Finding mm patches to review before those are pulled into the mainline (was "Re: [PATCH v2] fix AMDGPU failure with periodic signal") SeongJae Park
2026-01-07 12:33 ` [PATCH v2] fix AMDGPU failure with periodic signal Lorenzo Stoakes
1 sibling, 1 reply; 17+ messages in thread
From: Pedro Falcato @ 2026-01-06 21:52 UTC (permalink / raw)
To: Andrew Morton
Cc: Lorenzo Stoakes, Alex Deucher, Mikulas Patocka,
Christian König, David Hildenbrand, amd-gfx, linux-mm
On Tue, Jan 06, 2026 at 12:59:12PM -0800, Andrew Morton wrote:
> On Tue, 6 Jan 2026 18:24:10 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Tue, Jan 06, 2026 at 10:12:49AM -0800, Andrew Morton wrote:
> > > On Tue, 6 Jan 2026 11:51:49 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > I'm not sure if the git repos are lagging vs. quilt, but as reported this
> > > > patch breaks the VMA tests, and the tests are _still_ broken.
> > > >
> > > > Yet it's still in mm-new, mm-unstable, and even mm-hotfixes-unstable.
> > > >
> > > > This is interfering with my work, can we please drop this.
> > > >
> > > > Also the v3 is currently being debated, so surely should have been dropped
> > > > until we have this resolved?
> > >
> > > Well. I don't drop fixes unless it's decided to be a non-issue or
> > > unless a better fix is available.
> >
> > Even if it breaks the build and that's been reported on-list?
>
> I addressed that.
>
> > >
> > > I've done this for ever - I've held onto "wrong" fixes for *years*.
> > > View this as a weird issue-tracking system for a project which has no
> > > issue-tracking system. It's to prevent issues from falling through
> > > cracks and getting lost.
> >
> > I think a lot of the issue is these processes seem to work to you but those
> > on the ground are finding them not to work.
> >
> > The kernel today is not the same as the kernel X years ago, esp. in terms
> > of sheer volume.
> >
> > Having a patch that none of the relevant maintainers/reviewers have seen
> > land in an -rc out of the blue is a really serious problem.
>
> It isn't in -rc. It's in mm-hotfixes-unstable and it's marked "acks?",
> which means not to go upstream without further consideration.
>
> > Also it was taken 2 months after it was submitted, so nobody could have
> > _possibly_ picked this up by reading the list. This is why I am really
> > underlining this case.
>
> That's why I grabbed it. Had I not done so, this issue would have been
> lost. What I do *worked*.
>
> > >
> > > It's unfortunate that this one causes disruption so I guess I'll loudly
> > > comment it out and track the issue that way.
> > >
> >
> > I think we need a better approach, yes.
> >
> > We in mm are really very responsive compared to most, I think asking people
> > to wait and resend if somehow it got missed is considerably saner than
> > 'well I'll take any patch purporting to be a fix from anyone so we keep
> > track of stuff'.
>
> If someone wants to step up and be MM issue tracking person then great.
> I don't want to be that person.
>
> And let me reiterate: had I not done this, the issue Mikulas identified
> would have remained unaddressed.
>
I understand your point. I don't think anyone wants to see patches falling
through the cracks. But we also don't want patches to get applied without
any review.
Perhaps it's time to deploy something like Patchwork to help track
outstanding patches?
--
Pedro
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] fix AMDGPU failure with periodic signal
2026-01-06 20:59 ` Andrew Morton
2026-01-06 21:52 ` Pedro Falcato
@ 2026-01-07 12:33 ` Lorenzo Stoakes
1 sibling, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-07 12:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Alex Deucher, Mikulas Patocka, Christian König,
David Hildenbrand, amd-gfx, linux-mm
On Tue, Jan 06, 2026 at 12:59:12PM -0800, Andrew Morton wrote:
> On Tue, 6 Jan 2026 18:24:10 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Tue, Jan 06, 2026 at 10:12:49AM -0800, Andrew Morton wrote:
> > > On Tue, 6 Jan 2026 11:51:49 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > I'm not sure if the git repos are lagging vs. quilt, but as reported this
> > > > patch breaks the VMA tests, and the tests are _still_ broken.
> > > >
> > > > Yet it's still in mm-new, mm-unstable, and even mm-hotfixes-unstable.
> > > >
> > > > This is interfering with my work, can we please drop this.
> > > >
> > > > Also the v3 is currently being debated, so surely should have been dropped
> > > > until we have this resolved?
> > >
> > > Well. I don't drop fixes unless it's decided to be a non-issue or
> > > unless a better fix is available.
> >
> > Even if it breaks the build and that's been reported on-list?
>
> I addressed that.
I don't recall? If it's just git tree lag fine.
>
> > >
> > > I've done this for ever - I've held onto "wrong" fixes for *years*.
> > > View this as a weird issue-tracking system for a project which has no
> > > issue-tracking system. It's to prevent issues from falling through
> > > cracks and getting lost.
> >
> > I think a lot of the issue is these processes seem to work to you but those
> > on the ground are finding them not to work.
> >
> > The kernel today is not the same as the kernel X years ago, esp. in terms
> > of sheer volume.
> >
> > Having a patch that none of the relevant maintainers/reviewers have seen
> > land in an -rc out of the blue is a really serious problem.
>
> It isn't in -rc. It's in mm-hotfixes-unstable and it's marked "acks?",
> which means not to go upstream without further consideration.
OK this is another inscrutible part of the process, my understanding of
mm-hotfixes-unstable was 'will go to -rc unless otherwise noted', now it seems
to be 'maybe rc maybe not, but unlike other series, negative review will not
drop this from the tree, even if it causes compile issues'.
>
> > Also it was taken 2 months after it was submitted, so nobody could have
> > _possibly_ picked this up by reading the list. This is why I am really
> > underlining this case.
>
> That's why I grabbed it. Had I not done so, this issue would have been
> lost. What I do *worked*.
Um what worked? You didn't ping any reviewers/maintainers, there was no
recent mail on the mailing list for anybody to notice, your process would
have resulted in this patch going into an -rc without review right?
Was the hope that somebody would notice in a release kernel that something
was broken at some stage? I really do hope the process isn't to rely on
that?
It's only because I happened to notice this (it broke my tests) that
there's even been any review, and note that the patch itself was broken (no
fixes tag, numerous issues) + has caused much debate - that was because of
_my_ efforts.
I think the problem here is your processes are pushing down work onto other
people. Merging patches without maintainer signoff forces reviewers to jump
on things out of fear broken things will be merged (as they have been MANY
times).
I would humbly suggest you listen to the people doing the actual work
here. It may to you seem that your approach works, but if the people who
are doing the work required are telling you it's not, then surely it serves
the community better to listen to them?
I am for instance no longer going to be checking mm-new, and I feel that I
was one of the only people to actually try to report build/test issues
there.
>
> > >
> > > It's unfortunate that this one causes disruption so I guess I'll loudly
> > > comment it out and track the issue that way.
> > >
> >
> > I think we need a better approach, yes.
> >
> > We in mm are really very responsive compared to most, I think asking people
> > to wait and resend if somehow it got missed is considerably saner than
> > 'well I'll take any patch purporting to be a fix from anyone so we keep
> > track of stuff'.
>
> If someone wants to step up and be MM issue tracking person then great.
> I don't want to be that person.
You're not tracking issues, you're tracking submitted bug fixes that may or
may not be valid, and deferring the actual work of checking that to others.
I don't think arbitrarily accepting fix patches is helpful here at all.
And surely keeping a list of submitted patches and a reminder to ping
people (if they were cc'd...!) to look at them is not too egregious?
I could take a look at doing that if you're not happy to do that, though
I'd really want to feel my work would be appreciated, and that's not really
the impression I'm getting lately.
>
> And let me reiterate: had I not done this, the issue Mikulas identified
> would have remained unaddressed.
>
Firstly it's highly dubious there is any issue here (see review).
But more importantly - this is simply not true, you queueing this to be
merged without notifying anybody, and the email being 2 months old, meant
nobody could possibly have predicted it landing (you'd have to be reading
every mail in mm-commits too...)
So the most likely outcome here is this would have got merged unreviewed
with nobody realising, including the people who maintain mm/vma.c which it
changes.
Again it took the work of the actual reviewers for this to get
attention. The correct course here would have been to ping the memory
mapping people, tell Mikulas to cc- us in future, then await maintainer
sign-off.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Finding mm patches to review before those are pulled into the mainline (was "Re: [PATCH v2] fix AMDGPU failure with periodic signal")
2026-01-06 21:52 ` Pedro Falcato
@ 2026-01-08 6:26 ` SeongJae Park
2026-01-08 9:05 ` Lorenzo Stoakes
0 siblings, 1 reply; 17+ messages in thread
From: SeongJae Park @ 2026-01-08 6:26 UTC (permalink / raw)
To: Pedro Falcato
Cc: SeongJae Park, Andrew Morton, Lorenzo Stoakes, Alex Deucher,
Mikulas Patocka, Christian König, David Hildenbrand,
amd-gfx, linux-mm
On Tue, 6 Jan 2026 21:52:25 +0000 Pedro Falcato <pfalcato@suse.de> wrote:
[...]
> I understand your point. I don't think anyone wants to see patches falling
> through the cracks. But we also don't want patches to get applied without
> any review.
I can also clearly see both Andrew and Lorenzo are trying their best to make
Linux kernel better with only good faiths. I always appreciate their such
efforts. And both their opinions make sense to me in their ways.
>
> Perhaps it's time to deploy something like Patchwork to help track
> outstanding patches?
Nooo... I'm too dumb and lazy to learn how to use Patchwork...
I believe we always have rooms to improve, though. One way to resolve concerns
raised here would be asking Andrew, someone, or some tools pinging relevant
reviewers of patches that Andrew wants to add to mm tree. But I think that
might be too much request for a signle human, especially for mm, which is a
huge subsystem that many reviewers exist. And because the reviewers have their
own tastes, the solution may not fit very well to all the reviewers. For
example, someone might dislike directly getting such notification mails in
their inbox.
In the past, I actually considered making and running a tool that scans patch
mails that not Cc-ing relevant reviewers based on get_maintainers.pl and
forward those to the missing reviewers. But I didn't make it because I worried
polluting someone's inbox. I should also confess I worried my electricity bill
:)
As an alternative way, I was wondering what if reviewers consider mm tree as a
kind of compacted and curated version of the mailing list. That is, using the
mm tree as the useful place that we can more easily find patches that we need
to review asap. If it turns out there is no time to review immediately, the
reviewer can always ask Andrew to wait.
Finding such patches from mm tree may be much easier than doing that from the
mailing list, since the number of patches to look for is much smaller, and
writing scripts for that would be much easier, since we can use our favorite
tool, git. For example, I just wrote below simple script to find such patches
for DAMON from mm tree:
'''
#!/bin/bash
if [ $# -ne 1 ]
then
echo "Usage: $0 <commits>"
exit 1
fi
commits=$1
review_missed=""
for commit in $(git log --reverse "$commits" --pretty=%H)
do
commit_content=$(git show "$commit")
if ! echo "$commit_content" | grep damon --quiet
then
continue
fi
if echo "$commit_content" | grep "Signed-off-by: SeongJae Park" --quiet
then
continue
fi
if ! git show "$commit" | grep "Reviewed-by: SeongJae Park" --quiet
then
review_missed+="$commit "
fi
done
for commit in $review_missed
do
desc=$(git log -1 "$commit" --pretty="%h (\"%s\")")
echo "review missed for $desc"
done
'''
And it indeed found some interesting patches for me:
'''
$ bash commits_to_review.sh mm-stable..mm-new
review missed for cb844296e68a ("mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE")
review missed for 7bc3a776d611 ("mm: add basic tests for lazy_mmu")
review missed for e8dd7a6b54a8 ("mm/damon: fix typos in comments")
review missed for 999d5100ccf7 ("memcg: rename mem_cgroup_ino() to mem_cgroup_id()")
'''
The first two patches are false positives of the script, but the last two
patches were somewhat I actually needed to take more care. Thanks to Andrew
adding Link: tag to each patch on mm tree, taking the followup action was also
super easy and smooth for me. I like the results more than I expected, and
decided to keep using the script.
And I now realize this approach would also work for people who didn't list
their name on MAINTAINERS but still looking for patches to review.
Even though the idea and the script may not work for others, I just wanted to
share this, only hoping it helps finding another idea, whatever other than the
Patchwork [1].
[1] No offence but a joke, Pedro ;) Sorry if it was not funny.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Finding mm patches to review before those are pulled into the mainline (was "Re: [PATCH v2] fix AMDGPU failure with periodic signal")
2026-01-08 6:26 ` Finding mm patches to review before those are pulled into the mainline (was "Re: [PATCH v2] fix AMDGPU failure with periodic signal") SeongJae Park
@ 2026-01-08 9:05 ` Lorenzo Stoakes
0 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2026-01-08 9:05 UTC (permalink / raw)
To: SeongJae Park
Cc: Pedro Falcato, Andrew Morton, Alex Deucher, Mikulas Patocka,
Christian König, David Hildenbrand, amd-gfx, linux-mm
On Wed, Jan 07, 2026 at 10:26:35PM -0800, SeongJae Park wrote:
> On Tue, 6 Jan 2026 21:52:25 +0000 Pedro Falcato <pfalcato@suse.de> wrote:
>
> [...]
> > I understand your point. I don't think anyone wants to see patches falling
> > through the cracks. But we also don't want patches to get applied without
> > any review.
>
> I can also clearly see both Andrew and Lorenzo are trying their best to make
> Linux kernel better with only good faiths. I always appreciate their such
> efforts. And both their opinions make sense to me in their ways.
I mean, sorry but what? This makes me think you haven't understood what's
happened here.
This patch would have been merged _without any review whatsoever_ given the
circumstances (no M/R cc'd, the original email not replied to and from 2 months
ago).
I don't understand how we can be both-sides-ing here - treating merging stuff to
mainline without review for the _core_ infrastructure of the kernel as a bug
tracker is simply crazy.
Anything that involves merging stuff to mainline simpy to track it is
insane. This is core infrastructure, we shouldn't be breaking mainline in order
to track stuff.
Yet again the whole thing is set up in such a way that reviewers have to jump on
things, rather than submitters having to - gasp - wait sometimes.
>
> >
> > Perhaps it's time to deploy something like Patchwork to help track
> > outstanding patches?
I mean they're not outstanding, they get merged. So I'm not sure how you'd track
stuff that way?
All of this is predicated on Andrew no longer merging things without
review. Otherwise it's moot.
>
> Nooo... I'm too dumb and lazy to learn how to use Patchwork...
>
> I believe we always have rooms to improve, though. One way to resolve concerns
> raised here would be asking Andrew, someone, or some tools pinging relevant
> reviewers of patches that Andrew wants to add to mm tree. But I think that
> might be too much request for a signle human, especially for mm, which is a
> huge subsystem that many reviewers exist. And because the reviewers have their
> own tastes, the solution may not fit very well to all the reviewers. For
> example, someone might dislike directly getting such notification mails in
> their inbox.
I mean no, please - each individual sub-maintainer might have series they're
well aware of not being reviewed yet.
And again, any kind of tracking is really not meaningful if we just merge stuff
anyway.
>
> In the past, I actually considered making and running a tool that scans patch
> mails that not Cc-ing relevant reviewers based on get_maintainers.pl and
> forward those to the missing reviewers. But I didn't make it because I worried
> polluting someone's inbox. I should also confess I worried my electricity bill
> :)
>
> As an alternative way, I was wondering what if reviewers consider mm tree as a
> kind of compacted and curated version of the mailing list. That is, using the
> mm tree as the useful place that we can more easily find patches that we need
> to review asap. If it turns out there is no time to review immediately, the
> reviewer can always ask Andrew to wait.
>
> Finding such patches from mm tree may be much easier than doing that from the
> mailing list, since the number of patches to look for is much smaller, and
> writing scripts for that would be much easier, since we can use our favorite
> tool, git. For example, I just wrote below simple script to find such patches
> for DAMON from mm tree:
>
>
> '''
> #!/bin/bash
>
> if [ $# -ne 1 ]
> then
> echo "Usage: $0 <commits>"
> exit 1
> fi
>
> commits=$1
>
> review_missed=""
>
> for commit in $(git log --reverse "$commits" --pretty=%H)
> do
> commit_content=$(git show "$commit")
> if ! echo "$commit_content" | grep damon --quiet
> then
> continue
> fi
> if echo "$commit_content" | grep "Signed-off-by: SeongJae Park" --quiet
> then
> continue
> fi
> if ! git show "$commit" | grep "Reviewed-by: SeongJae Park" --quiet
> then
> review_missed+="$commit "
> fi
> done
>
> for commit in $review_missed
> do
> desc=$(git log -1 "$commit" --pretty="%h (\"%s\")")
> echo "review missed for $desc"
> done
> '''
>
> And it indeed found some interesting patches for me:
>
> '''
> $ bash commits_to_review.sh mm-stable..mm-new
> review missed for cb844296e68a ("mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE")
> review missed for 7bc3a776d611 ("mm: add basic tests for lazy_mmu")
> review missed for e8dd7a6b54a8 ("mm/damon: fix typos in comments")
> review missed for 999d5100ccf7 ("memcg: rename mem_cgroup_ino() to mem_cgroup_id()")
> '''
>
> The first two patches are false positives of the script, but the last two
> patches were somewhat I actually needed to take more care. Thanks to Andrew
I mean thanks for putting something forward :) but we'd really need to avoid
false positives and you're going to need all sorts of parameters for this to be
any way useful.
I mean we have people sending series in all kinds of ways, and if somebody sends
a v3 for a v2 that had tonnes of review we don't want to hear about that.
And for instance a huge series like Nico's I _am well aware of_ and don't need
nagging about.
In my opinion this is something for each individual sub-maintainer to keep track
of, and really submitters need to get used to resending if necessary.
I think the point Andrew was making here was to track _hotfixes_ especially, and
so you'd need to narrow down to those and really only those from at least 1 or 2
weeks ago that were not yet merged.
But we also can't do any of this without clarity on how the trees work.
Right now it's totally unclear how and when and why patches go to different
trees.
So any kind of effort like this is predacated on a bunch of other change
happening first, and I'm really losing any hope that any of that will happen at
this stage.
> adding Link: tag to each patch on mm tree, taking the followup action was also
> super easy and smooth for me. I like the results more than I expected, and
> decided to keep using the script.
>
> And I now realize this approach would also work for people who didn't list
> their name on MAINTAINERS but still looking for patches to review.
>
> Even though the idea and the script may not work for others, I just wanted to
> share this, only hoping it helps finding another idea, whatever other than the
> Patchwork [1].
>
> [1] No offence but a joke, Pedro ;) Sorry if it was not funny.
>
>
> Thanks,
> SJ
>
> [...]
>
Thanks for the input overall, but unless we change the default-merge policy all
of this is completely moot.
I imagine this will have the same impact as my saying I am no longer going to be
using/checking mm-new, but I'm pulling back from review quite a bit while things
remain as they are (you may have noticed I have already significantly reduced
it).
The current approach is unworkable and I burned out on review a couple cycles
ago when we had the insane influx (esp to THP).
So while tooling is nice and all, you still need people to do the reviewing.
And while mm is 'jump on it or it gets merged' it's just pushing all the stress
down to the people doing the actual work, no matter what tooling we use.
We need to change that in my opinion, but I'm no longer hopeful that we will any
time soon.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-01-08 9:05 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
2026-01-02 19:02 ` Lorenzo Stoakes
2026-01-02 19:08 ` Lorenzo Stoakes
2026-01-03 17:58 ` Andrew Morton
2026-01-05 10:43 ` Lorenzo Stoakes
2026-01-04 21:12 ` Mikulas Patocka
2026-01-05 10:45 ` Lorenzo Stoakes
2026-01-05 10:59 ` Lorenzo Stoakes
2026-01-02 19:15 ` Lorenzo Stoakes
2026-01-06 11:51 ` Lorenzo Stoakes
2026-01-06 18:12 ` Andrew Morton
2026-01-06 18:24 ` Lorenzo Stoakes
2026-01-06 20:59 ` Andrew Morton
2026-01-06 21:52 ` Pedro Falcato
2026-01-08 6:26 ` Finding mm patches to review before those are pulled into the mainline (was "Re: [PATCH v2] fix AMDGPU failure with periodic signal") SeongJae Park
2026-01-08 9:05 ` Lorenzo Stoakes
2026-01-07 12:33 ` [PATCH v2] fix AMDGPU failure with periodic signal Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox