linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Fix failures with signals and AMD OpenCL
@ 2026-01-07 20:31 Mikulas Patocka
  2026-01-07 20:31 ` [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS Mikulas Patocka
  2026-01-07 20:31 ` [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message Mikulas Patocka
  0 siblings, 2 replies; 9+ messages in thread
From: Mikulas Patocka @ 2026-01-07 20:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Alex Deucher,
	Christian König, David Hildenbrand, amd-gfx, linux-mm,
	Vlastimil Babka, Jann Horn

This patch set fixes failures with AMD OpenCL and signals.

The first patch makes the kernel return -ERESTARTSYS instead of -EINTR,
so that the syscall is restarted if the signal has the SA_RESTART flag.

The second patch removes a spurious error message.

Mikulas



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS
  2026-01-07 20:31 [PATCH v4 0/2] Fix failures with signals and AMD OpenCL Mikulas Patocka
@ 2026-01-07 20:31 ` Mikulas Patocka
  2026-01-07 20:55   ` Matthew Wilcox
  2026-01-08 15:33   ` Liam R. Howlett
  2026-01-07 20:31 ` [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message Mikulas Patocka
  1 sibling, 2 replies; 9+ messages in thread
From: Mikulas Patocka @ 2026-01-07 20:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Alex Deucher,
	Christian König, David Hildenbrand, amd-gfx, linux-mm,
	Vlastimil Babka, Jann Horn, Mikulas Patocka, stable

If a process receives a signal while it executes some kernel code that
calls mm_take_all_locks, we get -EINTR error. The -EINTR is propagated up
the call stack to userspace and userspace may fail if it gets this
error.

This commit changes -EINTR to -ERESTARTSYS, so that if the signal handler
was installed with the SA_RESTART flag, the operation is automatically
restarted.

For example, this problem happens when using OpenCL on AMDGPU. If some
signal races with clGetDeviceIDs, clGetDeviceIDs returns an error
CL_DEVICE_NOT_FOUND (and strace shows that open("/dev/kfd") failed with
EINTR).

This problem can be reproduced with the following program.

To run this program, you need AMD graphics card and the package
"rocm-opencl" installed. You must not have the package "mesa-opencl-icd"
installed, because it redirects the default OpenCL implementation to
itself.

include <stdio.h>
include <stdlib.h>
include <unistd.h>
include <string.h>
include <signal.h>
include <sys/time.h>

define CL_TARGET_OPENCL_VERSION	300
include <CL/opencl.h>

static void fn(void)
{
	while (1) {
		int32_t err;
		cl_device_id device;
		err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
		if (err != CL_SUCCESS) {
			fprintf(stderr, "clGetDeviceIDs failed: %d\n", err);
			exit(1);
		}
		write(2, "-", 1);
	}
}

static void alrm(int sig)
{
	write(2, ".", 1);
}

int main(void)
{
	struct itimerval it;
	struct sigaction sa;
	memset(&sa, 0, sizeof sa);
	sa.sa_handler = alrm;
	sa.sa_flags = SA_RESTART;
	sigaction(SIGALRM, &sa, NULL);
	it.it_interval.tv_sec = 0;
	it.it_interval.tv_usec = 50;
	it.it_value.tv_sec = 0;
	it.it_value.tv_usec = 50;
	setitimer(ITIMER_REAL, &it, NULL);
	fn();
	return 1;
}

I'm submitting this patch for the stable kernels, because the AMD ROCm
stack fails if it receives EINTR from open (it seems to restart EINTR
from ioctl correctly). The process may receive signals at unpredictable
times, so the OpenCL implementation may fail at unpredictable times.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-November/133141.html
Link: https://yhbt.net/lore/linux-mm/6f16b618-26fc-3031-abe8-65c2090262e7@redhat.com/T/#u
Cc: stable@vger.kernel.org
Fixes: 7906d00cd1f6 ("mmu-notifiers: add mm_take_all_locks() operation")
---
 mm/vma.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: mm/mm/vma.c
===================================================================
--- mm.orig/mm/vma.c	2026-01-07 20:11:21.000000000 +0100
+++ mm/mm/vma.c	2026-01-07 20:11:21.000000000 +0100
@@ -2202,7 +2202,7 @@ int mm_take_all_locks(struct mm_struct *
 
 out_unlock:
 	mm_drop_all_locks(mm);
-	return -EINTR;
+	return -ERESTARTSYS;
 }
 
 static void vm_unlock_anon_vma(struct anon_vma *anon_vma)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message
  2026-01-07 20:31 [PATCH v4 0/2] Fix failures with signals and AMD OpenCL Mikulas Patocka
  2026-01-07 20:31 ` [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS Mikulas Patocka
@ 2026-01-07 20:31 ` Mikulas Patocka
  2026-01-08 15:20   ` Lorenzo Stoakes
  1 sibling, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2026-01-07 20:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Alex Deucher,
	Christian König, David Hildenbrand, amd-gfx, linux-mm,
	Vlastimil Babka, Jann Horn, Mikulas Patocka

This error may happen if mm_take_all_locks was interrupted by a signal -
the userspace will retry an ioctl that returned -EINTR, so there is no
need to report it to the syslog.

Christian König suggested to remove this message at all, so let's do it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: mm/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
===================================================================
--- mm.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c	2026-01-07 20:09:51.000000000 +0100
+++ mm/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c	2026-01-07 20:12:11.000000000 +0100
@@ -1069,11 +1069,8 @@ static int init_user_pages(struct kgd_me
 	}
 
 	ret = amdgpu_hmm_register(bo, user_addr);
-	if (ret) {
-		pr_err("%s: Failed to register MMU notifier: %d\n",
-		       __func__, ret);
+	if (ret)
 		goto out;
-	}
 
 	if (criu_resume) {
 		/*



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS
  2026-01-07 20:31 ` [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS Mikulas Patocka
@ 2026-01-07 20:55   ` Matthew Wilcox
  2026-01-07 21:29     ` Mikulas Patocka
  2026-01-08 15:33   ` Liam R. Howlett
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2026-01-07 20:55 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes,
	Alex Deucher, Christian König, David Hildenbrand, amd-gfx,
	linux-mm, Vlastimil Babka, Jann Horn, stable

On Wed, Jan 07, 2026 at 09:31:14PM +0100, Mikulas Patocka wrote:
> This commit changes -EINTR to -ERESTARTSYS, so that if the signal handler
> was installed with the SA_RESTART flag, the operation is automatically
> restarted.

No, this is bonkers.  If you get a signal, you return -EINTR.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS
  2026-01-07 20:55   ` Matthew Wilcox
@ 2026-01-07 21:29     ` Mikulas Patocka
  2026-01-08 15:25       ` Lorenzo Stoakes
  0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2026-01-07 21:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes,
	Alex Deucher, Christian König, David Hildenbrand, amd-gfx,
	linux-mm, Vlastimil Babka, Jann Horn, stable



On Wed, 7 Jan 2026, Matthew Wilcox wrote:

> On Wed, Jan 07, 2026 at 09:31:14PM +0100, Mikulas Patocka wrote:
> > This commit changes -EINTR to -ERESTARTSYS, so that if the signal handler
> > was installed with the SA_RESTART flag, the operation is automatically
> > restarted.
> 
> No, this is bonkers.  If you get a signal, you return -EINTR.

Why?

fifo_open returns -ERESTARTSYS, so why not here?

Mikulas



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message
  2026-01-07 20:31 ` [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message Mikulas Patocka
@ 2026-01-08 15:20   ` Lorenzo Stoakes
  2026-01-08 15:30     ` Liam R. Howlett
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2026-01-08 15:20 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Pedro Falcato, Liam R. Howlett, Alex Deucher,
	Christian König, David Hildenbrand, amd-gfx, linux-mm,
	Vlastimil Babka, Jann Horn

On Wed, Jan 07, 2026 at 09:31:15PM +0100, Mikulas Patocka wrote:
> This error may happen if mm_take_all_locks was interrupted by a signal -
> the userspace will retry an ioctl that returned -EINTR, so there is no
> need to report it to the syslog.
>
> Christian König suggested to remove this message at all, so let's do it.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

I don't really see the point in this being part of the series here... seems
to me a patch that could be applied any time, so why not send that
separately to the amd driver people?

I mean maybe it's not al that important but still.

>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> Index: mm/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> ===================================================================
> --- mm.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c	2026-01-07 20:09:51.000000000 +0100
> +++ mm/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c	2026-01-07 20:12:11.000000000 +0100
> @@ -1069,11 +1069,8 @@ static int init_user_pages(struct kgd_me
>  	}
>
>  	ret = amdgpu_hmm_register(bo, user_addr);
> -	if (ret) {
> -		pr_err("%s: Failed to register MMU notifier: %d\n",
> -		       __func__, ret);
> +	if (ret)
>  		goto out;
> -	}
>
>  	if (criu_resume) {
>  		/*
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS
  2026-01-07 21:29     ` Mikulas Patocka
@ 2026-01-08 15:25       ` Lorenzo Stoakes
  0 siblings, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2026-01-08 15:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox, Andrew Morton, Pedro Falcato, Liam R. Howlett,
	Alex Deucher, Christian König, David Hildenbrand, amd-gfx,
	linux-mm, Vlastimil Babka, Jann Horn, stable

On Wed, Jan 07, 2026 at 10:29:44PM +0100, Mikulas Patocka wrote:
>
>
> On Wed, 7 Jan 2026, Matthew Wilcox wrote:
>
> > On Wed, Jan 07, 2026 at 09:31:14PM +0100, Mikulas Patocka wrote:
> > > This commit changes -EINTR to -ERESTARTSYS, so that if the signal handler
> > > was installed with the SA_RESTART flag, the operation is automatically
> > > restarted.
> >
> > No, this is bonkers.  If you get a signal, you return -EINTR.
>
> Why?
>
> fifo_open returns -ERESTARTSYS, so why not here?

This is fundamentally changing what this does for all callers, and we've
simply not encountered this issue elsewhere.

Given how long it might take to work it might be interrupted by another
signal on the next attempt, whose to say that it might not result in a
situation where it can never complete?

I thnk it should stay as it is.

Also the comment above literally has:

 * mm_take_all_locks() and mm_drop_all_locks are expensive operations
 * that may have to take thousand of locks.
 *
 * mm_take_all_locks() can fail if it's interrupted by signals.
 */

Which at any rate would need to be changed even if we were to change it.

I'm more and more inclined to say let's just drop this series in general,
and you should go fix how signals are handled in the driver/userland code.

This is really feeling like the wrong place to fix this.

>
> Mikulas
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message
  2026-01-08 15:20   ` Lorenzo Stoakes
@ 2026-01-08 15:30     ` Liam R. Howlett
  0 siblings, 0 replies; 9+ messages in thread
From: Liam R. Howlett @ 2026-01-08 15:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Mikulas Patocka, Andrew Morton, Pedro Falcato, Alex Deucher,
	Christian König, David Hildenbrand, amd-gfx, linux-mm,
	Vlastimil Babka, Jann Horn

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [260108 10:20]:
> On Wed, Jan 07, 2026 at 09:31:15PM +0100, Mikulas Patocka wrote:
> > This error may happen if mm_take_all_locks was interrupted by a signal -
> > the userspace will retry an ioctl that returned -EINTR, so there is no
> > need to report it to the syslog.
> >
> > Christian König suggested to remove this message at all, so let's do it.
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> I don't really see the point in this being part of the series here... seems
> to me a patch that could be applied any time, so why not send that
> separately to the amd driver people?
> 
> I mean maybe it's not al that important but still.

I really think changing this to info would be better.  It was (and may
be) useful to some?

Obviously the amd folks need to work on signal handling and this would
help them..

But then again, as Lorenzo said, we don't really need to be involved in
this patch.

> 
> >
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    5 +----
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > Index: mm/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > ===================================================================
> > --- mm.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c	2026-01-07 20:09:51.000000000 +0100
> > +++ mm/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c	2026-01-07 20:12:11.000000000 +0100
> > @@ -1069,11 +1069,8 @@ static int init_user_pages(struct kgd_me
> >  	}
> >
> >  	ret = amdgpu_hmm_register(bo, user_addr);
> > -	if (ret) {
> > -		pr_err("%s: Failed to register MMU notifier: %d\n",
> > -		       __func__, ret);
> > +	if (ret)
> >  		goto out;
> > -	}
> >
> >  	if (criu_resume) {
> >  		/*
> >
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS
  2026-01-07 20:31 ` [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS Mikulas Patocka
  2026-01-07 20:55   ` Matthew Wilcox
@ 2026-01-08 15:33   ` Liam R. Howlett
  1 sibling, 0 replies; 9+ messages in thread
From: Liam R. Howlett @ 2026-01-08 15:33 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Pedro Falcato, Lorenzo Stoakes, Alex Deucher,
	Christian König, David Hildenbrand, amd-gfx, linux-mm,
	Vlastimil Babka, Jann Horn, stable

* Mikulas Patocka <mpatocka@redhat.com> [260107 15:32]:
> If a process receives a signal while it executes some kernel code that
> calls mm_take_all_locks, we get -EINTR error. The -EINTR is propagated up
> the call stack to userspace and userspace may fail if it gets this
> error.
> 
> This commit changes -EINTR to -ERESTARTSYS, so that if the signal handler
> was installed with the SA_RESTART flag, the operation is automatically
> restarted.
> 
> For example, this problem happens when using OpenCL on AMDGPU. If some
> signal races with clGetDeviceIDs, clGetDeviceIDs returns an error
> CL_DEVICE_NOT_FOUND (and strace shows that open("/dev/kfd") failed with
> EINTR).
> 
> This problem can be reproduced with the following program.
> 
> To run this program, you need AMD graphics card and the package
> "rocm-opencl" installed. You must not have the package "mesa-opencl-icd"
> installed, because it redirects the default OpenCL implementation to
> itself.

What about [1]?  This reproducer should be fixed in the ROCm userspace
library, as stated by Christian König.

1.  https://lore.kernel.org/linux-mm/5c8ccd03-9a3a-47c0-9a89-be0abf6c1fb5@amd.com/

Thanks,
Liam


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-08 15:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-07 20:31 [PATCH v4 0/2] Fix failures with signals and AMD OpenCL Mikulas Patocka
2026-01-07 20:31 ` [PATCH v4 1/2] mm_take_all_locks: change -EINTR to -ERESTARTSYS Mikulas Patocka
2026-01-07 20:55   ` Matthew Wilcox
2026-01-07 21:29     ` Mikulas Patocka
2026-01-08 15:25       ` Lorenzo Stoakes
2026-01-08 15:33   ` Liam R. Howlett
2026-01-07 20:31 ` [PATCH v4 2/2] amdgpu: delete the "Failed to register MMU notifier" message Mikulas Patocka
2026-01-08 15:20   ` Lorenzo Stoakes
2026-01-08 15:30     ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox