linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] fix AMDGPU failure with periodic signal
@ 2025-11-07 17:48 Mikulas Patocka
  2026-01-02 19:02 ` Lorenzo Stoakes
  2026-01-02 19:15 ` Lorenzo Stoakes
  0 siblings, 2 replies; 4+ messages in thread
From: Mikulas Patocka @ 2025-11-07 17:48 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Andrew Morton, David Hildenbrand
  Cc: amd-gfx, linux-mm

If a process sets up a timer that periodically sends a signal in short
intervals and if it uses OpenCL on AMDGPU at the same time, we get random
errors. Sometimes, probing the OpenCL device fails (strace shows that
open("/dev/kfd") failed with -EINTR). Sometimes we get the message
"amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
syslog.

The bug can be reproduced with this program:
http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c

The root cause for these failures is in the function mm_take_all_locks.
This function fails with -EINTR if there is pending signal. The -EINTR is
propagated up the call stack to userspace and userspace fails if it gets
this error.

There is the following call chain: kfd_open -> kfd_create_process ->
create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
__mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"

If the failure happens in init_user_pages, there is the following call
chain: init_user_pages -> amdgpu_hmm_register ->
mmu_interval_notifier_insert -> mmu_notifier_register ->
__mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"

In order to fix these failures, this commit changes
signal_pending(current) to fatal_signal_pending(current) in
mm_take_all_locks, so that it is interrupted only if the signal is
actually killing the process.

Also, this commit skips pr_err in init_user_pages if the process is being
killed - in this situation, there was no error and so we don't want to
report it in the syslog.

I'm submitting this patch for the stable kernels, because this bug may
cause random failures in any OpenCL code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    9 +++++++--
 mm/vma.c                                         |    8 ++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
===================================================================
--- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
 
 	ret = amdgpu_hmm_register(bo, user_addr);
 	if (ret) {
-		pr_err("%s: Failed to register MMU notifier: %d\n",
-		       __func__, ret);
+		/*
+		 * If we got EINTR because the process was killed, don't report
+		 * it, because no error happened.
+		 */
+		if (!(fatal_signal_pending(current) && ret == -EINTR))
+			pr_err("%s: Failed to register MMU notifier: %d\n",
+			       __func__, ret);
 		goto out;
 	}
 
Index: linux-6.17.7/mm/vma.c
===================================================================
--- linux-6.17.7.orig/mm/vma.c
+++ linux-6.17.7/mm/vma.c
@@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
 	 * is reached.
 	 */
 	for_each_vma(vmi, vma) {
-		if (signal_pending(current))
+		if (fatal_signal_pending(current))
 			goto out_unlock;
 		vma_start_write(vma);
 	}
 
 	vma_iter_init(&vmi, mm, 0);
 	for_each_vma(vmi, vma) {
-		if (signal_pending(current))
+		if (fatal_signal_pending(current))
 			goto out_unlock;
 		if (vma->vm_file && vma->vm_file->f_mapping &&
 				is_vm_hugetlb_page(vma))
@@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
 
 	vma_iter_init(&vmi, mm, 0);
 	for_each_vma(vmi, vma) {
-		if (signal_pending(current))
+		if (fatal_signal_pending(current))
 			goto out_unlock;
 		if (vma->vm_file && vma->vm_file->f_mapping &&
 				!is_vm_hugetlb_page(vma))
@@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
 
 	vma_iter_init(&vmi, mm, 0);
 	for_each_vma(vmi, vma) {
-		if (signal_pending(current))
+		if (fatal_signal_pending(current))
 			goto out_unlock;
 		if (vma->anon_vma)
 			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] fix AMDGPU failure with periodic signal
  2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
@ 2026-01-02 19:02 ` Lorenzo Stoakes
  2026-01-02 19:08   ` Lorenzo Stoakes
  2026-01-02 19:15 ` Lorenzo Stoakes
  1 sibling, 1 reply; 4+ messages in thread
From: Lorenzo Stoakes @ 2026-01-02 19:02 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alex Deucher, Christian König, Andrew Morton,
	David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato

+cc literally everyone you should have cc'd in mm :/

Hi Mikulas,

You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
without cc'ing a single maintainer or reviewer of that file. I just happened to
notice this by chance, even lei seemed to mess up the file query for some
reason.

I'm confused in general about this patch, you sent it on 7th Nov? And it's been
ignored until now and then taken without review to the hotfixes queue?

Andrew - what's going on here? The patch looks fine but we do need to be made
aware of this stuff!

And it's seemingly against a specific stable version?... I guess this code is
antiquated so safe but still.

Thanks, Lorenzo

On Fri, Nov 07, 2025 at 06:48:01PM +0100, Mikulas Patocka wrote:
> If a process sets up a timer that periodically sends a signal in short
> intervals and if it uses OpenCL on AMDGPU at the same time, we get random
> errors. Sometimes, probing the OpenCL device fails (strace shows that
> open("/dev/kfd") failed with -EINTR). Sometimes we get the message
> "amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
> syslog.
>
> The bug can be reproduced with this program:
> http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c
>
> The root cause for these failures is in the function mm_take_all_locks.
> This function fails with -EINTR if there is pending signal. The -EINTR is
> propagated up the call stack to userspace and userspace fails if it gets
> this error.
>
> There is the following call chain: kfd_open -> kfd_create_process ->
> create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> If the failure happens in init_user_pages, there is the following call
> chain: init_user_pages -> amdgpu_hmm_register ->
> mmu_interval_notifier_insert -> mmu_notifier_register ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> In order to fix these failures, this commit changes
> signal_pending(current) to fatal_signal_pending(current) in
> mm_take_all_locks, so that it is interrupted only if the signal is
> actually killing the process.
>
> Also, this commit skips pr_err in init_user_pages if the process is being
> killed - in this situation, there was no error and so we don't want to
> report it in the syslog.
>
> I'm submitting this patch for the stable kernels, because this bug may
> cause random failures in any OpenCL code.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    9 +++++++--
>  mm/vma.c                                         |    8 ++++----
>  2 files changed, 11 insertions(+), 6 deletions(-)
>
> Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> ===================================================================
> --- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
>
>  	ret = amdgpu_hmm_register(bo, user_addr);
>  	if (ret) {
> -		pr_err("%s: Failed to register MMU notifier: %d\n",
> -		       __func__, ret);
> +		/*
> +		 * If we got EINTR because the process was killed, don't report
> +		 * it, because no error happened.
> +		 */
> +		if (!(fatal_signal_pending(current) && ret == -EINTR))
> +			pr_err("%s: Failed to register MMU notifier: %d\n",
> +			       __func__, ret);
>  		goto out;
>  	}
>
> Index: linux-6.17.7/mm/vma.c
> ===================================================================
> --- linux-6.17.7.orig/mm/vma.c
> +++ linux-6.17.7/mm/vma.c
> @@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
>  	 * is reached.
>  	 */
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		vma_start_write(vma);
>  	}
>
>  	vma_iter_init(&vmi, mm, 0);
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		if (vma->vm_file && vma->vm_file->f_mapping &&
>  				is_vm_hugetlb_page(vma))
> @@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
>
>  	vma_iter_init(&vmi, mm, 0);
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		if (vma->vm_file && vma->vm_file->f_mapping &&
>  				!is_vm_hugetlb_page(vma))
> @@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
>
>  	vma_iter_init(&vmi, mm, 0);
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		if (vma->anon_vma)
>  			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] fix AMDGPU failure with periodic signal
  2026-01-02 19:02 ` Lorenzo Stoakes
@ 2026-01-02 19:08   ` Lorenzo Stoakes
  0 siblings, 0 replies; 4+ messages in thread
From: Lorenzo Stoakes @ 2026-01-02 19:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alex Deucher, Christian König, Andrew Morton,
	David Hildenbrand, amd-gfx, linux-mm, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Jan 02, 2026 at 07:02:40PM +0000, Lorenzo Stoakes wrote:
> +cc literally everyone you should have cc'd in mm :/
>
> Hi Mikulas,
>
> You really need to check MAINTAINERS, you've sent a patch that changes mm/vma.c
> without cc'ing a single maintainer or reviewer of that file. I just happened to
> notice this by chance, even lei seemed to mess up the file query for some
> reason.

Ah yes, it's because this patch breaks the VMA userland tests.

You need to modify tools/testing/vma/vma_internal.h and rename signal_pending() to
fatal_signal_pending().

You can check it by going to the tools/testing/vma directory running make and
executing the vma executable.

This one I don't blame you for, there were meant to be CI tests for this in mm
but for some reason that's just not been done.

But this needs fixing. If this is being backported to all human history you
probably don't want to do that, but that leaves commits with broken tests in so
an alternative would be to add a patch that gets added before this one that adds
fatal_signal_pending() to vma_internal.h.

But not sure how feasible that is? Andrew?

Thanks.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] fix AMDGPU failure with periodic signal
  2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
  2026-01-02 19:02 ` Lorenzo Stoakes
@ 2026-01-02 19:15 ` Lorenzo Stoakes
  1 sibling, 0 replies; 4+ messages in thread
From: Lorenzo Stoakes @ 2026-01-02 19:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alex Deucher, Christian König, Andrew Morton,
	David Hildenbrand, amd-gfx, linux-mm

And now I've become aware of this patch (!) some review...

This subject is weird, this looks like an mm fix so the subject should be 'mm:
only interrupt taking all mm locks on fatal signal' or something.

It should describe what the patch is doing in general (hint: it doesn't only
affect AMD GPUs). So this is actively bad. Please change it.

You haven't provided a link to v1 of the patch which is not helpful.

On Fri, Nov 07, 2025 at 06:48:01PM +0100, Mikulas Patocka wrote:
> If a process sets up a timer that periodically sends a signal in short
> intervals and if it uses OpenCL on AMDGPU at the same time, we get random
> errors. Sometimes, probing the OpenCL device fails (strace shows that
> open("/dev/kfd") failed with -EINTR). Sometimes we get the message
> "amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
> syslog.

> The bug can be reproduced with this program:
> http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c

Please don't provide random links in commit messages. They die. Just include the
reproducer in the commit message.

>
> The root cause for these failures is in the function mm_take_all_locks.
> This function fails with -EINTR if there is pending signal. The -EINTR is
> propagated up the call stack to userspace and userspace fails if it gets
> this error.
>
> There is the following call chain: kfd_open -> kfd_create_process ->
> create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> If the failure happens in init_user_pages, there is the following call
> chain: init_user_pages -> amdgpu_hmm_register ->
> mmu_interval_notifier_insert -> mmu_notifier_register ->
> __mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
>
> In order to fix these failures, this commit changes
> signal_pending(current) to fatal_signal_pending(current) in
> mm_take_all_locks, so that it is interrupted only if the signal is
> actually killing the process.
>
> Also, this commit skips pr_err in init_user_pages if the process is being
> killed - in this situation, there was no error and so we don't want to
> report it in the syslog.
>
> I'm submitting this patch for the stable kernels, because this bug may
> cause random failures in any OpenCL code.

I mean, in general it might cause failures in any code which relies upon
mm_take_all_locks() or the mmu notifier logic you mention above right? I'd make
the commit message more generally about how it's sensible to check for fatal
signals, and then reference your use case as an example.

You're changing core mm code, the commit message should reflect that.

>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org

No fixes tag...? How can the stable guys know which stable kernels to apply this
against? Please fix.

>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |    9 +++++++--
>  mm/vma.c                                         |    8 ++++----
>  2 files changed, 11 insertions(+), 6 deletions(-)
>
> Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> ===================================================================

Whatever you're doing here is just wrong. You send mm patches against Andrew's
tree, mm-unstable probably best branch at the moment.

I mean this has been taken already, but now no tooling like 'b4 shazam'
etc. will work here.

> --- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
>
>  	ret = amdgpu_hmm_register(bo, user_addr);
>  	if (ret) {
> -		pr_err("%s: Failed to register MMU notifier: %d\n",
> -		       __func__, ret);
> +		/*
> +		 * If we got EINTR because the process was killed, don't report
> +		 * it, because no error happened.
> +		 */
> +		if (!(fatal_signal_pending(current) && ret == -EINTR))
> +			pr_err("%s: Failed to register MMU notifier: %d\n",
> +			       __func__, ret);

Why are you doing this here? It seems to me that this isn't really important,
and the bit you want to backport is _just_ the mm stuff.

Since you're going to be backporting to every single stable kernel, I suggest
you drop this and do it as a separate patch?

>  		goto out;
>  	}
>
> Index: linux-6.17.7/mm/vma.c
> ===================================================================
> --- linux-6.17.7.orig/mm/vma.c
> +++ linux-6.17.7/mm/vma.c
> @@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
>  	 * is reached.
>  	 */
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))

>  			goto out_unlock;
>  		vma_start_write(vma);
>  	}
>
>  	vma_iter_init(&vmi, mm, 0);
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		if (vma->vm_file && vma->vm_file->f_mapping &&
>  				is_vm_hugetlb_page(vma))
> @@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
>
>  	vma_iter_init(&vmi, mm, 0);
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		if (vma->vm_file && vma->vm_file->f_mapping &&
>  				!is_vm_hugetlb_page(vma))
> @@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
>
>  	vma_iter_init(&vmi, mm, 0);
>  	for_each_vma(vmi, vma) {
> -		if (signal_pending(current))
> +		if (fatal_signal_pending(current))
>  			goto out_unlock;
>  		if (vma->anon_vma)
>  			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>
>

This change seems reasonable to me at a glance, I don't think there's any
legitimate reason why we'd want to interrupt on _any_ signal here, a bit of
kernel archeology suggests this has been here since ~2008 so people probably
just assumed there were reasons (TM) why we checked signals in general here.

Others might have some great insight into why we did that though...

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-02 19:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
2026-01-02 19:02 ` Lorenzo Stoakes
2026-01-02 19:08   ` Lorenzo Stoakes
2026-01-02 19:15 ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox