* [PATCH v2] fix AMDGPU failure with periodic signal
@ 2025-11-07 17:48 Mikulas Patocka
0 siblings, 0 replies; only message in thread
From: Mikulas Patocka @ 2025-11-07 17:48 UTC (permalink / raw)
To: Alex Deucher, Christian König, Andrew Morton, David Hildenbrand
Cc: amd-gfx, linux-mm
If a process sets up a timer that periodically sends a signal in short
intervals and if it uses OpenCL on AMDGPU at the same time, we get random
errors. Sometimes, probing the OpenCL device fails (strace shows that
open("/dev/kfd") failed with -EINTR). Sometimes we get the message
"amdgpu: init_user_pages: Failed to register MMU notifier: -4" in the
syslog.
The bug can be reproduced with this program:
http://www.jikos.cz/~mikulas/testcases/opencl/opencl-bug-small.c
The root cause for these failures is in the function mm_take_all_locks.
This function fails with -EINTR if there is pending signal. The -EINTR is
propagated up the call stack to userspace and userspace fails if it gets
this error.
There is the following call chain: kfd_open -> kfd_create_process ->
create_process -> mmu_notifier_get -> mmu_notifier_get_locked ->
__mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
If the failure happens in init_user_pages, there is the following call
chain: init_user_pages -> amdgpu_hmm_register ->
mmu_interval_notifier_insert -> mmu_notifier_register ->
__mmu_notifier_register -> mm_take_all_locks -> "return -EINTR"
In order to fix these failures, this commit changes
signal_pending(current) to fatal_signal_pending(current) in
mm_take_all_locks, so that it is interrupted only if the signal is
actually killing the process.
Also, this commit skips pr_err in init_user_pages if the process is being
killed - in this situation, there was no error and so we don't want to
report it in the syslog.
I'm submitting this patch for the stable kernels, because this bug may
cause random failures in any OpenCL code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++++--
mm/vma.c | 8 ++++----
2 files changed, 11 insertions(+), 6 deletions(-)
Index: linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
===================================================================
--- linux-6.17.7.orig/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ linux-6.17.7/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1069,8 +1069,13 @@ static int init_user_pages(struct kgd_me
ret = amdgpu_hmm_register(bo, user_addr);
if (ret) {
- pr_err("%s: Failed to register MMU notifier: %d\n",
- __func__, ret);
+ /*
+ * If we got EINTR because the process was killed, don't report
+ * it, because no error happened.
+ */
+ if (!(fatal_signal_pending(current) && ret == -EINTR))
+ pr_err("%s: Failed to register MMU notifier: %d\n",
+ __func__, ret);
goto out;
}
Index: linux-6.17.7/mm/vma.c
===================================================================
--- linux-6.17.7.orig/mm/vma.c
+++ linux-6.17.7/mm/vma.c
@@ -2175,14 +2175,14 @@ int mm_take_all_locks(struct mm_struct *
* is reached.
*/
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
vma_start_write(vma);
}
vma_iter_init(&vmi, mm, 0);
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
if (vma->vm_file && vma->vm_file->f_mapping &&
is_vm_hugetlb_page(vma))
@@ -2191,7 +2191,7 @@ int mm_take_all_locks(struct mm_struct *
vma_iter_init(&vmi, mm, 0);
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
if (vma->vm_file && vma->vm_file->f_mapping &&
!is_vm_hugetlb_page(vma))
@@ -2200,7 +2200,7 @@ int mm_take_all_locks(struct mm_struct *
vma_iter_init(&vmi, mm, 0);
for_each_vma(vmi, vma) {
- if (signal_pending(current))
+ if (fatal_signal_pending(current))
goto out_unlock;
if (vma->anon_vma)
list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2025-11-07 17:48 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-07 17:48 [PATCH v2] fix AMDGPU failure with periodic signal Mikulas Patocka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox