From: Mikulas Patocka <mpatocka@redhat.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: "Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"David Hildenbrand" <david@redhat.com>,
amd-gfx@lists.freedesktop.org, linux-mm@kvack.org,
"Vlastimil Babka" <vbabka@suse.cz>,
"Jann Horn" <jannh@google.com>,
"Pedro Falcato" <pfalcato@suse.de>
Subject: Re: [PATCH v3 2/3] mm: only interrupt taking all mm locks on fatal signal
Date: Tue, 6 Jan 2026 21:19:59 +0100 (CET) [thread overview]
Message-ID: <6633f8ed-f432-f4c4-3fe2-8c14248cadab@redhat.com> (raw)
In-Reply-To: <j3dy3g5mchtdzxldtnqu5nwaalbr6ec4ceim3nuu6nwcddmqjc@7dgzr4m7pli2>
On Tue, 6 Jan 2026, Liam R. Howlett wrote:
> * Mikulas Patocka <mpatocka@redhat.com> [260105 15:08]:
> >
> > > If you only get the error message sometimes, does that mean there is
> > > another signal check that isn't covered by this change - or another call
> > > path?
> >
> > This call path is also triggered by -EINTR from mm_take_all_locks:
> > "init_user_pages -> amdgpu_hmm_register -> mmu_interval_notifier_insert ->
> > mmu_notifier_register -> __mmu_notifier_register -> mm_take_all_locks ->
> > return -EINTR". I am not expert in the GPU code, so I don't know how much
> > serious it is.
>
> Okay, so the other call paths also end up getting the -EINTR from this
> function? Can you please add that detail to the commit message?
Yes. I'd like to ask the GPU people to look at it and say how much damage
this -EINTR could do. I don't know - I just saw the messages "Failed to
register MMU notifier: -4" in the syslog.
> This means that -EINTR can no longer be returned from open(), right?
> Otherwise you are just reducing a race condition between open() and a
> signal entering from your timer.
EINTR can be returned from open() in cases when it was historically
behaving this way - such as opening a fifo when there is no matching
process having it open.
But I think that opening /dev/kfd doesn't fall into this category.
NFS has an "intr" flag that makes the filesystem syscalls interruptible by
signals. It is off by default, because many programs don't expect EINTR
when opening, reading or writing plain files on a filesystem.
> Any other -EINTR system call will also cause you problems since you
> continuously send signals to your process, so we'll have to change them
> all for this to work?
I use SA_RESTART for the signals. And I retry all the syscalls on EINTR
just in case SA_RESTART didn't work. So, I don't experience random
failures in my code due to the periodic signal.
But there is code that I have no control over - such as the OpenCL shared
library.
> This is the userspace ignoring what the error code means and just
> aborting on any error. This is a change in behaviour on the kernel side
> to work around what they are doing.
>
> It also sounds like it can be avoided by userspace not sending signals
> during the open process, or even to
So far, I worked around this issue by blocking all signals around
clGetPlatformIDs and clGetDeviceIDs - but this is a hack.
> retry at a higher level if a recoverable error occurs.
If clGetDeviceIDs fails and I call clGetDeviceIDs again, it doesn't even
attempt to open /dev/kfd again and fails right away. So, I can't work
around it by retrying it.
> > Even if I disabled the periodic timer, the failure could be triggered by
> > other signals, for example SIGWINCH when the user resizes the terminal, or
> > SIGCHLD when a subprocess exits.
>
> Those are also not random, they are expected signals caused by events.
From the process's point of view, they are random - the process doesn't
know when the user will drag the corner of the terminal window and resize
it. If the process spawns a subprocess, it cannot predict when will the
subprocess exit and SIGCHLD will be delivered.
If we don't change it, we end up with unreliable software stack that can
fail during rare events, such as dragging the corner of the window.
> I'm trying to say this git commit message is wrong and misleading.
OK, so I'll try to rewrite the commit message and submit version 4 of the
patch.
Mikulas
next prev parent reply other threads:[~2026-01-06 20:20 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-04 21:17 Mikulas Patocka
2026-01-05 10:42 ` Vlastimil Babka
2026-01-05 12:15 ` Lorenzo Stoakes
2026-01-05 18:15 ` Liam R. Howlett
2026-01-05 20:08 ` Mikulas Patocka
2026-01-06 17:40 ` Liam R. Howlett
2026-01-06 20:19 ` Mikulas Patocka [this message]
2026-01-06 21:56 ` Pedro Falcato
2026-01-07 20:14 ` Mikulas Patocka
2026-01-07 8:43 ` Vlastimil Babka
2026-01-07 9:25 ` Michel Dänzer
2026-01-06 11:36 ` Michel Dänzer
2026-01-06 12:52 ` Mikulas Patocka
2026-01-06 15:03 ` David Hildenbrand (Red Hat)
2026-01-07 9:55 ` Vlastimil Babka
2026-01-07 22:19 ` David Hildenbrand (Red Hat)
2026-01-06 14:57 ` Vlastimil Babka
2026-01-07 9:50 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6633f8ed-f432-f4c4-3fe2-8c14248cadab@redhat.com \
--to=mpatocka@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=david@redhat.com \
--cc=jannh@google.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=pfalcato@suse.de \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox