Re: [PATCH v3 2/3] mm: only interrupt taking all mm locks on fatal signal

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mikulas Patocka <mpatocka@redhat.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: "Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"David Hildenbrand" <david@redhat.com>,
	amd-gfx@lists.freedesktop.org, linux-mm@kvack.org,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Jann Horn" <jannh@google.com>,
	"Pedro Falcato" <pfalcato@suse.de>
Subject: Re: [PATCH v3 2/3] mm: only interrupt taking all mm locks on fatal signal
Date: Mon, 5 Jan 2026 21:08:31 +0100 (CET)	[thread overview]
Message-ID: <e60858c9-12a6-2b04-35ae-9d676f14db2a@redhat.com> (raw)
In-Reply-To: <7whbqlfrwjr4z2d4bpny3rjyl5tetdyx7ccf52uvby7hgywoym@6l6m2xcytez7>



On Mon, 5 Jan 2026, Liam R. Howlett wrote:

> I may be missing context because I didn't get 1/3 of this patch set,
> nor can I find it in my ML searching.  Nor did I get the cover letter,
> or find it.  Is this series threaded?

You can ignore the patch 1/3, it just changes memory management test 
suite.

> What you are doing is changing a really horrible loop across all VMAs,
> that happens 4 times, to a less interruptible method.
> 
> I'm not sure I'm okay with this.  Everyone else does seem fine with it,
> because userspace basically never checks error codes for a retry (or
> really anything, and sometimes not even for an error at all).

Everyone else does seem fine because they don't use periodic signals :-)

OpenCL is not the first thing that got broken by periodic signals. In the 
past, I found bugs in Linux/alpha, Linux/hppa, Cygwin, Intel Software 
Developer's Emulator regarding periodic signals. fork() was also buggy and 
it was fixed by c3ad2c3b02e953ead2b8d52a0c9e70312930c3d0.

> But this could potentially have larger consequences for those
> applications that register signal handlers, couldn't it?  That is, they
> expect to get a return based on some existing code but now it won't
> return and the application is forced to wait for all locks to be taken
> regardless of how long it takes - or to force kill the application?

Do you have any userspace application that expects open("/dev/kfd") to be 
interrupted with -EINTR and breaks when it isn't?

> We regularly get people requesting the default number of vmas be
> increased.  This means that processes that approach max_map_count will
> wait until 4 loops through the vmas before they can be interrupted, or
> they have to kill the process.
> 
> > 
> > For example, this bug happens when using OpenCL on AMDGPU. Sometimes,
> > probing the OpenCL device fails (strace shows that open("/dev/kfd")
> > failed with -EINTR). Sometimes we get the message "amdgpu:
> > init_user_pages: Failed to register MMU notifier: -4" in the syslog.
> 
> If you only get the error message sometimes, does that mean there is
> another signal check that isn't covered by this change - or another call
> path?

This call path is also triggered by -EINTR from mm_take_all_locks: 
"init_user_pages -> amdgpu_hmm_register -> mmu_interval_notifier_insert -> 
mmu_notifier_register -> __mmu_notifier_register -> mm_take_all_locks -> 
return -EINTR". I am not expert in the GPU code, so I don't know how much 
serious it is.

> > The bug can be reproduced with the following program.
> > 
> > To run this program, you need AMD graphics card and the package
> > "rocm-opencl" installed. You must not have the package "mesa-opencl-icd"
> > installed, because it redirects the default OpenCL implementation to
> > itself.
> 
> I'm not saying it's wrong to change the signal handling, but this is
> very much working around a bug in userspace constantly hammering a task
> with signals and then is surprised there is a response that the kernel
> was interrupted.
> 
> This seems to imply that all signal handling should only happen on fatal
> signals?

No - the kernel should do what applications expect. open is (according to 
the man page) supposed to be interrupted when opening slow devices (for 
example fifo). I'm wondering whether /dev/kfd should be considered a slow 
device or not.

> ...
> > 
> > I'm submitting this patch for the stable kernels, because this bug may
> > cause random failures in any code that calls mm_take_all_locks.
> 
> They aren't random failures, they are a response to a signal sent to the
> process that may be taking a very long time to do something.
> 
> I really don't see how continuously sending signals at a short interval
> interrupting system calls can be considered random failures, especially
> when the return is -EINTR which literally means "Interrupted system
> call".  How else would you interrupt a system call, if not a signal?

The AMDGPU OpenCL implementation attempts to open /dev/kfd and if it gets 
-EINTR, it behaves as if OpenCL were unavailable - it won't report itself 
in clGetPlatformIDs and it will make clGetDeviceIDs fail.

So, we are dealing with random failures - any single signal received at 
wrong time can make OpenCL fail.

Even if I disabled the periodic timer, the failure could be triggered by 
other signals, for example SIGWINCH when the user resizes the terminal, or 
SIGCHLD when a subprocess exits.

> I feel like we are making the code less versatile to work around the
> fact that userspace didn't realise that system calls could be
> interrupted without a fatal signal.
> 
> And from that view, I consider this a functional change and not a bug
> fix.
> 
> Thanks,
> Liam

In practice, I use 10ms timer and I get occasional OpenCL failures. In the 
test example, I used 50us timer, so that it is reproduced reliably - but 
decreasing the timer frequency doesn't fix the failure, it just makes it 
happen less often.

Mikulas

next prev parent reply	other threads:[~2026-01-05 20:08 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-04 21:17 Mikulas Patocka
2026-01-05 10:42 ` Vlastimil Babka
2026-01-05 12:15   ` Lorenzo Stoakes
2026-01-05 18:15 ` Liam R. Howlett
2026-01-05 20:08   ` Mikulas Patocka [this message]
2026-01-06 17:40     ` Liam R. Howlett
2026-01-06 20:19       ` Mikulas Patocka
2026-01-06 21:56         ` Pedro Falcato
2026-01-07 20:14           ` Mikulas Patocka
2026-01-07  8:43         ` Vlastimil Babka
2026-01-07  9:25         ` Michel Dänzer
2026-01-06 11:36   ` Michel Dänzer
2026-01-06 12:52     ` Mikulas Patocka
2026-01-06 15:03       ` David Hildenbrand (Red Hat)
2026-01-07  9:55         ` Vlastimil Babka
2026-01-07 22:19           ` David Hildenbrand (Red Hat)
2026-01-06 14:57     ` Vlastimil Babka
2026-01-07  9:50 ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e60858c9-12a6-2b04-35ae-9d676f14db2a@redhat.com \
    --to=mpatocka@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=pfalcato@suse.de \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox