* [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
@ 2024-02-02 8:51 David Howells
2024-02-02 9:08 ` Miklos Szeredi
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: David Howells @ 2024-02-02 8:51 UTC (permalink / raw)
To: lsf-pc
Cc: dhowells, Matthew Wilcox, Kent Overstreet, dwmw2, linux-fsdevel,
linux-mm
Hi,
We have various locks, mutexes, etc., that are taken on entry to filesystem
code, for example, and a bunch of them are taken interruptibly or killably (or
ought to be) - but filesystem code might be called into from uninterruptible
code, such as the memory allocator, fscache, etc..
Does it make sense to replace TASK_{INTERRUPTIBLE,KILLABLE,UNINTERRUPTIBLE}
with begin/end functions that define the area of uninterruptibility? The
regions would need to handle being nested, so maybe some sort of counter in
the task_struct counter would work.
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 8:51 [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility David Howells
@ 2024-02-02 9:08 ` Miklos Szeredi
2024-02-02 9:43 ` Kent Overstreet
2024-02-02 10:30 ` David Howells
2024-02-02 13:28 ` Matthew Wilcox
2 siblings, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2024-02-02 9:08 UTC (permalink / raw)
To: David Howells
Cc: lsf-pc, Matthew Wilcox, Kent Overstreet, dwmw2, linux-fsdevel, linux-mm
On Fri, 2 Feb 2024 at 09:51, David Howells <dhowells@redhat.com> wrote:
>
> Hi,
>
> We have various locks, mutexes, etc., that are taken on entry to filesystem
> code, for example, and a bunch of them are taken interruptibly or killably (or
> ought to be) - but filesystem code might be called into from uninterruptible
> code, such as the memory allocator, fscache, etc..
Are you suggesting to make lots more filesystem/vfs/mm sleeps
killable? That would present problems with being called from certain
contexts.
Or are there bugs already?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 9:08 ` Miklos Szeredi
@ 2024-02-02 9:43 ` Kent Overstreet
0 siblings, 0 replies; 11+ messages in thread
From: Kent Overstreet @ 2024-02-02 9:43 UTC (permalink / raw)
To: Miklos Szeredi
Cc: David Howells, lsf-pc, Matthew Wilcox, dwmw2, linux-fsdevel, linux-mm
On Fri, Feb 02, 2024 at 10:08:59AM +0100, Miklos Szeredi wrote:
> On Fri, 2 Feb 2024 at 09:51, David Howells <dhowells@redhat.com> wrote:
> >
> > Hi,
> >
> > We have various locks, mutexes, etc., that are taken on entry to filesystem
> > code, for example, and a bunch of them are taken interruptibly or killably (or
> > ought to be) - but filesystem code might be called into from uninterruptible
> > code, such as the memory allocator, fscache, etc..
>
> Are you suggesting to make lots more filesystem/vfs/mm sleeps
> killable? That would present problems with being called from certain
> contexts.
>
> Or are there bugs already?
I believe it's both.
Potentially we could get rid of e.g. mutex_lock_interruptible() and
mutex_lock_killable() so our API surface goes down, and I've seen at
least one bug where we were checking for a signal pending and bailing
out where we really didn't want to be.
I think we'd need the new API prototyped in order to have a concrete
discussion though.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 8:51 [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility David Howells
2024-02-02 9:08 ` Miklos Szeredi
@ 2024-02-02 10:30 ` David Howells
2024-02-02 10:46 ` Miklos Szeredi
2024-02-02 11:22 ` David Howells
2024-02-02 13:28 ` Matthew Wilcox
2 siblings, 2 replies; 11+ messages in thread
From: David Howells @ 2024-02-02 10:30 UTC (permalink / raw)
To: Miklos Szeredi
Cc: dhowells, lsf-pc, Matthew Wilcox, Kent Overstreet, dwmw2,
linux-fsdevel, linux-mm
Miklos Szeredi <miklos@szeredi.hu> wrote:
> > We have various locks, mutexes, etc., that are taken on entry to
> > filesystem code, for example, and a bunch of them are taken interruptibly
> > or killably (or ought to be) - but filesystem code might be called into
> > from uninterruptible code, such as the memory allocator, fscache, etc..
>
> Are you suggesting to make lots more filesystem/vfs/mm sleeps
> killable? That would present problems with being called from certain
> contexts.
No, it wouldn't. What I'm suggesting is something like:
overlayfs_mkdir(inode)
{
inode_lock(inode); <--- This could be interruptible
...
begin_task_uninterruptible();
...
do_stuff();
...
inode->lower->inode->mkdir(inode->lower->inode);
^--- say ext4_mkdir
...
do_more_stuff();
end_task_uninterruptible();
inode_unlock(inode);
}
ext4_mkdir(inode)
{
inode_lock(inode); <--- This would be interruptible, but
called from overlayfs above is now
uninterruptible
...
}
You bracket the context where interruptibility is bad and then everyone you
call is forced to be uninterruptible. This would need to be nestable also:
begin_task_uninterruptible();
begin_task_uninterruptible();
begin_task_uninterruptible();
...
end_task_uninterruptible(); // must not become interruptible
end_task_uninterruptible(); // nor this
end_task_uninterruptible(); // from here *could* be interruptible again
Obviously, we would need to gradate this to accommodate killability also.
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 10:30 ` David Howells
@ 2024-02-02 10:46 ` Miklos Szeredi
2024-02-02 11:22 ` David Howells
1 sibling, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2024-02-02 10:46 UTC (permalink / raw)
To: David Howells
Cc: lsf-pc, Matthew Wilcox, Kent Overstreet, dwmw2, linux-fsdevel, linux-mm
On Fri, 2 Feb 2024 at 11:30, David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > > We have various locks, mutexes, etc., that are taken on entry to
> > > filesystem code, for example, and a bunch of them are taken interruptibly
> > > or killably (or ought to be) - but filesystem code might be called into
> > > from uninterruptible code, such as the memory allocator, fscache, etc..
> >
> > Are you suggesting to make lots more filesystem/vfs/mm sleeps
> > killable? That would present problems with being called from certain
> > contexts.
>
> No, it wouldn't. What I'm suggesting is something like:
>
> overlayfs_mkdir(inode)
> {
> inode_lock(inode); <--- This could be interruptible
Just making inode_lock() interruptible would break everything.
So I assume this is not what you meant, but that we add error handling
to each and every inode_lock() call?
> ...
> begin_task_uninterruptible();
Yes, I understand that this is your suggestion.
For overlayfs it doesn't really make sense, but for network fs and
fuse I guess it could be interesting. I would have thought that
making all fs related sleeping locks interruptible is not something
that would be worth the effort, but maybe you have a very good use
case.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 10:30 ` David Howells
2024-02-02 10:46 ` Miklos Szeredi
@ 2024-02-02 11:22 ` David Howells
2024-02-02 12:06 ` Miklos Szeredi
` (2 more replies)
1 sibling, 3 replies; 11+ messages in thread
From: David Howells @ 2024-02-02 11:22 UTC (permalink / raw)
To: Miklos Szeredi
Cc: dhowells, lsf-pc, Matthew Wilcox, Kent Overstreet, dwmw2,
linux-fsdevel, linux-mm
Miklos Szeredi <miklos@szeredi.hu> wrote:
> Just making inode_lock() interruptible would break everything.
Why? Obviously, you'd need to check the result of the inode_lock(), which I
didn't put in my very rough example code, but why would taking the lock at the
front of a vfs op like mkdir be a problem?
> For overlayfs it doesn't really make sense, but for network fs and
> fuse I guess it could be interesting.
But overlayfs calls down into other filesystems - and those might be, say,
network filesystems that want to be interruptible.
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 11:22 ` David Howells
@ 2024-02-02 12:06 ` Miklos Szeredi
2024-02-02 12:44 ` Kent Overstreet
2024-02-02 16:23 ` Al Viro
2 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2024-02-02 12:06 UTC (permalink / raw)
To: David Howells
Cc: lsf-pc, Matthew Wilcox, Kent Overstreet, dwmw2, linux-fsdevel, linux-mm
On Fri, 2 Feb 2024 at 12:22, David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Just making inode_lock() interruptible would break everything.
>
> Why? Obviously, you'd need to check the result of the inode_lock(), which I
> didn't put in my very rough example code, but why would taking the lock at the
> front of a vfs op like mkdir be a problem?
No problem in theory. It just looks like a tedious job doing all that.
But please go for it, it's a worthy goal in my opinion.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 11:22 ` David Howells
2024-02-02 12:06 ` Miklos Szeredi
@ 2024-02-02 12:44 ` Kent Overstreet
2024-02-02 16:23 ` Al Viro
2 siblings, 0 replies; 11+ messages in thread
From: Kent Overstreet @ 2024-02-02 12:44 UTC (permalink / raw)
To: David Howells
Cc: Miklos Szeredi, lsf-pc, Matthew Wilcox, dwmw2, linux-fsdevel, linux-mm
On Fri, Feb 02, 2024 at 11:22:15AM +0000, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Just making inode_lock() interruptible would break everything.
>
> Why? Obviously, you'd need to check the result of the inode_lock(), which I
> didn't put in my very rough example code, but why would taking the lock at the
> front of a vfs op like mkdir be a problem?
Existing callers don't check for errors, so
maybe-interruptible-depending-on-context has to be a new function.
> > For overlayfs it doesn't really make sense, but for network fs and
> > fuse I guess it could be interesting.
>
> But overlayfs calls down into other filesystems - and those might be, say,
> network filesystems that want to be interruptible.
yup, and our interruptible vs. non interruptible stuff has always been a
wacky patchwork
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 8:51 [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility David Howells
2024-02-02 9:08 ` Miklos Szeredi
2024-02-02 10:30 ` David Howells
@ 2024-02-02 13:28 ` Matthew Wilcox
2 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2024-02-02 13:28 UTC (permalink / raw)
To: David Howells; +Cc: lsf-pc, Kent Overstreet, dwmw2, linux-fsdevel, linux-mm
On Fri, Feb 02, 2024 at 08:51:22AM +0000, David Howells wrote:
> Hi,
>
> We have various locks, mutexes, etc., that are taken on entry to filesystem
> code, for example, and a bunch of them are taken interruptibly or killably (or
> ought to be) - but filesystem code might be called into from uninterruptible
> code, such as the memory allocator, fscache, etc..
>
> Does it make sense to replace TASK_{INTERRUPTIBLE,KILLABLE,UNINTERRUPTIBLE}
> with begin/end functions that define the area of uninterruptibility? The
> regions would need to handle being nested, so maybe some sort of counter in
> the task_struct counter would work.
I don't think filesystems ever want to do stuff interruptible.
That allows any signal, including SIGALRM, SIGWINCH or SIGUSR[12] to
interrupt your sleep, and user code just isn't written to handle that
(short reads, etc).
But killable is useful. If I hit ^C, I want the task to die. Maybe some
of its resources hang around futilely waiting for a packet that will never
arrive; so be it. So I would not define inode_lock_interruptible().
But I would define inode_lock_killable(). I'd be happy to see that
overridden by a task_set_uninterruptible(), but we do need a new API to
give filesystems a chance to handle "I didn't get the lock, abort".
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 11:22 ` David Howells
2024-02-02 12:06 ` Miklos Szeredi
2024-02-02 12:44 ` Kent Overstreet
@ 2024-02-02 16:23 ` Al Viro
2024-02-03 17:27 ` Kent Overstreet
2 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2024-02-02 16:23 UTC (permalink / raw)
To: David Howells
Cc: Miklos Szeredi, lsf-pc, Matthew Wilcox, Kent Overstreet, dwmw2,
linux-fsdevel, linux-mm
On Fri, Feb 02, 2024 at 11:22:15AM +0000, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Just making inode_lock() interruptible would break everything.
>
> Why? Obviously, you'd need to check the result of the inode_lock(), which I
> didn't put in my very rough example code, but why would taking the lock at the
> front of a vfs op like mkdir be a problem?
Plenty of new failure exits to maintain?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility
2024-02-02 16:23 ` Al Viro
@ 2024-02-03 17:27 ` Kent Overstreet
0 siblings, 0 replies; 11+ messages in thread
From: Kent Overstreet @ 2024-02-03 17:27 UTC (permalink / raw)
To: Al Viro
Cc: David Howells, Miklos Szeredi, lsf-pc, Matthew Wilcox, dwmw2,
linux-fsdevel, linux-mm, Dave Chinner
On Fri, Feb 02, 2024 at 04:23:46PM +0000, Al Viro wrote:
> On Fri, Feb 02, 2024 at 11:22:15AM +0000, David Howells wrote:
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > Just making inode_lock() interruptible would break everything.
> >
> > Why? Obviously, you'd need to check the result of the inode_lock(), which I
> > didn't put in my very rough example code, but why would taking the lock at the
> > front of a vfs op like mkdir be a problem?
>
> Plenty of new failure exits to maintain?
I don't currently see a reason to go around converting existing
uninterruptible sleeps; the main benefit of the proposal as I see it
would be that we could mark sleeps as either interruptible or killable
correctly, since that really depends on what syscall we're in and what
userspace is expecting. If kernel code can correctly do one it can do
both, so this is a pretty straightforward change.
But it is an interesting idea, I'd be curious to see what comes out of
playing around with some refactorings.
There's some other wait_event() related ideas kicking around too...
Willy and Dave and I were talking about the "asynchronous waits" that
io_uring is wanting to do - I believe this is currently just done in an
ad-hoc way for waiting on a folio lock.
It seemed like it might be possible to do this in a more generic way by
simply dynamically allocating the waitlist entry, and signalling via
task_struct the wait/wakeup should be delivered to a kiocb, instead of
to a thread.
Another thing I've been wanting to do is embed a sequence number in
wait_queue_head_t, which would be incremented on wakeup. This would
change prepare_to_wait() to "read current sequence number", then later
we sleep until the sequence number has changed from what we initially
read.
This would let us fix double expansion of the wait condition in the
wait_event() macros, and it would also mean we're not flipping task
state before running the cond expression...
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-02-03 17:27 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-02 8:51 [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility David Howells
2024-02-02 9:08 ` Miklos Szeredi
2024-02-02 9:43 ` Kent Overstreet
2024-02-02 10:30 ` David Howells
2024-02-02 10:46 ` Miklos Szeredi
2024-02-02 11:22 ` David Howells
2024-02-02 12:06 ` Miklos Szeredi
2024-02-02 12:44 ` Kent Overstreet
2024-02-02 16:23 ` Al Viro
2024-02-03 17:27 ` Kent Overstreet
2024-02-02 13:28 ` Matthew Wilcox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox