* Need advice with iput() deadlock during writeback
@ 2025-09-17 8:07 Max Kellermann
2025-09-17 8:23 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 8:07 UTC (permalink / raw)
To: linux-fsdevel, Linux Memory Management List, ceph-devel
Hi,
I am currently hunting several deadlock bugs in the Ceph filesystem
that have been causing server downtimes repeatedly.
One of the deadlocks looks like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
Not tainted 6.16.7-i1-es #773
task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802
ppid:2 task_flags:0x4208060 flags:0x00004000
Workqueue: writeback wb_workfn (flush-ceph-3)
Call Trace:
<TASK>
__schedule+0x4ea/0x17d0
schedule+0x1c/0xc0
inode_wait_for_writeback+0x71/0xb0
evict+0xcf/0x200
ceph_put_wrbuffer_cap_refs+0xdd/0x220
ceph_invalidate_folio+0x97/0xc0
ceph_writepages_start+0x127b/0x14d0
do_writepages+0xba/0x150
__writeback_single_inode+0x34/0x290
writeback_sb_inodes+0x203/0x470
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x189/0x2b0
wb_workfn+0x30b/0x3d0
process_one_work+0x143/0x2b0
There's a writeback, and during that writeback, Ceph invokes iput()
releasing the last reference to that inode; iput() sees there's
pending writeback and waits for writeback to complete. But there's
nobody who will ever be able to finish writeback, because this is the
very thread that is supposed to finish writeback, so it's waiting for
itself.
It seems to me that iput() is a rather dangerous function because it
can easily block for a long time, and must never be called while
holding any lock. I wonder if all iput() callers are aware of this...
Anyway, I was wondering who is usually supposed to hold the inode
reference during writeback. If there is pending writeback, somebody
must still have a reference, or else the inode could have been evicted
before writeback even started - does that lead to UAF when writeback
actually happens?
One idea would be to postpone iput() calls to a workqueue to have it
in a different, safe context. Of course, that sounds overhead - and it
feels like a lousy kludge. There must be another way, a canonical
approach to avoiding this deadlock. I have a feeling that Ceph is
behaving weirdly, that Ceph is "holding it wrong".
I tried to trace ext4 writeback but found the inode reference counter
to be 1, the only reference being held by the dcache. But what if I
flush the dcache in the middle of writeback... I don't get it.
FS and MM experts - please help me understand how this is supposed to work.
Max
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 8:07 Need advice with iput() deadlock during writeback Max Kellermann
@ 2025-09-17 8:23 ` Mateusz Guzik
2025-09-17 8:38 ` Max Kellermann
0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 8:23 UTC (permalink / raw)
To: Max Kellermann; +Cc: linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:07:11AM +0200, Max Kellermann wrote:
> Hi,
>
> I am currently hunting several deadlock bugs in the Ceph filesystem
> that have been causing server downtimes repeatedly.
>
> One of the deadlocks looks like this:
>
> INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
> Not tainted 6.16.7-i1-es #773
> task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802
> ppid:2 task_flags:0x4208060 flags:0x00004000
> Workqueue: writeback wb_workfn (flush-ceph-3)
> Call Trace:
> <TASK>
> __schedule+0x4ea/0x17d0
> schedule+0x1c/0xc0
> inode_wait_for_writeback+0x71/0xb0
> evict+0xcf/0x200
> ceph_put_wrbuffer_cap_refs+0xdd/0x220
> ceph_invalidate_folio+0x97/0xc0
> ceph_writepages_start+0x127b/0x14d0
> do_writepages+0xba/0x150
> __writeback_single_inode+0x34/0x290
> writeback_sb_inodes+0x203/0x470
> __writeback_inodes_wb+0x4c/0xe0
> wb_writeback+0x189/0x2b0
> wb_workfn+0x30b/0x3d0
> process_one_work+0x143/0x2b0
>
> There's a writeback, and during that writeback, Ceph invokes iput()
> releasing the last reference to that inode; iput() sees there's
> pending writeback and waits for writeback to complete. But there's
> nobody who will ever be able to finish writeback, because this is the
> very thread that is supposed to finish writeback, so it's waiting for
> itself.
>
So that we are clear, this is a legally held ref by ceph and you are
legally releasing it? It's not that the code assumes there is a ref
because it came from writeback?
> Anyway, I was wondering who is usually supposed to hold the inode
> reference during writeback. If there is pending writeback, somebody
> must still have a reference, or else the inode could have been evicted
> before writeback even started - does that lead to UAF when writeback
> actually happens?
>
One of the ways to stall inode teardown is to have writeback running. It
does not need a reference because inode_wait_for_writeback() explicitly
waits for it like in the very deadlock you encountered.
> One idea would be to postpone iput() calls to a workqueue to have it
> in a different, safe context. Of course, that sounds overhead - and it
> feels like a lousy kludge. There must be another way, a canonical
> approach to avoiding this deadlock. I have a feeling that Ceph is
> behaving weirdly, that Ceph is "holding it wrong".
Doing it *by default* is indeed a no-go.
I don't know what other filesystems are doing, I would consider iput()
from writeback to be a bug.
However, assuming that's not avoidable, iput_async() or whatever could
be added to sort this out in a similar way fput() is.
As a temporary bandaid iput() itself could check if I_SYNC is set and if
so roll with the iput_async() option.
I can cook something up later.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 8:23 ` Mateusz Guzik
@ 2025-09-17 8:38 ` Max Kellermann
2025-09-17 8:59 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 8:38 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:23 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> So that we are clear, this is a legally held ref by ceph and you are
> legally releasing it? It's not that the code assumes there is a ref
> because it came from writeback?
Porbably yes, but I'm not 100% sure - Ceph code looks weird to me. A
reference is released that was previously taken by
ceph_take_cap_refs(), but there's another condition that causes an
iput() call - if ceph_try_drop_cap_snap() returns true - but I don't
see where that reference was taken.
> One of the ways to stall inode teardown is to have writeback running. It
> does not need a reference because inode_wait_for_writeback() explicitly
> waits for it like in the very deadlock you encountered.
Ah, right. No UAF. But I wonder if that's the best way to do it - why
not let writeback hold its own reference, and eliminate
inode_wait_for_writeback()? (and its latency)
> However, assuming that's not avoidable, iput_async() or whatever could
> be added to sort this out in a similar way fput() is.
>
> As a temporary bandaid iput() itself could check if I_SYNC is set and if
> so roll with the iput_async() option.
>
> I can cook something up later.
My idea was something like iput_safe() and that function would defer
the actual iput() call if the reference counter is 1 (i.e. the caller
is holding the last reference).
That would avoid the other Ceph deadlock bug I found - because Ceph,
like all filesystems that are built on top netfs, uses
netfs_wait_for_outstanding_io() in the evict_inode callback. Because
guess what happens when the Ceph messenger worker that handles I/O
completion decides to call iput()...
Due to the evict_inode callback and the unknowns hidden behind it,
checking I_SYNC is not enough.
Almost all iput() calls are fine because they just do an atomic
decrement, but all kinds of scary stuff can happen if the last
reference is released. Callers that are not confident that this is
safe shall then use my new iput_safe() instead of iput().
I can write such a patch, but I wanted you experts to first confirm
that this is a good idea that would be acceptable for merging (or
maybe Ceph is just weird and there's a simpler way to avoid this).
Max
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 8:38 ` Max Kellermann
@ 2025-09-17 8:59 ` Mateusz Guzik
2025-09-17 9:20 ` Max Kellermann
2025-09-17 20:14 ` Al Viro
0 siblings, 2 replies; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 8:59 UTC (permalink / raw)
To: Max Kellermann; +Cc: linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:38 AM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Wed, Sep 17, 2025 at 10:23 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > One of the ways to stall inode teardown is to have writeback running. It
> > does not need a reference because inode_wait_for_writeback() explicitly
> > waits for it like in the very deadlock you encountered.
>
> Ah, right. No UAF. But I wonder if that's the best way to do it - why
> not let writeback hold its own reference, and eliminate
> inode_wait_for_writeback()? (and its latency)
>
There happens to be a temporarily inactive discussion related to it, see:
https://lore.kernel.org/linux-fsdevel/cover.1756222464.git.josef@toxicpanda.com/
but also the followup:
https://lore.kernel.org/linux-fsdevel/eeu47pjcaxkfol2o2bltigfjvrz6eecdjwtilnmnprqh7dhdn7@rqi35ya5ilmv/
The patchset posted there retains inode_wait_for_writeback().
Suppose you are to get rid of it. In that case you have a corner case
where the writeback thread has to issue ->evict_inode() for arbitrary
filesystems, and that's quite a change and I'm not at all convinced
that's safe.
> > However, assuming that's not avoidable, iput_async() or whatever could
> > be added to sort this out in a similar way fput() is.
> >
> > As a temporary bandaid iput() itself could check if I_SYNC is set and if
> > so roll with the iput_async() option.
> >
> > I can cook something up later.
>
> My idea was something like iput_safe() and that function would defer
> the actual iput() call if the reference counter is 1 (i.e. the caller
> is holding the last reference).
>
That's the same as my proposal.
> Almost all iput() calls are fine because they just do an atomic
> decrement, but all kinds of scary stuff can happen if the last
> reference is released. Callers that are not confident that this is
> safe shall then use my new iput_safe() instead of iput().
>
Note that vast majority of real-world calls to iput already come with
a count of 1, but it may be this is not true for ceph.
> I can write such a patch, but I wanted you experts to first confirm
> that this is a good idea that would be acceptable for merging (or
> maybe Ceph is just weird and there's a simpler way to avoid this).
>
So the problem here is where to put linkage for the delegated work.
Another issue is that as is nobody knows who set I_SYNC and that
probably should change on kernel with CONFIG_DEBUG_VFS. As luck would
have it I posted a related patchset here:
https://lore.kernel.org/linux-fsdevel/20250916135900.2170346-1-mjguzik@gmail.com/T/#t
with that in place and debug enabled we can panic early on the first
iput so you don't have to wait to trigger the problem
I suspect the best short-term fix is to implement ceph-private async
iput with linkage coming from struct ceph_inode_info or whatever other
struct applicable.
You can use __fput_deferred() as a reference (put intended). Note this
one assumes the obj is already unrefed, but for iput_async it would be
best to also postpone it to that routine.
A sketch, incomplete:
static DECLARE_DELAYED_WORK(delayed_ceph_iput_work, delayed_ceph_iput);
static void __ceph_iput_async(struct callback_head *work)
{
struct ceph_inode_info *ci = container_of(work, struct
ceph_inode_info, async_task_work);
iput(&ci->netfs.inode);
}
void ceph_iput_async(struct ceph_inode_info *ci)
{
struct inode *inode = &ci->netfs.inode;
if (atomic_add_unless(&inode->i_count, -1, 1))
return;
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
init_task_work(&ci->async_task_work, __ceph_iput_async);
if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
return;
}
if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
schedule_delayed_work(&delayed_ceph_iput_work, 1);
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 8:59 ` Mateusz Guzik
@ 2025-09-17 9:20 ` Max Kellermann
2025-09-17 9:32 ` Mateusz Guzik
2025-09-17 12:48 ` Max Kellermann
2025-09-17 20:14 ` Al Viro
1 sibling, 2 replies; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 9:20 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> There happens to be a temporarily inactive discussion related to it, see:
> https://lore.kernel.org/linux-fsdevel/cover.1756222464.git.josef@toxicpanda.com/
>
> but also the followup:
> https://lore.kernel.org/linux-fsdevel/eeu47pjcaxkfol2o2bltigfjvrz6eecdjwtilnmnprqh7dhdn7@rqi35ya5ilmv/
>
> The patchset posted there retains inode_wait_for_writeback().
That is indeed a very interesting thread tackling a very similar
problem. I guess I can learn a bit from the discussion.
> > My idea was something like iput_safe() and that function would defer
> > the actual iput() call if the reference counter is 1 (i.e. the caller
> > is holding the last reference).
> >
>
> That's the same as my proposal.
The real difference (aside from naming) is that I wanted to change
only callers in unsafe contexts to the new function. But I guess most
people calling iput() are not aware of its dangers and if we look
closer, more existing bugs may be revealed.
For example, the Ceph bugs only occur under memory pressure (via
memcg) - only when the dcache happens to be flushed and the process
doing the writes had already exited, thus nobody else was still
holding a reference to the inode. These are rare circumstances for
normal people, but on our servers, that happens all the time.
> Note that vast majority of real-world calls to iput already come with
> a count of 1, but it may be this is not true for ceph.
Not my experience - I traced iput() and found that this was very rare
- because the dcache is almost always holding a reference and inodes
are only ever evicted if the dcache decides to drop them.
> I suspect the best short-term fix is to implement ceph-private async
> iput with linkage coming from struct ceph_inode_info or whatever other
> struct applicable.
I had already started writing exactly this, very similar to your
sketch. That's what I'm going to finish now - and it will produce a
patch that will hopefully be appropriate for a stable backport. This
Ceph deadlock bug appears to affect all Linux versions.
> if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> init_task_work(&ci->async_task_work, __ceph_iput_async);
> if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
> return;
> }
This part isn't useful for inodes, is it? I suppose this code exists
in fput() only to guarantee that all file handles are really closed
before returning to userspace, right? And we don't need that for
inodes?
Thanks for your helpful advice, Mateusz!
Max
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 9:20 ` Max Kellermann
@ 2025-09-17 9:32 ` Mateusz Guzik
2025-09-17 12:48 ` Max Kellermann
1 sibling, 0 replies; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 9:32 UTC (permalink / raw)
To: Max Kellermann; +Cc: linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 11:20 AM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Wed, Sep 17, 2025 at 10:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > My idea was something like iput_safe() and that function would defer
> > > the actual iput() call if the reference counter is 1 (i.e. the caller
> > > is holding the last reference).
> > >
> >
> > That's the same as my proposal.
>
> The real difference (aside from naming) is that I wanted to change
> only callers in unsafe contexts to the new function. But I guess most
> people calling iput() are not aware of its dangers and if we look
> closer, more existing bugs may be revealed.
>
I noted iput() handling this is a possibility, but also that it would
be best avoided.
We are in agreement here mate.
> > Note that vast majority of real-world calls to iput already come with
> > a count of 1, but it may be this is not true for ceph.
>
> Not my experience - I traced iput() and found that this was very rare
> - because the dcache is almost always holding a reference and inodes
> are only ever evicted if the dcache decides to drop them.
>
Most of the calls I had seen are from dcache. ;-)
> > I suspect the best short-term fix is to implement ceph-private async
> > iput with linkage coming from struct ceph_inode_info or whatever other
> > struct applicable.
>
> I had already started writing exactly this, very similar to your
> sketch. That's what I'm going to finish now - and it will produce a
> patch that will hopefully be appropriate for a stable backport. This
> Ceph deadlock bug appears to affect all Linux versions.
>
Sounds like a plan. After the inode_state_ accessor thing is sorted
out I'll add the diagnostics to catch unsafe iput() use.
So I had a look at inode layout with pahole and there is a pluggable
8-byte hole in it.
llist takes 8 bytes, so it can just fit right in without growing the
struct above what it is now.
Unfortunately task_work is 16 bytes, so embedding that sucker would
grow the struct but that's perhaps tolerable. Not my call.
If making sure to postpone the last unref there is no way to union
this with anything that I can see as the inode must remain safe to use
-- someone could have picked it up.
Maybe something could be figured out if iput_async already unrefs, but
this would require fuckery with flags to make sure nobody messes with
the inode.
> > if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> > init_task_work(&ci->async_task_work, __ceph_iput_async);
> > if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
> > return;
> > }
>
> This part isn't useful for inodes, is it? I suppose this code exists
> in fput() only to guarantee that all file handles are really closed
> before returning to userspace, right? And we don't need that for
> inodes?
>
No, the fput thing is to avoid a problem of a similar nature. As
*final* fput can start taking arbitrary locks, go to sleep or use a
lot of stack, it is woefully unsafe to be called from arbitrary
places. The current machinery guarantees anything other than an atomic
decrement is postponed to syscall boundary or a task queue if the
former is not possible so that these are not a factor.
Postponing to syscall boundary as opposed to blindly queueing up makes
the "right" thread do the work.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 9:20 ` Max Kellermann
2025-09-17 9:32 ` Mateusz Guzik
@ 2025-09-17 12:48 ` Max Kellermann
1 sibling, 0 replies; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 12:48 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 11:20 AM Max Kellermann
<max.kellermann@ionos.com> wrote:
> I had already started writing exactly this, very similar to your
> sketch.
I just submitted the patch, and it was even simpler than my first
draft, because I could use the existing work_struct in ceph_inode_info
and donate the inode reference to it.
I'd welcome your opinion on this approach.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 8:59 ` Mateusz Guzik
2025-09-17 9:20 ` Max Kellermann
@ 2025-09-17 20:14 ` Al Viro
2025-09-17 20:19 ` Max Kellermann
2025-09-17 20:23 ` Mateusz Guzik
1 sibling, 2 replies; 22+ messages in thread
From: Al Viro @ 2025-09-17 20:14 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:59:23AM +0200, Mateusz Guzik wrote:
> A sketch, incomplete:
> static DECLARE_DELAYED_WORK(delayed_ceph_iput_work, delayed_ceph_iput);
>
> static void __ceph_iput_async(struct callback_head *work)
> {
> struct ceph_inode_info *ci = container_of(work, struct
> ceph_inode_info, async_task_work);
> iput(&ci->netfs.inode);
> }
>
> void ceph_iput_async(struct ceph_inode_info *ci)
> {
> struct inode *inode = &ci->netfs.inode;
>
> if (atomic_add_unless(&inode->i_count, -1, 1))
> return;
>
> if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> init_task_work(&ci->async_task_work, __ceph_iput_async);
> if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
> return;
> }
>
> if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
> schedule_delayed_work(&delayed_ceph_iput_work, 1);
> }
Looks rather dangerous - what do you do on fs shutdown?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:14 ` Al Viro
@ 2025-09-17 20:19 ` Max Kellermann
2025-09-17 20:29 ` Al Viro
2025-09-17 20:23 ` Mateusz Guzik
1 sibling, 1 reply; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 20:19 UTC (permalink / raw)
To: Al Viro
Cc: Mateusz Guzik, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> Looks rather dangerous - what do you do on fs shutdown?
Sorry, I'm new to this, I don't know how fs shutdown works - stupid
question: is my code any more dangerous than what's already happening
with ceph_queue_inode_work()?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:14 ` Al Viro
2025-09-17 20:19 ` Max Kellermann
@ 2025-09-17 20:23 ` Mateusz Guzik
2025-09-17 20:34 ` Al Viro
1 sibling, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 20:23 UTC (permalink / raw)
To: Al Viro
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Sep 17, 2025 at 10:59:23AM +0200, Mateusz Guzik wrote:
>
> > A sketch, incomplete:
> > static DECLARE_DELAYED_WORK(delayed_ceph_iput_work, delayed_ceph_iput);
> >
> > static void __ceph_iput_async(struct callback_head *work)
> > {
> > struct ceph_inode_info *ci = container_of(work, struct
> > ceph_inode_info, async_task_work);
> > iput(&ci->netfs.inode);
> > }
> >
> > void ceph_iput_async(struct ceph_inode_info *ci)
> > {
> > struct inode *inode = &ci->netfs.inode;
> >
> > if (atomic_add_unless(&inode->i_count, -1, 1))
> > return;
> >
> > if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> > init_task_work(&ci->async_task_work, __ceph_iput_async);
> > if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
> > return;
> > }
> >
> > if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
> > schedule_delayed_work(&delayed_ceph_iput_work, 1);
> > }
>
> Looks rather dangerous - what do you do on fs shutdown?
Can you elaborate?
This should be equivalent to some random piece of code holding onto a
reference for a time.
I would expect whatever unmount/other teardown would proceed after it
gets rid of it.
Although for the queue at hand something can force flush it.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:19 ` Max Kellermann
@ 2025-09-17 20:29 ` Al Viro
2025-09-17 20:32 ` Max Kellermann
0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2025-09-17 20:29 UTC (permalink / raw)
To: Max Kellermann
Cc: Mateusz Guzik, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:19:27PM +0200, Max Kellermann wrote:
> On Wed, Sep 17, 2025 at 10:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Looks rather dangerous - what do you do on fs shutdown?
>
> Sorry, I'm new to this, I don't know how fs shutdown works - stupid
> question: is my code any more dangerous than what's already happening
> with ceph_queue_inode_work()?
umount /wherever/the/fuck/it/is/mounted
calls umount(2), which removes the mount from the tree, then calls
deactivate_super(), dropping the active reference to superblock.
If it hadn't been mounted elsewhere, that's the last reference and
we this:
shrinker_free(s->s_shrink);
fs->kill_sb(s);
kill_super_notify(s);
/*
* Since list_lru_destroy() may sleep, we cannot call it from
* put_super(), where we hold the sb_lock. Therefore we destroy
* the lru lists right now.
*/
list_lru_destroy(&s->s_dentry_lru);
list_lru_destroy(&s->s_inode_lru);
put_filesystem(fs);
put_super(s);
At some point ->kill_sb() will call generic_shutdown_super() (in case of ceph
that's done via kill_anon_super()), where we get to evict_inodes(). Any
busy inode at that point is a bad problem...
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:29 ` Al Viro
@ 2025-09-17 20:32 ` Max Kellermann
0 siblings, 0 replies; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 20:32 UTC (permalink / raw)
To: Al Viro
Cc: Mateusz Guzik, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:29 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> At some point ->kill_sb() will call generic_shutdown_super() (in case of ceph
> that's done via kill_anon_super()), where we get to evict_inodes(). Any
> busy inode at that point is a bad problem...
And before that, Ceph will call flush_fs_workqueues().
But again: if this were wrong, would this be a new bug in my patch, or
would this be an old existing bug? Knowing this is important for me to
understand the nature of the (potential) problem you're raising.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:23 ` Mateusz Guzik
@ 2025-09-17 20:34 ` Al Viro
2025-09-17 20:36 ` Max Kellermann
2025-09-17 20:39 ` Mateusz Guzik
0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2025-09-17 20:34 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:23:00PM +0200, Mateusz Guzik wrote:
> This should be equivalent to some random piece of code holding onto a
> reference for a time.
As in "Busy inodes after unmount"?
> I would expect whatever unmount/other teardown would proceed after it
> gets rid of it.
Gets rid of it how, exactly?
> Although for the queue at hand something can force flush it.
Suppose two threads do umount() on two different filesystems. The first
one to flush picks *everything* you've delayed and starts handling that.
The second sees nothing to do and proceeds to taking the filesystem
it's unmounting apart, right under the nose of the first thread doing
work on both filesystems...
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:34 ` Al Viro
@ 2025-09-17 20:36 ` Max Kellermann
2025-09-17 21:10 ` Al Viro
2025-09-17 20:39 ` Mateusz Guzik
1 sibling, 1 reply; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 20:36 UTC (permalink / raw)
To: Al Viro
Cc: Mateusz Guzik, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> Suppose two threads do umount() on two different filesystems. The first
> one to flush picks *everything* you've delayed and starts handling that.
> The second sees nothing to do and proceeds to taking the filesystem
> it's unmounting apart, right under the nose of the first thread doing
> work on both filesystems...
Each filesystem (struct ceph_fs_client) has its own inode_wq.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:34 ` Al Viro
2025-09-17 20:36 ` Max Kellermann
@ 2025-09-17 20:39 ` Mateusz Guzik
2025-09-17 21:02 ` Al Viro
1 sibling, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 20:39 UTC (permalink / raw)
To: Al Viro
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Sep 17, 2025 at 10:23:00PM +0200, Mateusz Guzik wrote:
>
> > This should be equivalent to some random piece of code holding onto a
> > reference for a time.
>
> As in "Busy inodes after unmount"?
>
Where I'm from (the BSD land) vnodes (the "inode" equivalent) are the
base object if you will.
If there are busy inodes and you want to unmount, it fails. If you
force an unmount, there is some shenanigans, but it ultimately waits
for inodes to disappear.
Linux has to have something of the sort for dentries, otherwise the
current fput stuff would not be safe. I find it surprising to learn
inodes are treated differently.
> > I would expect whatever unmount/other teardown would proceed after it
> > gets rid of it.
>
> Gets rid of it how, exactly?
>
In this context I mean whatever code holding onto it stopped.
> > Although for the queue at hand something can force flush it.
>
> Suppose two threads do umount() on two different filesystems. The first
> one to flush picks *everything* you've delayed and starts handling that.
> The second sees nothing to do and proceeds to taking the filesystem
> it's unmounting apart, right under the nose of the first thread doing
> work on both filesystems...
Per the above, the assumption was unmount would stall waiting for
these inodes to get processed.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:39 ` Mateusz Guzik
@ 2025-09-17 21:02 ` Al Viro
2025-09-17 21:18 ` Mateusz Guzik
2025-09-17 21:42 ` Al Viro
0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2025-09-17 21:02 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:39:22PM +0200, Mateusz Guzik wrote:
> Linux has to have something of the sort for dentries, otherwise the
> current fput stuff would not be safe. I find it surprising to learn
> inodes are treated differently.
If you are looking at vnode counterparts, dentries are closer to that.
Inodes are secondary.
And no, it's not a "wait for references to go away" - every file holds
a _pair_ of references, one to mount and another to dentry.
Additional references to mount => umount() gets -EBUSY, lazy umount()
(with MNT_DETACH) gets the sucker removed from the mount tree, with
shutdown deferred (at least) until the last reference to mount goes away.
Once the mount refcount hits zero and the damn thing gets taken apart,
an active reference to superblock (i.e. to filesystem instance) is
dropped.
If that was not the last one (e.g. it's mounted elsewhere as well), we
are not waiting for anything. If it *was* the last active ref, we
shut the filesystem instance down; that's _it_ - once you are into
->kill_sb(), it's all over.
Linux VFS is seriously different from Heidemann's-derived ones you'll find in
BSD land these days. Different taxonomy of objects, among other things...
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 20:36 ` Max Kellermann
@ 2025-09-17 21:10 ` Al Viro
2025-09-17 21:19 ` Max Kellermann
0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2025-09-17 21:10 UTC (permalink / raw)
To: Max Kellermann
Cc: Mateusz Guzik, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:36:48PM +0200, Max Kellermann wrote:
> On Wed, Sep 17, 2025 at 10:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Suppose two threads do umount() on two different filesystems. The first
> > one to flush picks *everything* you've delayed and starts handling that.
> > The second sees nothing to do and proceeds to taking the filesystem
> > it's unmounting apart, right under the nose of the first thread doing
> > work on both filesystems...
>
> Each filesystem (struct ceph_fs_client) has its own inode_wq.
Yes, but
if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
schedule_delayed_work(&delayed_ceph_iput_work, 1);
won't have anything to do with that.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 21:02 ` Al Viro
@ 2025-09-17 21:18 ` Mateusz Guzik
2025-09-17 21:42 ` Al Viro
1 sibling, 0 replies; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 21:18 UTC (permalink / raw)
To: Al Viro
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 11:02 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Sep 17, 2025 at 10:39:22PM +0200, Mateusz Guzik wrote:
>
> > Linux has to have something of the sort for dentries, otherwise the
> > current fput stuff would not be safe. I find it surprising to learn
> > inodes are treated differently.
>
> If you are looking at vnode counterparts, dentries are closer to that.
> Inodes are secondary.
>
> And no, it's not a "wait for references to go away" - every file holds
> a _pair_ of references, one to mount and another to dentry.
>
That I know, I just blindly assumed inodes also stall teardown in some
way. That said I see generic_shutdown_super() and indeed it works as
you described and now as I assumed.
> Additional references to mount => umount() gets -EBUSY, lazy umount()
> (with MNT_DETACH) gets the sucker removed from the mount tree, with
> shutdown deferred (at least) until the last reference to mount goes away.
>
> Once the mount refcount hits zero and the damn thing gets taken apart,
> an active reference to superblock (i.e. to filesystem instance) is
> dropped.
>
> If that was not the last one (e.g. it's mounted elsewhere as well), we
> are not waiting for anything. If it *was* the last active ref, we
> shut the filesystem instance down; that's _it_ - once you are into
> ->kill_sb(), it's all over.
>
Ye I see. Thanks for the breakdown.
> Linux VFS is seriously different from Heidemann's-derived ones you'll find in
> BSD land these days. Different taxonomy of objects, among other things...
So I keep finding.
But in this case I have to mention btrfs has some hand-rolled delayed
iput (see btrfs_run_delayed_iputs et al), perhaps something you may
want to take look at if you had not.
More importantly though, that's 2 filesystems which would like to do
async iput. Would be nice(tm) if the layer provided a helper to do it
safely.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 21:10 ` Al Viro
@ 2025-09-17 21:19 ` Max Kellermann
2025-09-17 21:20 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: Max Kellermann @ 2025-09-17 21:19 UTC (permalink / raw)
To: Al Viro
Cc: Mateusz Guzik, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 11:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Each filesystem (struct ceph_fs_client) has its own inode_wq.
>
> Yes, but
> if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
> schedule_delayed_work(&delayed_ceph_iput_work, 1);
> won't have anything to do with that.
Mateusz did not mention that the list must be flushed on umount, but
that's what "incomplete sketch" means.
(The patch I submitted uses inode_wq, but that's a different thread.)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 21:19 ` Max Kellermann
@ 2025-09-17 21:20 ` Mateusz Guzik
0 siblings, 0 replies; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 21:20 UTC (permalink / raw)
To: Max Kellermann
Cc: Al Viro, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 11:19 PM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Wed, Sep 17, 2025 at 11:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > Each filesystem (struct ceph_fs_client) has its own inode_wq.
> >
> > Yes, but
> > if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
> > schedule_delayed_work(&delayed_ceph_iput_work, 1);
> > won't have anything to do with that.
>
> Mateusz did not mention that the list must be flushed on umount, but
> that's what "incomplete sketch" means.
>
> (The patch I submitted uses inode_wq, but that's a different thread.)
I assumed someone would flush it to speed up the unmount.
I fully admit I did not realize it was a correctness issue here (see
my other mail).
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 21:02 ` Al Viro
2025-09-17 21:18 ` Mateusz Guzik
@ 2025-09-17 21:42 ` Al Viro
2025-09-17 22:58 ` Mateusz Guzik
1 sibling, 1 reply; 22+ messages in thread
From: Al Viro @ 2025-09-17 21:42 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 10:02:41PM +0100, Al Viro wrote:
> On Wed, Sep 17, 2025 at 10:39:22PM +0200, Mateusz Guzik wrote:
>
> > Linux has to have something of the sort for dentries, otherwise the
> > current fput stuff would not be safe. I find it surprising to learn
> > inodes are treated differently.
>
> If you are looking at vnode counterparts, dentries are closer to that.
> Inodes are secondary.
>
> And no, it's not a "wait for references to go away" - every file holds
> a _pair_ of references, one to mount and another to dentry.
>
> Additional references to mount => umount() gets -EBUSY, lazy umount()
> (with MNT_DETACH) gets the sucker removed from the mount tree, with
> shutdown deferred (at least) until the last reference to mount goes away.
>
> Once the mount refcount hits zero and the damn thing gets taken apart,
> an active reference to superblock (i.e. to filesystem instance) is
> dropped.
>
> If that was not the last one (e.g. it's mounted elsewhere as well), we
> are not waiting for anything. If it *was* the last active ref, we
> shut the filesystem instance down; that's _it_ - once you are into
> ->kill_sb(), it's all over.
>
> Linux VFS is seriously different from Heidemann's-derived ones you'll find in
> BSD land these days. Different taxonomy of objects, among other things...
FWIW, the basic overview of objects:
super_block: filesystem instance. Two refcounts (passive and active, having
positive active refcount counts as one passive reference). Shutdown when
active refcount gets to zero; freeing of in-core struct super_block - when
passive gets there.
mount: a subtree of an active filesystem. Most of them are in mount tree(s),
but they might exist on their own - e.g. pipefs one, etc. Has a refcount,
bears an active reference to fs instance (super_block) *and* a reference to
a dentry belonging to that instance - root of the (sub)tree visible in
it. Shutdown when refcount hits zero. Being in mount tree contributes
to refcount; that contribution goes away when it's detached from the tree
(on umount, normally). Refcount is responsible for -EBUSY from non-lazy
umount; lazy one (umount -l, umount2(path, MNT_DETACH)) dissolves the entire
subtree that used to be mounted at that point and shuts down everything
that had refcounts reach zero, leaving the rest until their refcounts drop
to zero too. Shutdown drops the superblock and root dentry refs.
inode & dentry: that's what vnodes map onto. Dentry is the main object,
inode is secondary. Each belongs to a specific fs instance for the entire
lifetime. Dentries form a forest; inodes are attached to some of them.
Details are a lot more involved than anything that would fit into a short
overview. Both are refcounted, attaching dentry to an inode contributes
1 to inode's refcount. Child dentry contributes 1 to refcount of parent.
Shutdown does *not* happen until the dentry refcount hits zero; once it's
zero, the normal policy is "keep it around if it's still hashed", but
filesystem may say "no point keeping it". Memory pressure => kill the
ones with zero refcount (and if their parents had been pinned only by
those children, take the parents out as well, etc.). Filesystem shutdown =>
kick out everything with zero refcount, complain if anything's left after
that (shrink_dcache_for_umount() does it, so if filesystem kept anything
pinned internally, it would better drop those before we get to that
point). evict_inodes() does the same to inodes.
file: the usual; open IO channel, as on any Unix. Carries a reference to
dentry and to mount. Shutdown happens when refcount goes to zero, normally
delayed until return to userland, when we are on shallow stack and without
any locks held. Incidentally, sockets and pipes come with those as well -
none of the "sockets don't have a vnode" headache.
cwd (and process's root as well): a pair of mount and dentry references.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Need advice with iput() deadlock during writeback
2025-09-17 21:42 ` Al Viro
@ 2025-09-17 22:58 ` Mateusz Guzik
0 siblings, 0 replies; 22+ messages in thread
From: Mateusz Guzik @ 2025-09-17 22:58 UTC (permalink / raw)
To: Al Viro
Cc: Max Kellermann, linux-fsdevel, Linux Memory Management List, ceph-devel
On Wed, Sep 17, 2025 at 11:42 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Sep 17, 2025 at 10:02:41PM +0100, Al Viro wrote:
> > On Wed, Sep 17, 2025 at 10:39:22PM +0200, Mateusz Guzik wrote:
> >
> > > Linux has to have something of the sort for dentries, otherwise the
> > > current fput stuff would not be safe. I find it surprising to learn
> > > inodes are treated differently.
> >
> > If you are looking at vnode counterparts, dentries are closer to that.
> > Inodes are secondary.
> >
> > And no, it's not a "wait for references to go away" - every file holds
> > a _pair_ of references, one to mount and another to dentry.
> >
> > Additional references to mount => umount() gets -EBUSY, lazy umount()
> > (with MNT_DETACH) gets the sucker removed from the mount tree, with
> > shutdown deferred (at least) until the last reference to mount goes away.
> >
> > Once the mount refcount hits zero and the damn thing gets taken apart,
> > an active reference to superblock (i.e. to filesystem instance) is
> > dropped.
> >
> > If that was not the last one (e.g. it's mounted elsewhere as well), we
> > are not waiting for anything. If it *was* the last active ref, we
> > shut the filesystem instance down; that's _it_ - once you are into
> > ->kill_sb(), it's all over.
> >
> > Linux VFS is seriously different from Heidemann's-derived ones you'll find in
> > BSD land these days. Different taxonomy of objects, among other things...
>
> FWIW, the basic overview of objects:
>
> super_block: filesystem instance. Two refcounts (passive and active, having
> positive active refcount counts as one passive reference). Shutdown when
> active refcount gets to zero; freeing of in-core struct super_block - when
> passive gets there.
>
> mount: a subtree of an active filesystem. Most of them are in mount tree(s),
> but they might exist on their own - e.g. pipefs one, etc. Has a refcount,
> bears an active reference to fs instance (super_block) *and* a reference to
> a dentry belonging to that instance - root of the (sub)tree visible in
> it. Shutdown when refcount hits zero. Being in mount tree contributes
> to refcount; that contribution goes away when it's detached from the tree
> (on umount, normally). Refcount is responsible for -EBUSY from non-lazy
> umount; lazy one (umount -l, umount2(path, MNT_DETACH)) dissolves the entire
> subtree that used to be mounted at that point and shuts down everything
> that had refcounts reach zero, leaving the rest until their refcounts drop
> to zero too. Shutdown drops the superblock and root dentry refs.
>
> inode & dentry: that's what vnodes map onto. Dentry is the main object,
> inode is secondary. Each belongs to a specific fs instance for the entire
> lifetime. Dentries form a forest; inodes are attached to some of them.
> Details are a lot more involved than anything that would fit into a short
> overview. Both are refcounted, attaching dentry to an inode contributes
> 1 to inode's refcount. Child dentry contributes 1 to refcount of parent.
> Shutdown does *not* happen until the dentry refcount hits zero; once it's
> zero, the normal policy is "keep it around if it's still hashed", but
> filesystem may say "no point keeping it". Memory pressure => kill the
> ones with zero refcount (and if their parents had been pinned only by
> those children, take the parents out as well, etc.). Filesystem shutdown =>
> kick out everything with zero refcount, complain if anything's left after
> that (shrink_dcache_for_umount() does it, so if filesystem kept anything
> pinned internally, it would better drop those before we get to that
> point). evict_inodes() does the same to inodes.
>
> file: the usual; open IO channel, as on any Unix. Carries a reference to
> dentry and to mount. Shutdown happens when refcount goes to zero, normally
> delayed until return to userland, when we are on shallow stack and without
> any locks held. Incidentally, sockets and pipes come with those as well -
> none of the "sockets don't have a vnode" headache.
>
> cwd (and process's root as well): a pair of mount and dentry references.
I groked most of it from my prior poking around, thanks for the write up though.
The real question though is how can a filesystem safely manage keeping
extra refs on inodes vs unmount. Per your explanation the usual safety
net does not apply. Frankly it makes igrab/iput sound very dangerous
in their own right.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-09-17 22:58 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-17 8:07 Need advice with iput() deadlock during writeback Max Kellermann
2025-09-17 8:23 ` Mateusz Guzik
2025-09-17 8:38 ` Max Kellermann
2025-09-17 8:59 ` Mateusz Guzik
2025-09-17 9:20 ` Max Kellermann
2025-09-17 9:32 ` Mateusz Guzik
2025-09-17 12:48 ` Max Kellermann
2025-09-17 20:14 ` Al Viro
2025-09-17 20:19 ` Max Kellermann
2025-09-17 20:29 ` Al Viro
2025-09-17 20:32 ` Max Kellermann
2025-09-17 20:23 ` Mateusz Guzik
2025-09-17 20:34 ` Al Viro
2025-09-17 20:36 ` Max Kellermann
2025-09-17 21:10 ` Al Viro
2025-09-17 21:19 ` Max Kellermann
2025-09-17 21:20 ` Mateusz Guzik
2025-09-17 20:39 ` Mateusz Guzik
2025-09-17 21:02 ` Al Viro
2025-09-17 21:18 ` Mateusz Guzik
2025-09-17 21:42 ` Al Viro
2025-09-17 22:58 ` Mateusz Guzik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox