From: Mateusz Guzik <mjguzik@gmail.com>
To: Max Kellermann <max.kellermann@ionos.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>,
ceph-devel@vger.kernel.org
Subject: Re: Need advice with iput() deadlock during writeback
Date: Wed, 17 Sep 2025 10:59:23 +0200 [thread overview]
Message-ID: <CAGudoHEqNYWMqDiogc9Q_s9QMQHB6Rm_1dUzcC7B0GFBrqS=1g@mail.gmail.com> (raw)
In-Reply-To: <CAKPOu+--m8eppmF5+fofG=AKAMu5K_meF44UH4XiL8V3_X_rJg@mail.gmail.com>
On Wed, Sep 17, 2025 at 10:38 AM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Wed, Sep 17, 2025 at 10:23 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > One of the ways to stall inode teardown is to have writeback running. It
> > does not need a reference because inode_wait_for_writeback() explicitly
> > waits for it like in the very deadlock you encountered.
>
> Ah, right. No UAF. But I wonder if that's the best way to do it - why
> not let writeback hold its own reference, and eliminate
> inode_wait_for_writeback()? (and its latency)
>
There happens to be a temporarily inactive discussion related to it, see:
https://lore.kernel.org/linux-fsdevel/cover.1756222464.git.josef@toxicpanda.com/
but also the followup:
https://lore.kernel.org/linux-fsdevel/eeu47pjcaxkfol2o2bltigfjvrz6eecdjwtilnmnprqh7dhdn7@rqi35ya5ilmv/
The patchset posted there retains inode_wait_for_writeback().
Suppose you are to get rid of it. In that case you have a corner case
where the writeback thread has to issue ->evict_inode() for arbitrary
filesystems, and that's quite a change and I'm not at all convinced
that's safe.
> > However, assuming that's not avoidable, iput_async() or whatever could
> > be added to sort this out in a similar way fput() is.
> >
> > As a temporary bandaid iput() itself could check if I_SYNC is set and if
> > so roll with the iput_async() option.
> >
> > I can cook something up later.
>
> My idea was something like iput_safe() and that function would defer
> the actual iput() call if the reference counter is 1 (i.e. the caller
> is holding the last reference).
>
That's the same as my proposal.
> Almost all iput() calls are fine because they just do an atomic
> decrement, but all kinds of scary stuff can happen if the last
> reference is released. Callers that are not confident that this is
> safe shall then use my new iput_safe() instead of iput().
>
Note that vast majority of real-world calls to iput already come with
a count of 1, but it may be this is not true for ceph.
> I can write such a patch, but I wanted you experts to first confirm
> that this is a good idea that would be acceptable for merging (or
> maybe Ceph is just weird and there's a simpler way to avoid this).
>
So the problem here is where to put linkage for the delegated work.
Another issue is that as is nobody knows who set I_SYNC and that
probably should change on kernel with CONFIG_DEBUG_VFS. As luck would
have it I posted a related patchset here:
https://lore.kernel.org/linux-fsdevel/20250916135900.2170346-1-mjguzik@gmail.com/T/#t
with that in place and debug enabled we can panic early on the first
iput so you don't have to wait to trigger the problem
I suspect the best short-term fix is to implement ceph-private async
iput with linkage coming from struct ceph_inode_info or whatever other
struct applicable.
You can use __fput_deferred() as a reference (put intended). Note this
one assumes the obj is already unrefed, but for iput_async it would be
best to also postpone it to that routine.
A sketch, incomplete:
static DECLARE_DELAYED_WORK(delayed_ceph_iput_work, delayed_ceph_iput);
static void __ceph_iput_async(struct callback_head *work)
{
struct ceph_inode_info *ci = container_of(work, struct
ceph_inode_info, async_task_work);
iput(&ci->netfs.inode);
}
void ceph_iput_async(struct ceph_inode_info *ci)
{
struct inode *inode = &ci->netfs.inode;
if (atomic_add_unless(&inode->i_count, -1, 1))
return;
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
init_task_work(&ci->async_task_work, __ceph_iput_async);
if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
return;
}
if (llist_add(&ci->async_llist, &delayed_ceph_iput_list))
schedule_delayed_work(&delayed_ceph_iput_work, 1);
}
next prev parent reply other threads:[~2025-09-17 8:59 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-17 8:07 Max Kellermann
2025-09-17 8:23 ` Mateusz Guzik
2025-09-17 8:38 ` Max Kellermann
2025-09-17 8:59 ` Mateusz Guzik [this message]
2025-09-17 9:20 ` Max Kellermann
2025-09-17 9:32 ` Mateusz Guzik
2025-09-17 12:48 ` Max Kellermann
2025-09-17 20:14 ` Al Viro
2025-09-17 20:19 ` Max Kellermann
2025-09-17 20:29 ` Al Viro
2025-09-17 20:32 ` Max Kellermann
2025-09-17 20:23 ` Mateusz Guzik
2025-09-17 20:34 ` Al Viro
2025-09-17 20:36 ` Max Kellermann
2025-09-17 21:10 ` Al Viro
2025-09-17 21:19 ` Max Kellermann
2025-09-17 21:20 ` Mateusz Guzik
2025-09-17 20:39 ` Mateusz Guzik
2025-09-17 21:02 ` Al Viro
2025-09-17 21:18 ` Mateusz Guzik
2025-09-17 21:42 ` Al Viro
2025-09-17 22:58 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGudoHEqNYWMqDiogc9Q_s9QMQHB6Rm_1dUzcC7B0GFBrqS=1g@mail.gmail.com' \
--to=mjguzik@gmail.com \
--cc=ceph-devel@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=max.kellermann@ionos.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox