Re: Need advice with iput() deadlock during writeback

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mateusz Guzik <mjguzik@gmail.com>
To: Max Kellermann <max.kellermann@ionos.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	 Linux Memory Management List <linux-mm@kvack.org>,
	ceph-devel@vger.kernel.org
Subject: Re: Need advice with iput() deadlock during writeback
Date: Wed, 17 Sep 2025 11:32:16 +0200	[thread overview]
Message-ID: <CAGudoHFNKkjjqR=JYQdFZ2cBgSnBsSnzX6njwDtmj40ZDxJ+jg@mail.gmail.com> (raw)
In-Reply-To: <CAKPOu+_B=0G-csXEw2OshD6ZJm0+Ex9dRNf6bHpVuQFgBB7-Zw@mail.gmail.com>

On Wed, Sep 17, 2025 at 11:20 AM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Wed, Sep 17, 2025 at 10:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > My idea was something like iput_safe() and that function would defer
> > > the actual iput() call if the reference counter is 1 (i.e. the caller
> > > is holding the last reference).
> > >
> >
> > That's the same as my proposal.
>
> The real difference (aside from naming) is that I wanted to change
> only callers in unsafe contexts to the new function. But I guess most
> people calling iput() are not aware of its dangers and if we look
> closer, more existing bugs may be revealed.
>

I noted iput() handling this is a possibility, but also that it would
be best avoided.

We are in agreement here mate.

> > Note  that vast majority of real-world calls to iput already come with
> > a count of 1, but it may be this is not true for ceph.
>
> Not my experience - I traced iput() and found that this was very rare
> - because the dcache is almost always holding a reference and inodes
> are only ever evicted if the dcache decides to drop them.
>

Most of the calls I had seen are from dcache. ;-)

> > I suspect the best short-term fix is to implement ceph-private async
> > iput with linkage coming from struct ceph_inode_info or whatever other
> > struct applicable.
>
> I had already started writing exactly this, very similar to your
> sketch. That's what I'm going to finish now - and it will produce a
> patch that will hopefully be appropriate for a stable backport. This
> Ceph deadlock bug appears to affect all Linux versions.
>

Sounds like a plan. After the inode_state_ accessor thing is sorted
out I'll add the diagnostics to catch unsafe iput() use.

So I had a look at inode layout with pahole and there is a pluggable
8-byte hole in it.

llist takes 8 bytes, so it can just fit right in without growing the
struct above what it is now.

Unfortunately task_work is 16 bytes, so embedding that sucker would
grow the struct but that's perhaps tolerable. Not my call.

If making sure to postpone the last unref there is no way to union
this with anything that I can see as the inode must remain safe to use
-- someone could have picked it up.

Maybe something could be figured out if iput_async already unrefs, but
this would require fuckery with flags to make sure nobody messes with
the inode.

> >         if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> >                 init_task_work(&ci->async_task_work, __ceph_iput_async);
> >                 if (!task_work_add(task, &ci->async_task_work, TWA_RESUME))
> >                         return;
> >         }
>
> This part isn't useful for inodes, is it? I suppose this code exists
> in fput() only to guarantee that all file handles are really closed
> before returning to userspace, right? And we don't need that for
> inodes?
>

No, the fput thing is to avoid a problem of a similar nature. As
*final* fput can start taking arbitrary locks, go to sleep or use a
lot of stack, it is woefully unsafe to be called from arbitrary
places. The current machinery guarantees anything other than an atomic
decrement is postponed to syscall boundary or a task queue if the
former is not possible so that these are not a factor.

Postponing to syscall boundary as opposed to blindly queueing up makes
the "right" thread do the work.

next prev parent reply	other threads:[~2025-09-17  9:32 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-17  8:07 Max Kellermann
2025-09-17  8:23 ` Mateusz Guzik
2025-09-17  8:38   ` Max Kellermann
2025-09-17  8:59     ` Mateusz Guzik
2025-09-17  9:20       ` Max Kellermann
2025-09-17  9:32         ` Mateusz Guzik [this message]
2025-09-17 12:48         ` Max Kellermann
2025-09-17 20:14       ` Al Viro
2025-09-17 20:19         ` Max Kellermann
2025-09-17 20:29           ` Al Viro
2025-09-17 20:32             ` Max Kellermann
2025-09-17 20:23         ` Mateusz Guzik
2025-09-17 20:34           ` Al Viro
2025-09-17 20:36             ` Max Kellermann
2025-09-17 21:10               ` Al Viro
2025-09-17 21:19                 ` Max Kellermann
2025-09-17 21:20                   ` Mateusz Guzik
2025-09-17 20:39             ` Mateusz Guzik
2025-09-17 21:02               ` Al Viro
2025-09-17 21:18                 ` Mateusz Guzik
2025-09-17 21:42                 ` Al Viro
2025-09-17 22:58                   ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGudoHFNKkjjqR=JYQdFZ2cBgSnBsSnzX6njwDtmj40ZDxJ+jg@mail.gmail.com' \
    --to=mjguzik@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=max.kellermann@ionos.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox