linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Christian Theune <ct@flyingcircus.io>
Cc: Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <willy@infradead.org>, Chris Mason <clm@meta.com>,
	 Jens Axboe <axboe@kernel.dk>,
	linux-mm@kvack.org,
	 "linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,  linux-kernel@vger.kernel.org,
	Daniel Dao <dqminh@cloudflare.com>,
	 regressions@lists.linux.dev, regressions@leemhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)
Date: Mon, 30 Sep 2024 11:46:30 -0700	[thread overview]
Message-ID: <CAHk-=wj6YRm2fpYHjZxNfKCC_N+X=T=ay+69g7tJ2cnziYT8=g@mail.gmail.com> (raw)
In-Reply-To: <02121707-E630-4E7E-837B-8F53B4C28721@flyingcircus.io>

On Mon, 30 Sept 2024 at 10:35, Christian Theune <ct@flyingcircus.io> wrote:
>
> Sep 27 00:51:20 <redactedhostname>13 kernel:  folio_wait_bit_common+0x13f/0x340
> Sep 27 00:51:20 <redactedhostname>13 kernel:  folio_wait_writeback+0x2b/0x80

Gaah. Every single case you point to is that folio_wait_writeback() case.

And this might be an old old annoyance.

folio_wait_writeback() is insane. It does

        while (folio_test_writeback(folio)) {
                trace_folio_wait_writeback(folio, folio_mapping(folio));
                folio_wait_bit(folio, PG_writeback);
        }

and the reason that is insane is that PG_writeback isn't some kind of
exclusive state. So folio_wait_bit() will return once somebody has
ended writeback, but *new* writeback can easily have been started
afterwards. So then we go back to wait...

And even after it eventually returns (possibly after having waited for
hundreds of other processes writing back that folio - imagine lots of
other threads doing writes to it and 'fdatasync()' or whatever) the
caller *still* can't actually assume that the writeback bit is clear,
because somebody else might have started writeback again.

Anyway, it's insane, but it's insane for a *reason*. We've tried to
fix this before, long before it was a folio op. See commit
c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple
pending writebacks").

IOW, this code is known-broken and might have extreme unfairness
issues (although I had blissfully forgotten about it), because while
the actual writeback *bit* itself is set and cleared atomically, the
wakeup for the bit is asynchronous and can be delayed almost
arbitrarily, so you can get basically spurious wakeups that were from
a previous bit clear.

So the "wait many times" is crazy, but it's sadly a necessary crazy as
things are right now.

Now, many callers hold the page lock while doing this, and in that
case new writeback cases shouldn't happen, and so repeating the loop
should be extremely limited.

But "many" is not "all". For example, __filemap_fdatawait_range() very
much doesn't hold the lock on the pages it waits for, so afaik this
can cause that unfairness and starvation issue.

That said, while every one of your traces are for that
folio_wait_writeback(), the last one is for the truncate case, and
that one *does* hold the page lock and so shouldn't see this potential
unfairness issue.

So the code here is questionable, and might cause some issues, but the
starvation of folio_wait_writeback() can't explain _all_ the cases you
see.

                  Linus


  reply	other threads:[~2024-09-30 18:46 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-12 21:18 Christian Theune
2024-09-12 21:55 ` Matthew Wilcox
2024-09-12 22:11   ` Christian Theune
2024-09-12 22:12   ` Jens Axboe
2024-09-12 22:25     ` Linus Torvalds
2024-09-12 22:30       ` Jens Axboe
2024-09-12 22:56         ` Linus Torvalds
2024-09-13  3:44           ` Matthew Wilcox
2024-09-13 13:23             ` Christian Theune
2024-09-13 12:11       ` Christian Brauner
2024-09-16 13:29         ` Matthew Wilcox
2024-09-18  9:51           ` Christian Brauner
2024-09-13 15:30       ` Chris Mason
2024-09-13 15:51         ` Matthew Wilcox
2024-09-13 16:33           ` Chris Mason
2024-09-13 18:15             ` Matthew Wilcox
2024-09-13 21:24               ` Linus Torvalds
2024-09-13 21:30                 ` Matthew Wilcox
2024-09-13 16:04       ` David Howells
2024-09-13 16:37         ` Chris Mason
2024-09-16  0:00       ` Dave Chinner
2024-09-16  4:20         ` Linus Torvalds
2024-09-16  8:47           ` Chris Mason
2024-09-17  9:32             ` Matthew Wilcox
2024-09-17  9:36               ` Chris Mason
2024-09-17 10:11               ` Christian Theune
2024-09-17 11:13               ` Chris Mason
2024-09-17 13:25                 ` Matthew Wilcox
2024-09-18  6:37                   ` Jens Axboe
2024-09-18  9:28                     ` Chris Mason
2024-09-18 12:23                       ` Chris Mason
2024-09-18 13:34                       ` Matthew Wilcox
2024-09-18 13:51                         ` Linus Torvalds
2024-09-18 14:12                           ` Matthew Wilcox
2024-09-18 14:39                             ` Linus Torvalds
2024-09-18 17:12                               ` Matthew Wilcox
2024-09-18 16:37                             ` Chris Mason
2024-09-19  1:43                         ` Dave Chinner
2024-09-19  3:03                           ` Linus Torvalds
2024-09-19  3:12                             ` Linus Torvalds
2024-09-19  3:38                               ` Jens Axboe
2024-09-19  4:32                                 ` Linus Torvalds
2024-09-19  4:42                                   ` Jens Axboe
2024-09-19  4:36                                 ` Matthew Wilcox
2024-09-19  4:46                                   ` Jens Axboe
2024-09-19  5:20                                     ` Jens Axboe
2024-09-19  4:46                                   ` Linus Torvalds
2024-09-20 13:54                                   ` Chris Mason
2024-09-24 15:58                                     ` Matthew Wilcox
2024-09-24 17:16                                     ` Sam James
2024-09-25 16:06                                       ` Kairui Song
2024-09-25 16:42                                         ` Christian Theune
2024-09-27 14:51                                         ` Sam James
2024-09-27 14:58                                           ` Jens Axboe
2024-10-01 21:10                                             ` Kairui Song
2024-09-24 19:17                                     ` Chris Mason
2024-09-24 19:24                                       ` Linus Torvalds
2024-09-19  6:34                               ` Christian Theune
2024-09-19  6:57                                 ` Linus Torvalds
2024-09-19 10:19                                   ` Christian Theune
2024-09-30 17:34                                     ` Christian Theune
2024-09-30 18:46                                       ` Linus Torvalds [this message]
2024-09-30 19:25                                         ` Christian Theune
2024-09-30 20:12                                           ` Linus Torvalds
2024-09-30 20:56                                             ` Matthew Wilcox
2024-09-30 22:42                                               ` Davidlohr Bueso
2024-09-30 23:00                                                 ` Davidlohr Bueso
2024-09-30 23:53                                               ` Linus Torvalds
2024-10-01  0:56                                       ` Chris Mason
2024-10-01  7:54                                         ` Christian Theune
2024-10-10  6:29                                         ` Christian Theune
2024-10-11  7:27                                           ` Christian Theune
2024-10-11  9:08                                             ` Christian Theune
2024-10-11 13:06                                               ` Chris Mason
2024-10-11 13:50                                                 ` Christian Theune
2024-10-12 17:01                                                 ` Linus Torvalds
2024-12-02 10:44                                                   ` Christian Theune
2024-10-01  2:22                                       ` Dave Chinner
2024-09-16  7:14         ` Christian Theune
2024-09-16 12:16           ` Matthew Wilcox
2024-09-18  8:31           ` Christian Theune

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wj6YRm2fpYHjZxNfKCC_N+X=T=ay+69g7tJ2cnziYT8=g@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=clm@meta.com \
    --cc=ct@flyingcircus.io \
    --cc=david@fromorbit.com \
    --cc=dqminh@cloudflare.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=regressions@leemhuis.info \
    --cc=regressions@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox