Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hillf Danton <hdanton@sina.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: John Hubbard <jhubbard@nvidia.com>,
	linux-ext4@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
Date: Sat, 26 Feb 2022 08:18:43 +0800	[thread overview]
Message-ID: <20220226001843.2520-1-hdanton@sina.com> (raw)
In-Reply-To: <YhlkcYjozFmt3Kl4@mit.edu>

On Fri, 25 Feb 2022 18:21:21 -0500 Theodore Ts'o wrote:
> On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote:
> > On 2/25/22 13:23, Theodore Ts'o wrote:
> > > [un]pin_user_pages_remote is dirtying pages without properly warning
> > > the file system in advance.  This was noted by Jan Kara in 2018[1] and
> > 
> > In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported
> > was actually that dio_bio_complete() was calling set_page_dirty_lock()
> > on pages that were not (any longer) set up for that.
> 
> Fair enough, there are two problems that are getting conflated here,
> and that's my bad.  The problem which Jan pointed out is one where the
> Direct I/O read path triggered a page fault, so page_mkwrite() was
> actually called.  So in this case, the file system was actually
> notified, and the page was marked dirty after the file system was
> notified.  But then the DIO read was racing with the page cleaner,
> which would call writepage(), and then clear the page, and then remove
> the buffer_heads.  Then dio_bio_complete() would call set_page_dirty()
> a second time, and that's what would trigger the BUG.
> 
> But in the syzbot reproducer, it's a different problem.  In this case,
> process_vm_writev() calling [un]pin_user_pages_remote(), and
> page_mkwrite() is never getting called.  So there is no need to race
> with the page cleaner, and so the BUG triggers much more reliably.
> 
> > > more recently has resulted in bug reports by Syzbot in various Android
> > > kernels[2].
> > > 
> > > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
> > 
> > Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock()
> > call into mm/gup.c, but that merely refactored things. The callers are
> > all over the kernel, and those callers are what need changing in order
> > to fix this.
> 
> >From my perspective, the bug is calling set_page_dirty() without first
> calling the file system's page_mkwrite().  This is necessary since the
> file system needs to allocate file system data blocks in preparation
> for a future writeback.
> 
> Now, calling page_mkwrite() by itself is not enough, since the moment
> you make the page dirty, the page cleaner could go ahead and call
> writepage() behind your back and clean it.  In actual practice, with a
> Direct I/O read request racing with writeback, this is race was quite
> hard to hit, because the that would imply that the background
> writepage() call would have to complete ahead of the synchronous read
> request, and the block layer generally prioritizes synchronous reads
> ahead of background write requests.  So in practice, this race was
> ***very*** hard to hit.  Jan may have reported it in 2018, but I don't
> think I've ever seen it happen myself.
> 
> For process_vm_writev() this is a case where user pages are pinned and
> then released in short order, so I suspect that race with the page
> cleaner would also be very hard to hit.  But we could completely
> remove the potential for the race, and also make things kinder for
> f2fs and btrfs's compressed file write support, by making things work
> much like the write(2) system call.  Imagine if we had a
> "pin_user_pages_local()" which calls write_begin(), and a
> "unpin_user_pages_local()" which calls write_end(), and the
> presumption with the "[un]pin_user_pages_local" API is that you don't
> hold the pinned pages for very long --- say, not across a system call
> boundary, and then it would work the same way the write(2) system call
> works does except that in the case of process_vm_writev(2) the pages
> are identified by another process's address space where they happen to
> be mapped.
> 
> This obviously doesn't work when pinning pages for remote DMA, because
> in that case the time between pin_user_pages_remote() and
> unpin_user_pages_remote() could be a long, long time, so that means we
> can't use using write_begin/write_end; we'd need to call page_mkwrite()
> when the pages are first pinned and then somehow prevent the page
> cleaner from touching a dirty page which is pinned for use by the
> remote DMA.

Sad to see it here given the attempt that no gup-pinned page will be put
under writeback. [05]

Hillf
> 
> Does that make sense?
> 
> 							- Ted
> 

[05] https://lore.kernel.org/linux-mm/20191103112113.8256-1-hdanton@sina.com/

          parent reply	other threads:[~2022-02-26  0:18 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <YhlkcYjozFmt3Kl4@mit.edu>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220226001843.2520-1-hdanton@sina.com \
    --to=hdanton@sina.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox