linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
Date: Wed, 11 Dec 2019 11:23:49 +1100	[thread overview]
Message-ID: <20191211002349.GC19213@dread.disaster.area> (raw)
In-Reply-To: <20191210162454.8608-4-axboe@kernel.dk>

On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
> cache instantiated for buffered writes. If new pages aren't
> instantiated, we leave them alone. This provides similar semantics to
> reads with RWF_UNCACHED set.

So what about filesystems that don't use generic_perform_write()?
i.e. Anything that uses the iomap infrastructure (i.e.
iomap_file_buffered_write()) instead of generic_file_write_iter())
will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
but there are likely to be more in the near future as more
filesystems are ported to the iomap infrastructure.

I'd also really like to see extensive fsx and fstress testing of
this new IO mode before it is committed - this is going to exercise page
cache coherency across different operations in new and unique
ways. that means we need patches to fstests to detect and use this
functionality when available, and new tests that explicitly exercise
combinations of buffered, mmap, dio and uncached for a range of
different IO size and alignments (e.g. mixing sector sized uncached
IO with page sized buffered/mmap/dio and vice versa).

We are not going to have a repeat of the copy_file_range() data
corruption fuckups because no testing was done and no test
infrastructure was written before the new API was committed.

> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> +			     unsigned *nr)
> +{
> +	loff_t start, end;
> +	int i;
> +
> +	end = 0;
> +	start = LLONG_MAX;
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +		loff_t off;
> +
> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
> +		if (off < start)
> +			start = off;
> +		if (off > end)
> +			end = off;
> +		get_page(page);
> +	}
> +
> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
> +
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +
> +		lock_page(page);
> +		if (page->mapping == mapping) {
> +			wait_on_page_writeback(page);
> +			if (!page_has_private(page) ||
> +			    try_to_release_page(page, 0))
> +				remove_mapping(mapping, page);
> +		}
> +		unlock_page(page);
> +	}
> +	*nr = 0;
> +}
> +EXPORT_SYMBOL_GPL(write_drop_cached_pages);
> +
> +#define GPW_PAGE_BATCH		16

In terms of performance, file fragmentation and premature filesystem
aging, this is also going to suck *really badly* for filesystems
that use delayed allocation because it is going to force conversion
of delayed allocation extents during the write() call. IOWs,
it adds all the overheads of doing delayed allocation, but it reaps
none of the benefits because it doesn't allow large contiguous
extents to build up in memory before physical allocation occurs.
i.e. there is no "delayed" in this allocation....

So it might work fine on a pristine, empty filesystem where it is
easy to find contiguous free space accross multiple allocations, but
it's going to suck after a few months of production usage has
fragmented all the free space into tiny pieces...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


  parent reply	other threads:[~2019-12-11  0:23 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-10 16:24 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-10 16:24 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-10 16:24 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-10 16:55   ` Matthew Wilcox
2019-12-10 17:02     ` Jens Axboe
2019-12-10 18:35       ` Chris Mason
2019-12-10 18:58       ` Matthew Wilcox
2019-12-10 19:10         ` Jens Axboe
2019-12-11  0:23   ` Dave Chinner [this message]
2019-12-11  0:28     ` Dave Chinner
2019-12-11 14:39     ` Jens Axboe
2019-12-10 16:24 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-10 16:24 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-10 21:17 ` [PATCHSET 0/5] Support for RWF_UNCACHED Andreas Dilger
2019-12-12 15:47 ` Christoph Hellwig
2019-12-12 15:52   ` Jens Axboe
2019-12-10 20:42 [PATCHSET v2 " Jens Axboe
2019-12-10 20:43 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-14  0:01   ` Andrew Morton
2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-12 19:01 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191211002349.GC19213@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox