From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B8DCC43603 for ; Wed, 11 Dec 2019 00:23:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 273142073D for ; Wed, 11 Dec 2019 00:23:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 273142073D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A1F8B6B2F12; Tue, 10 Dec 2019 19:23:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9CFC76B2F13; Tue, 10 Dec 2019 19:23:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BED06B2F14; Tue, 10 Dec 2019 19:23:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0133.hostedemail.com [216.40.44.133]) by kanga.kvack.org (Postfix) with ESMTP id 742BA6B2F12 for ; Tue, 10 Dec 2019 19:23:56 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 28A1B246E for ; Wed, 11 Dec 2019 00:23:56 +0000 (UTC) X-FDA: 76250962872.14.hate33_799256c5bee20 X-HE-Tag: hate33_799256c5bee20 X-Filterd-Recvd-Size: 4681 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Wed, 11 Dec 2019 00:23:55 +0000 (UTC) Received: from dread.disaster.area (pa49-195-139-249.pa.nsw.optusnet.com.au [49.195.139.249]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 05A387E9B8C; Wed, 11 Dec 2019 11:23:51 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1iepnB-0006Pe-C7; Wed, 11 Dec 2019 11:23:49 +1100 Date: Wed, 11 Dec 2019 11:23:49 +1100 From: Dave Chinner To: Jens Axboe Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org Subject: Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Message-ID: <20191211002349.GC19213@dread.disaster.area> References: <20191210162454.8608-1-axboe@kernel.dk> <20191210162454.8608-4-axboe@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191210162454.8608-4-axboe@kernel.dk> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=KoypXv6BqLCQNZUs2nCMWg==:117 a=KoypXv6BqLCQNZUs2nCMWg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=pxVhFHJ0LMsA:10 a=7-415B0cAAAA:8 a=OUdnQ-l5LQdgMwZtS0UA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote: > If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the > cache instantiated for buffered writes. If new pages aren't > instantiated, we leave them alone. This provides similar semantics to > reads with RWF_UNCACHED set. So what about filesystems that don't use generic_perform_write()? i.e. Anything that uses the iomap infrastructure (i.e. iomap_file_buffered_write()) instead of generic_file_write_iter()) will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now, but there are likely to be more in the near future as more filesystems are ported to the iomap infrastructure. I'd also really like to see extensive fsx and fstress testing of this new IO mode before it is committed - this is going to exercise page cache coherency across different operations in new and unique ways. that means we need patches to fstests to detect and use this functionality when available, and new tests that explicitly exercise combinations of buffered, mmap, dio and uncached for a range of different IO size and alignments (e.g. mixing sector sized uncached IO with page sized buffered/mmap/dio and vice versa). We are not going to have a repeat of the copy_file_range() data corruption fuckups because no testing was done and no test infrastructure was written before the new API was committed. > +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping, > + unsigned *nr) > +{ > + loff_t start, end; > + int i; > + > + end = 0; > + start = LLONG_MAX; > + for (i = 0; i < *nr; i++) { > + struct page *page = pgs[i]; > + loff_t off; > + > + off = (loff_t) page_to_index(page) << PAGE_SHIFT; > + if (off < start) > + start = off; > + if (off > end) > + end = off; > + get_page(page); > + } > + > + __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE); > + > + for (i = 0; i < *nr; i++) { > + struct page *page = pgs[i]; > + > + lock_page(page); > + if (page->mapping == mapping) { > + wait_on_page_writeback(page); > + if (!page_has_private(page) || > + try_to_release_page(page, 0)) > + remove_mapping(mapping, page); > + } > + unlock_page(page); > + } > + *nr = 0; > +} > +EXPORT_SYMBOL_GPL(write_drop_cached_pages); > + > +#define GPW_PAGE_BATCH 16 In terms of performance, file fragmentation and premature filesystem aging, this is also going to suck *really badly* for filesystems that use delayed allocation because it is going to force conversion of delayed allocation extents during the write() call. IOWs, it adds all the overheads of doing delayed allocation, but it reaps none of the benefits because it doesn't allow large contiguous extents to build up in memory before physical allocation occurs. i.e. there is no "delayed" in this allocation.... So it might work fine on a pristine, empty filesystem where it is easy to find contiguous free space accross multiple allocations, but it's going to suck after a few months of production usage has fragmented all the free space into tiny pieces... Cheers, Dave. -- Dave Chinner david@fromorbit.com