From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1C1CE77173 for ; Fri, 6 Dec 2024 17:37:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EE096B02B4; Fri, 6 Dec 2024 12:37:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 09E156B02B5; Fri, 6 Dec 2024 12:37:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA7FD6B02B6; Fri, 6 Dec 2024 12:37:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CD4046B02B4 for ; Fri, 6 Dec 2024 12:37:54 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 63A9CC1D8C for ; Fri, 6 Dec 2024 17:37:54 +0000 (UTC) X-FDA: 82865241564.24.CED6B7B Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf25.hostedemail.com (Postfix) with ESMTP id 68668A0016 for ; Fri, 6 Dec 2024 17:37:40 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=uGIiHYdR; spf=pass (imf25.hostedemail.com: domain of djwong@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733506664; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MBxJ0d+H4phHDzjRJofA5qgOQDyMURIKw5rnZcA2qoA=; b=K6dnJZLlH/clVjPhGIuLpMaRZOtb0vYkvzyZzCI5A3a5wWvIor5y0r/pD+VB1Zr3TXCfha Tcx7RrJdMAbpby3OfxfGe4ca6SYtoKGWTjf6HQIbW1xvwyEeTNFtrs4E/7Hw9q5g3w+2fd E1fkQb7LaZqt4rBoWVPM2IYY4Y/f4Oc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733506664; a=rsa-sha256; cv=none; b=MXYkVfLMf0C8e9q38bsYhI2YhMLoapUQe6R4BNmYarD1HLvguSq8IfbRMvgqit64KuZvBy //Kh77hFGn0+oafu6sYB3tmmoCbFzB03STZvMImBRbHechSnkEJ4sY/jloYwyewB6tioMr 6c3P2sTgRSRSV3MdGJfpmnB7ECxpn/g= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=uGIiHYdR; spf=pass (imf25.hostedemail.com: domain of djwong@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id F27A4A403F5; Fri, 6 Dec 2024 17:35:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7FB9CC4CED1; Fri, 6 Dec 2024 17:37:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1733506671; bh=IMQdqHnVMmnVveXaXPZhwcJ+Y+8PN18XLTmAIod2sbU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=uGIiHYdR5bhTNld8HW4xECXXWtPWfGDzIESExICVOVOz9y28SgY7ADMZlLRNj2rgU AcKZJhCJ3ljbC5SzI7m47So1zwFp2eQBNmdTKrWbvrCACFGhl8Oza3jNC5pRmA8DuR 5ebXSTZRQ3MJ/SITEJlvoPbtSu1XlD1CULQfxaAo4WYtPZZ9WQHPH1B5DWrWxG0HTE kBShH2hdeawqLnWvp/90Qc3yb0eQ9oJ/RUKbqYRJR8uLaZV2yeQI4wULudwLldhEVQ bzh35laHmfui839t5ZR+1xCP9gagrWCTuwmGYmISmplelD4QHEdTeDRPW4sOr+pE0X jMAeFDsQe/yJA== Date: Fri, 6 Dec 2024 09:37:51 -0800 From: "Darrick J. Wong" To: Jens Axboe Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hannes@cmpxchg.org, clm@meta.com, linux-kernel@vger.kernel.org, willy@infradead.org, kirill@shutemov.name, bfoster@redhat.com Subject: Re: [PATCHSET v6 0/12] Uncached buffered IO Message-ID: <20241206173751.GI7864@frogsfrogsfrogs> References: <20241203153232.92224-2-axboe@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241203153232.92224-2-axboe@kernel.dk> X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 68668A0016 X-Stat-Signature: x6zhyn7cpy1eqhuzzniuk6azg64wjpy1 X-Rspam-User: X-HE-Tag: 1733506660-664361 X-HE-Meta: U2FsdGVkX1+9n5jWmyj2QL0Oygu9tmbvSzGSEsnc32zHG+IPOe5LpZ4Y7+rGl9Fr2wZqNz/B8ABGYDhKFd0E8bupJiVR2euLwPZ/I1yRGWNs6qPo8Wcv5OvNonVL5Xq2Zk2dEbq2R0ph61WqU2zRahnSwY0W0SFO9ZEoesPKGMsPREtYjV32pbqF3Q/dQaND2NvHK0YBt2fmHAPNhW9/d7GoQ8Y2fSk5KY7N87xJDNQBXpLzYX/Fm8zHKueBKBzFGRJ8SYLVJiTe9cLEt1R05YeszVSbJbDStaNb2dPJ+88WwoM6PihEIAFiqrcIbMlgA0Ck9OvZ3USr40j4mj+k61OD4PqQJfOl+ctGmdmocNTZk+98SxTZRm2Dae5/z2XYTYof8umFQk3kLaZHBqzj6qltfkomiLwGSw/9Zwuyn1tluC4xFhyowKGIxliIKrVrdLrk4mOaVmMmIkccRLvLOqdQWRdN1RhCWc0rMjXpLp1MavUxmNRW+Zyjrn4/sqHa2fUQi97Pn3sCfXfnbvbWK8Hb9qoEDC7gjRSKbA8DkqwueLu7Hp0iDIJH4k24zZ2/QDWQJDC7jPknBormX89b7ZiS0kffEMffYOy2bkoexsZd9ycYnVJuMoW9Q2eGbk1EP/56afAZ0lY1CEy2EZr6f0Iys12GF164aDj9tPXaStBnLVC0+g6yHn3Jq9mcNzflBrPVvpaVmI8ekGSpAFp1LlyZL+fhVWCy1dEM+KrhotcQGw8y8UANAFIAFk8MyAJPDE25cCxNsdWnlJCu8xBR0sHsyHhH6SmBiQ3nt6526G5msD42UMcIFcwreRaEXBhdTFFFrlBysVG2D5uAQMGxEh1KCiQG1D6g4ZliPEUuMXiC3oHeuTP3zsJeGT8JSBTwW2LBn0L5AMXjwDa0SKB6ou3mWXdLbacH2OiklwBh4zKiNTmmWdg/DjioKxSlf8VYYA9pqz9jZpNt/8L7qgA 2fWUGGAo 2odvZZ8esySVIcVABVMes3G72Jxmk4/poM+xqm/2oxZq1BBVqDqdY+D04uTsMACLEgOq7Nqn+Hl4cC+NzRqrjxGRvkyUjKTD4ndH6wfNNmE1+iY0XKnqY11EMarP04WZ9qaIwbchvHvrLiW+vmiDkHdCdnaSVm1e+G8jhhMIfb4pesf+ycCuFNQojO1oaif/uN54CQL23gcCz4sQdzG7C1NWvcv+s4tCGxsfk50lBLSvlgmNoCyhZ+MbpJjqgu0tsJvB8W32dleJhO8g+llLQXCLnwkvltc0ScVwBk3ubKjuw4dI7POM2UvhI3PpgTQIsquGpD9tMxzrQL6w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 03, 2024 at 08:31:36AM -0700, Jens Axboe wrote: > Hi, > > 5 years ago I posted patches adding support for RWF_UNCACHED, as a way > to do buffered IO that isn't page cache persistent. The approach back > then was to have private pages for IO, and then get rid of them once IO > was done. But that then runs into all the issues that O_DIRECT has, in > terms of synchronizing with the page cache. > > So here's a new approach to the same concent, but using the page cache > as synchronization. That makes RWF_UNCACHED less special, in that it's > just page cache IO, except it prunes the ranges once IO is completed. > > Why do this, you may ask? The tldr is that device speeds are only > getting faster, while reclaim is not. Doing normal buffered IO can be > very unpredictable, and suck up a lot of resources on the reclaim side. > This leads people to use O_DIRECT as a work-around, which has its own > set of restrictions in terms of size, offset, and length of IO. It's > also inherently synchronous, and now you need async IO as well. While > the latter isn't necessarily a big problem as we have good options > available there, it also should not be a requirement when all you want > to do is read or write some data without caching. > > Even on desktop type systems, a normal NVMe device can fill the entire > page cache in seconds. On the big system I used for testing, there's a > lot more RAM, but also a lot more devices. As can be seen in some of the > results in the following patches, you can still fill RAM in seconds even > when there's 1TB of it. Hence this problem isn't solely a "big > hyperscaler system" issue, it's common across the board. > > Common for both reads and writes with RWF_UNCACHED is that they use the > page cache for IO. Reads work just like a normal buffered read would, > with the only exception being that the touched ranges will get pruned > after data has been copied. For writes, the ranges will get writeback > kicked off before the syscall returns, and then writeback completion > will prune the range. Hence writes aren't synchronous, and it's easy to > pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by > RWF_UNCACHED IO are left untouched. This means you that uncached IO > will take advantage of the page cache for uptodate data, but not leave > anything it instantiated/created in cache. > > File systems need to support this. The patches add support for the > generic filemap helpers, and for iomap. Then ext4 and XFS are marked as > supporting it. The last patch adds support for btrfs as well, lightly > tested. The read side is already done by filemap, only the write side > needs a bit of help. The amount of code here is really trivial, and the > only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO > return -EOPNOTSUPP just in case the fs doesn't use either the generic > paths or iomap. Adding "support" to other file systems should be > trivial, most of the time just a one-liner adding FOP_UNCACHED to the > fop_flags in the file_operations struct. > > Performance results are in patch 8 for reads and patch 10 for writes, > with the tldr being that I see about a 65% improvement in performance > for both, with fully predictable IO times. CPU reduction is substantial > as well, with no kswapd activity at all for reclaim when using uncached > IO. > > Using it from applications is trivial - just set RWF_UNCACHED for the > read or write, using pwritev2(2) or preadv2(2). For io_uring, same > thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write > operation. And that's it. > > Patches 1..7 are just prep patches, and should have no functional > changes at all. Patch 8 adds support for the filemap path for > RWF_UNCACHED reads, patch 11 adds support for filemap RWF_UNCACHED > writes. In the below mentioned branch, there are then patches to > adopt uncached reads and writes for ext4, xfs, and btrfs. > > Passes full xfstests and fsx overnight runs, no issues observed. That > includes the vm running the testing also using RWF_UNCACHED on the host. > I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far > as I'm concerned, no further work needs doing here. > > And git tree for the patches is here: > > https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8 Oh good, I much prefer browsing git branches these days. :) * mm/filemap: change filemap_create_folio() to take a struct kiocb * mm/readahead: add folio allocation helper * mm: add PG_uncached page flag * mm/readahead: add readahead_control->uncached member * mm/filemap: use page_cache_sync_ra() to kick off read-ahead * mm/truncate: add folio_unmap_invalidate() helper The mm patches look ok to me, but I think you ought to get at least an ack from willy since they're largely pagecache changes. * fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag See more detailed reply in the thread. * mm/filemap: add read support for RWF_UNCACHED Looks cleaner now that we don't even unmap the page if it's dirty. * mm/filemap: drop uncached pages when writeback completes * mm/filemap: add filemap_fdatawrite_range_kick() helper * mm/filemap: make buffered writes work with RWF_UNCACHED See more detailed reply in the thread. * mm: add FGP_UNCACHED folio creation flag I appreciate that !UNCACHED callers of __filemap_get_folio now clear the uncached bit if it's set. Now I proceed into the rest of your branch, because I felt like it: * ext4: add RWF_UNCACHED write support (Dunno about the WARN_ON removals in this patch, but this is really Ted's call anyway). * iomap: make buffered writes work with RWF_UNCACHED The commit message references a "iocb_uncached_write" but I don't find any such function in the extended patchset? Otherwise this looks ready to me. Thanks for changing it only to set uncached if we're actually creating a folio, and not just returning one that was already in the pagecache. * xfs: punt uncached write completions to the completion wq Dumb nit: spaces between "IOMAP_F_SHARED|IOMAP_F_UNCACHED" in this patch. * xfs: flag as supporting FOP_UNCACHED Otherwise the xfs changes look ready too. * btrfs: add support for uncached writes * block: support uncached IO Not sure why the definition of bio_dirty_lock gets moved around, but in principle this looks ok to me too. For the whole pile of mm changes (aka patches 1-6,8-10,12), Acked-by: "Darrick J. Wong" --D > > include/linux/fs.h | 21 +++++- > include/linux/page-flags.h | 5 ++ > include/linux/pagemap.h | 14 ++++ > include/trace/events/mmflags.h | 3 +- > include/uapi/linux/fs.h | 6 +- > mm/filemap.c | 114 +++++++++++++++++++++++++++++---- > mm/readahead.c | 22 +++++-- > mm/swap.c | 2 + > mm/truncate.c | 35 ++++++---- > 9 files changed, 187 insertions(+), 35 deletions(-) > > Since v5 > - Skip invalidation in filemap_uncached_read() if the folio is dirty > as well, retaining the uncached setting for later cleaning to do > the actual invalidation. > - Use the same trylock approach in read invalidation as the writeback > invalidation does. > - Swap order of patches 10 and 11 to fix a bisection issue. > - Split core mm changes and fs series patches. Once the generic side > has been approved, I'll send out the fs series separately. > - Rebase on 6.13-rc1 > > -- > Jens Axboe > >