From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92B82E77184 for ; Thu, 19 Dec 2024 16:54:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2DE9F6B008C; Thu, 19 Dec 2024 11:54:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 28E376B0092; Thu, 19 Dec 2024 11:54:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12E9D6B0093; Thu, 19 Dec 2024 11:54:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E90DD6B008C for ; Thu, 19 Dec 2024 11:54:40 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0236444195 for ; Thu, 19 Dec 2024 16:54:39 +0000 (UTC) X-FDA: 82912307184.29.7AD61AC Received: from mail-il1-f169.google.com (mail-il1-f169.google.com [209.85.166.169]) by imf07.hostedemail.com (Postfix) with ESMTP id B4AB04000C for ; Thu, 19 Dec 2024 16:53:47 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=IHPb8Ezi; spf=pass (imf07.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.169 as permitted sender) smtp.mailfrom=axboe@kernel.dk; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734627256; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OPAQC0CjmPDU3In7QT2WvDFGTmpm90H78Rbp4/qGsVE=; b=zZQZuclt5s9TUI39aKTsr1fV3x+X7fWUhvZOOJAF/zCzhCNSYPEwEAF5oXqPqZOOAPmrVm ub76HZkfqNNdmjSPHq4wjSrJ5gjrWecI1QUvpP6FkncK6A/XxOgfvBI8Eecj2/vtF0dLci W8Kf+ErK86KQ0fAaZv3de2rGABjQlBw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734627256; a=rsa-sha256; cv=none; b=DsoPtJca/wA3o7k2dcuno+6cHkCNR3Hy/wKQ6bbhzZdFcXsX+i1KxGeg/8suiLMxAtPUXg nr7MhRmdlH9j4UGHqRLQws6NGqZ1APEMH+wA7CO1gao3CsmrILU4eeln/bNAMcXiDLr28O pK3OQhbZTpOZwIipoT8uxKqVro+yJR4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=IHPb8Ezi; spf=pass (imf07.hostedemail.com: domain of axboe@kernel.dk designates 209.85.166.169 as permitted sender) smtp.mailfrom=axboe@kernel.dk; dmarc=none Received: by mail-il1-f169.google.com with SMTP id e9e14a558f8ab-3a77bd62fdeso6453765ab.2 for ; Thu, 19 Dec 2024 08:54:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1734627276; x=1735232076; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=OPAQC0CjmPDU3In7QT2WvDFGTmpm90H78Rbp4/qGsVE=; b=IHPb8Ezinz9WZP1t6m/p27619d4r6c7JtcY73S/mImN8Cp29MxKK887SngcRu2jgKX So1hJvj5S70U6IUQjKWRIHBHBKEJwEQtYXfEjhNKkdTVaxv04dfla5FVQX2s98CCtrsJ cod1HyShZalvx2/9layBgI45H3MRXtYDSHjP+tXyyBX0yKxPMeRWBBnj1sTOLxCun4ui qNMzdaSzGXNyrQ+Ux/JmMQaC8XtEbL2XQ3nnojsymWoTyLEIcRa0TxCcwNlj2Y3GKjPy B1bP7f2zP3oErbuaoo2VmCIqbZvb77NVVNEs3Spe1FtfjflpT0CtGoxGVgCPL812b3PM vdyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734627276; x=1735232076; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=OPAQC0CjmPDU3In7QT2WvDFGTmpm90H78Rbp4/qGsVE=; b=hglaI2MkksyKU1m3FE9x4BT9uuhBFFrMV102r78gCccj9QQdhth+x0xX8bBtuf4BeM Z55MutV8TENgL4mv7itIbsaeQCPYlq2sVRtrJCz+mja1AaDLWor9bHlSU+q1dPRj8PGF oYJZZxK4fsAmg7kowPltMuLJ32WH/5BTCJ4BJ6/sMNG1ackVpXgGZW29up8VeutkJnhi oZZZxVNlI7di2/O1IcEhLmtwRP9wJF4J6FzH1se136lIzBZgJxkeme2M44Q/PcZPxWH7 +M8lnkQs58RqC/lyzAxK7j8c+/JAMfGPDofrNzhnY1cehozx4BcF3awC1BsTIXvlPN9Z 0klQ== X-Gm-Message-State: AOJu0YxTkgwrKtu90mrfHPdk8rrRV6jZmYbkJ3Z9nAjn7CYxLLB12U66 421JzUwlOH+6ved7O428Un9eRLLFIwBDSeR9dkxVjdK/N4zVIccJNyQfrJu9s5Y= X-Gm-Gg: ASbGncvTnOvXAj87HrzFNCoX1ScZZWiflwVARyfYBK3eC+m8vCQxrohlzjjMlsMVylt qEjwnwLEm/YWxvc5jbP8QgyuaKh9wSt49whdOyfBVVP4bqLtVquqXDeCtGYuFphXW+/ehFSZ4bA 2pMDRg/QSzBLRl1njOzUKKLhINoMxSF8N7cjXLjS2DwPecoLT/L8ky7TmjcTPgzOA9WMpcXGe3W m2+zN2G36vqprIS9HAhDFdlr77xDl7nyoStJYKTutapp3oODcS7 X-Google-Smtp-Source: AGHT+IGPEmA/enwGnPfk2ZscKkQAFMKMb86HCI3AABDK1Ki272AdM1qkof8Vw8rfxfn10YTd+tUqIQ== X-Received: by 2002:a05:6e02:3f0f:b0:3a7:fede:7113 with SMTP id e9e14a558f8ab-3bdc437aa59mr72379185ab.18.1734627276499; Thu, 19 Dec 2024 08:54:36 -0800 (PST) Received: from [192.168.1.116] ([96.43.243.2]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4e68bf67546sm359332173.42.2024.12.19.08.54.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 19 Dec 2024 08:54:35 -0800 (PST) Message-ID: Date: Thu, 19 Dec 2024 09:54:35 -0700 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] nfs: flag as supporting FOP_DONTCACHE To: Mike Snitzer Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hannes@cmpxchg.org, clm@meta.com, linux-kernel@vger.kernel.org, willy@infradead.org, kirill@shutemov.name, bfoster@redhat.com, linux-nfs@vger.kernel.org References: <20241213155557.105419-1-axboe@kernel.dk> Content-Language: en-US From: Jens Axboe In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: B4AB04000C X-Stat-Signature: zqfwmffta9i8kc59t5xsfffwf16dme6h X-Rspam-User: X-HE-Tag: 1734627227-731885 X-HE-Meta: U2FsdGVkX186nrHGFkd1S00fDt+v7QYC70Z4/qaGCk8sUH/KuZ8RQCAKtBu1PJzOW+NiZHHroJA7mCXrFWBhgNA/jm9VBpALtnulkl/NvLFz6s8QPqpmFJUvtRlXxpRtyPFKJO4SHsO12OjmiqERP0XZTDMVkDJ4aG8/fhnnYtealkcRoKozqqCF/KOomgns0o+NmM6b5hMsRgm9oZpO0gCop5UhM9S3Pmw8wF0SOYAhGpY13tRXZzExisIOPd09miHew5T9u/uI4q2tvWoId1hiD5UWDlj40dpA36e/OSLrxXRecm3XTQ/7bswUn/BlOzh5IyVokQf8foNndmdFnq5TxFngy2dgZxq1bS7pqHAzyFnT9JeodQsO4exsCEOVIWVozEf8LhXvhz2dvHv0SqjsB+2JSOiMheI0hpi2ELRzBg1czQR7q0i9npCQjNzhmIZ8hvk+xD07vRNBzsrdHpx09szEx5ZKsiSmFvX/3IVO5A5F/Ve7xyZ0QbxQXcXTockswLLr+FKDdIcJsi85KtNB0avS4xFnYYWvXjHNnioNDeoeigOE+ZU6KOXyy9gNrve6oAy5YEB376bwijtZl2Ww7WyiOpNXy+dqIXrjElnu0fWKZ0fSnehWx4veZ7sxsm0S86gjV3VakiiVL50lcPbwbQRsNelaipggZfagu42+omtjmwgkigYDnIA1hOBUh/LN/4G+CJuntE9R1ghFaE7c8wFwp3RTx657kq+xrMqFygimQmYGVYVokA4J7qXUe66oYpC4mvs4Yn/fgntxlAbsu2XthC7SUACdA6vi8SlgCtWy72XQOimPJjayBGHRZpeFoBV06fuC/DsrIPIEVgvhzDbBi2nkIQOA+ED6Y2Z8vOVvv5MRFSq5viDjIT1veHOEyabrJGafivetzMobVpKWlzwxbYLUulJP4WT2ty88G2hhAg03N6f8BLp3ZRN+tqIf9jztTfrVL/WxTES pbLKYYav O4owMc0Vms/U8LirxPIf5ayMJUUpbwEitMFNWdYkk+W6OGR9KRnlq5A6lm/WVEBsf+zEZPt/3uyOCed91lfnsGd4qHkMtTleJk9qD+z9BsL7CO9KQGoFIzk3p7PzFiScDEmQkEZ54ixclQiehqAp5ooHQrr+t2OVxMNB+oiRn3xJmvxjKBMuOthThdXR31cu4IjvfQnqz/vEw/7IFTeR+O1pEWB0GwR/zYNfbjff/7lytkdZCSWp0Sk1Rogren9Ccz/NoemM4Kr2xoCj9p/8mOma4AyShN/kKfOkz6MTk/ozj3NAuWIMB6+VuXEWxSAZjI7FhN3di8S9lewIkl0Y5QnxaGdlVUd/a5uLd354tViTKuU2Bmv4TELVHf+EcNi4LCfn5p2DGsDStNpCrzi+M1RpeQf6kCm9Qo8jNNq6zhGjdI2Ka3KOb3lG8SElj1PEklpUX7q2xqxz02LZvHg6cojG+YA5XXpsRDhu0W32l+bvC6yfj2l/dHQl8SEY8FDpmQUhT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/18/24 10:16 AM, Mike Snitzer wrote: > On Fri, Dec 13, 2024 at 08:55:14AM -0700, Jens Axboe wrote: >> Hi, >> >> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way >> to do buffered IO that isn't page cache persistent. The approach back >> then was to have private pages for IO, and then get rid of them once IO >> was done. But that then runs into all the issues that O_DIRECT has, in >> terms of synchronizing with the page cache. >> >> So here's a new approach to the same concent, but using the page cache >> as synchronization. Due to excessive bike shedding on the naming, this >> is now named RWF_DONTCACHE, and is less special in that it's just page >> cache IO, except it prunes the ranges once IO is completed. >> >> Why do this, you may ask? The tldr is that device speeds are only >> getting faster, while reclaim is not. Doing normal buffered IO can be >> very unpredictable, and suck up a lot of resources on the reclaim side. >> This leads people to use O_DIRECT as a work-around, which has its own >> set of restrictions in terms of size, offset, and length of IO. It's >> also inherently synchronous, and now you need async IO as well. While >> the latter isn't necessarily a big problem as we have good options >> available there, it also should not be a requirement when all you want >> to do is read or write some data without caching. >> >> Even on desktop type systems, a normal NVMe device can fill the entire >> page cache in seconds. On the big system I used for testing, there's a >> lot more RAM, but also a lot more devices. As can be seen in some of the >> results in the following patches, you can still fill RAM in seconds even >> when there's 1TB of it. Hence this problem isn't solely a "big >> hyperscaler system" issue, it's common across the board. >> >> Common for both reads and writes with RWF_DONTCACHE is that they use the >> page cache for IO. Reads work just like a normal buffered read would, >> with the only exception being that the touched ranges will get pruned >> after data has been copied. For writes, the ranges will get writeback >> kicked off before the syscall returns, and then writeback completion >> will prune the range. Hence writes aren't synchronous, and it's easy to >> pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by >> RWF_DONTCACHE IO are left untouched. This means you that uncached IO >> will take advantage of the page cache for uptodate data, but not leave >> anything it instantiated/created in cache. >> >> File systems need to support this. This patchset adds support for the >> generic read path, which covers file systems like ext4. Patches exist to >> add support for iomap/XFS and btrfs as well, which sit on top of this >> series. If RWF_DONTCACHE IO is attempted on a file system that doesn't >> support it, -EOPNOTSUPP is returned. Hence the user can rely on it >> either working as designed, or flagging and error if that's not the >> case. The intent here is to give the application a sensible fallback >> path - eg, it may fall back to O_DIRECT if appropriate, or just live >> with the fact that uncached IO isn't available and do normal buffered >> IO. >> >> Adding "support" to other file systems should be trivial, most of the >> time just a one-liner adding FOP_DONTCACHE to the fop_flags in the >> file_operations struct. >> >> Performance results are in patch 8 for reads, and you can find the write >> side results in the XFS patch adding support for DONTCACHE writes for >> XFS: >> >> ://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached.9&id=edd7b1c910c5251941c6ba179f44b4c81a089019 >> >> with the tldr being that I see about a 65% improvement in performance >> for both, with fully predictable IO times. CPU reduction is substantial >> as well, with no kswapd activity at all for reclaim when using >> uncached IO. >> >> Using it from applications is trivial - just set RWF_DONTCACHE for the >> read or write, using pwritev2(2) or preadv2(2). For io_uring, same >> thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write >> operation. And that's it. >> >> Patches 1..7 are just prep patches, and should have no functional >> changes at all. Patch 8 adds support for the filemap path for >> RWF_DONTCACHE reads, and patches 9..11 are just prep patches for >> supporting the write side of uncached writes. In the below mentioned >> branch, there are then patches to adopt uncached reads and writes for >> xfs, btrfs, and ext4. The latter currently relies on bit of a hack for >> passing whether this is an uncached write or not through >> ->write_begin(), which can hopefully go away once ext4 adopts iomap for >> buffered writes. I say this is a hack as it's not the prettiest way to >> do it, however it is fully solid and will work just fine. >> >> Passes full xfstests and fsx overnight runs, no issues observed. That >> includes the vm running the testing also using RWF_DONTCACHE on the >> host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately. >> As far as I'm concerned, no further work needs doing here. >> >> And git tree for the patches is here: >> >> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.9 >> >> include/linux/fs.h | 21 +++++++- >> include/linux/page-flags.h | 5 ++ >> include/linux/pagemap.h | 1 + >> include/trace/events/mmflags.h | 3 +- >> include/uapi/linux/fs.h | 6 ++- >> mm/filemap.c | 97 +++++++++++++++++++++++++++++----- >> mm/internal.h | 2 + >> mm/readahead.c | 22 ++++++-- >> mm/swap.c | 2 + >> mm/truncate.c | 54 ++++++++++--------- >> 10 files changed, 166 insertions(+), 47 deletions(-) >> >> Since v6 >> - Rename the PG_uncached flag to PG_dropbehind >> - Shuffle patches around a bit, most notably so the foliop_uncached >> patch goes with the ext4 support >> - Get rid of foliop_uncached hack for btrfs (Christoph) >> - Get rid of passing in struct address_space to filemap_create_folio() >> - Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather >> than keep it as a separate helper >> - Rebase on top of current master >> >> -- >> Jens Axboe >> >> > > > Hi Jens, > > You may recall I tested NFS to work with UNCACHED (now DONTCACHE). > I've rebased the required small changes, feel free to append this to > your series if you like. > > More work is needed to inform knfsd to selectively use DONTCACHE, but > that will require more effort and coordination amongst the NFS kernel > team. Thanks Mike, I'll add it to the part 2 mix. -- Jens Axboe