From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31FECC47DB3 for ; Fri, 2 Feb 2024 14:29:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F4296B0072; Fri, 2 Feb 2024 09:29:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7A3886B0074; Fri, 2 Feb 2024 09:29:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 66B746B0075; Fri, 2 Feb 2024 09:29:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 56A966B0072 for ; Fri, 2 Feb 2024 09:29:48 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id CD9C4140FF1 for ; Fri, 2 Feb 2024 14:29:47 +0000 (UTC) X-FDA: 81747097614.07.9FBEB82 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf27.hostedemail.com (Postfix) with ESMTP id CB7774001E for ; Fri, 2 Feb 2024 14:29:44 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=XKYE+x3u; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706884186; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wP7ap7SwTE6cJeEes3sVUH/vFnz+eX+KSdmLVr5C1To=; b=24GO8POJGd0DPlbtHYWCSBxMMgBbHaXzzrHRY1qZoHgYO55GB+607tPDLvMsRqKCbrD8YP XNZyN3wM4yTT+1e1Y/YGynUs8+OkdrnXeeAAnjkaLOyNimG+Tjt1g9SCVV4jVBM9UlGf/B 9MFO/Xi0ZDkVrDqJotjyepPRk9vb9yY= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=XKYE+x3u; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706884186; a=rsa-sha256; cv=none; b=j9kaytGSPa7Z5C2nLud44X9z2ir1lLHN4r6YyryZmWT6oargM7mgRD/2PMO8LaLVAJzAkw w36SPyRY/w40TA+oQCz4t5aFBd5M5TX9ss/N7hraDSQVJLz+qADZLp+c9ZhbwLGjhFbbjY E/yWxSAYeYCA0AuWviZUqNXcqGMYom4= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=wP7ap7SwTE6cJeEes3sVUH/vFnz+eX+KSdmLVr5C1To=; b=XKYE+x3u9bB0tuGI/mGQ4xi5Sv pdMjLTE0LcFX2TLHGnBe6u6ld6fp8xdUYT0+pYZBK9H+Ts0IqvFc2YVlUFPdS0GywD8jSR95Chxhs +jzef2Fo+TiyloyChnxnsUD1lS7DSh2FpDgvK7nsYIZjDuK76U+nMANLkyc6REeD2V9d27s6aPooh 4Ij8FkFzkZzF2B0uZZ5PKGJiSouTJfBuuiytuTgAzAI+mwjo5+HHM6YKRd4PuBj6UrpxwFRi91T11 sWpFpcGUQYcytmMgC6K+mfFbjGvWxwBKi05OkRm3F9cfkikh8HAH2icgpi0+r4Vj7CbGwgTSI3XXB 3G5SRSUQ==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rVuY0-00000001H82-3gby; Fri, 02 Feb 2024 14:29:40 +0000 Date: Fri, 2 Feb 2024 14:29:40 +0000 From: Matthew Wilcox To: David Howells Cc: lsf-pc@lists.linux-foundation.org, netfs@lists.linux.dev, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache Message-ID: References: <2701740.1706864989@warthog.procyon.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2701740.1706864989@warthog.procyon.org.uk> X-Rspamd-Queue-Id: CB7774001E X-Rspam-User: X-Stat-Signature: 3b37f45tdf1jo936utarsegn6hzpasek X-Rspamd-Server: rspam01 X-HE-Tag: 1706884184-430465 X-HE-Meta: U2FsdGVkX1/j2IE95+g+p/Q6lYN7capBNx6d7HPwMII+R7t1Tz+i11c5FOiPfquY6Ax5W1K53ea4x+hx7vTVQFWa4xLKswlzH1+ytPlRHR5FYDnljZlae4FdB5cll0AhYUNi1BRbCcNIcCbg6ZL63gpz+4rr35Lo/eQyAcqZkxc/JmV1RGGl+ckfDMmZLjRjLoXmE55DYCF3hdnogI9TmTAv6Zg0wv3cFc2AGIeEGdW7yrrttOe17tyTXup+Zx6Zw/LMwqgdtGW9qauc1nWCNzWdkZAnixqgHUa5fXxzHiHwqksYXnKv73aosIUFsF8wv32QOz/t4T3QdCdQ7y7HyAxuc+TxtG7j2vjQu/f0dY6EWVkcjfIIJbLGRhQ1OdtoZgWUm7FGBSjdYsgfhvlOMGuBRepwCIPNEJ/mXf9toZDH5vuAa4ik3og81fXejaFUk6fP+yv5jkbbdrvTJ4SzSmxiALbqj0xgS/+PnE9PN3dpT6uUzzyp308lJrhNGR7IU/2buu6UcHuN29saRH1OCYblcS5Y9yA6IB91Vl0R7YLzlUQJ9gWqSghST1Us2rCskN+3Mx2AoaXQXayWtYhSPct4Nh5Ms0CwxKaK4gq1KF7m32lS7Zi7hbQ9MbPw6kHm/+d4W03/nUGnBgIxVYfsCCYTSW3aHXuhT76j2rtFZeuuB1WUBB3lyH2dD7y0rgDmOZlnKf9GDxklqDZ/TYItyQsN9ETQ/0fxXVGSzB3JrZ/NRI1e5P2D5AsRaUuS/G3SVxGcdSsZ9hST02ZqsLJb8gr2nROo72fafMf4nCL/3Vr78gUVbwurDkodH6x5TLfufMiM4ajYwjii0eCS/7Wq2VJIx4ooAITLQDClyoKZriTR8LOYxKvfQNEo7k2vFEhea9sRYFk7Q23Io41WFnBQqe0ovrAV/U2ZzX2sIgDjgQuS0nHnfi7vT5sCQ6Hyxw+czLop7T8wSmAp7tkGiFi uuHUWxeX 8jOj/wM2IBnbZGiK6gNXGdFhEUnKo4fjmL3BFpIgQY83EyFveR75c9iXaQq7PEmEqc2tUQbol44Ayps51my/cVbpyAT0wSz9F1MXuDL+rkVule0fqyovgI63DCOhSkv6+3Q0l5FUk3pO53aE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Feb 02, 2024 at 09:09:49AM +0000, David Howells wrote: > The topic came up in a recent discussion about how to deal with large folios > when it comes to swap as a swap device is normally considered a simple array > of PAGE_SIZE-sized elements that can be indexed by a single integer. > > With the advent of large folios, however, we might need to change this in > order to be better able to swap out a compound page efficiently. Swap > fragmentation raises its head, as does the need to potentially save multiple > indices per folio. Does swap need to grow more filesystem features? I didn't mention this during the meeting, but there are more reasons to do something like this. For example, even with large folios, it doesn't make sense to drive writing to swap on a per-folio basis. We should be writing out large chunks of virtual address space in a single write to the swap device, just like we do large chunks of files in ->writepages. Another reason to do something different is that we're starting to see block devices with bs>PS. That means we'll _have_ to write out larger chunks than a single page. For reads, we can discard the extra data, but it'd be better to swap back in the entire block rather than individual pages. So my modest proposal is that we completely rearchitect how we handle swap. Instead of putting swp entries in the page tables (and in shmem's case in the page cache), we turn swap into an (object, offset) lookup (just like a filesystem). That means that each anon_vma becomes its own swap object and each shmem inode becomes its own swap object. The swap system can then borrow techniques from whichever filesystem it likes to do (object, offset, length) -> n x (device, block) mappings. > Further to this, we have at least two ways to cache data on disk/flash/etc. - > swap and fscache - and both want to set aside disk space for their operation. > Might it be possible to combine the two? > > One thing I want to look at for fscache is the possibility of switching from a > file-per-object-based approach to a tagged cache more akin to the way OpenAFS > does things. In OpenAFS, you have a whole bunch of small files, each > containing a single block (e.g. 256K) of data, and an index that maps a > particular {volume,file,version,block} to one of these files in the cache. I think my proposal above works for you? For each file you want to cache, create a swap object, and then tell swap when you want to read/write to the local swap object. What you do need is to persist the objects over a power cycle. That shouldn't be too hard ... after all, filesystems manage to do it. All we need to do is figure out how to name the lookup (I don't think we need to use strings to name the swap object, but obviously we could). Maybe it's just a stream of bytes.