From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03A67C54798 for ; Thu, 29 Feb 2024 19:31:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7FE356B007B; Thu, 29 Feb 2024 14:31:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7AE316B0099; Thu, 29 Feb 2024 14:31:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 676286B009C; Thu, 29 Feb 2024 14:31:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 53A276B007B for ; Thu, 29 Feb 2024 14:31:31 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 1C99BA0C50 for ; Thu, 29 Feb 2024 19:31:31 +0000 (UTC) X-FDA: 81845835582.18.6BF898A Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf03.hostedemail.com (Postfix) with ESMTP id 1ED192002B for ; Thu, 29 Feb 2024 19:31:27 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=PXo5rcGN; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709235088; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=k7H6zPVfUYt7eFMO7q+TzE+/3FPK8nRrYqNx/l9rIu8=; b=8cHmtJqN1DFurBLmr5rJpJbli726HJ00Bw9Ey9HrgtHdcGbLuhdgbFrLDcgOtzSEbzOrE3 bPH1U2lS98B3hPK02r7ietfrVJXrjz9xB0cnZdCAJ1oiRavVV0H5seMa403v6yU92rAWbW Po6RO8euq20KY73/scdhJnAevitfJP0= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=PXo5rcGN; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709235088; a=rsa-sha256; cv=none; b=DvmhqsAFmU3Jr1Djhl0kmmhLxeJqLQLIJKOQRnsX/zCptjPXB3PSEOCjecBMNspk87KVRE baIYXigZ97qIsIb9OrZ0nef0Yl4OH4SputepHp3vaIDzDDDzsneSp3gSOAk3zjW7TsDn+e reHMhhlo9mzRp5n/E54XSi2F9Q/Olj0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 0D92E6131F for ; Thu, 29 Feb 2024 19:31:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B07FFC433C7 for ; Thu, 29 Feb 2024 19:31:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1709235086; bh=vWkgBUl9V8zzq+Ocdi7IrYcPtFJi2zw+v1JOHzEqo48=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=PXo5rcGNcGPORD2C593HQ0++cr2DhCHUXriqji8v1dGSX4O7i9yF/1/SetT39pUub T5LoFp21eV0B7HnNJQV7eZpwzOJd/pNbt4pwguBCsT84dJg7y7HSpnoYDro2gz3Zya 1I460YRWIIZyFk2NRCqTArOo5PHBeZGXKjWpXHjGjLbdfu4duyPqUBaeTED2FBKcQm hu+oRvo6mu5R+J02tbmCq21pua0yX1+YmMN3RtTUPzsB7BbeRXE0B6iB/D7nEYX3+6 ymOs47S1Bkey9c8j2FTpkDfnmk1hiZX4To9DBQcyQ/VL5sI2uktqM4SmWiBfBCovQq s4S7hRC6gJ2rA== Received: by mail-il1-f169.google.com with SMTP id e9e14a558f8ab-3651edae0a1so5047745ab.0 for ; Thu, 29 Feb 2024 11:31:26 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCWYRidhAC6PKip1GQhCi6kSVYxfR8gy1WpLJoEkGX+3wSG37E/BbhPGmtEM3CiBfJGZ1HUo06gwtAyGxxQ/ttCVnKI= X-Gm-Message-State: AOJu0YzrkPpOba5Q0hB2Ovl1B9JJUBrW6oj8oD5SGfx1N6G4kP+AwJP5 ulUrqc0q7PksRCE/dffh9QOd+YcdE5RUU4tnM2IzSeUfhuBEN/ik5raF20DfQT0gOMkv8xCUDnL okkaIjj/skyrerjZ5rhICeHO2OIF/2TMsxz7y X-Google-Smtp-Source: AGHT+IH6o1VRdbbDzmd2S/3gNApCVNErtkJIbssEMM6NS3pyBNfgv/MJuH86kMdZTJwgpBlSB1Gkw+XY38lp47zNsTc= X-Received: by 2002:a92:c245:0:b0:365:1b7c:670 with SMTP id k5-20020a92c245000000b003651b7c0670mr11943ilo.8.1709235086020; Thu, 29 Feb 2024 11:31:26 -0800 (PST) MIME-Version: 1.0 References: <2701740.1706864989@warthog.procyon.org.uk> In-Reply-To: From: Chris Li Date: Thu, 29 Feb 2024 11:31:12 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache To: Matthew Wilcox Cc: David Howells , lsf-pc@lists.linux-foundation.org, netfs@lists.linux.dev, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 6a7ps9j5584su6nfsc3nxqkhextqmreh X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1ED192002B X-HE-Tag: 1709235087-813746 X-HE-Meta: U2FsdGVkX19rhVq9Qm9mt38z4v8skoVtgfD1vhhHx0ejXFoEPyyyUVEBSGsuxIovOSiuJ4blbKkLTme1WeGrMfOOioRne12HxH64roWQ64gr7zaJafSTeafqB87bEQ1LRdr87PrgoNtYeEBx+QxYHgbVkr4dkZIrKdc9hDShqBAy3/c6bpagB7IQllb7YkPy1MSdm7LqjybugNi0DN6WWlGZRi0/Se4Tm5Omwa5UepN8lPGKb56IhFwXGESygedyybL7r4iPPh+p8+muz+4c9pK911ejjHutJIb/k4dmPJ979oh6xRJOU/sVJKKBIdBhgD7KL7PRvsuTvIQFwp4OllSXWREnOSvbvqLgP/z56KqivTL2ib6jOHa3v+X6tUtdxLlzzYAFmiZ+MbvD8XFJhRQgvXl6PbA2PhW4iXEq4c5Au0P0QLtXmZc9jxS3RgcbQwY6D3BOLoMWRrTdLJGwVHFNupfn1wsp3xA32AigO7EV7l5Y0d4qGOflM+jK17h7sDxEMj0vRHWf5Ye6c0TAgD5F21j5RFIFPnGVErmruJrg4xxBrQu/Rj19zRvvQsoWS+G3tW72+Kq+WLnjGETXXsS1t2/ux+DEAI4d8s38OF9/TaWAAFduu1tS72BVyr6kV8tJUOBHbJnPZ7mwF64LLiCaLH9mhGTOhAeng4yLYuX26f4SYAmzXeeI+dFzvpkXHeRtyrB2ZI/UUJtSrErRWDe6y+efuxWcF4ckwvhEr0sOW80Pm1dc4Jbr9YMqe7RnacoQkjKRU5e8E4ZmECtgm2yqeUlGBn+Pac5QR37d6yAjHhWUUUcMjIyksvUfMymQE6CunQEYCLejYbWwbrg1y9iPaEpVjXJyI81GsVoBQOwwTRYLYEF8r/SJRTBYkNbH7eNvYMvls2WCw5v4lBe5x1fKX/M/cLil X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Matthew, On Fri, Feb 2, 2024 at 6:29=E2=80=AFAM Matthew Wilcox = wrote: > > On Fri, Feb 02, 2024 at 09:09:49AM +0000, David Howells wrote: > > The topic came up in a recent discussion about how to deal with large f= olios > > when it comes to swap as a swap device is normally considered a simple = array > > of PAGE_SIZE-sized elements that can be indexed by a single integer. > > > > With the advent of large folios, however, we might need to change this = in > > order to be better able to swap out a compound page efficiently. Swap > > fragmentation raises its head, as does the need to potentially save mul= tiple > > indices per folio. Does swap need to grow more filesystem features? > > I didn't mention this during the meeting, but there are more reasons > to do something like this. For example, even with large folios, it > doesn't make sense to drive writing to swap on a per-folio basis. We > should be writing out large chunks of virtual address space in a single > write to the swap device, just like we do large chunks of files in > ->writepages. I have thought about your proposal after the THP meeting. One observation is that the swap write and swap read has some asymmetries. For swap read, you always know which vma you are reading into. However, the swap write that is based on the LRU list, (shrink_folio_list) does not have the vma information in hand. Actually the same folio might map by two different processes. It would need to do the rmap walk to find out the VMA. So organizing the swap write around VMA mapping is not convenient for the LRU reclaim write back case. Chris > Another reason to do something different is that we're starting to see > block devices with bs>PS. That means we'll _have_ to write out larger > chunks than a single page. For reads, we can discard the extra data, > but it'd be better to swap back in the entire block rather than > individual pages. > > So my modest proposal is that we completely rearchitect how we handle > swap. Instead of putting swp entries in the page tables (and in shmem's > case in the page cache), we turn swap into an (object, offset) lookup > (just like a filesystem). That means that each anon_vma becomes its > own swap object and each shmem inode becomes its own swap object. > The swap system can then borrow techniques from whichever filesystem > it likes to do (object, offset, length) -> n x (device, block) mappings. > > > Further to this, we have at least two ways to cache data on disk/flash/= etc. - > > swap and fscache - and both want to set aside disk space for their oper= ation. > > Might it be possible to combine the two? > > > > One thing I want to look at for fscache is the possibility of switching= from a > > file-per-object-based approach to a tagged cache more akin to the way O= penAFS > > does things. In OpenAFS, you have a whole bunch of small files, each > > containing a single block (e.g. 256K) of data, and an index that maps a > > particular {volume,file,version,block} to one of these files in the cac= he. > > I think my proposal above works for you? For each file you want to cache= , > create a swap object, and then tell swap when you want to read/write to > the local swap object. What you do need is to persist the objects over > a power cycle. That shouldn't be too hard ... after all, filesystems > manage to do it. All we need to do is figure out how to name the > lookup (I don't think we need to use strings to name the swap object, > but obviously we could). Maybe it's just a stream of bytes. >