From: Hannes Reinecke <hare@suse.de>
To: Keith Busch <kbusch@kernel.org>, Matthew Wilcox <willy@infradead.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org,
linux-nvme@lists.infradead.org
Subject: Re: [LSF/MM/BPF TOPIC] Memory folios
Date: Thu, 27 May 2021 09:41:51 +0200 [thread overview]
Message-ID: <97698689-0a18-81e0-a0ff-b4f92e56be5b@suse.de> (raw)
In-Reply-To: <20210526210742.GA3706388@dhcp-10-100-145-180.wdc.com>
On 5/26/21 11:07 PM, Keith Busch wrote:
> On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote:
>> On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote:
>>> I don't know exactly how much will be left to discuss about supporting
>>> larger memory allocation units in the page cache by December. In my
>>> ideal world, all the patches I've submitted so far are accepted, I
>>> persuade every filesystem maintainer to convert their own filesystem
>>> and struct page is nothing but a bad memory by December. In reality,
>>> I'm just not that persuasive.
>>>
>>> So, probably some kind of discussion will be worthwhile about
>>> converting the remaining filesystems to use folios, when it's worth
>>> having filesystems opt-in to multi-page folios, what we can do about
>>> buffer-head based filesystems, and so on.
>>>
>>> Hopefully we aren't still discussing whether folios are a good idea
>>> or not by then.
>>
>> I got an email from Hannes today asking about memory folios as they
>> pertain to the block layer, and I thought this would be a good chance
>> to talk about them. If you're not familiar with the term "folio",
>> https://lore.kernel.org/lkml/20210505150628.111735-10-willy@infradead.org/
>> is not a bad introduction.
>>
>> Thanks to the work done by Ming Lei in 2017, the block layer already
>> supports multipage bvecs, so to a first order of approximation, I don't
>> need anything from the block layer on down through the various storage
>> layers. Which is why I haven't been talking to anyone in storage!
>>
>> It might change (slightly) the contents of bios. For example,
>> bvec[n]->bv_offset might now be larger than PAGE_SIZE. Drivers should
>> handle this OK, but probably haven't been audited to make sure they do.
>> Mostly, it's simply that drivers will now see fewer, larger, segments
>> in their bios. Once a filesystem supports multipage folios, we will
>> allocate order-N pages as part of readahead (and sufficiently large
>> writes). Dirtiness is tracked on a per-folio basis (not per page),
>> so folios take trips around the LRU as a single unit and finally make
>> it to being written back as a single unit.
>>
>> Drivers still need to cope with sub-folio-sized reads and writes.
>> O_DIRECT still exists and (eg) doing a sub-page, block-aligned write
>> will not necessarily cause readaround to happen. Filesystems may read
>> and write their own metadata at whatever granularity and alignment they
>> see fit. But the vast majority of pagecache I/O will be folio-sized
>> and folio-aligned.
>>
>> I do have two small patches which make it easier for the one
>> filesystem that I've converted so far (iomap/xfs) to add folios to bios
>> and get folios back out of bios:
>>
>> https://lore.kernel.org/lkml/20210505150628.111735-72-willy@infradead.org/
>> https://lore.kernel.org/lkml/20210505150628.111735-73-willy@infradead.org/
>>
>> as well as a third patch that estimates how large a bio to allocate,
>> given the current folio that it's working on:
>> https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b126a59dc7319ad618767e2d880fcadd6c2
>>
>> It would be possible to make other changes in future. For example, if
>> we decide it'd be better, we could change bvecs from being (page, offset,
>> length) to (folio, offset, length). I don't know that it's worth doing;
>> it would need to be evaluated on its merits. Personally, I'd rather
>> see us move to a (phys_addr, length) pair, but I'm a little busy at the
>> moment.
>>
>> Hannes has some fun ideas about using the folio work to support larger
>> sector sizes, and I think they're doable.
>
> I'm also interested in this, and was looking into the exact same thing
> recently. Some of the very high capacity SSDs that can really benefit
> from better large sector support. If this is a topic for the conference,
> I would like to attend this session.
>
And, of course, so would I :-)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)
prev parent reply other threads:[~2021-05-27 7:41 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-10 17:56 Matthew Wilcox
2021-05-14 17:48 ` Matthew Wilcox
2021-05-17 10:00 ` Christoph Hellwig
2021-05-26 21:07 ` Keith Busch
2021-05-27 7:41 ` Hannes Reinecke [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=97698689-0a18-81e0-a0ff-b4f92e56be5b@suse.de \
--to=hare@suse.de \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-ide@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-scsi@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox