From: Luis Chamberlain <mcgrof@kernel.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Hannes Reinecke" <hare@suse.de>,
"Keith Busch" <kbusch@kernel.org>,
"Theodore Ts'o" <tytso@mit.edu>,
"Pankaj Raghav" <p.raghav@samsung.com>,
"Daniel Gomez" <da.gomez@samsung.com>,
"Javier González" <javier.gonz@samsung.com>,
"Klaus Jensen" <its@irrelevant.dk>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-block@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Sat, 4 Mar 2023 10:53:47 -0800 [thread overview]
Message-ID: <ZAOTu5qk+ax88+d9@bombadil.infradead.org> (raw)
In-Reply-To: <ZAOF3p+vqA6pd7px@casper.infradead.org>
On Sat, Mar 04, 2023 at 05:54:38PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
> > On 3/4/23 17:47, Matthew Wilcox wrote:
> > > On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
> > > > We could implement a (virtual) zoned device, and expose each zone as a
> > > > block. That gives us the required large block characteristics, and with
> > > > a bit of luck we might be able to dial up to really large block sizes
> > > > like the 256M sizes on current SMR drives.
> > > > ublk might be a good starting point.
> > >
> > > Ummmm. Is supporting 256MB block sizes really a desired goal? I suggest
> > > that is far past the knee of the curve; if we can only write 256MB chunks
> > > as a single entity, we're looking more at a filesystem redesign than we
> > > are at making filesystems and the MM support 256MB size blocks.
> > >
> > Naa, not really. It _would_ be cool as we could get rid of all the cludges
> > which have nowadays re sequential writes.
> > And, remember, 256M is just a number someone thought to be a good
> > compromise. If we end up with a lower number (16M?) we might be able
> > to convince the powers that be to change their zone size.
> > Heck, with 16M block size there wouldn't be a _need_ for zones in
> > the first place.
> >
> > But yeah, 256M is excessive. Initially I would shoot for something
> > like 2M.
>
> I think we're talking about different things (probably different storage
> vendors want different things, or even different people at the same
> storage vendor want different things).
>
> Luis and I are talking about larger LBA sizes. That is, the minimum
> read/write size from the block device is 16kB or 64kB or whatever.
> In this scenario, the minimum amount of space occupied by a file goes
> up from 512 bytes or 4kB to 64kB. That's doable, even if somewhat
> suboptimal.
Yes.
> Your concern seems to be more around shingled devices (or their equivalent
> in SSD terms) where there are large zones which are append-only, but
> you can still random-read 512 byte LBAs. I think there are different
> solutions to these problems, and people are working on both of these
> problems.
>
> But if storage vendors are really pushing for 256MB LBAs, then that's
> going to need a third kind of solution, and I'm not aware of anyone
> working on that.
Hannes had replied to my suggestion about a way to *virtualize* *optimally*
a real storage controller with larger LBA, in that thread I was hinting to
avoid using on the hypervisor cache=passthrough and instead use something
like cache=writeback or even cache=unsafe for experimentation for
virtio-blk-pci. For a more elaborate description of these see [0] but the
skinny is cache=writeback uses the host stroage controller while the
other rely on the host page cache.
The overhead of latencies incurred by anything to replicate larger LBAs
should be mitigated, so I don't think using a zone storage zone for it
would be good.
I was asking whether or not experimenting with a different host page cache
PAGE_SIZE might help replicate things more a bit realistically, even if
if was suboptimal for the host for the reasons previously noted as stupid.
If sticking to PAGE_SIZE on the host another idea may be to use tmpfs +
huge pages so to at least mitigate TLB lookups.
[0] https://github.com/linux-kdevops/kdevops/commit/94844c4684a51997cb327d2fb0ce491fe4429dfc
Luis
next prev parent reply other threads:[~2023-03-04 18:54 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-01 3:52 Theodore Ts'o
2023-03-01 4:18 ` Gao Xiang
2023-03-01 4:40 ` Matthew Wilcox
2023-03-01 4:59 ` Gao Xiang
2023-03-01 4:35 ` Matthew Wilcox
2023-03-01 4:49 ` Gao Xiang
2023-03-01 5:01 ` Matthew Wilcox
2023-03-01 5:09 ` Gao Xiang
2023-03-01 5:19 ` Gao Xiang
2023-03-01 5:42 ` Matthew Wilcox
2023-03-01 5:51 ` Gao Xiang
2023-03-01 6:00 ` Gao Xiang
2023-03-02 3:13 ` Chaitanya Kulkarni
2023-03-02 3:50 ` Darrick J. Wong
2023-03-03 3:03 ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03 3:05 ` Martin K. Petersen
2023-03-03 1:58 ` Keith Busch
2023-03-03 3:49 ` Matthew Wilcox
2023-03-03 11:32 ` Hannes Reinecke
2023-03-03 13:11 ` James Bottomley
2023-03-04 7:34 ` Matthew Wilcox
2023-03-04 13:41 ` James Bottomley
2023-03-04 16:39 ` Matthew Wilcox
2023-03-05 4:15 ` Luis Chamberlain
2023-03-05 5:02 ` Matthew Wilcox
2023-03-08 6:11 ` Luis Chamberlain
2023-03-08 7:59 ` Dave Chinner
2023-03-06 12:04 ` Hannes Reinecke
2023-03-06 3:50 ` James Bottomley
2023-03-04 19:04 ` Luis Chamberlain
2023-03-03 21:45 ` Luis Chamberlain
2023-03-03 22:07 ` Keith Busch
2023-03-03 22:14 ` Luis Chamberlain
2023-03-03 22:32 ` Keith Busch
2023-03-03 23:09 ` Luis Chamberlain
2023-03-16 15:29 ` Pankaj Raghav
2023-03-16 15:41 ` Pankaj Raghav
2023-03-03 23:51 ` Bart Van Assche
2023-03-04 11:08 ` Hannes Reinecke
2023-03-04 13:24 ` Javier González
2023-03-04 16:47 ` Matthew Wilcox
2023-03-04 17:17 ` Hannes Reinecke
2023-03-04 17:54 ` Matthew Wilcox
2023-03-04 18:53 ` Luis Chamberlain [this message]
2023-03-05 3:06 ` Damien Le Moal
2023-03-05 11:22 ` Hannes Reinecke
2023-03-06 8:23 ` Matthew Wilcox
2023-03-06 10:05 ` Hannes Reinecke
2023-03-06 16:12 ` Theodore Ts'o
2023-03-08 17:53 ` Matthew Wilcox
2023-03-08 18:13 ` James Bottomley
2023-03-09 8:04 ` Javier González
2023-03-09 13:11 ` James Bottomley
2023-03-09 14:05 ` Keith Busch
2023-03-09 15:23 ` Martin K. Petersen
2023-03-09 20:49 ` James Bottomley
2023-03-09 21:13 ` Luis Chamberlain
2023-03-09 21:28 ` Martin K. Petersen
2023-03-10 1:16 ` Dan Helmick
2023-03-10 7:59 ` Javier González
2023-03-08 19:35 ` Luis Chamberlain
2023-03-08 19:55 ` Bart Van Assche
2023-03-03 2:54 ` Martin K. Petersen
2023-03-03 3:29 ` Keith Busch
2023-03-03 4:20 ` Theodore Ts'o
2023-07-16 4:09 BELINDA Goodpaster kelly
2025-09-22 17:49 Belinda R Goodpaster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZAOTu5qk+ax88+d9@bombadil.infradead.org \
--to=mcgrof@kernel.org \
--cc=da.gomez@samsung.com \
--cc=hare@suse.de \
--cc=its@irrelevant.dk \
--cc=javier.gonz@samsung.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=p.raghav@samsung.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox