From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Javier González" <javier.gonz@samsung.com>,
"Matthew Wilcox" <willy@infradead.org>,
"Theodore Ts'o" <tytso@mit.edu>, "Hannes Reinecke" <hare@suse.de>,
"Luis Chamberlain" <mcgrof@kernel.org>,
"Keith Busch" <kbusch@kernel.org>,
"Pankaj Raghav" <p.raghav@samsung.com>,
"Daniel Gomez" <da.gomez@samsung.com>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-block@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Thu, 09 Mar 2023 10:23:31 -0500 [thread overview]
Message-ID: <yq11qlygevs.fsf@ca-mkp.ca.oracle.com> (raw)
In-Reply-To: <260064c68b61f4a7bc49f09499e1c107e2a28f31.camel@HansenPartnership.com> (James Bottomley's message of "Thu, 09 Mar 2023 08:11:35 -0500")
James,
> Well a decade ago we did a lot of work to support 4k sector devices.
> Ultimately the industry went with 512 logical/4k physical devices
> because of problems with non-Linux proprietary OSs but you could still
> use 4k today if you wanted (I've actually still got a working 4k SCSI
> drive), so why is no NVMe device doing that?
FWIW, I have SATA, SAS, and NVMe devices that report 4KB logical.
The reason the industry converged on 512e is that the performance
problems were solved by ensuring correct alignment and transfer length.
Almost every I/O we submit is a multiple of 4KB. So if things are
properly aligned wrt. the device's physical block size, it is irrelevant
whether we express CDB fields in units of 512 bytes or 4KB. We're still
transferring the same number of bytes.
In addition 512e had two additional advantages that 4Kn didn't:
1. Legacy applications doing direct I/O and expecting 512-byte blocks
kept working (albeit with a penalty for writes smaller than a
physical block).
2. For things like PI where the 16-bit CRC is underwhelming wrt.
detecting errors in 4096 bytes of data, leaving the protection
interval at 512 bytes was also a benefit. So while 4Kn adoption
looked strong inside enterprise disk arrays initially, several
vendors ended up with 512e for PI reasons.
Once I/Os from the OS were properly aligned, there was just no
compelling reason for anyone to go with 4Kn and having to deal with
multiple SKUs, etc.
For NVMe 4Kn was prevalent for a while but drives have started
gravitating towards 512n/512e. Perhaps because of (1) above. Plus
whatever problems there may be on other platforms as you mentioned...
> This is not to say I think larger block sizes is in any way a bad idea
> ... I just think that given the history, it will be driven by
> application needs rather than what the manufacturers tell us.
I think it would be beneficial for Linux to support filesystem blocks
larger than the page size. Based on experience outlined above, I am not
convinced larger logical block sizes will get much traction. But that
doesn't prevent devices from advertising larger physical/minimum/optimal
I/O sizes and for us to handle those more gracefully than we currently
do.
--
Martin K. Petersen Oracle Linux Engineering
next prev parent reply other threads:[~2023-03-09 15:24 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-01 3:52 Theodore Ts'o
2023-03-01 4:18 ` Gao Xiang
2023-03-01 4:40 ` Matthew Wilcox
2023-03-01 4:59 ` Gao Xiang
2023-03-01 4:35 ` Matthew Wilcox
2023-03-01 4:49 ` Gao Xiang
2023-03-01 5:01 ` Matthew Wilcox
2023-03-01 5:09 ` Gao Xiang
2023-03-01 5:19 ` Gao Xiang
2023-03-01 5:42 ` Matthew Wilcox
2023-03-01 5:51 ` Gao Xiang
2023-03-01 6:00 ` Gao Xiang
2023-03-02 3:13 ` Chaitanya Kulkarni
2023-03-02 3:50 ` Darrick J. Wong
2023-03-03 3:03 ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03 3:05 ` Martin K. Petersen
2023-03-03 1:58 ` Keith Busch
2023-03-03 3:49 ` Matthew Wilcox
2023-03-03 11:32 ` Hannes Reinecke
2023-03-03 13:11 ` James Bottomley
2023-03-04 7:34 ` Matthew Wilcox
2023-03-04 13:41 ` James Bottomley
2023-03-04 16:39 ` Matthew Wilcox
2023-03-05 4:15 ` Luis Chamberlain
2023-03-05 5:02 ` Matthew Wilcox
2023-03-08 6:11 ` Luis Chamberlain
2023-03-08 7:59 ` Dave Chinner
2023-03-06 12:04 ` Hannes Reinecke
2023-03-06 3:50 ` James Bottomley
2023-03-04 19:04 ` Luis Chamberlain
2023-03-03 21:45 ` Luis Chamberlain
2023-03-03 22:07 ` Keith Busch
2023-03-03 22:14 ` Luis Chamberlain
2023-03-03 22:32 ` Keith Busch
2023-03-03 23:09 ` Luis Chamberlain
2023-03-16 15:29 ` Pankaj Raghav
2023-03-16 15:41 ` Pankaj Raghav
2023-03-03 23:51 ` Bart Van Assche
2023-03-04 11:08 ` Hannes Reinecke
2023-03-04 13:24 ` Javier González
2023-03-04 16:47 ` Matthew Wilcox
2023-03-04 17:17 ` Hannes Reinecke
2023-03-04 17:54 ` Matthew Wilcox
2023-03-04 18:53 ` Luis Chamberlain
2023-03-05 3:06 ` Damien Le Moal
2023-03-05 11:22 ` Hannes Reinecke
2023-03-06 8:23 ` Matthew Wilcox
2023-03-06 10:05 ` Hannes Reinecke
2023-03-06 16:12 ` Theodore Ts'o
2023-03-08 17:53 ` Matthew Wilcox
2023-03-08 18:13 ` James Bottomley
2023-03-09 8:04 ` Javier González
2023-03-09 13:11 ` James Bottomley
2023-03-09 14:05 ` Keith Busch
2023-03-09 15:23 ` Martin K. Petersen [this message]
2023-03-09 20:49 ` James Bottomley
2023-03-09 21:13 ` Luis Chamberlain
2023-03-09 21:28 ` Martin K. Petersen
2023-03-10 1:16 ` Dan Helmick
2023-03-10 7:59 ` Javier González
2023-03-08 19:35 ` Luis Chamberlain
2023-03-08 19:55 ` Bart Van Assche
2023-03-03 2:54 ` Martin K. Petersen
2023-03-03 3:29 ` Keith Busch
2023-03-03 4:20 ` Theodore Ts'o
2023-07-16 4:09 BELINDA Goodpaster kelly
2025-09-22 17:49 Belinda R Goodpaster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=yq11qlygevs.fsf@ca-mkp.ca.oracle.com \
--to=martin.petersen@oracle.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=da.gomez@samsung.com \
--cc=hare@suse.de \
--cc=javier.gonz@samsung.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=p.raghav@samsung.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox