* [PATCH v3 0/8] enable bs > ps for block devices
@ 2025-02-21 22:38 Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
` (8 more replies)
0 siblings, 9 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
Christian, Andrew,
This v3 series addresses the feedback from the v2 series [0]. The only
patch which was mofified was the patch titled "fs/mpage: use blocks_per_folio
instead of blocks_per_page". The motivation for this series is to mainly
start supporting block devices with logical block sizes larger than 4k,
we do this by addressing buffer-head support required for the block
device cache.
In the future these changes can be leveraged to also start experimenting
with LBS support for filesystems which support only buffer-heads. This
paves the way for that work.
Its perhaps is surprising to some but since this also lifts the block
device cache sector size support to 64k, devices which support up to
64k sector sizes can also leverage this to enable filesystems created with
larger sector sizes up to 64k sector sizes. The filesystem sector size
is used or documented in a bit of obscurity except for few filesystems,
but in short it ensures that the filesystem itself will not generate
writes iteslef smaller than the specified sector size. In practice this
means you can constrain metadata writes as well to a minimum size, and
so be completely deterministic with regards to the specified sector size
for min IO writes. For example since XFS can supports up to 32k sector size,
it means with these changes enable filesystems to also be created on x86_64
with both the filesystem block size and sector size to 32k, now that the block
device cache limitation is lifted.
Since this touches buffer-heads I've ran this through fstests on ext4
and found no new regressions. I've also used blktests against a kernel
built with these changes to test block devices with different larger logical
block sizes than 4k on x86_64. All changes to be able to test block
devices with a logical block size support > 4k are now merged on
upstream blktests. I've tested the block layer with blktests with block
devices with logical block sizes up to 64k which is the max we are
currently supporting and found no new regressions.
Detailed changes in this series:
- Modifies the commit log for "fs/buffer: remove batching from async
read" as per Willy's request and collects his SOB.
- Collects Reviewed-by tags
- The patch titled "fs/mpage: use blocks_per_folio instead of blocks_per_page"
received more love to account for Willy's point
that we should keep accounting in order for nr_pages on mpage. This
does this by using folio_nr_pages() on the args passed and adjusts
the last_block accounting accordingly.
- Through code inspection fixed folio_zero_segment() use to use
folio_size() as we move to suppor large folios for unmapped
folio segments on do_mpage_readpage(), this is dealt with on the
patch titled "fs/mpage: use blocks_per_folio instead of blocks_per_page"
as that's when we start accounting large folios into the picture.
[0] https://lkml.kernel.org/r/20250204231209.429356-1-mcgrof@kernel.org
Hannes Reinecke (2):
fs/mpage: avoid negative shift for large blocksize
block/bdev: enable large folio support for large logical block sizes
Luis Chamberlain (5):
fs/buffer: simplify block_read_full_folio() with bh_offset()
fs/mpage: use blocks_per_folio instead of blocks_per_page
fs/buffer fs/mpage: remove large folio restriction
block/bdev: lift block size restrictions to 64k
bdev: use bdev_io_min() for statx block size
Matthew Wilcox (1):
fs/buffer: remove batching from async read
block/bdev.c | 11 ++++----
fs/buffer.c | 58 +++++++++++++++++-------------------------
fs/mpage.c | 49 +++++++++++++++++------------------
include/linux/blkdev.h | 8 +++++-
4 files changed, 59 insertions(+), 67 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 2/8] fs/buffer: remove batching from async read Luis Chamberlain
` (7 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
When we read over all buffers in a folio we currently use the
buffer index on the folio and blocksize to get the offset. Simplify
this with bh_offset(). This simplifies the loop while making no
functional changes.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/buffer.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..b99560e8a142 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2381,7 +2381,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
lblock = div_u64(limit + blocksize - 1, blocksize);
bh = head;
nr = 0;
- i = 0;
do {
if (buffer_uptodate(bh))
@@ -2398,7 +2397,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
page_error = true;
}
if (!buffer_mapped(bh)) {
- folio_zero_range(folio, i * blocksize,
+ folio_zero_range(folio, bh_offset(bh),
blocksize);
if (!err)
set_buffer_uptodate(bh);
@@ -2412,7 +2411,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
continue;
}
arr[nr++] = bh;
- } while (i++, iblock++, (bh = bh->b_this_page) != head);
+ } while (iblock++, (bh = bh->b_this_page) != head);
if (fully_mapped)
folio_set_mappedtodisk(folio);
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 2/8] fs/buffer: remove batching from async read
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
` (6 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
From: Matthew Wilcox <willy@infradead.org>
block_read_full_folio() currently puts all !uptodate buffers into
an array allocated on the stack, then iterates over it twice, first
locking the buffers and then submitting them for read. We want to
remove this array because it occupies too much stack space on
configurations with a larger PAGE_SIZE (eg 512 bytes with 8 byte
pointers and a 64KiB PAGE_SIZE).
We cannot simply submit buffer heads as we find them as the completion
handler needs to be able to tell when all reads are finished, so it can
end the folio read. So we keep one buffer in reserve (using the 'prev'
variable) until the end of the function.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/buffer.c | 51 +++++++++++++++++++++------------------------------
1 file changed, 21 insertions(+), 30 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index b99560e8a142..167fa3e33566 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,9 +2361,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
struct inode *inode = folio->mapping->host;
sector_t iblock, lblock;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head, *prev = NULL;
size_t blocksize;
- int nr, i;
int fully_mapped = 1;
bool page_error = false;
loff_t limit = i_size_read(inode);
@@ -2380,7 +2379,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
iblock = div_u64(folio_pos(folio), blocksize);
lblock = div_u64(limit + blocksize - 1, blocksize);
bh = head;
- nr = 0;
do {
if (buffer_uptodate(bh))
@@ -2410,40 +2408,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
if (buffer_uptodate(bh))
continue;
}
- arr[nr++] = bh;
+
+ lock_buffer(bh);
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ continue;
+ }
+
+ mark_buffer_async_read(bh);
+ if (prev)
+ submit_bh(REQ_OP_READ, prev);
+ prev = bh;
} while (iblock++, (bh = bh->b_this_page) != head);
if (fully_mapped)
folio_set_mappedtodisk(folio);
- if (!nr) {
- /*
- * All buffers are uptodate or get_block() returned an
- * error when trying to map them - we can finish the read.
- */
- folio_end_read(folio, !page_error);
- return 0;
- }
-
- /* Stage two: lock the buffers */
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- lock_buffer(bh);
- mark_buffer_async_read(bh);
- }
-
/*
- * Stage 3: start the IO. Check for uptodateness
- * inside the buffer lock in case another process reading
- * the underlying blockdev brought it uptodate (the sct fix).
+ * All buffers are uptodate or get_block() returned an error
+ * when trying to map them - we must finish the read because
+ * end_buffer_async_read() will never be called on any buffer
+ * in this folio.
*/
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- if (buffer_uptodate(bh))
- end_buffer_async_read(bh, 1);
- else
- submit_bh(REQ_OP_READ, bh);
- }
+ if (prev)
+ submit_bh(REQ_OP_READ, prev);
+ else
+ folio_end_read(folio, !page_error);
+
return 0;
}
EXPORT_SYMBOL(block_read_full_folio);
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 3/8] fs/mpage: avoid negative shift for large blocksize
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 2/8] fs/buffer: remove batching from async read Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
` (5 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof, Hannes Reinecke
From: Hannes Reinecke <hare@kernel.org>
For large blocksizes the number of block bits is larger than PAGE_SHIFT,
so calculate the sector number from the byte offset instead. This is
required to enable large folios with buffer-heads.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
fs/mpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 82aecf372743..a3c82206977f 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -181,7 +181,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (folio_buffers(folio))
goto confused;
- block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+ block_in_file = folio_pos(folio) >> blkbits;
last_block = block_in_file + args->nr_pages * blocks_per_page;
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
@@ -527,7 +527,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
* The page has no buffers: map it to disk
*/
BUG_ON(!folio_test_uptodate(folio));
- block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+ block_in_file = folio_pos(folio) >> blkbits;
/*
* Whole page beyond EOF? Skip allocating blocks to avoid leaking
* space.
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
` (2 preceding siblings ...)
2025-02-21 22:38 ` [PATCH v3 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-24 7:44 ` Hannes Reinecke
2025-02-21 22:38 ` [PATCH v3 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
` (4 subsequent siblings)
8 siblings, 1 reply; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
Convert mpage to folios and adjust accounting for the number of blocks
within a folio instead of a single page. This also adjusts the number
of pages we should process to be the size of the folio to ensure we
always read a full folio.
Note that the page cache code already ensures do_mpage_readpage() will
work with folios respecting the address space min order, this ensures
that so long as folio_size() is used for our requirements mpage will
also now be able to process block sizes larger than the page size.
Originally-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/mpage.c | 42 +++++++++++++++++++++---------------------
1 file changed, 21 insertions(+), 21 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index a3c82206977f..9c8cf4015238 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -107,7 +107,7 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
* don't make any buffers if there is only one buffer on
* the folio and the folio just needs to be set up to date
*/
- if (inode->i_blkbits == PAGE_SHIFT &&
+ if (inode->i_blkbits == folio_shift(folio) &&
buffer_uptodate(bh)) {
folio_mark_uptodate(folio);
return;
@@ -153,7 +153,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
struct folio *folio = args->folio;
struct inode *inode = folio->mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
const unsigned blocksize = 1 << blkbits;
struct buffer_head *map_bh = &args->map_bh;
sector_t block_in_file;
@@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
sector_t last_block_in_file;
sector_t first_block;
unsigned page_block;
- unsigned first_hole = blocks_per_page;
+ unsigned first_hole = blocks_per_folio;
struct block_device *bdev = NULL;
int length;
int fully_mapped = 1;
@@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
block_in_file = folio_pos(folio) >> blkbits;
- last_block = block_in_file + args->nr_pages * blocks_per_page;
+ last_block = block_in_file + ((args->nr_pages * PAGE_SIZE) >> blkbits);
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
last_block = last_block_in_file;
@@ -204,7 +204,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
clear_buffer_mapped(map_bh);
break;
}
- if (page_block == blocks_per_page)
+ if (page_block == blocks_per_folio)
break;
page_block++;
block_in_file++;
@@ -216,7 +216,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
* Then do more get_blocks calls until we are done with this folio.
*/
map_bh->b_folio = folio;
- while (page_block < blocks_per_page) {
+ while (page_block < blocks_per_folio) {
map_bh->b_state = 0;
map_bh->b_size = 0;
@@ -229,7 +229,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (!buffer_mapped(map_bh)) {
fully_mapped = 0;
- if (first_hole == blocks_per_page)
+ if (first_hole == blocks_per_folio)
first_hole = page_block;
page_block++;
block_in_file++;
@@ -247,7 +247,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
}
- if (first_hole != blocks_per_page)
+ if (first_hole != blocks_per_folio)
goto confused; /* hole -> non-hole */
/* Contiguous blocks? */
@@ -260,7 +260,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (relative_block == nblocks) {
clear_buffer_mapped(map_bh);
break;
- } else if (page_block == blocks_per_page)
+ } else if (page_block == blocks_per_folio)
break;
page_block++;
block_in_file++;
@@ -268,8 +268,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
bdev = map_bh->b_bdev;
}
- if (first_hole != blocks_per_page) {
- folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
+ if (first_hole != blocks_per_folio) {
+ folio_zero_segment(folio, first_hole << blkbits, folio_size(folio));
if (first_hole == 0) {
folio_mark_uptodate(folio);
folio_unlock(folio);
@@ -303,10 +303,10 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
relative_block = block_in_file - args->first_logical_block;
nblocks = map_bh->b_size >> blkbits;
if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
- (first_hole != blocks_per_page))
+ (first_hole != blocks_per_folio))
args->bio = mpage_bio_submit_read(args->bio);
else
- args->last_block_in_bio = first_block + blocks_per_page - 1;
+ args->last_block_in_bio = first_block + blocks_per_folio - 1;
out:
return args->bio;
@@ -385,7 +385,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
{
struct mpage_readpage_args args = {
.folio = folio,
- .nr_pages = 1,
+ .nr_pages = folio_nr_pages(folio),
.get_block = get_block,
};
@@ -456,12 +456,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
struct address_space *mapping = folio->mapping;
struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
sector_t last_block;
sector_t block_in_file;
sector_t first_block;
unsigned page_block;
- unsigned first_unmapped = blocks_per_page;
+ unsigned first_unmapped = blocks_per_folio;
struct block_device *bdev = NULL;
int boundary = 0;
sector_t boundary_block = 0;
@@ -486,12 +486,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
*/
if (buffer_dirty(bh))
goto confused;
- if (first_unmapped == blocks_per_page)
+ if (first_unmapped == blocks_per_folio)
first_unmapped = page_block;
continue;
}
- if (first_unmapped != blocks_per_page)
+ if (first_unmapped != blocks_per_folio)
goto confused; /* hole -> non-hole */
if (!buffer_dirty(bh) || !buffer_uptodate(bh))
@@ -536,7 +536,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
goto page_is_mapped;
last_block = (i_size - 1) >> blkbits;
map_bh.b_folio = folio;
- for (page_block = 0; page_block < blocks_per_page; ) {
+ for (page_block = 0; page_block < blocks_per_folio; ) {
map_bh.b_state = 0;
map_bh.b_size = 1 << blkbits;
@@ -618,14 +618,14 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
BUG_ON(folio_test_writeback(folio));
folio_start_writeback(folio);
folio_unlock(folio);
- if (boundary || (first_unmapped != blocks_per_page)) {
+ if (boundary || (first_unmapped != blocks_per_folio)) {
bio = mpage_bio_submit_write(bio);
if (boundary_block) {
write_boundary_block(boundary_bdev,
boundary_block, 1 << blkbits);
}
} else {
- mpd->last_block_in_bio = first_block + blocks_per_page - 1;
+ mpd->last_block_in_bio = first_block + blocks_per_folio - 1;
}
goto out;
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 5/8] fs/buffer fs/mpage: remove large folio restriction
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
` (3 preceding siblings ...)
2025-02-21 22:38 ` [PATCH v3 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
` (3 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
Now that buffer-heads has been converted over to support large folios
we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
their use.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/buffer.c | 2 --
fs/mpage.c | 3 ---
2 files changed, 5 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 167fa3e33566..194eacbefc95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2371,8 +2371,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
limit = inode->i_sb->s_maxbytes;
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
head = folio_create_buffers(folio, inode, 0);
blocksize = head->b_size;
diff --git a/fs/mpage.c b/fs/mpage.c
index 9c8cf4015238..ad7844de87c3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -170,9 +170,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
unsigned relative_block;
gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
- /* MAX_BUF_PER_PAGE, for example */
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
if (args->is_readahead) {
opf |= REQ_RAHEAD;
gfp |= __GFP_NORETRY | __GFP_NOWARN;
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 6/8] block/bdev: enable large folio support for large logical block sizes
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
` (4 preceding siblings ...)
2025-02-21 22:38 ` [PATCH v3 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
` (2 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
From: Hannes Reinecke <hare@suse.de>
Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
block/bdev.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/bdev.c b/block/bdev.c
index 9d73a8fbf7f9..8aadf1f23cb4 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -148,6 +148,8 @@ static void set_init_blocksize(struct block_device *bdev)
bsize <<= 1;
}
BD_INODE(bdev)->i_blkbits = blksize_bits(bsize);
+ mapping_set_folio_min_order(BD_INODE(bdev)->i_mapping,
+ get_order(bsize));
}
int set_blocksize(struct file *file, int size)
@@ -169,6 +171,7 @@ int set_blocksize(struct file *file, int size)
if (inode->i_blkbits != blksize_bits(size)) {
sync_blockdev(bdev);
inode->i_blkbits = blksize_bits(size);
+ mapping_set_folio_min_order(inode->i_mapping, get_order(size));
kill_bdev(bdev);
}
return 0;
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 7/8] block/bdev: lift block size restrictions to 64k
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
` (5 preceding siblings ...)
2025-02-21 22:38 ` [PATCH v3 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
2025-02-24 10:45 ` [PATCH v3 0/8] enable bs > ps for block devices Christian Brauner
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
We now can support blocksizes larger than PAGE_SIZE, so in theory
we should be able to lift the restriction up to the max supported page
cache order. However bound ourselves to what we can currently validate
and test. Through blktests and fstest we can validate up to 64k today.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
block/bdev.c | 3 +--
include/linux/blkdev.h | 8 +++++++-
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 8aadf1f23cb4..22806ce11e1d 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -183,8 +183,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
{
if (set_blocksize(sb->s_bdev_file, size))
return 0;
- /* If we get here, we know size is power of two
- * and it's value is between 512 and PAGE_SIZE */
+ /* If we get here, we know size is validated */
sb->s_blocksize = size;
sb->s_blocksize_bits = blksize_bits(size);
return sb->s_blocksize;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..a97428e8bbbe 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -267,10 +267,16 @@ static inline dev_t disk_devt(struct gendisk *disk)
return MKDEV(disk->major, disk->first_minor);
}
+/*
+ * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
+ * however we constrain this to what we can validate and test.
+ */
+#define BLK_MAX_BLOCK_SIZE SZ_64K
+
/* blk_validate_limits() validates bsize, so drivers don't usually need to */
static inline int blk_validate_block_size(unsigned long bsize)
{
- if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize))
+ if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize))
return -EINVAL;
return 0;
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 8/8] bdev: use bdev_io_min() for statx block size
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
` (6 preceding siblings ...)
2025-02-21 22:38 ` [PATCH v3 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
@ 2025-02-21 22:38 ` Luis Chamberlain
2025-02-24 10:45 ` [PATCH v3 0/8] enable bs > ps for block devices Christian Brauner
8 siblings, 0 replies; 11+ messages in thread
From: Luis Chamberlain @ 2025-02-21 22:38 UTC (permalink / raw)
To: brauner, akpm, hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
You can use lsblk to query for a block device block device block size:
lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
4096
The min-io is the minimum IO the block device prefers for optimal
performance. In turn we map this to the block device block size.
The current block size exposed even for block devices with an
LBA format of 16k is 4k. Likewise devices which support 4k LBA format
but have a larger Indirection Unit of 16k have an exposed block size
of 4k.
This incurs read-modify-writes on direct IO against devices with a
min-io larger than the page size. To fix this, use the block device
min io, which is the minimal optimal IO the device prefers.
With this we now get:
lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
16384
And so userspace gets the appropriate information it needs for optimal
performance. This is verified with blkalgn against mkfs against a
device with LBA format of 4k but an NPWG of 16k (min io size)
mkfs.xfs -f -b size=16k /dev/nvme3n1
blkalgn -d nvme3n1 --ops Write
Block size : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 66 |****************************************|
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 2 |* |
Block size: 14 - 66
Block size: 17 - 2
Algn size : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 66 |****************************************|
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 2 |* |
Algn size: 14 - 66
Algn size: 17 - 2
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
block/bdev.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 22806ce11e1d..3bd948e6438d 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1276,9 +1276,6 @@ void bdev_statx(struct path *path, struct kstat *stat,
struct inode *backing_inode;
struct block_device *bdev;
- if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
- return;
-
backing_inode = d_backing_inode(path->dentry);
/*
@@ -1305,6 +1302,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
queue_atomic_write_unit_max_bytes(bd_queue));
}
+ stat->blksize = bdev_io_min(bdev);
+
blkdev_put_no_open(bdev);
}
--
2.47.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-21 22:38 ` [PATCH v3 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2025-02-24 7:44 ` Hannes Reinecke
0 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2025-02-24 7:44 UTC (permalink / raw)
To: Luis Chamberlain, brauner, akpm, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel
On 2/21/25 23:38, Luis Chamberlain wrote:
> Convert mpage to folios and adjust accounting for the number of blocks
> within a folio instead of a single page. This also adjusts the number
> of pages we should process to be the size of the folio to ensure we
> always read a full folio.
>
> Note that the page cache code already ensures do_mpage_readpage() will
> work with folios respecting the address space min order, this ensures
> that so long as folio_size() is used for our requirements mpage will
> also now be able to process block sizes larger than the page size.
>
> Originally-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---by: H
> fs/mpage.c | 42 +++++++++++++++++++++---------------------
> 1 file changed, 21 insertions(+), 21 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 0/8] enable bs > ps for block devices
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
` (7 preceding siblings ...)
2025-02-21 22:38 ` [PATCH v3 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
@ 2025-02-24 10:45 ` Christian Brauner
8 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-02-24 10:45 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Christian Brauner, john.g.garry, hch, ritesh.list, linux-fsdevel,
linux-xfs, linux-mm, linux-block, gost.dev, p.raghav, da.gomez,
kernel, akpm, hare, willy, dave, david, djwong, kbusch
On Fri, 21 Feb 2025 14:38:15 -0800, Luis Chamberlain wrote:
> Christian, Andrew,
>
> This v3 series addresses the feedback from the v2 series [0]. The only
> patch which was mofified was the patch titled "fs/mpage: use blocks_per_folio
> instead of blocks_per_page". The motivation for this series is to mainly
> start supporting block devices with logical block sizes larger than 4k,
> we do this by addressing buffer-head support required for the block
> device cache.
>
> [...]
Applied to the vfs-6.15.pagesize branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.pagesize branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.pagesize
[1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
https://git.kernel.org/vfs/vfs/c/753aadebf2e3
[2/8] fs/buffer: remove batching from async read
https://git.kernel.org/vfs/vfs/c/b72e591f74de
[3/8] fs/mpage: avoid negative shift for large blocksize
https://git.kernel.org/vfs/vfs/c/86c60efd7c0e
[4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
https://git.kernel.org/vfs/vfs/c/8b45a4f4133d
[5/8] fs/buffer fs/mpage: remove large folio restriction
https://git.kernel.org/vfs/vfs/c/e59e97d42b05
[6/8] block/bdev: enable large folio support for large logical block sizes
https://git.kernel.org/vfs/vfs/c/3c20917120ce
[7/8] block/bdev: lift block size restrictions to 64k
https://git.kernel.org/vfs/vfs/c/47dd67532303
[8/8] bdev: use bdev_io_min() for statx block size
https://git.kernel.org/vfs/vfs/c/425fbcd62d2e
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-02-24 10:45 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-21 22:38 [PATCH v3 0/8] enable bs > ps for block devices Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 2/8] fs/buffer: remove batching from async read Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
2025-02-24 7:44 ` Hannes Reinecke
2025-02-21 22:38 ` [PATCH v3 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
2025-02-21 22:38 ` [PATCH v3 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
2025-02-24 10:45 ` [PATCH v3 0/8] enable bs > ps for block devices Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox