[RFC 0/8] enable bs > ps for block devices

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/8] enable bs > ps for block devices
@ 2024-11-13  9:47 Luis Chamberlain
  2024-11-13  9:47 ` [RFC 1/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

The last time this was addressed was at LFSMM this year in May [0] and
before LBS was on its way upstream. LBS is now on v6.12 and so the delta
required is much smaller now.

Before the turkey massacre slows part of the world down, here's a refresh.

Although Hannes' patches were in PATCH form, testing showed quickly that
it wasn't quite ready yet. I've only done cursory testing so far but have
also incorporated all of the fixes and feedback we could accumulate over
time. And so I'm sticking to RFC to reflect that this still needs thorough
testing. It at least does not crash for me yet and its a major rebase
onto v6.12-rc7.

The biggest changes now are these last patches:

  - block/bdev: lift block size restrictions and use common definition
  - nvme: remove superfluous block size check
  - bdev: use bdev_io_min() for statx block size

The buffer-head pathces I think should be ready.

If the consolidation of the max block size is good, perhaps we just also use it
for the iomap max zero page too. Note that in theory we should be able to get
up to a block size of 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER), in practice
testing that shows we need much more love [1] although prospects indeed show
we should be able to get up to 2 MiB on x86_64. And so I think we should first
reduce scope up to 64k for now, test all this, and then embark on the next
64k --> 2 MiB journey next.

Thoughts?

If you want this in a tree you can get this from the kdevops branch
large-block-buffer-heads-for-next [2]

[0] https://lore.kernel.org/all/20240514173900.62207-1-hare@kernel.org/
[1] https://github.com/linux-kdevops/linux/commit/266f2c700be55bdb5626d521230597673c83c91d#diff-79b436371fdb3ddf0e7ad9bd4c9afe05160f7953438e650a77519b882904c56bL272
[2] https://github.com/linux-kdevops/linux/tree/large-block-buffer-heads-for-next

Hannes Reinecke (4):
  fs/mpage: use blocks_per_folio instead of blocks_per_page
  fs/mpage: avoid negative shift for large blocksize
  fs/buffer: restart block_read_full_folio() to avoid array overflow
  block/bdev: enable large folio support for large logical block sizes

Luis Chamberlain (4):
  fs/buffer fs/mpage: remove large folio restriction
  block/bdev: lift block size restrictions and use common definition
  nvme: remove superfluous block size check
  bdev: use bdev_io_min() for statx block size

 block/bdev.c             |  9 +++++---
 drivers/nvme/host/core.c | 10 ---------
 fs/buffer.c              | 21 ++++++++++++++----
 fs/mpage.c               | 47 +++++++++++++++++++---------------------
 fs/stat.c                |  2 +-
 include/linux/blkdev.h   |  6 ++++-
 6 files changed, 51 insertions(+), 44 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 1/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13  9:47 ` [RFC 2/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

From: Hannes Reinecke <hare@suse.de>

Convert mpage to folios and associate the number of blocks with
a folio and not a page.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/mpage.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index b5b5ddf9d513..5b5989c7c2a0 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -107,7 +107,7 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
 		 * don't make any buffers if there is only one buffer on
 		 * the folio and the folio just needs to be set up to date
 		 */
-		if (inode->i_blkbits == PAGE_SHIFT &&
+		if (inode->i_blkbits == folio_shift(folio) &&
 		    buffer_uptodate(bh)) {
 			folio_mark_uptodate(folio);
 			return;
@@ -153,7 +153,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	struct folio *folio = args->folio;
 	struct inode *inode = folio->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	struct buffer_head *map_bh = &args->map_bh;
 	sector_t block_in_file;
@@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	sector_t last_block_in_file;
 	sector_t first_block;
 	unsigned page_block;
-	unsigned first_hole = blocks_per_page;
+	unsigned first_hole = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int length;
 	int fully_mapped = 1;
@@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 		goto confused;
 
 	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
-	last_block = block_in_file + args->nr_pages * blocks_per_page;
+	last_block = block_in_file + args->nr_pages * blocks_per_folio;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
 		last_block = last_block_in_file;
@@ -204,7 +204,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 				clear_buffer_mapped(map_bh);
 				break;
 			}
-			if (page_block == blocks_per_page)
+			if (page_block == blocks_per_folio)
 				break;
 			page_block++;
 			block_in_file++;
@@ -216,7 +216,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	 * Then do more get_blocks calls until we are done with this folio.
 	 */
 	map_bh->b_folio = folio;
-	while (page_block < blocks_per_page) {
+	while (page_block < blocks_per_folio) {
 		map_bh->b_state = 0;
 		map_bh->b_size = 0;
 
@@ -229,7 +229,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 
 		if (!buffer_mapped(map_bh)) {
 			fully_mapped = 0;
-			if (first_hole == blocks_per_page)
+			if (first_hole == blocks_per_folio)
 				first_hole = page_block;
 			page_block++;
 			block_in_file++;
@@ -247,7 +247,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			goto confused;
 		}
 	
-		if (first_hole != blocks_per_page)
+		if (first_hole != blocks_per_folio)
 			goto confused;		/* hole -> non-hole */
 
 		/* Contiguous blocks? */
@@ -260,7 +260,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			if (relative_block == nblocks) {
 				clear_buffer_mapped(map_bh);
 				break;
-			} else if (page_block == blocks_per_page)
+			} else if (page_block == blocks_per_folio)
 				break;
 			page_block++;
 			block_in_file++;
@@ -268,7 +268,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 		bdev = map_bh->b_bdev;
 	}
 
-	if (first_hole != blocks_per_page) {
+	if (first_hole != blocks_per_folio) {
 		folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
 		if (first_hole == 0) {
 			folio_mark_uptodate(folio);
@@ -303,10 +303,10 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	relative_block = block_in_file - args->first_logical_block;
 	nblocks = map_bh->b_size >> blkbits;
 	if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
-	    (first_hole != blocks_per_page))
+	    (first_hole != blocks_per_folio))
 		args->bio = mpage_bio_submit_read(args->bio);
 	else
-		args->last_block_in_bio = first_block + blocks_per_page - 1;
+		args->last_block_in_bio = first_block + blocks_per_folio - 1;
 out:
 	return args->bio;
 
@@ -385,7 +385,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
 {
 	struct mpage_readpage_args args = {
 		.folio = folio,
-		.nr_pages = 1,
+		.nr_pages = folio_nr_pages(folio),
 		.get_block = get_block,
 	};
 
@@ -456,12 +456,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	struct address_space *mapping = folio->mapping;
 	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
 	sector_t last_block;
 	sector_t block_in_file;
 	sector_t first_block;
 	unsigned page_block;
-	unsigned first_unmapped = blocks_per_page;
+	unsigned first_unmapped = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int boundary = 0;
 	sector_t boundary_block = 0;
@@ -486,12 +486,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 				 */
 				if (buffer_dirty(bh))
 					goto confused;
-				if (first_unmapped == blocks_per_page)
+				if (first_unmapped == blocks_per_folio)
 					first_unmapped = page_block;
 				continue;
 			}
 
-			if (first_unmapped != blocks_per_page)
+			if (first_unmapped != blocks_per_folio)
 				goto confused;	/* hole -> non-hole */
 
 			if (!buffer_dirty(bh) || !buffer_uptodate(bh))
@@ -536,7 +536,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 		goto page_is_mapped;
 	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_folio = folio;
-	for (page_block = 0; page_block < blocks_per_page; ) {
+	for (page_block = 0; page_block < blocks_per_folio; ) {
 
 		map_bh.b_state = 0;
 		map_bh.b_size = 1 << blkbits;
@@ -618,14 +618,14 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	BUG_ON(folio_test_writeback(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
-	if (boundary || (first_unmapped != blocks_per_page)) {
+	if (boundary || (first_unmapped != blocks_per_folio)) {
 		bio = mpage_bio_submit_write(bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
 		}
 	} else {
-		mpd->last_block_in_bio = first_block + blocks_per_page - 1;
+		mpd->last_block_in_bio = first_block + blocks_per_folio - 1;
 	}
 	goto out;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 2/8] fs/mpage: avoid negative shift for large blocksize
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
  2024-11-13  9:47 ` [RFC 1/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13 14:06   ` Matthew Wilcox
  2024-11-13  9:47 ` [RFC 3/8] fs/buffer: restart block_read_full_folio() to avoid array overflow Luis Chamberlain
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

For large blocksizes the number of block bits is larger than PAGE_SHIFT,
so use shift to calculate the sector number from the page cache index.

With this in place we can now enable large folios on with buffer-heads.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 fs/mpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 5b5989c7c2a0..ff76600380ca 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -181,7 +181,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	if (folio_buffers(folio))
 		goto confused;
 
-	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+	block_in_file = (sector_t)(((loff_t)folio->index << PAGE_SHIFT) >> blkbits);
 	last_block = block_in_file + args->nr_pages * blocks_per_folio;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
@@ -527,7 +527,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	 * The page has no buffers: map it to disk
 	 */
 	BUG_ON(!folio_test_uptodate(folio));
-	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+	block_in_file = (sector_t)(((loff_t)folio->index << PAGE_SHIFT) >> blkbits);
 	/*
 	 * Whole page beyond EOF? Skip allocating blocks to avoid leaking
 	 * space.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 3/8] fs/buffer: restart block_read_full_folio() to avoid array overflow
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
  2024-11-13  9:47 ` [RFC 1/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
  2024-11-13  9:47 ` [RFC 2/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13 18:50   ` Matthew Wilcox
  2024-11-13  9:47 ` [RFC 4/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

From: Hannes Reinecke <hare@suse.de>

block_read_full_folio() uses an on-stack array to hold any buffer_heads
which should be updated. The array is sized for the number of buffer_heads
per PAGE_SIZE, which of course will overflow for large folios.
So instead of increasing the size of the array (and thereby incurring
a possible stack overflow for really large folios) stop the iteration
when the array is filled up, submit these buffer_heads, and restart
the iteration with the remaining buffer_heads.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/buffer.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1fc9a50def0b..818c9c5840fe 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2366,7 +2366,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head, *restart_bh = NULL, *arr[MAX_BUF_PER_PAGE];
 	size_t blocksize;
 	int nr, i;
 	int fully_mapped = 1;
@@ -2385,6 +2385,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
 	bh = head;
+restart:
 	nr = 0;
 	i = 0;
 
@@ -2417,7 +2418,12 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 				continue;
 		}
 		arr[nr++] = bh;
-	} while (i++, iblock++, (bh = bh->b_this_page) != head);
+	} while (i++, iblock++, (bh = bh->b_this_page) != head && nr < MAX_BUF_PER_PAGE);
+
+	if (nr == MAX_BUF_PER_PAGE && bh != head)
+		restart_bh = bh;
+	else
+		restart_bh = NULL;
 
 	if (fully_mapped)
 		folio_set_mappedtodisk(folio);
@@ -2450,6 +2456,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 		else
 			submit_bh(REQ_OP_READ, bh);
 	}
+
+	/*
+	 * Found more buffers than 'arr' could hold,
+	 * restart to submit the remaining ones.
+	 */
+	if (restart_bh) {
+		bh = restart_bh;
+		goto restart;
+	}
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 4/8] fs/buffer fs/mpage: remove large folio restriction
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
                   ` (2 preceding siblings ...)
  2024-11-13  9:47 ` [RFC 3/8] fs/buffer: restart block_read_full_folio() to avoid array overflow Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13  9:55   ` Hannes Reinecke
  2024-11-13  9:47 ` [RFC 5/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Now that buffer-heads has been converted over to support large folios
we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
their use.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 2 --
 fs/mpage.c  | 3 ---
 2 files changed, 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 818c9c5840fe..85471d2c0df9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2377,8 +2377,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
 		limit = inode->i_sb->s_maxbytes;
 
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
 	head = folio_create_buffers(folio, inode, 0);
 	blocksize = head->b_size;
 
diff --git a/fs/mpage.c b/fs/mpage.c
index ff76600380ca..a71a4a0b34f4 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -170,9 +170,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	unsigned relative_block;
 	gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 
-	/* MAX_BUF_PER_PAGE, for example */
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
 	if (args->is_readahead) {
 		opf |= REQ_RAHEAD;
 		gfp |= __GFP_NORETRY | __GFP_NOWARN;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 5/8] block/bdev: enable large folio support for large logical block sizes
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
                   ` (3 preceding siblings ...)
  2024-11-13  9:47 ` [RFC 4/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13  9:47 ` [RFC 6/8] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

From: Hannes Reinecke <hare@suse.de>

Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/bdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7..167d82b46781 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -148,6 +148,8 @@ static void set_init_blocksize(struct block_device *bdev)
 		bsize <<= 1;
 	}
 	BD_INODE(bdev)->i_blkbits = blksize_bits(bsize);
+	mapping_set_folio_min_order(BD_INODE(bdev)->i_mapping,
+				    get_order(bsize));
 }
 
 int set_blocksize(struct file *file, int size)
@@ -170,6 +172,7 @@ int set_blocksize(struct file *file, int size)
 	if (inode->i_blkbits != blksize_bits(size)) {
 		sync_blockdev(bdev);
 		inode->i_blkbits = blksize_bits(size);
+		mapping_set_folio_min_order(inode->i_mapping, get_order(size));
 		kill_bdev(bdev);
 	}
 	return 0;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 6/8] block/bdev: lift block size restrictions and use common definition
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
                   ` (4 preceding siblings ...)
  2024-11-13  9:47 ` [RFC 5/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13  9:57   ` Hannes Reinecke
                     ` (2 more replies)
  2024-11-13  9:47 ` [RFC 7/8] nvme: remove superfluous block size check Luis Chamberlain
  2024-11-13  9:47 ` [RFC 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
  7 siblings, 3 replies; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

We now can support blocksizes larger than PAGE_SIZE, so lift
the restriction up to the max supported page cache order and
just bake this into a common helper used by the block layer.

We bound ourselves to 64k, because beyond that we need more testing.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c           | 5 ++---
 include/linux/blkdev.h | 6 +++++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 167d82b46781..3a5fd65f6c8e 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
 	struct inode *inode = file->f_mapping->host;
 	struct block_device *bdev = I_BDEV(inode);
 
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+	if (blk_validate_block_size(size))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
@@ -185,7 +184,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
 	if (set_blocksize(sb->s_bdev_file, size))
 		return 0;
 	/* If we get here, we know size is power of two
-	 * and it's value is between 512 and PAGE_SIZE */
+	 * and it's value is larger than 512 */
 	sb->s_blocksize = size;
 	sb->s_blocksize_bits = blksize_bits(size);
 	return sb->s_blocksize;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50c3b959da28..cc9fca1fceaa 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -25,6 +25,7 @@
 #include <linux/uuid.h>
 #include <linux/xarray.h>
 #include <linux/file.h>
+#include <linux/pagemap.h>
 
 struct module;
 struct request_queue;
@@ -268,10 +269,13 @@ static inline dev_t disk_devt(struct gendisk *disk)
 	return MKDEV(disk->major, disk->first_minor);
 }
 
+/* We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) */
+#define BLK_MAX_BLOCK_SIZE      (SZ_64K)
+
 /* blk_validate_limits() validates bsize, so drivers don't usually need to */
 static inline int blk_validate_block_size(unsigned long bsize)
 {
-	if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize))
+	if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize))
 		return -EINVAL;
 
 	return 0;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 7/8] nvme: remove superfluous block size check
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
                   ` (5 preceding siblings ...)
  2024-11-13  9:47 ` [RFC 6/8] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13  9:57   ` Hannes Reinecke
  2024-11-13  9:47 ` [RFC 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
  7 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

The block layer already validates proper block sizes with
blk_validate_block_size() for us so we can remove this now
superfluous check.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/core.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 855b42c92284..86ff872cf6bd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2022,16 +2022,6 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 	u32 atomic_bs, phys_bs, io_opt = 0;
 	bool valid = true;
 
-	/*
-	 * The block layer can't support LBA sizes larger than the page size
-	 * or smaller than a sector size yet, so catch this early and don't
-	 * allow block I/O.
-	 */
-	if (head->lba_shift > PAGE_SHIFT || head->lba_shift < SECTOR_SHIFT) {
-		bs = (1 << 9);
-		valid = false;
-	}
-
 	atomic_bs = phys_bs = bs;
 	if (id->nabo == 0) {
 		/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 8/8] bdev: use bdev_io_min() for statx block size
  2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
                   ` (6 preceding siblings ...)
  2024-11-13  9:47 ` [RFC 7/8] nvme: remove superfluous block size check Luis Chamberlain
@ 2024-11-13  9:47 ` Luis Chamberlain
  2024-11-13  9:59   ` Hannes Reinecke
  2024-11-18  7:08   ` Christoph Hellwig
  7 siblings, 2 replies; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-13  9:47 UTC (permalink / raw)
  To: willy, hch, hare, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

You can use lsblk to query for a block device block device block size:

lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
 4096

The min-io is the minimum IO the block device prefers for optimal
performance. In turn we map this to the block device block size.
The current block size exposed even for block devices with an
LBA format of 16k is 4k. Likewise devices which support 4k LBA format
but have a larger Indirection Unit of 16k have an exposed block size
of 4k.

This incurs read-modify-writes on direct IO against devices with a
min-io larger than the page size. To fix this, use the block device
min io, which is the minimal optimal IO the device prefers.

With this we now get:

lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
 16384

And so userspace gets the appropriate information it needs for optimal
performance. This is verified with blkalgn against mkfs against a
device with LBA format of 4k but an NPWG of 16k (min io size)

mkfs.xfs -f -b size=16k  /dev/nvme3n1
blkalgn -d nvme3n1 --ops Write

     Block size          : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 66       |****************************************|
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 2        |*                                       |
Block size: 14 - 66
Block size: 17 - 2

     Algn size           : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 66       |****************************************|
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 2        |*                                       |
Algn size: 14 - 66
Algn size: 17 - 2

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 1 +
 fs/stat.c    | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index 3a5fd65f6c8e..4dcc501ed953 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1306,6 +1306,7 @@ void bdev_statx(struct path *path, struct kstat *stat,
 			queue_atomic_write_unit_max_bytes(bd_queue));
 	}
 
+	stat->blksize = (unsigned int) bdev_io_min(bdev);
 	blkdev_put_no_open(bdev);
 }
 
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e..9b579c0b5153 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -268,7 +268,7 @@ static int vfs_statx_path(struct path *path, int flags, struct kstat *stat,
 	 * obtained from the bdev backing inode.
 	 */
 	if (S_ISBLK(stat->mode))
-		bdev_statx(path, stat, request_mask);
+		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);
 
 	return error;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 4/8] fs/buffer fs/mpage: remove large folio restriction
  2024-11-13  9:47 ` [RFC 4/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
@ 2024-11-13  9:55   ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2024-11-13  9:55 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 11/13/24 10:47, Luis Chamberlain wrote:
> Now that buffer-heads has been converted over to support large folios
> we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
> their use.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   fs/buffer.c | 2 --
>   fs/mpage.c  | 3 ---
>   2 files changed, 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 6/8] block/bdev: lift block size restrictions and use common definition
  2024-11-13  9:47 ` [RFC 6/8] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
@ 2024-11-13  9:57   ` Hannes Reinecke
  2024-11-13 14:14   ` Matthew Wilcox
  2024-11-18  9:18   ` John Garry
  2 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2024-11-13  9:57 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 11/13/24 10:47, Luis Chamberlain wrote:
> We now can support blocksizes larger than PAGE_SIZE, so lift
> the restriction up to the max supported page cache order and
> just bake this into a common helper used by the block layer.
> 
> We bound ourselves to 64k, because beyond that we need more testing.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   block/bdev.c           | 5 ++---
>   include/linux/blkdev.h | 6 +++++-
>   2 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 167d82b46781..3a5fd65f6c8e 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
>   	struct inode *inode = file->f_mapping->host;
>   	struct block_device *bdev = I_BDEV(inode);
>   
> -	/* Size must be a power of two, and between 512 and PAGE_SIZE */
> -	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> +	if (blk_validate_block_size(size))
>   		return -EINVAL;
>   
>   	/* Size cannot be smaller than the size supported by the device */
> @@ -185,7 +184,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
>   	if (set_blocksize(sb->s_bdev_file, size))
>   		return 0;
>   	/* If we get here, we know size is power of two
> -	 * and it's value is between 512 and PAGE_SIZE */
> +	 * and it's value is larger than 512 */
>   	sb->s_blocksize = size;
>   	sb->s_blocksize_bits = blksize_bits(size);
>   	return sb->s_blocksize;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 50c3b959da28..cc9fca1fceaa 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -25,6 +25,7 @@
>   #include <linux/uuid.h>
>   #include <linux/xarray.h>
>   #include <linux/file.h>
> +#include <linux/pagemap.h>
>   
>   struct module;
>   struct request_queue;
> @@ -268,10 +269,13 @@ static inline dev_t disk_devt(struct gendisk *disk)
>   	return MKDEV(disk->major, disk->first_minor);
>   }
>   
> +/* We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) */
> +#define BLK_MAX_BLOCK_SIZE      (SZ_64K)
> +

Please make the comment a bit more descriptive, indicating that beyond 
64k more testing is required, hence it's not enabled for now.

We _could_ add a config option to make this conditional...

>   /* blk_validate_limits() validates bsize, so drivers don't usually need to */
>   static inline int blk_validate_block_size(unsigned long bsize)
>   {
> -	if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize))
> +	if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize))
>   		return -EINVAL;
>   
>   	return 0;
Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 7/8] nvme: remove superfluous block size check
  2024-11-13  9:47 ` [RFC 7/8] nvme: remove superfluous block size check Luis Chamberlain
@ 2024-11-13  9:57   ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2024-11-13  9:57 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 11/13/24 10:47, Luis Chamberlain wrote:
> The block layer already validates proper block sizes with
> blk_validate_block_size() for us so we can remove this now
> superfluous check.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   drivers/nvme/host/core.c | 10 ----------
>   1 file changed, 10 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 8/8] bdev: use bdev_io_min() for statx block size
  2024-11-13  9:47 ` [RFC 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
@ 2024-11-13  9:59   ` Hannes Reinecke
  2024-11-18  7:08   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2024-11-13  9:59 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 11/13/24 10:47, Luis Chamberlain wrote:
> You can use lsblk to query for a block device block device block size:
> 
> lsblk -o MIN-IO /dev/nvme0n1
> MIN-IO
>   4096
> 
> The min-io is the minimum IO the block device prefers for optimal
> performance. In turn we map this to the block device block size.
> The current block size exposed even for block devices with an
> LBA format of 16k is 4k. Likewise devices which support 4k LBA format
> but have a larger Indirection Unit of 16k have an exposed block size
> of 4k.
> 
> This incurs read-modify-writes on direct IO against devices with a
> min-io larger than the page size. To fix this, use the block device
> min io, which is the minimal optimal IO the device prefers.
> 
> With this we now get:
> 
> lsblk -o MIN-IO /dev/nvme0n1
> MIN-IO
>   16384
> 
> And so userspace gets the appropriate information it needs for optimal
> performance. This is verified with blkalgn against mkfs against a
> device with LBA format of 4k but an NPWG of 16k (min io size)
> 
> mkfs.xfs -f -b size=16k  /dev/nvme3n1
> blkalgn -d nvme3n1 --ops Write
> 
>       Block size          : count     distribution
>           0 -> 1          : 0        |                                        |
>           2 -> 3          : 0        |                                        |
>           4 -> 7          : 0        |                                        |
>           8 -> 15         : 0        |                                        |
>          16 -> 31         : 0        |                                        |
>          32 -> 63         : 0        |                                        |
>          64 -> 127        : 0        |                                        |
>         128 -> 255        : 0        |                                        |
>         256 -> 511        : 0        |                                        |
>         512 -> 1023       : 0        |                                        |
>        1024 -> 2047       : 0        |                                        |
>        2048 -> 4095       : 0        |                                        |
>        4096 -> 8191       : 0        |                                        |
>        8192 -> 16383      : 0        |                                        |
>       16384 -> 32767      : 66       |****************************************|
>       32768 -> 65535      : 0        |                                        |
>       65536 -> 131071     : 0        |                                        |
>      131072 -> 262143     : 2        |*                                       |
> Block size: 14 - 66
> Block size: 17 - 2
> 
>       Algn size           : count     distribution
>           0 -> 1          : 0        |                                        |
>           2 -> 3          : 0        |                                        |
>           4 -> 7          : 0        |                                        |
>           8 -> 15         : 0        |                                        |
>          16 -> 31         : 0        |                                        |
>          32 -> 63         : 0        |                                        |
>          64 -> 127        : 0        |                                        |
>         128 -> 255        : 0        |                                        |
>         256 -> 511        : 0        |                                        |
>         512 -> 1023       : 0        |                                        |
>        1024 -> 2047       : 0        |                                        |
>        2048 -> 4095       : 0        |                                        |
>        4096 -> 8191       : 0        |                                        |
>        8192 -> 16383      : 0        |                                        |
>       16384 -> 32767      : 66       |****************************************|
>       32768 -> 65535      : 0        |                                        |
>       65536 -> 131071     : 0        |                                        |
>      131072 -> 262143     : 2        |*                                       |
> Algn size: 14 - 66
> Algn size: 17 - 2
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   block/bdev.c | 1 +
>   fs/stat.c    | 2 +-
>   2 files changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/8] fs/mpage: avoid negative shift for large blocksize
  2024-11-13  9:47 ` [RFC 2/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
@ 2024-11-13 14:06   ` Matthew Wilcox
  2024-11-14 13:47     ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2024-11-13 14:06 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, hare, david, djwong, john.g.garry, ritesh.list, kbusch,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel, Hannes Reinecke

On Wed, Nov 13, 2024 at 01:47:21AM -0800, Luis Chamberlain wrote:
> +++ b/fs/mpage.c
> @@ -181,7 +181,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>  	if (folio_buffers(folio))
>  		goto confused;
>  
> -	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
> +	block_in_file = (sector_t)(((loff_t)folio->index << PAGE_SHIFT) >> blkbits);

	block_in_file = folio_pos(folio) >> blkbits;
?

> @@ -527,7 +527,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
>  	 * The page has no buffers: map it to disk
>  	 */
>  	BUG_ON(!folio_test_uptodate(folio));
> -	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
> +	block_in_file = (sector_t)(((loff_t)folio->index << PAGE_SHIFT) >> blkbits);

Likewise.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 6/8] block/bdev: lift block size restrictions and use common definition
  2024-11-13  9:47 ` [RFC 6/8] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
  2024-11-13  9:57   ` Hannes Reinecke
@ 2024-11-13 14:14   ` Matthew Wilcox
  2024-11-18  9:18   ` John Garry
  2 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2024-11-13 14:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, hare, david, djwong, john.g.garry, ritesh.list, kbusch,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Wed, Nov 13, 2024 at 01:47:25AM -0800, Luis Chamberlain wrote:
> @@ -185,7 +184,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
>  	if (set_blocksize(sb->s_bdev_file, size))
>  		return 0;
>  	/* If we get here, we know size is power of two
> -	 * and it's value is between 512 and PAGE_SIZE */
> +	 * and it's value is larger than 512 */

If you're changing this line, please delete the incorrect apostrophe.

>  	sb->s_blocksize = size;
>  	sb->s_blocksize_bits = blksize_bits(size);
>  	return sb->s_blocksize;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 50c3b959da28..cc9fca1fceaa 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -25,6 +25,7 @@
>  #include <linux/uuid.h>
>  #include <linux/xarray.h>
>  #include <linux/file.h>
> +#include <linux/pagemap.h>

Why do we need to add this include?

> @@ -268,10 +269,13 @@ static inline dev_t disk_devt(struct gendisk *disk)
>  	return MKDEV(disk->major, disk->first_minor);
>  }
>  
> +/* We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) */
> +#define BLK_MAX_BLOCK_SIZE      (SZ_64K)

I think we need CONFIG_TRANSPARENT_HUGEPAGE to go over PAGE_SIZE.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/8] fs/buffer: restart block_read_full_folio() to avoid array overflow
  2024-11-13  9:47 ` [RFC 3/8] fs/buffer: restart block_read_full_folio() to avoid array overflow Luis Chamberlain
@ 2024-11-13 18:50   ` Matthew Wilcox
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2024-11-13 18:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, hare, david, djwong, john.g.garry, ritesh.list, kbusch,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel

On Wed, Nov 13, 2024 at 01:47:22AM -0800, Luis Chamberlain wrote:
> +++ b/fs/buffer.c
> @@ -2366,7 +2366,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	sector_t iblock, lblock;
> -	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> +	struct buffer_head *bh, *head, *restart_bh = NULL, *arr[MAX_BUF_PER_PAGE];

MAX_BUF_PER_PAGE is a pain.  There are configs like hexagon which have
256kB pages and so this array ends up being 512 * 8 bytes = 4kB in size
which spews stack growth warnings.  Can we just make this 8?

> @@ -2385,6 +2385,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  	iblock = div_u64(folio_pos(folio), blocksize);
>  	lblock = div_u64(limit + blocksize - 1, blocksize);
>  	bh = head;
> +restart:
>  	nr = 0;
>  	i = 0;
>  
> @@ -2417,7 +2418,12 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  				continue;
>  		}
>  		arr[nr++] = bh;
> -	} while (i++, iblock++, (bh = bh->b_this_page) != head);
> +	} while (i++, iblock++, (bh = bh->b_this_page) != head && nr < MAX_BUF_PER_PAGE);
> +
> +	if (nr == MAX_BUF_PER_PAGE && bh != head)
> +		restart_bh = bh;
> +	else
> +		restart_bh = NULL;
>  
>  	if (fully_mapped)
>  		folio_set_mappedtodisk(folio);
> @@ -2450,6 +2456,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  		else
>  			submit_bh(REQ_OP_READ, bh);
>  	}
> +
> +	/*
> +	 * Found more buffers than 'arr' could hold,
> +	 * restart to submit the remaining ones.
> +	 */
> +	if (restart_bh) {
> +		bh = restart_bh;
> +		goto restart;
> +	}
>  	return 0;

This isn't right.

Let's assume we need 16 blocks to fill in this folio and we have 8
entries in 'arr'.

        nr = 0;
        i = 0;

        do {
                if (buffer_uptodate(bh))
                        continue;
...
                arr[nr++] = bh;
        } while (i++, iblock++, (bh = bh->b_this_page) != head);

        for (i = 0; i < nr; i++) {
                bh = arr[i];
                        submit_bh(REQ_OP_READ, bh);

OK, so first time round, we've submitted 8 I/Os.  Now we see that
restart_bh is not NULL and so we go round again.

This time, we happen to find that the last 8 BHs are uptodate.
And so we take this path:

        if (!nr) {
                /*
                 * All buffers are uptodate or get_block() returned an
                 * error when trying to map them - we can finish the read.
                 */
                folio_end_read(folio, !page_error);

oops, we forgot about the 8 buffers we already submitted for read.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/8] fs/mpage: avoid negative shift for large blocksize
  2024-11-13 14:06   ` Matthew Wilcox
@ 2024-11-14 13:47     ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2024-11-14 13:47 UTC (permalink / raw)
  To: Matthew Wilcox, Luis Chamberlain
  Cc: hch, david, djwong, john.g.garry, ritesh.list, kbusch,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel, Hannes Reinecke

On 11/13/24 15:06, Matthew Wilcox wrote:
> On Wed, Nov 13, 2024 at 01:47:21AM -0800, Luis Chamberlain wrote:
>> +++ b/fs/mpage.c
>> @@ -181,7 +181,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>>   	if (folio_buffers(folio))
>>   		goto confused;
>>   
>> -	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
>> +	block_in_file = (sector_t)(((loff_t)folio->index << PAGE_SHIFT) >> blkbits);
> 
> 	block_in_file = folio_pos(folio) >> blkbits;
> ?
> 
>> @@ -527,7 +527,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
>>   	 * The page has no buffers: map it to disk
>>   	 */
>>   	BUG_ON(!folio_test_uptodate(folio));
>> -	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
>> +	block_in_file = (sector_t)(((loff_t)folio->index << PAGE_SHIFT) >> blkbits);
> 
> Likewise.

Yeah. /me not being able to find proper macros ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 8/8] bdev: use bdev_io_min() for statx block size
  2024-11-13  9:47 ` [RFC 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
  2024-11-13  9:59   ` Hannes Reinecke
@ 2024-11-18  7:08   ` Christoph Hellwig
  2024-11-18 21:16     ` Luis Chamberlain
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2024-11-18  7:08 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: willy, hch, hare, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote:
> The min-io is the minimum IO the block device prefers for optimal
> performance. In turn we map this to the block device block size.

It's not the block size, but (to quote the man page) 'the "preferred"
block size for efficient filesystem I/O'.  While the difference might
sound minor it actually is important.

> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 3a5fd65f6c8e..4dcc501ed953 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1306,6 +1306,7 @@ void bdev_statx(struct path *path, struct kstat *stat,
>  			queue_atomic_write_unit_max_bytes(bd_queue));
>  	}
>  
> +	stat->blksize = (unsigned int) bdev_io_min(bdev);

No need for the cast.

>  	if (S_ISBLK(stat->mode))
> -		bdev_statx(path, stat, request_mask);
> +		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);

And this is both unrelated and wrong.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 6/8] block/bdev: lift block size restrictions and use common definition
  2024-11-13  9:47 ` [RFC 6/8] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
  2024-11-13  9:57   ` Hannes Reinecke
  2024-11-13 14:14   ` Matthew Wilcox
@ 2024-11-18  9:18   ` John Garry
  2 siblings, 0 replies; 21+ messages in thread
From: John Garry @ 2024-11-18  9:18 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, hare, david, djwong
  Cc: ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel

On 13/11/2024 09:47, Luis Chamberlain wrote:
>   #include <linux/uuid.h>
>   #include <linux/xarray.h>
>   #include <linux/file.h>
> +#include <linux/pagemap.h>
>   
>   struct module;
>   struct request_queue;
> @@ -268,10 +269,13 @@ static inline dev_t disk_devt(struct gendisk *disk)
>   	return MKDEV(disk->major, disk->first_minor);
>   }
>   
> +/* We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) */

I fell that this comment can be reworked.

I think that what we want to say is that hard limit is 1 << (PAGE_SHIFT 
+ MAX_PAGECACHE_ORDER), but we set at sensible size of 64K.

> +#define BLK_MAX_BLOCK_SIZE      (SZ_64K)
> +



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 8/8] bdev: use bdev_io_min() for statx block size
  2024-11-18  7:08   ` Christoph Hellwig
@ 2024-11-18 21:16     ` Luis Chamberlain
  2024-11-19  6:08       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2024-11-18 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: willy, hare, david, djwong, john.g.garry, ritesh.list, kbusch,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, da.gomez, kernel, nilay

On Mon, Nov 18, 2024 at 08:08:05AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote:
> >  	if (S_ISBLK(stat->mode))
> > -		bdev_statx(path, stat, request_mask);
> > +		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);
> 
> And this is both unrelated and wrong.

I knew this was an eyesore, but was not sure if we really wanted to
go through the trouble of adding a new field for blksize alone, but come
to think of it, with it at least userspace knows for sure its
getting where as befault it was not.

If we add it, and since it would be added post LBS support it could also
signal that a kernel supports LBS. That may be a useful clue for default
mkfs in case it is set and larger than today's 4k default.

So how about:

diff --git a/block/bdev.c b/block/bdev.c
index 3a5fd65f6c8e..f5d7cda97616 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1277,7 +1277,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
 	struct inode *backing_inode;
 	struct block_device *bdev;
 
-	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
+	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC |
+			      STATX_BLKSIZE)))
 		return;
 
 	backing_inode = d_backing_inode(path->dentry);
@@ -1306,6 +1307,11 @@ void bdev_statx(struct path *path, struct kstat *stat,
 			queue_atomic_write_unit_max_bytes(bd_queue));
 	}
 
+	if (request_mask & STATX_BLKSIZE) {
+		stat->blksize = (unsigned int) bdev_io_min(bdev);
+		stat->result_mask |= STATX_BLKSIZE;
+	}
+
 	blkdev_put_no_open(bdev);
 }
 
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e..d4cb2296b42d 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -268,7 +268,7 @@ static int vfs_statx_path(struct path *path, int flags, struct kstat *stat,
 	 * obtained from the bdev backing inode.
 	 */
 	if (S_ISBLK(stat->mode))
-		bdev_statx(path, stat, request_mask);
+		bdev_statx(path, stat, request_mask | STATX_BLKSIZE);
 
 	return error;
 }
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 887a25286441..b7e180bf72b8 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -164,6 +164,7 @@ struct statx {
 #define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */
 #define STATX_SUBVOL		0x00008000U	/* Want/got stx_subvol */
 #define STATX_WRITE_ATOMIC	0x00010000U	/* Want/got atomic_write_* fields */
+#define STATX_BLKSIZE		0x00020000U	/* Want/got stx_blksize */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 8/8] bdev: use bdev_io_min() for statx block size
  2024-11-18 21:16     ` Luis Chamberlain
@ 2024-11-19  6:08       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2024-11-19  6:08 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, willy, hare, david, djwong, john.g.garry,
	ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel, nilay

On Mon, Nov 18, 2024 at 01:16:37PM -0800, Luis Chamberlain wrote:
> On Mon, Nov 18, 2024 at 08:08:05AM +0100, Christoph Hellwig wrote:
> > On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote:
> > >  	if (S_ISBLK(stat->mode))
> > > -		bdev_statx(path, stat, request_mask);
> > > +		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);
> > 
> > And this is both unrelated and wrong.
> 
> I knew this was an eyesore, but was not sure if we really wanted to
> go through the trouble of adding a new field for blksize alone, but come
> to think of it, with it at least userspace knows for sure its
> getting where as befault it was not.

Huh?  The only think this does is forcing to fill out the dio align
fields when not requested.  It has nothing to do with block sizes.

> So how about:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 3a5fd65f6c8e..f5d7cda97616 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1277,7 +1277,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
>  	struct inode *backing_inode;
>  	struct block_device *bdev;
>  
> -	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
> +	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC |
> +			      STATX_BLKSIZE)))

Just drop this conditional entirely and you're fine.



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-12-05 15:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-13  9:47 [RFC 0/8] enable bs > ps for block devices Luis Chamberlain
2024-11-13  9:47 ` [RFC 1/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
2024-11-13  9:47 ` [RFC 2/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
2024-11-13 14:06   ` Matthew Wilcox
2024-11-14 13:47     ` Hannes Reinecke
2024-11-13  9:47 ` [RFC 3/8] fs/buffer: restart block_read_full_folio() to avoid array overflow Luis Chamberlain
2024-11-13 18:50   ` Matthew Wilcox
2024-11-13  9:47 ` [RFC 4/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
2024-11-13  9:55   ` Hannes Reinecke
2024-11-13  9:47 ` [RFC 5/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
2024-11-13  9:47 ` [RFC 6/8] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
2024-11-13  9:57   ` Hannes Reinecke
2024-11-13 14:14   ` Matthew Wilcox
2024-11-18  9:18   ` John Garry
2024-11-13  9:47 ` [RFC 7/8] nvme: remove superfluous block size check Luis Chamberlain
2024-11-13  9:57   ` Hannes Reinecke
2024-11-13  9:47 ` [RFC 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
2024-11-13  9:59   ` Hannes Reinecke
2024-11-18  7:08   ` Christoph Hellwig
2024-11-18 21:16     ` Luis Chamberlain
2024-11-19  6:08       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox