[RFC v2 00/11] enable bs > ps for block devices

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 00/11] enable bs > ps for block devices
@ 2024-12-14  3:10 Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 01/11] fs/buffer: move async batch read code into a helper Luis Chamberlain
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Now that LBS is upstream on v6.12 and we have bs > ps for filesystems
this address support for up bs > ps block devices. The first RFC v1 was
posted [0] before the turkey massacre in the US. The changes on this v2
addrsess all the feedback from that series.

The changes on this v2:

  - Simplfy with folio_pos() as suggested by Matthew Wilcox
  - To address support to reduce the buffer head array of
    size MAX_BUF_PER_PAGE to avoid stack growth growth warnings
    on systems with large base page sizes such as Hexagon with 256 KiB
    base page sizes I've added the async batch helper bh_read_batch_async()
    and iterator support for block_read_full_folio().
  - Simplify the bdev_io_min() as suggested by Christoph Hellwig
  - Collect tags, and minor comment enhancements and remove new header
    inclusions where not actually needed

This still goes out as RFCs as I'm still not done with my testing. As part
of my test plan I'm including a baseline of ext4 as we're mucking around with
buffer heads and testing xfs alone won't help to ensure we don't regress
existing buffer head users. I'm also testing XFS with 32k sector size support
given part of this enablement is to allow filesystems to also increase their
support sector size.

Patches 2-4 are really the meat and bones behind these changes and careful
review is appreciated. I suspect a bit of bike shedding potential is in order
there as well for those patches.

If you'd like to help test this, this is available in the kdevops linux
branch large-block-buffer-heads-for-next [1]. It is based on v6.13-rc2, and
on that tree has a fix not yet merged on v6.13-rc2 which is required for LBS.
That fix is already being tested and planned for v6.13-rc3, I carry since
otherwise you wound't be able to mount any LBS filesystem with a filesystem
block size larger than 16k.

[0] https://lkml.kernel.org/r/20241113094727.1497722-1-mcgrof@kernel.org
[1] https://github.com/linux-kdevops/linux/tree/large-block-buffer-heads-for-next

Hannes Reinecke (3):
  fs/mpage: use blocks_per_folio instead of blocks_per_page
  fs/mpage: avoid negative shift for large blocksize
  block/bdev: enable large folio support for large logical block sizes

Luis Chamberlain (8):
  fs/buffer: move async batch read code into a helper
  fs/buffer: add a for_each_bh() for block_read_full_folio()
  fs/buffer: add iteration support for block_read_full_folio()
  fs/buffer: reduce stack usage on bh_read_iter()
  fs/buffer fs/mpage: remove large folio restriction
  block/bdev: lift block size restrictions and use common definition
  nvme: remove superfluous block size check
  bdev: use bdev_io_min() for statx block size

 block/bdev.c             |  13 +--
 drivers/nvme/host/core.c |  10 --
 fs/buffer.c              | 209 +++++++++++++++++++++++++++------------
 fs/mpage.c               |  47 +++++----
 include/linux/blkdev.h   |  11 ++-
 5 files changed, 187 insertions(+), 103 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 01/11] fs/buffer: move async batch read code into a helper
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-17  9:56   ` Hannes Reinecke
  2024-12-14  3:10 ` [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Move the code from block_read_full_folio() which does a batch of async
reads into a helper.

No functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 73 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 43 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..580451337efa 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2350,6 +2350,48 @@ bool block_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 }
 EXPORT_SYMBOL(block_is_partially_uptodate);
 
+static void bh_read_batch_async(struct folio *folio,
+				int nr, struct buffer_head *arr[],
+				bool fully_mapped, bool no_reads,
+				bool any_get_block_error)
+{
+	int i;
+	struct buffer_head *bh;
+
+	if (fully_mapped)
+		folio_set_mappedtodisk(folio);
+
+	if (no_reads) {
+		/*
+		 * All buffers are uptodate or get_block() returned an
+		 * error when trying to map them *all* buffers we can
+		 * finish the read.
+		 */
+		folio_end_read(folio, !any_get_block_error);
+		return;
+	}
+
+	/* Stage one: lock the buffers */
+	for (i = 0; i < nr; i++) {
+		bh = arr[i];
+		lock_buffer(bh);
+		mark_buffer_async_read(bh);
+	}
+
+	/*
+	 * Stage 2: start the IO.  Check for uptodateness
+	 * inside the buffer lock in case another process reading
+	 * the underlying blockdev brought it uptodate (the sct fix).
+	 */
+	for (i = 0; i < nr; i++) {
+		bh = arr[i];
+		if (buffer_uptodate(bh))
+			end_buffer_async_read(bh, 1);
+		else
+			submit_bh(REQ_OP_READ, bh);
+	}
+}
+
 /*
  * Generic "read_folio" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
@@ -2414,37 +2456,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 		arr[nr++] = bh;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
-	if (fully_mapped)
-		folio_set_mappedtodisk(folio);
-
-	if (!nr) {
-		/*
-		 * All buffers are uptodate or get_block() returned an
-		 * error when trying to map them - we can finish the read.
-		 */
-		folio_end_read(folio, !page_error);
-		return 0;
-	}
-
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
+	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
 
-	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
-	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(REQ_OP_READ, bh);
-	}
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 01/11] fs/buffer: move async batch read code into a helper Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-14  4:02   ` Matthew Wilcox
  2024-12-17  9:57   ` Hannes Reinecke
  2024-12-14  3:10 ` [RFC v2 03/11] fs/buffer: add iteration support " Luis Chamberlain
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

We want to be able to work through all buffer heads on a folio
for an async read, but in the future we want to support the option
to stop before we've processed all linked buffer heads. To make
code easier to read and follow adopt a for_each_bh(tmp, head) loop
instead of using a do { ... } while () to make the code easier to
read and later be expanded in subsequent patches.

This introduces no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 580451337efa..108e1c36fc1a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2392,6 +2392,17 @@ static void bh_read_batch_async(struct folio *folio,
 	}
 }
 
+#define bh_is_last(__bh, __head) ((__bh)->b_this_page == (__head))
+
+#define bh_next(__bh, __head) \
+    (bh_is_last(__bh, __head) ? NULL : (__bh)->b_this_page)
+
+/* Starts from the provided head */
+#define for_each_bh(__tmp, __head)			\
+    for ((__tmp) = (__head);				\
+         (__tmp);					\
+         (__tmp) = bh_next(__tmp, __head))
+
 /*
  * Generic "read_folio" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
@@ -2421,11 +2432,10 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
-	bh = head;
 	nr = 0;
 	i = 0;
 
-	do {
+	for_each_bh(bh, head) {
 		if (buffer_uptodate(bh))
 			continue;
 
@@ -2454,7 +2464,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 				continue;
 		}
 		arr[nr++] = bh;
-	} while (i++, iblock++, (bh = bh->b_this_page) != head);
+		i++;
+		iblock++;
+	}
 
 	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 03/11] fs/buffer: add iteration support for block_read_full_folio()
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 01/11] fs/buffer: move async batch read code into a helper Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-17 10:00   ` Hannes Reinecke
  2024-12-14  3:10 ` [RFC v2 04/11] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Provide a helper to iterate on buffer heads on a folio. We do this
as a preliminary step so to make the subsequent changes easier to
read. Right now we use an array on stack to loop over all buffer heads
in a folio of size MAX_BUF_PER_PAGE, however on CPUs where the system
page size is quite larger like Hexagon with 256 KiB page size support
this can mean the kernel can end up spewing spews stack growth
warnings.

To be able to break this down into smaller array chunks add support for
processing smaller array chunks of buffer heads at a time. The used
array size is not changed yet, that will be done in a subsequent patch,
this just adds the iterator support and logic.

While at it clarify the booleans used on bh_read_batch_async() and
how they are only valid in consideration when we've processed all
buffer-heads of a folio, that is when we're on the last buffer head in
a folio:

  * bh_folio_reads
  * unmapped

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 130 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 94 insertions(+), 36 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 108e1c36fc1a..f8e6a5454dbb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2397,57 +2397,64 @@ static void bh_read_batch_async(struct folio *folio,
 #define bh_next(__bh, __head) \
     (bh_is_last(__bh, __head) ? NULL : (__bh)->b_this_page)
 
+/* Starts from a pivot which you initialize */
+#define for_each_bh_pivot(__pivot, __last, __head)	\
+    for ((__pivot) = __last = (__pivot);		\
+         (__pivot);					\
+         (__pivot) = bh_next(__pivot, __head),		\
+	 (__last) = (__pivot) ? (__pivot) : (__last))
+
 /* Starts from the provided head */
 #define for_each_bh(__tmp, __head)			\
     for ((__tmp) = (__head);				\
          (__tmp);					\
          (__tmp) = bh_next(__tmp, __head))
 
+struct bh_iter {
+	sector_t iblock;
+	get_block_t *get_block;
+	bool any_get_block_error;
+	int unmapped;
+	int bh_folio_reads;
+};
+
 /*
- * Generic "read_folio" function for block devices that have the normal
- * get_block functionality. This is most of the block device filesystems.
- * Reads the folio asynchronously --- the unlock_buffer() and
- * set/clear_buffer_uptodate() functions propagate buffer state into the
- * folio once IO has completed.
+ * Reads up to MAX_BUF_PER_PAGE buffer heads at a time on a folio on the given
+ * block range iblock to lblock and helps update the number of buffer-heads
+ * which were not uptodate or unmapped for which we issued an async read for
+ * on iter->bh_folio_reads for the full folio. Returns the last buffer-head we
+ * worked on.
  */
-int block_read_full_folio(struct folio *folio, get_block_t *get_block)
-{
-	struct inode *inode = folio->mapping->host;
-	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
-	size_t blocksize;
-	int nr, i;
-	int fully_mapped = 1;
-	bool page_error = false;
-	loff_t limit = i_size_read(inode);
-
-	/* This is needed for ext4. */
-	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
-		limit = inode->i_sb->s_maxbytes;
-
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
-	head = folio_create_buffers(folio, inode, 0);
-	blocksize = head->b_size;
+static struct buffer_head *bh_read_iter(struct folio *folio,
+					struct buffer_head *pivot,
+					struct buffer_head *head,
+					struct inode *inode,
+					struct bh_iter *iter, sector_t lblock)
+{
+	struct buffer_head *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh = pivot, *last;
+	int nr = 0, i = 0;
+	size_t blocksize = head->b_size;
+	bool no_reads = false;
+	bool fully_mapped = false;
+
+	/* collect buffers not uptodate and not mapped yet */
+	for_each_bh_pivot(bh, last, head) {
+		BUG_ON(nr >= MAX_BUF_PER_PAGE);
 
-	iblock = div_u64(folio_pos(folio), blocksize);
-	lblock = div_u64(limit + blocksize - 1, blocksize);
-	nr = 0;
-	i = 0;
-
-	for_each_bh(bh, head) {
 		if (buffer_uptodate(bh))
 			continue;
 
 		if (!buffer_mapped(bh)) {
 			int err = 0;
 
-			fully_mapped = 0;
-			if (iblock < lblock) {
+			iter->unmapped++;
+			if (iter->iblock < lblock) {
 				WARN_ON(bh->b_size != blocksize);
-				err = get_block(inode, iblock, bh, 0);
+				err = iter->get_block(inode, iter->iblock,
+						      bh, 0);
 				if (err)
-					page_error = true;
+					iter->any_get_block_error = true;
 			}
 			if (!buffer_mapped(bh)) {
 				folio_zero_range(folio, i * blocksize,
@@ -2465,10 +2472,61 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 		}
 		arr[nr++] = bh;
 		i++;
-		iblock++;
+		iter->iblock++;
+	}
+
+	iter->bh_folio_reads += nr;
+
+	WARN_ON_ONCE(!bh_is_last(last, head));
+
+	if (bh_is_last(last, head)) {
+		if (!iter->bh_folio_reads)
+			no_reads = true;
+		if (!iter->unmapped)
+			fully_mapped = true;
 	}
 
-	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
+	bh_read_batch_async(folio, nr, arr, fully_mapped, no_reads,
+			    iter->any_get_block_error);
+
+	return last;
+}
+
+/*
+ * Generic "read_folio" function for block devices that have the normal
+ * get_block functionality. This is most of the block device filesystems.
+ * Reads the folio asynchronously --- the unlock_buffer() and
+ * set/clear_buffer_uptodate() functions propagate buffer state into the
+ * folio once IO has completed.
+ */
+int block_read_full_folio(struct folio *folio, get_block_t *get_block)
+{
+	struct inode *inode = folio->mapping->host;
+	sector_t lblock;
+	size_t blocksize;
+	struct buffer_head *bh, *head;
+	struct bh_iter iter = {
+		.get_block = get_block,
+		.unmapped = 0,
+		.any_get_block_error = false,
+		.bh_folio_reads = 0,
+	};
+	loff_t limit = i_size_read(inode);
+
+	/* This is needed for ext4. */
+	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
+		limit = inode->i_sb->s_maxbytes;
+
+	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
+
+	head = folio_create_buffers(folio, inode, 0);
+	blocksize = head->b_size;
+
+	iter.iblock = div_u64(folio_pos(folio), blocksize);
+	lblock = div_u64(limit + blocksize - 1, blocksize);
+
+	for_each_bh(bh, head)
+		bh = bh_read_iter(folio, bh, head, inode, &iter, lblock);
 
 	return 0;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 04/11] fs/buffer: reduce stack usage on bh_read_iter()
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (2 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 03/11] fs/buffer: add iteration support " Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-17 10:04   ` Hannes Reinecke
  2024-12-14  3:10 ` [RFC v2 05/11] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Now that we can read asynchronously buffer heads from a folio in
chunks,  we can chop up bh_read_iter() with a smaller array size.
Use an array of 8 to avoid stack growth warnings on systems with
huge base page sizes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index f8e6a5454dbb..b4994c48e6ee 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2410,7 +2410,10 @@ static void bh_read_batch_async(struct folio *folio,
          (__tmp);					\
          (__tmp) = bh_next(__tmp, __head))
 
+#define MAX_BUF_CHUNK 8
+
 struct bh_iter {
+	int chunk_number;
 	sector_t iblock;
 	get_block_t *get_block;
 	bool any_get_block_error;
@@ -2419,7 +2422,7 @@ struct bh_iter {
 };
 
 /*
- * Reads up to MAX_BUF_PER_PAGE buffer heads at a time on a folio on the given
+ * Reads up to MAX_BUF_CHUNK buffer heads at a time on a folio on the given
  * block range iblock to lblock and helps update the number of buffer-heads
  * which were not uptodate or unmapped for which we issued an async read for
  * on iter->bh_folio_reads for the full folio. Returns the last buffer-head we
@@ -2431,16 +2434,18 @@ static struct buffer_head *bh_read_iter(struct folio *folio,
 					struct inode *inode,
 					struct bh_iter *iter, sector_t lblock)
 {
-	struct buffer_head *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *arr[MAX_BUF_CHUNK];
 	struct buffer_head *bh = pivot, *last;
 	int nr = 0, i = 0;
 	size_t blocksize = head->b_size;
+	int chunk_idx = MAX_BUF_CHUNK * iter->chunk_number;
 	bool no_reads = false;
 	bool fully_mapped = false;
 
 	/* collect buffers not uptodate and not mapped yet */
 	for_each_bh_pivot(bh, last, head) {
-		BUG_ON(nr >= MAX_BUF_PER_PAGE);
+		if (nr >= MAX_BUF_CHUNK)
+			break;
 
 		if (buffer_uptodate(bh))
 			continue;
@@ -2457,7 +2462,8 @@ static struct buffer_head *bh_read_iter(struct folio *folio,
 					iter->any_get_block_error = true;
 			}
 			if (!buffer_mapped(bh)) {
-				folio_zero_range(folio, i * blocksize,
+				folio_zero_range(folio,
+						(i + chunk_idx) * blocksize,
 						blocksize);
 				if (!err)
 					set_buffer_uptodate(bh);
@@ -2476,8 +2482,7 @@ static struct buffer_head *bh_read_iter(struct folio *folio,
 	}
 
 	iter->bh_folio_reads += nr;
-
-	WARN_ON_ONCE(!bh_is_last(last, head));
+	iter->chunk_number++;
 
 	if (bh_is_last(last, head)) {
 		if (!iter->bh_folio_reads)
@@ -2507,6 +2512,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	struct buffer_head *bh, *head;
 	struct bh_iter iter = {
 		.get_block = get_block,
+		.chunk_number = 0,
 		.unmapped = 0,
 		.any_get_block_error = false,
 		.bh_folio_reads = 0,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 05/11] fs/mpage: use blocks_per_folio instead of blocks_per_page
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (3 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 04/11] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-14  4:46   ` Matthew Wilcox
  2024-12-14  3:10 ` [RFC v2 06/11] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

From: Hannes Reinecke <hare@suse.de>

Convert mpage to folios and associate the number of blocks with
a folio and not a page.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/mpage.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 82aecf372743..eb6fee7de529 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -107,7 +107,7 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
 		 * don't make any buffers if there is only one buffer on
 		 * the folio and the folio just needs to be set up to date
 		 */
-		if (inode->i_blkbits == PAGE_SHIFT &&
+		if (inode->i_blkbits == folio_shift(folio) &&
 		    buffer_uptodate(bh)) {
 			folio_mark_uptodate(folio);
 			return;
@@ -153,7 +153,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	struct folio *folio = args->folio;
 	struct inode *inode = folio->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	struct buffer_head *map_bh = &args->map_bh;
 	sector_t block_in_file;
@@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	sector_t last_block_in_file;
 	sector_t first_block;
 	unsigned page_block;
-	unsigned first_hole = blocks_per_page;
+	unsigned first_hole = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int length;
 	int fully_mapped = 1;
@@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 		goto confused;
 
 	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
-	last_block = block_in_file + args->nr_pages * blocks_per_page;
+	last_block = block_in_file + args->nr_pages * blocks_per_folio;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
 		last_block = last_block_in_file;
@@ -204,7 +204,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 				clear_buffer_mapped(map_bh);
 				break;
 			}
-			if (page_block == blocks_per_page)
+			if (page_block == blocks_per_folio)
 				break;
 			page_block++;
 			block_in_file++;
@@ -216,7 +216,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	 * Then do more get_blocks calls until we are done with this folio.
 	 */
 	map_bh->b_folio = folio;
-	while (page_block < blocks_per_page) {
+	while (page_block < blocks_per_folio) {
 		map_bh->b_state = 0;
 		map_bh->b_size = 0;
 
@@ -229,7 +229,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 
 		if (!buffer_mapped(map_bh)) {
 			fully_mapped = 0;
-			if (first_hole == blocks_per_page)
+			if (first_hole == blocks_per_folio)
 				first_hole = page_block;
 			page_block++;
 			block_in_file++;
@@ -247,7 +247,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			goto confused;
 		}
 	
-		if (first_hole != blocks_per_page)
+		if (first_hole != blocks_per_folio)
 			goto confused;		/* hole -> non-hole */
 
 		/* Contiguous blocks? */
@@ -260,7 +260,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			if (relative_block == nblocks) {
 				clear_buffer_mapped(map_bh);
 				break;
-			} else if (page_block == blocks_per_page)
+			} else if (page_block == blocks_per_folio)
 				break;
 			page_block++;
 			block_in_file++;
@@ -268,7 +268,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 		bdev = map_bh->b_bdev;
 	}
 
-	if (first_hole != blocks_per_page) {
+	if (first_hole != blocks_per_folio) {
 		folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
 		if (first_hole == 0) {
 			folio_mark_uptodate(folio);
@@ -303,10 +303,10 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	relative_block = block_in_file - args->first_logical_block;
 	nblocks = map_bh->b_size >> blkbits;
 	if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
-	    (first_hole != blocks_per_page))
+	    (first_hole != blocks_per_folio))
 		args->bio = mpage_bio_submit_read(args->bio);
 	else
-		args->last_block_in_bio = first_block + blocks_per_page - 1;
+		args->last_block_in_bio = first_block + blocks_per_folio - 1;
 out:
 	return args->bio;
 
@@ -385,7 +385,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
 {
 	struct mpage_readpage_args args = {
 		.folio = folio,
-		.nr_pages = 1,
+		.nr_pages = folio_nr_pages(folio),
 		.get_block = get_block,
 	};
 
@@ -456,12 +456,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	struct address_space *mapping = folio->mapping;
 	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
 	sector_t last_block;
 	sector_t block_in_file;
 	sector_t first_block;
 	unsigned page_block;
-	unsigned first_unmapped = blocks_per_page;
+	unsigned first_unmapped = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int boundary = 0;
 	sector_t boundary_block = 0;
@@ -486,12 +486,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 				 */
 				if (buffer_dirty(bh))
 					goto confused;
-				if (first_unmapped == blocks_per_page)
+				if (first_unmapped == blocks_per_folio)
 					first_unmapped = page_block;
 				continue;
 			}
 
-			if (first_unmapped != blocks_per_page)
+			if (first_unmapped != blocks_per_folio)
 				goto confused;	/* hole -> non-hole */
 
 			if (!buffer_dirty(bh) || !buffer_uptodate(bh))
@@ -536,7 +536,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 		goto page_is_mapped;
 	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_folio = folio;
-	for (page_block = 0; page_block < blocks_per_page; ) {
+	for (page_block = 0; page_block < blocks_per_folio; ) {
 
 		map_bh.b_state = 0;
 		map_bh.b_size = 1 << blkbits;
@@ -618,14 +618,14 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	BUG_ON(folio_test_writeback(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
-	if (boundary || (first_unmapped != blocks_per_page)) {
+	if (boundary || (first_unmapped != blocks_per_folio)) {
 		bio = mpage_bio_submit_write(bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
 		}
 	} else {
-		mpd->last_block_in_bio = first_block + blocks_per_page - 1;
+		mpd->last_block_in_bio = first_block + blocks_per_folio - 1;
 	}
 	goto out;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 06/11] fs/mpage: avoid negative shift for large blocksize
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (4 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 05/11] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 07/11] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

For large blocksizes the number of block bits is larger than PAGE_SHIFT,
so use instead use folio_pos(folio) >> blkbits to calculate the sector
number.

With this in place we can now enable large folios on with buffer-heads.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 fs/mpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index eb6fee7de529..c6bb2a9706a1 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -181,7 +181,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	if (folio_buffers(folio))
 		goto confused;
 
-	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+	block_in_file = folio_pos(folio) >> blkbits;
 	last_block = block_in_file + args->nr_pages * blocks_per_folio;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
@@ -527,7 +527,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	 * The page has no buffers: map it to disk
 	 */
 	BUG_ON(!folio_test_uptodate(folio));
-	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+	block_in_file = folio_pos(folio) >> blkbits;
 	/*
 	 * Whole page beyond EOF? Skip allocating blocks to avoid leaking
 	 * space.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 07/11] fs/buffer fs/mpage: remove large folio restriction
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (5 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 06/11] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 08/11] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

Now that buffer-heads has been converted over to support large folios
we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
their use.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/buffer.c | 2 --
 fs/mpage.c  | 3 ---
 2 files changed, 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b4994c48e6ee..4296bfb06fb1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2523,8 +2523,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
 		limit = inode->i_sb->s_maxbytes;
 
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
 	head = folio_create_buffers(folio, inode, 0);
 	blocksize = head->b_size;
 
diff --git a/fs/mpage.c b/fs/mpage.c
index c6bb2a9706a1..c1b85be8df64 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -170,9 +170,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	unsigned relative_block;
 	gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 
-	/* MAX_BUF_PER_PAGE, for example */
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
 	if (args->is_readahead) {
 		opf |= REQ_RAHEAD;
 		gfp |= __GFP_NORETRY | __GFP_NOWARN;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 08/11] block/bdev: enable large folio support for large logical block sizes
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (6 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 07/11] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

From: Hannes Reinecke <hare@suse.de>

Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/bdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7..167d82b46781 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -148,6 +148,8 @@ static void set_init_blocksize(struct block_device *bdev)
 		bsize <<= 1;
 	}
 	BD_INODE(bdev)->i_blkbits = blksize_bits(bsize);
+	mapping_set_folio_min_order(BD_INODE(bdev)->i_mapping,
+				    get_order(bsize));
 }
 
 int set_blocksize(struct file *file, int size)
@@ -170,6 +172,7 @@ int set_blocksize(struct file *file, int size)
 	if (inode->i_blkbits != blksize_bits(size)) {
 		sync_blockdev(bdev);
 		inode->i_blkbits = blksize_bits(size);
+		mapping_set_folio_min_order(inode->i_mapping, get_order(size));
 		kill_bdev(bdev);
 	}
 	return 0;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (7 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 08/11] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-16  8:55   ` John Garry
                     ` (2 more replies)
  2024-12-14  3:10 ` [RFC v2 10/11] nvme: remove superfluous block size check Luis Chamberlain
  2024-12-14  3:10 ` [RFC v2 11/11] bdev: use bdev_io_min() for statx block size Luis Chamberlain
  10 siblings, 3 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

We now can support blocksizes larger than PAGE_SIZE, so lift
the restriction up to the max supported page cache order and
just bake this into a common helper used by the block layer.

We bound ourselves to 64k as a sensible limit. The hard limit,
however is 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER).

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c           |  5 ++---
 include/linux/blkdev.h | 11 ++++++++++-
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 167d82b46781..b57dc4bff81b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
 	struct inode *inode = file->f_mapping->host;
 	struct block_device *bdev = I_BDEV(inode);
 
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+	if (blk_validate_block_size(size))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
@@ -185,7 +184,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
 	if (set_blocksize(sb->s_bdev_file, size))
 		return 0;
 	/* If we get here, we know size is power of two
-	 * and it's value is between 512 and PAGE_SIZE */
+	 * and its value is larger than 512 */
 	sb->s_blocksize = size;
 	sb->s_blocksize_bits = blksize_bits(size);
 	return sb->s_blocksize;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08a727b40816..a7303a55ed2a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -269,10 +269,19 @@ static inline dev_t disk_devt(struct gendisk *disk)
 	return MKDEV(disk->major, disk->first_minor);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * The hard limit is (1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER).
+ */
+#define BLK_MAX_BLOCK_SIZE      (SZ_64K)
+#else
+#define BLK_MAX_BLOCK_SIZE      (PAGE_SIZE)
+#endif
+
 /* blk_validate_limits() validates bsize, so drivers don't usually need to */
 static inline int blk_validate_block_size(unsigned long bsize)
 {
-	if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize))
+	if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize))
 		return -EINVAL;
 
 	return 0;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 10/11] nvme: remove superfluous block size check
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (8 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  2024-12-15  0:39   ` Matthew Wilcox
  2024-12-14  3:10 ` [RFC v2 11/11] bdev: use bdev_io_min() for statx block size Luis Chamberlain
  10 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

The block layer already validates proper block sizes with
blk_validate_block_size() for us so we can remove this now
superfluous check.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/core.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d169a30eb935..bbb5e9d2415c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2029,16 +2029,6 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 	u32 atomic_bs, phys_bs, io_opt = 0;
 	bool valid = true;
 
-	/*
-	 * The block layer can't support LBA sizes larger than the page size
-	 * or smaller than a sector size yet, so catch this early and don't
-	 * allow block I/O.
-	 */
-	if (head->lba_shift > PAGE_SHIFT || head->lba_shift < SECTOR_SHIFT) {
-		bs = (1 << 9);
-		valid = false;
-	}
-
 	atomic_bs = phys_bs = bs;
 	if (id->nabo == 0) {
 		/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v2 11/11] bdev: use bdev_io_min() for statx block size
  2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
                   ` (9 preceding siblings ...)
  2024-12-14  3:10 ` [RFC v2 10/11] nvme: remove superfluous block size check Luis Chamberlain
@ 2024-12-14  3:10 ` Luis Chamberlain
  10 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-14  3:10 UTC (permalink / raw)
  To: willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
	mcgrof

You can use lsblk to query for a block device block device block size:

lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
 4096

The min-io is the minimum IO the block device prefers for optimal
performance. In turn we map this to the block device block size.
The current block size exposed even for block devices with an
LBA format of 16k is 4k. Likewise devices which support 4k LBA format
but have a larger Indirection Unit of 16k have an exposed block size
of 4k.

This incurs read-modify-writes on direct IO against devices with a
min-io larger than the page size. To fix this, use the block device
min io, which is the minimal optimal IO the device prefers.

With this we now get:

lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
 16384

And so userspace gets the appropriate information it needs for optimal
performance. This is verified with blkalgn against mkfs against a
device with LBA format of 4k but an NPWG of 16k (min io size)

mkfs.xfs -f -b size=16k  /dev/nvme3n1
blkalgn -d nvme3n1 --ops Write

     Block size          : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 66       |****************************************|
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 2        |*                                       |
Block size: 14 - 66
Block size: 17 - 2

     Algn size           : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 66       |****************************************|
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 2        |*                                       |
Algn size: 14 - 66
Algn size: 17 - 2

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index b57dc4bff81b..b1be720bd485 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1277,9 +1277,6 @@ void bdev_statx(struct path *path, struct kstat *stat,
 	struct inode *backing_inode;
 	struct block_device *bdev;
 
-	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
-		return;
-
 	backing_inode = d_backing_inode(path->dentry);
 
 	/*
@@ -1306,6 +1303,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
 			queue_atomic_write_unit_max_bytes(bd_queue));
 	}
 
+	stat->blksize = bdev_io_min(bdev);
+
 	blkdev_put_no_open(bdev);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-14  3:10 ` [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
@ 2024-12-14  4:02   ` Matthew Wilcox
  2024-12-16 18:56     ` Luis Chamberlain
  2024-12-17  9:57   ` Hannes Reinecke
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2024-12-14  4:02 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Fri, Dec 13, 2024 at 07:10:40PM -0800, Luis Chamberlain wrote:
> -	do {
> +	for_each_bh(bh, head) {
>  		if (buffer_uptodate(bh))
>  			continue;
>  
> @@ -2454,7 +2464,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  				continue;
>  		}
>  		arr[nr++] = bh;
> -	} while (i++, iblock++, (bh = bh->b_this_page) != head);
> +		i++;
> +		iblock++;
> +	}

This is non-equivalent.  That 'continue' you can see would increment i
and iblock.  Now it doesn't.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 05/11] fs/mpage: use blocks_per_folio instead of blocks_per_page
  2024-12-14  3:10 ` [RFC v2 05/11] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2024-12-14  4:46   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2024-12-14  4:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Fri, Dec 13, 2024 at 07:10:43PM -0800, Luis Chamberlain wrote:

Something is wrong here ... I'm just not sure what (nor am I sure why
testing hasn't caught it).

> -	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
> +	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;

OK, fine ...

> -	last_block = block_in_file + args->nr_pages * blocks_per_page;
> +	last_block = block_in_file + args->nr_pages * blocks_per_folio;

So we've scaled up the right hand side of this, that's good.

> @@ -385,7 +385,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
>  {
>  	struct mpage_readpage_args args = {
>  		.folio = folio,
> -		.nr_pages = 1,
> +		.nr_pages = folio_nr_pages(folio),

Oh, but we've also scaled up the left hand side.  So instead of being,
say, 4 times larger, it's now 16 times larger.

Now, where do we use it?

                if (block_in_file < last_block) {
                        map_bh->b_size = (last_block-block_in_file) << blkbits;
                        if (args->get_block(inode, block_in_file, map_bh, 0))
                                goto confused;

so we're going to ask the filesystem for 4x as much data as we actually
need.  Guess it'll depend on the filesystem whether that's a bad thing
or not.

I guess I think the right solution here is to keep working in terms of
pages.  readahead_count() returns the number of pages, not the number of
folios (remember we can have a mix of different size folios in the same
readahead_batch).

But I can be argued with.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 10/11] nvme: remove superfluous block size check
  2024-12-14  3:10 ` [RFC v2 10/11] nvme: remove superfluous block size check Luis Chamberlain
@ 2024-12-15  0:39   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2024-12-15  0:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Fri, Dec 13, 2024 at 07:10:48PM -0800, Luis Chamberlain wrote:
> The block layer already validates proper block sizes with
> blk_validate_block_size() for us so we can remove this now
> superfluous check.

If this patch were correct, it couldn't go far enough ("valid" is now
only assigned true, so it can be removed and the function could return
void, etc).

But I think there's still utility in checking the current configuration
to see if it can be supported by the block layer, and setting capacity
to 0 if it's not (so that the namespace can be reconfigured).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-14  3:10 ` [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
@ 2024-12-16  8:55   ` John Garry
  2024-12-16  9:19     ` Ming Lei
  2024-12-17 20:51     ` John Garry
  2024-12-17 10:05   ` Hannes Reinecke
  2024-12-17 21:00   ` Bart Van Assche
  2 siblings, 2 replies; 30+ messages in thread
From: John Garry @ 2024-12-16  8:55 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, hare, dave, david, djwong
  Cc: ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel

On 14/12/2024 03:10, Luis Chamberlain wrote:
> index 167d82b46781..b57dc4bff81b 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
>   	struct inode *inode = file->f_mapping->host;
>   	struct block_device *bdev = I_BDEV(inode);
>   
> -	/* Size must be a power of two, and between 512 and PAGE_SIZE */
> -	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> +	if (blk_validate_block_size(size))
>   		return -EINVAL;

I suppose that this can be sent as a separate patch to be merged now.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-16  8:55   ` John Garry
@ 2024-12-16  9:19     ` Ming Lei
  2024-12-16 10:13       ` John Garry
  2024-12-17 20:51     ` John Garry
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2024-12-16  9:19 UTC (permalink / raw)
  To: John Garry
  Cc: Luis Chamberlain, willy, hch, hare, dave, david, djwong,
	ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel

On Mon, Dec 16, 2024 at 4:58 PM John Garry <john.g.garry@oracle.com> wrote:
>
> On 14/12/2024 03:10, Luis Chamberlain wrote:
> > index 167d82b46781..b57dc4bff81b 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
> >       struct inode *inode = file->f_mapping->host;
> >       struct block_device *bdev = I_BDEV(inode);
> >
> > -     /* Size must be a power of two, and between 512 and PAGE_SIZE */
> > -     if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> > +     if (blk_validate_block_size(size))
> >               return -EINVAL;
>
> I suppose that this can be sent as a separate patch to be merged now.

There have been some bugs found in case that PAGE_SIZE == 64K, and I
think it is bad to use PAGE_SIZE for validating many hw/queue limits, we might
have to fix them first.

Such as:

1) blk_validate_limits()

- failure if max_segment_size is less than 64K

- max_user_sectors

if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
       return -EINVAL;

2) bio_may_need_split()

- max_segment_size may be less than 64K

Thanks,



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-16  9:19     ` Ming Lei
@ 2024-12-16 10:13       ` John Garry
  2024-12-16 10:23         ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: John Garry @ 2024-12-16 10:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Luis Chamberlain, willy, hch, hare, dave, david, djwong,
	ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel

On 16/12/2024 09:19, Ming Lei wrote:
> On Mon, Dec 16, 2024 at 4:58 PM John Garry <john.g.garry@oracle.com> wrote:
>>
>> On 14/12/2024 03:10, Luis Chamberlain wrote:
>>> index 167d82b46781..b57dc4bff81b 100644
>>> --- a/block/bdev.c
>>> +++ b/block/bdev.c
>>> @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
>>>        struct inode *inode = file->f_mapping->host;
>>>        struct block_device *bdev = I_BDEV(inode);
>>>
>>> -     /* Size must be a power of two, and between 512 and PAGE_SIZE */
>>> -     if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
>>> +     if (blk_validate_block_size(size))
>>>                return -EINVAL;
>>
>> I suppose that this can be sent as a separate patch to be merged now.
> 
> There have been some bugs found in case that PAGE_SIZE == 64K, and I
> think it is bad to use PAGE_SIZE for validating many hw/queue limits, we might
> have to fix them first.

I am just suggesting to remove duplicated code, as these checks are same 
as blk_validate_block_size()

> 
> Such as:

Aren't the below list just enforcing block layer requirements? And so 
only block drivers need fixing for PAGE_SIZE > 4K (or cannot be used for 
PAGE_SIZE > 4K), right?

> 
> 1) blk_validate_limits()
> 
> - failure if max_segment_size is less than 64K
 > > - max_user_sectors
> 
> if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
>         return -EINVAL;
> 
> 2) bio_may_need_split()
> 
> - max_segment_size may be less than 64K
> 
> Thanks,



> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-16 10:13       ` John Garry
@ 2024-12-16 10:23         ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2024-12-16 10:23 UTC (permalink / raw)
  To: John Garry
  Cc: Luis Chamberlain, willy, hch, hare, dave, david, djwong,
	ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel

On Mon, Dec 16, 2024 at 6:14 PM John Garry <john.g.garry@oracle.com> wrote:
>
> On 16/12/2024 09:19, Ming Lei wrote:
> > On Mon, Dec 16, 2024 at 4:58 PM John Garry <john.g.garry@oracle.com> wrote:
> >>
> >> On 14/12/2024 03:10, Luis Chamberlain wrote:
> >>> index 167d82b46781..b57dc4bff81b 100644
> >>> --- a/block/bdev.c
> >>> +++ b/block/bdev.c
> >>> @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
> >>>        struct inode *inode = file->f_mapping->host;
> >>>        struct block_device *bdev = I_BDEV(inode);
> >>>
> >>> -     /* Size must be a power of two, and between 512 and PAGE_SIZE */
> >>> -     if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> >>> +     if (blk_validate_block_size(size))
> >>>                return -EINVAL;
> >>
> >> I suppose that this can be sent as a separate patch to be merged now.
> >
> > There have been some bugs found in case that PAGE_SIZE == 64K, and I
> > think it is bad to use PAGE_SIZE for validating many hw/queue limits, we might
> > have to fix them first.
>
> I am just suggesting to remove duplicated code, as these checks are same
> as blk_validate_block_size()

My fault, misunderstood your point as pushing this single patch only.

>
> >
> > Such as:
>
> Aren't the below list just enforcing block layer requirements? And so
> only block drivers need fixing for PAGE_SIZE > 4K (or cannot be used for
> PAGE_SIZE > 4K), right?

It is block layer which should be fixed to support  PAGE_SIZE > 4K.

Thanks,



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-14  4:02   ` Matthew Wilcox
@ 2024-12-16 18:56     ` Luis Chamberlain
  2024-12-16 20:05       ` Luis Chamberlain
  2024-12-17  8:46       ` Luis Chamberlain
  0 siblings, 2 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-16 18:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Sat, Dec 14, 2024 at 04:02:53AM +0000, Matthew Wilcox wrote:
> On Fri, Dec 13, 2024 at 07:10:40PM -0800, Luis Chamberlain wrote:
> > -	do {
> > +	for_each_bh(bh, head) {
> >  		if (buffer_uptodate(bh))
> >  			continue;
> >  
> > @@ -2454,7 +2464,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> >  				continue;
> >  		}
> >  		arr[nr++] = bh;
> > -	} while (i++, iblock++, (bh = bh->b_this_page) != head);
> > +		i++;
> > +		iblock++;
> > +	}
> 
> This is non-equivalent.  That 'continue' you can see would increment i
> and iblock.  Now it doesn't.

Thanks, not sure how I missed that! With that fix in place I ran a full
baseline against ext4 and all XFS profiles.

For ext4 the new failures I see are just:

  * generic/044
  * generic/045
  * generic/046

For cases where we race writing a file, truncate it and check to verify
if the file is non-zero it should have extents. I'll do a regression test
to see which commit messes this up.

For XFS I've tested 20 XFS proflies (non-LBS) and 4 LBS profiles, and
using the latest kdevops-results-archive test results for "fixes-6.13_2024-12-11"
as the baseline and these paatches + the loop fix you mentioned as a
test I mostly see these I need to look into:

  * xfs/009
  * xfs/059
  * xfs/155
  * xfs/168
  * xfs/185
  * xfs/301
  * generic/753

I'm not sure yet if these are flaky or real. The LBS profiles are using 4k sector sizes.

Also when testing with the xfs 32k secttor size profile generic/470
reveals device mapper needs to be updated to reject larger sector sizes
if it does not yet support it as we do with nvme block driver.

The full set of failures for XFS with 32k sector sizes:

generic/054 generic/055 generic/081 generic/102 generic/172 generic/223
generic/347 generic/405 generic/455 generic/457 generic/482 generic/500
generic/741 xfs/014 xfs/020 xfs/032 xfs/049 xfs/078 xfs/129 xfs/144
xfs/149 xfs/164 xfs/165 xfs/170 xfs/174 xfs/188 xfs/206 xfs/216 xfs/234
xfs/250 xfs/253 xfs/284 xfs/289 xfs/292 xfs/294 xfs/503 xfs/514 xfs/522
xfs/524 xfs/543 xfs/597 xfs/598 xfs/604 xfs/605 xfs/606 xfs/614 xfs/631
xfs/806

The full output I get by comparing the test results from
fixes-6.13_2024-12-11 and the run I just did with inside
kdevops-results-archive:

./bin/compare-results-fstests.py d48182fc621f87bc941ef4445e4585a3891923e9 cd7aa6fc6e46733a5dcf6a10b89566cabe0beaf

Comparing commits:
Baseline:      d48182fc621f | linux-xfs-kpd: Merge tag 'fixes-6.13_2024-12-11' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into next-rc
Test:          cd7aa6fc6e46 | linux-xfs-kpd: loop fix noted by willy

Baseline Kernel:6.13.0-rc2+
Test Kernel:   6.13.0-rc2+

Test Results Comparison:
================================================================================

Profile: xfs_crc
  New Failures:
    + xfs/059

Profile: xfs_crc_rtdev_extsize_28k
  New Failures:
    + xfs/301
  Resolved Failures:
    - xfs/185

Profile: xfs_crc_rtdev_extsize_64k
  New Failures:
    + xfs/155
    + xfs/301
  Resolved Failures:
    - xfs/629

Profile: xfs_nocrc
  New Failures:
    + generic/753

Profile: xfs_nocrc_2k
  New Failures:
    + xfs/009

Profile: xfs_nocrc_4k
  New Failures:
    + xfs/301

Profile: xfs_reflink_1024
  New Failures:
    + xfs/168
  Resolved Failures:
    - xfs/033

Profile: xfs_reflink_16k_4ks
  New Failures:
    + xfs/059

Profile: xfs_reflink_8k_4ks
  New Failures:
    + xfs/301

Profile: xfs_reflink_dir_bsize_8k
  New Failures:
    + xfs/301

Profile: xfs_reflink_stripe_len
  New Failures:
    + xfs/301

[0] https://github.com/linux-kdevops/kdevops-results-archive

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-16 18:56     ` Luis Chamberlain
@ 2024-12-16 20:05       ` Luis Chamberlain
  2024-12-16 21:46         ` Luis Chamberlain
  2024-12-17  8:46       ` Luis Chamberlain
  1 sibling, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-16 20:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Mon, Dec 16, 2024 at 10:56:44AM -0800, Luis Chamberlain wrote:
> On Sat, Dec 14, 2024 at 04:02:53AM +0000, Matthew Wilcox wrote:
> > On Fri, Dec 13, 2024 at 07:10:40PM -0800, Luis Chamberlain wrote:
> > > -	do {
> > > +	for_each_bh(bh, head) {
> > >  		if (buffer_uptodate(bh))
> > >  			continue;
> > >  
> > > @@ -2454,7 +2464,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> > >  				continue;
> > >  		}
> > >  		arr[nr++] = bh;
> > > -	} while (i++, iblock++, (bh = bh->b_this_page) != head);
> > > +		i++;
> > > +		iblock++;
> > > +	}
> > 
> > This is non-equivalent.  That 'continue' you can see would increment i
> > and iblock.  Now it doesn't.
> 
> Thanks, not sure how I missed that! With that fix in place I ran a full
> baseline against ext4 and all XFS profiles.
> 
> For ext4 the new failures I see are just:
> 
>   * generic/044
>   * generic/045
>   * generic/046

Oh my, these all fail on vanilla v6.12-rc2 so its not the code which is
at fault.

 Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-16 20:05       ` Luis Chamberlain
@ 2024-12-16 21:46         ` Luis Chamberlain
  0 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-16 21:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

On Mon, Dec 16, 2024 at 12:05:54PM -0800, Luis Chamberlain wrote:
> On Mon, Dec 16, 2024 at 10:56:44AM -0800, Luis Chamberlain wrote:
> > On Sat, Dec 14, 2024 at 04:02:53AM +0000, Matthew Wilcox wrote:
> > > On Fri, Dec 13, 2024 at 07:10:40PM -0800, Luis Chamberlain wrote:
> > > > -	do {
> > > > +	for_each_bh(bh, head) {
> > > >  		if (buffer_uptodate(bh))
> > > >  			continue;
> > > >  
> > > > @@ -2454,7 +2464,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> > > >  				continue;
> > > >  		}
> > > >  		arr[nr++] = bh;
> > > > -	} while (i++, iblock++, (bh = bh->b_this_page) != head);
> > > > +		i++;
> > > > +		iblock++;
> > > > +	}
> > > 
> > > This is non-equivalent.  That 'continue' you can see would increment i
> > > and iblock.  Now it doesn't.
> > 
> > Thanks, not sure how I missed that! With that fix in place I ran a full
> > baseline against ext4 and all XFS profiles.
> > 
> > For ext4 the new failures I see are just:
> > 
> >   * generic/044
> >   * generic/045
> >   * generic/046
> 
> Oh my, these all fail on vanilla v6.12-rc2 so its not the code which is
> at fault.

I looked inside my bag of "tribal knowedlge" and found that these are
known to fail because by default ext4 uses mount -o data=ordered mode
in favor of performance instead of the     mount -o data=journal mode.
And I confirm using mount -o data=journal fixes this for both the
baselines v6.13-rc2 and with these patches. In fstets you do that with:

MOUNT_OPTIONS='-o data=journal'

And so these failure are part of the baseline, and so, so far no
regressions are found with ext4 with this patch series.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-16 18:56     ` Luis Chamberlain
  2024-12-16 20:05       ` Luis Chamberlain
@ 2024-12-17  8:46       ` Luis Chamberlain
  1 sibling, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2024-12-17  8:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hch, hare, dave, david, djwong, john.g.garry, ritesh.list,
	kbusch, linux-fsdevel, linux-xfs, linux-mm, linux-block,
	gost.dev, p.raghav, da.gomez, kernel

So all XFS failures were due to flaky tests and failures, reproducible
on the baseline.
 
  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 01/11] fs/buffer: move async batch read code into a helper
  2024-12-14  3:10 ` [RFC v2 01/11] fs/buffer: move async batch read code into a helper Luis Chamberlain
@ 2024-12-17  9:56   ` Hannes Reinecke
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2024-12-17  9:56 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 12/14/24 04:10, Luis Chamberlain wrote:
> Move the code from block_read_full_folio() which does a batch of async
> reads into a helper.
> 
> No functional changes.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   fs/buffer.c | 73 +++++++++++++++++++++++++++++++----------------------
>   1 file changed, 43 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index cc8452f60251..580451337efa 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2350,6 +2350,48 @@ bool block_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>   }
>   EXPORT_SYMBOL(block_is_partially_uptodate);
>   
> +static void bh_read_batch_async(struct folio *folio,
> +				int nr, struct buffer_head *arr[],
> +				bool fully_mapped, bool no_reads,
> +				bool any_get_block_error)
> +{
> +	int i;
> +	struct buffer_head *bh;
> +
> +	if (fully_mapped)
> +		folio_set_mappedtodisk(folio);
> +
> +	if (no_reads) {
> +		/*
> +		 * All buffers are uptodate or get_block() returned an
> +		 * error when trying to map them *all* buffers we can
> +		 * finish the read.
> +		 */
> +		folio_end_read(folio, !any_get_block_error);
> +		return;
> +	}
> +
> +	/* Stage one: lock the buffers */

Now you messed up documentation:
Originally this was 'stage two', so now we have two 'stage one'
comments.
Please use the original documentation convention and add a note
to the helper that it's contingent on the 'stage 1' in the
calling function.

> +	for (i = 0; i < nr; i++) {
> +		bh = arr[i];
> +		lock_buffer(bh);
> +		mark_buffer_async_read(bh);
> +	}
> +
> +	/*
> +	 * Stage 2: start the IO.  Check for uptodateness
> +	 * inside the buffer lock in case another process reading
> +	 * the underlying blockdev brought it uptodate (the sct fix).
> +	 */
Same here; should be 'stage 3' to be consistent.

> +	for (i = 0; i < nr; i++) {
> +		bh = arr[i];
> +		if (buffer_uptodate(bh))
> +			end_buffer_async_read(bh, 1);
> +		else
> +			submit_bh(REQ_OP_READ, bh);
> +	}
> +}
> +
>   /*
>    * Generic "read_folio" function for block devices that have the normal
>    * get_block functionality. This is most of the block device filesystems.
> @@ -2414,37 +2456,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>   		arr[nr++] = bh;
>   	} while (i++, iblock++, (bh = bh->b_this_page) != head);
>   
> -	if (fully_mapped)
> -		folio_set_mappedtodisk(folio);
> -
> -	if (!nr) {
> -		/*
> -		 * All buffers are uptodate or get_block() returned an
> -		 * error when trying to map them - we can finish the read.
> -		 */
> -		folio_end_read(folio, !page_error);
> -		return 0;
> -	}
> -
> -	/* Stage two: lock the buffers */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		lock_buffer(bh);
> -		mark_buffer_async_read(bh);
> -	}
> +	bh_read_batch_async(folio, nr, arr, fully_mapped, nr == 0, page_error);
>   
> -	/*
> -	 * Stage 3: start the IO.  Check for uptodateness
> -	 * inside the buffer lock in case another process reading
> -	 * the underlying blockdev brought it uptodate (the sct fix).
> -	 */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		if (buffer_uptodate(bh))
> -			end_buffer_async_read(bh, 1);
> -		else
> -			submit_bh(REQ_OP_READ, bh);
> -	}
>   	return 0;
>   }
>   EXPORT_SYMBOL(block_read_full_folio);

Otherwise looks good.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio()
  2024-12-14  3:10 ` [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
  2024-12-14  4:02   ` Matthew Wilcox
@ 2024-12-17  9:57   ` Hannes Reinecke
  1 sibling, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2024-12-17  9:57 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 12/14/24 04:10, Luis Chamberlain wrote:
> We want to be able to work through all buffer heads on a folio
> for an async read, but in the future we want to support the option
> to stop before we've processed all linked buffer heads. To make
> code easier to read and follow adopt a for_each_bh(tmp, head) loop
> instead of using a do { ... } while () to make the code easier to
> read and later be expanded in subsequent patches.
> 
> This introduces no functional changes.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   fs/buffer.c | 18 +++++++++++++++---
>   1 file changed, 15 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 03/11] fs/buffer: add iteration support for block_read_full_folio()
  2024-12-14  3:10 ` [RFC v2 03/11] fs/buffer: add iteration support " Luis Chamberlain
@ 2024-12-17 10:00   ` Hannes Reinecke
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2024-12-17 10:00 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 12/14/24 04:10, Luis Chamberlain wrote:
> Provide a helper to iterate on buffer heads on a folio. We do this
> as a preliminary step so to make the subsequent changes easier to
> read. Right now we use an array on stack to loop over all buffer heads
> in a folio of size MAX_BUF_PER_PAGE, however on CPUs where the system
> page size is quite larger like Hexagon with 256 KiB page size support
> this can mean the kernel can end up spewing spews stack growth
> warnings.
> 
> To be able to break this down into smaller array chunks add support for
> processing smaller array chunks of buffer heads at a time. The used
> array size is not changed yet, that will be done in a subsequent patch,
> this just adds the iterator support and logic.
> 
> While at it clarify the booleans used on bh_read_batch_async() and
> how they are only valid in consideration when we've processed all
> buffer-heads of a folio, that is when we're on the last buffer head in
> a folio:
> 
>    * bh_folio_reads
>    * unmapped
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   fs/buffer.c | 130 +++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 94 insertions(+), 36 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 04/11] fs/buffer: reduce stack usage on bh_read_iter()
  2024-12-14  3:10 ` [RFC v2 04/11] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
@ 2024-12-17 10:04   ` Hannes Reinecke
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2024-12-17 10:04 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 12/14/24 04:10, Luis Chamberlain wrote:
> Now that we can read asynchronously buffer heads from a folio in
> chunks,  we can chop up bh_read_iter() with a smaller array size.
> Use an array of 8 to avoid stack growth warnings on systems with
> huge base page sizes.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   fs/buffer.c | 18 ++++++++++++------
>   1 file changed, 12 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-14  3:10 ` [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
  2024-12-16  8:55   ` John Garry
@ 2024-12-17 10:05   ` Hannes Reinecke
  2024-12-17 21:00   ` Bart Van Assche
  2 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2024-12-17 10:05 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 12/14/24 04:10, Luis Chamberlain wrote:
> We now can support blocksizes larger than PAGE_SIZE, so lift
> the restriction up to the max supported page cache order and
> just bake this into a common helper used by the block layer.
> 
> We bound ourselves to 64k as a sensible limit. The hard limit,
> however is 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER).
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   block/bdev.c           |  5 ++---
>   include/linux/blkdev.h | 11 ++++++++++-
>   2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 167d82b46781..b57dc4bff81b 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
>   	struct inode *inode = file->f_mapping->host;
>   	struct block_device *bdev = I_BDEV(inode);
>   
> -	/* Size must be a power of two, and between 512 and PAGE_SIZE */
> -	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> +	if (blk_validate_block_size(size))
>   		return -EINVAL;
>   
>   	/* Size cannot be smaller than the size supported by the device */
> @@ -185,7 +184,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
>   	if (set_blocksize(sb->s_bdev_file, size))
>   		return 0;
>   	/* If we get here, we know size is power of two
> -	 * and it's value is between 512 and PAGE_SIZE */
> +	 * and its value is larger than 512 */
>   	sb->s_blocksize = size;
>   	sb->s_blocksize_bits = blksize_bits(size);
>   	return sb->s_blocksize;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 08a727b40816..a7303a55ed2a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -269,10 +269,19 @@ static inline dev_t disk_devt(struct gendisk *disk)
>   	return MKDEV(disk->major, disk->first_minor);
>   }
>   
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * The hard limit is (1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER).
As this is the hard limit, one wonders why we don't use it ...
So please add a comment why we restrict it to 64k.

> + */
> +#define BLK_MAX_BLOCK_SIZE      (SZ_64K)
> +#else
> +#define BLK_MAX_BLOCK_SIZE      (PAGE_SIZE)
> +#endif
> +
>   /* blk_validate_limits() validates bsize, so drivers don't usually need to */
>   static inline int blk_validate_block_size(unsigned long bsize)
>   {
> -	if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize))
> +	if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize))
>   		return -EINVAL;
>   
>   	return 0;

Otherwise:
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-16  8:55   ` John Garry
  2024-12-16  9:19     ` Ming Lei
@ 2024-12-17 20:51     ` John Garry
  1 sibling, 0 replies; 30+ messages in thread
From: John Garry @ 2024-12-17 20:51 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, hare, dave, david, djwong
  Cc: ritesh.list, kbusch, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, da.gomez, kernel

On 16/12/2024 08:55, John Garry wrote:
> On 14/12/2024 03:10, Luis Chamberlain wrote:
>> index 167d82b46781..b57dc4bff81b 100644
>> --- a/block/bdev.c
>> +++ b/block/bdev.c
>> @@ -157,8 +157,7 @@ int set_blocksize(struct file *file, int size)
>>       struct inode *inode = file->f_mapping->host;
>>       struct block_device *bdev = I_BDEV(inode);
>> -    /* Size must be a power of two, and between 512 and PAGE_SIZE */
>> -    if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
>> +    if (blk_validate_block_size(size))
>>           return -EINVAL;
> 
> I suppose that this can be sent as a separate patch to be merged now.

And if you want to send this as a separate prep change, feel free to add:

Reviewed-by: John Garry <john.g.garry@oracle.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition
  2024-12-14  3:10 ` [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
  2024-12-16  8:55   ` John Garry
  2024-12-17 10:05   ` Hannes Reinecke
@ 2024-12-17 21:00   ` Bart Van Assche
  2 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2024-12-17 21:00 UTC (permalink / raw)
  To: Luis Chamberlain, willy, hch, hare, dave, david, djwong
  Cc: john.g.garry, ritesh.list, kbusch, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel

On 12/13/24 7:10 PM, Luis Chamberlain wrote:
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * The hard limit is (1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER).
> + */

A closing parenthesis is missing from the above comment.

> +#define BLK_MAX_BLOCK_SIZE      (SZ_64K)
> +#else
> +#define BLK_MAX_BLOCK_SIZE      (PAGE_SIZE)
> +#endif

Parentheses are never necessary around constants since the definition of
the constant should include parenthesis if necessary.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-12-17 21:00 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-14  3:10 [RFC v2 00/11] enable bs > ps for block devices Luis Chamberlain
2024-12-14  3:10 ` [RFC v2 01/11] fs/buffer: move async batch read code into a helper Luis Chamberlain
2024-12-17  9:56   ` Hannes Reinecke
2024-12-14  3:10 ` [RFC v2 02/11] fs/buffer: add a for_each_bh() for block_read_full_folio() Luis Chamberlain
2024-12-14  4:02   ` Matthew Wilcox
2024-12-16 18:56     ` Luis Chamberlain
2024-12-16 20:05       ` Luis Chamberlain
2024-12-16 21:46         ` Luis Chamberlain
2024-12-17  8:46       ` Luis Chamberlain
2024-12-17  9:57   ` Hannes Reinecke
2024-12-14  3:10 ` [RFC v2 03/11] fs/buffer: add iteration support " Luis Chamberlain
2024-12-17 10:00   ` Hannes Reinecke
2024-12-14  3:10 ` [RFC v2 04/11] fs/buffer: reduce stack usage on bh_read_iter() Luis Chamberlain
2024-12-17 10:04   ` Hannes Reinecke
2024-12-14  3:10 ` [RFC v2 05/11] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
2024-12-14  4:46   ` Matthew Wilcox
2024-12-14  3:10 ` [RFC v2 06/11] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
2024-12-14  3:10 ` [RFC v2 07/11] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
2024-12-14  3:10 ` [RFC v2 08/11] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
2024-12-14  3:10 ` [RFC v2 09/11] block/bdev: lift block size restrictions and use common definition Luis Chamberlain
2024-12-16  8:55   ` John Garry
2024-12-16  9:19     ` Ming Lei
2024-12-16 10:13       ` John Garry
2024-12-16 10:23         ` Ming Lei
2024-12-17 20:51     ` John Garry
2024-12-17 10:05   ` Hannes Reinecke
2024-12-17 21:00   ` Bart Van Assche
2024-12-14  3:10 ` [RFC v2 10/11] nvme: remove superfluous block size check Luis Chamberlain
2024-12-15  0:39   ` Matthew Wilcox
2024-12-14  3:10 ` [RFC v2 11/11] bdev: use bdev_io_min() for statx block size Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox