[PATCH v12 00/10] enable bs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v12 00/10] enable bs > ps in XFS
@ 2024-08-15  9:08 Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (12 more replies)
  0 siblings, 13 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: Pankaj Raghav <p.raghav@samsung.com>

This is the 12th version of the series that enables block size > page size
(Large Block Size) experimental support in XFS. Please consider this for
the inclusion in 6.12.
The series is based on fs-next as I was not able to run tests on
the latest linux-next.

The context and motivation can be seen in cover letter of the RFC v1 [0].
We also recorded a talk about this effort at LPC [1], if someone would
like more context on this effort.

A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.

I also ran split_huge_page_test selftest on XFS filesystem to check for
huge page splits in min order chunks is done correctly.

No regressions were found with these patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.  To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.

We ran sysbench on postgres and mysql for several hours on LBS XFS
without any issues.

We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.

For those who want this in a git tree we have this up on a kdevops
large-block-minorder-for-next-v12 tag [6].

[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[6] https://github.com/linux-kdevops/linux/
[7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t

Changes since v11:
- Minor string alignment fixup.
- Collected RVB from Dave.

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Luis Chamberlain (1):
  mm: split a folio in minimum folio order chunks

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  filemap: allocate mapping_min_order folios in the page cache
  readahead: allocate folios with mapping_min_order in readahead
  filemap: cap PTE range to be created to allowed zero fill in
    folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/buffered-io.c        |   4 +-
 fs/iomap/direct-io.c          |  45 +++++++++++--
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++---
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |   8 ++-
 fs/xfs/xfs_super.c            |  28 +++++---
 include/linux/huge_mm.h       |  14 ++--
 include/linux/pagemap.h       | 122 ++++++++++++++++++++++++++++++----
 mm/filemap.c                  |  36 ++++++----
 mm/huge_memory.c              |  59 ++++++++++++++--
 mm/readahead.c                |  83 +++++++++++++++++------
 14 files changed, 345 insertions(+), 85 deletions(-)

base-commit: bb62fbd2b0e31b2ed5dccf1dc4489460137fdf5c
-- 
2.44.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 01/10] fs: Allow fine-grained control of folio sizes
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We need filesystems to be able to communicate acceptable folio sizes
to the pagecache for a variety of uses (e.g. large block sizes).
Support a range of folio sizes between order-0 and order-31.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 include/linux/pagemap.h | 89 ++++++++++++++++++++++++++++++++++-------
 mm/filemap.c            |  6 +--
 mm/readahead.c          |  4 +-
 3 files changed, 79 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422bd..75bbe88b89904 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -204,14 +204,20 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_INACCESSIBLE,	/* Do not attempt direct R/W access to the mapping,
-				   including to move the mapping */
+	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	/* Bits 16-25 are used for FOLIO_ORDER */
+	AS_FOLIO_ORDER_BITS = 5,
+	AS_FOLIO_ORDER_MIN = 16,
+	AS_FOLIO_ORDER_MAX = AS_FOLIO_ORDER_MIN + AS_FOLIO_ORDER_BITS,
 };
 
+#define AS_FOLIO_ORDER_MASK     ((1u << AS_FOLIO_ORDER_BITS) - 1)
+#define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
+#define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -367,9 +373,51 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
 #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
 
+/*
+ * mapping_set_folio_order_range() - Set the orders supported by a file.
+ * @mapping: The address space of the file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file.  This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+						 unsigned int min,
+						 unsigned int max)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return;
+
+	if (min > MAX_PAGECACHE_ORDER)
+		min = MAX_PAGECACHE_ORDER;
+
+	if (max > MAX_PAGECACHE_ORDER)
+		max = MAX_PAGECACHE_ORDER;
+
+	if (max < min)
+		max = min;
+
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+		(min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address space of the file.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -380,7 +428,23 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int
+mapping_max_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int
+mapping_min_folio_order(const struct address_space *mapping)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
 /*
@@ -389,20 +453,17 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	/* AS_LARGE_FOLIO_SUPPORT is only reasonable for pagecache folios */
+	/* AS_FOLIO_ORDER is only reasonable for pagecache folios */
 	VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON,
 			"Anonymous mapping always supports large folio");
 
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 /* Return the maximum folio size for this pagecache mapping, in bytes. */
-static inline size_t mapping_max_folio_size(struct address_space *mapping)
+static inline size_t mapping_max_folio_size(const struct address_space *mapping)
 {
-	if (mapping_large_folio_support(mapping))
-		return PAGE_SIZE << MAX_PAGECACHE_ORDER;
-	return PAGE_SIZE;
+	return PAGE_SIZE << mapping_max_folio_order(mapping);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index 29fec1fccd0a6..6c4489ada3ecc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1933,10 +1933,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		if (!mapping_large_folio_support(mapping))
-			order = 0;
-		if (order > MAX_PAGECACHE_ORDER)
-			order = MAX_PAGECACHE_ORDER;
+		if (order > mapping_max_folio_order(mapping))
+			order = mapping_max_folio_order(mapping);
 		/* If we're not aligned, allocate a smaller folio */
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
diff --git a/mm/readahead.c b/mm/readahead.c
index 517c0be7ce665..3e5239e9e1777 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -449,10 +449,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < MAX_PAGECACHE_ORDER)
+	if (new_order < mapping_max_folio_order(mapping))
 		new_order += 2;
 
-	new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
+	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 
 	/* See comment in page_cache_ra_unbounded() */
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 02/10] filemap: allocate mapping_min_order folios in the page cache
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: Pankaj Raghav <p.raghav@samsung.com>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 20 ++++++++++++++++++++
 mm/filemap.c            | 24 ++++++++++++++++--------
 2 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 75bbe88b89904..3a876d6801a90 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -447,6 +447,26 @@ mapping_min_folio_order(const struct address_space *mapping)
 	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
+static inline unsigned long
+mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_index() - Align index for this mapping.
+ * @mapping: The address_space.
+ *
+ * The index of a folio must be naturally aligned.  If you are adding a
+ * new folio to the page cache and need to know what index to give it,
+ * call this function.
+ */
+static inline pgoff_t mapping_align_index(struct address_space *mapping,
+					  pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
+}
+
 /*
  * Large folio support currently depends on THP.  These dependencies are
  * being worked on but are not yet fixed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 6c4489ada3ecc..623c0f988da79 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -859,6 +859,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
@@ -1919,8 +1921,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1943,7 +1947,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order > 0)
+			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
@@ -1958,7 +1962,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2447,13 +2451,15 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2471,6 +2477,7 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2531,8 +2538,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3748,9 +3754,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 03/10] readahead: allocate folios with mapping_min_order in readahead
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Co-developed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/readahead.c | 79 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3e5239e9e1777..2078c42777a62 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long ra_folio_index, index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long mark, i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,10 +224,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	nr_to_read += readahead_index(ractl) - index;
+	ractl->_index = index;
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 		int ret;
 
@@ -240,12 +255,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += min_nrpages;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 
@@ -255,14 +271,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			if (ret == -ENOMEM)
 				break;
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += min_nrpages;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
+		i += min_nrpages;
 	}
 
 	/*
@@ -438,13 +455,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t start = readahead_index(ractl);
 	pgoff_t index = start;
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -454,10 +477,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+	new_order = max(new_order, min_order);
 
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -465,7 +497,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -703,8 +735,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -714,9 +753,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -726,7 +767,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -741,9 +782,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -753,10 +796,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 04/10] mm: split a folio in minimum folio order chunks
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: Luis Chamberlain <mcgrof@kernel.org>

split_folio() and split_folio_to_list() assume order 0, to support
minorder for non-anonymous folios, we must expand these to check the
folio mapping order and use that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 14 +++++++---
 mm/huge_memory.c        | 59 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89a..7c50aeed05228 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,6 +96,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
+#define split_folio(f) split_folio_to_list(f, NULL)
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PUD_SHIFT PUD_SHIFT
@@ -317,9 +319,10 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_huge_page_to_list_to_order(page, NULL, 0);
+	return split_folio(page_folio(page));
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -484,6 +487,12 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+
+static inline int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
@@ -598,7 +607,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
 	return split_folio_to_list_to_order(folio, NULL, new_order);
 }
 
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f4be468e06a49..1a273625eb507 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3082,6 +3082,9 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
  * released, or if some unexpected race happened (e.g., anon VMA disappeared,
  * truncation).
  *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
@@ -3163,6 +3166,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
+		unsigned int min_order;
 		gfp_t gfp;
 
 		mapping = folio->mapping;
@@ -3173,6 +3177,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		min_order = mapping_min_folio_order(folio->mapping);
+		if (new_order < min_order) {
+			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+				     min_order);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
@@ -3285,6 +3297,25 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	unsigned int min_order = 0;
+
+	if (folio_test_anon(folio))
+		goto out;
+
+	if (!folio->mapping) {
+		if (folio_test_pmd_mappable(folio))
+			count_vm_event(THP_SPLIT_PAGE_FAILED);
+		return -EBUSY;
+	}
+
+	min_order = mapping_min_folio_order(folio->mapping);
+out:
+	return split_huge_page_to_list_to_order(&folio->page, list,
+							min_order);
+}
+
 void __folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
@@ -3515,6 +3546,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct page *page;
 		struct folio *folio;
+		struct address_space *mapping;
+		unsigned int target_order = new_order;
 
 		if (!vma)
 			break;
@@ -3535,7 +3568,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!is_transparent_hugepage(folio))
 			goto next;
 
-		if (new_order >= folio_order(folio))
+		if (!folio_test_anon(folio)) {
+			mapping = folio->mapping;
+			target_order = max(new_order,
+					   mapping_min_folio_order(mapping));
+		}
+
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		total++;
@@ -3551,9 +3590,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!folio_test_anon(folio) && folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
@@ -3578,6 +3621,7 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	pgoff_t index;
 	int nr_pages = 1;
 	unsigned long total = 0, split = 0;
+	unsigned int min_order;
 
 	file = getname_kernel(file_path);
 	if (IS_ERR(file))
@@ -3591,9 +3635,11 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		 file_path, off_start, off_end);
 
 	mapping = candidate->f_mapping;
+	min_order = mapping_min_folio_order(mapping);
 
 	for (index = off_start; index < off_end; index += nr_pages) {
 		struct folio *folio = filemap_get_folio(mapping, index);
+		unsigned int target_order = new_order;
 
 		nr_pages = 1;
 		if (IS_ERR(folio))
@@ -3602,18 +3648,23 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (!folio_test_large(folio))
 			goto next;
 
+		target_order = max(new_order, min_order);
 		total++;
 		nr_pages = folio_nr_pages(folio);
 
-		if (new_order >= folio_order(folio))
+		if (target_order >= folio_order(folio))
 			goto next;
 
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (folio->mapping != mapping)
+			goto unlock;
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
+unlock:
 		folio_unlock(folio);
 next:
 		folio_put(folio);
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. While doing a
read, do_fault_around() can create PTEs for pages that lie beyond the
EOF leading to incorrect error return when accessing a page beyond the
mapped file.

Cap the PTE range to be created for the page cache up to the end of
file(EOF) in filemap_map_pages() so that return error codes are consistent
with POSIX[1] for LBS configurations.

generic/749 has been created to trigger this edge case. This also fixes
generic/749 for tmpfs with huge=always on systems with 4k base page size.

[1](from mmap(2))  SIGBUS
    Attempted access to a page of the buffer that lies beyond the end
    of the mapped file.  For an explanation of the treatment  of  the
    bytes  in  the  page that corresponds to the end of a mapped file
    that is not a multiple of the page size, see NOTES.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 623c0f988da79..77b583a7aabd1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3608,7 +3608,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	pgoff_t last_pgoff = start_pgoff;
+	pgoff_t file_end, last_pgoff = start_pgoff;
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
@@ -3634,6 +3634,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (end_pgoff > file_end)
+		end_pgoff = file_end;
+
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c |  4 ++--
 fs/iomap/direct-io.c   | 45 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b4ca3811a242..cdab801e9d635 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -2007,10 +2007,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
-static int __init iomap_init(void)
+static int __init iomap_buffered_init(void)
 {
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
 }
-fs_initcall(iomap_init);
+fs_initcall(iomap_buffered_init);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46e..c02b266bba525 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -11,6 +11,7 @@
 #include <linux/iomap.h>
 #include <linux/backing-dev.h>
 #include <linux/uio.h>
+#include <linux/set_memory.h>
 #include <linux/task_io_accounting_ops.h>
 #include "trace.h"
 
@@ -27,6 +28,13 @@
 #define IOMAP_DIO_WRITE		(1U << 30)
 #define IOMAP_DIO_DIRTY		(1U << 31)
 
+/*
+ * Used for sub block zeroing in iomap_dio_zero()
+ */
+#define IOMAP_ZERO_PAGE_SIZE (SZ_64K)
+#define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE))
+static struct page *zero_page;
+
 struct iomap_dio {
 	struct kiocb		*iocb;
 	const struct iomap_dio_ops *dops;
@@ -232,13 +240,20 @@ void iomap_dio_bio_end_io(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io);
 
-static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
+static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 		loff_t pos, unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
-	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
+	if (!len)
+		return 0;
+	/*
+	 * Max block size supported is 64k
+	 */
+	if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE))
+		return -EINVAL;
+
 	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
@@ -246,8 +261,9 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	__bio_add_page(bio, zero_page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
+	return 0;
 }
 
 /*
@@ -356,8 +372,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	if (need_zeroout) {
 		/* zero out from the start of the block to the write offset */
 		pad = pos & (fs_block_size - 1);
-		if (pad)
-			iomap_dio_zero(iter, dio, pos - pad, pad);
+
+		ret = iomap_dio_zero(iter, dio, pos - pad, pad);
+		if (ret)
+			goto out;
 	}
 
 	/*
@@ -431,7 +449,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		/* zero out from the end of the write to the end of the block */
 		pad = pos & (fs_block_size - 1);
 		if (pad)
-			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+			ret = iomap_dio_zero(iter, dio, pos,
+					     fs_block_size - pad);
 	}
 out:
 	/* Undo iter limitation to current extent */
@@ -753,3 +772,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	return iomap_dio_complete(dio);
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+	zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+				IOMAP_ZERO_PAGE_ORDER);
+
+	if (!zero_page)
+		return -ENOMEM;
+
+	set_memory_ro((unsigned long)page_address(zero_page),
+		      1U << IOMAP_ZERO_PAGE_ORDER);
+	return 0;
+}
+fs_initcall(iomap_dio_init);
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 07/10] xfs: use kvmalloc for xattr buffers
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b9e98950eb3d8..09f4cb061a6e0 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1138,10 +1138,7 @@ xfs_attr3_leaf_to_shortform(
 
 	trace_xfs_attr_leaf_to_sf(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
-	if (!tmpbuffer)
-		return -ENOMEM;
-
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 
 	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1205,7 +1202,7 @@ xfs_attr3_leaf_to_shortform(
 	error = 0;
 
 out:
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 	return error;
 }
 
@@ -1613,7 +1610,7 @@ xfs_attr3_leaf_compact(
 
 	trace_xfs_attr_leaf_compact(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 	memset(bp->b_addr, 0, args->geo->blksize);
 	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1651,7 +1648,7 @@ xfs_attr3_leaf_compact(
 	 */
 	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
 
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 }
 
 /*
@@ -2330,7 +2327,7 @@ xfs_attr3_leaf_unbalance(
 		struct xfs_attr_leafblock *tmp_leaf;
 		struct xfs_attr3_icleaf_hdr tmphdr;
 
-		tmp_leaf = kzalloc(state->args->geo->blksize,
+		tmp_leaf = kvzalloc(state->args->geo->blksize,
 				GFP_KERNEL | __GFP_NOFAIL);
 
 		/*
@@ -2371,7 +2368,7 @@ xfs_attr3_leaf_unbalance(
 		}
 		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
 		savehdr = tmphdr; /* struct copy */
-		kfree(tmp_leaf);
+		kvfree(tmp_leaf);
 	}
 
 	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 08/10] xfs: expose block size in stat
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a1c4a350a6dbf..2b8dbe8bf1381 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -567,7 +567,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-15  9:08 ` [PATCH v12 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 09eef1721ef4f..3949f720b5354 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,11 +132,16 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
+	uint64_t		max_bytes;
+
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
+
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	if (max_bytes >> PAGE_SHIFT > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v12 10/10] xfs: enable block size larger than page size support
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-08-15  9:08 ` Pankaj Raghav (Samsung)
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15  9:08 UTC (permalink / raw)
  To: brauner, akpm
  Cc: chandan.babu, linux-fsdevel, djwong, hare, gost.dev, linux-xfs,
	kernel, hch, david, Zi Yan, yang, linux-kernel, linux-mm, willy,
	john.g.garry, cl, p.raghav, mcgrof, ryan.roberts, Dave Chinner

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 28 ++++++++++++++++++++--------
 include/linux/pagemap.h    | 13 +++++++++++++
 6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0af5b7a33d055..1921b689888b8 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3033,6 +3033,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 2f7413afbf46c..33b84a3a83ff6 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -224,6 +224,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e7..0fcf235e50235 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,7 +88,8 @@ xfs_inode_alloc(
 
 	/* VFS doesn't initialise i_mode! */
 	VFS_I(ip)->i_mode = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -325,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_uid = uid;
 	inode->i_gid = gid;
 	inode->i_state = state;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3949f720b5354..c6933440f8066 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -134,7 +134,6 @@ xfs_sb_validate_fsb_count(
 {
 	uint64_t		max_bytes;
 
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 210481b03fdb4..8cd76a01b543f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1638,16 +1638,28 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		size_t max_folio_size = mapping_max_folio_size_supported();
+
+		if (!xfs_has_crc(mp)) {
+			xfs_warn(mp,
+"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		if (mp->m_sb.sb_blocksize > max_folio_size) {
+			xfs_warn(mp,
+"block size (%u bytes) not supported; Only block size (%ld) or less is supported",
+				mp->m_sb.sb_blocksize, max_folio_size);
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		xfs_warn(mp,
+"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3a876d6801a90..61a7649d86e57 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -373,6 +373,19 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
 #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
 
+/*
+ * mapping_max_folio_size_supported() - Check the max folio size supported
+ *
+ * The filesystem should call this function at mount time if there is a
+ * requirement on the folio mapping size in the page cache.
+ */
+static inline size_t mapping_max_folio_size_supported(void)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
+	return PAGE_SIZE;
+}
+
 /*
  * mapping_set_folio_order_range() - Set the orders supported by a file.
  * @mapping: The address space of the file.
-- 
2.44.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-08-15  9:08 ` [PATCH v12 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-08-16 19:31 ` David Howells
  2024-08-18 16:51   ` Pankaj Raghav (Samsung)
                     ` (4 more replies)
  2024-08-19 15:17 ` David Howells
  2024-08-19 16:51 ` David Howells
  12 siblings, 5 replies; 28+ messages in thread
From: David Howells @ 2024-08-16 19:31 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Hi Pankaj,

I applied the first five patches and set minimum folio size for afs files to
8K (see attached patch) and ran some tests.

With simple tests, I can see in the trace log that it is definitely creating
8K folios where it would previously create 4K folios.

However, with 'xfstests -g quick', generic/075 generic/112 generic/393 fail
where they didn't previously.  I won't be able to look into this more till
Monday.

If you want to try using afs for yourself, install the kafs-client package
(available on Fedora and Debian), do 'systemctl start afs.mount' and then you
can, say, do:

	ls /afs/openafs.org/www/docs.openafs.org/

and browse the publicly accessible files under there.

David
---
commit d676df787baee3b710b9f0d284b21518473feb3c
Author: David Howells <dhowells@redhat.com>
Date:   Fri Aug 16 19:54:25 2024 +0100

    afs: [DEBUGGING] Set min folio order

diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3acf5e050072..c3842cba92e7 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
 		inode->i_fop	= &afs_file_operations;
 		inode->i_mapping->a_ops	= &afs_file_aops;
 		mapping_set_large_folios(inode->i_mapping);
+		mapping_set_folio_min_order(inode->i_mapping, 1);
 		break;
 	case AFS_FTYPE_DIR:
 		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
@ 2024-08-18 16:51   ` Pankaj Raghav (Samsung)
  2024-08-18 20:16   ` David Howells
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-18 16:51 UTC (permalink / raw)
  To: David Howells
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof,
	ryan.roberts

Hi David,

On Fri, Aug 16, 2024 at 08:31:03PM +0100, David Howells wrote:
> Hi Pankaj,
> 
> I applied the first five patches and set minimum folio size for afs files to
> 8K (see attached patch) and ran some tests.
> 
> With simple tests, I can see in the trace log that it is definitely creating
> 8K folios where it would previously create 4K folios.
> 
> However, with 'xfstests -g quick', generic/075 generic/112 generic/393 fail
> where they didn't previously.  I won't be able to look into this more till
> Monday.

Thanks for trying it out!

As you might have seen the whole patchset, typically filesystems will
require some changes to support min order correctly. That is why 
this patchset only enables XFS to use min order to support bs > ps.

In the case of XFS (block-based FS), we set the min order to the FS
block size as that is the smallest unit of operation in the data path,
and we know for sure there are no implicit PAGE_SIZE assumption.

I am no expert in network filesystems but are you sure there are no
PAGE_SIZE assumption when manipulating folios from the page cache in
AFS?

Similar to AFS, XFS also supported large_folios but we found some bugs
when we set min order to be the block size of the FS.
> 
> If you want to try using afs for yourself, install the kafs-client package
> (available on Fedora and Debian), do 'systemctl start afs.mount' and then you
> can, say, do:
> 
> 	ls /afs/openafs.org/www/docs.openafs.org/
> 
> and browse the publicly accessible files under there.

Great. But is this enough to run FStests? I assume I also need some afs
server to run the fstests?

Are the tests just failing or are you getting some kernel panic?

> 
> David
> ---
> commit d676df787baee3b710b9f0d284b21518473feb3c
> Author: David Howells <dhowells@redhat.com>
> Date:   Fri Aug 16 19:54:25 2024 +0100
> 
>     afs: [DEBUGGING] Set min folio order
> 
> diff --git a/fs/afs/inode.c b/fs/afs/inode.c
> index 3acf5e050072..c3842cba92e7 100644
> --- a/fs/afs/inode.c
> +++ b/fs/afs/inode.c
> @@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
>  		inode->i_fop	= &afs_file_operations;
>  		inode->i_mapping->a_ops	= &afs_file_aops;
>  		mapping_set_large_folios(inode->i_mapping);
> +		mapping_set_folio_min_order(inode->i_mapping, 1);
>  		break;
>  	case AFS_FTYPE_DIR:
>  		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
  2024-08-18 16:51   ` Pankaj Raghav (Samsung)
@ 2024-08-18 20:16   ` David Howells
  2024-08-19  7:24     ` Hannes Reinecke
  2024-08-19 12:25     ` David Howells
  2024-08-19 11:46   ` David Howells
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 28+ messages in thread
From: David Howells @ 2024-08-18 20:16 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:

> I am no expert in network filesystems but are you sure there are no
> PAGE_SIZE assumption when manipulating folios from the page cache in
> AFS?

Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to
netfslib and intend to do the same to ceph.  The client filesystems just
provide read and write ops to netfslib and netfslib uses those to do ordinary
buffered I/O, unbuffered I/O (selectable by mount option on some filesystems)
and DIO.

That said, I'm not sure that I haven't made some PAGE_SIZE assumptions.  I
don't *think* I have since netfslib is fully multipage folio capable, but I
can't guarantee it.

Mostly this was just a note to you that there might be an issue with your code
- but I haven't investigated it yet and it could well be in my code.

Apparently, I also need to update xfstests, so it could be that too.

> > 	ls /afs/openafs.org/www/docs.openafs.org/
> > 
> > and browse the publicly accessible files under there.
> 
> Great. But is this enough to run FStests? I assume I also need some afs
> server to run the fstests?

Sadly not, but if you turn on some tracepoints, you can see netfslib operating
under the bonnet.

> Are the tests just failing or are you getting some kernel panic?

Just failing.

Thanks,
David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-18 20:16   ` David Howells
@ 2024-08-19  7:24     ` Hannes Reinecke
  2024-08-19  7:37       ` Pankaj Raghav (Samsung)
  2024-08-19 12:25     ` David Howells
  1 sibling, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2024-08-19  7:24 UTC (permalink / raw)
  To: David Howells, Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, gost.dev,
	linux-xfs, hch, david, Zi Yan, yang, linux-kernel, linux-mm,
	willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

On 8/18/24 22:16, David Howells wrote:
> Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:
> 
>> I am no expert in network filesystems but are you sure there are no
>> PAGE_SIZE assumption when manipulating folios from the page cache in
>> AFS?
> 
> Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to
> netfslib and intend to do the same to ceph.  The client filesystems just
> provide read and write ops to netfslib and netfslib uses those to do ordinary
> buffered I/O, unbuffered I/O (selectable by mount option on some filesystems)
> and DIO.
> 
> That said, I'm not sure that I haven't made some PAGE_SIZE assumptions.  I
> don't *think* I have since netfslib is fully multipage folio capable, but I
> can't guarantee it.
> 
I guess you did:

static int afs_fill_super(struct super_block *sb, struct afs_fs_context 
*ctx)
{
         struct afs_super_info *as = AFS_FS_S(sb);
         struct inode *inode = NULL;
         int ret;

         _enter("");

         /* fill in the superblock */
         sb->s_blocksize         = PAGE_SIZE;
         sb->s_blocksize_bits    = PAGE_SHIFT;
         sb->s_maxbytes          = MAX_LFS_FILESIZE;
         sb->s_magic             = AFS_FS_MAGIC;
         sb->s_op                = &afs_super_ops;

IE you essentially nail AFS to use PAGE_SIZE.
Not sure how you would tell AFS to use a different block size;
maybe a mount option?

And there are several other places which will need to be modified;
eg afs_mntpt_set_params() is trying to read from a page which
won't fly with large blocks (converted to read_full_folio()?),
and, of course, the infamous AFS_DIR_BLOCKS_PER_PAGE which will
overflow for large blocks.

So some work is required, but everything looks doable.
Maybe I can find some time until LPC.

> Mostly this was just a note to you that there might be an issue with your code
> - but I haven't investigated it yet and it could well be in my code.
> 
Hmm. I'd rather fix the obvious places in afs first; just do a quick
grep for 'PAGE_', that'll give you a good impression of places to look at.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-19  7:24     ` Hannes Reinecke
@ 2024-08-19  7:37       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-19  7:37 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: David Howells, brauner, akpm, chandan.babu, linux-fsdevel,
	djwong, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

On Mon, Aug 19, 2024 at 09:24:11AM +0200, Hannes Reinecke wrote:
> On 8/18/24 22:16, David Howells wrote:
> > Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:
> > 
> > > I am no expert in network filesystems but are you sure there are no
> > > PAGE_SIZE assumption when manipulating folios from the page cache in
> > > AFS?
> > 
> > Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to
> > netfslib and intend to do the same to ceph.  The client filesystems just
> > provide read and write ops to netfslib and netfslib uses those to do ordinary
> > buffered I/O, unbuffered I/O (selectable by mount option on some filesystems)
> > and DIO.
> > 
> > That said, I'm not sure that I haven't made some PAGE_SIZE assumptions.  I
> > don't *think* I have since netfslib is fully multipage folio capable, but I
> > can't guarantee it.
> > 
> I guess you did:
> 
> static int afs_fill_super(struct super_block *sb, struct afs_fs_context
> *ctx)
> {
>         struct afs_super_info *as = AFS_FS_S(sb);
>         struct inode *inode = NULL;
>         int ret;
> 
>         _enter("");
> 
>         /* fill in the superblock */
>         sb->s_blocksize         = PAGE_SIZE;
>         sb->s_blocksize_bits    = PAGE_SHIFT;
>         sb->s_maxbytes          = MAX_LFS_FILESIZE;
>         sb->s_magic             = AFS_FS_MAGIC;
>         sb->s_op                = &afs_super_ops;
> 
> IE you essentially nail AFS to use PAGE_SIZE.
> Not sure how you would tell AFS to use a different block size;
> maybe a mount option?

I saw this as well, but I didn't see this variable being used anywhere.
Probably this has no meaning in a network-based FSs?

> And there are several other places which will need to be modified;
> eg afs_mntpt_set_params() is trying to read from a page which
> won't fly with large blocks (converted to read_full_folio()?),
> and, of course, the infamous AFS_DIR_BLOCKS_PER_PAGE which will
> overflow for large blocks.

But the min folio order is set only for AFS_FTYPE_FILE and not
for AFS_FTYPE_DIR.

> 
> So some work is required, but everything looks doable.
> Maybe I can find some time until LPC.
> 
> > Mostly this was just a note to you that there might be an issue with your code
> > - but I haven't investigated it yet and it could well be in my code.
> > 
> Hmm. I'd rather fix the obvious places in afs first; just do a quick
> grep for 'PAGE_', that'll give you a good impression of places to look at.
> 
Agree.

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
  2024-08-18 16:51   ` Pankaj Raghav (Samsung)
  2024-08-18 20:16   ` David Howells
@ 2024-08-19 11:46   ` David Howells
  2024-08-19 12:48     ` Hannes Reinecke
                       ` (3 more replies)
  2024-08-19 11:59   ` David Howells
  2024-08-20 23:24   ` David Howells
  4 siblings, 4 replies; 28+ messages in thread
From: David Howells @ 2024-08-19 11:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Hi Pankaj,

I can reproduce the problem with:

xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" -c "truncate 4" -c "truncate 4096" /xfstest.test/wubble; od -x /xfstest.test/wubble

borrowed from generic/393.  I've distilled it down to the attached C program.

Turning on tracing and adding a bit more, I can see the problem happening.
Here's an excerpt of the tracing (I've added some non-upstream tracepoints).
Firstly, you can see the second pwrite at fpos 0, 40 bytes (ie. 0x28):

 pankaj-5833: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0
 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 mod-streamw

Then first ftruncate() is called to reduce the file size to 4:

 pankaj-5833: netfs_truncate: ni=9e isz=2028 rsz=2028 zp=4000 to=4
 pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=4 l=1ffc d=78787878
 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
 pankaj-5833: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=4

You can see the invalidate_folio call, with the offset at 0x4 an the length as
0x1ffc.  The data at the beginning of the page is 0x78787878.  This looks
correct.

Then second ftruncate() is called to increase the file size to 4096
(ie. 0x1000):

 pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000
 pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878
 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
 pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4

And here's the problem: in the invalidate_folio() call, the offset is 0x1000
and the length is 0x1000 (o= and l=).  But that's the wrong half of the folio!
I'm guessing that the caller thereafter clears the other half of the folio -
the bit that should be kept.

David
---
/* Distillation of the generic/393 xfstest */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

#define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0)

static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy";
static const char dropfile[] = "/proc/sys/vm/drop_caches";
static const char droptype[] = "3";
static const char file[] = "/xfstest.test/wubble";

int main(int argc, char *argv[])
{
        int fd, drop;

	/* Fill in the second 8K block of the file... */
        fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666);
        ERR(fd, "open");
        ERR(ftruncate(fd, 0), "pre-trunc $file");
        ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000");
        ERR(close(fd), "close");

	/* ... and drop the pagecache so that we get a streaming
	 * write, attaching some private data to the folio.
	 */
        drop = open(dropfile, O_WRONLY);
        ERR(drop, dropfile);
        ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop");
        ERR(close(drop), "close-drop");

        fd = open(file, O_WRONLY, 0666);
        ERR(fd, "reopen");
	/* Make a streaming write on the first 8K block (needs O_WRONLY). */
        ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0");
	/* Now use truncate to shrink and reexpand. */
        ERR(ftruncate(fd, 4), "trunc-4");
        ERR(ftruncate(fd, 4096), "trunc-4096");
        ERR(close(fd), "close-2");
        exit(0);
}



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
                     ` (2 preceding siblings ...)
  2024-08-19 11:46   ` David Howells
@ 2024-08-19 11:59   ` David Howells
  2024-08-20 23:24   ` David Howells
  4 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2024-08-19 11:59 UTC (permalink / raw)
  Cc: dhowells, Pankaj Raghav (Samsung),
	brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof,
	ryan.roberts

David Howells <dhowells@redhat.com> wrote:

> You can see the invalidate_folio call, with the offset at 0x4 an the length as
> 0x1ffc.  The data at the beginning of the page is 0x78787878.  This looks
> correct.
> 
> Then second ftruncate() is called to increase the file size to 4096
> (ie. 0x1000):
> 
>  pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000
>  pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878
>  pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
>  pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4
> 
> And here's the problem: in the invalidate_folio() call, the offset is 0x1000
> and the length is 0x1000 (o= and l=).  But that's the wrong half of the folio!
> I'm guessing that the caller thereafter clears the other half of the folio -
> the bit that should be kept.

Actually, I think I'm wrong in my evaluation - I think that's the region to be
invalidated, not the region to be kept.

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-18 20:16   ` David Howells
  2024-08-19  7:24     ` Hannes Reinecke
@ 2024-08-19 12:25     ` David Howells
  1 sibling, 0 replies; 28+ messages in thread
From: David Howells @ 2024-08-19 12:25 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: dhowells, Pankaj Raghav (Samsung),
	brauner, akpm, chandan.babu, linux-fsdevel, djwong, gost.dev,
	linux-xfs, hch, david, Zi Yan, yang, linux-kernel, linux-mm,
	willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

Hannes Reinecke <hare@suse.de> wrote:

> IE you essentially nail AFS to use PAGE_SIZE.
> Not sure how you would tell AFS to use a different block size;
> maybe a mount option?

As far as I know:

        sb->s_blocksize         = PAGE_SIZE;
        sb->s_blocksize_bits    = PAGE_SHIFT;

isn't used by the VM.

> Hmm. I'd rather fix the obvious places in afs first; just do a quick
> grep for 'PAGE_', that'll give you a good impression of places to look at.

Sure:

   fs/afs/dir.c:   nr_pages = (i_size + PAGE_SIZE - 1) / PAGE_SIZE;
   fs/afs/dir.c:   req->len = nr_pages * PAGE_SIZE; /* We can ask for more than there is */
   fs/afs/dir.c:           task_io_account_read(PAGE_SIZE * req->nr_pages);
   fs/afs/dir.c:           folio = __filemap_get_folio(dir->i_mapping, ctx->pos / PAGE_SIZE,
   fs/afs/xdr_fs.h:#define AFS_DIR_BLOCKS_PER_PAGE (PAGE_SIZE / AFS_DIR_BLOCK_SIZE)

Those only affect directories.

   fs/afs/mntpt.c:         if (size < 2 || size > PAGE_SIZE - 1)

That only affects mountpoint symlinks.

   fs/afs/super.c: sb->s_blocksize         = PAGE_SIZE;

This is the only thing (and sb->s_blocksize_bits) that might affect files.  I
checked, and doubling this and adding 1 to bits does not alter the outcome.

Now, the VM wrangling is offloaded to netfslib, and most of that is to do with
converting between indices and file positions.  Going through the usages of
PAGE_SIZE there:

   fs/netfs/buffered_read.c:               size += PAGE_SIZE << order;

That was recording the size of a folio readahead allocated.

   fs/netfs/buffered_read.c:       size_t nr_bvec = flen / PAGE_SIZE + 2;
   fs/netfs/buffered_read.c:               part = min_t(size_t, to - off, PAGE_SIZE);

Those two are used to fill in the gaps around a partial page - but that didn't
appear in the logs.

   fs/netfs/buffered_write.c:      pgoff_t index = pos / PAGE_SIZE;
   fs/netfs/buffered_write.c:              fgp_flags |= fgf_set_order(pos % PAGE_SIZE + part);

Those two are used when asking __filemap_get_folio() to allocate a folio to
write into.  I got a folio of the right size and index, so that's not the
problem.

   fs/netfs/fscache_io.c:  pgoff_t first = start / PAGE_SIZE;
   fs/netfs/fscache_io.c:  pgoff_t last = (start + len - 1) / PAGE_SIZE;

Caching is not enabled at the moment, so these don't happen.

   fs/netfs/iterator.c:            cur_npages = DIV_ROUND_UP(ret, PAGE_SIZE);
   fs/netfs/iterator.c:                    len = ret > PAGE_SIZE ? PAGE_SIZE : ret;

I'm not doing DIO, so these aren't used.

   fs/netfs/iterator.c:    pgoff_t index = pos / PAGE_SIZE;

I'm not using an ITER_XARRAY iterator, so this doesn't happen.

   fs/netfs/misc.c:        rreq->io_iter.count += PAGE_SIZE << order;

This is just multiplying up the folio size to add to the byte count.

   fs/netfs/read_collect.c:        fsize = PAGE_SIZE << subreq->curr_folio_order;
   fs/netfs/read_collect.c:            WARN_ON_ONCE(folioq_folio(folioq, slot)->index != fpos / PAGE_SIZE)) {

These two are converting between a file pos and an index - but only during
read, and I can see from wireshark that we're writing the wrong data to the
server before we get this far.

And that's all the PAGE_SIZE usages in afs and netfslib.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-19 11:46   ` David Howells
@ 2024-08-19 12:48     ` Hannes Reinecke
  2024-08-19 14:08     ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Hannes Reinecke @ 2024-08-19 12:48 UTC (permalink / raw)
  To: David Howells, Pankaj Raghav (Samsung)
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, gost.dev,
	linux-xfs, hch, david, Zi Yan, yang, linux-kernel, linux-mm,
	willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

On 8/19/24 13:46, David Howells wrote:
> Hi Pankaj,
> 
> I can reproduce the problem with:
> 
> xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" -c "truncate 4" -c "truncate 4096" /xfstest.test/wubble; od -x /xfstest.test/wubble
> 
> borrowed from generic/393.  I've distilled it down to the attached C program.
> 
> Turning on tracing and adding a bit more, I can see the problem happening.
> Here's an excerpt of the tracing (I've added some non-upstream tracepoints).
> Firstly, you can see the second pwrite at fpos 0, 40 bytes (ie. 0x28):
> 
>   pankaj-5833: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0
>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 mod-streamw
> 
> Then first ftruncate() is called to reduce the file size to 4:
> 
>   pankaj-5833: netfs_truncate: ni=9e isz=2028 rsz=2028 zp=4000 to=4
>   pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=4 l=1ffc d=78787878
>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
>   pankaj-5833: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=4
> 
> You can see the invalidate_folio call, with the offset at 0x4 an the length as
> 0x1ffc.  The data at the beginning of the page is 0x78787878.  This looks
> correct.
> 
> Then second ftruncate() is called to increase the file size to 4096
> (ie. 0x1000):
> 
>   pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000
>   pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878
>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
>   pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4
> 
> And here's the problem: in the invalidate_folio() call, the offset is 0x1000
> and the length is 0x1000 (o= and l=).  But that's the wrong half of the folio!
> I'm guessing that the caller thereafter clears the other half of the folio -
> the bit that should be kept.
> 
> David
> ---
> /* Distillation of the generic/393 xfstest */
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> 
> #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0)
> 
> static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
> static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy";
> static const char dropfile[] = "/proc/sys/vm/drop_caches";
> static const char droptype[] = "3";
> static const char file[] = "/xfstest.test/wubble";
> 
> int main(int argc, char *argv[])
> {
>          int fd, drop;
> 
> 	/* Fill in the second 8K block of the file... */
>          fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666);
>          ERR(fd, "open");
>          ERR(ftruncate(fd, 0), "pre-trunc $file");
>          ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000");
>          ERR(close(fd), "close");
> 
> 	/* ... and drop the pagecache so that we get a streaming
> 	 * write, attaching some private data to the folio.
> 	 */
>          drop = open(dropfile, O_WRONLY);
>          ERR(drop, dropfile);
>          ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop");
>          ERR(close(drop), "close-drop");
> 
>          fd = open(file, O_WRONLY, 0666);
>          ERR(fd, "reopen");
> 	/* Make a streaming write on the first 8K block (needs O_WRONLY). */
>          ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0");
> 	/* Now use truncate to shrink and reexpand. */
>          ERR(ftruncate(fd, 4), "trunc-4");
>          ERR(ftruncate(fd, 4096), "trunc-4096");
>          ERR(close(fd), "close-2");
>          exit(0);
> }
> 

Wouldn't the second truncate end up with a 4k file, and not an 8k?
IE the resulting file will be:
After step 1: 8k
After step 2: 4
After step 3: 4k

Hmm?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-19 11:46   ` David Howells
  2024-08-19 12:48     ` Hannes Reinecke
@ 2024-08-19 14:08     ` David Howells
  2024-08-19 16:39     ` Pankaj Raghav (Samsung)
  2024-08-19 18:40     ` David Howells
  3 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2024-08-19 14:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: dhowells, Pankaj Raghav (Samsung),
	brauner, akpm, chandan.babu, linux-fsdevel, djwong, gost.dev,
	linux-xfs, hch, david, Zi Yan, yang, linux-kernel, linux-mm,
	willy, john.g.garry, cl, p.raghav, mcgrof, ryan.roberts

Hannes Reinecke <hare@suse.de> wrote:

> Wouldn't the second truncate end up with a 4k file, and not an 8k?
> IE the resulting file will be:
> After step 1: 8k
> After step 2: 4
> After step 3: 4k

Yes, but the folio should still be an 8K folio, and it is:

>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part

as indicated by the inclusive folio index range ix=00000-00001.

The problem is that the bottom four bytes of the file are getting cleared
somewhere.  They *should* be "XXXX", but they're all zeros.

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (10 preceding siblings ...)
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
@ 2024-08-19 15:17 ` David Howells
  2024-08-19 16:51 ` David Howells
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2024-08-19 15:17 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Okay, the code in netfs_invalidate_folio() isn't correct in the way it reduces
streaming writes.  Attached is a patch that shows some of the changes I need
to make - but this is not yet working.

David
---
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index eaa0a992d178..e237c771eeb5 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -214,18 +215,34 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length)
 		/* We have a partially uptodate page from a streaming write. */
 		unsigned int fstart = finfo->dirty_offset;
 		unsigned int fend = fstart + finfo->dirty_len;
-		unsigned int end = offset + length;
+		unsigned int iend = offset + length;
 
 		if (offset >= fend)
 			return;
-		if (end <= fstart)
+		if (iend <= fstart)
+			return;
+
+		/* The invalidation region overlaps the data.  If the region
+		 * covers the start of the data, we either move along the start
+		 * or just erase the data entirely.
+		 */
+		if (offset <= fstart) {
+			if (iend >= fend)
+				goto erase_completely;
+			/* Move the start of the data. */
+			finfo->dirty_len = fend - iend;
+			finfo->dirty_offset = offset;
 			return;
-		if (offset <= fstart && end >= fend)
-			goto erase_completely;
-		if (offset <= fstart && end > fstart)
-			goto reduce_len;
-		if (offset > fstart && end >= fend)
-			goto move_start;
+		}
+
+		/* Reduce the length of the data if the invalidation region
+		 * covers the tail part.
+		 */
+		if (iend >= fend) {
+			finfo->dirty_len = offset - fstart;
+			return;
+		}
+
 		/* A partial write was split.  The caller has already zeroed
 		 * it, so just absorb the hole.
 		 */
@@ -238,12 +261,6 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length)
 	folio_clear_uptodate(folio);
 	kfree(finfo);
 	return;
-reduce_len:
-	finfo->dirty_len = offset + length - finfo->dirty_offset;
-	return;
-move_start:
-	finfo->dirty_len -= offset - finfo->dirty_offset;
-	finfo->dirty_offset = offset;
 }
 EXPORT_SYMBOL(netfs_invalidate_folio);
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-19 11:46   ` David Howells
  2024-08-19 12:48     ` Hannes Reinecke
  2024-08-19 14:08     ` David Howells
@ 2024-08-19 16:39     ` Pankaj Raghav (Samsung)
  2024-08-19 18:40     ` David Howells
  3 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-19 16:39 UTC (permalink / raw)
  To: David Howells
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof,
	ryan.roberts

> ---
> /* Distillation of the generic/393 xfstest */
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> 
> #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0)
> 
> static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
> static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy";
> static const char dropfile[] = "/proc/sys/vm/drop_caches";
> static const char droptype[] = "3";
> static const char file[] = "/xfstest.test/wubble";
> 
> int main(int argc, char *argv[])
> {
>         int fd, drop;
> 
> 	/* Fill in the second 8K block of the file... */
>         fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666);
>         ERR(fd, "open");
>         ERR(ftruncate(fd, 0), "pre-trunc $file");
>         ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000");
>         ERR(close(fd), "close");
> 
> 	/* ... and drop the pagecache so that we get a streaming
> 	 * write, attaching some private data to the folio.
> 	 */
>         drop = open(dropfile, O_WRONLY);
>         ERR(drop, dropfile);
>         ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop");
>         ERR(close(drop), "close-drop");
> 
>         fd = open(file, O_WRONLY, 0666);
>         ERR(fd, "reopen");
> 	/* Make a streaming write on the first 8K block (needs O_WRONLY). */
>         ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0");
> 	/* Now use truncate to shrink and reexpand. */
>         ERR(ftruncate(fd, 4), "trunc-4");
>         ERR(ftruncate(fd, 4096), "trunc-4096");
>         ERR(close(fd), "close-2");
>         exit(0);
> }

I tried this code on XFS, and it is working as expected (I am getting
xxxx).

[nix-shell:~/xfstests]# hexdump -C /media/test/wubble
00000000  78 78 78 78 00 00 00 00  00 00 00 00 00 00 00 00  |xxxx............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

I did some tracing as well and here are the results.

$ trace-cmd record -e xfs_file_fsync -e xfs_file_buffered_write -e xfs_setattr -e xfs_zero_eof -F -c ./a.out

[nix-shell:~/xfstests]# trace-cmd report
cpus=4
           a.out-3872  [003] 84120.161472: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x0
           a.out-3872  [003] 84120.172109: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x20 
           a.out-3872  [003] 84120.172151: xfs_zero_eof:         dev 259:0 ino 0x103 isize 0x0 disize 0x0 pos 0x0 bytecount 0x2000 // First truncate
           a.out-3872  [003] 84120.172156: xfs_file_buffered_write: dev 259:0 ino 0x103 disize 0x0 pos 0x2000 bytecount 0x28
           a.out-3872  [003] 84120.185423: xfs_file_buffered_write: dev 259:0 ino 0x103 disize 0x2028 pos 0x0 bytecount 0x28
           a.out-3872  [003] 84120.185477: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x0
           a.out-3872  [003] 84120.186493: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x20
           a.out-3872  [003] 84120.186495: xfs_zero_eof:         dev 259:0 ino 0x103 isize 0x4 disize 0x4 pos 0x4 bytecount 0xffc // Third truncate

First and third truncate result in calling xfs_zero_eof as we are
increasing the size of the file.

When we do the second ftruncate(fd, 4), we call into iomap_truncate_page() with
offset 0:

int
iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
		const struct iomap_ops *ops)
{
	unsigned int blocksize = i_blocksize(inode);
	unsigned int off = pos & (blocksize - 1);

	/* Block boundary? Nothing to do */
	if (!off)
		return 0;
	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
}

As you can see, we take into account the blocksize (which is set as
minorder during inode init) and make sure the sub-block zeroing is done
correctly.

Also if you see iomap_invalidate_folio(), we don't remove the folio
private data until the whole folio is invalidated.

I doubt we are doing anything wrong from the page cache layer with these
patches.

All we do with minorder support is to make sure we always allocate folios
in the page cache that are at least min order in size and aligned to the
min order (PATCH 2 and 3) and we maintain this even we do a split (PATCH
4).

I hope this helps!

--
Pankaj


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (11 preceding siblings ...)
  2024-08-19 15:17 ` David Howells
@ 2024-08-19 16:51 ` David Howells
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2024-08-19 16:51 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Okay, I think there is a bug in your patches also.  If I do:

	xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" \
		-c "truncate 4" -c "truncate 4096" \
		/xfstest.test/wubble; od /xfstest.test/wubble

I see:

  xfs_io-6059: netfs_truncate: ni=9e isz=1000 rsz=1000 zp=0 to=0
  xfs_io-6059: netfs_set_size: ni=9e resize-file isz=0 rsz=0 zp=0
  xfs_io-6059: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0
  xfs_io-6059: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 mod-n-clear d=5858585858585858
  xfs_io-6059: netfs_write: R=0000000c WRITEBACK c=00000002 i=9e by=0-ffffffffffffffff
  xfs_io-6059: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 store d=5858585858585858
  xfs_io-6059: netfs_sreq: R=0000000c[1] UPLD PREP  f=00 s=0 0/0 e=0
  xfs_io-6059: netfs_sreq: R=0000000c[1] UPLD SUBMT f=100 s=0 0/28 e=0
 kworker-5948: netfs_sreq: R=0000000c[1] UPLD TERM  f=100 s=0 28/28 e=0
 kworker-5948: netfs_rreq: R=0000000c WB COLLECT f=2120
 kworker-5948: netfs_sreq: R=0000000c[1] UPLD FREE  f=00 s=0 28/28 e=0
 kworker-5948: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 clear d=5858585858585858
 kworker-5948: netfs_rreq: R=0000000c WB WR-DONE f=2120
 kworker-5948: netfs_rreq: R=0000000c WB WAKE-IP f=2120
 kworker-5948: netfs_rreq: R=0000000c WB FREE    f=2100
  xfs_io-6059: netfs_truncate: ni=9e isz=28 rsz=28 zp=0 to=4
  xfs_io-6059: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=0

But ->release_folio() should have been called here because netfs_inode_init()
would have called mapping_set_release_always() for ordinary afs files.

  xfs_io-6059: netfs_truncate: ni=9e isz=4 rsz=4 zp=0 to=1000
  xfs_io-6059: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=0
      od-6060: netfs_read: R=0000000d READAHEAD c=00000002 ni=9e s=0 l=2000 sz=1000
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read d=58585858
      od-6060: netfs_sreq: R=0000000d[1] ---- ADD   f=00 s=0 0/2000 e=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO SUBMT f=00 s=0 0/2000 e=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO CLEAR f=02 s=0 2000/2000 e=0
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-done d=0
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-unlock d=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO TERM  f=02 s=0 2000/2000 e=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO FREE  f=02 s=0 2000/2000 e=0
      od-6060: netfs_rreq: R=0000000d RA ASSESS  f=20
      od-6060: netfs_rreq: R=0000000d RA WAKE-IP f=20
      od-6060: netfs_rreq: R=0000000d RA DONE    f=00
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-put d=0
 kworker-5948: netfs_rreq: R=0000000d RA FREE    f=00

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-19 11:46   ` David Howells
                       ` (2 preceding siblings ...)
  2024-08-19 16:39     ` Pankaj Raghav (Samsung)
@ 2024-08-19 18:40     ` David Howells
  2024-08-20  9:17       ` Pankaj Raghav (Samsung)
  3 siblings, 1 reply; 28+ messages in thread
From: David Howells @ 2024-08-19 18:40 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:

> I tried this code on XFS, and it is working as expected (I am getting
> xxxx).

XFS doesn't try to use mapping_set_release_always().

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-19 18:40     ` David Howells
@ 2024-08-20  9:17       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-20  9:17 UTC (permalink / raw)
  To: David Howells
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof,
	ryan.roberts

On Mon, Aug 19, 2024 at 07:40:44PM +0100, David Howells wrote:
> Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:
> 
> > I tried this code on XFS, and it is working as expected (I am getting
> > xxxx).
> 
> XFS doesn't try to use mapping_set_release_always().

Thanks David for digging deep. It is indeed a bug in this patchset
(PATCH 1). I think I overlooked the way we MASK the folio order bits
when we changed it sometime back. 

But still I don't know why AS_RELEASE_ALWAYS is being cleared because it
is in BIT 6, and existing bug should not affect BIT 6.

The following triggers an ASSERT failure.

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0fcf235e5023..35961d73d54a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,9 +88,13 @@ xfs_inode_alloc(
 
        /* VFS doesn't initialise i_mode! */
        VFS_I(ip)->i_mode = 0;
+       mapping_set_unevictable(VFS_I(ip)->i_mapping);
        mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
                                    M_IGEO(mp)->min_folio_order);
 
+       ASSERT(mapping_unevictable(VFS_I(ip)->i_mapping) == 1);
+
+       mapping_clear_unevictable(VFS_I(ip)->i_mapping);
        XFS_STATS_INC(mp, vn_active);
        ASSERT(atomic_read(&ip->i_pincount) == 0);
        ASSERT(ip->i_ino == 0);

The patch that fixes this is:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 61a7649d86e5..5e245b8dcfd6 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -217,6 +217,7 @@ enum mapping_flags {
 #define AS_FOLIO_ORDER_MASK     ((1u << AS_FOLIO_ORDER_BITS) - 1)
 #define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
 #define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
+#define AS_FOLIO_ORDER_MIN_MAX_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
 
 /**
  * mapping_set_error - record a writeback error in the address_space
@@ -418,7 +419,7 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
        if (max < min)
                max = min;
 
-       mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+       mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MIN_MAX_MASK) |
                (min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
 }
 
Could you try this patch and see if it fixes it by any chance?

--
Pankaj


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
                     ` (3 preceding siblings ...)
  2024-08-19 11:59   ` David Howells
@ 2024-08-20 23:24   ` David Howells
  2024-08-21  7:16     ` Pankaj Raghav (Samsung)
  4 siblings, 1 reply; 28+ messages in thread
From: David Howells @ 2024-08-20 23:24 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: dhowells, brauner, akpm, chandan.babu, linux-fsdevel, djwong,
	hare, gost.dev, linux-xfs, hch, david, Zi Yan, yang,
	linux-kernel, linux-mm, willy, john.g.garry, cl, p.raghav,
	mcgrof, ryan.roberts

Okay, I think I've found the bugs in my code and in truncate.  It appears
they're affected by your code, but exist upstream.  You can add:

	Tested-by: David Howells <dhowells@redhat.com>

to patches 1-5 if you wish.

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v12 00/10] enable bs > ps in XFS
  2024-08-20 23:24   ` David Howells
@ 2024-08-21  7:16     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 28+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-21  7:16 UTC (permalink / raw)
  To: David Howells
  Cc: brauner, akpm, chandan.babu, linux-fsdevel, djwong, hare,
	gost.dev, linux-xfs, hch, david, Zi Yan, yang, linux-kernel,
	linux-mm, willy, john.g.garry, cl, p.raghav, mcgrof,
	ryan.roberts

On Wed, Aug 21, 2024 at 12:24:24AM +0100, David Howells wrote:
> Okay, I think I've found the bugs in my code and in truncate.  It appears
> they're affected by your code, but exist upstream.  You can add:
> 
> 	Tested-by: David Howells <dhowells@redhat.com>
> 
> to patches 1-5 if you wish.

Thanks David. I will send a new version with your Tested-by and the one
fix in the first patch.

-- 
Pankaj


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-08-21  7:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-15  9:08 [PATCH v12 00/10] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 01/10] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 02/10] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 03/10] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 04/10] mm: split a folio in minimum folio order chunks Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 06/10] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 07/10] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 08/10] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-08-15  9:08 ` [PATCH v12 10/10] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-08-16 19:31 ` [PATCH v12 00/10] enable bs > ps in XFS David Howells
2024-08-18 16:51   ` Pankaj Raghav (Samsung)
2024-08-18 20:16   ` David Howells
2024-08-19  7:24     ` Hannes Reinecke
2024-08-19  7:37       ` Pankaj Raghav (Samsung)
2024-08-19 12:25     ` David Howells
2024-08-19 11:46   ` David Howells
2024-08-19 12:48     ` Hannes Reinecke
2024-08-19 14:08     ` David Howells
2024-08-19 16:39     ` Pankaj Raghav (Samsung)
2024-08-19 18:40     ` David Howells
2024-08-20  9:17       ` Pankaj Raghav (Samsung)
2024-08-19 11:59   ` David Howells
2024-08-20 23:24   ` David Howells
2024-08-21  7:16     ` Pankaj Raghav (Samsung)
2024-08-19 15:17 ` David Howells
2024-08-19 16:51 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox