linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] enable bs > ps in XFS
@ 2024-03-01 16:44 Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 01/13] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
                   ` (12 more replies)
  0 siblings, 13 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

This is the second version of the series that enables block size > page size
(Large Block Size) in XFS. The context and motivation can be seen in cover
letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
if someone would like more context on this effort.

A lot of emphasis has been put on testing using kdevops. The testing has
been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for
*regression on existing profiles due to the page cache changes.

No regression was found with the patches added on top.

*Baseline for regression was created using SOAK_DURATION of 2.5 hours
and having used about 7-8 XFS test clusters to test loop fstests over
70 times. We then scraped for critical failures (crashes, XFS or page
cache asserts, or hung tasks) and have reported these to the community
as well.[4]

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.
To compare it with existing support, an ARM VM with 64k base page system
(without our patches) was used as a reference to check for actual failures
due to LBS support in a 4k base page size system.

There are some common failures upstream for bs=64k that needs to be
fixed[5].
There are also some tests that assumes block size < page size that needs to
be fixed. I have a tree with fixes for xfstests here [6], which I will be
sending soon to the list.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block
size against pmem and NVMe with buffered IO and Direct IO on vanilla
v6.8-rc4 Vs v6.8-rc4 + these patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [7] to see if IO sent to the device
is aligned and at least filesystem block size in length.

Git tree:
https://github.com/linux-kdevops/linux/tree/large-block-minorder-6.8.0-rc5-v2

[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[2] https://lore.kernel.org/linux-xfs/20240213093713.1753368-1-kernel@pankajraghav.com/
[3] https://www.youtube.com/watch?v=ar72r5Xf7x4
[4] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[5] https://lore.kernel.org/linux-xfs/fe7fec1c-3b08-430f-9c95-ea76b237acf4@samsung.com/
[6] https://github.com/Panky-codes/xfstests/tree/lbs-fixes
[7] https://github.com/iovisor/bcc/pull/4813

Changes since v1:
- Round up to nearest min nr pages in ra_init
- Calculate index in filemap_create instead of doing in
  filemap_get_pages
- Remove unnecessary BUG_ONs in the delete path
- Use check_shl_overflow instead of check_mul_overflow
- Cast to uint32_t instead of unsigned long in xfs_stat_blksize

Changes since RFC v2:
- Move order 1 patch above the 1st patch
- Remove order == 1 conditional in `fs: Allow fine-grained control of
folio sizes`. This fixed generic/630 that was reported in the previous version.
- Hide the max order and expose `mapping_set_folio_min_order` instead.
- Add new helper mapping_start_index_align and DEFINE_READAHEAD_ALIGN
- don't call `page_cache_ra_order` with min order in do_mmap_sync_readahead
- simplify ondemand readahead with only aligning the start index at the end
- Don't cap ra_pages based on bdi->io_pages
- use `checked_mul_overflow` while calculating bytes in validate_fsb
- Remove config lbs option
- Add a warning while mounting a LBS kernel
- Add Acked-by and Reviewed-by from Hannes and Darrick.

Changes since RFC v1:
- Added willy's patch to enable order-1 folios.
- Unified common page cache effort from Hannes LBS work.
- Added a new helper min_nrpages and added CONFIG_THP for enabling mapping_large_folio_support
- Don't split a folio if it has minorder set. Remove the old code where we set extra pins if it has that requirement.
- Split the code in XFS between the validation of mapping count. Put the icache code changes with enabling bs > ps.
- Added CONFIG_XFS_LBS option
- align the index in do_read_cache_folio()
- Removed truncate changes
- Fixed generic/091 with iomap changes to iomap_dio_zero function.
- Took care of folio truncation scenario in page_cache_ra_unbounded() that happens after read_pages if a folio was found.
- Sqaushed and moved commits around
- Rebased on top of v6.8-rc4

Hannes Reinecke (1):
  readahead: rework loop in page_cache_ra_unbounded()

Luis Chamberlain (3):
  filemap: align the index to mapping_min_order in the page cache
  readahead: round up file_ra_state->ra_pages to mapping_min_nrpages
  readahead: align index to mapping_min_order in ondemand_ra and
    force_ra

Matthew Wilcox (Oracle) (2):
  mm: Support order-1 folios in the page cache
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  filemap: use mapping_min_order while allocating folios
  readahead: allocate folios with mapping_min_order in
    ra_(unbounded|order)
  mm: do not split a folio if it has minimum folio order requirement
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/direct-io.c       |  13 ++++-
 fs/xfs/libxfs/xfs_ialloc.c |   5 ++
 fs/xfs/libxfs/xfs_shared.h |   3 +
 fs/xfs/xfs_icache.c        |   6 +-
 fs/xfs/xfs_iops.c          |   2 +-
 fs/xfs/xfs_mount.c         |  10 +++-
 fs/xfs/xfs_super.c         |  10 +---
 include/linux/huge_mm.h    |   7 ++-
 include/linux/pagemap.h    | 110 ++++++++++++++++++++++++++++++-------
 mm/filemap.c               |  35 ++++++++----
 mm/huge_memory.c           |  36 ++++++++++--
 mm/internal.h              |   4 +-
 mm/readahead.c             |  73 +++++++++++++++++-------
 13 files changed, 236 insertions(+), 78 deletions(-)


base-commit: b401b621758e46812da61fa58a67c3fd8d91de0d
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 01/13] mm: Support order-1 folios in the page cache
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 17:08   ` Hannes Reinecke
  2024-03-01 16:44 ` [PATCH v2 02/13] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Folios of order 1 have no space to store the deferred list.  This is
not a problem for the page cache as file-backed folios are never
placed on the deferred list.  All we need to do is prevent the core
MM from touching the deferred list for order 1 folios and remove the
code which prevented us from allocating order 1 folios.

Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h |  7 +++++--
 mm/filemap.c            |  2 --
 mm/huge_memory.c        | 23 ++++++++++++++++++-----
 mm/internal.h           |  4 +---
 mm/readahead.c          |  3 ---
 5 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..916a2a539517 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -263,7 +263,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
-void folio_prep_large_rmappable(struct folio *folio);
+struct folio *folio_prep_large_rmappable(struct folio *folio);
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
@@ -410,7 +410,10 @@ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return 0;
 }
 
-static inline void folio_prep_large_rmappable(struct folio *folio) {}
+static inline struct folio *folio_prep_large_rmappable(struct folio *folio)
+{
+	return folio;
+}
 
 #define transparent_hugepage_flags 0UL
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..2b00442b9d19 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1912,8 +1912,6 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order == 1)
-				order = 0;
 			if (order > 0)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
 			folio = filemap_alloc_folio(alloc_gfp, order);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..81fd1ba57088 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,11 +788,15 @@ struct deferred_split *get_deferred_split_queue(struct folio *folio)
 }
 #endif
 
-void folio_prep_large_rmappable(struct folio *folio)
+struct folio *folio_prep_large_rmappable(struct folio *folio)
 {
-	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
-	INIT_LIST_HEAD(&folio->_deferred_list);
+	if (!folio || !folio_test_large(folio))
+		return folio;
+	if (folio_order(folio) > 1)
+		INIT_LIST_HEAD(&folio->_deferred_list);
 	folio_set_large_rmappable(folio);
+
+	return folio;
 }
 
 static inline bool is_transparent_hugepage(struct folio *folio)
@@ -3082,7 +3086,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
-		if (!list_empty(&folio->_deferred_list)) {
+		if (folio_order(folio) > 1 &&
+		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
 			list_del(&folio->_deferred_list);
 		}
@@ -3133,6 +3138,9 @@ void folio_undo_large_rmappable(struct folio *folio)
 	struct deferred_split *ds_queue;
 	unsigned long flags;
 
+	if (folio_order(folio) <= 1)
+		return;
+
 	/*
 	 * At this point, there is no one trying to add the folio to
 	 * deferred_list. If folio is not in deferred_list, it's safe
@@ -3158,7 +3166,12 @@ void deferred_split_folio(struct folio *folio)
 #endif
 	unsigned long flags;
 
-	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
+	/*
+	 * Order 1 folios have no space for a deferred list, but we also
+	 * won't waste much memory by not adding them to the deferred list.
+	 */
+	if (folio_order(folio) <= 1)
+		return;
 
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..5174b5b0c344 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -419,9 +419,7 @@ static inline struct folio *page_rmappable_folio(struct page *page)
 {
 	struct folio *folio = (struct folio *)page;
 
-	if (folio && folio_order(folio) > 1)
-		folio_prep_large_rmappable(folio);
-	return folio;
+	return folio_prep_large_rmappable(folio);
 }
 
 static inline void prep_compound_head(struct page *page, unsigned int order)
diff --git a/mm/readahead.c b/mm/readahead.c
index 2648ec4f0494..369c70e2be42 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -516,9 +516,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		/* Don't allocate pages past EOF */
 		while (index + (1UL << order) - 1 > limit)
 			order--;
-		/* THP machinery does not support order-1 */
-		if (order == 1)
-			order = 0;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
 			break;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 02/13] fs: Allow fine-grained control of folio sizes
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 01/13] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 17:09   ` Hannes Reinecke
  2024-03-01 16:44 ` [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to ensure that folios that are added to
the page cache are at least a certain size.
Add mapping_set_folio_min_order() to allow this level of control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/pagemap.h | 100 ++++++++++++++++++++++++++++++++--------
 1 file changed, 80 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..fc8eb9c94e9c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,13 +202,18 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
+	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -344,9 +349,47 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+/*
+ * mapping_set_folio_min_order() - Set the minimum folio order
+ * @mapping: The address_space.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	if (min > MAX_PAGECACHE_ORDER)
+		min = MAX_PAGECACHE_ORDER;
+
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			 (min << AS_FOLIO_ORDER_MIN) |
+			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address_space.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -357,7 +400,37 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_min_order(mapping, 0);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
+}
+
+static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_start_index() - Align starting index based on the min
+ * folio order of the page cache.
+ * @mapping: The address_space.
+ *
+ * Ensure the index used is aligned to the minimum folio order when adding
+ * new folios to the page cache by rounding down to the nearest minimum
+ * folio number of pages.
+ */
+static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
+						pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
 }
 
 /*
@@ -367,7 +440,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
 	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	       (mapping_max_folio_order(mapping) > 0);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
@@ -528,19 +601,6 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 01/13] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 02/13] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 19:26   ` Matthew Wilcox
  2024-03-01 16:44 ` [PATCH v2 04/13] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. So when adding new
folios to the page cache we must ensure the index used is aligned to the
mapping_min_order as the page cache requires the index to be aligned to
the order of the folio.

A higher order folio than min_order by definition is a multiple of the
min_order. If an index is aligned to an order higher than a min_order, it
will also be aligned to the min order.

This effectively introduces no new functional changes when min order is
not set other than a few rounding computations that should result in the
same value.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/pagemap.h | 10 +++++++++-
 mm/filemap.c            | 16 ++++++++++------
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fc8eb9c94e9c..b3cf8ef89826 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1328,6 +1328,14 @@ struct readahead_control {
 		._index = i,						\
 	}
 
+#define DEFINE_READAHEAD_ALIGNED(ractl, f, r, m, i)			\
+	struct readahead_control ractl = {				\
+		.file = f,						\
+		.mapping = m,						\
+		.ra = r,						\
+		._index = mapping_align_start_index(m, i),		\
+	}
+
 #define VM_READAHEAD_PAGES	(SZ_128K / PAGE_SIZE)
 
 void page_cache_ra_unbounded(struct readahead_control *,
@@ -1356,7 +1364,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
 		struct file_ra_state *ra, struct file *file, pgoff_t index,
 		unsigned long req_count)
 {
-	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
+	DEFINE_READAHEAD_ALIGNED(ractl, file, ra, mapping, index);
 	page_cache_sync_ra(&ractl, req_count);
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b00442b9d19..96fe5c7fe094 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2416,11 +2416,13 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
 	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
 	if (!folio)
@@ -2440,6 +2442,8 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	/* index in PAGE units but aligned to min_order number of pages. */
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2500,8 +2504,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3093,7 +3096,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
-	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
+	DEFINE_READAHEAD_ALIGNED(ractl, file, ra, mapping, vmf->pgoff);
 	struct file *fpin = NULL;
 	unsigned long vm_flags = vmf->vma->vm_flags;
 	unsigned int mmap_miss;
@@ -3147,7 +3150,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
-	ractl._index = ra->start;
+	ractl._index = mapping_align_start_index(mapping, ra->start);
 	page_cache_ra_order(&ractl, ra, 0);
 	return fpin;
 }
@@ -3162,7 +3165,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
-	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
+	DEFINE_READAHEAD_ALIGNED(ractl, file, ra, file->f_mapping, vmf->pgoff);
 	struct file *fpin = NULL;
 	unsigned int mmap_miss;
 
@@ -3657,6 +3660,7 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 	struct folio *folio;
 	int err;
 
+	index = mapping_align_start_index(mapping, index);
 	if (!filler)
 		filler = mapping->a_ops->read_folio;
 repeat:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 04/13] filemap: use mapping_min_order while allocating folios
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 05/13] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

As we bring the notion of mapping_min_order, make sure these functions
allocate at least folio of mapping_min_order as we need to guarantee it
in the page cache.

Add some additional VM_BUG_ON() in __filemap_add_folio to catch errors
where we add folios that has order less than min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/filemap.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 96fe5c7fe094..3e621c6344f7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -849,6 +849,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
@@ -1886,8 +1888,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_start_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1912,8 +1916,11 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
+			if (order < min_order)
+				order = min_order;
 			if (order > 0)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
 				continue;
@@ -1927,7 +1934,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2424,7 +2431,8 @@ static int filemap_create_folio(struct file *file,
 	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -3666,7 +3674,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		err = filemap_add_folio(mapping, folio, index, gfp);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 05/13] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 04/13] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 06/13] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

As we will be adding multiples of mapping_min_nrpages to the page cache,
initialize file_ra_state->ra_pages with bdi->ra_pages rounded up to the
nearest mapping_min_nrpages.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 369c70e2be42..6336c1736cc9 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -138,7 +138,8 @@
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
+	ra->ra_pages = round_up(inode_to_bdi(mapping->host)->ra_pages,
+				mapping_min_folio_nrpages(mapping));
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 06/13] readahead: align index to mapping_min_order in ondemand_ra and force_ra
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 05/13] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 07/13] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

Align the ra->start and ra->size to mapping_min_order in
ondemand_readahead(), and align the index to mapping_min_order in
force_page_cache_ra(). This will ensure that the folios allocated for
readahead that are added to the page cache are aligned to
mapping_min_order.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 6336c1736cc9..0197cb91cf85 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -310,7 +310,9 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	struct file_ra_state *ra = ractl->ra;
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-	unsigned long max_pages, index;
+	unsigned long max_pages;
+	pgoff_t index, new_index;
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
 		return;
@@ -320,7 +322,14 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	 * be up to the optimal hardware IO size
 	 */
 	index = readahead_index(ractl);
+	new_index = mapping_align_start_index(mapping, index);
+	if (new_index != index) {
+		nr_to_read += index - new_index;
+		index = new_index;
+	}
+
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
+	max_pages = max_t(unsigned long, max_pages, min_nrpages);
 	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
 	while (nr_to_read) {
 		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;
@@ -328,6 +337,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		ractl->_index = index;
+		VM_BUG_ON(!IS_ALIGNED(index, min_nrpages));
 		do_page_cache_ra(ractl, this_chunk, 0);
 
 		index += this_chunk;
@@ -554,8 +564,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	unsigned long add_pages;
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t expected, prev_index;
-	unsigned int order = folio ? folio_order(folio) : 0;
+	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
+	unsigned int min_nrpages = mapping_min_folio_nrpages(ractl->mapping);
+	unsigned int order = folio ? folio_order(folio) : min_order;
 
+	VM_BUG_ON(!IS_ALIGNED(index, min_nrpages));
 	/*
 	 * If the request exceeds the readahead window, allow the read to
 	 * be up to the optimal hardware IO size
@@ -577,7 +590,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 			1UL << order);
 	if (index == expected || index == (ra->start + ra->size)) {
 		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->size = max(get_next_ra_size(ra, max_pages), min_nrpages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -602,7 +615,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 		ra->start = start;
 		ra->size = start - index;	/* old async_size */
 		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->size = max(get_next_ra_size(ra, max_pages), min_nrpages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -639,7 +652,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 
 initial_readahead:
 	ra->start = index;
-	ra->size = get_init_ra_size(req_size, max_pages);
+	ra->size = max(min_nrpages, get_init_ra_size(req_size, max_pages));
 	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
 
 readit:
@@ -650,7 +663,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 * Take care of maximum IO pages as above.
 	 */
 	if (index == ra->start && ra->size == ra->async_size) {
-		add_pages = get_next_ra_size(ra, max_pages);
+		add_pages = max(get_next_ra_size(ra, max_pages), min_nrpages);
 		if (ra->size + add_pages <= max_pages) {
 			ra->async_size = add_pages;
 			ra->size += add_pages;
@@ -660,7 +673,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 		}
 	}
 
-	ractl->_index = ra->start;
+	ractl->_index = mapping_align_start_index(ractl->mapping, ra->start);
 	page_cache_ra_order(ractl, ra, order);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 07/13] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 06/13] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 08/13] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Hannes Reinecke <hare@suse.de>

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/readahead.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 0197cb91cf85..65fbb9e78615 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -209,7 +209,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -227,7 +227,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
@@ -240,8 +240,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -253,13 +253,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			folio_put(folio);
 			read_pages(ractl);
 			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 
 	/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 08/13] readahead: allocate folios with mapping_min_order in ra_(unbounded|order)
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 07/13] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 09/13] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Allocate folios with at least mapping_min_order in
page_cache_ra_unbounded() and page_cache_ra_order() as we need to
guarantee a minimum order in the page cache.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 mm/readahead.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 65fbb9e78615..4e3a6f763f5c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -210,6 +210,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 	unsigned long i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -231,6 +232,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
+
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -240,19 +243,31 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
@@ -500,6 +515,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
@@ -526,8 +542,13 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
+
+		if (order < min_order)
+			order = min_order;
+
+		VM_BUG_ON(index & ((1UL << order) - 1));
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
 			break;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 09/13] mm: do not split a folio if it has minimum folio order requirement
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 08/13] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 10/13] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

As we don't have a way to split a folio to a any given lower folio
order yet, avoid splitting the folio in split_huge_page_to_list() if it
has a minimum folio order requirement.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 mm/huge_memory.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81fd1ba57088..6ec3417638a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3030,6 +3030,19 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			goto out;
 		}
 
+		/*
+		 * Do not split if mapping has minimum folio order
+		 * requirement.
+		 *
+		 * XXX: Once we have support for splitting to any lower
+		 * folio order, then it could be split based on the
+		 * min_folio_order.
+		 */
+		if (mapping_min_folio_order(mapping)) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 10/13] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 09/13] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 11/13] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 11/13] xfs: expose block size in stat
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 10/13] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 12/13] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 13/13] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..7ee829f7d708 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -543,7 +543,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 12/13] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (10 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 11/13] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  2024-03-01 16:44 ` [PATCH v2 13/13] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_mount.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aabb25dc3efa..9cf800586da7 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
 {
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	uint64_t max_index;
+	uint64_t max_bytes;
+
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	max_index = max_bytes >> PAGE_SHIFT;
+
+	if (max_index > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 13/13] xfs: enable block size larger than page size support
  2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (11 preceding siblings ...)
  2024-03-01 16:44 ` [PATCH v2 12/13] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-03-01 16:44 ` Pankaj Raghav (Samsung)
  12 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-01 16:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, hare, david, akpm, gost.dev,
	linux-kernel, chandan.babu, willy, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 10 ++--------
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 2361a22035b0..c040bd6271fd 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2892,6 +2892,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 4220d3584c1b..67ed406e7a81 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -188,6 +188,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index dba514a2c84d..a1857000e2cd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,7 +88,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -323,7 +324,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 9cf800586da7..a77e927807e5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,7 +131,6 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 	uint64_t max_index;
 	uint64_t max_bytes;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5a2512d20bd0..685ce7bf7324 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1625,16 +1625,10 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
 		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
-				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 01/13] mm: Support order-1 folios in the page cache
  2024-03-01 16:44 ` [PATCH v2 01/13] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
@ 2024-03-01 17:08   ` Hannes Reinecke
  0 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2024-03-01 17:08 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, david, akpm, gost.dev, linux-kernel,
	chandan.babu, willy

On 3/1/24 17:44, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Folios of order 1 have no space to store the deferred list.  This is
> not a problem for the page cache as file-backed folios are never
> placed on the deferred list.  All we need to do is prevent the core
> MM from touching the deferred list for order 1 folios and remove the
> code which prevented us from allocating order 1 folios.
> 
> Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   include/linux/huge_mm.h |  7 +++++--
>   mm/filemap.c            |  2 --
>   mm/huge_memory.c        | 23 ++++++++++++++++++-----
>   mm/internal.h           |  4 +---
>   mm/readahead.c          |  3 ---
>   5 files changed, 24 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 02/13] fs: Allow fine-grained control of folio sizes
  2024-03-01 16:44 ` [PATCH v2 02/13] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-03-01 17:09   ` Hannes Reinecke
  0 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2024-03-01 17:09 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-fsdevel, linux-xfs
  Cc: djwong, mcgrof, linux-mm, david, akpm, gost.dev, linux-kernel,
	chandan.babu, willy, Pankaj Raghav

On 3/1/24 17:44, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Some filesystems want to be able to ensure that folios that are added to
> the page cache are at least a certain size.
> Add mapping_set_folio_min_order() to allow this level of control.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   include/linux/pagemap.h | 100 ++++++++++++++++++++++++++++++++--------
>   1 file changed, 80 insertions(+), 20 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache
  2024-03-01 16:44 ` [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
@ 2024-03-01 19:26   ` Matthew Wilcox
  2024-03-01 20:04     ` Kent Overstreet
  2024-03-04 15:36     ` Pankaj Raghav (Samsung)
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2024-03-01 19:26 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-fsdevel, linux-xfs, djwong, mcgrof, linux-mm, hare, david,
	akpm, gost.dev, linux-kernel, chandan.babu, Pankaj Raghav

On Fri, Mar 01, 2024 at 05:44:34PM +0100, Pankaj Raghav (Samsung) wrote:
> +#define DEFINE_READAHEAD_ALIGNED(ractl, f, r, m, i)			\
> +	struct readahead_control ractl = {				\
> +		.file = f,						\
> +		.mapping = m,						\
> +		.ra = r,						\
> +		._index = mapping_align_start_index(m, i),		\
> +	}

My point was that you didn't need to do any of this.

Look, I've tried to give constructive review, but I feel like I'm going
to have to be blunt.  There is no evidence of design or understanding
in these patches or their commit messages.  You don't have a coherent
message about "These things have to be aligned; these things can be at
arbitrary alignment".  If you have thought about it, it doesn't show.

Maybe you just need to go back over the patches and read them as a series,
but it feels like "Oh, there's a hole here, patch it; another hole here,
patch it" without thinking about what's going on and why.

I want to help, but it feels like it'd be easier to do all the work myself
at this point, and that's not good for me, and it's not good for you.

So, let's start off: Is the index in ractl aligned or not, and why do
you believe that's the right approach?  And review each of the patches
in this series with the answer to that question in mind because you are
currently inconsistent.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache
  2024-03-01 19:26   ` Matthew Wilcox
@ 2024-03-01 20:04     ` Kent Overstreet
  2024-03-04 15:38       ` Pankaj Raghav (Samsung)
  2024-03-04 15:36     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2024-03-01 20:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung),
	linux-fsdevel, linux-xfs, djwong, mcgrof, linux-mm, hare, david,
	akpm, gost.dev, linux-kernel, chandan.babu, Pankaj Raghav

On Fri, Mar 01, 2024 at 07:26:55PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 01, 2024 at 05:44:34PM +0100, Pankaj Raghav (Samsung) wrote:
> > +#define DEFINE_READAHEAD_ALIGNED(ractl, f, r, m, i)			\
> > +	struct readahead_control ractl = {				\
> > +		.file = f,						\
> > +		.mapping = m,						\
> > +		.ra = r,						\
> > +		._index = mapping_align_start_index(m, i),		\
> > +	}
> 
> My point was that you didn't need to do any of this.
> 
> Look, I've tried to give constructive review, but I feel like I'm going
> to have to be blunt.  There is no evidence of design or understanding
> in these patches or their commit messages.  You don't have a coherent
> message about "These things have to be aligned; these things can be at
> arbitrary alignment".  If you have thought about it, it doesn't show.

Don't you think you might be going off a bit much? I looked over these
patches after we talked privately, and they looked pretty sensible to
me...

Yes, we _always_ want more thorough commit messages that properly
explain the motivations for changes, but in my experience that's the
thing that takes the longest to learn how to do well as an engineer...
ease up abit.

> So, let's start off: Is the index in ractl aligned or not, and why do
> you believe that's the right approach?  And review each of the patches
> in this series with the answer to that question in mind because you are
> currently inconsistent.

^ this is a real point though, DEFINE_READAHEAD_ALIGNED() feels off to
me.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache
  2024-03-01 19:26   ` Matthew Wilcox
  2024-03-01 20:04     ` Kent Overstreet
@ 2024-03-04 15:36     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-04 15:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, djwong, mcgrof, linux-mm, hare, david,
	akpm, gost.dev, linux-kernel, chandan.babu, Pankaj Raghav

On Fri, Mar 01, 2024 at 07:26:55PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 01, 2024 at 05:44:34PM +0100, Pankaj Raghav (Samsung) wrote:
> > +#define DEFINE_READAHEAD_ALIGNED(ractl, f, r, m, i)			\
> > +	struct readahead_control ractl = {				\
> > +		.file = f,						\
> > +		.mapping = m,						\
> > +		.ra = r,						\
> > +		._index = mapping_align_start_index(m, i),		\
> > +	}
> 
> My point was that you didn't need to do any of this.
> 
Got it. I probably didn't understand your old comment properly.

> Look, I've tried to give constructive review, but I feel like I'm going
> to have to be blunt.  There is no evidence of design or understanding
> in these patches or their commit messages.  You don't have a coherent
> message about "These things have to be aligned; these things can be at
> arbitrary alignment".  If you have thought about it, it doesn't show.
> 
> Maybe you just need to go back over the patches and read them as a series,
> but it feels like "Oh, there's a hole here, patch it; another hole here,
> patch it" without thinking about what's going on and why.
> 
> I want to help, but it feels like it'd be easier to do all the work myself
> at this point, and that's not good for me, and it's not good for you.
> 
> So, let's start off: Is the index in ractl aligned or not, and why do
> you believe that's the right approach?  And review each of the patches
> in this series with the answer to that question in mind because you are
> currently inconsistent.

Thanks for the feedback, and I get your comment about inconsistentency,
especially in the part where we align the index probably in places where
it doesn't even matter. As someone who is a bit new to the inner
workings of the page cache, I was a bit unsure about choosing the right
abstracation to enforce alignment.

I am going through all the patches now based on your feedback and
changing the commit messages to clarify the intent.

--
Pankaj


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache
  2024-03-01 20:04     ` Kent Overstreet
@ 2024-03-04 15:38       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 20+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-04 15:38 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, linux-fsdevel, linux-xfs, djwong, mcgrof,
	linux-mm, hare, david, akpm, gost.dev, linux-kernel,
	chandan.babu, Pankaj Raghav

On Fri, Mar 01, 2024 at 03:04:33PM -0500, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 07:26:55PM +0000, Matthew Wilcox wrote:
> > On Fri, Mar 01, 2024 at 05:44:34PM +0100, Pankaj Raghav (Samsung) wrote:
> > > +#define DEFINE_READAHEAD_ALIGNED(ractl, f, r, m, i)			\
> > > +	struct readahead_control ractl = {				\
> > > +		.file = f,						\
> > > +		.mapping = m,						\
> > > +		.ra = r,						\
> > > +		._index = mapping_align_start_index(m, i),		\
> > > +	}
> > 
> > My point was that you didn't need to do any of this.
> > 
> > Look, I've tried to give constructive review, but I feel like I'm going
> > to have to be blunt.  There is no evidence of design or understanding
> > in these patches or their commit messages.  You don't have a coherent
> > message about "These things have to be aligned; these things can be at
> > arbitrary alignment".  If you have thought about it, it doesn't show.
> 
> Don't you think you might be going off a bit much? I looked over these
> patches after we talked privately, and they looked pretty sensible to
> me...
> 
> Yes, we _always_ want more thorough commit messages that properly
> explain the motivations for changes, but in my experience that's the
> thing that takes the longest to learn how to do well as an engineer...
> ease up abit.
> 
> > So, let's start off: Is the index in ractl aligned or not, and why do
> > you believe that's the right approach?  And review each of the patches
> > in this series with the answer to that question in mind because you are
> > currently inconsistent.
> 
> ^ this is a real point though, DEFINE_READAHEAD_ALIGNED() feels off to
> me.

Thanks Kent. I am going over the patches again and changing it based on
the feedback.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-03-04 15:38 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-01 16:44 [PATCH v2 00/13] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 01/13] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
2024-03-01 17:08   ` Hannes Reinecke
2024-03-01 16:44 ` [PATCH v2 02/13] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-03-01 17:09   ` Hannes Reinecke
2024-03-01 16:44 ` [PATCH v2 03/13] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
2024-03-01 19:26   ` Matthew Wilcox
2024-03-01 20:04     ` Kent Overstreet
2024-03-04 15:38       ` Pankaj Raghav (Samsung)
2024-03-04 15:36     ` Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 04/13] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 05/13] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 06/13] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 07/13] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 08/13] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 09/13] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 10/13] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 11/13] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 12/13] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-03-01 16:44 ` [PATCH v2 13/13] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox