[PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split
@ 2025-02-11 15:50 Zi Yan
  2025-02-11 15:50 ` [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry Zi Yan
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

Hi Matthew,

Can you please take a look at Patch 1 and let me know if the new xarray
function looks good to you? I will send __filemap_add_folio() and
shmem_split_large_entry() changes separately.

Hi all,

This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
   n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
   order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10
(or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0
folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios.

It is on top of mm-everything-2025-02-07-05-27 with V6 reverted. It is ready to
be merged.

Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().

xfstests quick group passed for both tmpfs and xfs.

Changelog
===
From V6[8]:
1. Added an xarray function xas_try_split() to support iterative folio split,
   removing the need of using xas_split_alloc() and xas_split(). The
   function guarantees that at most one xa_node is allocated for each
   call.
2. Added concrete numbers of after-split folios and xa_node savings to
   cover letter, commit log. (per Andrew)

From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
   from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
   non-uniform split will not be used if it is not supported.

From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
   and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
   split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
   not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
   __split_unmapped_folio().

From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
   operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
   quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
   to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
   thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
   tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
   and FS without large folio. This check was missed in the prior
   versions.

From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.

From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
   Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
   for xas_split() during buddy allocator like split.

From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
   folio_split(). The same code is used for both uniform split and buddy
   allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
   instead of the one containing the given page, since
   the caller of truncate_inode_partial_folio() locks and unlocks the
   first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().

Design
===

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.

The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
   the whole folio into multiple smaller ones with the same order in a
   shot, this approach splits the folio iteratively. Taking the example
   above, this approach first splits the original order-9 into two order-8,
   then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
   adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.

__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
   uniformly split the given folio. All resulting folios are put back to
   the list after split. The folio containing the given page is left to
   caller to unlock and others are unlocked.

2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
   to __split_folio_to_order() are used to split the given folio at order N to
   order N-1. After each call, the target folio is changed to the one
   containing the page, which is given as a folio_split() parameter.
   After each call, folios not containing the page are put back to the list.
   The folio containing the page is put back to the list when its order
   is new_order. All folios are unlocked except the first folio, which
   is left to caller to unlock.

Patch Overview
===
1. Patch 1 added a new xarray function xas_try_split() to perform
   iterative xarray split.
2. Patch 2 added __split_unmapped_folio() and __split_folio_to_order() to
   prepare for moving to new backend split code.

3. Patch 3 moved common code in split_huge_page_to_list_to_order() to
   __folio_split().

4. Patch 4 added new folio_split() and made
   split_huge_page_to_list_to_order() share the new
   __split_unmapped_folio() with folio_split().

5. Patch 5 removed no longer used __split_huge_page() and
   __split_huge_page_tail().

6. Patch 6 added a new in_folio_offset to split_huge_page debugfs for
   folio_split() test.

7. Patch 7 used try_folio_split() for truncate operation.

8. Patch 8 added folio_split() tests.

Any comments and/or suggestions are welcome. Thanks.

[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43ncly4l4ogr@7ruqsay6i2h2/
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
[8] https://lore.kernel.org/linux-mm/20250205031417.1771278-1-ziy@nvidia.com/

Zi Yan (8):
  xarray: add xas_try_split() to split a multi-index entry.
  mm/huge_memory: add two new (not yet used) functions for folio_split()
  mm/huge_memory: move folio split common code to __folio_split()
  mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  mm/huge_memory: remove the old, unused __split_huge_page()
  mm/huge_memory: add folio_split() to debugfs testing interface.
  mm/truncate: use buddy allocator like folio split for truncate
    operation.
  selftests/mm: add tests for folio_split(), buddy allocator like split.

 Documentation/core-api/xarray.rst             |  14 +-
 include/linux/huge_mm.h                       |  36 +
 include/linux/xarray.h                        |   7 +
 lib/test_xarray.c                             |  47 ++
 lib/xarray.c                                  | 136 +++-
 mm/huge_memory.c                              | 751 ++++++++++++------
 mm/truncate.c                                 |  31 +-
 tools/testing/radix-tree/Makefile             |   1 +
 .../selftests/mm/split_huge_page_test.c       |  34 +-
 9 files changed, 772 insertions(+), 285 deletions(-)

-- 
2.47.2

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-12  0:57   ` Zi Yan
  2025-02-17 21:44   ` David Hildenbrand
  2025-02-11 15:50 ` [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

It is a preparation patch for non-uniform folio split, which always split
a folio into half iteratively, and minimal xarray entry split.

Currently, xas_split_alloc() and xas_split() always split all slots from a
multi-index entry. They cost the same number of xa_node as the to-be-split
slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
needed. Instead xas_try_split() is intended to be used iteratively to split
the order-9 entry into 2 order-8 entries, then split one order-8 entry,
based on the given index, to 2 order-7 entries, ..., and split one order-1
entry to 2 order-0 entries. When splitting the order-6 entry and a new
xa_node is needed, xas_try_split() will try to allocate one if possible.
As a result, xas_try_split() would only need one xa_node instead of 8.

When a new xa_node is needed during the split, xas_try_split() can try to
allocate one but no more. -ENOMEM will be return if a node cannot be
allocated. -EINVAL will be return if a sibling node is split or
cascade split happens, where two or more new nodes are needed, and these
are not supported by xas_try_split().

xas_split_alloc() and xas_split() split an order-9 to order-0:

         ---------------------------------
         |   |   |   |   |   |   |   |   |
         | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
         |   |   |   |   |   |   |   |   |
         ---------------------------------
           |   |                   |   |
     -------   ---               ---   -------
     |           |     ...       |           |
     V           V               V           V
----------- -----------     ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- -----------     ----------- -----------

xas_try_split() splits an order-9 to order-0:
   ---------------------------------
   |   |   |   |   |   |   |   |   |
   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
   |   |   |   |   |   |   |   |   |
   ---------------------------------
     |
     |
     V
-----------
| xa_node |
-----------

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 Documentation/core-api/xarray.rst |  14 ++-
 include/linux/xarray.h            |   7 ++
 lib/test_xarray.c                 |  47 +++++++++++
 lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
 tools/testing/radix-tree/Makefile |   1 +
 5 files changed, 188 insertions(+), 17 deletions(-)

diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index f6a3eef4fe7f..c6c91cbd0c3c 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -489,7 +489,19 @@ Storing ``NULL`` into any index of a multi-index entry will set the
 entry at every index to ``NULL`` and dissolve the tie.  A multi-index
 entry can be split into entries occupying smaller ranges by calling
 xas_split_alloc() without the xa_lock held, followed by taking the lock
-and calling xas_split().
+and calling xas_split() or calling xas_try_split() with xa_lock. The
+difference between xas_split_alloc()+xas_split() and xas_try_alloc() is
+that xas_split_alloc() + xas_split() split the entry from the original
+order to the new order in one shot uniformly, whereas xas_try_split()
+iteratively splits the entry containing the index non-uniformly.
+For example, to split an order-9 entry, which takes 2^(9-6)=8 slots,
+assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need
+8 xa_node. xas_try_split() splits the order-9 entry into
+2 order-8 entries, then split one order-8 entry, based on the given index,
+to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries.
+When splitting the order-6 entry and a new xa_node is needed, xas_try_split()
+will try to allocate one if possible. As a result, xas_try_split() would only
+need 1 xa_node instead of 8.
 
 Functions and structures
 ========================
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0b618ec04115..9eb8c7425090 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1555,6 +1555,8 @@ int xa_get_order(struct xarray *, unsigned long index);
 int xas_get_order(struct xa_state *xas);
 void xas_split(struct xa_state *, void *entry, unsigned int order);
 void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
+void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
+		gfp_t gfp);
 #else
 static inline int xa_get_order(struct xarray *xa, unsigned long index)
 {
@@ -1576,6 +1578,11 @@ static inline void xas_split_alloc(struct xa_state *xas, void *entry,
 		unsigned int order, gfp_t gfp)
 {
 }
+
+static inline void xas_try_split(struct xa_state *xas, void *entry,
+		unsigned int order, gfp_t gfp)
+{
+}
 #endif
 
 /**
diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 6932a26f4927..598ca38a2f5b 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1857,6 +1857,49 @@ static void check_split_1(struct xarray *xa, unsigned long index,
 	xa_destroy(xa);
 }
 
+static void check_split_2(struct xarray *xa, unsigned long index,
+				unsigned int order, unsigned int new_order)
+{
+	XA_STATE_ORDER(xas, xa, index, new_order);
+	unsigned int i, found;
+	void *entry;
+
+	xa_store_order(xa, index, order, xa, GFP_KERNEL);
+	xa_set_mark(xa, index, XA_MARK_1);
+
+	xas_lock(&xas);
+	xas_try_halve(&xas, xa, order, GFP_KERNEL);
+	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
+	    new_order < order - 1) {
+		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);
+		xas_unlock(&xas);
+		goto out;
+	}
+	for (i = 0; i < (1 << order); i += (1 << new_order))
+		__xa_store(xa, index + i, xa_mk_index(index + i), 0);
+	xas_unlock(&xas);
+
+	for (i = 0; i < (1 << order); i++) {
+		unsigned int val = index + (i & ~((1 << new_order) - 1));
+		XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
+	}
+
+	xa_set_mark(xa, index, XA_MARK_0);
+	XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
+
+	xas_set_order(&xas, index, 0);
+	found = 0;
+	rcu_read_lock();
+	xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) {
+		found++;
+		XA_BUG_ON(xa, xa_is_internal(entry));
+	}
+	rcu_read_unlock();
+	XA_BUG_ON(xa, found != 1 << (order - new_order));
+out:
+	xa_destroy(xa);
+}
+
 static noinline void check_split(struct xarray *xa)
 {
 	unsigned int order, new_order;
@@ -1868,6 +1911,10 @@ static noinline void check_split(struct xarray *xa)
 			check_split_1(xa, 0, order, new_order);
 			check_split_1(xa, 1UL << order, order, new_order);
 			check_split_1(xa, 3UL << order, order, new_order);
+
+			check_split_2(xa, 0, order, new_order);
+			check_split_2(xa, 1UL << order, order, new_order);
+			check_split_2(xa, 3UL << order, order, new_order);
 		}
 	}
 }
diff --git a/lib/xarray.c b/lib/xarray.c
index 116e9286c64e..c38beca77830 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1007,6 +1007,31 @@ static void node_set_marks(struct xa_node *node, unsigned int offset,
 	}
 }
 
+static struct xa_node *__xas_alloc_node_for_split(struct xa_state *xas,
+		void *entry, gfp_t gfp)
+{
+	unsigned int i;
+	void *sibling = NULL;
+	struct xa_node *node;
+	unsigned int mask = xas->xa_sibs;
+
+	node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+	if (!node)
+		return NULL;
+	node->array = xas->xa;
+	for (i = 0; i < XA_CHUNK_SIZE; i++) {
+		if ((i & mask) == 0) {
+			RCU_INIT_POINTER(node->slots[i], entry);
+			sibling = xa_mk_sibling(i);
+		} else {
+			RCU_INIT_POINTER(node->slots[i], sibling);
+		}
+	}
+	RCU_INIT_POINTER(node->parent, xas->xa_alloc);
+
+	return node;
+}
+
 /**
  * xas_split_alloc() - Allocate memory for splitting an entry.
  * @xas: XArray operation state.
@@ -1025,7 +1050,6 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 		gfp_t gfp)
 {
 	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
-	unsigned int mask = xas->xa_sibs;
 
 	/* XXX: no support for splitting really large entries yet */
 	if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT <= order))
@@ -1034,23 +1058,9 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 		return;
 
 	do {
-		unsigned int i;
-		void *sibling = NULL;
-		struct xa_node *node;
-
-		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+		struct xa_node *node = __xas_alloc_node_for_split(xas, entry, gfp);
 		if (!node)
 			goto nomem;
-		node->array = xas->xa;
-		for (i = 0; i < XA_CHUNK_SIZE; i++) {
-			if ((i & mask) == 0) {
-				RCU_INIT_POINTER(node->slots[i], entry);
-				sibling = xa_mk_sibling(i);
-			} else {
-				RCU_INIT_POINTER(node->slots[i], sibling);
-			}
-		}
-		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
 		xas->xa_alloc = node;
 	} while (sibs-- > 0);
 
@@ -1122,6 +1132,100 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
 	xas_update(xas, node);
 }
 EXPORT_SYMBOL_GPL(xas_split);
+
+/**
+ * xas_try_split() - Try to split a multi-index entry.
+ * @xas: XArray operation state.
+ * @entry: New entry to store in the array.
+ * @order: Current entry order.
+ * @gfp: Memory allocation flags.
+ *
+ * The size of the new entries is set in @xas.  The value in @entry is
+ * copied to all the replacement entries. If and only if one xa_node needs to
+ * be allocated, the function will use @gfp to get one. If more xa_node are
+ * needed, the function gives EINVAL error.
+ *
+ * Context: Any context.  The caller should hold the xa_lock.
+ */
+void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
+		gfp_t gfp)
+{
+	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
+	unsigned int offset, marks;
+	struct xa_node *node;
+	void *curr = xas_load(xas);
+	int values = 0;
+
+	node = xas->xa_node;
+	if (xas_top(node))
+		return;
+
+	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+		gfp |= __GFP_ACCOUNT;
+
+	marks = node_get_marks(node, xas->xa_offset);
+
+	offset = xas->xa_offset + sibs;
+	do {
+		if (xas->xa_shift < node->shift) {
+			struct xa_node *child = xas->xa_alloc;
+			unsigned int expected_sibs =
+				(1 << ((order - 1) % XA_CHUNK_SHIFT)) - 1;
+
+			/*
+			 * No support for splitting sibling entries
+			 * (horizontally) or cascade split (vertically), which
+			 * requires two or more new xa_nodes.
+			 * Since if one xa_node allocation fails,
+			 * it is hard to free the prior allocations.
+			 */
+			if (sibs || xas->xa_sibs != expected_sibs) {
+				xas_destroy(xas);
+				xas_set_err(xas, -EINVAL);
+				return;
+			}
+
+			if (!child) {
+				child = __xas_alloc_node_for_split(xas, entry,
+						gfp);
+				if (!child) {
+					xas_destroy(xas);
+					xas_set_err(xas, -ENOMEM);
+					return;
+				}
+			}
+
+			xas->xa_alloc = rcu_dereference_raw(child->parent);
+			child->shift = node->shift - XA_CHUNK_SHIFT;
+			child->offset = offset;
+			child->count = XA_CHUNK_SIZE;
+			child->nr_values = xa_is_value(entry) ?
+					XA_CHUNK_SIZE : 0;
+			RCU_INIT_POINTER(child->parent, node);
+			node_set_marks(node, offset, child, xas->xa_sibs,
+					marks);
+			rcu_assign_pointer(node->slots[offset],
+					xa_mk_node(child));
+			if (xa_is_value(curr))
+				values--;
+			xas_update(xas, child);
+		} else {
+			unsigned int canon = offset - xas->xa_sibs;
+
+			node_set_marks(node, canon, NULL, 0, marks);
+			rcu_assign_pointer(node->slots[canon], entry);
+			while (offset > canon)
+				rcu_assign_pointer(node->slots[offset--],
+						xa_mk_sibling(canon));
+			values += (xa_is_value(entry) - xa_is_value(curr)) *
+					(xas->xa_sibs + 1);
+		}
+	} while (offset-- > xas->xa_offset);
+
+	node->nr_values += values;
+	xas_update(xas, node);
+}
+EXPORT_SYMBOL_GPL(xas_try_split);
 #endif
 
 /**
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 8b3591a51e1f..b2a6660bbd92 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -14,6 +14,7 @@ include ../shared/shared.mk
 
 main:	$(OFILES)
 
+xarray.o: ../../../lib/test_xarray.c
 idr-test.o: ../../../lib/test_ida.c
 idr-test: idr-test.o $(CORE_OFILES)
 
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
  2025-02-11 15:50 ` [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-14 21:59   ` David Hildenbrand
  2025-02-15  1:52   ` Zi Yan
  2025-02-11 15:50 ` [PATCH v7 3/8] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

This is a preparation patch, both added functions are not used yet.

The added __split_unmapped_folio() is able to split a folio with
its mapping removed in two manners: 1) uniform split (the existing way),
and 2) buddy allocator like split.

The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __split_unmapped_folio() calls it once to split
the given folio to the new order. For buddy allocator split,
__split_unmapped_folio() calls it (folio_order - new_order) times
and each time splits the folio containing the given page to one lower
order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 349 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 348 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a0277f4154c2..12d3f515c408 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3262,7 +3262,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
 static void lru_add_page_tail(struct folio *folio, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
@@ -3506,6 +3505,354 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 					caller_pins;
 }
 
+/*
+ * It splits @folio into @new_order folios and copies the @folio metadata to
+ * all the resulting folios.
+ */
+static int __split_folio_to_order(struct folio *folio, int new_order)
+{
+	int curr_order = folio_order(folio);
+	long nr_pages = folio_nr_pages(folio);
+	long new_nr_pages = 1 << new_order;
+	long index;
+
+	if (curr_order <= new_order)
+		return -EINVAL;
+
+	/*
+	 * Skip the first new_nr_pages, since the new folio from them have all
+	 * the flags from the original folio.
+	 */
+	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
+		struct page *head = &folio->page;
+		struct page *new_head = head + index;
+
+		/*
+		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
+		 * Don't pass it around before clear_compound_head().
+		 */
+		struct folio *new_folio = (struct folio *)new_head;
+
+		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
+
+		/*
+		 * Clone page flags before unfreezing refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow flags change,
+		 * for example lock_page() which set PG_waiters.
+		 *
+		 * Note that for mapped sub-pages of an anonymous THP,
+		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
+		 * the migration entry instead from where remap_page() will restore it.
+		 * We can still have PG_anon_exclusive set on effectively unmapped and
+		 * unreferenced sub-pages of an anonymous THP: we can simply drop
+		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
+		 */
+		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		new_head->flags |= (head->flags &
+				((1L << PG_referenced) |
+				 (1L << PG_swapbacked) |
+				 (1L << PG_swapcache) |
+				 (1L << PG_mlocked) |
+				 (1L << PG_uptodate) |
+				 (1L << PG_active) |
+				 (1L << PG_workingset) |
+				 (1L << PG_locked) |
+				 (1L << PG_unevictable) |
+#ifdef CONFIG_ARCH_USES_PG_ARCH_2
+				 (1L << PG_arch_2) |
+#endif
+#ifdef CONFIG_ARCH_USES_PG_ARCH_3
+				 (1L << PG_arch_3) |
+#endif
+				 (1L << PG_dirty) |
+				 LRU_GEN_MASK | LRU_REFS_MASK));
+
+		/* ->mapping in first and second tail page is replaced by other uses */
+		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
+			       new_head);
+		new_head->mapping = head->mapping;
+		new_head->index = head->index + index;
+
+		/*
+		 * page->private should not be set in tail pages. Fix up and warn once
+		 * if private is unexpectedly set.
+		 */
+		if (unlikely(new_head->private)) {
+			VM_WARN_ON_ONCE_PAGE(true, new_head);
+			new_head->private = 0;
+		}
+
+		if (folio_test_swapcache(folio))
+			new_folio->swap.val = folio->swap.val + index;
+
+		/* Page flags must be visible before we make the page non-compound. */
+		smp_wmb();
+
+		/*
+		 * Clear PageTail before unfreezing page refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow put_page()
+		 * which needs correct compound_head().
+		 */
+		clear_compound_head(new_head);
+		if (new_order) {
+			prep_compound_page(new_head, new_order);
+			folio_set_large_rmappable(new_folio);
+
+			folio_set_order(folio, new_order);
+		}
+
+		if (folio_test_young(folio))
+			folio_set_young(new_folio);
+		if (folio_test_idle(folio))
+			folio_set_idle(new_folio);
+
+		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+	}
+
+	if (!new_order)
+		ClearPageCompound(&folio->page);
+
+	return 0;
+}
+
+/*
+ * It splits an unmapped @folio to lower order smaller folios in two ways.
+ * @folio: the to-be-split folio
+ * @new_order: the smallest order of the after split folios (since buddy
+ *             allocator like split generates folios with orders from @folio's
+ *             order - 1 to new_order).
+ * @page: in buddy allocator like split, the folio containing @page will be
+ *        split until its order becomes @new_order.
+ * @list: the after split folios will be added to @list if it is not NULL,
+ *        otherwise to LRU lists.
+ * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
+ * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
+ * @mapping: @folio->mapping
+ * @uniform_split: if the split is uniform or not (buddy allocator like split)
+ *
+ *
+ * 1. uniform split: the given @folio into multiple @new_order small folios,
+ *    where all small folios have the same order. This is done when
+ *    uniform_split is true.
+ * 2. buddy allocator like (non-uniform) split: the given @folio is split into
+ *    half and one of the half (containing the given page) is split into half
+ *    until the given @page's order becomes @new_order. This is done when
+ *    uniform_split is false.
+ *
+ * The high level flow for these two methods are:
+ * 1. uniform split: a single __split_folio_to_order() is called to split the
+ *    @folio into @new_order, then we traverse all the resulting folios one by
+ *    one in PFN ascending order and perform stats, unfreeze, adding to list,
+ *    and file mapping index operations.
+ * 2. non-uniform split: in general, folio_order - @new_order calls to
+ *    __split_folio_to_order() are made in a for loop to split the @folio
+ *    to one lower order at a time. The resulting small folios are processed
+ *    like what is done during the traversal in 1, except the one containing
+ *    @page, which is split in next for loop.
+ *
+ * After splitting, the caller's folio reference will be transferred to the
+ * folio containing @page. The other folios may be freed if they are not mapped.
+ *
+ * In terms of locking, after splitting,
+ * 1. uniform split leaves @page (or the folio contains it) locked;
+ * 2. buddy allocator like (non-uniform) split leaves @folio locked.
+ *
+ *
+ * For !uniform_split, when -ENOMEM is returned, the original folio might be
+ * split. The caller needs to check the input folio.
+ */
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+		struct page *page, struct list_head *list, pgoff_t end,
+		struct xa_state *xas, struct address_space *mapping,
+		bool uniform_split)
+{
+	struct lruvec *lruvec;
+	struct address_space *swap_cache = NULL;
+	struct folio *origin_folio = folio;
+	struct folio *next_folio = folio_next(folio);
+	struct folio *new_folio;
+	struct folio *next;
+	int order = folio_order(folio);
+	int split_order;
+	int start_order = uniform_split ? new_order : order - 1;
+	int nr_dropped = 0;
+	int ret = 0;
+	bool stop_split = false;
+
+	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
+		/* a swapcache folio can only be uniformly split to order-0 */
+		if (!uniform_split || new_order != 0)
+			return -EINVAL;
+
+		swap_cache = swap_address_space(folio->swap);
+		xa_lock(&swap_cache->i_pages);
+	}
+
+	if (folio_test_anon(folio))
+		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+
+	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+	lruvec = folio_lruvec_lock(folio);
+
+	folio_clear_has_hwpoisoned(folio);
+
+	/*
+	 * split to new_order one order at a time. For uniform split,
+	 * folio is split to new_order directly.
+	 */
+	for (split_order = start_order;
+	     split_order >= new_order && !stop_split;
+	     split_order--) {
+		int old_order = folio_order(folio);
+		struct folio *release;
+		struct folio *end_folio = folio_next(folio);
+		int status;
+
+		/* order-1 anonymous folio is not supported */
+		if (folio_test_anon(folio) && split_order == 1)
+			continue;
+		if (uniform_split && split_order != new_order)
+			continue;
+
+		if (mapping) {
+			/*
+			 * uniform split has xas_split_alloc() called before
+			 * irq is disabled to allocate enough memory, whereas
+			 * non-uniform split can handle ENOMEM.
+			 */
+			if (uniform_split)
+				xas_split(xas, folio, old_order);
+			else {
+				xas_set_order(xas, folio->index, split_order);
+				xas_try_split(xas, folio, old_order,
+						GFP_NOWAIT);
+				if (xas_error(xas)) {
+					ret = xas_error(xas);
+					stop_split = true;
+					goto after_split;
+				}
+			}
+		}
+
+		/* complete memcg works before add pages to LRU */
+		split_page_memcg(&folio->page, old_order, split_order);
+		split_page_owner(&folio->page, old_order, split_order);
+		pgalloc_tag_split(folio, old_order, split_order);
+
+		status = __split_folio_to_order(folio, split_order);
+
+		if (status < 0) {
+			stop_split = true;
+			ret = -EINVAL;
+		}
+
+after_split:
+		/*
+		 * Iterate through after-split folios and perform related
+		 * operations. But in buddy allocator like split, the folio
+		 * containing the specified page is skipped until its order
+		 * is new_order, since the folio will be worked on in next
+		 * iteration.
+		 */
+		for (release = folio, next = folio_next(folio);
+		     release != end_folio;
+		     release = next, next = folio_next(next)) {
+			/*
+			 * for buddy allocator like split, the folio containing
+			 * page will be split next and should not be released,
+			 * until the folio's order is new_order or stop_split
+			 * is set to true by the above xas_split() failure.
+			 */
+			if (release == page_folio(page)) {
+				folio = release;
+				if (split_order != new_order && !stop_split)
+					continue;
+			}
+			if (folio_test_anon(release)) {
+				mod_mthp_stat(folio_order(release),
+						MTHP_STAT_NR_ANON, 1);
+			}
+
+			/*
+			 * Unfreeze refcount first. Additional reference from
+			 * page cache.
+			 */
+			folio_ref_unfreeze(release,
+				1 + ((!folio_test_anon(origin_folio) ||
+				     folio_test_swapcache(origin_folio)) ?
+					     folio_nr_pages(release) : 0));
+
+			if (release != origin_folio)
+				lru_add_page_tail(origin_folio, &release->page,
+						lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from page cache */
+			if (release->index >= end) {
+				if (shmem_mapping(origin_folio->mapping))
+					nr_dropped += folio_nr_pages(release);
+				else if (folio_test_clear_dirty(release))
+					folio_account_cleaned(release,
+						inode_to_wb(origin_folio->mapping->host));
+				__filemap_remove_folio(release, NULL);
+				folio_put(release);
+			} else if (!folio_test_anon(release)) {
+				__xa_store(&origin_folio->mapping->i_pages,
+						release->index, &release->page, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+						swap_cache_index(release->swap),
+						&release->page, 0);
+			}
+		}
+	}
+
+	unlock_page_lruvec(lruvec);
+
+	if (folio_test_anon(origin_folio)) {
+		if (folio_test_swapcache(origin_folio))
+			xa_unlock(&swap_cache->i_pages);
+	} else
+		xa_unlock(&mapping->i_pages);
+
+	/* Caller disabled irqs, so they are still disabled here */
+	local_irq_enable();
+
+	if (nr_dropped)
+		shmem_uncharge(mapping->host, nr_dropped);
+
+	remap_page(origin_folio, 1 << order,
+			folio_test_anon(origin_folio) ?
+				RMP_USE_SHARED_ZEROPAGE : 0);
+
+	/*
+	 * At this point, folio should contain the specified page.
+	 * For uniform split, it is left for caller to unlock.
+	 * For buddy allocator like split, the first after-split folio is left
+	 * for caller to unlock.
+	 */
+	for (new_folio = origin_folio, next = folio_next(origin_folio);
+	     new_folio != next_folio;
+	     new_folio = next, next = folio_next(next)) {
+		if (uniform_split && new_folio == folio)
+			continue;
+		if (!uniform_split && new_folio == origin_folio)
+			continue;
+
+		folio_unlock(new_folio);
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		free_page_and_swap_cache(&new_folio->page);
+	}
+	return ret;
+}
+
 /*
  * This function splits a large folio into smaller folios of order @new_order.
  * @page can point to any page of the large folio to split. The split operation
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 3/8] mm/huge_memory: move folio split common code to __folio_split()
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
  2025-02-11 15:50 ` [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry Zi Yan
  2025-02-11 15:50 ` [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-11 15:50 ` [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split() Zi Yan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

This is a preparation patch for folio_split().

In the upcoming patch folio_split() will share folio unmapping and
remapping code with split_huge_page_to_list_to_order(), so move the code
to a common function __folio_split() first.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 107 +++++++++++++++++++++++++----------------------
 1 file changed, 57 insertions(+), 50 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 12d3f515c408..21ebe2dec5a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3853,57 +3853,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	return ret;
 }
 
-/*
- * This function splits a large folio into smaller folios of order @new_order.
- * @page can point to any page of the large folio to split. The split operation
- * does not change the position of @page.
- *
- * Prerequisites:
- *
- * 1) The caller must hold a reference on the @page's owning folio, also known
- *    as the large folio.
- *
- * 2) The large folio must be locked.
- *
- * 3) The folio must not be pinned. Any unexpected folio references, including
- *    GUP pins, will result in the folio not getting split; instead, the caller
- *    will receive an -EAGAIN.
- *
- * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
- *    supported for non-file-backed folios, because folio->_deferred_list, which
- *    is used by partially mapped folios, is stored in subpage 2, but an order-1
- *    folio only has subpages 0 and 1. File-backed order-1 folios are supported,
- *    since they do not use _deferred_list.
- *
- * After splitting, the caller's folio reference will be transferred to @page,
- * resulting in a raised refcount of @page after this call. The other pages may
- * be freed if they are not mapped.
- *
- * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
- *
- * Pages in @new_order will inherit the mapping, flags, and so on from the
- * huge page.
- *
- * Returns 0 if the huge page was split successfully.
- *
- * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
- * the folio was concurrently removed from the page cache.
- *
- * Returns -EBUSY when trying to split the huge zeropage, if the folio is
- * under writeback, if fs-specific folio metadata cannot currently be
- * released, or if some unexpected race happened (e.g., anon VMA disappeared,
- * truncation).
- *
- * Callers should ensure that the order respects the address space mapping
- * min-order if one is set for non-anonymous folios.
- *
- * Returns -EINVAL when trying to split to an order that is incompatible
- * with the folio. Splitting to order 0 is compatible with all folios.
- */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+static int __folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list)
 {
-	struct folio *folio = page_folio(page);
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	/* reset xarray order to new order after split */
 	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
@@ -4113,6 +4065,61 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+/*
+ * This function splits a large folio into smaller folios of order @new_order.
+ * @page can point to any page of the large folio to split. The split operation
+ * does not change the position of @page.
+ *
+ * Prerequisites:
+ *
+ * 1) The caller must hold a reference on the @page's owning folio, also known
+ *    as the large folio.
+ *
+ * 2) The large folio must be locked.
+ *
+ * 3) The folio must not be pinned. Any unexpected folio references, including
+ *    GUP pins, will result in the folio not getting split; instead, the caller
+ *    will receive an -EAGAIN.
+ *
+ * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not
+ *    supported for non-file-backed folios, because folio->_deferred_list, which
+ *    is used by partially mapped folios, is stored in subpage 2, but an order-1
+ *    folio only has subpages 0 and 1. File-backed order-1 folios are supported,
+ *    since they do not use _deferred_list.
+ *
+ * After splitting, the caller's folio reference will be transferred to @page,
+ * resulting in a raised refcount of @page after this call. The other pages may
+ * be freed if they are not mapped.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Pages in @new_order will inherit the mapping, flags, and so on from the
+ * huge page.
+ *
+ * Returns 0 if the huge page was split successfully.
+ *
+ * Returns -EAGAIN if the folio has unexpected reference (e.g., GUP) or if
+ * the folio was concurrently removed from the page cache.
+ *
+ * Returns -EBUSY when trying to split the huge zeropage, if the folio is
+ * under writeback, if fs-specific folio metadata cannot currently be
+ * released, or if some unexpected race happened (e.g., anon VMA disappeared,
+ * truncation).
+ *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set for non-anonymous folios.
+ *
+ * Returns -EINVAL when trying to split to an order that is incompatible
+ * with the folio. Splitting to order 0 is compatible with all folios.
+ */
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order)
+{
+	struct folio *folio = page_folio(page);
+
+	return __folio_split(folio, new_order, page, list);
+}
+
 int min_order_for_split(struct folio *folio)
 {
 	if (folio_test_anon(folio))
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
                   ` (2 preceding siblings ...)
  2025-02-11 15:50 ` [PATCH v7 3/8] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-16 10:32   ` David Hildenbrand
  2025-02-11 15:50 ` [PATCH v7 5/8] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon,
since anon folio does not support order-1 yet.
-----------------------------------------------------------------
|   |   |   |   |     |   |       |                             |
|O-0|O-0|O-0|O-0| O-2 |...|  O-7  |             O-8             |
|   |   |   |   |     |   |       |                             |
-----------------------------------------------------------------

O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
---------------------------------------------------------------
|     |   |   |     |   |       |                             |
| O-1 |O-0|O-0| O-2 |...|  O-7  |             O-8             |
|     |   |   |     |   |       |                             |
---------------------------------------------------------------

It generates fewer folios (i.e., 11 or 10) than existing page split
approach, which splits the order-9 to 512 order-0 folios. It also reduces
the number of new xa_node needed during a pagecache folio split from
8 to 1, potentially decreasing the folio split failure rate due to memory
constraints.

folio_split() and existing split_huge_page_to_list_to_order() share
the folio unmapping and remapping code in __folio_split() and the common
backend split code in __split_unmapped_folio() using
uniform_split variable to distinguish their operations.

uniform_split_supported() and non_uniform_split_supported() are added
to factor out check code and will be used outside __folio_split() in the
following commit.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 137 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 100 insertions(+), 37 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 21ebe2dec5a4..400dfe8a6e60 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3853,12 +3853,68 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	return ret;
 }
 
+static bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
+		bool warns)
+{
+	/* order-1 is not supported for anonymous THP. */
+	if (folio_test_anon(folio) && new_order == 1) {
+		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
+		return false;
+	}
+
+	/*
+	 * No split if the file system does not support large folio.
+	 * Note that we might still have THPs in such mappings due to
+	 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
+	 * does not actually support large folios properly.
+	 */
+	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+	    !mapping_large_folio_support(folio->mapping)) {
+		VM_WARN_ONCE(warns,
+			"Cannot split file folio to non-0 order");
+		return false;
+	}
+
+	/* Only swapping a whole PMD-mapped folio is supported */
+	if (folio_test_swapcache(folio)) {
+		VM_WARN_ONCE(warns,
+			"Cannot split swapcache folio to non-0 order");
+		return false;
+	}
+
+	return true;
+}
+
+/* See comments in non_uniform_split_supported() */
+static bool uniform_split_supported(struct folio *folio, unsigned int new_order,
+		bool warns)
+{
+	if (folio_test_anon(folio) && new_order == 1) {
+		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
+		return false;
+	}
+
+	if (new_order) {
+		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+		    !mapping_large_folio_support(folio->mapping)) {
+			VM_WARN_ONCE(warns,
+				"Cannot split file folio to non-0 order");
+			return false;
+		}
+		if (folio_test_swapcache(folio)) {
+			VM_WARN_ONCE(warns,
+				"Cannot split swapcache folio to non-0 order");
+			return false;
+		}
+	}
+	return true;
+}
+
 static int __folio_split(struct folio *folio, unsigned int new_order,
-		struct page *page, struct list_head *list)
+		struct page *page, struct list_head *list, bool uniform_split)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
-	/* reset xarray order to new order after split */
-	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
+	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
 	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
 	struct anon_vma *anon_vma = NULL;
@@ -3873,29 +3929,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	if (new_order >= folio_order(folio))
 		return -EINVAL;
 
-	if (is_anon) {
-		/* order-1 is not supported for anonymous THP. */
-		if (new_order == 1) {
-			VM_WARN_ONCE(1, "Cannot split to order-1 folio");
-			return -EINVAL;
-		}
-	} else if (new_order) {
-		/*
-		 * No split if the file system does not support large folio.
-		 * Note that we might still have THPs in such mappings due to
-		 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
-		 * does not actually support large folios properly.
-		 */
-		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
-		    !mapping_large_folio_support(folio->mapping)) {
-			VM_WARN_ONCE(1,
-				"Cannot split file folio to non-0 order");
-			return -EINVAL;
-		}
-	}
+	if (uniform_split && !uniform_split_supported(folio, new_order, true))
+		return -EINVAL;
 
-	/* Only swapping a whole PMD-mapped folio is supported */
-	if (folio_test_swapcache(folio) && new_order)
+	if (!uniform_split &&
+	    !non_uniform_split_supported(folio, new_order, true))
 		return -EINVAL;
 
 	is_hzp = is_huge_zero_folio(folio);
@@ -3952,10 +3990,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			goto out;
 		}
 
-		xas_split_alloc(&xas, folio, folio_order(folio), gfp);
-		if (xas_error(&xas)) {
-			ret = xas_error(&xas);
-			goto out;
+		if (uniform_split) {
+			xas_set_order(&xas, folio->index, new_order);
+			xas_split_alloc(&xas, folio, folio_order(folio), gfp);
+			if (xas_error(&xas)) {
+				ret = xas_error(&xas);
+				goto out;
+			}
 		}
 
 		anon_vma = NULL;
@@ -4020,7 +4061,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
-			xas_split(&xas, folio, folio_order(folio));
 			if (folio_test_pmd_mappable(folio) &&
 			    new_order < HPAGE_PMD_ORDER) {
 				if (folio_test_swapbacked(folio)) {
@@ -4034,12 +4074,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			}
 		}
 
-		if (is_anon) {
-			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
-			mod_mthp_stat(new_order, MTHP_STAT_NR_ANON, 1 << (order - new_order));
-		}
-		__split_huge_page(page, list, end, new_order);
-		ret = 0;
+		ret = __split_unmapped_folio(page_folio(page), new_order,
+				page, list, end, &xas, mapping, uniform_split);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
@@ -4117,7 +4153,34 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, page, list);
+	return __folio_split(folio, new_order, page, list, true);
+}
+
+/*
+ * folio_split: split a folio at @page to a @new_order folio
+ * @folio: folio to split
+ * @new_order: the order of the new folio
+ * @page: a page within the new folio
+ *
+ * return: 0: successful, <0 failed (if -ENOMEM is returned, @folio might be
+ * split but not to @new_order, the caller needs to check)
+ *
+ * It has the same prerequisites and returns as
+ * split_huge_page_to_list_to_order().
+ *
+ * Split a folio at offset_in_new_order to a new_order folio, leave the
+ * remaining subpages of the original folio as large as possible. For example,
+ * split an order-9 folio at its third order-3 subpages to an order-3 folio.
+ * There are 2^6=64 order-3 subpages in an order-9 folio and the result will be
+ * a set of folios with different order and the new folio is in bracket:
+ * [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].
+ *
+ * After split, folio is left locked for caller.
+ */
+int folio_split(struct folio *folio, unsigned int new_order,
+		struct page *page, struct list_head *list)
+{
+	return __folio_split(folio, new_order, page, list, false);
 }
 
 int min_order_for_split(struct folio *folio)
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 5/8] mm/huge_memory: remove the old, unused __split_huge_page()
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
                   ` (3 preceding siblings ...)
  2025-02-11 15:50 ` [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split() Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-11 15:50 ` [PATCH v7 6/8] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

Now split_huge_page_to_list_to_order() uses the new backend split code in
__folio_split_without_mapping(), the old __split_huge_page() and
__split_huge_page_tail() can be removed.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 207 -----------------------------------------------
 1 file changed, 207 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 400dfe8a6e60..437d0cd13663 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3281,213 +3281,6 @@ static void lru_add_page_tail(struct folio *folio, struct page *tail,
 	}
 }
 
-static void __split_huge_page_tail(struct folio *folio, int tail,
-		struct lruvec *lruvec, struct list_head *list,
-		unsigned int new_order)
-{
-	struct page *head = &folio->page;
-	struct page *page_tail = head + tail;
-	/*
-	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
-	 * Don't pass it around before clear_compound_head().
-	 */
-	struct folio *new_folio = (struct folio *)page_tail;
-
-	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
-
-	/*
-	 * Clone page flags before unfreezing refcount.
-	 *
-	 * After successful get_page_unless_zero() might follow flags change,
-	 * for example lock_page() which set PG_waiters.
-	 *
-	 * Note that for mapped sub-pages of an anonymous THP,
-	 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
-	 * the migration entry instead from where remap_page() will restore it.
-	 * We can still have PG_anon_exclusive set on effectively unmapped and
-	 * unreferenced sub-pages of an anonymous THP: we can simply drop
-	 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
-	 */
-	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-	page_tail->flags |= (head->flags &
-			((1L << PG_referenced) |
-			 (1L << PG_swapbacked) |
-			 (1L << PG_swapcache) |
-			 (1L << PG_mlocked) |
-			 (1L << PG_uptodate) |
-			 (1L << PG_active) |
-			 (1L << PG_workingset) |
-			 (1L << PG_locked) |
-			 (1L << PG_unevictable) |
-#ifdef CONFIG_ARCH_USES_PG_ARCH_2
-			 (1L << PG_arch_2) |
-#endif
-#ifdef CONFIG_ARCH_USES_PG_ARCH_3
-			 (1L << PG_arch_3) |
-#endif
-			 (1L << PG_dirty) |
-			 LRU_GEN_MASK | LRU_REFS_MASK));
-
-	/* ->mapping in first and second tail page is replaced by other uses */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
-	new_folio->mapping = folio->mapping;
-	new_folio->index = folio->index + tail;
-
-	/*
-	 * page->private should not be set in tail pages. Fix up and warn once
-	 * if private is unexpectedly set.
-	 */
-	if (unlikely(page_tail->private)) {
-		VM_WARN_ON_ONCE_PAGE(true, page_tail);
-		page_tail->private = 0;
-	}
-	if (folio_test_swapcache(folio))
-		new_folio->swap.val = folio->swap.val + tail;
-
-	/* Page flags must be visible before we make the page non-compound. */
-	smp_wmb();
-
-	/*
-	 * Clear PageTail before unfreezing page refcount.
-	 *
-	 * After successful get_page_unless_zero() might follow put_page()
-	 * which needs correct compound_head().
-	 */
-	clear_compound_head(page_tail);
-	if (new_order) {
-		prep_compound_page(page_tail, new_order);
-		folio_set_large_rmappable(new_folio);
-	}
-
-	/* Finally unfreeze refcount. Additional reference from page cache. */
-	page_ref_unfreeze(page_tail,
-		1 + ((!folio_test_anon(folio) || folio_test_swapcache(folio)) ?
-			     folio_nr_pages(new_folio) : 0));
-
-	if (folio_test_young(folio))
-		folio_set_young(new_folio);
-	if (folio_test_idle(folio))
-		folio_set_idle(new_folio);
-
-	folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
-
-	/*
-	 * always add to the tail because some iterators expect new
-	 * pages to show after the currently processed elements - e.g.
-	 * migrate_pages
-	 */
-	lru_add_page_tail(folio, page_tail, lruvec, list);
-}
-
-static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned int new_order)
-{
-	struct folio *folio = page_folio(page);
-	struct page *head = &folio->page;
-	struct lruvec *lruvec;
-	struct address_space *swap_cache = NULL;
-	unsigned long offset = 0;
-	int i, nr_dropped = 0;
-	unsigned int new_nr = 1 << new_order;
-	int order = folio_order(folio);
-	unsigned int nr = 1 << order;
-
-	/* complete memcg works before add pages to LRU */
-	split_page_memcg(head, order, new_order);
-
-	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
-		offset = swap_cache_index(folio->swap);
-		swap_cache = swap_address_space(folio->swap);
-		xa_lock(&swap_cache->i_pages);
-	}
-
-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);
-
-	folio_clear_has_hwpoisoned(folio);
-
-	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
-		struct folio *tail;
-		__split_huge_page_tail(folio, i, lruvec, list, new_order);
-		tail = page_folio(head + i);
-		/* Some pages can be beyond EOF: drop them from page cache */
-		if (tail->index >= end) {
-			if (shmem_mapping(folio->mapping))
-				nr_dropped += new_nr;
-			else if (folio_test_clear_dirty(tail))
-				folio_account_cleaned(tail,
-					inode_to_wb(folio->mapping->host));
-			__filemap_remove_folio(tail, NULL);
-			folio_put(tail);
-		} else if (!folio_test_anon(folio)) {
-			__xa_store(&folio->mapping->i_pages, tail->index,
-					tail, 0);
-		} else if (swap_cache) {
-			__xa_store(&swap_cache->i_pages, offset + i,
-					tail, 0);
-		}
-	}
-
-	if (!new_order)
-		ClearPageCompound(head);
-	else {
-		struct folio *new_folio = (struct folio *)head;
-
-		folio_set_order(new_folio, new_order);
-	}
-	unlock_page_lruvec(lruvec);
-	/* Caller disabled irqs, so they are still disabled here */
-
-	split_page_owner(head, order, new_order);
-	pgalloc_tag_split(folio, order, new_order);
-
-	/* See comment in __split_huge_page_tail() */
-	if (folio_test_anon(folio)) {
-		/* Additional pin to swap cache */
-		if (folio_test_swapcache(folio)) {
-			folio_ref_add(folio, 1 + new_nr);
-			xa_unlock(&swap_cache->i_pages);
-		} else {
-			folio_ref_inc(folio);
-		}
-	} else {
-		/* Additional pin to page cache */
-		folio_ref_add(folio, 1 + new_nr);
-		xa_unlock(&folio->mapping->i_pages);
-	}
-	local_irq_enable();
-
-	if (nr_dropped)
-		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
-
-	/*
-	 * set page to its compound_head when split to non order-0 pages, so
-	 * we can skip unlocking it below, since PG_locked is transferred to
-	 * the compound_head of the page and the caller will unlock it.
-	 */
-	if (new_order)
-		page = compound_head(page);
-
-	for (i = 0; i < nr; i += new_nr) {
-		struct page *subpage = head + i;
-		struct folio *new_folio = page_folio(subpage);
-		if (subpage == page)
-			continue;
-		folio_unlock(new_folio);
-
-		/*
-		 * Subpages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		free_page_and_swap_cache(subpage);
-	}
-}
-
 /* Racy check whether the huge page can be split */
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 6/8] mm/huge_memory: add folio_split() to debugfs testing interface.
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
                   ` (4 preceding siblings ...)
  2025-02-11 15:50 ` [PATCH v7 5/8] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-11 15:50 ` [PATCH v7 7/8] mm/truncate: use buddy allocator like folio split for truncate operation Zi Yan
  2025-02-11 15:50 ` [PATCH v7 8/8] selftests/mm: add tests for folio_split(), buddy allocator like split Zi Yan
  7 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

This allows to test folio_split() by specifying an additional in folio
page offset parameter to split_huge_page debugfs interface.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 437d0cd13663..05c09b791676 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4295,7 +4295,8 @@ static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
 }
 
 static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
-				unsigned long vaddr_end, unsigned int new_order)
+				unsigned long vaddr_end, unsigned int new_order,
+				long in_folio_offset)
 {
 	int ret = 0;
 	struct task_struct *task;
@@ -4379,8 +4380,16 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_test_anon(folio) && folio->mapping != mapping)
 			goto unlock;
 
-		if (!split_folio_to_order(folio, target_order))
-			split++;
+		if (in_folio_offset < 0 ||
+		    in_folio_offset >= folio_nr_pages(folio)) {
+			if (!split_folio_to_order(folio, target_order))
+				split++;
+		} else {
+			struct page *split_at = folio_page(folio,
+							   in_folio_offset);
+			if (!folio_split(folio, target_order, split_at, NULL))
+				split++;
+		}
 
 unlock:
 
@@ -4403,7 +4412,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 }
 
 static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
-				pgoff_t off_end, unsigned int new_order)
+				pgoff_t off_end, unsigned int new_order,
+				long in_folio_offset)
 {
 	struct filename *file;
 	struct file *candidate;
@@ -4452,8 +4462,15 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (folio->mapping != mapping)
 			goto unlock;
 
-		if (!split_folio_to_order(folio, target_order))
-			split++;
+		if (in_folio_offset < 0 || in_folio_offset >= nr_pages) {
+			if (!split_folio_to_order(folio, target_order))
+				split++;
+		} else {
+			struct page *split_at = folio_page(folio,
+							   in_folio_offset);
+			if (!folio_split(folio, target_order, split_at, NULL))
+				split++;
+		}
 
 unlock:
 		folio_unlock(folio);
@@ -4486,6 +4503,7 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
 	int pid;
 	unsigned long vaddr_start, vaddr_end;
 	unsigned int new_order = 0;
+	long in_folio_offset = -1;
 
 	ret = mutex_lock_interruptible(&split_debug_mutex);
 	if (ret)
@@ -4514,30 +4532,33 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf,
 			goto out;
 		}
 
-		ret = sscanf(tok_buf, "0x%lx,0x%lx,%d", &off_start,
-			    &off_end, &new_order);
-		if (ret != 2 && ret != 3) {
+		ret = sscanf(tok_buf, "0x%lx,0x%lx,%d,%ld", &off_start, &off_end,
+				&new_order, &in_folio_offset);
+		if (ret != 2 && ret != 3 && ret != 4) {
 			ret = -EINVAL;
 			goto out;
 		}
-		ret = split_huge_pages_in_file(file_path, off_start, off_end, new_order);
+		ret = split_huge_pages_in_file(file_path, off_start, off_end,
+				new_order, in_folio_offset);
 		if (!ret)
 			ret = input_len;
 
 		goto out;
 	}
 
-	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", &pid, &vaddr_start, &vaddr_end, &new_order);
+	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d,%ld", &pid, &vaddr_start,
+			&vaddr_end, &new_order, &in_folio_offset);
 	if (ret == 1 && pid == 1) {
 		split_huge_pages_all();
 		ret = strlen(input_buf);
 		goto out;
-	} else if (ret != 3 && ret != 4) {
+	} else if (ret != 3 && ret != 4 && ret != 5) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order);
+	ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order,
+			in_folio_offset);
 	if (!ret)
 		ret = strlen(input_buf);
 out:
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 7/8] mm/truncate: use buddy allocator like folio split for truncate operation.
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
                   ` (5 preceding siblings ...)
  2025-02-11 15:50 ` [PATCH v7 6/8] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  2025-02-11 15:50 ` [PATCH v7 8/8] selftests/mm: add tests for folio_split(), buddy allocator like split Zi Yan
  7 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

Instead of splitting the large folio uniformly during truncation, try to
use buddy allocator like split at the start of truncation range to minimize
the number of resulting folios if it is supported. try_folio_split() is
introduced to use folio_split() if supported and fall back to uniform
split otherwise.

For example, to truncate a order-4 folio
[0, 1, 2, 3, 4, 5, ..., 15]
between [3, 10] (inclusive), folio_split() splits the folio to
[0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and
[8..15] is kept with zeros in [8..10], then another folio_split() is
done at 10, so [8..10] can be dropped.

One possible optimization is to make folio_split() to split a folio
based on a given range, like [3..10] above. But that complicates
folio_split(), so it will be investigated when necessary.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 36 ++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        |  4 ++--
 mm/truncate.c           | 31 ++++++++++++++++++++++++++++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9e7be666bb6c..c00669b4fdcd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,6 +343,36 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
+bool uniform_split_supported(struct folio *folio, unsigned int new_order,
+		bool warns);
+bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
+		bool warns);
+int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
+		struct list_head *list);
+/*
+ * try_folio_split - try to split a @folio at @page using non uniform split.
+ * @folio: folio to be split
+ * @page: split to order-0 at the given page
+ * @list: store the after-split folios
+ *
+ * Try to split a @folio at @page using non uniform split to order-0, if
+ * non uniform split is not supported, fall back to uniform split.
+ *
+ * Return: 0: split is successful, otherwise split failed.
+ */
+static inline int try_folio_split(struct folio *folio, struct page *page,
+		struct list_head *list)
+{
+	int ret = min_order_for_split(folio);
+
+	if (ret < 0)
+		return ret;
+
+	if (!non_uniform_split_supported(folio, 0, false))
+		return split_huge_page_to_list_to_order(&folio->page, list,
+				ret);
+	return folio_split(folio, ret, page, list);
+}
 static inline int split_huge_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
@@ -535,6 +565,12 @@ static inline int split_folio_to_list(struct folio *folio, struct list_head *lis
 	return 0;
 }
 
+static inline int try_folio_split(struct folio *folio, struct page *page,
+		struct list_head *list)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 05c09b791676..ab80348f33dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3646,7 +3646,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	return ret;
 }
 
-static bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
+bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns)
 {
 	/* order-1 is not supported for anonymous THP. */
@@ -3679,7 +3679,7 @@ static bool non_uniform_split_supported(struct folio *folio, unsigned int new_or
 }
 
 /* See comments in non_uniform_split_supported() */
-static bool uniform_split_supported(struct folio *folio, unsigned int new_order,
+bool uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns)
 {
 	if (folio_test_anon(folio) && new_order == 1) {
diff --git a/mm/truncate.c b/mm/truncate.c
index 0395e578d946..031d0be19f42 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -192,6 +192,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 {
 	loff_t pos = folio_pos(folio);
 	unsigned int offset, length;
+	struct page *split_at, *split_at2;
 
 	if (pos < start)
 		offset = start - pos;
@@ -221,8 +222,36 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		folio_invalidate(folio, offset, length);
 	if (!folio_test_large(folio))
 		return true;
-	if (split_folio(folio) == 0)
+
+	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
+	split_at2 = folio_page(folio,
+			PAGE_ALIGN_DOWN(offset + length) / PAGE_SIZE);
+
+	if (!try_folio_split(folio, split_at, NULL)) {
+		/*
+		 * try to split at offset + length to make sure folios within
+		 * the range can be dropped, especially to avoid memory waste
+		 * for shmem truncate
+		 */
+		struct folio *folio2 = page_folio(split_at2);
+
+		if (!folio_try_get(folio2))
+			goto no_split;
+
+		if (!folio_test_large(folio2))
+			goto out;
+
+		if (!folio_trylock(folio2))
+			goto out;
+
+		/* split result does not matter here */
+		try_folio_split(folio2, split_at2, NULL);
+		folio_unlock(folio2);
+out:
+		folio_put(folio2);
+no_split:
 		return true;
+	}
 	if (folio_test_dirty(folio))
 		return false;
 	truncate_inode_folio(folio->mapping, folio);
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 8/8] selftests/mm: add tests for folio_split(), buddy allocator like split.
  2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
                   ` (6 preceding siblings ...)
  2025-02-11 15:50 ` [PATCH v7 7/8] mm/truncate: use buddy allocator like folio split for truncate operation Zi Yan
@ 2025-02-11 15:50 ` Zi Yan
  7 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-11 15:50 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

It splits page cache folios to orders from 0 to 8 at different in-folio
offset.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 .../selftests/mm/split_huge_page_test.c       | 34 +++++++++++++++----
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index e0304046b1a0..719c5e2a6624 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -14,6 +14,7 @@
 #include <fcntl.h>
 #include <sys/mman.h>
 #include <sys/mount.h>
+#include <sys/param.h>
 #include <malloc.h>
 #include <stdbool.h>
 #include <time.h>
@@ -456,7 +457,8 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd,
 	return -1;
 }
 
-void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_loc)
+void split_thp_in_pagecache_to_order_at(size_t fd_size, const char *fs_loc,
+		int order, int offset)
 {
 	int fd;
 	char *addr;
@@ -474,7 +476,12 @@ void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_l
 		return;
 	err = 0;
 
-	write_debugfs(PID_FMT, getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order);
+	if (offset == -1)
+		write_debugfs(PID_FMT, getpid(), (uint64_t)addr,
+			      (uint64_t)addr + fd_size, order);
+	else
+		write_debugfs(PID_FMT, getpid(), (uint64_t)addr,
+			      (uint64_t)addr + fd_size, order, offset);
 
 	for (i = 0; i < fd_size; i++)
 		if (*(addr + i) != (char)i) {
@@ -493,9 +500,15 @@ void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_l
 	munmap(addr, fd_size);
 	close(fd);
 	unlink(testfile);
-	if (err)
-		ksft_exit_fail_msg("Split PMD-mapped pagecache folio to order %d failed\n", order);
-	ksft_test_result_pass("Split PMD-mapped pagecache folio to order %d passed\n", order);
+	if (offset == -1) {
+		if (err)
+			ksft_exit_fail_msg("Split PMD-mapped pagecache folio to order %d failed\n", order);
+		ksft_test_result_pass("Split PMD-mapped pagecache folio to order %d passed\n", order);
+	} else {
+		if (err)
+			ksft_exit_fail_msg("Split PMD-mapped pagecache folio to order %d at in-folio offset %d failed\n", order, offset);
+		ksft_test_result_pass("Split PMD-mapped pagecache folio to order %d at in-folio offset %d passed\n", order, offset);
+	}
 }
 
 int main(int argc, char **argv)
@@ -506,6 +519,7 @@ int main(int argc, char **argv)
 	char fs_loc_template[] = "/tmp/thp_fs_XXXXXX";
 	const char *fs_loc;
 	bool created_tmp;
+	int offset;
 
 	ksft_print_header();
 
@@ -517,7 +531,7 @@ int main(int argc, char **argv)
 	if (argc > 1)
 		optional_xfs_path = argv[1];
 
-	ksft_set_plan(1+8+1+9+9);
+	ksft_set_plan(1+8+1+9+9+8*4+2);
 
 	pagesize = getpagesize();
 	pageshift = ffs(pagesize) - 1;
@@ -540,7 +554,13 @@ int main(int argc, char **argv)
 	created_tmp = prepare_thp_fs(optional_xfs_path, fs_loc_template,
 			&fs_loc);
 	for (i = 8; i >= 0; i--)
-		split_thp_in_pagecache_to_order(fd_size, i, fs_loc);
+		split_thp_in_pagecache_to_order_at(fd_size, fs_loc, i, -1);
+
+	for (i = 0; i < 9; i++)
+		for (offset = 0;
+		     offset < pmd_pagesize / pagesize;
+		     offset += MAX(pmd_pagesize / pagesize / 4, 1 << i))
+			split_thp_in_pagecache_to_order_at(fd_size, fs_loc, i, offset);
 	cleanup_thp_fs(fs_loc, created_tmp);
 
 	ksft_finished();
-- 
2.47.2



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-11 15:50 ` [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry Zi Yan
@ 2025-02-12  0:57   ` Zi Yan
  2025-02-12  1:51     ` Zi Yan
  2025-02-17 21:44   ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-12  0:57 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

On 11 Feb 2025, at 10:50, Zi Yan wrote:

> It is a preparation patch for non-uniform folio split, which always split
> a folio into half iteratively, and minimal xarray entry split.
>
> Currently, xas_split_alloc() and xas_split() always split all slots from a
> multi-index entry. They cost the same number of xa_node as the to-be-split
> slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
> slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
> needed. Instead xas_try_split() is intended to be used iteratively to split
> the order-9 entry into 2 order-8 entries, then split one order-8 entry,
> based on the given index, to 2 order-7 entries, ..., and split one order-1
> entry to 2 order-0 entries. When splitting the order-6 entry and a new
> xa_node is needed, xas_try_split() will try to allocate one if possible.
> As a result, xas_try_split() would only need one xa_node instead of 8.
>
> When a new xa_node is needed during the split, xas_try_split() can try to
> allocate one but no more. -ENOMEM will be return if a node cannot be
> allocated. -EINVAL will be return if a sibling node is split or
> cascade split happens, where two or more new nodes are needed, and these
> are not supported by xas_try_split().
>
> xas_split_alloc() and xas_split() split an order-9 to order-0:
>
>          ---------------------------------
>          |   |   |   |   |   |   |   |   |
>          | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>          |   |   |   |   |   |   |   |   |
>          ---------------------------------
>            |   |                   |   |
>      -------   ---               ---   -------
>      |           |     ...       |           |
>      V           V               V           V
> ----------- -----------     ----------- -----------
> | xa_node | | xa_node | ... | xa_node | | xa_node |
> ----------- -----------     ----------- -----------
>
> xas_try_split() splits an order-9 to order-0:
>    ---------------------------------
>    |   |   |   |   |   |   |   |   |
>    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>    |   |   |   |   |   |   |   |   |
>    ---------------------------------
>      |
>      |
>      V
> -----------
> | xa_node |
> -----------
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  Documentation/core-api/xarray.rst |  14 ++-
>  include/linux/xarray.h            |   7 ++
>  lib/test_xarray.c                 |  47 +++++++++++
>  lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
>  tools/testing/radix-tree/Makefile |   1 +
>  5 files changed, 188 insertions(+), 17 deletions(-)

Hi Andrew,

Do you mind folding the diff below to this one? I changed the function
name but forgot the one in the xarray test. Thanks.

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 598ca38a2f5b..cc2dd325158f 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1868,7 +1868,7 @@ static void check_split_2(struct xarray *xa, unsigned long index,
 	xa_set_mark(xa, index, XA_MARK_1);

 	xas_lock(&xas);
-	xas_try_halve(&xas, xa, order, GFP_KERNEL);
+	xas_try_split(&xas, xa, order, GFP_KERNEL);
 	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
 	    new_order < order - 1) {
 		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-12  0:57   ` Zi Yan
@ 2025-02-12  1:51     ` Zi Yan
  0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-12  1:51 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Kirill A . Shutemov, Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan

On 11 Feb 2025, at 19:57, Zi Yan wrote:

> On 11 Feb 2025, at 10:50, Zi Yan wrote:
>
>> It is a preparation patch for non-uniform folio split, which always split
>> a folio into half iteratively, and minimal xarray entry split.
>>
>> Currently, xas_split_alloc() and xas_split() always split all slots from a
>> multi-index entry. They cost the same number of xa_node as the to-be-split
>> slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
>> slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
>> needed. Instead xas_try_split() is intended to be used iteratively to split
>> the order-9 entry into 2 order-8 entries, then split one order-8 entry,
>> based on the given index, to 2 order-7 entries, ..., and split one order-1
>> entry to 2 order-0 entries. When splitting the order-6 entry and a new
>> xa_node is needed, xas_try_split() will try to allocate one if possible.
>> As a result, xas_try_split() would only need one xa_node instead of 8.
>>
>> When a new xa_node is needed during the split, xas_try_split() can try to
>> allocate one but no more. -ENOMEM will be return if a node cannot be
>> allocated. -EINVAL will be return if a sibling node is split or
>> cascade split happens, where two or more new nodes are needed, and these
>> are not supported by xas_try_split().
>>
>> xas_split_alloc() and xas_split() split an order-9 to order-0:
>>
>>          ---------------------------------
>>          |   |   |   |   |   |   |   |   |
>>          | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>          |   |   |   |   |   |   |   |   |
>>          ---------------------------------
>>            |   |                   |   |
>>      -------   ---               ---   -------
>>      |           |     ...       |           |
>>      V           V               V           V
>> ----------- -----------     ----------- -----------
>> | xa_node | | xa_node | ... | xa_node | | xa_node |
>> ----------- -----------     ----------- -----------
>>
>> xas_try_split() splits an order-9 to order-0:
>>    ---------------------------------
>>    |   |   |   |   |   |   |   |   |
>>    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>    |   |   |   |   |   |   |   |   |
>>    ---------------------------------
>>      |
>>      |
>>      V
>> -----------
>> | xa_node |
>> -----------
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  Documentation/core-api/xarray.rst |  14 ++-
>>  include/linux/xarray.h            |   7 ++
>>  lib/test_xarray.c                 |  47 +++++++++++
>>  lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
>>  tools/testing/radix-tree/Makefile |   1 +
>>  5 files changed, 188 insertions(+), 17 deletions(-)
>
> Hi Andrew,
>
> Do you mind folding the diff below to this one? I changed the function
> name but forgot the one in the xarray test. Thanks.

From bdf3b10f2ebcd09898ba7a277ac7107c25b8c71b Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Tue, 11 Feb 2025 20:48:55 -0500
Subject: [PATCH] correct the function name.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 lib/test_xarray.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 598ca38a2f5b..cc2dd325158f 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1868,7 +1868,7 @@ static void check_split_2(struct xarray *xa, unsigned long index,
 	xa_set_mark(xa, index, XA_MARK_1);

 	xas_lock(&xas);
-	xas_try_halve(&xas, xa, order, GFP_KERNEL);
+	xas_try_split(&xas, xa, order, GFP_KERNEL);
 	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
 	    new_order < order - 1) {
 		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);
-- 
2.47.2



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()
  2025-02-11 15:50 ` [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
@ 2025-02-14 21:59   ` David Hildenbrand
  2025-02-14 22:03     ` Zi Yan
  2025-02-15  1:52   ` Zi Yan
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-02-14 21:59 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 11.02.25 16:50, Zi Yan wrote:
> This is a preparation patch, both added functions are not used yet.
> 
> The added __split_unmapped_folio() is able to split a folio with
> its mapping removed in two manners: 1) uniform split (the existing way),
> and 2) buddy allocator like split.
> 
> The added __split_folio_to_order() can split a folio into any lower order.
> For uniform split, __split_unmapped_folio() calls it once to split
> the given folio to the new order. For buddy allocator split,
> __split_unmapped_folio() calls it (folio_order - new_order) times
> and each time splits the folio containing the given page to one lower
> order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/huge_memory.c | 349 ++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 348 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a0277f4154c2..12d3f515c408 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3262,7 +3262,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
>   static void lru_add_page_tail(struct folio *folio, struct page *tail,
>   		struct lruvec *lruvec, struct list_head *list)
>   {
> -	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>   	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
>   	lockdep_assert_held(&lruvec->lru_lock);
>   
> @@ -3506,6 +3505,354 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>   					caller_pins;
>   }
>   
> +/*
> + * It splits @folio into @new_order folios and copies the @folio metadata to
> + * all the resulting folios.
> + */
> +static int __split_folio_to_order(struct folio *folio, int new_order)
> +{
> +	int curr_order = folio_order(folio);
> +	long nr_pages = folio_nr_pages(folio);
> +	long new_nr_pages = 1 << new_order;
> +	long index;
> +
> +	if (curr_order <= new_order)
> +		return -EINVAL;
> +
> +	/*
> +	 * Skip the first new_nr_pages, since the new folio from them have all
> +	 * the flags from the original folio.
> +	 */
> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
> +		struct page *head = &folio->page;
> +		struct page *new_head = head + index;
> +
> +		/*
> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
> +		 * Don't pass it around before clear_compound_head().
> +		 */
> +		struct folio *new_folio = (struct folio *)new_head;
> +
> +		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
> +
> +		/*
> +		 * Clone page flags before unfreezing refcount.
> +		 *
> +		 * After successful get_page_unless_zero() might follow flags change,
> +		 * for example lock_page() which set PG_waiters.
> +		 *
> +		 * Note that for mapped sub-pages of an anonymous THP,
> +		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
> +		 * the migration entry instead from where remap_page() will restore it.
> +		 * We can still have PG_anon_exclusive set on effectively unmapped and
> +		 * unreferenced sub-pages of an anonymous THP: we can simply drop
> +		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
> +		 */
> +		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		new_head->flags |= (head->flags &
> +				((1L << PG_referenced) |
> +				 (1L << PG_swapbacked) |
> +				 (1L << PG_swapcache) |
> +				 (1L << PG_mlocked) |
> +				 (1L << PG_uptodate) |
> +				 (1L << PG_active) |
> +				 (1L << PG_workingset) |
> +				 (1L << PG_locked) |
> +				 (1L << PG_unevictable) |
> +#ifdef CONFIG_ARCH_USES_PG_ARCH_2
> +				 (1L << PG_arch_2) |
> +#endif
> +#ifdef CONFIG_ARCH_USES_PG_ARCH_3
> +				 (1L << PG_arch_3) |
> +#endif
> +				 (1L << PG_dirty) |
> +				 LRU_GEN_MASK | LRU_REFS_MASK));
> +
> +		/* ->mapping in first and second tail page is replaced by other uses */
> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
> +			       new_head);
> +		new_head->mapping = head->mapping;
> +		new_head->index = head->index + index;
> +
> +		/*
> +		 * page->private should not be set in tail pages. Fix up and warn once
> +		 * if private is unexpectedly set.
> +		 */
> +		if (unlikely(new_head->private)) {
> +			VM_WARN_ON_ONCE_PAGE(true, new_head);
> +			new_head->private = 0;
> +		}
> +
> +		if (folio_test_swapcache(folio))
> +			new_folio->swap.val = folio->swap.val + index;
> +
> +		/* Page flags must be visible before we make the page non-compound. */
> +		smp_wmb();
> +
> +		/*
> +		 * Clear PageTail before unfreezing page refcount.
> +		 *
> +		 * After successful get_page_unless_zero() might follow put_page()
> +		 * which needs correct compound_head().
> +		 */
> +		clear_compound_head(new_head);
> +		if (new_order) {
> +			prep_compound_page(new_head, new_order);
> +			folio_set_large_rmappable(new_folio);
> +
> +			folio_set_order(folio, new_order);
> +		}
> +
> +		if (folio_test_young(folio))
> +			folio_set_young(new_folio);
> +		if (folio_test_idle(folio))
> +			folio_set_idle(new_folio);
> +
> +		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
> +	}
> +
> +	if (!new_order)
> +		ClearPageCompound(&folio->page);
> +
> +	return 0;
> +}
> +
> +/*
> + * It splits an unmapped @folio to lower order smaller folios in two ways.
> + * @folio: the to-be-split folio
> + * @new_order: the smallest order of the after split folios (since buddy
> + *             allocator like split generates folios with orders from @folio's
> + *             order - 1 to new_order).
> + * @page: in buddy allocator like split, the folio containing @page will be
> + *        split until its order becomes @new_order.
> + * @list: the after split folios will be added to @list if it is not NULL,
> + *        otherwise to LRU lists.
> + * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
> + * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
> + * @mapping: @folio->mapping
> + * @uniform_split: if the split is uniform or not (buddy allocator like split)
> + *
> + *
> + * 1. uniform split: the given @folio into multiple @new_order small folios,
> + *    where all small folios have the same order. This is done when
> + *    uniform_split is true.
> + * 2. buddy allocator like (non-uniform) split: the given @folio is split into
> + *    half and one of the half (containing the given page) is split into half
> + *    until the given @page's order becomes @new_order. This is done when
> + *    uniform_split is false.
> + *
> + * The high level flow for these two methods are:
> + * 1. uniform split: a single __split_folio_to_order() is called to split the
> + *    @folio into @new_order, then we traverse all the resulting folios one by
> + *    one in PFN ascending order and perform stats, unfreeze, adding to list,
> + *    and file mapping index operations.
> + * 2. non-uniform split: in general, folio_order - @new_order calls to
> + *    __split_folio_to_order() are made in a for loop to split the @folio
> + *    to one lower order at a time. The resulting small folios are processed
> + *    like what is done during the traversal in 1, except the one containing
> + *    @page, which is split in next for loop.
> + *
> + * After splitting, the caller's folio reference will be transferred to the
> + * folio containing @page. The other folios may be freed if they are not mapped.
> + *
> + * In terms of locking, after splitting,
> + * 1. uniform split leaves @page (or the folio contains it) locked;
> + * 2. buddy allocator like (non-uniform) split leaves @folio locked.
> + *
> + *
> + * For !uniform_split, when -ENOMEM is returned, the original folio might be
> + * split. The caller needs to check the input folio.
> + */
> +static int __split_unmapped_folio(struct folio *folio, int new_order,
> +		struct page *page, struct list_head *list, pgoff_t end,
> +		struct xa_state *xas, struct address_space *mapping,
> +		bool uniform_split)
> +{
> +	struct lruvec *lruvec;
> +	struct address_space *swap_cache = NULL;
> +	struct folio *origin_folio = folio;
> +	struct folio *next_folio = folio_next(folio);
> +	struct folio *new_folio;
> +	struct folio *next;
> +	int order = folio_order(folio);
> +	int split_order;
> +	int start_order = uniform_split ? new_order : order - 1;
> +	int nr_dropped = 0;
> +	int ret = 0;
> +	bool stop_split = false;
> +
> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> +		/* a swapcache folio can only be uniformly split to order-0 */
> +		if (!uniform_split || new_order != 0)
> +			return -EINVAL;
> +
> +		swap_cache = swap_address_space(folio->swap);
> +		xa_lock(&swap_cache->i_pages);
> +	}
> +
> +	if (folio_test_anon(folio))
> +		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> +
> +	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> +	lruvec = folio_lruvec_lock(folio);
> +
> +	folio_clear_has_hwpoisoned(folio);
> +
> +	/*
> +	 * split to new_order one order at a time. For uniform split,
> +	 * folio is split to new_order directly.
> +	 */
> +	for (split_order = start_order;
> +	     split_order >= new_order && !stop_split;
> +	     split_order--) {
> +		int old_order = folio_order(folio);
> +		struct folio *release;
> +		struct folio *end_folio = folio_next(folio);
> +		int status;
> +
> +		/* order-1 anonymous folio is not supported */
> +		if (folio_test_anon(folio) && split_order == 1)
> +			continue;
> +		if (uniform_split && split_order != new_order)
> +			continue;
> +
> +		if (mapping) {
> +			/*
> +			 * uniform split has xas_split_alloc() called before
> +			 * irq is disabled to allocate enough memory, whereas
> +			 * non-uniform split can handle ENOMEM.
> +			 */
> +			if (uniform_split)
> +				xas_split(xas, folio, old_order);
> +			else {
> +				xas_set_order(xas, folio->index, split_order);
> +				xas_try_split(xas, folio, old_order,
> +						GFP_NOWAIT);
> +				if (xas_error(xas)) {
> +					ret = xas_error(xas);
> +					stop_split = true;
> +					goto after_split;
> +				}
> +			}
> +		}
> +
> +		/* complete memcg works before add pages to LRU */
> +		split_page_memcg(&folio->page, old_order, split_order);
> +		split_page_owner(&folio->page, old_order, split_order);
> +		pgalloc_tag_split(folio, old_order, split_order);
> +
> +		status = __split_folio_to_order(folio, split_order);
> +

Stumbling over that code (sorry for the late reply ... ).

That looks weird. We split memcg/owner/pgalloc ... and then figure out 
in __split_folio_to_order() that we don't want to ... split?

Should that all be moved into __split_folio_to_order() and performed 
only when we really want to split?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()
  2025-02-14 21:59   ` David Hildenbrand
@ 2025-02-14 22:03     ` Zi Yan
  2025-02-14 22:06       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-14 22:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 14 Feb 2025, at 16:59, David Hildenbrand wrote:

> On 11.02.25 16:50, Zi Yan wrote:
>> This is a preparation patch, both added functions are not used yet.
>>
>> The added __split_unmapped_folio() is able to split a folio with
>> its mapping removed in two manners: 1) uniform split (the existing way),
>> and 2) buddy allocator like split.
>>
>> The added __split_folio_to_order() can split a folio into any lower order.
>> For uniform split, __split_unmapped_folio() calls it once to split
>> the given folio to the new order. For buddy allocator split,
>> __split_unmapped_folio() calls it (folio_order - new_order) times
>> and each time splits the folio containing the given page to one lower
>> order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/huge_memory.c | 349 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 348 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index a0277f4154c2..12d3f515c408 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3262,7 +3262,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
>>   static void lru_add_page_tail(struct folio *folio, struct page *tail,
>>   		struct lruvec *lruvec, struct list_head *list)
>>   {
>> -	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>   	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
>>   	lockdep_assert_held(&lruvec->lru_lock);
>>  @@ -3506,6 +3505,354 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>   					caller_pins;
>>   }
>>  +/*
>> + * It splits @folio into @new_order folios and copies the @folio metadata to
>> + * all the resulting folios.
>> + */
>> +static int __split_folio_to_order(struct folio *folio, int new_order)
>> +{
>> +	int curr_order = folio_order(folio);
>> +	long nr_pages = folio_nr_pages(folio);
>> +	long new_nr_pages = 1 << new_order;
>> +	long index;
>> +
>> +	if (curr_order <= new_order)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * Skip the first new_nr_pages, since the new folio from them have all
>> +	 * the flags from the original folio.
>> +	 */
>> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
>> +		struct page *head = &folio->page;
>> +		struct page *new_head = head + index;
>> +
>> +		/*
>> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
>> +		 * Don't pass it around before clear_compound_head().
>> +		 */
>> +		struct folio *new_folio = (struct folio *)new_head;
>> +
>> +		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
>> +
>> +		/*
>> +		 * Clone page flags before unfreezing refcount.
>> +		 *
>> +		 * After successful get_page_unless_zero() might follow flags change,
>> +		 * for example lock_page() which set PG_waiters.
>> +		 *
>> +		 * Note that for mapped sub-pages of an anonymous THP,
>> +		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
>> +		 * the migration entry instead from where remap_page() will restore it.
>> +		 * We can still have PG_anon_exclusive set on effectively unmapped and
>> +		 * unreferenced sub-pages of an anonymous THP: we can simply drop
>> +		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
>> +		 */
>> +		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>> +		new_head->flags |= (head->flags &
>> +				((1L << PG_referenced) |
>> +				 (1L << PG_swapbacked) |
>> +				 (1L << PG_swapcache) |
>> +				 (1L << PG_mlocked) |
>> +				 (1L << PG_uptodate) |
>> +				 (1L << PG_active) |
>> +				 (1L << PG_workingset) |
>> +				 (1L << PG_locked) |
>> +				 (1L << PG_unevictable) |
>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_2
>> +				 (1L << PG_arch_2) |
>> +#endif
>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_3
>> +				 (1L << PG_arch_3) |
>> +#endif
>> +				 (1L << PG_dirty) |
>> +				 LRU_GEN_MASK | LRU_REFS_MASK));
>> +
>> +		/* ->mapping in first and second tail page is replaced by other uses */
>> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>> +			       new_head);
>> +		new_head->mapping = head->mapping;
>> +		new_head->index = head->index + index;
>> +
>> +		/*
>> +		 * page->private should not be set in tail pages. Fix up and warn once
>> +		 * if private is unexpectedly set.
>> +		 */
>> +		if (unlikely(new_head->private)) {
>> +			VM_WARN_ON_ONCE_PAGE(true, new_head);
>> +			new_head->private = 0;
>> +		}
>> +
>> +		if (folio_test_swapcache(folio))
>> +			new_folio->swap.val = folio->swap.val + index;
>> +
>> +		/* Page flags must be visible before we make the page non-compound. */
>> +		smp_wmb();
>> +
>> +		/*
>> +		 * Clear PageTail before unfreezing page refcount.
>> +		 *
>> +		 * After successful get_page_unless_zero() might follow put_page()
>> +		 * which needs correct compound_head().
>> +		 */
>> +		clear_compound_head(new_head);
>> +		if (new_order) {
>> +			prep_compound_page(new_head, new_order);
>> +			folio_set_large_rmappable(new_folio);
>> +
>> +			folio_set_order(folio, new_order);
>> +		}
>> +
>> +		if (folio_test_young(folio))
>> +			folio_set_young(new_folio);
>> +		if (folio_test_idle(folio))
>> +			folio_set_idle(new_folio);
>> +
>> +		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
>> +	}
>> +
>> +	if (!new_order)
>> +		ClearPageCompound(&folio->page);
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * It splits an unmapped @folio to lower order smaller folios in two ways.
>> + * @folio: the to-be-split folio
>> + * @new_order: the smallest order of the after split folios (since buddy
>> + *             allocator like split generates folios with orders from @folio's
>> + *             order - 1 to new_order).
>> + * @page: in buddy allocator like split, the folio containing @page will be
>> + *        split until its order becomes @new_order.
>> + * @list: the after split folios will be added to @list if it is not NULL,
>> + *        otherwise to LRU lists.
>> + * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
>> + * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
>> + * @mapping: @folio->mapping
>> + * @uniform_split: if the split is uniform or not (buddy allocator like split)
>> + *
>> + *
>> + * 1. uniform split: the given @folio into multiple @new_order small folios,
>> + *    where all small folios have the same order. This is done when
>> + *    uniform_split is true.
>> + * 2. buddy allocator like (non-uniform) split: the given @folio is split into
>> + *    half and one of the half (containing the given page) is split into half
>> + *    until the given @page's order becomes @new_order. This is done when
>> + *    uniform_split is false.
>> + *
>> + * The high level flow for these two methods are:
>> + * 1. uniform split: a single __split_folio_to_order() is called to split the
>> + *    @folio into @new_order, then we traverse all the resulting folios one by
>> + *    one in PFN ascending order and perform stats, unfreeze, adding to list,
>> + *    and file mapping index operations.
>> + * 2. non-uniform split: in general, folio_order - @new_order calls to
>> + *    __split_folio_to_order() are made in a for loop to split the @folio
>> + *    to one lower order at a time. The resulting small folios are processed
>> + *    like what is done during the traversal in 1, except the one containing
>> + *    @page, which is split in next for loop.
>> + *
>> + * After splitting, the caller's folio reference will be transferred to the
>> + * folio containing @page. The other folios may be freed if they are not mapped.
>> + *
>> + * In terms of locking, after splitting,
>> + * 1. uniform split leaves @page (or the folio contains it) locked;
>> + * 2. buddy allocator like (non-uniform) split leaves @folio locked.
>> + *
>> + *
>> + * For !uniform_split, when -ENOMEM is returned, the original folio might be
>> + * split. The caller needs to check the input folio.
>> + */
>> +static int __split_unmapped_folio(struct folio *folio, int new_order,
>> +		struct page *page, struct list_head *list, pgoff_t end,
>> +		struct xa_state *xas, struct address_space *mapping,
>> +		bool uniform_split)
>> +{
>> +	struct lruvec *lruvec;
>> +	struct address_space *swap_cache = NULL;
>> +	struct folio *origin_folio = folio;
>> +	struct folio *next_folio = folio_next(folio);
>> +	struct folio *new_folio;
>> +	struct folio *next;
>> +	int order = folio_order(folio);
>> +	int split_order;
>> +	int start_order = uniform_split ? new_order : order - 1;
>> +	int nr_dropped = 0;
>> +	int ret = 0;
>> +	bool stop_split = false;
>> +
>> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
>> +		/* a swapcache folio can only be uniformly split to order-0 */
>> +		if (!uniform_split || new_order != 0)
>> +			return -EINVAL;
>> +
>> +		swap_cache = swap_address_space(folio->swap);
>> +		xa_lock(&swap_cache->i_pages);
>> +	}
>> +
>> +	if (folio_test_anon(folio))
>> +		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>> +
>> +	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
>> +	lruvec = folio_lruvec_lock(folio);
>> +
>> +	folio_clear_has_hwpoisoned(folio);
>> +
>> +	/*
>> +	 * split to new_order one order at a time. For uniform split,
>> +	 * folio is split to new_order directly.
>> +	 */
>> +	for (split_order = start_order;
>> +	     split_order >= new_order && !stop_split;
>> +	     split_order--) {
>> +		int old_order = folio_order(folio);
>> +		struct folio *release;
>> +		struct folio *end_folio = folio_next(folio);
>> +		int status;
>> +
>> +		/* order-1 anonymous folio is not supported */
>> +		if (folio_test_anon(folio) && split_order == 1)
>> +			continue;
>> +		if (uniform_split && split_order != new_order)
>> +			continue;
>> +
>> +		if (mapping) {
>> +			/*
>> +			 * uniform split has xas_split_alloc() called before
>> +			 * irq is disabled to allocate enough memory, whereas
>> +			 * non-uniform split can handle ENOMEM.
>> +			 */
>> +			if (uniform_split)
>> +				xas_split(xas, folio, old_order);
>> +			else {
>> +				xas_set_order(xas, folio->index, split_order);
>> +				xas_try_split(xas, folio, old_order,
>> +						GFP_NOWAIT);
>> +				if (xas_error(xas)) {
>> +					ret = xas_error(xas);
>> +					stop_split = true;
>> +					goto after_split;
>> +				}
>> +			}
>> +		}
>> +
>> +		/* complete memcg works before add pages to LRU */
>> +		split_page_memcg(&folio->page, old_order, split_order);
>> +		split_page_owner(&folio->page, old_order, split_order);
>> +		pgalloc_tag_split(folio, old_order, split_order);
>> +
>> +		status = __split_folio_to_order(folio, split_order);
>> +
>
> Stumbling over that code (sorry for the late reply ... ).
>
> That looks weird. We split memcg/owner/pgalloc ... and then figure out in __split_folio_to_order() that we don't want to ... split?
>
> Should that all be moved into __split_folio_to_order() and performed only when we really want to split?

Yes, or move it after the status check. In reality, __split_folio_to_order()
only fails split_order is bigger than folio’s order, which should not happen.
But still. I will fix it in the next version.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()
  2025-02-14 22:03     ` Zi Yan
@ 2025-02-14 22:06       ` David Hildenbrand
  2025-02-14 22:18         ` Zi Yan
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-02-14 22:06 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 14.02.25 23:03, Zi Yan wrote:
> On 14 Feb 2025, at 16:59, David Hildenbrand wrote:
> 
>> On 11.02.25 16:50, Zi Yan wrote:
>>> This is a preparation patch, both added functions are not used yet.
>>>
>>> The added __split_unmapped_folio() is able to split a folio with
>>> its mapping removed in two manners: 1) uniform split (the existing way),
>>> and 2) buddy allocator like split.
>>>
>>> The added __split_folio_to_order() can split a folio into any lower order.
>>> For uniform split, __split_unmapped_folio() calls it once to split
>>> the given folio to the new order. For buddy allocator split,
>>> __split_unmapped_folio() calls it (folio_order - new_order) times
>>> and each time splits the folio containing the given page to one lower
>>> order.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/huge_memory.c | 349 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 348 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index a0277f4154c2..12d3f515c408 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3262,7 +3262,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
>>>    static void lru_add_page_tail(struct folio *folio, struct page *tail,
>>>    		struct lruvec *lruvec, struct list_head *list)
>>>    {
>>> -	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>    	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
>>>    	lockdep_assert_held(&lruvec->lru_lock);
>>>   @@ -3506,6 +3505,354 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>    					caller_pins;
>>>    }
>>>   +/*
>>> + * It splits @folio into @new_order folios and copies the @folio metadata to
>>> + * all the resulting folios.
>>> + */
>>> +static int __split_folio_to_order(struct folio *folio, int new_order)
>>> +{
>>> +	int curr_order = folio_order(folio);
>>> +	long nr_pages = folio_nr_pages(folio);
>>> +	long new_nr_pages = 1 << new_order;
>>> +	long index;
>>> +
>>> +	if (curr_order <= new_order)
>>> +		return -EINVAL;
>>> +
>>> +	/*
>>> +	 * Skip the first new_nr_pages, since the new folio from them have all
>>> +	 * the flags from the original folio.
>>> +	 */
>>> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
>>> +		struct page *head = &folio->page;
>>> +		struct page *new_head = head + index;
>>> +
>>> +		/*
>>> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
>>> +		 * Don't pass it around before clear_compound_head().
>>> +		 */
>>> +		struct folio *new_folio = (struct folio *)new_head;
>>> +
>>> +		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
>>> +
>>> +		/*
>>> +		 * Clone page flags before unfreezing refcount.
>>> +		 *
>>> +		 * After successful get_page_unless_zero() might follow flags change,
>>> +		 * for example lock_page() which set PG_waiters.
>>> +		 *
>>> +		 * Note that for mapped sub-pages of an anonymous THP,
>>> +		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
>>> +		 * the migration entry instead from where remap_page() will restore it.
>>> +		 * We can still have PG_anon_exclusive set on effectively unmapped and
>>> +		 * unreferenced sub-pages of an anonymous THP: we can simply drop
>>> +		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
>>> +		 */
>>> +		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>> +		new_head->flags |= (head->flags &
>>> +				((1L << PG_referenced) |
>>> +				 (1L << PG_swapbacked) |
>>> +				 (1L << PG_swapcache) |
>>> +				 (1L << PG_mlocked) |
>>> +				 (1L << PG_uptodate) |
>>> +				 (1L << PG_active) |
>>> +				 (1L << PG_workingset) |
>>> +				 (1L << PG_locked) |
>>> +				 (1L << PG_unevictable) |
>>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_2
>>> +				 (1L << PG_arch_2) |
>>> +#endif
>>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_3
>>> +				 (1L << PG_arch_3) |
>>> +#endif
>>> +				 (1L << PG_dirty) |
>>> +				 LRU_GEN_MASK | LRU_REFS_MASK));
>>> +
>>> +		/* ->mapping in first and second tail page is replaced by other uses */
>>> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>>> +			       new_head);
>>> +		new_head->mapping = head->mapping;
>>> +		new_head->index = head->index + index;
>>> +
>>> +		/*
>>> +		 * page->private should not be set in tail pages. Fix up and warn once
>>> +		 * if private is unexpectedly set.
>>> +		 */
>>> +		if (unlikely(new_head->private)) {
>>> +			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>> +			new_head->private = 0;
>>> +		}
>>> +
>>> +		if (folio_test_swapcache(folio))
>>> +			new_folio->swap.val = folio->swap.val + index;
>>> +
>>> +		/* Page flags must be visible before we make the page non-compound. */
>>> +		smp_wmb();
>>> +
>>> +		/*
>>> +		 * Clear PageTail before unfreezing page refcount.
>>> +		 *
>>> +		 * After successful get_page_unless_zero() might follow put_page()
>>> +		 * which needs correct compound_head().
>>> +		 */
>>> +		clear_compound_head(new_head);
>>> +		if (new_order) {
>>> +			prep_compound_page(new_head, new_order);
>>> +			folio_set_large_rmappable(new_folio);
>>> +
>>> +			folio_set_order(folio, new_order);
>>> +		}
>>> +
>>> +		if (folio_test_young(folio))
>>> +			folio_set_young(new_folio);
>>> +		if (folio_test_idle(folio))
>>> +			folio_set_idle(new_folio);
>>> +
>>> +		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
>>> +	}
>>> +
>>> +	if (!new_order)
>>> +		ClearPageCompound(&folio->page);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>> + * It splits an unmapped @folio to lower order smaller folios in two ways.
>>> + * @folio: the to-be-split folio
>>> + * @new_order: the smallest order of the after split folios (since buddy
>>> + *             allocator like split generates folios with orders from @folio's
>>> + *             order - 1 to new_order).
>>> + * @page: in buddy allocator like split, the folio containing @page will be
>>> + *        split until its order becomes @new_order.
>>> + * @list: the after split folios will be added to @list if it is not NULL,
>>> + *        otherwise to LRU lists.
>>> + * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
>>> + * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
>>> + * @mapping: @folio->mapping
>>> + * @uniform_split: if the split is uniform or not (buddy allocator like split)
>>> + *
>>> + *
>>> + * 1. uniform split: the given @folio into multiple @new_order small folios,
>>> + *    where all small folios have the same order. This is done when
>>> + *    uniform_split is true.
>>> + * 2. buddy allocator like (non-uniform) split: the given @folio is split into
>>> + *    half and one of the half (containing the given page) is split into half
>>> + *    until the given @page's order becomes @new_order. This is done when
>>> + *    uniform_split is false.
>>> + *
>>> + * The high level flow for these two methods are:
>>> + * 1. uniform split: a single __split_folio_to_order() is called to split the
>>> + *    @folio into @new_order, then we traverse all the resulting folios one by
>>> + *    one in PFN ascending order and perform stats, unfreeze, adding to list,
>>> + *    and file mapping index operations.
>>> + * 2. non-uniform split: in general, folio_order - @new_order calls to
>>> + *    __split_folio_to_order() are made in a for loop to split the @folio
>>> + *    to one lower order at a time. The resulting small folios are processed
>>> + *    like what is done during the traversal in 1, except the one containing
>>> + *    @page, which is split in next for loop.
>>> + *
>>> + * After splitting, the caller's folio reference will be transferred to the
>>> + * folio containing @page. The other folios may be freed if they are not mapped.
>>> + *
>>> + * In terms of locking, after splitting,
>>> + * 1. uniform split leaves @page (or the folio contains it) locked;
>>> + * 2. buddy allocator like (non-uniform) split leaves @folio locked.
>>> + *
>>> + *
>>> + * For !uniform_split, when -ENOMEM is returned, the original folio might be
>>> + * split. The caller needs to check the input folio.
>>> + */
>>> +static int __split_unmapped_folio(struct folio *folio, int new_order,
>>> +		struct page *page, struct list_head *list, pgoff_t end,
>>> +		struct xa_state *xas, struct address_space *mapping,
>>> +		bool uniform_split)
>>> +{
>>> +	struct lruvec *lruvec;
>>> +	struct address_space *swap_cache = NULL;
>>> +	struct folio *origin_folio = folio;
>>> +	struct folio *next_folio = folio_next(folio);
>>> +	struct folio *new_folio;
>>> +	struct folio *next;
>>> +	int order = folio_order(folio);
>>> +	int split_order;
>>> +	int start_order = uniform_split ? new_order : order - 1;
>>> +	int nr_dropped = 0;
>>> +	int ret = 0;
>>> +	bool stop_split = false;
>>> +
>>> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
>>> +		/* a swapcache folio can only be uniformly split to order-0 */
>>> +		if (!uniform_split || new_order != 0)
>>> +			return -EINVAL;
>>> +
>>> +		swap_cache = swap_address_space(folio->swap);
>>> +		xa_lock(&swap_cache->i_pages);
>>> +	}
>>> +
>>> +	if (folio_test_anon(folio))
>>> +		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>>> +
>>> +	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
>>> +	lruvec = folio_lruvec_lock(folio);
>>> +
>>> +	folio_clear_has_hwpoisoned(folio);
>>> +
>>> +	/*
>>> +	 * split to new_order one order at a time. For uniform split,
>>> +	 * folio is split to new_order directly.
>>> +	 */
>>> +	for (split_order = start_order;
>>> +	     split_order >= new_order && !stop_split;
>>> +	     split_order--) {
>>> +		int old_order = folio_order(folio);
>>> +		struct folio *release;
>>> +		struct folio *end_folio = folio_next(folio);
>>> +		int status;
>>> +
>>> +		/* order-1 anonymous folio is not supported */
>>> +		if (folio_test_anon(folio) && split_order == 1)
>>> +			continue;
>>> +		if (uniform_split && split_order != new_order)
>>> +			continue;
>>> +
>>> +		if (mapping) {
>>> +			/*
>>> +			 * uniform split has xas_split_alloc() called before
>>> +			 * irq is disabled to allocate enough memory, whereas
>>> +			 * non-uniform split can handle ENOMEM.
>>> +			 */
>>> +			if (uniform_split)
>>> +				xas_split(xas, folio, old_order);
>>> +			else {
>>> +				xas_set_order(xas, folio->index, split_order);
>>> +				xas_try_split(xas, folio, old_order,
>>> +						GFP_NOWAIT);
>>> +				if (xas_error(xas)) {
>>> +					ret = xas_error(xas);
>>> +					stop_split = true;
>>> +					goto after_split;
>>> +				}
>>> +			}
>>> +		}
>>> +
>>> +		/* complete memcg works before add pages to LRU */
>>> +		split_page_memcg(&folio->page, old_order, split_order);
>>> +		split_page_owner(&folio->page, old_order, split_order);
>>> +		pgalloc_tag_split(folio, old_order, split_order);
>>> +
>>> +		status = __split_folio_to_order(folio, split_order);
>>> +
>>
>> Stumbling over that code (sorry for the late reply ... ).
>>
>> That looks weird. We split memcg/owner/pgalloc ... and then figure out in __split_folio_to_order() that we don't want to ... split?
>>
>> Should that all be moved into __split_folio_to_order() and performed only when we really want to split?
> 
> Yes, or move it after the status check. In reality, __split_folio_to_order()
> only fails split_order is bigger than folio’s order, which should not happen.

Right, I was wondering if this is actually a WARN_ON_ONCE() kind-of 
situation.

Probably  __split_folio_to_order() should never fail, and that 
sanity-check should be done before that splitting code here in the 
single caller.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()
  2025-02-14 22:06       ` David Hildenbrand
@ 2025-02-14 22:18         ` Zi Yan
  0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-14 22:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 14 Feb 2025, at 17:06, David Hildenbrand wrote:

> On 14.02.25 23:03, Zi Yan wrote:
>> On 14 Feb 2025, at 16:59, David Hildenbrand wrote:
>>
>>> On 11.02.25 16:50, Zi Yan wrote:
>>>> This is a preparation patch, both added functions are not used yet.
>>>>
>>>> The added __split_unmapped_folio() is able to split a folio with
>>>> its mapping removed in two manners: 1) uniform split (the existing way),
>>>> and 2) buddy allocator like split.
>>>>
>>>> The added __split_folio_to_order() can split a folio into any lower order.
>>>> For uniform split, __split_unmapped_folio() calls it once to split
>>>> the given folio to the new order. For buddy allocator split,
>>>> __split_unmapped_folio() calls it (folio_order - new_order) times
>>>> and each time splits the folio containing the given page to one lower
>>>> order.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>    mm/huge_memory.c | 349 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 348 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index a0277f4154c2..12d3f515c408 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3262,7 +3262,6 @@ static void remap_page(struct folio *folio, unsigned long nr, int flags)
>>>>    static void lru_add_page_tail(struct folio *folio, struct page *tail,
>>>>    		struct lruvec *lruvec, struct list_head *list)
>>>>    {
>>>> -	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>>>>    	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
>>>>    	lockdep_assert_held(&lruvec->lru_lock);
>>>>   @@ -3506,6 +3505,354 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>>    					caller_pins;
>>>>    }
>>>>   +/*
>>>> + * It splits @folio into @new_order folios and copies the @folio metadata to
>>>> + * all the resulting folios.
>>>> + */
>>>> +static int __split_folio_to_order(struct folio *folio, int new_order)
>>>> +{
>>>> +	int curr_order = folio_order(folio);
>>>> +	long nr_pages = folio_nr_pages(folio);
>>>> +	long new_nr_pages = 1 << new_order;
>>>> +	long index;
>>>> +
>>>> +	if (curr_order <= new_order)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/*
>>>> +	 * Skip the first new_nr_pages, since the new folio from them have all
>>>> +	 * the flags from the original folio.
>>>> +	 */
>>>> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
>>>> +		struct page *head = &folio->page;
>>>> +		struct page *new_head = head + index;
>>>> +
>>>> +		/*
>>>> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
>>>> +		 * Don't pass it around before clear_compound_head().
>>>> +		 */
>>>> +		struct folio *new_folio = (struct folio *)new_head;
>>>> +
>>>> +		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
>>>> +
>>>> +		/*
>>>> +		 * Clone page flags before unfreezing refcount.
>>>> +		 *
>>>> +		 * After successful get_page_unless_zero() might follow flags change,
>>>> +		 * for example lock_page() which set PG_waiters.
>>>> +		 *
>>>> +		 * Note that for mapped sub-pages of an anonymous THP,
>>>> +		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
>>>> +		 * the migration entry instead from where remap_page() will restore it.
>>>> +		 * We can still have PG_anon_exclusive set on effectively unmapped and
>>>> +		 * unreferenced sub-pages of an anonymous THP: we can simply drop
>>>> +		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
>>>> +		 */
>>>> +		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>> +		new_head->flags |= (head->flags &
>>>> +				((1L << PG_referenced) |
>>>> +				 (1L << PG_swapbacked) |
>>>> +				 (1L << PG_swapcache) |
>>>> +				 (1L << PG_mlocked) |
>>>> +				 (1L << PG_uptodate) |
>>>> +				 (1L << PG_active) |
>>>> +				 (1L << PG_workingset) |
>>>> +				 (1L << PG_locked) |
>>>> +				 (1L << PG_unevictable) |
>>>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_2
>>>> +				 (1L << PG_arch_2) |
>>>> +#endif
>>>> +#ifdef CONFIG_ARCH_USES_PG_ARCH_3
>>>> +				 (1L << PG_arch_3) |
>>>> +#endif
>>>> +				 (1L << PG_dirty) |
>>>> +				 LRU_GEN_MASK | LRU_REFS_MASK));
>>>> +
>>>> +		/* ->mapping in first and second tail page is replaced by other uses */
>>>> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>>>> +			       new_head);
>>>> +		new_head->mapping = head->mapping;
>>>> +		new_head->index = head->index + index;
>>>> +
>>>> +		/*
>>>> +		 * page->private should not be set in tail pages. Fix up and warn once
>>>> +		 * if private is unexpectedly set.
>>>> +		 */
>>>> +		if (unlikely(new_head->private)) {
>>>> +			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>>> +			new_head->private = 0;
>>>> +		}
>>>> +
>>>> +		if (folio_test_swapcache(folio))
>>>> +			new_folio->swap.val = folio->swap.val + index;
>>>> +
>>>> +		/* Page flags must be visible before we make the page non-compound. */
>>>> +		smp_wmb();
>>>> +
>>>> +		/*
>>>> +		 * Clear PageTail before unfreezing page refcount.
>>>> +		 *
>>>> +		 * After successful get_page_unless_zero() might follow put_page()
>>>> +		 * which needs correct compound_head().
>>>> +		 */
>>>> +		clear_compound_head(new_head);
>>>> +		if (new_order) {
>>>> +			prep_compound_page(new_head, new_order);
>>>> +			folio_set_large_rmappable(new_folio);
>>>> +
>>>> +			folio_set_order(folio, new_order);
>>>> +		}
>>>> +
>>>> +		if (folio_test_young(folio))
>>>> +			folio_set_young(new_folio);
>>>> +		if (folio_test_idle(folio))
>>>> +			folio_set_idle(new_folio);
>>>> +
>>>> +		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
>>>> +	}
>>>> +
>>>> +	if (!new_order)
>>>> +		ClearPageCompound(&folio->page);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * It splits an unmapped @folio to lower order smaller folios in two ways.
>>>> + * @folio: the to-be-split folio
>>>> + * @new_order: the smallest order of the after split folios (since buddy
>>>> + *             allocator like split generates folios with orders from @folio's
>>>> + *             order - 1 to new_order).
>>>> + * @page: in buddy allocator like split, the folio containing @page will be
>>>> + *        split until its order becomes @new_order.
>>>> + * @list: the after split folios will be added to @list if it is not NULL,
>>>> + *        otherwise to LRU lists.
>>>> + * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
>>>> + * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
>>>> + * @mapping: @folio->mapping
>>>> + * @uniform_split: if the split is uniform or not (buddy allocator like split)
>>>> + *
>>>> + *
>>>> + * 1. uniform split: the given @folio into multiple @new_order small folios,
>>>> + *    where all small folios have the same order. This is done when
>>>> + *    uniform_split is true.
>>>> + * 2. buddy allocator like (non-uniform) split: the given @folio is split into
>>>> + *    half and one of the half (containing the given page) is split into half
>>>> + *    until the given @page's order becomes @new_order. This is done when
>>>> + *    uniform_split is false.
>>>> + *
>>>> + * The high level flow for these two methods are:
>>>> + * 1. uniform split: a single __split_folio_to_order() is called to split the
>>>> + *    @folio into @new_order, then we traverse all the resulting folios one by
>>>> + *    one in PFN ascending order and perform stats, unfreeze, adding to list,
>>>> + *    and file mapping index operations.
>>>> + * 2. non-uniform split: in general, folio_order - @new_order calls to
>>>> + *    __split_folio_to_order() are made in a for loop to split the @folio
>>>> + *    to one lower order at a time. The resulting small folios are processed
>>>> + *    like what is done during the traversal in 1, except the one containing
>>>> + *    @page, which is split in next for loop.
>>>> + *
>>>> + * After splitting, the caller's folio reference will be transferred to the
>>>> + * folio containing @page. The other folios may be freed if they are not mapped.
>>>> + *
>>>> + * In terms of locking, after splitting,
>>>> + * 1. uniform split leaves @page (or the folio contains it) locked;
>>>> + * 2. buddy allocator like (non-uniform) split leaves @folio locked.
>>>> + *
>>>> + *
>>>> + * For !uniform_split, when -ENOMEM is returned, the original folio might be
>>>> + * split. The caller needs to check the input folio.
>>>> + */
>>>> +static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>> +		struct page *page, struct list_head *list, pgoff_t end,
>>>> +		struct xa_state *xas, struct address_space *mapping,
>>>> +		bool uniform_split)
>>>> +{
>>>> +	struct lruvec *lruvec;
>>>> +	struct address_space *swap_cache = NULL;
>>>> +	struct folio *origin_folio = folio;
>>>> +	struct folio *next_folio = folio_next(folio);
>>>> +	struct folio *new_folio;
>>>> +	struct folio *next;
>>>> +	int order = folio_order(folio);
>>>> +	int split_order;
>>>> +	int start_order = uniform_split ? new_order : order - 1;
>>>> +	int nr_dropped = 0;
>>>> +	int ret = 0;
>>>> +	bool stop_split = false;
>>>> +
>>>> +	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
>>>> +		/* a swapcache folio can only be uniformly split to order-0 */
>>>> +		if (!uniform_split || new_order != 0)
>>>> +			return -EINVAL;
>>>> +
>>>> +		swap_cache = swap_address_space(folio->swap);
>>>> +		xa_lock(&swap_cache->i_pages);
>>>> +	}
>>>> +
>>>> +	if (folio_test_anon(folio))
>>>> +		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>>>> +
>>>> +	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
>>>> +	lruvec = folio_lruvec_lock(folio);
>>>> +
>>>> +	folio_clear_has_hwpoisoned(folio);
>>>> +
>>>> +	/*
>>>> +	 * split to new_order one order at a time. For uniform split,
>>>> +	 * folio is split to new_order directly.
>>>> +	 */
>>>> +	for (split_order = start_order;
>>>> +	     split_order >= new_order && !stop_split;
>>>> +	     split_order--) {
>>>> +		int old_order = folio_order(folio);
>>>> +		struct folio *release;
>>>> +		struct folio *end_folio = folio_next(folio);
>>>> +		int status;
>>>> +
>>>> +		/* order-1 anonymous folio is not supported */
>>>> +		if (folio_test_anon(folio) && split_order == 1)
>>>> +			continue;
>>>> +		if (uniform_split && split_order != new_order)
>>>> +			continue;
>>>> +
>>>> +		if (mapping) {
>>>> +			/*
>>>> +			 * uniform split has xas_split_alloc() called before
>>>> +			 * irq is disabled to allocate enough memory, whereas
>>>> +			 * non-uniform split can handle ENOMEM.
>>>> +			 */
>>>> +			if (uniform_split)
>>>> +				xas_split(xas, folio, old_order);
>>>> +			else {
>>>> +				xas_set_order(xas, folio->index, split_order);
>>>> +				xas_try_split(xas, folio, old_order,
>>>> +						GFP_NOWAIT);
>>>> +				if (xas_error(xas)) {
>>>> +					ret = xas_error(xas);
>>>> +					stop_split = true;
>>>> +					goto after_split;
>>>> +				}
>>>> +			}
>>>> +		}
>>>> +
>>>> +		/* complete memcg works before add pages to LRU */
>>>> +		split_page_memcg(&folio->page, old_order, split_order);
>>>> +		split_page_owner(&folio->page, old_order, split_order);
>>>> +		pgalloc_tag_split(folio, old_order, split_order);
>>>> +
>>>> +		status = __split_folio_to_order(folio, split_order);
>>>> +
>>>
>>> Stumbling over that code (sorry for the late reply ... ).
>>>
>>> That looks weird. We split memcg/owner/pgalloc ... and then figure out in __split_folio_to_order() that we don't want to ... split?
>>>
>>> Should that all be moved into __split_folio_to_order() and performed only when we really want to split?
>>
>> Yes, or move it after the status check. In reality, __split_folio_to_order()
>> only fails split_order is bigger than folio’s order, which should not happen.
>
> Right, I was wondering if this is actually a WARN_ON_ONCE() kind-of situation.
>
> Probably  __split_folio_to_order() should never fail, and that sanity-check should be done before that splitting code here in the single caller.

Right. The check in __split_folio_to_order() is redundant. new_order
was checked in __folio_split(). Let me remove the check and make
__split_folio_to_order() never fail.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()
  2025-02-11 15:50 ` [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
  2025-02-14 21:59   ` David Hildenbrand
@ 2025-02-15  1:52   ` Zi Yan
  1 sibling, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-15  1:52 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Ryan Roberts, Hugh Dickins, David Hildenbrand, Yang Shi,
	Miaohe Lin, Kefeng Wang, Yu Zhao, John Hubbard, Baolin Wang,
	linux-kselftest, linux-kernel, Zi Yan, Matthew Wilcox (Oracle),
	Kirill A . Shutemov

On 11 Feb 2025, at 10:50, Zi Yan wrote:

> This is a preparation patch, both added functions are not used yet.
>
> The added __split_unmapped_folio() is able to split a folio with
> its mapping removed in two manners: 1) uniform split (the existing way),
> and 2) buddy allocator like split.
>
> The added __split_folio_to_order() can split a folio into any lower order.
> For uniform split, __split_unmapped_folio() calls it once to split
> the given folio to the new order. For buddy allocator split,
> __split_unmapped_folio() calls it (folio_order - new_order) times
> and each time splits the folio containing the given page to one lower
> order.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 349 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 348 insertions(+), 1 deletion(-)
>

Hi Andrew,

Can you fold the patch below into this one? It addresses the error
reported by syzbot at: https://lore.kernel.org/all/67af65cb.050a0220.21dd3.004a.GAE@google.com/ and the concern raised
by David Hildenbrand at: https://lore.kernel.org/linux-mm/db77d017-4a1e-4a47-9064-e335cb0313af@redhat.com/.

Let me know if you prefer a new version of the whole series.

Thanks.



From a6bd83dfbb1143f1614ede4817cccb1e8cc6290d Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Fri, 14 Feb 2025 16:18:24 -0500
Subject: [PATCH] mm/huge_memory: do not drop the original folio during
 truncate.

The caller expects to handle the original folio itself.

also make __split_unmapped_folio() never fail, per discussion with David
Hildenbrand.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2eda2a9ec8fc..87cb62c81bf3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3292,16 +3292,12 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
  * It splits @folio into @new_order folios and copies the @folio metadata to
  * all the resulting folios.
  */
-static int __split_folio_to_order(struct folio *folio, int new_order)
+static void __split_folio_to_order(struct folio *folio, int new_order)
 {
-	int curr_order = folio_order(folio);
 	long nr_pages = folio_nr_pages(folio);
 	long new_nr_pages = 1 << new_order;
 	long index;

-	if (curr_order <= new_order)
-		return -EINVAL;
-
 	/*
 	 * Skip the first new_nr_pages, since the new folio from them have all
 	 * the flags from the original folio.
@@ -3396,8 +3392,6 @@ static int __split_folio_to_order(struct folio *folio, int new_order)

 	if (!new_order)
 		ClearPageCompound(&folio->page);
-
-	return 0;
 }

 /*
@@ -3491,7 +3485,6 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		int old_order = folio_order(folio);
 		struct folio *release;
 		struct folio *end_folio = folio_next(folio);
-		int status;

 		/* order-1 anonymous folio is not supported */
 		if (folio_test_anon(folio) && split_order == 1)
@@ -3524,12 +3517,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		split_page_owner(&folio->page, old_order, split_order);
 		pgalloc_tag_split(folio, old_order, split_order);

-		status = __split_folio_to_order(folio, split_order);
-
-		if (status < 0) {
-			stop_split = true;
-			ret = -EINVAL;
-		}
+		__split_folio_to_order(folio, split_order);

 after_split:
 		/*
@@ -3567,8 +3555,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 				     folio_test_swapcache(origin_folio)) ?
 					     folio_nr_pages(release) : 0));

-			if (release != origin_folio)
-				lru_add_page_tail(origin_folio, &release->page,
+			if (release == origin_folio)
+				continue;
+			
+			lru_add_page_tail(origin_folio, &release->page,
 						lruvec, list);

 			/* Some pages can be beyond EOF: drop them from page cache */
-- 
2.47.2



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  2025-02-11 15:50 ` [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split() Zi Yan
@ 2025-02-16 10:32   ` David Hildenbrand
  2025-02-16 14:17     ` Zi Yan
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-02-16 10:32 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 11.02.25 16:50, Zi Yan wrote:
> folio_split() splits a large folio in the same way as buddy allocator
> splits a large free page for allocation. The purpose is to minimize the
> number of folios after the split. For example, if user wants to free the
> 3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
> as:
> O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon,
> since anon folio does not support order-1 yet.
> -----------------------------------------------------------------
> |   |   |   |   |     |   |       |                             |
> |O-0|O-0|O-0|O-0| O-2 |...|  O-7  |             O-8             |
> |   |   |   |   |     |   |       |                             |
> -----------------------------------------------------------------
> 
> O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
> ---------------------------------------------------------------
> |     |   |   |     |   |       |                             |
> | O-1 |O-0|O-0| O-2 |...|  O-7  |             O-8             |
> |     |   |   |     |   |       |                             |
> ---------------------------------------------------------------
> 
> It generates fewer folios (i.e., 11 or 10) than existing page split
> approach, which splits the order-9 to 512 order-0 folios. It also reduces
> the number of new xa_node needed during a pagecache folio split from
> 8 to 1, potentially decreasing the folio split failure rate due to memory
> constraints.
> 
> folio_split() and existing split_huge_page_to_list_to_order() share
> the folio unmapping and remapping code in __folio_split() and the common
> backend split code in __split_unmapped_folio() using
> uniform_split variable to distinguish their operations.
> 
> uniform_split_supported() and non_uniform_split_supported() are added
> to factor out check code and will be used outside __folio_split() in the
> following commit.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/huge_memory.c | 137 ++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 100 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 21ebe2dec5a4..400dfe8a6e60 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3853,12 +3853,68 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>   	return ret;
>   }
>   
> +static bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
> +		bool warns)
> +{
> +	/* order-1 is not supported for anonymous THP. */
> +	if (folio_test_anon(folio) && new_order == 1) {
> +		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
> +		return false;
> +	}
> +
> +	/*
> +	 * No split if the file system does not support large folio.
> +	 * Note that we might still have THPs in such mappings due to
> +	 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
> +	 * does not actually support large folios properly.
> +	 */
> +	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> +	    !mapping_large_folio_support(folio->mapping)) {

In this (and a similar case below), you need

if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
     !folio_test_anon(folio) &&
     !mapping_large_folio_support(folio->mapping)) {

Otherwise mapping_large_folio_support() is unhappy:

[root@localhost mm]# ./split_huge_page_test
TAP version 13
1..20
ok 1 Split zero filled huge pages successful
ok 2 Split huge pages to order 0 successful
[  144.936764][T15389] ------------[ cut here ]------------
[  144.937948][T15389] Anonymous mapping always supports large folio
[  144.938164][T15389] WARNING: CPU: 5 PID: 15389 at ./include/linux/pagemap.h:494 uniform_split_supported+0x270/0x290
[  144.941442][T15389] Modules linked in:
[  144.942212][T15389] CPU: 5 UID: 0 PID: 15389 Comm: split_huge_page Not tainted 6.14.0-rc2-00200-gdcbc194183fd #153
[  144.944188][T15389] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[  144.945934][T15389] RIP: 0010:uniform_split_supported+0x270/0x290
[  144.947144][T15389] Code: ff 89 de e8 c2 20 ca ff 84 db 0f 85 47 fe ff ff e8 05 26 ca ff c6 05 f6 c2 22 06 01 90 48 c7 c7 18 44 fa 86 e8 31 2a ac ff 90 <0f> 0b 90 90 e9 24 fe ff ff e8 e2 25 ca ff 48 c7 c6 08 52 f7 86 48
[  144.950897][T15389] RSP: 0018:ffffc90022813990 EFLAGS: 00010286
[  144.952058][T15389] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8120ed77
[  144.953559][T15389] RDX: ffff8881326f3880 RSI: ffffffff8120ed8a RDI: ffff8881326f3880
[  144.955045][T15389] RBP: ffffea00062a0000 R08: 0000000000000001 R09: 0000000000000000
[  144.956544][T15389] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000001
[  144.958043][T15389] R13: 0000000000000001 R14: ffff8881328b3951 R15: 0000000000000001
[  144.959898][T15389] FS:  00007fb74cda4740(0000) GS:ffff888277b40000(0000) knlGS:0000000000000000
[  144.961627][T15389] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  144.962886][T15389] CR2: 00007fb74ca00000 CR3: 000000013e072000 CR4: 0000000000750ef0
[  144.964418][T15389] PKRU: 55555554
[  144.965100][T15389] Call Trace:
[  144.965746][T15389]  <TASK>
[  144.966331][T15389]  ? uniform_split_supported+0x270/0x290


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  2025-02-16 10:32   ` David Hildenbrand
@ 2025-02-16 14:17     ` Zi Yan
  2025-02-17 15:22       ` Zi Yan
  0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-16 14:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 16 Feb 2025, at 5:32, David Hildenbrand wrote:

> On 11.02.25 16:50, Zi Yan wrote:
>> folio_split() splits a large folio in the same way as buddy allocator
>> splits a large free page for allocation. The purpose is to minimize the
>> number of folios after the split. For example, if user wants to free the
>> 3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
>> as:
>> O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon,
>> since anon folio does not support order-1 yet.
>> -----------------------------------------------------------------
>> |   |   |   |   |     |   |       |                             |
>> |O-0|O-0|O-0|O-0| O-2 |...|  O-7  |             O-8             |
>> |   |   |   |   |     |   |       |                             |
>> -----------------------------------------------------------------
>>
>> O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
>> ---------------------------------------------------------------
>> |     |   |   |     |   |       |                             |
>> | O-1 |O-0|O-0| O-2 |...|  O-7  |             O-8             |
>> |     |   |   |     |   |       |                             |
>> ---------------------------------------------------------------
>>
>> It generates fewer folios (i.e., 11 or 10) than existing page split
>> approach, which splits the order-9 to 512 order-0 folios. It also reduces
>> the number of new xa_node needed during a pagecache folio split from
>> 8 to 1, potentially decreasing the folio split failure rate due to memory
>> constraints.
>>
>> folio_split() and existing split_huge_page_to_list_to_order() share
>> the folio unmapping and remapping code in __folio_split() and the common
>> backend split code in __split_unmapped_folio() using
>> uniform_split variable to distinguish their operations.
>>
>> uniform_split_supported() and non_uniform_split_supported() are added
>> to factor out check code and will be used outside __folio_split() in the
>> following commit.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/huge_memory.c | 137 ++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 100 insertions(+), 37 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 21ebe2dec5a4..400dfe8a6e60 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3853,12 +3853,68 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>   	return ret;
>>   }
>>  +static bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>> +		bool warns)
>> +{
>> +	/* order-1 is not supported for anonymous THP. */
>> +	if (folio_test_anon(folio) && new_order == 1) {
>> +		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
>> +		return false;
>> +	}
>> +
>> +	/*
>> +	 * No split if the file system does not support large folio.
>> +	 * Note that we might still have THPs in such mappings due to
>> +	 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
>> +	 * does not actually support large folios properly.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>> +	    !mapping_large_folio_support(folio->mapping)) {
>
> In this (and a similar case below), you need
>
> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>     !folio_test_anon(folio) &&
>     !mapping_large_folio_support(folio->mapping)) {
>
> Otherwise mapping_large_folio_support() is unhappy:
>

Thanks. The patch below should fix it.

I am going to send V8, since
1. there have been 4 fixes so far for V7, a new series would help people
review;

2.  based on the discussion with you in THP cabal meeting, to
convert split_huge_page*() to use __folio_split(), the current
__folio_split() interface becomes awkward. Two changes are needed:
   a) use in folio offset instead of struct page, since even in
     truncate_inode_partial_folio() I needed to convert in folio offset
     struct page to use my current interface;
   b) split_huge_page*()'s caller might hold the page lock at a non-head
     page, so an additional keep_lock_at_in_folio_offset is needed
     to indicate which after-split folio should be kept locked after
     split is done.


From 8b2aa5432c8d726a1fb6ce74c971365650da9370 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Sun, 16 Feb 2025 09:01:29 -0500
Subject: [PATCH] mm/huge_memory: check folio_test_anon() before
 mapping_large_folio_support()

Otherwise mapping_large_folio_support() complains.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 48 ++++++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87cb62c81bf3..deb16fe662c4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3629,20 +3629,19 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns)
 {
-	/* order-1 is not supported for anonymous THP. */
-	if (folio_test_anon(folio) && new_order == 1) {
-		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
-		return false;
-	}
-
-	/*
-	 * No split if the file system does not support large folio.
-	 * Note that we might still have THPs in such mappings due to
-	 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
-	 * does not actually support large folios properly.
-	 */
-	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+	if (folio_test_anon(folio)) {
+		/* order-1 is not supported for anonymous THP. */
+		VM_WARN_ONCE(warns && new_order == 1,
+				"Cannot split to order-1 folio");
+		return new_order != 1;
+	} else if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    !mapping_large_folio_support(folio->mapping)) {
+		/*
+		 * No split if the file system does not support large folio.
+		 * Note that we might still have THPs in such mappings due to
+		 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
+		 * does not actually support large folios properly.
+		 */
 		VM_WARN_ONCE(warns,
 			"Cannot split file folio to non-0 order");
 		return false;
@@ -3662,24 +3661,25 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns)
 {
-	if (folio_test_anon(folio) && new_order == 1) {
-		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
-		return false;
-	}
-
-	if (new_order) {
+	if (folio_test_anon(folio)) {
+		VM_WARN_ONCE(warns && new_order == 1,
+				"Cannot split to order-1 folio");
+		return new_order != 1;
+	} else  if (new_order) {
 		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 		    !mapping_large_folio_support(folio->mapping)) {
 			VM_WARN_ONCE(warns,
 				"Cannot split file folio to non-0 order");
 			return false;
 		}
-		if (folio_test_swapcache(folio)) {
-			VM_WARN_ONCE(warns,
-				"Cannot split swapcache folio to non-0 order");
-			return false;
-		}
 	}
+
+	if (new_order && folio_test_swapcache(folio)) {
+		VM_WARN_ONCE(warns,
+			"Cannot split swapcache folio to non-0 order");
+		return false;
+	}
+
 	return true;
 }

-- 
2.47.2



--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  2025-02-16 14:17     ` Zi Yan
@ 2025-02-17 15:22       ` Zi Yan
  2025-02-18  4:12         ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-17 15:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Kirill A . Shutemov, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 16 Feb 2025, at 9:17, Zi Yan wrote:

> On 16 Feb 2025, at 5:32, David Hildenbrand wrote:
>
>> On 11.02.25 16:50, Zi Yan wrote:
>>> folio_split() splits a large folio in the same way as buddy allocator
>>> splits a large free page for allocation. The purpose is to minimize the
>>> number of folios after the split. For example, if user wants to free the
>>> 3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
>>> as:
>>> O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon,
>>> since anon folio does not support order-1 yet.
>>> -----------------------------------------------------------------
>>> |   |   |   |   |     |   |       |                             |
>>> |O-0|O-0|O-0|O-0| O-2 |...|  O-7  |             O-8             |
>>> |   |   |   |   |     |   |       |                             |
>>> -----------------------------------------------------------------
>>>
>>> O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
>>> ---------------------------------------------------------------
>>> |     |   |   |     |   |       |                             |
>>> | O-1 |O-0|O-0| O-2 |...|  O-7  |             O-8             |
>>> |     |   |   |     |   |       |                             |
>>> ---------------------------------------------------------------
>>>
>>> It generates fewer folios (i.e., 11 or 10) than existing page split
>>> approach, which splits the order-9 to 512 order-0 folios. It also reduces
>>> the number of new xa_node needed during a pagecache folio split from
>>> 8 to 1, potentially decreasing the folio split failure rate due to memory
>>> constraints.
>>>
>>> folio_split() and existing split_huge_page_to_list_to_order() share
>>> the folio unmapping and remapping code in __folio_split() and the common
>>> backend split code in __split_unmapped_folio() using
>>> uniform_split variable to distinguish their operations.
>>>
>>> uniform_split_supported() and non_uniform_split_supported() are added
>>> to factor out check code and will be used outside __folio_split() in the
>>> following commit.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>   mm/huge_memory.c | 137 ++++++++++++++++++++++++++++++++++-------------
>>>   1 file changed, 100 insertions(+), 37 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 21ebe2dec5a4..400dfe8a6e60 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3853,12 +3853,68 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>   	return ret;
>>>   }
>>>  +static bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> +		bool warns)
>>> +{
>>> +	/* order-1 is not supported for anonymous THP. */
>>> +	if (folio_test_anon(folio) && new_order == 1) {
>>> +		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
>>> +		return false;
>>> +	}
>>> +
>>> +	/*
>>> +	 * No split if the file system does not support large folio.
>>> +	 * Note that we might still have THPs in such mappings due to
>>> +	 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
>>> +	 * does not actually support large folios properly.
>>> +	 */
>>> +	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>>> +	    !mapping_large_folio_support(folio->mapping)) {
>>
>> In this (and a similar case below), you need
>>
>> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>>     !folio_test_anon(folio) &&
>>     !mapping_large_folio_support(folio->mapping)) {
>>
>> Otherwise mapping_large_folio_support() is unhappy:
>>
>
> Thanks. The patch below should fix it.
>
> I am going to send V8, since
> 1. there have been 4 fixes so far for V7, a new series would help people
> review;
>
> 2.  based on the discussion with you in THP cabal meeting, to
> convert split_huge_page*() to use __folio_split(), the current
> __folio_split() interface becomes awkward. Two changes are needed:
>    a) use in folio offset instead of struct page, since even in
>      truncate_inode_partial_folio() I needed to convert in folio offset
>      struct page to use my current interface;
>    b) split_huge_page*()'s caller might hold the page lock at a non-head
>      page, so an additional keep_lock_at_in_folio_offset is needed
>      to indicate which after-split folio should be kept locked after
>      split is done.
>

Hi Andrew,

I am planing to send V8 to collect all fixup patches I have so far plus
the one below and change folio_split() interface and some of the code.
What is your preferred method?

1. you can pick up the fixup below and I send a new set of patches to
change folio_split();

2. I collect a new V8 with all fixup patches and folio_split() change.

For 1, the commit history might be messy due to my new folio_split()
change. For 2, Minimize xa_node allocation during xarry split [1]
patchset depends on patch 1 of this series, which adds some extra work
for you to collect V8 (alternatively, I can send V8 without patch 1).

Let me know your thoughts. Thanks.


[1] https://lore.kernel.org/linux-mm/20250213034355.516610-1-ziy@nvidia.com/


>
> From 8b2aa5432c8d726a1fb6ce74c971365650da9370 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Sun, 16 Feb 2025 09:01:29 -0500
> Subject: [PATCH] mm/huge_memory: check folio_test_anon() before
>  mapping_large_folio_support()
>
> Otherwise mapping_large_folio_support() complains.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 48 ++++++++++++++++++++++++------------------------
>  1 file changed, 24 insertions(+), 24 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87cb62c81bf3..deb16fe662c4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3629,20 +3629,19 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns)
>  {
> -	/* order-1 is not supported for anonymous THP. */
> -	if (folio_test_anon(folio) && new_order == 1) {
> -		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
> -		return false;
> -	}
> -
> -	/*
> -	 * No split if the file system does not support large folio.
> -	 * Note that we might still have THPs in such mappings due to
> -	 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
> -	 * does not actually support large folios properly.
> -	 */
> -	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> +	if (folio_test_anon(folio)) {
> +		/* order-1 is not supported for anonymous THP. */
> +		VM_WARN_ONCE(warns && new_order == 1,
> +				"Cannot split to order-1 folio");
> +		return new_order != 1;
> +	} else if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>  	    !mapping_large_folio_support(folio->mapping)) {
> +		/*
> +		 * No split if the file system does not support large folio.
> +		 * Note that we might still have THPs in such mappings due to
> +		 * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping
> +		 * does not actually support large folios properly.
> +		 */
>  		VM_WARN_ONCE(warns,
>  			"Cannot split file folio to non-0 order");
>  		return false;
> @@ -3662,24 +3661,25 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns)
>  {
> -	if (folio_test_anon(folio) && new_order == 1) {
> -		VM_WARN_ONCE(warns, "Cannot split to order-1 folio");
> -		return false;
> -	}
> -
> -	if (new_order) {
> +	if (folio_test_anon(folio)) {
> +		VM_WARN_ONCE(warns && new_order == 1,
> +				"Cannot split to order-1 folio");
> +		return new_order != 1;
> +	} else  if (new_order) {
>  		if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
>  		    !mapping_large_folio_support(folio->mapping)) {
>  			VM_WARN_ONCE(warns,
>  				"Cannot split file folio to non-0 order");
>  			return false;
>  		}
> -		if (folio_test_swapcache(folio)) {
> -			VM_WARN_ONCE(warns,
> -				"Cannot split swapcache folio to non-0 order");
> -			return false;
> -		}
>  	}
> +
> +	if (new_order && folio_test_swapcache(folio)) {
> +		VM_WARN_ONCE(warns,
> +			"Cannot split swapcache folio to non-0 order");
> +		return false;
> +	}
> +
>  	return true;
>  }
>
> -- 
> 2.47.2
>
>
>
> --
> Best Regards,
> Yan, Zi


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-11 15:50 ` [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry Zi Yan
  2025-02-12  0:57   ` Zi Yan
@ 2025-02-17 21:44   ` David Hildenbrand
  2025-02-17 22:05     ` Zi Yan
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-02-17 21:44 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle)
  Cc: Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 11.02.25 16:50, Zi Yan wrote:
> It is a preparation patch for non-uniform folio split, which always split
> a folio into half iteratively, and minimal xarray entry split.
> 
> Currently, xas_split_alloc() and xas_split() always split all slots from a
> multi-index entry. They cost the same number of xa_node as the to-be-split
> slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
> slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
> needed. Instead xas_try_split() is intended to be used iteratively to split
> the order-9 entry into 2 order-8 entries, then split one order-8 entry,
> based on the given index, to 2 order-7 entries, ..., and split one order-1
> entry to 2 order-0 entries. When splitting the order-6 entry and a new
> xa_node is needed, xas_try_split() will try to allocate one if possible.
> As a result, xas_try_split() would only need one xa_node instead of 8.
> 
> When a new xa_node is needed during the split, xas_try_split() can try to
> allocate one but no more. -ENOMEM will be return if a node cannot be
> allocated. -EINVAL will be return if a sibling node is split or
> cascade split happens, where two or more new nodes are needed, and these
> are not supported by xas_try_split().
> 
> xas_split_alloc() and xas_split() split an order-9 to order-0:
> 
>           ---------------------------------
>           |   |   |   |   |   |   |   |   |
>           | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>           |   |   |   |   |   |   |   |   |
>           ---------------------------------
>             |   |                   |   |
>       -------   ---               ---   -------
>       |           |     ...       |           |
>       V           V               V           V
> ----------- -----------     ----------- -----------
> | xa_node | | xa_node | ... | xa_node | | xa_node |
> ----------- -----------     ----------- -----------
> 
> xas_try_split() splits an order-9 to order-0:
>     ---------------------------------
>     |   |   |   |   |   |   |   |   |
>     | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>     |   |   |   |   |   |   |   |   |
>     ---------------------------------
>       |
>       |
>       V
> -----------
> | xa_node |
> -----------
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   Documentation/core-api/xarray.rst |  14 ++-
>   include/linux/xarray.h            |   7 ++
>   lib/test_xarray.c                 |  47 +++++++++++
>   lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
>   tools/testing/radix-tree/Makefile |   1 +
>   5 files changed, 188 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
> index f6a3eef4fe7f..c6c91cbd0c3c 100644
> --- a/Documentation/core-api/xarray.rst
> +++ b/Documentation/core-api/xarray.rst
> @@ -489,7 +489,19 @@ Storing ``NULL`` into any index of a multi-index entry will set the
>   entry at every index to ``NULL`` and dissolve the tie.  A multi-index
>   entry can be split into entries occupying smaller ranges by calling
>   xas_split_alloc() without the xa_lock held, followed by taking the lock
> -and calling xas_split().
> +and calling xas_split() or calling xas_try_split() with xa_lock. The
> +difference between xas_split_alloc()+xas_split() and xas_try_alloc() is
> +that xas_split_alloc() + xas_split() split the entry from the original
> +order to the new order in one shot uniformly, whereas xas_try_split()
> +iteratively splits the entry containing the index non-uniformly.
> +For example, to split an order-9 entry, which takes 2^(9-6)=8 slots,
> +assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need
> +8 xa_node. xas_try_split() splits the order-9 entry into
> +2 order-8 entries, then split one order-8 entry, based on the given index,
> +to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries.
> +When splitting the order-6 entry and a new xa_node is needed, xas_try_split()
> +will try to allocate one if possible. As a result, xas_try_split() would only
> +need 1 xa_node instead of 8.
>   
>   Functions and structures
>   ========================
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0b618ec04115..9eb8c7425090 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1555,6 +1555,8 @@ int xa_get_order(struct xarray *, unsigned long index);
>   int xas_get_order(struct xa_state *xas);
>   void xas_split(struct xa_state *, void *entry, unsigned int order);
>   void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
> +		gfp_t gfp);
>   #else
>   static inline int xa_get_order(struct xarray *xa, unsigned long index)
>   {
> @@ -1576,6 +1578,11 @@ static inline void xas_split_alloc(struct xa_state *xas, void *entry,
>   		unsigned int order, gfp_t gfp)
>   {
>   }
> +
> +static inline void xas_try_split(struct xa_state *xas, void *entry,
> +		unsigned int order, gfp_t gfp)
> +{
> +}
>   #endif
>   
>   /**
> diff --git a/lib/test_xarray.c b/lib/test_xarray.c
> index 6932a26f4927..598ca38a2f5b 100644
> --- a/lib/test_xarray.c
> +++ b/lib/test_xarray.c
> @@ -1857,6 +1857,49 @@ static void check_split_1(struct xarray *xa, unsigned long index,
>   	xa_destroy(xa);
>   }
>   
> +static void check_split_2(struct xarray *xa, unsigned long index,
> +				unsigned int order, unsigned int new_order)
> +{
> +	XA_STATE_ORDER(xas, xa, index, new_order);
> +	unsigned int i, found;
> +	void *entry;
> +
> +	xa_store_order(xa, index, order, xa, GFP_KERNEL);
> +	xa_set_mark(xa, index, XA_MARK_1);
> +
> +	xas_lock(&xas);
> +	xas_try_halve(&xas, xa, order, GFP_KERNEL);
> +	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
> +	    new_order < order - 1) {
> +		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);
> +		xas_unlock(&xas);
> +		goto out;
> +	}
> +	for (i = 0; i < (1 << order); i += (1 << new_order))
> +		__xa_store(xa, index + i, xa_mk_index(index + i), 0);
> +	xas_unlock(&xas);
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		unsigned int val = index + (i & ~((1 << new_order) - 1));
> +		XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
> +	}
> +
> +	xa_set_mark(xa, index, XA_MARK_0);
> +	XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
> +
> +	xas_set_order(&xas, index, 0);
> +	found = 0;
> +	rcu_read_lock();
> +	xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) {
> +		found++;
> +		XA_BUG_ON(xa, xa_is_internal(entry));
> +	}
> +	rcu_read_unlock();
> +	XA_BUG_ON(xa, found != 1 << (order - new_order));
> +out:
> +	xa_destroy(xa);
> +}
> +
>   static noinline void check_split(struct xarray *xa)
>   {
>   	unsigned int order, new_order;
> @@ -1868,6 +1911,10 @@ static noinline void check_split(struct xarray *xa)
>   			check_split_1(xa, 0, order, new_order);
>   			check_split_1(xa, 1UL << order, order, new_order);
>   			check_split_1(xa, 3UL << order, order, new_order);
> +
> +			check_split_2(xa, 0, order, new_order);
> +			check_split_2(xa, 1UL << order, order, new_order);
> +			check_split_2(xa, 3UL << order, order, new_order);
>   		}
>   	}
>   }
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 116e9286c64e..c38beca77830 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -1007,6 +1007,31 @@ static void node_set_marks(struct xa_node *node, unsigned int offset,
>   	}
>   }
>   
> +static struct xa_node *__xas_alloc_node_for_split(struct xa_state *xas,
> +		void *entry, gfp_t gfp)
> +{
> +	unsigned int i;
> +	void *sibling = NULL;
> +	struct xa_node *node;
> +	unsigned int mask = xas->xa_sibs;
> +
> +	node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
> +	if (!node)
> +		return NULL;
> +	node->array = xas->xa;
> +	for (i = 0; i < XA_CHUNK_SIZE; i++) {
> +		if ((i & mask) == 0) {
> +			RCU_INIT_POINTER(node->slots[i], entry);
> +			sibling = xa_mk_sibling(i);
> +		} else {
> +			RCU_INIT_POINTER(node->slots[i], sibling);
> +		}
> +	}
> +	RCU_INIT_POINTER(node->parent, xas->xa_alloc);
> +
> +	return node;
> +}
> +
>   /**
>    * xas_split_alloc() - Allocate memory for splitting an entry.
>    * @xas: XArray operation state.
> @@ -1025,7 +1050,6 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>   		gfp_t gfp)
>   {
>   	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
> -	unsigned int mask = xas->xa_sibs;
>   
>   	/* XXX: no support for splitting really large entries yet */
>   	if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT <= order))
> @@ -1034,23 +1058,9 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>   		return;
>   
>   	do {
> -		unsigned int i;
> -		void *sibling = NULL;
> -		struct xa_node *node;
> -
> -		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
> +		struct xa_node *node = __xas_alloc_node_for_split(xas, entry, gfp);
>   		if (!node)
>   			goto nomem;
> -		node->array = xas->xa;
> -		for (i = 0; i < XA_CHUNK_SIZE; i++) {
> -			if ((i & mask) == 0) {
> -				RCU_INIT_POINTER(node->slots[i], entry);
> -				sibling = xa_mk_sibling(i);
> -			} else {
> -				RCU_INIT_POINTER(node->slots[i], sibling);
> -			}
> -		}
> -		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>   		xas->xa_alloc = node;
>   	} while (sibs-- > 0);
>   
> @@ -1122,6 +1132,100 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
>   	xas_update(xas, node);
>   }
>   EXPORT_SYMBOL_GPL(xas_split);
> +
> +/**
> + * xas_try_split() - Try to split a multi-index entry.
> + * @xas: XArray operation state.
> + * @entry: New entry to store in the array.
> + * @order: Current entry order.
> + * @gfp: Memory allocation flags.
> + *
> + * The size of the new entries is set in @xas.  The value in @entry is
> + * copied to all the replacement entries. If and only if one xa_node needs to
> + * be allocated, the function will use @gfp to get one. If more xa_node are
> + * needed, the function gives EINVAL error.
> + *
> + * Context: Any context.  The caller should hold the xa_lock.
> + */
> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
> +		gfp_t gfp)
> +{
> +	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
> +	unsigned int offset, marks;
> +	struct xa_node *node;
> +	void *curr = xas_load(xas);
> +	int values = 0;
> +
> +	node = xas->xa_node;
> +	if (xas_top(node))
> +		return;
> +
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
> +
> +	marks = node_get_marks(node, xas->xa_offset);
> +
> +	offset = xas->xa_offset + sibs;
> +	do {
> +		if (xas->xa_shift < node->shift) {
> +			struct xa_node *child = xas->xa_alloc;
> +			unsigned int expected_sibs =
> +				(1 << ((order - 1) % XA_CHUNK_SHIFT)) - 1;
> +
> +			/*
> +			 * No support for splitting sibling entries
> +			 * (horizontally) or cascade split (vertically), which
> +			 * requires two or more new xa_nodes.
> +			 * Since if one xa_node allocation fails,
> +			 * it is hard to free the prior allocations.
> +			 */
> +			if (sibs || xas->xa_sibs != expected_sibs) {
> +				xas_destroy(xas);
> +				xas_set_err(xas, -EINVAL);
> +				return;
> +			}
> +
> +			if (!child) {
> +				child = __xas_alloc_node_for_split(xas, entry,
> +						gfp);
> +				if (!child) {
> +					xas_destroy(xas);
> +					xas_set_err(xas, -ENOMEM);
> +					return;
> +				}
> +			}

No expert on this, just wondering ...

... what is the effect if we halfway-through fail the split? Is it okay 
to leave that "partially split" thing in place? Can callers deal with that?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-17 21:44   ` David Hildenbrand
@ 2025-02-17 22:05     ` Zi Yan
  2025-02-18 15:44       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-17 22:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 17 Feb 2025, at 16:44, David Hildenbrand wrote:

> On 11.02.25 16:50, Zi Yan wrote:
>> It is a preparation patch for non-uniform folio split, which always split
>> a folio into half iteratively, and minimal xarray entry split.
>>
>> Currently, xas_split_alloc() and xas_split() always split all slots from a
>> multi-index entry. They cost the same number of xa_node as the to-be-split
>> slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
>> slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
>> needed. Instead xas_try_split() is intended to be used iteratively to split
>> the order-9 entry into 2 order-8 entries, then split one order-8 entry,
>> based on the given index, to 2 order-7 entries, ..., and split one order-1
>> entry to 2 order-0 entries. When splitting the order-6 entry and a new
>> xa_node is needed, xas_try_split() will try to allocate one if possible.
>> As a result, xas_try_split() would only need one xa_node instead of 8.
>>
>> When a new xa_node is needed during the split, xas_try_split() can try to
>> allocate one but no more. -ENOMEM will be return if a node cannot be
>> allocated. -EINVAL will be return if a sibling node is split or
>> cascade split happens, where two or more new nodes are needed, and these
>> are not supported by xas_try_split().
>>
>> xas_split_alloc() and xas_split() split an order-9 to order-0:
>>
>>           ---------------------------------
>>           |   |   |   |   |   |   |   |   |
>>           | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>           |   |   |   |   |   |   |   |   |
>>           ---------------------------------
>>             |   |                   |   |
>>       -------   ---               ---   -------
>>       |           |     ...       |           |
>>       V           V               V           V
>> ----------- -----------     ----------- -----------
>> | xa_node | | xa_node | ... | xa_node | | xa_node |
>> ----------- -----------     ----------- -----------
>>
>> xas_try_split() splits an order-9 to order-0:
>>     ---------------------------------
>>     |   |   |   |   |   |   |   |   |
>>     | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>     |   |   |   |   |   |   |   |   |
>>     ---------------------------------
>>       |
>>       |
>>       V
>> -----------
>> | xa_node |
>> -----------
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   Documentation/core-api/xarray.rst |  14 ++-
>>   include/linux/xarray.h            |   7 ++
>>   lib/test_xarray.c                 |  47 +++++++++++
>>   lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
>>   tools/testing/radix-tree/Makefile |   1 +
>>   5 files changed, 188 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
>> index f6a3eef4fe7f..c6c91cbd0c3c 100644
>> --- a/Documentation/core-api/xarray.rst
>> +++ b/Documentation/core-api/xarray.rst
>> @@ -489,7 +489,19 @@ Storing ``NULL`` into any index of a multi-index entry will set the
>>   entry at every index to ``NULL`` and dissolve the tie.  A multi-index
>>   entry can be split into entries occupying smaller ranges by calling
>>   xas_split_alloc() without the xa_lock held, followed by taking the lock
>> -and calling xas_split().
>> +and calling xas_split() or calling xas_try_split() with xa_lock. The
>> +difference between xas_split_alloc()+xas_split() and xas_try_alloc() is
>> +that xas_split_alloc() + xas_split() split the entry from the original
>> +order to the new order in one shot uniformly, whereas xas_try_split()
>> +iteratively splits the entry containing the index non-uniformly.
>> +For example, to split an order-9 entry, which takes 2^(9-6)=8 slots,
>> +assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need
>> +8 xa_node. xas_try_split() splits the order-9 entry into
>> +2 order-8 entries, then split one order-8 entry, based on the given index,
>> +to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries.
>> +When splitting the order-6 entry and a new xa_node is needed, xas_try_split()
>> +will try to allocate one if possible. As a result, xas_try_split() would only
>> +need 1 xa_node instead of 8.
>>    Functions and structures
>>   ========================
>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>> index 0b618ec04115..9eb8c7425090 100644
>> --- a/include/linux/xarray.h
>> +++ b/include/linux/xarray.h
>> @@ -1555,6 +1555,8 @@ int xa_get_order(struct xarray *, unsigned long index);
>>   int xas_get_order(struct xa_state *xas);
>>   void xas_split(struct xa_state *, void *entry, unsigned int order);
>>   void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
>> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
>> +		gfp_t gfp);
>>   #else
>>   static inline int xa_get_order(struct xarray *xa, unsigned long index)
>>   {
>> @@ -1576,6 +1578,11 @@ static inline void xas_split_alloc(struct xa_state *xas, void *entry,
>>   		unsigned int order, gfp_t gfp)
>>   {
>>   }
>> +
>> +static inline void xas_try_split(struct xa_state *xas, void *entry,
>> +		unsigned int order, gfp_t gfp)
>> +{
>> +}
>>   #endif
>>    /**
>> diff --git a/lib/test_xarray.c b/lib/test_xarray.c
>> index 6932a26f4927..598ca38a2f5b 100644
>> --- a/lib/test_xarray.c
>> +++ b/lib/test_xarray.c
>> @@ -1857,6 +1857,49 @@ static void check_split_1(struct xarray *xa, unsigned long index,
>>   	xa_destroy(xa);
>>   }
>>  +static void check_split_2(struct xarray *xa, unsigned long index,
>> +				unsigned int order, unsigned int new_order)
>> +{
>> +	XA_STATE_ORDER(xas, xa, index, new_order);
>> +	unsigned int i, found;
>> +	void *entry;
>> +
>> +	xa_store_order(xa, index, order, xa, GFP_KERNEL);
>> +	xa_set_mark(xa, index, XA_MARK_1);
>> +
>> +	xas_lock(&xas);
>> +	xas_try_halve(&xas, xa, order, GFP_KERNEL);
>> +	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
>> +	    new_order < order - 1) {
>> +		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);
>> +		xas_unlock(&xas);
>> +		goto out;
>> +	}
>> +	for (i = 0; i < (1 << order); i += (1 << new_order))
>> +		__xa_store(xa, index + i, xa_mk_index(index + i), 0);
>> +	xas_unlock(&xas);
>> +
>> +	for (i = 0; i < (1 << order); i++) {
>> +		unsigned int val = index + (i & ~((1 << new_order) - 1));
>> +		XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
>> +	}
>> +
>> +	xa_set_mark(xa, index, XA_MARK_0);
>> +	XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
>> +
>> +	xas_set_order(&xas, index, 0);
>> +	found = 0;
>> +	rcu_read_lock();
>> +	xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) {
>> +		found++;
>> +		XA_BUG_ON(xa, xa_is_internal(entry));
>> +	}
>> +	rcu_read_unlock();
>> +	XA_BUG_ON(xa, found != 1 << (order - new_order));
>> +out:
>> +	xa_destroy(xa);
>> +}
>> +
>>   static noinline void check_split(struct xarray *xa)
>>   {
>>   	unsigned int order, new_order;
>> @@ -1868,6 +1911,10 @@ static noinline void check_split(struct xarray *xa)
>>   			check_split_1(xa, 0, order, new_order);
>>   			check_split_1(xa, 1UL << order, order, new_order);
>>   			check_split_1(xa, 3UL << order, order, new_order);
>> +
>> +			check_split_2(xa, 0, order, new_order);
>> +			check_split_2(xa, 1UL << order, order, new_order);
>> +			check_split_2(xa, 3UL << order, order, new_order);
>>   		}
>>   	}
>>   }
>> diff --git a/lib/xarray.c b/lib/xarray.c
>> index 116e9286c64e..c38beca77830 100644
>> --- a/lib/xarray.c
>> +++ b/lib/xarray.c
>> @@ -1007,6 +1007,31 @@ static void node_set_marks(struct xa_node *node, unsigned int offset,
>>   	}
>>   }
>>  +static struct xa_node *__xas_alloc_node_for_split(struct xa_state *xas,
>> +		void *entry, gfp_t gfp)
>> +{
>> +	unsigned int i;
>> +	void *sibling = NULL;
>> +	struct xa_node *node;
>> +	unsigned int mask = xas->xa_sibs;
>> +
>> +	node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
>> +	if (!node)
>> +		return NULL;
>> +	node->array = xas->xa;
>> +	for (i = 0; i < XA_CHUNK_SIZE; i++) {
>> +		if ((i & mask) == 0) {
>> +			RCU_INIT_POINTER(node->slots[i], entry);
>> +			sibling = xa_mk_sibling(i);
>> +		} else {
>> +			RCU_INIT_POINTER(node->slots[i], sibling);
>> +		}
>> +	}
>> +	RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>> +
>> +	return node;
>> +}
>> +
>>   /**
>>    * xas_split_alloc() - Allocate memory for splitting an entry.
>>    * @xas: XArray operation state.
>> @@ -1025,7 +1050,6 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>>   		gfp_t gfp)
>>   {
>>   	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>> -	unsigned int mask = xas->xa_sibs;
>>    	/* XXX: no support for splitting really large entries yet */
>>   	if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT <= order))
>> @@ -1034,23 +1058,9 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>>   		return;
>>    	do {
>> -		unsigned int i;
>> -		void *sibling = NULL;
>> -		struct xa_node *node;
>> -
>> -		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
>> +		struct xa_node *node = __xas_alloc_node_for_split(xas, entry, gfp);
>>   		if (!node)
>>   			goto nomem;
>> -		node->array = xas->xa;
>> -		for (i = 0; i < XA_CHUNK_SIZE; i++) {
>> -			if ((i & mask) == 0) {
>> -				RCU_INIT_POINTER(node->slots[i], entry);
>> -				sibling = xa_mk_sibling(i);
>> -			} else {
>> -				RCU_INIT_POINTER(node->slots[i], sibling);
>> -			}
>> -		}
>> -		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>>   		xas->xa_alloc = node;
>>   	} while (sibs-- > 0);
>>  @@ -1122,6 +1132,100 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
>>   	xas_update(xas, node);
>>   }
>>   EXPORT_SYMBOL_GPL(xas_split);
>> +
>> +/**
>> + * xas_try_split() - Try to split a multi-index entry.
>> + * @xas: XArray operation state.
>> + * @entry: New entry to store in the array.
>> + * @order: Current entry order.
>> + * @gfp: Memory allocation flags.
>> + *
>> + * The size of the new entries is set in @xas.  The value in @entry is
>> + * copied to all the replacement entries. If and only if one xa_node needs to
>> + * be allocated, the function will use @gfp to get one. If more xa_node are
>> + * needed, the function gives EINVAL error.
>> + *
>> + * Context: Any context.  The caller should hold the xa_lock.
>> + */
>> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
>> +		gfp_t gfp)
>> +{
>> +	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>> +	unsigned int offset, marks;
>> +	struct xa_node *node;
>> +	void *curr = xas_load(xas);
>> +	int values = 0;
>> +
>> +	node = xas->xa_node;
>> +	if (xas_top(node))
>> +		return;
>> +
>> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
>> +		gfp |= __GFP_ACCOUNT;
>> +
>> +	marks = node_get_marks(node, xas->xa_offset);
>> +
>> +	offset = xas->xa_offset + sibs;
>> +	do {
>> +		if (xas->xa_shift < node->shift) {
>> +			struct xa_node *child = xas->xa_alloc;
>> +			unsigned int expected_sibs =
>> +				(1 << ((order - 1) % XA_CHUNK_SHIFT)) - 1;
>> +
>> +			/*
>> +			 * No support for splitting sibling entries
>> +			 * (horizontally) or cascade split (vertically), which
>> +			 * requires two or more new xa_nodes.
>> +			 * Since if one xa_node allocation fails,
>> +			 * it is hard to free the prior allocations.
>> +			 */
>> +			if (sibs || xas->xa_sibs != expected_sibs) {
>> +				xas_destroy(xas);
>> +				xas_set_err(xas, -EINVAL);
>> +				return;
>> +			}
>> +
>> +			if (!child) {
>> +				child = __xas_alloc_node_for_split(xas, entry,
>> +						gfp);
>> +				if (!child) {
>> +					xas_destroy(xas);
>> +					xas_set_err(xas, -ENOMEM);
>> +					return;
>> +				}
>> +			}
>
> No expert on this, just wondering ...
>
> ... what is the effect if we halfway-through fail the split? Is it okay to leave that "partially split" thing in place? Can callers deal with that?

Good question.

xas_try_split() imposes what kind of split it does and is usually used to
split from order N to order N-1:

1. when N is a multiplier of XA_CHUNK_SHIFT, a new xa_node is needed, so
either child (namely xas->xa_alloc) is not NULL, meaning someone called
xa_nomem() to allocate a xa_node before xas_try_split() or child is NULL
and an allocation is needed. If child is still NULL after the allocation,
meaning we are out of memory, no split is done;

2. when N is not, no new xa_node is needed, xas_try_split() just rewrites
existing slot values to perform the split (the code after the hunk above).
No fail will happen. For this split, since no new xa_node is needed,
the caller is actually allowed to split from N to a value smaller than
N-1 as long as N-1 is >= (N - N % XA_CHUNK_SHIFT).


Various checks make sure xas_try_split() only sees the two above situation:

a. "xas->xa_shift < node->shift" means the split crosses XA_CHUNK_SHIFT,
at least 1 new xa_node is needed; the else branch only handles the case
2 above;

b. for the then branch the "if (sibs || xas->xa_sibs != expected_sibs)"
check makes sure N is a multiplier of XA_CHUNK_SHIFT and the new order
has to be N-1. In "if (sibs || xas->xa_sibs != expected_sibs)",
"sibs != 0" means the from order N covers more than 1 slot, so more than 1
new xa_node is needed, "xas->xa_sibs != expected_sibs" makes sure
the new order is N-1 (you can see it from how expected_sibs is assigned).

Let me know if you have any other question.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  2025-02-17 15:22       ` Zi Yan
@ 2025-02-18  4:12         ` Andrew Morton
  2025-02-18 15:23           ` Zi Yan
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2025-02-18  4:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Kirill A . Shutemov, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On Mon, 17 Feb 2025 10:22:44 -0500 Zi Yan <ziy@nvidia.com> wrote:

> >
> > Thanks. The patch below should fix it.
> >
> > I am going to send V8, since
> > 1. there have been 4 fixes so far for V7, a new series would help people
> > review;
> >
> > 2.  based on the discussion with you in THP cabal meeting, to
> > convert split_huge_page*() to use __folio_split(), the current
> > __folio_split() interface becomes awkward. Two changes are needed:
> >    a) use in folio offset instead of struct page, since even in
> >      truncate_inode_partial_folio() I needed to convert in folio offset
> >      struct page to use my current interface;
> >    b) split_huge_page*()'s caller might hold the page lock at a non-head
> >      page, so an additional keep_lock_at_in_folio_offset is needed
> >      to indicate which after-split folio should be kept locked after
> >      split is done.
> >
> 
> Hi Andrew,
> 
> I am planing to send V8 to collect all fixup patches I have so far plus
> the one below and change folio_split() interface and some of the code.
> What is your preferred method?
> 
> 1. you can pick up the fixup below and I send a new set of patches to
> change folio_split();
> 
> 2. I collect a new V8 with all fixup patches and folio_split() change.
> 
> For 1, the commit history might be messy due to my new folio_split()
> change. For 2, Minimize xa_node allocation during xarry split [1]
> patchset depends on patch 1 of this series, which adds some extra work
> for you to collect V8 (alternatively, I can send V8 without patch 1).

We're only at -rc3, so I'll remove both series from mm.git.  Please
fully resend both series against mm-unstable?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
  2025-02-18  4:12         ` Andrew Morton
@ 2025-02-18 15:23           ` Zi Yan
  0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-02-18 15:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Kirill A . Shutemov, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 17 Feb 2025, at 23:12, Andrew Morton wrote:

> On Mon, 17 Feb 2025 10:22:44 -0500 Zi Yan <ziy@nvidia.com> wrote:
>
>>>
>>> Thanks. The patch below should fix it.
>>>
>>> I am going to send V8, since
>>> 1. there have been 4 fixes so far for V7, a new series would help people
>>> review;
>>>
>>> 2.  based on the discussion with you in THP cabal meeting, to
>>> convert split_huge_page*() to use __folio_split(), the current
>>> __folio_split() interface becomes awkward. Two changes are needed:
>>>    a) use in folio offset instead of struct page, since even in
>>>      truncate_inode_partial_folio() I needed to convert in folio offset
>>>      struct page to use my current interface;
>>>    b) split_huge_page*()'s caller might hold the page lock at a non-head
>>>      page, so an additional keep_lock_at_in_folio_offset is needed
>>>      to indicate which after-split folio should be kept locked after
>>>      split is done.
>>>
>>
>> Hi Andrew,
>>
>> I am planing to send V8 to collect all fixup patches I have so far plus
>> the one below and change folio_split() interface and some of the code.
>> What is your preferred method?
>>
>> 1. you can pick up the fixup below and I send a new set of patches to
>> change folio_split();
>>
>> 2. I collect a new V8 with all fixup patches and folio_split() change.
>>
>> For 1, the commit history might be messy due to my new folio_split()
>> change. For 2, Minimize xa_node allocation during xarry split [1]
>> patchset depends on patch 1 of this series, which adds some extra work
>> for you to collect V8 (alternatively, I can send V8 without patch 1).
>
> We're only at -rc3, so I'll remove both series from mm.git.  Please
> fully resend both series against mm-unstable?

Got it.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-17 22:05     ` Zi Yan
@ 2025-02-18 15:44       ` David Hildenbrand
  2025-02-18 16:04         ` Zi Yan
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-02-18 15:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 17.02.25 23:05, Zi Yan wrote:
> On 17 Feb 2025, at 16:44, David Hildenbrand wrote:
> 
>> On 11.02.25 16:50, Zi Yan wrote:
>>> It is a preparation patch for non-uniform folio split, which always split
>>> a folio into half iteratively, and minimal xarray entry split.
>>>
>>> Currently, xas_split_alloc() and xas_split() always split all slots from a
>>> multi-index entry. They cost the same number of xa_node as the to-be-split
>>> slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
>>> slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
>>> needed. Instead xas_try_split() is intended to be used iteratively to split
>>> the order-9 entry into 2 order-8 entries, then split one order-8 entry,
>>> based on the given index, to 2 order-7 entries, ..., and split one order-1
>>> entry to 2 order-0 entries. When splitting the order-6 entry and a new
>>> xa_node is needed, xas_try_split() will try to allocate one if possible.
>>> As a result, xas_try_split() would only need one xa_node instead of 8.
>>>
>>> When a new xa_node is needed during the split, xas_try_split() can try to
>>> allocate one but no more. -ENOMEM will be return if a node cannot be
>>> allocated. -EINVAL will be return if a sibling node is split or
>>> cascade split happens, where two or more new nodes are needed, and these
>>> are not supported by xas_try_split().
>>>
>>> xas_split_alloc() and xas_split() split an order-9 to order-0:
>>>
>>>            ---------------------------------
>>>            |   |   |   |   |   |   |   |   |
>>>            | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>>            |   |   |   |   |   |   |   |   |
>>>            ---------------------------------
>>>              |   |                   |   |
>>>        -------   ---               ---   -------
>>>        |           |     ...       |           |
>>>        V           V               V           V
>>> ----------- -----------     ----------- -----------
>>> | xa_node | | xa_node | ... | xa_node | | xa_node |
>>> ----------- -----------     ----------- -----------
>>>
>>> xas_try_split() splits an order-9 to order-0:
>>>      ---------------------------------
>>>      |   |   |   |   |   |   |   |   |
>>>      | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>>      |   |   |   |   |   |   |   |   |
>>>      ---------------------------------
>>>        |
>>>        |
>>>        V
>>> -----------
>>> | xa_node |
>>> -----------
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    Documentation/core-api/xarray.rst |  14 ++-
>>>    include/linux/xarray.h            |   7 ++
>>>    lib/test_xarray.c                 |  47 +++++++++++
>>>    lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
>>>    tools/testing/radix-tree/Makefile |   1 +
>>>    5 files changed, 188 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
>>> index f6a3eef4fe7f..c6c91cbd0c3c 100644
>>> --- a/Documentation/core-api/xarray.rst
>>> +++ b/Documentation/core-api/xarray.rst
>>> @@ -489,7 +489,19 @@ Storing ``NULL`` into any index of a multi-index entry will set the
>>>    entry at every index to ``NULL`` and dissolve the tie.  A multi-index
>>>    entry can be split into entries occupying smaller ranges by calling
>>>    xas_split_alloc() without the xa_lock held, followed by taking the lock
>>> -and calling xas_split().
>>> +and calling xas_split() or calling xas_try_split() with xa_lock. The
>>> +difference between xas_split_alloc()+xas_split() and xas_try_alloc() is
>>> +that xas_split_alloc() + xas_split() split the entry from the original
>>> +order to the new order in one shot uniformly, whereas xas_try_split()
>>> +iteratively splits the entry containing the index non-uniformly.
>>> +For example, to split an order-9 entry, which takes 2^(9-6)=8 slots,
>>> +assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need
>>> +8 xa_node. xas_try_split() splits the order-9 entry into
>>> +2 order-8 entries, then split one order-8 entry, based on the given index,
>>> +to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries.
>>> +When splitting the order-6 entry and a new xa_node is needed, xas_try_split()
>>> +will try to allocate one if possible. As a result, xas_try_split() would only
>>> +need 1 xa_node instead of 8.
>>>     Functions and structures
>>>    ========================
>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>>> index 0b618ec04115..9eb8c7425090 100644
>>> --- a/include/linux/xarray.h
>>> +++ b/include/linux/xarray.h
>>> @@ -1555,6 +1555,8 @@ int xa_get_order(struct xarray *, unsigned long index);
>>>    int xas_get_order(struct xa_state *xas);
>>>    void xas_split(struct xa_state *, void *entry, unsigned int order);
>>>    void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
>>> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
>>> +		gfp_t gfp);
>>>    #else
>>>    static inline int xa_get_order(struct xarray *xa, unsigned long index)
>>>    {
>>> @@ -1576,6 +1578,11 @@ static inline void xas_split_alloc(struct xa_state *xas, void *entry,
>>>    		unsigned int order, gfp_t gfp)
>>>    {
>>>    }
>>> +
>>> +static inline void xas_try_split(struct xa_state *xas, void *entry,
>>> +		unsigned int order, gfp_t gfp)
>>> +{
>>> +}
>>>    #endif
>>>     /**
>>> diff --git a/lib/test_xarray.c b/lib/test_xarray.c
>>> index 6932a26f4927..598ca38a2f5b 100644
>>> --- a/lib/test_xarray.c
>>> +++ b/lib/test_xarray.c
>>> @@ -1857,6 +1857,49 @@ static void check_split_1(struct xarray *xa, unsigned long index,
>>>    	xa_destroy(xa);
>>>    }
>>>   +static void check_split_2(struct xarray *xa, unsigned long index,
>>> +				unsigned int order, unsigned int new_order)
>>> +{
>>> +	XA_STATE_ORDER(xas, xa, index, new_order);
>>> +	unsigned int i, found;
>>> +	void *entry;
>>> +
>>> +	xa_store_order(xa, index, order, xa, GFP_KERNEL);
>>> +	xa_set_mark(xa, index, XA_MARK_1);
>>> +
>>> +	xas_lock(&xas);
>>> +	xas_try_halve(&xas, xa, order, GFP_KERNEL);
>>> +	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
>>> +	    new_order < order - 1) {
>>> +		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);
>>> +		xas_unlock(&xas);
>>> +		goto out;
>>> +	}
>>> +	for (i = 0; i < (1 << order); i += (1 << new_order))
>>> +		__xa_store(xa, index + i, xa_mk_index(index + i), 0);
>>> +	xas_unlock(&xas);
>>> +
>>> +	for (i = 0; i < (1 << order); i++) {
>>> +		unsigned int val = index + (i & ~((1 << new_order) - 1));
>>> +		XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
>>> +	}
>>> +
>>> +	xa_set_mark(xa, index, XA_MARK_0);
>>> +	XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
>>> +
>>> +	xas_set_order(&xas, index, 0);
>>> +	found = 0;
>>> +	rcu_read_lock();
>>> +	xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) {
>>> +		found++;
>>> +		XA_BUG_ON(xa, xa_is_internal(entry));
>>> +	}
>>> +	rcu_read_unlock();
>>> +	XA_BUG_ON(xa, found != 1 << (order - new_order));
>>> +out:
>>> +	xa_destroy(xa);
>>> +}
>>> +
>>>    static noinline void check_split(struct xarray *xa)
>>>    {
>>>    	unsigned int order, new_order;
>>> @@ -1868,6 +1911,10 @@ static noinline void check_split(struct xarray *xa)
>>>    			check_split_1(xa, 0, order, new_order);
>>>    			check_split_1(xa, 1UL << order, order, new_order);
>>>    			check_split_1(xa, 3UL << order, order, new_order);
>>> +
>>> +			check_split_2(xa, 0, order, new_order);
>>> +			check_split_2(xa, 1UL << order, order, new_order);
>>> +			check_split_2(xa, 3UL << order, order, new_order);
>>>    		}
>>>    	}
>>>    }
>>> diff --git a/lib/xarray.c b/lib/xarray.c
>>> index 116e9286c64e..c38beca77830 100644
>>> --- a/lib/xarray.c
>>> +++ b/lib/xarray.c
>>> @@ -1007,6 +1007,31 @@ static void node_set_marks(struct xa_node *node, unsigned int offset,
>>>    	}
>>>    }
>>>   +static struct xa_node *__xas_alloc_node_for_split(struct xa_state *xas,
>>> +		void *entry, gfp_t gfp)
>>> +{
>>> +	unsigned int i;
>>> +	void *sibling = NULL;
>>> +	struct xa_node *node;
>>> +	unsigned int mask = xas->xa_sibs;
>>> +
>>> +	node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
>>> +	if (!node)
>>> +		return NULL;
>>> +	node->array = xas->xa;
>>> +	for (i = 0; i < XA_CHUNK_SIZE; i++) {
>>> +		if ((i & mask) == 0) {
>>> +			RCU_INIT_POINTER(node->slots[i], entry);
>>> +			sibling = xa_mk_sibling(i);
>>> +		} else {
>>> +			RCU_INIT_POINTER(node->slots[i], sibling);
>>> +		}
>>> +	}
>>> +	RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>>> +
>>> +	return node;
>>> +}
>>> +
>>>    /**
>>>     * xas_split_alloc() - Allocate memory for splitting an entry.
>>>     * @xas: XArray operation state.
>>> @@ -1025,7 +1050,6 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>>>    		gfp_t gfp)
>>>    {
>>>    	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>>> -	unsigned int mask = xas->xa_sibs;
>>>     	/* XXX: no support for splitting really large entries yet */
>>>    	if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT <= order))
>>> @@ -1034,23 +1058,9 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>>>    		return;
>>>     	do {
>>> -		unsigned int i;
>>> -		void *sibling = NULL;
>>> -		struct xa_node *node;
>>> -
>>> -		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
>>> +		struct xa_node *node = __xas_alloc_node_for_split(xas, entry, gfp);
>>>    		if (!node)
>>>    			goto nomem;
>>> -		node->array = xas->xa;
>>> -		for (i = 0; i < XA_CHUNK_SIZE; i++) {
>>> -			if ((i & mask) == 0) {
>>> -				RCU_INIT_POINTER(node->slots[i], entry);
>>> -				sibling = xa_mk_sibling(i);
>>> -			} else {
>>> -				RCU_INIT_POINTER(node->slots[i], sibling);
>>> -			}
>>> -		}
>>> -		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>>>    		xas->xa_alloc = node;
>>>    	} while (sibs-- > 0);
>>>   @@ -1122,6 +1132,100 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
>>>    	xas_update(xas, node);
>>>    }
>>>    EXPORT_SYMBOL_GPL(xas_split);
>>> +
>>> +/**
>>> + * xas_try_split() - Try to split a multi-index entry.
>>> + * @xas: XArray operation state.
>>> + * @entry: New entry to store in the array.
>>> + * @order: Current entry order.
>>> + * @gfp: Memory allocation flags.
>>> + *
>>> + * The size of the new entries is set in @xas.  The value in @entry is
>>> + * copied to all the replacement entries. If and only if one xa_node needs to
>>> + * be allocated, the function will use @gfp to get one. If more xa_node are
>>> + * needed, the function gives EINVAL error.
>>> + *
>>> + * Context: Any context.  The caller should hold the xa_lock.
>>> + */
>>> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
>>> +		gfp_t gfp)
>>> +{
>>> +	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>>> +	unsigned int offset, marks;
>>> +	struct xa_node *node;
>>> +	void *curr = xas_load(xas);
>>> +	int values = 0;
>>> +
>>> +	node = xas->xa_node;
>>> +	if (xas_top(node))
>>> +		return;
>>> +
>>> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
>>> +		gfp |= __GFP_ACCOUNT;
>>> +
>>> +	marks = node_get_marks(node, xas->xa_offset);
>>> +
>>> +	offset = xas->xa_offset + sibs;
>>> +	do {
>>> +		if (xas->xa_shift < node->shift) {
>>> +			struct xa_node *child = xas->xa_alloc;
>>> +			unsigned int expected_sibs =
>>> +				(1 << ((order - 1) % XA_CHUNK_SHIFT)) - 1;
>>> +
>>> +			/*
>>> +			 * No support for splitting sibling entries
>>> +			 * (horizontally) or cascade split (vertically), which
>>> +			 * requires two or more new xa_nodes.
>>> +			 * Since if one xa_node allocation fails,
>>> +			 * it is hard to free the prior allocations.
>>> +			 */
>>> +			if (sibs || xas->xa_sibs != expected_sibs) {
>>> +				xas_destroy(xas);
>>> +				xas_set_err(xas, -EINVAL);
>>> +				return;
>>> +			}
>>> +
>>> +			if (!child) {
>>> +				child = __xas_alloc_node_for_split(xas, entry,
>>> +						gfp);
>>> +				if (!child) {
>>> +					xas_destroy(xas);
>>> +					xas_set_err(xas, -ENOMEM);
>>> +					return;
>>> +				}
>>> +			}
>>
>> No expert on this, just wondering ...
>>
>> ... what is the effect if we halfway-through fail the split? Is it okay to leave that "partially split" thing in place? Can callers deal with that?
> 
> Good question.
> 

Let me rephrase: In __split_unmapped_folio(), we call xas_try_split(). 
If that fails, we stop the split and effectively skip over the 
__split_folio_to_order(). The folio remains unsplit (no order change: 
old_order).

xas_try_split() was instructed to split from old_order -> split_order.

xas_try_split() documents that: "The value in @entry is copied to all 
the replacement entries.", meaning after the split, all entries will be 
pointing at the folio.

Now, can it happen that xas_try_split() would ever perform a partial 
split in any way, when invoked from __split_unmapped_folio(), such that 
we run into the do { } while(); loop and fail with -ENOMEM after already 
having performed changes -- xas_update().

Or is that simply impossible?

Maybe it's just the do { } while(); loop in there that is confusing me. 
(again, no expert)

> xas_try_split() imposes what kind of split it does and is usually used to
> split from order N to order N-1:

You mean that old_order -> split_order will in the case of 
__split_unmapped_folio() always be a difference of 1?

> 
> 1. when N is a multiplier of XA_CHUNK_SHIFT, a new xa_node is needed, so
> either child (namely xas->xa_alloc) is not NULL, meaning someone called
> xa_nomem() to allocate a xa_node before xas_try_split() or child is NULL
> and an allocation is needed. If child is still NULL after the allocation,
> meaning we are out of memory, no split is done;
> 
> 2. when N is not, no new xa_node is needed, xas_try_split() just rewrites
> existing slot values to perform the split (the code after the hunk above).
> No fail will happen. For this split, since no new xa_node is needed,
> the caller is actually allowed to split from N to a value smaller than
> N-1 as long as N-1 is >= (N - N % XA_CHUNK_SHIFT).
> 
> 
> Various checks make sure xas_try_split() only sees the two above situation:
> 
> a. "xas->xa_shift < node->shift" means the split crosses XA_CHUNK_SHIFT,
> at least 1 new xa_node is needed; the else branch only handles the case
> 2 above;
> 
> b. for the then branch the "if (sibs || xas->xa_sibs != expected_sibs)"
> check makes sure N is a multiplier of XA_CHUNK_SHIFT and the new order
> has to be N-1. In "if (sibs || xas->xa_sibs != expected_sibs)",
> "sibs != 0" means the from order N covers more than 1 slot, so more than 1
> new xa_node is needed, "xas->xa_sibs != expected_sibs" makes sure
> the new order is N-1 (you can see it from how expected_sibs is assigned).

Thanks!

> 
> Let me know if you have any other question.

Thanks for the details!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-18 15:44       ` David Hildenbrand
@ 2025-02-18 16:04         ` Zi Yan
  2025-02-18 16:12           ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-02-18 16:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

On 18 Feb 2025, at 10:44, David Hildenbrand wrote:

> On 17.02.25 23:05, Zi Yan wrote:
>> On 17 Feb 2025, at 16:44, David Hildenbrand wrote:
>>
>>> On 11.02.25 16:50, Zi Yan wrote:
>>>> It is a preparation patch for non-uniform folio split, which always split
>>>> a folio into half iteratively, and minimal xarray entry split.
>>>>
>>>> Currently, xas_split_alloc() and xas_split() always split all slots from a
>>>> multi-index entry. They cost the same number of xa_node as the to-be-split
>>>> slots. For example, to split an order-9 entry, which takes 2^(9-6)=8
>>>> slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are
>>>> needed. Instead xas_try_split() is intended to be used iteratively to split
>>>> the order-9 entry into 2 order-8 entries, then split one order-8 entry,
>>>> based on the given index, to 2 order-7 entries, ..., and split one order-1
>>>> entry to 2 order-0 entries. When splitting the order-6 entry and a new
>>>> xa_node is needed, xas_try_split() will try to allocate one if possible.
>>>> As a result, xas_try_split() would only need one xa_node instead of 8.
>>>>
>>>> When a new xa_node is needed during the split, xas_try_split() can try to
>>>> allocate one but no more. -ENOMEM will be return if a node cannot be
>>>> allocated. -EINVAL will be return if a sibling node is split or
>>>> cascade split happens, where two or more new nodes are needed, and these
>>>> are not supported by xas_try_split().
>>>>
>>>> xas_split_alloc() and xas_split() split an order-9 to order-0:
>>>>
>>>>            ---------------------------------
>>>>            |   |   |   |   |   |   |   |   |
>>>>            | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>>>            |   |   |   |   |   |   |   |   |
>>>>            ---------------------------------
>>>>              |   |                   |   |
>>>>        -------   ---               ---   -------
>>>>        |           |     ...       |           |
>>>>        V           V               V           V
>>>> ----------- -----------     ----------- -----------
>>>> | xa_node | | xa_node | ... | xa_node | | xa_node |
>>>> ----------- -----------     ----------- -----------
>>>>
>>>> xas_try_split() splits an order-9 to order-0:
>>>>      ---------------------------------
>>>>      |   |   |   |   |   |   |   |   |
>>>>      | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
>>>>      |   |   |   |   |   |   |   |   |
>>>>      ---------------------------------
>>>>        |
>>>>        |
>>>>        V
>>>> -----------
>>>> | xa_node |
>>>> -----------
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>    Documentation/core-api/xarray.rst |  14 ++-
>>>>    include/linux/xarray.h            |   7 ++
>>>>    lib/test_xarray.c                 |  47 +++++++++++
>>>>    lib/xarray.c                      | 136 ++++++++++++++++++++++++++----
>>>>    tools/testing/radix-tree/Makefile |   1 +
>>>>    5 files changed, 188 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
>>>> index f6a3eef4fe7f..c6c91cbd0c3c 100644
>>>> --- a/Documentation/core-api/xarray.rst
>>>> +++ b/Documentation/core-api/xarray.rst
>>>> @@ -489,7 +489,19 @@ Storing ``NULL`` into any index of a multi-index entry will set the
>>>>    entry at every index to ``NULL`` and dissolve the tie.  A multi-index
>>>>    entry can be split into entries occupying smaller ranges by calling
>>>>    xas_split_alloc() without the xa_lock held, followed by taking the lock
>>>> -and calling xas_split().
>>>> +and calling xas_split() or calling xas_try_split() with xa_lock. The
>>>> +difference between xas_split_alloc()+xas_split() and xas_try_alloc() is
>>>> +that xas_split_alloc() + xas_split() split the entry from the original
>>>> +order to the new order in one shot uniformly, whereas xas_try_split()
>>>> +iteratively splits the entry containing the index non-uniformly.
>>>> +For example, to split an order-9 entry, which takes 2^(9-6)=8 slots,
>>>> +assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need
>>>> +8 xa_node. xas_try_split() splits the order-9 entry into
>>>> +2 order-8 entries, then split one order-8 entry, based on the given index,
>>>> +to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries.
>>>> +When splitting the order-6 entry and a new xa_node is needed, xas_try_split()
>>>> +will try to allocate one if possible. As a result, xas_try_split() would only
>>>> +need 1 xa_node instead of 8.
>>>>     Functions and structures
>>>>    ========================
>>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>>>> index 0b618ec04115..9eb8c7425090 100644
>>>> --- a/include/linux/xarray.h
>>>> +++ b/include/linux/xarray.h
>>>> @@ -1555,6 +1555,8 @@ int xa_get_order(struct xarray *, unsigned long index);
>>>>    int xas_get_order(struct xa_state *xas);
>>>>    void xas_split(struct xa_state *, void *entry, unsigned int order);
>>>>    void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
>>>> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
>>>> +		gfp_t gfp);
>>>>    #else
>>>>    static inline int xa_get_order(struct xarray *xa, unsigned long index)
>>>>    {
>>>> @@ -1576,6 +1578,11 @@ static inline void xas_split_alloc(struct xa_state *xas, void *entry,
>>>>    		unsigned int order, gfp_t gfp)
>>>>    {
>>>>    }
>>>> +
>>>> +static inline void xas_try_split(struct xa_state *xas, void *entry,
>>>> +		unsigned int order, gfp_t gfp)
>>>> +{
>>>> +}
>>>>    #endif
>>>>     /**
>>>> diff --git a/lib/test_xarray.c b/lib/test_xarray.c
>>>> index 6932a26f4927..598ca38a2f5b 100644
>>>> --- a/lib/test_xarray.c
>>>> +++ b/lib/test_xarray.c
>>>> @@ -1857,6 +1857,49 @@ static void check_split_1(struct xarray *xa, unsigned long index,
>>>>    	xa_destroy(xa);
>>>>    }
>>>>   +static void check_split_2(struct xarray *xa, unsigned long index,
>>>> +				unsigned int order, unsigned int new_order)
>>>> +{
>>>> +	XA_STATE_ORDER(xas, xa, index, new_order);
>>>> +	unsigned int i, found;
>>>> +	void *entry;
>>>> +
>>>> +	xa_store_order(xa, index, order, xa, GFP_KERNEL);
>>>> +	xa_set_mark(xa, index, XA_MARK_1);
>>>> +
>>>> +	xas_lock(&xas);
>>>> +	xas_try_halve(&xas, xa, order, GFP_KERNEL);
>>>> +	if (((new_order / XA_CHUNK_SHIFT) < (order / XA_CHUNK_SHIFT)) &&
>>>> +	    new_order < order - 1) {
>>>> +		XA_BUG_ON(xa, !xas_error(&xas) || xas_error(&xas) != -EINVAL);
>>>> +		xas_unlock(&xas);
>>>> +		goto out;
>>>> +	}
>>>> +	for (i = 0; i < (1 << order); i += (1 << new_order))
>>>> +		__xa_store(xa, index + i, xa_mk_index(index + i), 0);
>>>> +	xas_unlock(&xas);
>>>> +
>>>> +	for (i = 0; i < (1 << order); i++) {
>>>> +		unsigned int val = index + (i & ~((1 << new_order) - 1));
>>>> +		XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
>>>> +	}
>>>> +
>>>> +	xa_set_mark(xa, index, XA_MARK_0);
>>>> +	XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
>>>> +
>>>> +	xas_set_order(&xas, index, 0);
>>>> +	found = 0;
>>>> +	rcu_read_lock();
>>>> +	xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) {
>>>> +		found++;
>>>> +		XA_BUG_ON(xa, xa_is_internal(entry));
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +	XA_BUG_ON(xa, found != 1 << (order - new_order));
>>>> +out:
>>>> +	xa_destroy(xa);
>>>> +}
>>>> +
>>>>    static noinline void check_split(struct xarray *xa)
>>>>    {
>>>>    	unsigned int order, new_order;
>>>> @@ -1868,6 +1911,10 @@ static noinline void check_split(struct xarray *xa)
>>>>    			check_split_1(xa, 0, order, new_order);
>>>>    			check_split_1(xa, 1UL << order, order, new_order);
>>>>    			check_split_1(xa, 3UL << order, order, new_order);
>>>> +
>>>> +			check_split_2(xa, 0, order, new_order);
>>>> +			check_split_2(xa, 1UL << order, order, new_order);
>>>> +			check_split_2(xa, 3UL << order, order, new_order);
>>>>    		}
>>>>    	}
>>>>    }
>>>> diff --git a/lib/xarray.c b/lib/xarray.c
>>>> index 116e9286c64e..c38beca77830 100644
>>>> --- a/lib/xarray.c
>>>> +++ b/lib/xarray.c
>>>> @@ -1007,6 +1007,31 @@ static void node_set_marks(struct xa_node *node, unsigned int offset,
>>>>    	}
>>>>    }
>>>>   +static struct xa_node *__xas_alloc_node_for_split(struct xa_state *xas,
>>>> +		void *entry, gfp_t gfp)
>>>> +{
>>>> +	unsigned int i;
>>>> +	void *sibling = NULL;
>>>> +	struct xa_node *node;
>>>> +	unsigned int mask = xas->xa_sibs;
>>>> +
>>>> +	node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
>>>> +	if (!node)
>>>> +		return NULL;
>>>> +	node->array = xas->xa;
>>>> +	for (i = 0; i < XA_CHUNK_SIZE; i++) {
>>>> +		if ((i & mask) == 0) {
>>>> +			RCU_INIT_POINTER(node->slots[i], entry);
>>>> +			sibling = xa_mk_sibling(i);
>>>> +		} else {
>>>> +			RCU_INIT_POINTER(node->slots[i], sibling);
>>>> +		}
>>>> +	}
>>>> +	RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>>>> +
>>>> +	return node;
>>>> +}
>>>> +
>>>>    /**
>>>>     * xas_split_alloc() - Allocate memory for splitting an entry.
>>>>     * @xas: XArray operation state.
>>>> @@ -1025,7 +1050,6 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>>>>    		gfp_t gfp)
>>>>    {
>>>>    	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>>>> -	unsigned int mask = xas->xa_sibs;
>>>>     	/* XXX: no support for splitting really large entries yet */
>>>>    	if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT <= order))
>>>> @@ -1034,23 +1058,9 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>>>>    		return;
>>>>     	do {
>>>> -		unsigned int i;
>>>> -		void *sibling = NULL;
>>>> -		struct xa_node *node;
>>>> -
>>>> -		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
>>>> +		struct xa_node *node = __xas_alloc_node_for_split(xas, entry, gfp);
>>>>    		if (!node)
>>>>    			goto nomem;
>>>> -		node->array = xas->xa;
>>>> -		for (i = 0; i < XA_CHUNK_SIZE; i++) {
>>>> -			if ((i & mask) == 0) {
>>>> -				RCU_INIT_POINTER(node->slots[i], entry);
>>>> -				sibling = xa_mk_sibling(i);
>>>> -			} else {
>>>> -				RCU_INIT_POINTER(node->slots[i], sibling);
>>>> -			}
>>>> -		}
>>>> -		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
>>>>    		xas->xa_alloc = node;
>>>>    	} while (sibs-- > 0);
>>>>   @@ -1122,6 +1132,100 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
>>>>    	xas_update(xas, node);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(xas_split);
>>>> +
>>>> +/**
>>>> + * xas_try_split() - Try to split a multi-index entry.
>>>> + * @xas: XArray operation state.
>>>> + * @entry: New entry to store in the array.
>>>> + * @order: Current entry order.
>>>> + * @gfp: Memory allocation flags.
>>>> + *
>>>> + * The size of the new entries is set in @xas.  The value in @entry is
>>>> + * copied to all the replacement entries. If and only if one xa_node needs to
>>>> + * be allocated, the function will use @gfp to get one. If more xa_node are
>>>> + * needed, the function gives EINVAL error.
>>>> + *
>>>> + * Context: Any context.  The caller should hold the xa_lock.
>>>> + */
>>>> +void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
>>>> +		gfp_t gfp)
>>>> +{
>>>> +	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>>>> +	unsigned int offset, marks;
>>>> +	struct xa_node *node;
>>>> +	void *curr = xas_load(xas);
>>>> +	int values = 0;
>>>> +
>>>> +	node = xas->xa_node;
>>>> +	if (xas_top(node))
>>>> +		return;
>>>> +
>>>> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
>>>> +		gfp |= __GFP_ACCOUNT;
>>>> +
>>>> +	marks = node_get_marks(node, xas->xa_offset);
>>>> +
>>>> +	offset = xas->xa_offset + sibs;
>>>> +	do {
>>>> +		if (xas->xa_shift < node->shift) {
>>>> +			struct xa_node *child = xas->xa_alloc;
>>>> +			unsigned int expected_sibs =
>>>> +				(1 << ((order - 1) % XA_CHUNK_SHIFT)) - 1;
>>>> +
>>>> +			/*
>>>> +			 * No support for splitting sibling entries
>>>> +			 * (horizontally) or cascade split (vertically), which
>>>> +			 * requires two or more new xa_nodes.
>>>> +			 * Since if one xa_node allocation fails,
>>>> +			 * it is hard to free the prior allocations.
>>>> +			 */
>>>> +			if (sibs || xas->xa_sibs != expected_sibs) {
>>>> +				xas_destroy(xas);
>>>> +				xas_set_err(xas, -EINVAL);
>>>> +				return;
>>>> +			}
>>>> +
>>>> +			if (!child) {
>>>> +				child = __xas_alloc_node_for_split(xas, entry,
>>>> +						gfp);
>>>> +				if (!child) {
>>>> +					xas_destroy(xas);
>>>> +					xas_set_err(xas, -ENOMEM);
>>>> +					return;
>>>> +				}
>>>> +			}
>>>
>>> No expert on this, just wondering ...
>>>
>>> ... what is the effect if we halfway-through fail the split? Is it okay to leave that "partially split" thing in place? Can callers deal with that?
>>
>> Good question.
>>
>
> Let me rephrase: In __split_unmapped_folio(), we call xas_try_split(). If that fails, we stop the split and effectively skip over the __split_folio_to_order(). The folio remains unsplit (no order change: old_order).

Right. To be more specific, in !uniform_split case, the original folio
can be split and old_order can change. Namely, if the caller wants to
split an order-9, folio_split() can split it to 2 order-6s, 1 order-7,
and 1 order-8 then cannot allocate a new xa_node due to memory constrains
and stop. The caller will get 2 order-6s, 1 order-7, and 1 order-8
and folio_split() returns -ENOMEM. The caller needs to handle this
situation, although it should be quite rare. Because unless the caller
is splitting order-12 (or even higher orders) to order-0, at most
1 xa_node is needed.

>
> xas_try_split() was instructed to split from old_order -> split_order.
>
> xas_try_split() documents that: "The value in @entry is copied to all the replacement entries.", meaning after the split, all entries will be pointing at the folio.

Right.

>
> Now, can it happen that xas_try_split() would ever perform a partial split in any way, when invoked from __split_unmapped_folio(), such that we run into the do { } while(); loop and fail with -ENOMEM after already having performed changes -- xas_update().
>
> Or is that simply impossible?

Right. It is impossible. xas_try_split() either splits by copying @entry
to all the replacement entries, or is trying to allocate a new xa_node,
which can result in -ENOMEM. These two will not be mixed.

>
> Maybe it's just the do { } while(); loop in there that is confusing me. (again, no expert)

Yeah, that the do while loop is confusing. Let me restructure the code
so that the do while loop only runs in the @entry copy case not the
xa_node allocation case.

>
>> xas_try_split() imposes what kind of split it does and is usually used to
>> split from order N to order N-1:
>
> You mean that old_order -> split_order will in the case of __split_unmapped_folio() always be a difference of 1?

Yes for !uniform_split case. For uniform_split case (split_huge_page*() uses),
xas_split() is used and all required new xa_node are preallocated by
xas_split_alloc() in __folio_split().

>
>>
>> 1. when N is a multiplier of XA_CHUNK_SHIFT, a new xa_node is needed, so
>> either child (namely xas->xa_alloc) is not NULL, meaning someone called
>> xa_nomem() to allocate a xa_node before xas_try_split() or child is NULL
>> and an allocation is needed. If child is still NULL after the allocation,
>> meaning we are out of memory, no split is done;
>>
>> 2. when N is not, no new xa_node is needed, xas_try_split() just rewrites
>> existing slot values to perform the split (the code after the hunk above).
>> No fail will happen. For this split, since no new xa_node is needed,
>> the caller is actually allowed to split from N to a value smaller than
>> N-1 as long as N-1 is >= (N - N % XA_CHUNK_SHIFT).
>>
>>
>> Various checks make sure xas_try_split() only sees the two above situation:
>>
>> a. "xas->xa_shift < node->shift" means the split crosses XA_CHUNK_SHIFT,
>> at least 1 new xa_node is needed; the else branch only handles the case
>> 2 above;
>>
>> b. for the then branch the "if (sibs || xas->xa_sibs != expected_sibs)"
>> check makes sure N is a multiplier of XA_CHUNK_SHIFT and the new order
>> has to be N-1. In "if (sibs || xas->xa_sibs != expected_sibs)",
>> "sibs != 0" means the from order N covers more than 1 slot, so more than 1
>> new xa_node is needed, "xas->xa_sibs != expected_sibs" makes sure
>> the new order is N-1 (you can see it from how expected_sibs is assigned).
>
> Thanks!
>
>>
>> Let me know if you have any other question.
>
> Thanks for the details!

Thank you for checking the code. :)

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry.
  2025-02-18 16:04         ` Zi Yan
@ 2025-02-18 16:12           ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-02-18 16:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Andrew Morton, Kirill A . Shutemov,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Hugh Dickins, Yang Shi, Miaohe Lin, Kefeng Wang,
	Yu Zhao, John Hubbard, Baolin Wang, linux-kselftest,
	linux-kernel

>>
>> Now, can it happen that xas_try_split() would ever perform a partial split in any way, when invoked from __split_unmapped_folio(), such that we run into the do { } while(); loop and fail with -ENOMEM after already having performed changes -- xas_update().
>>
>> Or is that simply impossible?
> 
> Right. It is impossible. xas_try_split() either splits by copying @entry
> to all the replacement entries, or is trying to allocate a new xa_node,
> which can result in -ENOMEM. These two will not be mixed.
> 
>>
>> Maybe it's just the do { } while(); loop in there that is confusing me. (again, no expert)
> 
> Yeah, that the do while loop is confusing. Let me restructure the code
> so that the do while loop only runs in the @entry copy case not the
> xa_node allocation case.

Great!

> 
>>
>>> xas_try_split() imposes what kind of split it does and is usually used to
>>> split from order N to order N-1:
>>
>> You mean that old_order -> split_order will in the case of __split_unmapped_folio() always be a difference of 1?
> 
> Yes for !uniform_split case. For uniform_split case (split_huge_page*() uses),
> xas_split() is used and all required new xa_node are preallocated by
> xas_split_alloc() in __folio_split().

Got it, thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-02-18 16:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-11 15:50 [PATCH v7 0/8] Buddy allocator like (or non-uniform) folio split Zi Yan
2025-02-11 15:50 ` [PATCH v7 1/8] xarray: add xas_try_split() to split a multi-index entry Zi Yan
2025-02-12  0:57   ` Zi Yan
2025-02-12  1:51     ` Zi Yan
2025-02-17 21:44   ` David Hildenbrand
2025-02-17 22:05     ` Zi Yan
2025-02-18 15:44       ` David Hildenbrand
2025-02-18 16:04         ` Zi Yan
2025-02-18 16:12           ` David Hildenbrand
2025-02-11 15:50 ` [PATCH v7 2/8] mm/huge_memory: add two new (not yet used) functions for folio_split() Zi Yan
2025-02-14 21:59   ` David Hildenbrand
2025-02-14 22:03     ` Zi Yan
2025-02-14 22:06       ` David Hildenbrand
2025-02-14 22:18         ` Zi Yan
2025-02-15  1:52   ` Zi Yan
2025-02-11 15:50 ` [PATCH v7 3/8] mm/huge_memory: move folio split common code to __folio_split() Zi Yan
2025-02-11 15:50 ` [PATCH v7 4/8] mm/huge_memory: add buddy allocator like (non-uniform) folio_split() Zi Yan
2025-02-16 10:32   ` David Hildenbrand
2025-02-16 14:17     ` Zi Yan
2025-02-17 15:22       ` Zi Yan
2025-02-18  4:12         ` Andrew Morton
2025-02-18 15:23           ` Zi Yan
2025-02-11 15:50 ` [PATCH v7 5/8] mm/huge_memory: remove the old, unused __split_huge_page() Zi Yan
2025-02-11 15:50 ` [PATCH v7 6/8] mm/huge_memory: add folio_split() to debugfs testing interface Zi Yan
2025-02-11 15:50 ` [PATCH v7 7/8] mm/truncate: use buddy allocator like folio split for truncate operation Zi Yan
2025-02-11 15:50 ` [PATCH v7 8/8] selftests/mm: add tests for folio_split(), buddy allocator like split Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox