[RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
@ 2025-11-17 22:46 Ackerley Tng
  2025-11-17 22:46 ` [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating Ackerley Tng
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-11-17 22:46 UTC (permalink / raw)
  To: willy, akpm, linux-fsdevel, linux-mm, linux-kernel
  Cc: david, michael.roth, vannapurve, Ackerley Tng

guest_memfd is planning to store huge pages in the filemap, and
guest_memfd's use of huge pages involves splitting of huge pages into
individual pages. Splitting of huge pages also involves splitting of
the filemap entries for the pages being split.

To summarize the context of how these patches will be used,

+ guest_memfd stores huge pages (up to 1G pages) in the filemap.
+ During folio splitting, guest_memfd needs split the folios, and
  approaches that by first splitting the filemap (XArray) entries that
  the folio occupies, and then splitting the struct folios themselves.
+ Splitting from a 1G to 4K folio requires splitting an entry in a
  shift-18 XArray node to a shift-0 node in the xarray, which goes
  beyond 2 levels of XArray nodes, and is currently not supported.

This work-in-progress series at [1] shows the context of how these
patches for XArray entry splitting will be used.

[1] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring

This patch series extends xas_split_alloc() to allocate enough nodes
for splitting an XArray node beyond 2 levels, and extends xas_split()
to use the allocated nodes in a split beyond 2 levels.

Merging of XArray entries can be performed with xa_store_order() at
the original order, and hence no change to the XArray library is
required.

xas_destroy() cleans up any allocated and unused nodes in struct
xa_state, and so no further changes are necessary there.

Please let me know

+ If this extension is welcome
+ Your thoughts on the approach: is it too many nodes to allocate at
  once? Would a recursive implementation be preferred?
+ If there are any bugs, particularly around how xas_split() interacts
  with LRU

Thank you!

Ackerley Tng (4):
  XArray: Initialize nodes while splitting instead of while allocating
  XArray: Update xas_split_alloc() to allocate enough nodes to split
    large entries
  XArray: Support splitting for arbitrarily large entries
  XArray: test: Increase split order test range in check_split()

 lib/test_xarray.c |   6 +-
 lib/xarray.c      | 210 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 162 insertions(+), 54 deletions(-)

--
2.52.0.rc1.455.g30608eb744-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating
  2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
@ 2025-11-17 22:46 ` Ackerley Tng
  2025-11-17 22:46 ` [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries Ackerley Tng
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-11-17 22:46 UTC (permalink / raw)
  To: willy, akpm, linux-fsdevel, linux-mm, linux-kernel
  Cc: david, michael.roth, vannapurve, Ackerley Tng

Currently, newly allocated nodes in xas_split_alloc are initialized
immediately upon allocation. This occurs before the node's specific role,
shift, and offset within the splitting hierarchy are determined.

Move the call to __xas_init_node_for_split() from xas_split_alloc() to
xas_split(). This removes any dependence on "entry" during node allocation,
in preparation for extending xa_split* functions to support multi-level
splits, where intermediate nodes need not have their slots initialized.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---

After moving the call to __xas_init_node_for_split() from
xas_split_alloc() to xas_split(), void *entry is unused and no longer
needs to be passed to xas_split_alloc().

I would like to get feedback on moving initialization into xas_split()
before making a full refactoring to remove void *entry as a parameter
to xas_split_alloc().

 lib/xarray.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/xarray.c b/lib/xarray.c
index edb877c2d51c..636edcf014f1 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1060,7 +1060,6 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 		if (!node)
 			goto nomem;

-		__xas_init_node_for_split(xas, node, entry);
 		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
 		xas->xa_alloc = node;
 	} while (sibs-- > 0);
@@ -1103,6 +1102,7 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
 			struct xa_node *child = xas->xa_alloc;

 			xas->xa_alloc = rcu_dereference_raw(child->parent);
+			__xas_init_node_for_split(xas, child, entry);
 			child->shift = node->shift - XA_CHUNK_SHIFT;
 			child->offset = offset;
 			child->count = XA_CHUNK_SIZE;
--
2.52.0.rc1.455.g30608eb744-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries
  2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
  2025-11-17 22:46 ` [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating Ackerley Tng
@ 2025-11-17 22:46 ` Ackerley Tng
  2025-11-17 22:47 ` [RFC PATCH 3/4] XArray: Support splitting for arbitrarily " Ackerley Tng
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-11-17 22:46 UTC (permalink / raw)
  To: willy, akpm, linux-fsdevel, linux-mm, linux-kernel
  Cc: david, michael.roth, vannapurve, Ackerley Tng

The xas_split_alloc() function was previously limited in its ability to
handle splits for large entries, specifically those requiring the XArray's
height to increase by more than one level. It contained a WARN_ON for such
cases and only allocated nodes for a single level of the tree.

Introduce a new helper function, __xas_alloc_nodes(), to centralize the
node allocation logic.

Update xas_split_alloc() to determine the total number of nodes required
across all new levels, then use __xas_alloc_nodes() to allocate them.

This change removes the previous limitation and allows xas_split_alloc() to
allocate enough nodes to support splitting for arbitrarily large entries.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 lib/xarray.c | 52 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/lib/xarray.c b/lib/xarray.c
index 636edcf014f1..b7c44a75bb03 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1028,6 +1028,27 @@ static void __xas_init_node_for_split(struct xa_state *xas,
 	}
 }

+static void __xas_alloc_nodes(struct xa_state *xas, unsigned int num_nodes, gfp_t gfp)
+{
+	struct xa_node *node;
+	unsigned int i;
+
+	for (i = 0; i < num_nodes; ++i) {
+		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+		if (!node)
+			goto nomem;
+
+		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
+		xas->xa_alloc = node;
+	}
+
+	return;
+
+nomem:
+	xas_destroy(xas);
+	xas_set_err(xas, -ENOMEM);
+}
+
 /**
  * xas_split_alloc() - Allocate memory for splitting an entry.
  * @xas: XArray operation state.
@@ -1046,28 +1067,27 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 		gfp_t gfp)
 {
 	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
+	unsigned int shift = order - (order % XA_CHUNK_SHIFT);
+	unsigned int level_nodes;
+	unsigned int nodes = 0;

-	/* XXX: no support for splitting really large entries yet */
-	if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT <= order))
-		goto nomem;
-	if (xas->xa_shift + XA_CHUNK_SHIFT > order)
+	if (shift <= xas->xa_shift)
 		return;

-	do {
-		struct xa_node *node;
+	shift -= XA_CHUNK_SHIFT;

-		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
-		if (!node)
-			goto nomem;
+	level_nodes = sibs + 1;
+	for (;;) {
+		nodes += level_nodes;

-		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
-		xas->xa_alloc = node;
-	} while (sibs-- > 0);
+		if (shift == xas->xa_shift)
+			break;

-	return;
-nomem:
-	xas_destroy(xas);
-	xas_set_err(xas, -ENOMEM);
+		nodes *= XA_CHUNK_SIZE;
+		shift -= XA_CHUNK_SHIFT;
+	}
+
+	__xas_alloc_nodes(xas, nodes, gfp);
 }
 EXPORT_SYMBOL_GPL(xas_split_alloc);

--
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/4] XArray: Support splitting for arbitrarily large entries
  2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
  2025-11-17 22:46 ` [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating Ackerley Tng
  2025-11-17 22:46 ` [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries Ackerley Tng
@ 2025-11-17 22:47 ` Ackerley Tng
  2025-11-17 22:47 ` [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split() Ackerley Tng
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-11-17 22:47 UTC (permalink / raw)
  To: willy, akpm, linux-fsdevel, linux-mm, linux-kernel
  Cc: david, michael.roth, vannapurve, Ackerley Tng

The existing xas_split() function primarily supports splitting an entry
into multiple siblings within the current node, or creating child nodes for
one level below.

It does not handle scenarios where the requested split order requires
wiring up multiple levels of new intermediate nodes to form a deeper
subtree. This limits the xas_split() from splitting arbitrarily large
entries.

This commit extends xas_split() to build subtrees of XArray nodes and then
set these subtrees as children of the node where the split is requested.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---

I had a recursive implementation which I believe was easier to understand, but I
went with a iterative implementation using a stack because I was concerned about
stack depths in the kernel. Let me know if the recursive implementation is
preferred!

Feedback is always appreciated, and I'd specifically like feedback on:

+ RCU-related handling
+ Handling of xas_update() calls
+ Use of node_set_marks() - can/should that be refactored to be node-focused,
  rather than in some cases also updating the child?
+ Can/should xas_split_alloc() read entry using xas_load(), rather than rely on
  void *entry passed as an argument?


 lib/xarray.c | 158 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 121 insertions(+), 37 deletions(-)

diff --git a/lib/xarray.c b/lib/xarray.c
index b7c44a75bb03..6fdace2e73df 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1105,52 +1105,136 @@ EXPORT_SYMBOL_GPL(xas_split_alloc);
 void xas_split(struct xa_state *xas, void *entry, unsigned int order)
 {
 	unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
-	unsigned int offset, marks;
+	struct xa_node *node_to_split;
+	unsigned int offset_to_split;
+	struct xa_node *stack;
 	struct xa_node *node;
-	void *curr = xas_load(xas);
-	int values = 0;
+	unsigned int offset;
+	unsigned int marks;
+	unsigned int i;
+	void *sibling;

-	node = xas->xa_node;
-	if (xas_top(node))
+	xas_load(xas);
+	node_to_split = xas->xa_node;
+	if (xas_top(node_to_split))
 		return;

-	marks = node_get_marks(node, xas->xa_offset);
+	marks = node_get_marks(node_to_split, xas->xa_offset);

-	offset = xas->xa_offset + sibs;
-	do {
-		if (xas->xa_shift < node->shift) {
-			struct xa_node *child = xas->xa_alloc;
-
-			xas->xa_alloc = rcu_dereference_raw(child->parent);
-			__xas_init_node_for_split(xas, child, entry);
-			child->shift = node->shift - XA_CHUNK_SHIFT;
-			child->offset = offset;
-			child->count = XA_CHUNK_SIZE;
-			child->nr_values = xa_is_value(entry) ?
-					XA_CHUNK_SIZE : 0;
-			RCU_INIT_POINTER(child->parent, node);
-			node_set_marks(node, offset, child, xas->xa_sibs,
-					marks);
-			rcu_assign_pointer(node->slots[offset],
-					xa_mk_node(child));
-			if (xa_is_value(curr))
-				values--;
-			xas_update(xas, child);
+	/* Horizontal split: just fill in values in existing node. */
+	if (node_to_split->shift == xas->xa_shift) {
+		offset = xas->xa_offset;
+
+		for (i = offset; i < offset + sibs + 1; i++) {
+			if ((i & xas->xa_sibs) == 0) {
+				node_set_marks(node_to_split, i, NULL, 0, marks);
+				rcu_assign_pointer(node_to_split->slots[i], entry);
+
+				sibling = xa_mk_sibling(i);
+			} else {
+				rcu_assign_pointer(node_to_split->slots[i], sibling);
+			}
+		}
+
+		xas_update(xas, node_to_split);
+		return;
+	}
+
+	/*
+	 * Vertical split: build tree bottom-up, so that on any RCU lookup (on
+	 * the original tree), the tree is consistent.
+	 */
+	offset_to_split = get_offset(xas->xa_index, node_to_split);
+	stack = NULL;
+	offset = 0;
+	for (;;) {
+		unsigned int next_offset;
+
+		if (stack &&
+		    stack->shift == node_to_split->shift - XA_CHUNK_SHIFT &&
+		    stack->offset == offset_to_split + sibs)
+			break;
+
+		if (stack && stack->offset == XA_CHUNK_SIZE - 1) {
+			unsigned int child_shift;
+
+			node = xas->xa_alloc;
+			xas->xa_alloc = rcu_dereference_raw(node->parent);
+
+			child_shift = stack->shift;
+			for (i = 0; i < XA_CHUNK_SIZE; i++) {
+				struct xa_node *child = stack;
+
+				stack = child->parent;
+
+				node_set_marks(node, child->offset, NULL, 0, marks);
+
+				RCU_INIT_POINTER(child->parent, node);
+				RCU_INIT_POINTER(node->slots[child->offset], xa_mk_node(child));
+			}
+
+			node->array = xas->xa;
+			node->count = XA_CHUNK_SIZE;
+			node->nr_values = 0;
+			node->shift = child_shift + XA_CHUNK_SHIFT;
+			node->offset = 0;
+			if (node->shift == node_to_split->shift - XA_CHUNK_SHIFT)
+				node->offset = offset_to_split;
+			if (stack && stack->shift == node->shift)
+				node->offset = stack->offset + 1;
+
+			next_offset = 0;
+
+			xas_update(xas, node);
 		} else {
-			unsigned int canon = offset - xas->xa_sibs;
+			node = xas->xa_alloc;
+			xas->xa_alloc = rcu_dereference_raw(node->parent);

-			node_set_marks(node, canon, NULL, 0, marks);
-			rcu_assign_pointer(node->slots[canon], entry);
-			while (offset > canon)
-				rcu_assign_pointer(node->slots[offset--],
-						xa_mk_sibling(canon));
-			values += (xa_is_value(entry) - xa_is_value(curr)) *
-					(xas->xa_sibs + 1);
+			for (i = 0; i < XA_CHUNK_SIZE; i++) {
+				if ((i & xas->xa_sibs) == 0) {
+					node_set_marks(node, i, NULL, 0, marks);
+					RCU_INIT_POINTER(node->slots[i], entry);
+
+					sibling = xa_mk_sibling(i);
+				} else {
+					RCU_INIT_POINTER(node->slots[i], sibling);
+				}
+			}
+
+			node->array = xas->xa;
+			node->count = XA_CHUNK_SIZE;
+			node->nr_values = xa_is_value(entry) ? XA_CHUNK_SIZE : 0;
+			node->shift = xas->xa_shift;
+			node->offset = offset;
+			if (node->shift == node_to_split->shift - XA_CHUNK_SHIFT)
+				node->offset = offset_to_split;
+			if (stack && stack->shift == node->shift)
+				node->offset = stack->offset + 1;
+
+			next_offset = offset + 1;
 		}
-	} while (offset-- > xas->xa_offset);

-	node->nr_values += values;
-	xas_update(xas, node);
+		node->parent = stack;
+		stack = node;
+
+		offset = next_offset;
+	}
+
+	/* Combine all the new nodes on the stack into node_to_split. */
+	for (i = 0; i < sibs + 1; i++) {
+		node = stack;
+		stack = node->parent;
+
+		node_set_marks(node_to_split, node->offset, NULL, 0, marks);
+
+		rcu_assign_pointer(node->parent, node_to_split);
+		rcu_assign_pointer(node_to_split->slots[node->offset], xa_mk_node(node));
+	}
+
+	WARN_ON(stack);
+
+	node_to_split->nr_values -= xa_is_value(entry) ? sibs + 1 : 0;
+	xas_update(xas, node_to_split);
 }
 EXPORT_SYMBOL_GPL(xas_split);

--
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split()
  2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
                   ` (2 preceding siblings ...)
  2025-11-17 22:47 ` [RFC PATCH 3/4] XArray: Support splitting for arbitrarily " Ackerley Tng
@ 2025-11-17 22:47 ` Ackerley Tng
  2025-11-17 23:22 ` [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Matthew Wilcox
  2025-11-18  8:46 ` [syzbot ci] " syzbot ci
  5 siblings, 0 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-11-17 22:47 UTC (permalink / raw)
  To: willy, akpm, linux-fsdevel, linux-mm, linux-kernel
  Cc: david, michael.roth, vannapurve, Ackerley Tng

Expand the range of order values for check_split_1() from 2 *
XA_CHUNK_SHIFT to 4 * XA_CHUNK_SHIFT to test splitting beyond 2 levels.

Separate the loops for check_split_1() and check_split_2() calls, since
xas_try_split() does not yet support splitting beyond 2 levels.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 lib/test_xarray.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 42626fb4dc0e..fbdf647e4ef8 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1905,12 +1905,16 @@ static noinline void check_split(struct xarray *xa)

 	XA_BUG_ON(xa, !xa_empty(xa));

-	for (order = 1; order < 2 * XA_CHUNK_SHIFT; order++) {
+	for (order = 1; order < 4 * XA_CHUNK_SHIFT; order++) {
 		for (new_order = 0; new_order < order; new_order++) {
 			check_split_1(xa, 0, order, new_order);
 			check_split_1(xa, 1UL << order, order, new_order);
 			check_split_1(xa, 3UL << order, order, new_order);
+		}
+	}

+	for (order = 1; order < 2 * XA_CHUNK_SHIFT; order++) {
+		for (new_order = 0; new_order < order; new_order++) {
 			check_split_2(xa, 0, order, new_order);
 			check_split_2(xa, 1UL << order, order, new_order);
 			check_split_2(xa, 3UL << order, order, new_order);
--
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
  2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
                   ` (3 preceding siblings ...)
  2025-11-17 22:47 ` [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split() Ackerley Tng
@ 2025-11-17 23:22 ` Matthew Wilcox
  2025-11-17 23:43   ` Ackerley Tng
  2025-11-18  8:46 ` [syzbot ci] " syzbot ci
  5 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2025-11-17 23:22 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, david, michael.roth,
	vannapurve

On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
> guest_memfd is planning to store huge pages in the filemap, and
> guest_memfd's use of huge pages involves splitting of huge pages into
> individual pages. Splitting of huge pages also involves splitting of
> the filemap entries for the pages being split.

Hm, I'm not most concerned about the number of nodes you're allocating.
I'm most concerned that, once we have memdescs, splitting a 1GB page
into 512 * 512 4kB pages is going to involve allocating about 20MB
of memory (80 bytes * 512 * 512).  Is this necessary to do all at once?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
  2025-11-17 23:22 ` [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Matthew Wilcox
@ 2025-11-17 23:43   ` Ackerley Tng
  2025-11-18  8:51     ` David Hildenbrand (Red Hat)
  2025-12-05  0:38     ` Ackerley Tng
  0 siblings, 2 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-11-17 23:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, david, michael.roth,
	vannapurve

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>> guest_memfd is planning to store huge pages in the filemap, and
>> guest_memfd's use of huge pages involves splitting of huge pages into
>> individual pages. Splitting of huge pages also involves splitting of
>> the filemap entries for the pages being split.

>
> Hm, I'm not most concerned about the number of nodes you're allocating.

Thanks for reminding me, I left this out of the original message.

Splitting the xarray entry for a 1G folio (in a shift-18 node for
order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve

+ shift-18 node (the original node will be reused - no new allocations)
+ shift-12 node: 1 node allocated
+ shift-6 node : 64 nodes allocated
+ shift-0 node : 64 * 64 = 4096 nodes allocated

This brings the total number of allocated nodes to 4161 nodes. struct
xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
splitting. The other large memory cost would be from undoing HVO for the
HugeTLB folio.

> I'm most concerned that, once we have memdescs, splitting a 1GB page
> into 512 * 512 4kB pages is going to involve allocating about 20MB
> of memory (80 bytes * 512 * 512).

I definitely need to catch up on memdescs. What's the best place for me
to learn/get an overview of how memdescs will describe memory/replace
struct folios?

I think there might be a better way to solve the original problem of
usage tracking with memdesc support, but this was intended to make
progress before memdescs.

> Is this necessary to do all at once?

The plan for guest_memfd was to first split from 1G to 4K, then optimize
on that by splitting in stages, from 1G to 2M as much as possible, then
to 4K only for the page ranges that the guest shared with the host.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [syzbot ci] Re: Extend xas_split* to support splitting arbitrarily large entries
  2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
                   ` (4 preceding siblings ...)
  2025-11-17 23:22 ` [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Matthew Wilcox
@ 2025-11-18  8:46 ` syzbot ci
  5 siblings, 0 replies; 10+ messages in thread
From: syzbot ci @ 2025-11-18  8:46 UTC (permalink / raw)
  To: ackerleytng, akpm, david, linux-fsdevel, linux-kernel, linux-mm,
	michael.roth, vannapurve, willy
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] Extend xas_split* to support splitting arbitrarily large entries
https://lore.kernel.org/all/20251117224701.1279139-1-ackerleytng@google.com
* [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating
* [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries
* [RFC PATCH 3/4] XArray: Support splitting for arbitrarily large entries
* [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split()

and found the following issue:
WARNING: kmalloc bug in bpf_prog_alloc_no_stats

Full report is available here:
https://ci.syzbot.org/series/aa74d39d-0773-4398-bb90-0a6d21365c3d

***

WARNING: kmalloc bug in bpf_prog_alloc_no_stats

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      41218ede767f6b218185af65ce919d0cade75f6b
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/c26972f6-b81e-4d6f-bead-3d77003cf075/config

------------[ cut here ]------------
Unexpected gfp: 0x400000 (__GFP_ACCOUNT). Fixing up to gfp: 0xdc0 (GFP_KERNEL|__GFP_ZERO). Fix your code!
WARNING: CPU: 0 PID: 6465 at mm/vmalloc.c:3938 vmalloc_fix_flags+0x9c/0xe0
Modules linked in:
CPU: 0 UID: 0 PID: 6465 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:vmalloc_fix_flags+0x9c/0xe0
Code: 81 e6 1f 52 ee ff 89 74 24 30 81 e3 e0 ad 11 00 89 5c 24 20 90 48 c7 c7 c0 b9 76 8b 4c 89 fa 89 d9 4d 89 f0 e8 75 2b 6e ff 90 <0f> 0b 90 90 8b 44 24 20 48 c7 04 24 0e 36 e0 45 4b c7 04 2c 00 00
RSP: 0018:ffffc90005d7fb00 EFLAGS: 00010246
RAX: 6e85c22fb4362300 RBX: 0000000000000dc0 RCX: ffff888176898000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
RBP: ffffc90005d7fb98 R08: ffff888121224293 R09: 1ffff11024244852
R10: dffffc0000000000 R11: ffffed1024244853 R12: 1ffff92000baff60
R13: dffffc0000000000 R14: ffffc90005d7fb20 R15: ffffc90005d7fb30
FS:  000055555be14500(0000) GS:ffff88818eb36000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f653e85c470 CR3: 00000001139ec000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __vmalloc_noprof+0xf2/0x120
 bpf_prog_alloc_no_stats+0x4a/0x4d0
 bpf_prog_alloc+0x3c/0x1a0
 bpf_prog_create_from_user+0xa7/0x440
 do_seccomp+0x7b1/0xd90
 __se_sys_prctl+0xc3c/0x1830
 do_syscall_64+0xfa/0xfa0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f653e990b0d
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 9d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1b 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
RSP: 002b:00007fffbd3687c0 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
RAX: ffffffffffffffda RBX: 00007f653ea2cf80 RCX: 00007f653e990b0d
RDX: 00007fffbd368820 RSI: 0000000000000002 RDI: 0000000000000016
RBP: 00007fffbd368830 R08: 0000000000000006 R09: 0000000000000071
R10: 0000000000000071 R11: 0000000000000246 R12: 000000000000006d
R13: 00007fffbd368c58 R14: 00007fffbd368ed8 R15: 0000000000000000
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
  2025-11-17 23:43   ` Ackerley Tng
@ 2025-11-18  8:51     ` David Hildenbrand (Red Hat)
  2025-12-05  0:38     ` Ackerley Tng
  1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-18  8:51 UTC (permalink / raw)
  To: Ackerley Tng, Matthew Wilcox
  Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, michael.roth, vannapurve

On 18.11.25 00:43, Ackerley Tng wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>>> guest_memfd is planning to store huge pages in the filemap, and
>>> guest_memfd's use of huge pages involves splitting of huge pages into
>>> individual pages. Splitting of huge pages also involves splitting of
>>> the filemap entries for the pages being split.
> 
>>
>> Hm, I'm not most concerned about the number of nodes you're allocating.
> 
> Thanks for reminding me, I left this out of the original message.
> 
> Splitting the xarray entry for a 1G folio (in a shift-18 node for
> order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve
> 
> + shift-18 node (the original node will be reused - no new allocations)
> + shift-12 node: 1 node allocated
> + shift-6 node : 64 nodes allocated
> + shift-0 node : 64 * 64 = 4096 nodes allocated
> 
> This brings the total number of allocated nodes to 4161 nodes. struct
> xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
> 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
> splitting. The other large memory cost would be from undoing HVO for the
> HugeTLB folio.
> 
>> I'm most concerned that, once we have memdescs, splitting a 1GB page
>> into 512 * 512 4kB pages is going to involve allocating about 20MB
>> of memory (80 bytes * 512 * 512).
> 
> I definitely need to catch up on memdescs. What's the best place for me
> to learn/get an overview of how memdescs will describe memory/replace
> struct folios?
> 
> I think there might be a better way to solve the original problem of
> usage tracking with memdesc support, but this was intended to make
> progress before memdescs.
> 
>> Is this necessary to do all at once?
> 
> The plan for guest_memfd was to first split from 1G to 4K, then optimize
> on that by splitting in stages, from 1G to 2M as much as possible, then
> to 4K only for the page ranges that the guest shared with the host.

Right, we also discussed the non-uniform split as an optimization in the 
future.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries
  2025-11-17 23:43   ` Ackerley Tng
  2025-11-18  8:51     ` David Hildenbrand (Red Hat)
@ 2025-12-05  0:38     ` Ackerley Tng
  1 sibling, 0 replies; 10+ messages in thread
From: Ackerley Tng @ 2025-12-05  0:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, david, michael.roth,
	vannapurve

Ackerley Tng <ackerleytng@google.com> writes:

> Matthew Wilcox <willy@infradead.org> writes:
>
>> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>>> guest_memfd is planning to store huge pages in the filemap, and
>>> guest_memfd's use of huge pages involves splitting of huge pages into
>>> individual pages. Splitting of huge pages also involves splitting of
>>> the filemap entries for the pages being split.
>
>>
>> Hm, I'm not most concerned about the number of nodes you're allocating.
>
> Thanks for reminding me, I left this out of the original message.
>
> Splitting the xarray entry for a 1G folio (in a shift-18 node for
> order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve
>
> + shift-18 node (the original node will be reused - no new allocations)
> + shift-12 node: 1 node allocated
> + shift-6 node : 64 nodes allocated
> + shift-0 node : 64 * 64 = 4096 nodes allocated
>
> This brings the total number of allocated nodes to 4161 nodes. struct
> xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
> 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
> splitting. The other large memory cost would be from undoing HVO for the
> HugeTLB folio.
>

At the guest_memfd biweekly call this morning, we touched on this topic
again. David pointed out that the ~2MB overhead to store a 1G folio in
the filemap seems a little high.

IIUC the above is correct, so even if we put aside splitting, without
multi-index XArrays, storing a 1G folio in the filemap would incur this
number of nodes in overheads. (Hence multi-index XArrays are great :))

>> I'm most concerned that, once we have memdescs, splitting a 1GB page
>> into 512 * 512 4kB pages is going to involve allocating about 20MB
>> of memory (80 bytes * 512 * 512).
>
> I definitely need to catch up on memdescs. What's the best place for me
> to learn/get an overview of how memdescs will describe memory/replace
> struct folios?
>
> I think there might be a better way to solve the original problem of
> usage tracking with memdesc support, but this was intended to make
> progress before memdescs.
>
>> Is this necessary to do all at once?
>
> The plan for guest_memfd was to first split from 1G to 4K, then optimize
> on that by splitting in stages, from 1G to 2M as much as possible, then
> to 4K only for the page ranges that the guest shared with the host.

David asked if splitting from 1G to 2M would remove the need for this
extension patch series. On the call, I wrongly agreed - looking at the
code again, even though the existing code kind of takes input for the
target order of the split though xas, it actually still does not split
to the requested order.

I think some workarounds could be possible, but for the introduction of
guest_memfd HugeTLB with folio restructuring, taking a dependency on
non-uniform splits (splitting 1G to 511 2M folios and 512 4K folios) is
significant complexity for a single series. It is significant because in
addition to having to deal with non-uniform splits of the folios, we'd
also have to deal with non-uniform HugeTLB vmemmap optimization.

Hence I'm hoping that I could get help reviewing these changes, so that
guest_memfd HugeTLB with non-uniform splits could be handled in a later
stage as an optimization. Besides, David says generalizing this could
help unblock other things (I forgot the detail, maybe David can chime in
here) :)

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-05  0:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-17 22:46 [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Ackerley Tng
2025-11-17 22:46 ` [RFC PATCH 1/4] XArray: Initialize nodes while splitting instead of while allocating Ackerley Tng
2025-11-17 22:46 ` [RFC PATCH 2/4] XArray: Update xas_split_alloc() to allocate enough nodes to split large entries Ackerley Tng
2025-11-17 22:47 ` [RFC PATCH 3/4] XArray: Support splitting for arbitrarily " Ackerley Tng
2025-11-17 22:47 ` [RFC PATCH 4/4] XArray: test: Increase split order test range in check_split() Ackerley Tng
2025-11-17 23:22 ` [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries Matthew Wilcox
2025-11-17 23:43   ` Ackerley Tng
2025-11-18  8:51     ` David Hildenbrand (Red Hat)
2025-12-05  0:38     ` Ackerley Tng
2025-11-18  8:46 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox