[PATCH v2 0/2] Minimize xa_node allocation during xarry split

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Minimize xa_node allocation during xarry split
@ 2025-02-18 23:54 Zi Yan
  2025-02-18 23:54 ` [PATCH v2 1/2] mm/filemap: use xas_try_split() in __filemap_add_folio() Zi Yan
  2025-02-18 23:54 ` [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry() Zi Yan
  0 siblings, 2 replies; 19+ messages in thread
From: Zi Yan @ 2025-02-18 23:54 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Kairui Song,
	Miaohe Lin, linux-kernel, Zi Yan

Hi all,

When splitting a multi-index entry in XArray from order-n to order-m,
existing xas_split_alloc()+xas_split() approach requires
2^(n % XA_CHUNK_SHIFT) xa_node allocations. But its callers,
__filemap_add_folio() and shmem_split_large_entry(), use at most 1 xa_node.
To minimize xa_node allocation and remove the limitation of no split from
order-12 (or above) to order-0 (or anything between 0 and 5)[1],
xas_try_split() was added[2], which allocates
(n / XA_CHUNK_SHIFT - m / XA_CHUNK_SHIFT) xa_node. It is used
for non-uniform folio split, but can be used by __filemap_add_folio()
and shmem_split_large_entry().

It is a resend on top of Buddy allocator like (or non-uniform)
folio split V8[3], which is on top of mm-everything-2025-02-15-05-49.

xas_split_alloc() and xas_split() split an order-9 to order-0:

         ---------------------------------
         |   |   |   |   |   |   |   |   |
         | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
         |   |   |   |   |   |   |   |   |
         ---------------------------------
           |   |                   |   |
     -------   ---               ---   -------
     |           |     ...       |           |
     V           V               V           V
----------- -----------     ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- -----------     ----------- -----------

xas_try_split() splits an order-9 to order-0:
   ---------------------------------
   |   |   |   |   |   |   |   |   |
   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
   |   |   |   |   |   |   |   |   |
   ---------------------------------
     |
     |
     V
-----------
| xa_node |
-----------

xas_try_split() is designed to be called iteratively with n = m + 1.
xas_try_split_mini_order() is added to minmize the number of calls to
xas_try_split() by telling the caller the next minimal order to split to
instead of n - 1. Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT
does not require xa_node allocation and requires 1 xa_node
when n=l * XA_CHUNK_SHIFT and m = n - 1, so it is OK to use
xas_try_split() with n > m + 1 when no new xa_node is needed.

xfstests quick group test passed on xfs and tmpfs.

Let me know your comments.

[1] https://lore.kernel.org/linux-mm/Z6YX3RznGLUD07Ao@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/20250211155034.268962-2-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20250218235012.1542225-1-ziy@nvidia.com/

Zi Yan (2):
  mm/filemap: use xas_try_split() in __filemap_add_folio()
  mm/shmem: use xas_try_split() in shmem_split_large_entry()

 include/linux/xarray.h |  7 +++++++
 lib/xarray.c           | 25 +++++++++++++++++++++++
 mm/filemap.c           | 46 +++++++++++++++++-------------------------
 mm/shmem.c             | 43 +++++++++++++++------------------------
 4 files changed, 67 insertions(+), 54 deletions(-)

-- 
2.47.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/2] mm/filemap: use xas_try_split() in __filemap_add_folio()
  2025-02-18 23:54 [PATCH v2 0/2] Minimize xa_node allocation during xarry split Zi Yan
@ 2025-02-18 23:54 ` Zi Yan
  2025-02-18 23:54 ` [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry() Zi Yan
  1 sibling, 0 replies; 19+ messages in thread
From: Zi Yan @ 2025-02-18 23:54 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Kairui Song,
	Miaohe Lin, linux-kernel, Zi Yan

During __filemap_add_folio(), a shadow entry is covering n slots and a
folio covers m slots with m < n is to be added.  Instead of splitting all
n slots, only the m slots covered by the folio need to be split and the
remaining n-m shadow entries can be retained with orders ranging from m to
n-1.  This method only requires

	(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)

new xa_nodes instead of
	(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))

new xa_nodes, compared to the original xas_split_alloc() + xas_split()
one.  For example, to insert an order-0 folio when an order-9 shadow entry
is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of
8.

xas_try_split_min_order() is introduced to reduce the number of calls to
xas_try_split() during split.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mattew Wilcox <willy@infradead.org>
---
 include/linux/xarray.h |  7 +++++++
 lib/xarray.c           | 25 +++++++++++++++++++++++
 mm/filemap.c           | 46 +++++++++++++++++-------------------------
 3 files changed, 51 insertions(+), 27 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 9eb8c7425090..6ef3d682b189 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1557,6 +1557,7 @@ void xas_split(struct xa_state *, void *entry, unsigned int order);
 void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
 void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
 		gfp_t gfp);
+unsigned int xas_try_split_min_order(unsigned int order);
 #else
 static inline int xa_get_order(struct xarray *xa, unsigned long index)
 {
@@ -1583,6 +1584,12 @@ static inline void xas_try_split(struct xa_state *xas, void *entry,
 		unsigned int order, gfp_t gfp)
 {
 }
+
+static inline unsigned int xas_try_split_min_order(unsigned int order)
+{
+	return 0;
+}
+
 #endif
 
 /**
diff --git a/lib/xarray.c b/lib/xarray.c
index b9a63d7fbd58..e8dd80aa15db 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1133,6 +1133,28 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(xas_split);
 
+/**
+ * xas_try_split_min_order() - Minimal split order xas_try_split() can accept
+ * @order: Current entry order.
+ *
+ * xas_try_split() can split a multi-index entry to smaller than @order - 1 if
+ * no new xa_node is needed. This function provides the minimal order
+ * xas_try_split() supports.
+ *
+ * Return: the minimal order xas_try_split() supports
+ *
+ * Context: Any context.
+ *
+ */
+unsigned int xas_try_split_min_order(unsigned int order)
+{
+	if (order % XA_CHUNK_SHIFT == 0)
+		return order == 0 ? 0 : order - 1;
+
+	return order - (order % XA_CHUNK_SHIFT);
+}
+EXPORT_SYMBOL_GPL(xas_try_split_min_order);
+
 /**
  * xas_try_split() - Try to split a multi-index entry.
  * @xas: XArray operation state.
@@ -1145,6 +1167,9 @@ EXPORT_SYMBOL_GPL(xas_split);
  * be allocated, the function will use @gfp to get one. If more xa_node are
  * needed, the function gives EINVAL error.
  *
+ * NOTE: use xas_try_split_min_order() to get next split order instead of
+ * @order - 1 if you want to minmize xas_try_split() calls.
+ *
  * Context: Any context.  The caller should hold the xa_lock.
  */
 void xas_try_split(struct xa_state *xas, void *entry, unsigned int order,
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b860b59a521..c6650de837d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -857,11 +857,10 @@ EXPORT_SYMBOL_GPL(replace_page_cache_folio);
 noinline int __filemap_add_folio(struct address_space *mapping,
 		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
 {
-	XA_STATE(xas, &mapping->i_pages, index);
-	void *alloced_shadow = NULL;
-	int alloced_order = 0;
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
 	bool huge;
 	long nr;
+	unsigned int forder = folio_order(folio);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
@@ -870,7 +869,6 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 	mapping_set_update(&xas, mapping);
 
 	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
-	xas_set_order(&xas, index, folio_order(folio));
 	huge = folio_test_hugetlb(folio);
 	nr = folio_nr_pages(folio);
 
@@ -880,7 +878,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 	folio->index = xas.xa_index;
 
 	for (;;) {
-		int order = -1, split_order = 0;
+		int order = -1;
 		void *entry, *old = NULL;
 
 		xas_lock_irq(&xas);
@@ -898,21 +896,26 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 				order = xas_get_order(&xas);
 		}
 
-		/* entry may have changed before we re-acquire the lock */
-		if (alloced_order && (old != alloced_shadow || order != alloced_order)) {
-			xas_destroy(&xas);
-			alloced_order = 0;
-		}
-
 		if (old) {
-			if (order > 0 && order > folio_order(folio)) {
+			if (order > 0 && order > forder) {
+				unsigned int split_order = max(forder,
+						xas_try_split_min_order(order));
+
 				/* How to handle large swap entries? */
 				BUG_ON(shmem_mapping(mapping));
-				if (!alloced_order) {
-					split_order = order;
-					goto unlock;
+
+				while (order > forder) {
+					xas_set_order(&xas, index, split_order);
+					xas_try_split(&xas, old, order,
+						      GFP_NOWAIT);
+					if (xas_error(&xas))
+						goto unlock;
+					order = split_order;
+					split_order =
+						max(xas_try_split_min_order(
+							    split_order),
+						    forder);
 				}
-				xas_split(&xas, old, order);
 				xas_reset(&xas);
 			}
 			if (shadowp)
@@ -936,17 +939,6 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 unlock:
 		xas_unlock_irq(&xas);
 
-		/* split needed, alloc here and retry. */
-		if (split_order) {
-			xas_split_alloc(&xas, old, split_order, gfp);
-			if (xas_error(&xas))
-				goto error;
-			alloced_shadow = old;
-			alloced_order = split_order;
-			xas_reset(&xas);
-			continue;
-		}
-
 		if (!xas_nomem(&xas, gfp))
 			break;
 	}
-- 
2.47.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-18 23:54 [PATCH v2 0/2] Minimize xa_node allocation during xarry split Zi Yan
  2025-02-18 23:54 ` [PATCH v2 1/2] mm/filemap: use xas_try_split() in __filemap_add_folio() Zi Yan
@ 2025-02-18 23:54 ` Zi Yan
  2025-02-19 10:04   ` Baolin Wang
  1 sibling, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-18 23:54 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Kairui Song,
	Miaohe Lin, linux-kernel, Zi Yan

During shmem_split_large_entry(), large swap entries are covering n slots
and an order-0 folio needs to be inserted.

Instead of splitting all n slots, only the 1 slot covered by the folio
need to be split and the remaining n-1 shadow entries can be retained with
orders ranging from 0 to n-1.  This method only requires
(n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
(n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
xas_split_alloc() + xas_split() one.

For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
is 6), 1 xa_node is needed instead of 8.

xas_try_split_min_order() is used to reduce the number of calls to
xas_try_split() during split.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Mattew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/shmem.c | 43 ++++++++++++++++---------------------------
 1 file changed, 16 insertions(+), 27 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 671f63063fd4..b35ba250c53d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2162,14 +2162,14 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
-	void *alloced_shadow = NULL;
-	int alloced_order = 0, i;
+	int split_order = 0;
+	int i;
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
 
 	for (;;) {
-		int order = -1, split_order = 0;
+		int order = -1;
 		void *old = NULL;
 
 		xas_lock_irq(&xas);
@@ -2181,20 +2181,21 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 
 		order = xas_get_order(&xas);
 
-		/* Swap entry may have changed before we re-acquire the lock */
-		if (alloced_order &&
-		    (old != alloced_shadow || order != alloced_order)) {
-			xas_destroy(&xas);
-			alloced_order = 0;
-		}
-
 		/* Try to split large swap entry in pagecache */
 		if (order > 0) {
-			if (!alloced_order) {
-				split_order = order;
-				goto unlock;
+			int cur_order = order;
+
+			split_order = xas_try_split_min_order(cur_order);
+
+			while (cur_order > 0) {
+				xas_set_order(&xas, index, split_order);
+				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
+				if (xas_error(&xas))
+					goto unlock;
+				cur_order = split_order;
+				split_order =
+					xas_try_split_min_order(split_order);
 			}
-			xas_split(&xas, old, order);
 
 			/*
 			 * Re-set the swap entry after splitting, and the swap
@@ -2213,26 +2214,14 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 unlock:
 		xas_unlock_irq(&xas);
 
-		/* split needed, alloc here and retry. */
-		if (split_order) {
-			xas_split_alloc(&xas, old, split_order, gfp);
-			if (xas_error(&xas))
-				goto error;
-			alloced_shadow = old;
-			alloced_order = split_order;
-			xas_reset(&xas);
-			continue;
-		}
-
 		if (!xas_nomem(&xas, gfp))
 			break;
 	}
 
-error:
 	if (xas_error(&xas))
 		return xas_error(&xas);
 
-	return alloced_order;
+	return split_order;
 }
 
 /*
-- 
2.47.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-18 23:54 ` [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry() Zi Yan
@ 2025-02-19 10:04   ` Baolin Wang
  2025-02-19 16:10     ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-19 10:04 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Hugh Dickins, Kairui Song, Miaohe Lin, linux-kernel

Hi Zi,

Sorry for the late reply due to being busy with other things:)

On 2025/2/19 07:54, Zi Yan wrote:
> During shmem_split_large_entry(), large swap entries are covering n slots
> and an order-0 folio needs to be inserted.
> 
> Instead of splitting all n slots, only the 1 slot covered by the folio
> need to be split and the remaining n-1 shadow entries can be retained with
> orders ranging from 0 to n-1.  This method only requires
> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
> xas_split_alloc() + xas_split() one.
> 
> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
> is 6), 1 xa_node is needed instead of 8.
> 
> xas_try_split_min_order() is used to reduce the number of calls to
> xas_try_split() during split.

For shmem swapin, if we cannot swap in the whole large folio by skipping 
the swap cache, we will split the large swap entry stored in the shmem 
mapping into order-0 swap entries, rather than splitting it into other 
orders of swap entries. This is because the next time we swap in a shmem 
folio through shmem_swapin_cluster(), it will still be an order 0 folio.

Moreover I did a quick test with swapping in order 6 shmem folios, 
however, my test hung, and the console was continuously filled with the 
following information. It seems there are some issues with shmem swapin 
handling. Anyway, I need more time to debug and test.

[ 1037.364644] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364650] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364652] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364654] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364656] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364658] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364659] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364661] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364663] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1037.364665] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 1042.368539] pagefault_out_of_memory: 9268696 callbacks suppressed
.......

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-19 10:04   ` Baolin Wang
@ 2025-02-19 16:10     ` Zi Yan
  2025-02-20  9:07       ` Baolin Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-19 16:10 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Matthew Wilcox, linux-mm, linux-fsdevel, Andrew Morton,
	Hugh Dickins, Kairui Song, Miaohe Lin, linux-kernel

On 19 Feb 2025, at 5:04, Baolin Wang wrote:

> Hi Zi,
>
> Sorry for the late reply due to being busy with other things:)

Thank you for taking a look at the patches. :)

>
> On 2025/2/19 07:54, Zi Yan wrote:
>> During shmem_split_large_entry(), large swap entries are covering n slots
>> and an order-0 folio needs to be inserted.
>>
>> Instead of splitting all n slots, only the 1 slot covered by the folio
>> need to be split and the remaining n-1 shadow entries can be retained with
>> orders ranging from 0 to n-1.  This method only requires
>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>> xas_split_alloc() + xas_split() one.
>>
>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>> is 6), 1 xa_node is needed instead of 8.
>>
>> xas_try_split_min_order() is used to reduce the number of calls to
>> xas_try_split() during split.
>
> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.

Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
should split the large swap entry and give you a slot to store the order-0 folio.
For example, with an order-9 large swap entry, to swap in first order-0 folio,
the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
after the split. Then the first order-0 swap entry can be used.
Then, when a second order-0 is swapped in, the second order-0 can be used.
When the last order-0 is swapped in, the order-8 would be split to
order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.

Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
are order-0, which can lead to issues. There should be some check like
if the swap entry order > folio_order, shmem_split_large_entry() should
be used.
>
> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
To swap in order-6 folios, shmem_split_large_entry() does not allocate
any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
error below. Let me know if there is anything I can help.

>
> [ 1037.364644] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364650] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364652] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364654] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364656] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364658] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364659] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364661] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364663] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1037.364665] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [ 1042.368539] pagefault_out_of_memory: 9268696 callbacks suppressed
> .......


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-19 16:10     ` Zi Yan
@ 2025-02-20  9:07       ` Baolin Wang
  2025-02-20  9:27         ` Baolin Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-20  9:07 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, linux-mm, linux-fsdevel, Andrew Morton,
	Hugh Dickins, Kairui Song, Miaohe Lin, linux-kernel



On 2025/2/20 00:10, Zi Yan wrote:
> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
> 
>> Hi Zi,
>>
>> Sorry for the late reply due to being busy with other things:)
> 
> Thank you for taking a look at the patches. :)
> 
>>
>> On 2025/2/19 07:54, Zi Yan wrote:
>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>> and an order-0 folio needs to be inserted.
>>>
>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>> need to be split and the remaining n-1 shadow entries can be retained with
>>> orders ranging from 0 to n-1.  This method only requires
>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>> xas_split_alloc() + xas_split() one.
>>>
>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>> is 6), 1 xa_node is needed instead of 8.
>>>
>>> xas_try_split_min_order() is used to reduce the number of calls to
>>> xas_try_split() during split.
>>
>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
> 
> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()

Yes, now we always swapin an order-0 folio from the async swap device at 
a time. However, for sync swap device, we will skip the swapcache and 
swapin the whole large folio by commit 1dd44c0af4fa, so it will not call 
shmem_split_large_entry() in this case.

> should split the large swap entry and give you a slot to store the order-0 folio.
> For example, with an order-9 large swap entry, to swap in first order-0 folio,
> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
> after the split. Then the first order-0 swap entry can be used.
> Then, when a second order-0 is swapped in, the second order-0 can be used.
> When the last order-0 is swapped in, the order-8 would be split to
> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.

Yes, understood. However, for the sequential swapin scenarios, where 
originally only one split operation is needed. However, your approach 
increases the number of split operations. Of course, I understand that 
in non-sequential swapin scenarios, your patch will save some xarray 
memory. It might be necessary to evaluate whether the increased split 
operations will have a significant impact on the performance of 
sequential swapin?

> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
> are order-0, which can lead to issues. There should be some check like
> if the swap entry order > folio_order, shmem_split_large_entry() should
> be used.
>>
>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
> To swap in order-6 folios, shmem_split_large_entry() does not allocate
> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
> error below. Let me know if there is anything I can help.

I encountered some issues while testing order 4 and order 6 swapin with 
your patches. And I roughly reviewed the patch, and it seems that the 
new swap entry stored in the shmem mapping was not correctly updated 
after the split.

The following logic is to reset the swap entry after split, and I assume 
that the large swap entry is always split to order 0 before. As your 
patch suggests, if a non-uniform split is used, then the logic for 
resetting the swap entry needs to be changed? Please correct me if I 
missed something.

/*
  * Re-set the swap entry after splitting, and the swap
  * offset of the original large entry must be continuous.
  */
for (i = 0; i < 1 << order; i++) {
	pgoff_t aligned_index = round_down(index, 1 << order);
	swp_entry_t tmp;

	tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
	__xa_store(&mapping->i_pages, aligned_index + i,
		   swp_to_radix_entry(tmp), 0);
}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-20  9:07       ` Baolin Wang
@ 2025-02-20  9:27         ` Baolin Wang
  2025-02-20 13:06           ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-20  9:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, linux-mm, linux-fsdevel, Andrew Morton,
	Hugh Dickins, Kairui Song, Miaohe Lin, linux-kernel



On 2025/2/20 17:07, Baolin Wang wrote:
> 
> 
> On 2025/2/20 00:10, Zi Yan wrote:
>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>
>>> Hi Zi,
>>>
>>> Sorry for the late reply due to being busy with other things:)
>>
>> Thank you for taking a look at the patches. :)
>>
>>>
>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>> During shmem_split_large_entry(), large swap entries are covering n 
>>>> slots
>>>> and an order-0 folio needs to be inserted.
>>>>
>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>> need to be split and the remaining n-1 shadow entries can be 
>>>> retained with
>>>> orders ranging from 0 to n-1.  This method only requires
>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>> xas_split_alloc() + xas_split() one.
>>>>
>>>> For example, to split an order-9 large swap entry (assuming 
>>>> XA_CHUNK_SHIFT
>>>> is 6), 1 xa_node is needed instead of 8.
>>>>
>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>> xas_try_split() during split.
>>>
>>> For shmem swapin, if we cannot swap in the whole large folio by 
>>> skipping the swap cache, we will split the large swap entry stored in 
>>> the shmem mapping into order-0 swap entries, rather than splitting it 
>>> into other orders of swap entries. This is because the next time we 
>>> swap in a shmem folio through shmem_swapin_cluster(), it will still 
>>> be an order 0 folio.
>>
>> Right. But the swapin is one folio at a time, right? 
>> shmem_split_large_entry()
> 
> Yes, now we always swapin an order-0 folio from the async swap device at 
> a time. However, for sync swap device, we will skip the swapcache and 
> swapin the whole large folio by commit 1dd44c0af4fa, so it will not call 
> shmem_split_large_entry() in this case.
> 
>> should split the large swap entry and give you a slot to store the 
>> order-0 folio.
>> For example, with an order-9 large swap entry, to swap in first 
>> order-0 folio,
>> the large swap entry will become order-0, order-0, order-1, order-2,… 
>> order-8,
>> after the split. Then the first order-0 swap entry can be used.
>> Then, when a second order-0 is swapped in, the second order-0 can be 
>> used.
>> When the last order-0 is swapped in, the order-8 would be split to
>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will 
>> be used.
> 
> Yes, understood. However, for the sequential swapin scenarios, where 
> originally only one split operation is needed. However, your approach 
> increases the number of split operations. Of course, I understand that 
> in non-sequential swapin scenarios, your patch will save some xarray 
> memory. It might be necessary to evaluate whether the increased split 
> operations will have a significant impact on the performance of 
> sequential swapin?
> 
>> Maybe the swapin assumes after shmem_split_large_entry(), all swap 
>> entries
>> are order-0, which can lead to issues. There should be some check like
>> if the swap entry order > folio_order, shmem_split_large_entry() should
>> be used.
>>>
>>> Moreover I did a quick test with swapping in order 6 shmem folios, 
>>> however, my test hung, and the console was continuously filled with 
>>> the following information. It seems there are some issues with shmem 
>>> swapin handling. Anyway, I need more time to debug and test.
>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>> error below. Let me know if there is anything I can help.
> 
> I encountered some issues while testing order 4 and order 6 swapin with 
> your patches. And I roughly reviewed the patch, and it seems that the 
> new swap entry stored in the shmem mapping was not correctly updated 
> after the split.
> 
> The following logic is to reset the swap entry after split, and I assume 
> that the large swap entry is always split to order 0 before. As your 
> patch suggests, if a non-uniform split is used, then the logic for 
> resetting the swap entry needs to be changed? Please correct me if I 
> missed something.
> 
> /*
>   * Re-set the swap entry after splitting, and the swap
>   * offset of the original large entry must be continuous.
>   */
> for (i = 0; i < 1 << order; i++) {
>      pgoff_t aligned_index = round_down(index, 1 << order);
>      swp_entry_t tmp;
> 
>      tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>      __xa_store(&mapping->i_pages, aligned_index + i,
>             swp_to_radix_entry(tmp), 0);
> }

In addition, after your patch, the shmem_split_large_entry() seems 
always return 0 even though it splits a large swap entry, but we still 
need re-calculate the swap entry value after splitting, otherwise it may 
return errors due to shmem_confirm_swap() validation failure.

/*
  * If the large swap entry has already been split, it is
  * necessary to recalculate the new swap entry based on
  * the old order alignment.
  */
  if (split_order > 0) {
	pgoff_t offset = index - round_down(index, 1 << split_order);

	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-20  9:27         ` Baolin Wang
@ 2025-02-20 13:06           ` Zi Yan
  2025-02-21  2:33             ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-20 13:06 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 20 Feb 2025, at 4:27, Baolin Wang wrote:

> On 2025/2/20 17:07, Baolin Wang wrote:
>>
>>
>> On 2025/2/20 00:10, Zi Yan wrote:
>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>
>>>> Hi Zi,
>>>>
>>>> Sorry for the late reply due to being busy with other things:)
>>>
>>> Thank you for taking a look at the patches. :)
>>>
>>>>
>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>> and an order-0 folio needs to be inserted.
>>>>>
>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>> xas_split_alloc() + xas_split() one.
>>>>>
>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>
>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>> xas_try_split() during split.
>>>>
>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>
>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>
>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.

Got it. I will check the commit.

>>
>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>> after the split. Then the first order-0 swap entry can be used.
>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>> When the last order-0 is swapped in, the order-8 would be split to
>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>
>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?

Is there a shmem swapin test I can run to measure this? xas_try_split() should
performance similar operations as existing xas_split_alloc()+xas_split().

>>
>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>> are order-0, which can lead to issues. There should be some check like
>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>> be used.
>>>>
>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>> error below. Let me know if there is anything I can help.
>>
>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>
>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>
>> /*
>>   * Re-set the swap entry after splitting, and the swap
>>   * offset of the original large entry must be continuous.
>>   */
>> for (i = 0; i < 1 << order; i++) {
>>      pgoff_t aligned_index = round_down(index, 1 << order);
>>      swp_entry_t tmp;
>>
>>      tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>      __xa_store(&mapping->i_pages, aligned_index + i,
>>             swp_to_radix_entry(tmp), 0);
>> }

Right. I will need to adjust swp_entry_t. Thanks for pointing this out.

>
> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>
> /*
>  * If the large swap entry has already been split, it is
>  * necessary to recalculate the new swap entry based on
>  * the old order alignment.
>  */
>  if (split_order > 0) {
> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>
> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> }

Got it. I will fix it.

BTW, do you mind sharing your swapin tests so that I can test my new version
properly?

Thanks.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-20 13:06           ` Zi Yan
@ 2025-02-21  2:33             ` Zi Yan
  2025-02-21  2:38               ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-21  2:33 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 20 Feb 2025, at 8:06, Zi Yan wrote:

> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>
>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>
>>>
>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>
>>>>> Hi Zi,
>>>>>
>>>>> Sorry for the late reply due to being busy with other things:)
>>>>
>>>> Thank you for taking a look at the patches. :)
>>>>
>>>>>
>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>> and an order-0 folio needs to be inserted.
>>>>>>
>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>
>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>
>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>> xas_try_split() during split.
>>>>>
>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>
>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>
>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>
> Got it. I will check the commit.
>
>>>
>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>> after the split. Then the first order-0 swap entry can be used.
>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>
>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>
> Is there a shmem swapin test I can run to measure this? xas_try_split() should
> performance similar operations as existing xas_split_alloc()+xas_split().
>
>>>
>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>> are order-0, which can lead to issues. There should be some check like
>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>> be used.
>>>>>
>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>> error below. Let me know if there is anything I can help.
>>>
>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>
>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>
>>> /*
>>>   * Re-set the swap entry after splitting, and the swap
>>>   * offset of the original large entry must be continuous.
>>>   */
>>> for (i = 0; i < 1 << order; i++) {
>>>      pgoff_t aligned_index = round_down(index, 1 << order);
>>>      swp_entry_t tmp;
>>>
>>>      tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>      __xa_store(&mapping->i_pages, aligned_index + i,
>>>             swp_to_radix_entry(tmp), 0);
>>> }
>
> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>
>>
>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>
>> /*
>>  * If the large swap entry has already been split, it is
>>  * necessary to recalculate the new swap entry based on
>>  * the old order alignment.
>>  */
>>  if (split_order > 0) {
>> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>>
>> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>> }
>
> Got it. I will fix it.
>
> BTW, do you mind sharing your swapin tests so that I can test my new version
> properly?

The diff below adjusts the swp_entry_t and returns the right order after
shmem_split_large_entry(). Let me know if it fixes your issue.

diff --git a/mm/shmem.c b/mm/shmem.c
index b35ba250c53d..190fc36e43ec 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2192,23 +2192,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
 				if (xas_error(&xas))
 					goto unlock;
+
+				/*
+				 * Re-set the swap entry after splitting, and the swap
+				 * offset of the original large entry must be continuous.
+				 */
+				for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
+					pgoff_t aligned_index = round_down(index, 1 << cur_order);
+					swp_entry_t tmp;
+
+					tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+					__xa_store(&mapping->i_pages, aligned_index + i,
+						   swp_to_radix_entry(tmp), 0);
+				}
 				cur_order = split_order;
 				split_order =
 					xas_try_split_min_order(split_order);
 			}
-
-			/*
-			 * Re-set the swap entry after splitting, and the swap
-			 * offset of the original large entry must be continuous.
-			 */
-			for (i = 0; i < 1 << order; i++) {
-				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp;
-
-				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
-				__xa_store(&mapping->i_pages, aligned_index + i,
-					   swp_to_radix_entry(tmp), 0);
-			}
 		}

 unlock:
@@ -2221,7 +2221,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);

-	return split_order;
+	return order;
 }

 /*


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-21  2:33             ` Zi Yan
@ 2025-02-21  2:38               ` Zi Yan
  2025-02-21  6:17                 ` Baolin Wang
  2025-02-25  9:20                 ` Baolin Wang
  0 siblings, 2 replies; 19+ messages in thread
From: Zi Yan @ 2025-02-21  2:38 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 20 Feb 2025, at 21:33, Zi Yan wrote:

> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>
>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>
>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>
>>>>>> Hi Zi,
>>>>>>
>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>
>>>>> Thank you for taking a look at the patches. :)
>>>>>
>>>>>>
>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>
>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>
>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>
>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>> xas_try_split() during split.
>>>>>>
>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>
>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>
>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>
>> Got it. I will check the commit.
>>
>>>>
>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>
>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>
>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>> performance similar operations as existing xas_split_alloc()+xas_split().
>>
>>>>
>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>> are order-0, which can lead to issues. There should be some check like
>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>> be used.
>>>>>>
>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>> error below. Let me know if there is anything I can help.
>>>>
>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>
>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>
>>>> /*
>>>>   * Re-set the swap entry after splitting, and the swap
>>>>   * offset of the original large entry must be continuous.
>>>>   */
>>>> for (i = 0; i < 1 << order; i++) {
>>>>      pgoff_t aligned_index = round_down(index, 1 << order);
>>>>      swp_entry_t tmp;
>>>>
>>>>      tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>      __xa_store(&mapping->i_pages, aligned_index + i,
>>>>             swp_to_radix_entry(tmp), 0);
>>>> }
>>
>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>
>>>
>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>
>>> /*
>>>  * If the large swap entry has already been split, it is
>>>  * necessary to recalculate the new swap entry based on
>>>  * the old order alignment.
>>>  */
>>>  if (split_order > 0) {
>>> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>>>
>>> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> }
>>
>> Got it. I will fix it.
>>
>> BTW, do you mind sharing your swapin tests so that I can test my new version
>> properly?
>
> The diff below adjusts the swp_entry_t and returns the right order after
> shmem_split_large_entry(). Let me know if it fixes your issue.

Fixed the compilation error. It will be great if you can share a swapin test, so that
I can test locally. Thanks.

diff --git a/mm/shmem.c b/mm/shmem.c
index b35ba250c53d..bfc4ef511391 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
-	int split_order = 0;
+	int split_order = 0, entry_order = 0;
 	int i;

 	/* Convert user data gfp flags to xarray node gfp flags */
@@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 		}

 		order = xas_get_order(&xas);
+		entry_order = order;

 		/* Try to split large swap entry in pagecache */
 		if (order > 0) {
@@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
 				if (xas_error(&xas))
 					goto unlock;
+
+				/*
+				 * Re-set the swap entry after splitting, and the swap
+				 * offset of the original large entry must be continuous.
+				 */
+				for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
+					pgoff_t aligned_index = round_down(index, 1 << cur_order);
+					swp_entry_t tmp;
+
+					tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+					__xa_store(&mapping->i_pages, aligned_index + i,
+						   swp_to_radix_entry(tmp), 0);
+				}
 				cur_order = split_order;
 				split_order =
 					xas_try_split_min_order(split_order);
 			}
-
-			/*
-			 * Re-set the swap entry after splitting, and the swap
-			 * offset of the original large entry must be continuous.
-			 */
-			for (i = 0; i < 1 << order; i++) {
-				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp;
-
-				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
-				__xa_store(&mapping->i_pages, aligned_index + i,
-					   swp_to_radix_entry(tmp), 0);
-			}
 		}

 unlock:
@@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);

-	return split_order;
+	return entry_order;
 }

 /*


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-21  2:38               ` Zi Yan
@ 2025-02-21  6:17                 ` Baolin Wang
  2025-02-21 23:47                   ` Zi Yan
  2025-02-25  9:20                 ` Baolin Wang
  1 sibling, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-21  6:17 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 9484 bytes --]



On 2025/2/21 10:38, Zi Yan wrote:
> On 20 Feb 2025, at 21:33, Zi Yan wrote:
> 
>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>
>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>
>>>>>>> Hi Zi,
>>>>>>>
>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>
>>>>>> Thank you for taking a look at the patches. :)
>>>>>>
>>>>>>>
>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>
>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>
>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>
>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>> xas_try_split() during split.
>>>>>>>
>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>
>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>
>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>
>>> Got it. I will check the commit.
>>>
>>>>>
>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>
>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>
>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>> performance similar operations as existing xas_split_alloc()+xas_split().

I think a simple sequential swapin case is enough? Anyway I can help to 
evaluate the performance impact with your new patch.

>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>> be used.
>>>>>>>
>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>> error below. Let me know if there is anything I can help.
>>>>>
>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>
>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>
>>>>> /*
>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>    * offset of the original large entry must be continuous.
>>>>>    */
>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>       swp_entry_t tmp;
>>>>>
>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>              swp_to_radix_entry(tmp), 0);
>>>>> }
>>>
>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>
>>>>
>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>
>>>> /*
>>>>   * If the large swap entry has already been split, it is
>>>>   * necessary to recalculate the new swap entry based on
>>>>   * the old order alignment.
>>>>   */
>>>>   if (split_order > 0) {
>>>> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>
>>>> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>> }
>>>
>>> Got it. I will fix it.
>>>
>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>> properly?
>>
>> The diff below adjusts the swp_entry_t and returns the right order after
>> shmem_split_large_entry(). Let me know if it fixes your issue.
> 
> Fixed the compilation error. It will be great if you can share a swapin test, so that
> I can test locally. Thanks.

Sure. I've attached 3 test shmem swapin cases to see if they can help 
you with testing. I will also find time next week to review and test 
your patch.

Additionally, you can use zram as a swap device and disable the skipping 
swapcache feature to test the split logic quickly:

diff --git a/mm/shmem.c b/mm/shmem.c
index 745f130bfb4c..7374d5c1cdde 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2274,7 +2274,7 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
         folio = swap_cache_get_folio(swap, NULL, 0);
         if (!folio) {
                 int order = xa_get_order(&mapping->i_pages, index);
-               bool fallback_order0 = false;
+               bool fallback_order0 = true;
                 int split_order;

                 /* Or update major stats only when swapin succeeds?? */

> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b35ba250c53d..bfc4ef511391 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   {
>   	struct address_space *mapping = inode->i_mapping;
>   	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> -	int split_order = 0;
> +	int split_order = 0, entry_order = 0;
>   	int i;
> 
>   	/* Convert user data gfp flags to xarray node gfp flags */
> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   		}
> 
>   		order = xas_get_order(&xas);
> +		entry_order = order;
> 
>   		/* Try to split large swap entry in pagecache */
>   		if (order > 0) {
> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>   				if (xas_error(&xas))
>   					goto unlock;
> +
> +				/*
> +				 * Re-set the swap entry after splitting, and the swap
> +				 * offset of the original large entry must be continuous.
> +				 */
> +				for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
> +					pgoff_t aligned_index = round_down(index, 1 << cur_order);
> +					swp_entry_t tmp;
> +
> +					tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> +					__xa_store(&mapping->i_pages, aligned_index + i,
> +						   swp_to_radix_entry(tmp), 0);
> +				}
>   				cur_order = split_order;
>   				split_order =
>   					xas_try_split_min_order(split_order);
>   			}
> -
> -			/*
> -			 * Re-set the swap entry after splitting, and the swap
> -			 * offset of the original large entry must be continuous.
> -			 */
> -			for (i = 0; i < 1 << order; i++) {
> -				pgoff_t aligned_index = round_down(index, 1 << order);
> -				swp_entry_t tmp;
> -
> -				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> -				__xa_store(&mapping->i_pages, aligned_index + i,
> -					   swp_to_radix_entry(tmp), 0);
> -			}
>   		}
> 
>   unlock:
> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   	if (xas_error(&xas))
>   		return xas_error(&xas);
> 
> -	return split_order;
> +	return entry_order;
>   }
> 
>   /*
> 
> 
> Best Regards,
> Yan, Zi

[-- Attachment #2: shmem_aligned_swapin.c --]
[-- Type: text/plain, Size: 1500 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>

#ifndef MADV_PAGEOUT
#define MADV_PAGEOUT 21
#endif

/* 1G size testing */
static int SIZE = 1024UL*1024*1024;
//static int SIZE = 2UL*1024*1024;
//static int SIZE = 64*1024;

int main(void)
{
	pid_t pid;
	char *shared_memory = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

	if (shared_memory == MAP_FAILED) {
		perror("mmap failed");
		exit(EXIT_FAILURE);
	}

	//populate the shmem
	memset(shared_memory, 0xaa, SIZE);

	/* create child */
	pid = fork();
	if (pid < 0) {
		perror("fork failed");
		exit(EXIT_FAILURE);
	} else if (pid == 0) {
		printf("Child process sees shared_memory[0x%lx] = %d\n", (unsigned long)shared_memory, *shared_memory);
		(*shared_memory)++;
		printf("Child process incremented shared_memory to %d\n", *shared_memory);
		exit(0);
	} else {
		/* parent:wait for child to complete */
		wait(NULL);
		printf("Parent process sees shared_memory = %d\n", *shared_memory);
		(*shared_memory)++;
		printf("Parent process incremented shared_memory to %d\n", *shared_memory);
	}

	if (madvise(shared_memory, SIZE, MADV_PAGEOUT)) {
		perror("madvise(MADV_HUGEPAGE)");
		exit(1);
	}

	if (madvise(shared_memory, SIZE, MADV_PAGEOUT)) {
		perror("madvise(MADV_HUGEPAGE)");
		exit(1);
	}

	memset(shared_memory, 0, SIZE);

	if (munmap(shared_memory, SIZE) == -1) {
		perror("munmap failed");
		exit(EXIT_FAILURE);
	}
	return 0;
}


[-- Attachment #3: shmem_nonaligned_swapin.c --]
[-- Type: text/plain, Size: 1746 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>

#ifndef MADV_PAGEOUT
#define MADV_PAGEOUT 21
#endif

/* 1G size testing */
static int SIZE = 1024UL*1024*1024;
//static int SIZE = 2UL*1024*1024;
//static int SIZE = 64*1024;

int main(void)
{
	pid_t pid;
	char *second_memory;
	int i;
	char *shared_memory = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

	if (shared_memory == MAP_FAILED) {
		perror("mmap failed");
		exit(EXIT_FAILURE);
	}

	//populate the shmem
	memset(shared_memory, 0xaa, SIZE);

	/* create child */
	pid = fork();
	if (pid < 0) {
		perror("fork failed");
		exit(EXIT_FAILURE);
	} else if (pid == 0) {
		printf("Child process sees shared_memory[0x%lx] = %d\n", (unsigned long)shared_memory, *shared_memory);
		(*shared_memory)++;
		printf("Child process incremented shared_memory to %d\n", *shared_memory);
		exit(0);
	} else {
		/* parent:wait for child to complete */
		wait(NULL);
		printf("Parent process sees shared_memory = %d\n", *shared_memory);
		(*shared_memory)++;
		printf("Parent process incremented shared_memory to %d\n", *shared_memory);
	}

	if (madvise(shared_memory, SIZE, MADV_PAGEOUT)) {
		perror("madvise(MADV_HUGEPAGE)");
		exit(1);
	}

	if (madvise(shared_memory, SIZE, MADV_PAGEOUT)) {
		perror("madvise(MADV_HUGEPAGE)");
		exit(1);
	}

	//swap in shmem without aligned 64k
	second_memory = shared_memory + 4096 * 3;
	for (i = 0; i < SIZE; i += 4096 * 16) {
		*(second_memory + i) = (char)i;
		*(second_memory + i + 4096 * 3) = (char)i;
		*(second_memory + i + 4096 * 10) = (char)i;
	}

	if (munmap(shared_memory, SIZE) == -1) {
		perror("munmap failed");
		exit(EXIT_FAILURE);
	}
	return 0;
}


[-- Attachment #4: shmem_concurrent_swapin.c --]
[-- Type: text/plain, Size: 2233 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>

#ifndef MADV_PAGEOUT
#define MADV_PAGEOUT 21
#endif

/* 1G size testing */
static unsigned long SIZE = 10UL*1024*1024*1024;
//static int SIZE = 2UL*1024*1024;
//static int SIZE = 64*1024;

static void child_swapin_shmem(char *shared_memory)
{
	char *second_memory;
	int i;

	//swap in shmem without aligned 64k
	second_memory = shared_memory + 4096 * 2;
	for (i = 0; i < SIZE; i += 4096 * 16) {
		*(second_memory + i) = (char)i;
		*(second_memory + i + 4096 * 4) = (char)i;
		*(second_memory + i + 4096 * 9) = (char)i;
	}
}

int main(void)
{
	pid_t pid;
	char *second_memory;
	int i;
	char *shared_memory = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

	if (shared_memory == MAP_FAILED) {
		perror("mmap failed");
		exit(EXIT_FAILURE);
	}

	//populate the shmem
	memset(shared_memory, 0xaa, SIZE);

	/* swapout all shmem */
	if (madvise(shared_memory, SIZE, MADV_PAGEOUT)) {
                perror("madvise(MADV_HUGEPAGE)");
                exit(1);
        }

        if (madvise(shared_memory, SIZE, MADV_PAGEOUT)) {
                perror("madvise(MADV_HUGEPAGE)");
                exit(1);
        }

	/* create child */
	pid = fork();
	if (pid < 0) {
		perror("fork failed");
		exit(EXIT_FAILURE);
	} else if (pid == 0) {
		printf("Child process sees shared_memory[0x%lx] = %d\n", (unsigned long)shared_memory, *shared_memory);
		(*shared_memory)++;
		child_swapin_shmem(shared_memory);
		printf("Child process incremented shared_memory to %d\n", *shared_memory);
		exit(0);
	} else {
		printf("Parent process sees shared_memory = %d\n", *shared_memory);
		(*shared_memory)++;
		printf("Parent process incremented shared_memory to %d\n", *shared_memory);
	}

	//swap in shmem without aligned 64k
	second_memory = shared_memory + 4096 * 3;
	for (i = 0; i < SIZE; i += 4096 * 16) {
		*(second_memory + i) = (char)i;
		*(second_memory + i + 4096 * 3) = (char)i;
		*(second_memory + i + 4096 * 10) = (char)i;
	}

	/* parent:wait for child to complete */
	wait(NULL);

	if (munmap(shared_memory, SIZE) == -1) {
		perror("munmap failed");
		exit(EXIT_FAILURE);
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-21  6:17                 ` Baolin Wang
@ 2025-02-21 23:47                   ` Zi Yan
  2025-02-25  9:25                     ` Baolin Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-21 23:47 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 21 Feb 2025, at 1:17, Baolin Wang wrote:

> On 2025/2/21 10:38, Zi Yan wrote:
>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>
>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>
>>>>>>>> Hi Zi,
>>>>>>>>
>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>
>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>
>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>
>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>
>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>> xas_try_split() during split.
>>>>>>>>
>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>
>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>
>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>
>>>> Got it. I will check the commit.
>>>>
>>>>>>
>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>
>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>
>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>
> I think a simple sequential swapin case is enough? Anyway I can help to evaluate the performance impact with your new patch.
>
>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>> be used.
>>>>>>>>
>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>
>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>
>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>
>>>>>> /*
>>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>>    * offset of the original large entry must be continuous.
>>>>>>    */
>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>       swp_entry_t tmp;
>>>>>>
>>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>              swp_to_radix_entry(tmp), 0);
>>>>>> }
>>>>
>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>
>>>>>
>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>
>>>>> /*
>>>>>   * If the large swap entry has already been split, it is
>>>>>   * necessary to recalculate the new swap entry based on
>>>>>   * the old order alignment.
>>>>>   */
>>>>>   if (split_order > 0) {
>>>>> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>
>>>>> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> }
>>>>
>>>> Got it. I will fix it.
>>>>
>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>> properly?
>>>
>>> The diff below adjusts the swp_entry_t and returns the right order after
>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>
>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>> I can test locally. Thanks.
>
> Sure. I've attached 3 test shmem swapin cases to see if they can help you with testing. I will also find time next week to review and test your patch.
>
> Additionally, you can use zram as a swap device and disable the skipping swapcache feature to test the split logic quickly:
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 745f130bfb4c..7374d5c1cdde 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2274,7 +2274,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         folio = swap_cache_get_folio(swap, NULL, 0);
>         if (!folio) {
>                 int order = xa_get_order(&mapping->i_pages, index);
> -               bool fallback_order0 = false;
> +               bool fallback_order0 = true;
>                 int split_order;
>
>                 /* Or update major stats only when swapin succeeds?? */

Thank you for the testing programs and the patch above. With zswap enabled,
I do not see any crash. I also tried to mount a tmpfs, dd a file that
is larger than total memory, and read the file out. The system crashed
with my original patch but no longer crashes with my fix.

In terms of performance, I used your shmem_aligned_swapin.c and increased
the shmem size from 1GB to 10GB and measured the time of memset at the
end, which swaps in memory and triggers split large entry. I see no difference
between with and without my patch.

I will wait for your results to confirm my fix. Really appreciate your help.

BTW, without zswap, it seems that madvise(MADV_PAGEOUT) does not write
shmem to swapfile and during swapin, swap_cache_get_folio() always gets
a folio. I wonder what is the difference between zswap and a swapfile.


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-21  2:38               ` Zi Yan
  2025-02-21  6:17                 ` Baolin Wang
@ 2025-02-25  9:20                 ` Baolin Wang
  2025-02-25 10:15                   ` Baolin Wang
  1 sibling, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-25  9:20 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton



On 2025/2/21 10:38, Zi Yan wrote:
> On 20 Feb 2025, at 21:33, Zi Yan wrote:
> 
>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>
>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>
>>>>>>> Hi Zi,
>>>>>>>
>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>
>>>>>> Thank you for taking a look at the patches. :)
>>>>>>
>>>>>>>
>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>
>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>
>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>
>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>> xas_try_split() during split.
>>>>>>>
>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>
>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>
>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>
>>> Got it. I will check the commit.
>>>
>>>>>
>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>
>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>
>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>> performance similar operations as existing xas_split_alloc()+xas_split().
>>>
>>>>>
>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>> be used.
>>>>>>>
>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>> error below. Let me know if there is anything I can help.
>>>>>
>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>
>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>
>>>>> /*
>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>    * offset of the original large entry must be continuous.
>>>>>    */
>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>       swp_entry_t tmp;
>>>>>
>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>              swp_to_radix_entry(tmp), 0);
>>>>> }
>>>
>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>
>>>>
>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>
>>>> /*
>>>>   * If the large swap entry has already been split, it is
>>>>   * necessary to recalculate the new swap entry based on
>>>>   * the old order alignment.
>>>>   */
>>>>   if (split_order > 0) {
>>>> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>
>>>> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>> }
>>>
>>> Got it. I will fix it.
>>>
>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>> properly?
>>
>> The diff below adjusts the swp_entry_t and returns the right order after
>> shmem_split_large_entry(). Let me know if it fixes your issue.
> 
> Fixed the compilation error. It will be great if you can share a swapin test, so that
> I can test locally. Thanks.
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b35ba250c53d..bfc4ef511391 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   {
>   	struct address_space *mapping = inode->i_mapping;
>   	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> -	int split_order = 0;
> +	int split_order = 0, entry_order = 0;
>   	int i;
> 
>   	/* Convert user data gfp flags to xarray node gfp flags */
> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   		}
> 
>   		order = xas_get_order(&xas);
> +		entry_order = order;
> 
>   		/* Try to split large swap entry in pagecache */
>   		if (order > 0) {
> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>   				if (xas_error(&xas))
>   					goto unlock;
> +
> +				/*
> +				 * Re-set the swap entry after splitting, and the swap
> +				 * offset of the original large entry must be continuous.
> +				 */
> +				for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
> +					pgoff_t aligned_index = round_down(index, 1 << cur_order);
> +					swp_entry_t tmp;
> +
> +					tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> +					__xa_store(&mapping->i_pages, aligned_index + i,
> +						   swp_to_radix_entry(tmp), 0);
> +				}
>   				cur_order = split_order;
>   				split_order =
>   					xas_try_split_min_order(split_order);
>   			}

This looks incorrect to me. Suppose we are splitting an order-9 swap 
entry, in the first iteration of the loop, it splits the order-9 swap 
entry into 8 order-6 swap entries. At this point, the order-6 swap 
entries are reset, and everything seems fine.

However, in the second iteration, where an order-6 swap entry is split 
into 63 order-0 swap entries, the split operation itself is correct. But 
when resetting the order-0 swap entry, it seems incorrect. Now the 
'cur_order' = 6 and 'split_order' = 0, which means the range for the 
reset index is always between 0 and 63 (see __xa_store()).

 > +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
 > +	pgoff_t aligned_index = round_down(index, 1 << cur_order);
 > +	swp_entry_t tmp;
 > +
 > +	tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
 > +	__xa_store(&mapping->i_pages, aligned_index + i,
 > +		swp_to_radix_entry(tmp), 0);
 > +}

However, if the index is greater than 63, it appears that it is not set 
to the correct new swap entry after splitting. Please corect me if I 
missed anyting.

> -
> -			/*
> -			 * Re-set the swap entry after splitting, and the swap
> -			 * offset of the original large entry must be continuous.
> -			 */
> -			for (i = 0; i < 1 << order; i++) {
> -				pgoff_t aligned_index = round_down(index, 1 << order);
> -				swp_entry_t tmp;
> -
> -				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> -				__xa_store(&mapping->i_pages, aligned_index + i,
> -					   swp_to_radix_entry(tmp), 0);
> -			}
>   		}
> 
>   unlock:
> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   	if (xas_error(&xas))
>   		return xas_error(&xas);
> 
> -	return split_order;
> +	return entry_order;
>   }
> 
>   /*
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-21 23:47                   ` Zi Yan
@ 2025-02-25  9:25                     ` Baolin Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Baolin Wang @ 2025-02-25  9:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton



On 2025/2/22 07:47, Zi Yan wrote:
> On 21 Feb 2025, at 1:17, Baolin Wang wrote:
> 
>> On 2025/2/21 10:38, Zi Yan wrote:
>>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>>
>>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>>
>>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>> Hi Zi,
>>>>>>>>>
>>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>>
>>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>>
>>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>>
>>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>>
>>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>>> xas_try_split() during split.
>>>>>>>>>
>>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>>
>>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>>
>>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>>
>>>>> Got it. I will check the commit.
>>>>>
>>>>>>>
>>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>>
>>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>>
>>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>>
>> I think a simple sequential swapin case is enough? Anyway I can help to evaluate the performance impact with your new patch.
>>
>>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>>> be used.
>>>>>>>>>
>>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>>
>>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>>
>>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>>
>>>>>>> /*
>>>>>>>     * Re-set the swap entry after splitting, and the swap
>>>>>>>     * offset of the original large entry must be continuous.
>>>>>>>     */
>>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>>        pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>>        swp_entry_t tmp;
>>>>>>>
>>>>>>>        tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>>        __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>>               swp_to_radix_entry(tmp), 0);
>>>>>>> }
>>>>>
>>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>>
>>>>>>
>>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>>
>>>>>> /*
>>>>>>    * If the large swap entry has already been split, it is
>>>>>>    * necessary to recalculate the new swap entry based on
>>>>>>    * the old order alignment.
>>>>>>    */
>>>>>>    if (split_order > 0) {
>>>>>> 	pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>
>>>>>> 	swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>> }
>>>>>
>>>>> Got it. I will fix it.
>>>>>
>>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>>> properly?
>>>>
>>>> The diff below adjusts the swp_entry_t and returns the right order after
>>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>>
>>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>>> I can test locally. Thanks.
>>
>> Sure. I've attached 3 test shmem swapin cases to see if they can help you with testing. I will also find time next week to review and test your patch.
>>
>> Additionally, you can use zram as a swap device and disable the skipping swapcache feature to test the split logic quickly:
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 745f130bfb4c..7374d5c1cdde 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2274,7 +2274,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>          folio = swap_cache_get_folio(swap, NULL, 0);
>>          if (!folio) {
>>                  int order = xa_get_order(&mapping->i_pages, index);
>> -               bool fallback_order0 = false;
>> +               bool fallback_order0 = true;
>>                  int split_order;
>>
>>                  /* Or update major stats only when swapin succeeds?? */
> 
> Thank you for the testing programs and the patch above. With zswap enabled,
> I do not see any crash. I also tried to mount a tmpfs, dd a file that
> is larger than total memory, and read the file out. The system crashed
> with my original patch but no longer crashes with my fix.
> 
> In terms of performance, I used your shmem_aligned_swapin.c and increased
> the shmem size from 1GB to 10GB and measured the time of memset at the
> end, which swaps in memory and triggers split large entry. I see no difference
> between with and without my patch.

OK. Great.

> I will wait for your results to confirm my fix. Really appreciate your help.

You are welcome:) and I commented your fixes.

> BTW, without zswap, it seems that madvise(MADV_PAGEOUT) does not write
> shmem to swapfile and during swapin, swap_cache_get_folio() always gets
> a folio. I wonder what is the difference between zswap and a swapfile.

Right, IIUC swapfile is not a sync swap device. You can set up zram as a 
swap device, which is a sync swap device:

modprobe -v zram
# setup 20G swap
echo 20G > /sys/block/zram0/disksize
mkswap /dev/zram0
swapon /dev/zram0
swapon -s


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-25  9:20                 ` Baolin Wang
@ 2025-02-25 10:15                   ` Baolin Wang
  2025-02-25 16:41                     ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-25 10:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton



On 2025/2/25 17:20, Baolin Wang wrote:
> 
> 
> On 2025/2/21 10:38, Zi Yan wrote:
>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>
>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>
>>>>>>>> Hi Zi,
>>>>>>>>
>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>
>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>> During shmem_split_large_entry(), large swap entries are 
>>>>>>>>> covering n slots
>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>
>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by 
>>>>>>>>> the folio
>>>>>>>>> need to be split and the remaining n-1 shadow entries can be 
>>>>>>>>> retained with
>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>
>>>>>>>>> For example, to split an order-9 large swap entry (assuming 
>>>>>>>>> XA_CHUNK_SHIFT
>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>
>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>> xas_try_split() during split.
>>>>>>>>
>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by 
>>>>>>>> skipping the swap cache, we will split the large swap entry 
>>>>>>>> stored in the shmem mapping into order-0 swap entries, rather 
>>>>>>>> than splitting it into other orders of swap entries. This is 
>>>>>>>> because the next time we swap in a shmem folio through 
>>>>>>>> shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>
>>>>>>> Right. But the swapin is one folio at a time, right? 
>>>>>>> shmem_split_large_entry()
>>>>>>
>>>>>> Yes, now we always swapin an order-0 folio from the async swap 
>>>>>> device at a time. However, for sync swap device, we will skip the 
>>>>>> swapcache and swapin the whole large folio by commit 1dd44c0af4fa, 
>>>>>> so it will not call shmem_split_large_entry() in this case.
>>>>
>>>> Got it. I will check the commit.
>>>>
>>>>>>
>>>>>>> should split the large swap entry and give you a slot to store 
>>>>>>> the order-0 folio.
>>>>>>> For example, with an order-9 large swap entry, to swap in first 
>>>>>>> order-0 folio,
>>>>>>> the large swap entry will become order-0, order-0, order-1, 
>>>>>>> order-2,… order-8,
>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can 
>>>>>>> be used.
>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 
>>>>>>> will be used.
>>>>>>
>>>>>> Yes, understood. However, for the sequential swapin scenarios, 
>>>>>> where originally only one split operation is needed. However, your 
>>>>>> approach increases the number of split operations. Of course, I 
>>>>>> understand that in non-sequential swapin scenarios, your patch 
>>>>>> will save some xarray memory. It might be necessary to evaluate 
>>>>>> whether the increased split operations will have a significant 
>>>>>> impact on the performance of sequential swapin?
>>>>
>>>> Is there a shmem swapin test I can run to measure this? 
>>>> xas_try_split() should
>>>> performance similar operations as existing 
>>>> xas_split_alloc()+xas_split().
>>>>
>>>>>>
>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all 
>>>>>>> swap entries
>>>>>>> are order-0, which can lead to issues. There should be some check 
>>>>>>> like
>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() 
>>>>>>> should
>>>>>>> be used.
>>>>>>>>
>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem 
>>>>>>>> folios, however, my test hung, and the console was continuously 
>>>>>>>> filled with the following information. It seems there are some 
>>>>>>>> issues with shmem swapin handling. Anyway, I need more time to 
>>>>>>>> debug and test.
>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not 
>>>>>>> allocate
>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>
>>>>>> I encountered some issues while testing order 4 and order 6 swapin 
>>>>>> with your patches. And I roughly reviewed the patch, and it seems 
>>>>>> that the new swap entry stored in the shmem mapping was not 
>>>>>> correctly updated after the split.
>>>>>>
>>>>>> The following logic is to reset the swap entry after split, and I 
>>>>>> assume that the large swap entry is always split to order 0 
>>>>>> before. As your patch suggests, if a non-uniform split is used, 
>>>>>> then the logic for resetting the swap entry needs to be changed? 
>>>>>> Please correct me if I missed something.
>>>>>>
>>>>>> /*
>>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>>    * offset of the original large entry must be continuous.
>>>>>>    */
>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>       swp_entry_t tmp;
>>>>>>
>>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>              swp_to_radix_entry(tmp), 0);
>>>>>> }
>>>>
>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>
>>>>>
>>>>> In addition, after your patch, the shmem_split_large_entry() seems 
>>>>> always return 0 even though it splits a large swap entry, but we 
>>>>> still need re-calculate the swap entry value after splitting, 
>>>>> otherwise it may return errors due to shmem_confirm_swap() 
>>>>> validation failure.
>>>>>
>>>>> /*
>>>>>   * If the large swap entry has already been split, it is
>>>>>   * necessary to recalculate the new swap entry based on
>>>>>   * the old order alignment.
>>>>>   */
>>>>>   if (split_order > 0) {
>>>>>     pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>
>>>>>     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> }
>>>>
>>>> Got it. I will fix it.
>>>>
>>>> BTW, do you mind sharing your swapin tests so that I can test my new 
>>>> version
>>>> properly?
>>>
>>> The diff below adjusts the swp_entry_t and returns the right order after
>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>
>> Fixed the compilation error. It will be great if you can share a 
>> swapin test, so that
>> I can test locally. Thanks.
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index b35ba250c53d..bfc4ef511391 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode 
>> *inode, pgoff_t index,
>>   {
>>       struct address_space *mapping = inode->i_mapping;
>>       XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>> -    int split_order = 0;
>> +    int split_order = 0, entry_order = 0;
>>       int i;
>>
>>       /* Convert user data gfp flags to xarray node gfp flags */
>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode 
>> *inode, pgoff_t index,
>>           }
>>
>>           order = xas_get_order(&xas);
>> +        entry_order = order;
>>
>>           /* Try to split large swap entry in pagecache */
>>           if (order > 0) {
>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct 
>> inode *inode, pgoff_t index,
>>                   xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>                   if (xas_error(&xas))
>>                       goto unlock;
>> +
>> +                /*
>> +                 * Re-set the swap entry after splitting, and the swap
>> +                 * offset of the original large entry must be 
>> continuous.
>> +                 */
>> +                for (i = 0; i < 1 << cur_order; i += (1 << 
>> split_order)) {
>> +                    pgoff_t aligned_index = round_down(index, 1 << 
>> cur_order);
>> +                    swp_entry_t tmp;
>> +
>> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) 
>> + i);
>> +                    __xa_store(&mapping->i_pages, aligned_index + i,
>> +                           swp_to_radix_entry(tmp), 0);
>> +                }
>>                   cur_order = split_order;
>>                   split_order =
>>                       xas_try_split_min_order(split_order);
>>               }
> 
> This looks incorrect to me. Suppose we are splitting an order-9 swap 
> entry, in the first iteration of the loop, it splits the order-9 swap 
> entry into 8 order-6 swap entries. At this point, the order-6 swap 
> entries are reset, and everything seems fine.
> 
> However, in the second iteration, where an order-6 swap entry is split 
> into 63 order-0 swap entries, the split operation itself is correct. But 

typo: 64

> when resetting the order-0 swap entry, it seems incorrect. Now the 
> 'cur_order' = 6 and 'split_order' = 0, which means the range for the 
> reset index is always between 0 and 63 (see __xa_store()).

Sorry for confusing. The 'aligned_index' will be rounded down by 
'cur_order' (which is 6), so the index is correct. But the swap offset 
calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is 
always between 0 and 63.

>  > +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>  > +    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>  > +    swp_entry_t tmp;
>  > +
>  > +    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>  > +    __xa_store(&mapping->i_pages, aligned_index + i,
>  > +        swp_to_radix_entry(tmp), 0);
>  > +}
> 
> However, if the index is greater than 63, it appears that it is not set 
> to the correct new swap entry after splitting. Please corect me if I 
> missed anyting.
> 
>> -
>> -            /*
>> -             * Re-set the swap entry after splitting, and the swap
>> -             * offset of the original large entry must be continuous.
>> -             */
>> -            for (i = 0; i < 1 << order; i++) {
>> -                pgoff_t aligned_index = round_down(index, 1 << order);
>> -                swp_entry_t tmp;
>> -
>> -                tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>> -                __xa_store(&mapping->i_pages, aligned_index + i,
>> -                       swp_to_radix_entry(tmp), 0);
>> -            }
>>           }
>>
>>   unlock:
>> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode 
>> *inode, pgoff_t index,
>>       if (xas_error(&xas))
>>           return xas_error(&xas);
>>
>> -    return split_order;
>> +    return entry_order;
>>   }
>>
>>   /*
>>
>>
>> Best Regards,
>> Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-25 10:15                   ` Baolin Wang
@ 2025-02-25 16:41                     ` Zi Yan
  2025-02-25 20:32                       ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-25 16:41 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 25 Feb 2025, at 5:15, Baolin Wang wrote:

> On 2025/2/25 17:20, Baolin Wang wrote:
>>
>>
>> On 2025/2/21 10:38, Zi Yan wrote:
>>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>>
>>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>>
>>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>> Hi Zi,
>>>>>>>>>
>>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>>
>>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>>
>>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>>
>>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>>
>>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>>> xas_try_split() during split.
>>>>>>>>>
>>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>>
>>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>>
>>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>>
>>>>> Got it. I will check the commit.
>>>>>
>>>>>>>
>>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>>
>>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>>
>>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>>>>>
>>>>>>>
>>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>>> be used.
>>>>>>>>>
>>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>>
>>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>>
>>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>>
>>>>>>> /*
>>>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>>>    * offset of the original large entry must be continuous.
>>>>>>>    */
>>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>>       swp_entry_t tmp;
>>>>>>>
>>>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>>              swp_to_radix_entry(tmp), 0);
>>>>>>> }
>>>>>
>>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>>
>>>>>>
>>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>>
>>>>>> /*
>>>>>>   * If the large swap entry has already been split, it is
>>>>>>   * necessary to recalculate the new swap entry based on
>>>>>>   * the old order alignment.
>>>>>>   */
>>>>>>   if (split_order > 0) {
>>>>>>     pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>
>>>>>>     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>> }
>>>>>
>>>>> Got it. I will fix it.
>>>>>
>>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>>> properly?
>>>>
>>>> The diff below adjusts the swp_entry_t and returns the right order after
>>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>>
>>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>>> I can test locally. Thanks.
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index b35ba250c53d..bfc4ef511391 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>   {
>>>       struct address_space *mapping = inode->i_mapping;
>>>       XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>> -    int split_order = 0;
>>> +    int split_order = 0, entry_order = 0;
>>>       int i;
>>>
>>>       /* Convert user data gfp flags to xarray node gfp flags */
>>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>           }
>>>
>>>           order = xas_get_order(&xas);
>>> +        entry_order = order;
>>>
>>>           /* Try to split large swap entry in pagecache */
>>>           if (order > 0) {
>>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>                   xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>>                   if (xas_error(&xas))
>>>                       goto unlock;
>>> +
>>> +                /*
>>> +                 * Re-set the swap entry after splitting, and the swap
>>> +                 * offset of the original large entry must be continuous.
>>> +                 */
>>> +                for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>>> +                    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>>> +                    swp_entry_t tmp;
>>> +
>>> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>> +                    __xa_store(&mapping->i_pages, aligned_index + i,
>>> +                           swp_to_radix_entry(tmp), 0);
>>> +                }
>>>                   cur_order = split_order;
>>>                   split_order =
>>>                       xas_try_split_min_order(split_order);
>>>               }
>>
>> This looks incorrect to me. Suppose we are splitting an order-9 swap entry, in the first iteration of the loop, it splits the order-9 swap entry into 8 order-6 swap entries. At this point, the order-6 swap entries are reset, and everything seems fine.
>>
>> However, in the second iteration, where an order-6 swap entry is split into 63 order-0 swap entries, the split operation itself is correct. But
>
> typo: 64
>
>> when resetting the order-0 swap entry, it seems incorrect. Now the 'cur_order' = 6 and 'split_order' = 0, which means the range for the reset index is always between 0 and 63 (see __xa_store()).
>
> Sorry for confusing. The 'aligned_index' will be rounded down by 'cur_order' (which is 6), so the index is correct. But the swap offset calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is always between 0 and 63.

Right. I think I need to recalculate swap’s swp_offset for each iteration
by adding the difference of round_down(index, 1 << cur_order) and
round_down(index, 1 << split_order) and use the new swap in this iteration.
Thank you a lot for walking me through the details. I really appreciate it. :)

My tests did not fail probably because I was using linear access pattern
to swap in folios.

>
>>> +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>>> +    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>>> +    swp_entry_t tmp;
>>> +
>>> +    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>> +    __xa_store(&mapping->i_pages, aligned_index + i,
>>> +        swp_to_radix_entry(tmp), 0);
>>> +}
>>
>> However, if the index is greater than 63, it appears that it is not set to the correct new swap entry after splitting. Please corect me if I missed anyting.
>>
>>> -
>>> -            /*
>>> -             * Re-set the swap entry after splitting, and the swap
>>> -             * offset of the original large entry must be continuous.
>>> -             */
>>> -            for (i = 0; i < 1 << order; i++) {
>>> -                pgoff_t aligned_index = round_down(index, 1 << order);
>>> -                swp_entry_t tmp;
>>> -
>>> -                tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>> -                __xa_store(&mapping->i_pages, aligned_index + i,
>>> -                       swp_to_radix_entry(tmp), 0);
>>> -            }
>>>           }
>>>
>>>   unlock:
>>> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>       if (xas_error(&xas))
>>>           return xas_error(&xas);
>>>
>>> -    return split_order;
>>> +    return entry_order;
>>>   }
>>>
>>>   /*
>>>
>>>
>>> Best Regards,
>>> Yan, Zi


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-25 16:41                     ` Zi Yan
@ 2025-02-25 20:32                       ` Zi Yan
  2025-02-26  6:37                         ` Baolin Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Zi Yan @ 2025-02-25 20:32 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 25 Feb 2025, at 11:41, Zi Yan wrote:

> On 25 Feb 2025, at 5:15, Baolin Wang wrote:
>
>> On 2025/2/25 17:20, Baolin Wang wrote:
>>>
>>>
>>> On 2025/2/21 10:38, Zi Yan wrote:
>>>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>>>
>>>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>>>
>>>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>>>
>>>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>> Hi Zi,
>>>>>>>>>>
>>>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>>>
>>>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>>>
>>>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>>>
>>>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>>>
>>>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>>>> xas_try_split() during split.
>>>>>>>>>>
>>>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>>>
>>>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>>>
>>>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>>>
>>>>>> Got it. I will check the commit.
>>>>>>
>>>>>>>>
>>>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>>>
>>>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>>>
>>>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>>>>>>
>>>>>>>>
>>>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>>>> be used.
>>>>>>>>>>
>>>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>>>
>>>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>>>
>>>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>>>
>>>>>>>> /*
>>>>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>>>>    * offset of the original large entry must be continuous.
>>>>>>>>    */
>>>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>>>       swp_entry_t tmp;
>>>>>>>>
>>>>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>>>              swp_to_radix_entry(tmp), 0);
>>>>>>>> }
>>>>>>
>>>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>>>
>>>>>>>
>>>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>>>
>>>>>>> /*
>>>>>>>   * If the large swap entry has already been split, it is
>>>>>>>   * necessary to recalculate the new swap entry based on
>>>>>>>   * the old order alignment.
>>>>>>>   */
>>>>>>>   if (split_order > 0) {
>>>>>>>     pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>>
>>>>>>>     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>> }
>>>>>>
>>>>>> Got it. I will fix it.
>>>>>>
>>>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>>>> properly?
>>>>>
>>>>> The diff below adjusts the swp_entry_t and returns the right order after
>>>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>>>
>>>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>>>> I can test locally. Thanks.
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index b35ba250c53d..bfc4ef511391 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>   {
>>>>       struct address_space *mapping = inode->i_mapping;
>>>>       XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>>> -    int split_order = 0;
>>>> +    int split_order = 0, entry_order = 0;
>>>>       int i;
>>>>
>>>>       /* Convert user data gfp flags to xarray node gfp flags */
>>>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>           }
>>>>
>>>>           order = xas_get_order(&xas);
>>>> +        entry_order = order;
>>>>
>>>>           /* Try to split large swap entry in pagecache */
>>>>           if (order > 0) {
>>>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>                   xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>>>                   if (xas_error(&xas))
>>>>                       goto unlock;
>>>> +
>>>> +                /*
>>>> +                 * Re-set the swap entry after splitting, and the swap
>>>> +                 * offset of the original large entry must be continuous.
>>>> +                 */
>>>> +                for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>>>> +                    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>>>> +                    swp_entry_t tmp;
>>>> +
>>>> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>> +                    __xa_store(&mapping->i_pages, aligned_index + i,
>>>> +                           swp_to_radix_entry(tmp), 0);
>>>> +                }
>>>>                   cur_order = split_order;
>>>>                   split_order =
>>>>                       xas_try_split_min_order(split_order);
>>>>               }
>>>
>>> This looks incorrect to me. Suppose we are splitting an order-9 swap entry, in the first iteration of the loop, it splits the order-9 swap entry into 8 order-6 swap entries. At this point, the order-6 swap entries are reset, and everything seems fine.
>>>
>>> However, in the second iteration, where an order-6 swap entry is split into 63 order-0 swap entries, the split operation itself is correct. But
>>
>> typo: 64
>>
>>> when resetting the order-0 swap entry, it seems incorrect. Now the 'cur_order' = 6 and 'split_order' = 0, which means the range for the reset index is always between 0 and 63 (see __xa_store()).
>>
>> Sorry for confusing. The 'aligned_index' will be rounded down by 'cur_order' (which is 6), so the index is correct. But the swap offset calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is always between 0 and 63.
>
> Right. I think I need to recalculate swap’s swp_offset for each iteration
> by adding the difference of round_down(index, 1 << cur_order) and
> round_down(index, 1 << split_order) and use the new swap in this iteration.
> Thank you a lot for walking me through the details. I really appreciate it. :)
>
> My tests did not fail probably because I was using linear access pattern
> to swap in folios.

Here is my new fix on top of my original patch. I tested it with zswap
and a random swapin order without any issue. Let me know if it passes
your tests. Thanks.


From aaf4407546ff08b761435048d0850944d5de211d Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Tue, 25 Feb 2025 12:03:34 -0500
Subject: [PATCH] mm/shmem: fix shmem_split_large_entry()

the swap entry offset was updated incorrectly. fix it.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/shmem.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 48caa16e8971..f4e58611899f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2153,7 +2153,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
-	int split_order = 0;
+	int split_order = 0, entry_order = 0;
 	int i;

 	/* Convert user data gfp flags to xarray node gfp flags */
@@ -2171,35 +2171,46 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 		}

 		order = xas_get_order(&xas);
+		entry_order = order;

 		/* Try to split large swap entry in pagecache */
 		if (order > 0) {
 			int cur_order = order;
+			pgoff_t swap_index = round_down(index, 1 << order);

 			split_order = xas_try_split_min_order(cur_order);

 			while (cur_order > 0) {
+				pgoff_t aligned_index =
+					round_down(index, 1 << cur_order);
+				pgoff_t swap_offset = aligned_index - swap_index;
+
 				xas_set_order(&xas, index, split_order);
 				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
 				if (xas_error(&xas))
 					goto unlock;
+
+				/*
+				 * Re-set the swap entry after splitting, and
+				 * the swap offset of the original large entry
+				 * must be continuous.
+				 */
+				for (i = 0; i < 1 << cur_order;
+				     i += (1 << split_order)) {
+					swp_entry_t tmp;
+
+					tmp = swp_entry(swp_type(swap),
+							swp_offset(swap) +
+							swap_offset +
+								i);
+					__xa_store(&mapping->i_pages,
+						   aligned_index + i,
+						   swp_to_radix_entry(tmp), 0);
+				}
 				cur_order = split_order;
 				split_order =
 					xas_try_split_min_order(split_order);
 			}
-
-			/*
-			 * Re-set the swap entry after splitting, and the swap
-			 * offset of the original large entry must be continuous.
-			 */
-			for (i = 0; i < 1 << order; i++) {
-				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp;
-
-				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
-				__xa_store(&mapping->i_pages, aligned_index + i,
-					   swp_to_radix_entry(tmp), 0);
-			}
 		}

 unlock:
@@ -2212,7 +2223,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);

-	return split_order;
+	return entry_order;
 }

 /*
-- 
2.47.2



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-25 20:32                       ` Zi Yan
@ 2025-02-26  6:37                         ` Baolin Wang
  2025-02-26 15:03                           ` Zi Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Baolin Wang @ 2025-02-26  6:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton



On 2025/2/26 04:32, Zi Yan wrote:
> On 25 Feb 2025, at 11:41, Zi Yan wrote:
> 
>> On 25 Feb 2025, at 5:15, Baolin Wang wrote:
>>
>>> On 2025/2/25 17:20, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/2/21 10:38, Zi Yan wrote:
>>>>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>>>>
>>>>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>>>>
>>>>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>>>>
>>>>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Zi,
>>>>>>>>>>>
>>>>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>>>>
>>>>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>>>>
>>>>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>>>>
>>>>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>>>>> xas_try_split() during split.
>>>>>>>>>>>
>>>>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>>>>
>>>>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>>>>
>>>>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>>>>
>>>>>>> Got it. I will check the commit.
>>>>>>>
>>>>>>>>>
>>>>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>>>>
>>>>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>>>>
>>>>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>>>>>>>
>>>>>>>>>
>>>>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>>>>> be used.
>>>>>>>>>>>
>>>>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>>>>
>>>>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>>>>
>>>>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>>     * Re-set the swap entry after splitting, and the swap
>>>>>>>>>     * offset of the original large entry must be continuous.
>>>>>>>>>     */
>>>>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>>>>        pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>>>>        swp_entry_t tmp;
>>>>>>>>>
>>>>>>>>>        tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>>>>        __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>>>>               swp_to_radix_entry(tmp), 0);
>>>>>>>>> }
>>>>>>>
>>>>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>>>>
>>>>>>>>
>>>>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>>>>
>>>>>>>> /*
>>>>>>>>    * If the large swap entry has already been split, it is
>>>>>>>>    * necessary to recalculate the new swap entry based on
>>>>>>>>    * the old order alignment.
>>>>>>>>    */
>>>>>>>>    if (split_order > 0) {
>>>>>>>>      pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>>>
>>>>>>>>      swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>>> }
>>>>>>>
>>>>>>> Got it. I will fix it.
>>>>>>>
>>>>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>>>>> properly?
>>>>>>
>>>>>> The diff below adjusts the swp_entry_t and returns the right order after
>>>>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>>>>
>>>>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>>>>> I can test locally. Thanks.
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index b35ba250c53d..bfc4ef511391 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>    {
>>>>>        struct address_space *mapping = inode->i_mapping;
>>>>>        XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>>>> -    int split_order = 0;
>>>>> +    int split_order = 0, entry_order = 0;
>>>>>        int i;
>>>>>
>>>>>        /* Convert user data gfp flags to xarray node gfp flags */
>>>>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>            }
>>>>>
>>>>>            order = xas_get_order(&xas);
>>>>> +        entry_order = order;
>>>>>
>>>>>            /* Try to split large swap entry in pagecache */
>>>>>            if (order > 0) {
>>>>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>                    xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>>>>                    if (xas_error(&xas))
>>>>>                        goto unlock;
>>>>> +
>>>>> +                /*
>>>>> +                 * Re-set the swap entry after splitting, and the swap
>>>>> +                 * offset of the original large entry must be continuous.
>>>>> +                 */
>>>>> +                for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>>>>> +                    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>>>>> +                    swp_entry_t tmp;
>>>>> +
>>>>> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>> +                    __xa_store(&mapping->i_pages, aligned_index + i,
>>>>> +                           swp_to_radix_entry(tmp), 0);
>>>>> +                }
>>>>>                    cur_order = split_order;
>>>>>                    split_order =
>>>>>                        xas_try_split_min_order(split_order);
>>>>>                }
>>>>
>>>> This looks incorrect to me. Suppose we are splitting an order-9 swap entry, in the first iteration of the loop, it splits the order-9 swap entry into 8 order-6 swap entries. At this point, the order-6 swap entries are reset, and everything seems fine.
>>>>
>>>> However, in the second iteration, where an order-6 swap entry is split into 63 order-0 swap entries, the split operation itself is correct. But
>>>
>>> typo: 64
>>>
>>>> when resetting the order-0 swap entry, it seems incorrect. Now the 'cur_order' = 6 and 'split_order' = 0, which means the range for the reset index is always between 0 and 63 (see __xa_store()).
>>>
>>> Sorry for confusing. The 'aligned_index' will be rounded down by 'cur_order' (which is 6), so the index is correct. But the swap offset calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is always between 0 and 63.
>>
>> Right. I think I need to recalculate swap’s swp_offset for each iteration
>> by adding the difference of round_down(index, 1 << cur_order) and
>> round_down(index, 1 << split_order) and use the new swap in this iteration.
>> Thank you a lot for walking me through the details. I really appreciate it. :)
>>
>> My tests did not fail probably because I was using linear access pattern
>> to swap in folios.
> 
> Here is my new fix on top of my original patch. I tested it with zswap
> and a random swapin order without any issue. Let me know if it passes
> your tests. Thanks.
> 
> 
>  From aaf4407546ff08b761435048d0850944d5de211d Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Tue, 25 Feb 2025 12:03:34 -0500
> Subject: [PATCH] mm/shmem: fix shmem_split_large_entry()
> 
> the swap entry offset was updated incorrectly. fix it.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/shmem.c | 41 ++++++++++++++++++++++++++---------------
>   1 file changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 48caa16e8971..f4e58611899f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2153,7 +2153,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   {
>   	struct address_space *mapping = inode->i_mapping;
>   	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> -	int split_order = 0;
> +	int split_order = 0, entry_order = 0;
>   	int i;
> 
>   	/* Convert user data gfp flags to xarray node gfp flags */
> @@ -2171,35 +2171,46 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   		}
> 
>   		order = xas_get_order(&xas);
> +		entry_order = order;

It seems ‘entry_order’ and ‘order’ are duplicate variables, and you can 
remove the 'order' variable.

> 
>   		/* Try to split large swap entry in pagecache */
>   		if (order > 0) {

You can change the code as:
		if (!entry_order)
			goto unlock;

which can some indentation.

>   			int cur_order = order;
> +			pgoff_t swap_index = round_down(index, 1 << order);
> 
>   			split_order = xas_try_split_min_order(cur_order);
> 
>   			while (cur_order > 0) {
> +				pgoff_t aligned_index =
> +					round_down(index, 1 << cur_order);
> +				pgoff_t swap_offset = aligned_index - swap_index;
> +
>   				xas_set_order(&xas, index, split_order);
>   				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>   				if (xas_error(&xas))
>   					goto unlock;
> +
> +				/*
> +				 * Re-set the swap entry after splitting, and
> +				 * the swap offset of the original large entry
> +				 * must be continuous.
> +				 */
> +				for (i = 0; i < 1 << cur_order;
> +				     i += (1 << split_order)) {
> +					swp_entry_t tmp;
> +
> +					tmp = swp_entry(swp_type(swap),
> +							swp_offset(swap) +
> +							swap_offset +
> +								i);
> +					__xa_store(&mapping->i_pages,
> +						   aligned_index + i,
> +						   swp_to_radix_entry(tmp), 0);
> +				}
>   				cur_order = split_order;
>   				split_order =
>   					xas_try_split_min_order(split_order);
>   			}
> -
> -			/*
> -			 * Re-set the swap entry after splitting, and the swap
> -			 * offset of the original large entry must be continuous.
> -			 */
> -			for (i = 0; i < 1 << order; i++) {
> -				pgoff_t aligned_index = round_down(index, 1 << order);
> -				swp_entry_t tmp;
> -
> -				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> -				__xa_store(&mapping->i_pages, aligned_index + i,
> -					   swp_to_radix_entry(tmp), 0);
> -			}
>   		}
> 
>   unlock:
> @@ -2212,7 +2223,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>   	if (xas_error(&xas))
>   		return xas_error(&xas);
> 
> -	return split_order;
> +	return entry_order;
>   }

I did not find any obvious issues. But could you rebase and resend the 
patch with fixing above coding style issues? (BTW, I posted one bugfix 
patch to fix the split issues[1]) I can do more testing.

[1] 
https://lore.kernel.org/all/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.1740476943.git.baolin.wang@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
  2025-02-26  6:37                         ` Baolin Wang
@ 2025-02-26 15:03                           ` Zi Yan
  0 siblings, 0 replies; 19+ messages in thread
From: Zi Yan @ 2025-02-26 15:03 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, linux-fsdevel, Matthew Wilcox, Hugh Dickins,
	Kairui Song, Miaohe Lin, linux-kernel, Andrew Morton

On 26 Feb 2025, at 1:37, Baolin Wang wrote:

> On 2025/2/26 04:32, Zi Yan wrote:
>> On 25 Feb 2025, at 11:41, Zi Yan wrote:
>>
>>> On 25 Feb 2025, at 5:15, Baolin Wang wrote:
>>>
>>>> On 2025/2/25 17:20, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/2/21 10:38, Zi Yan wrote:
>>>>>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>>>>>
>>>>>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>>>>>
>>>>>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zi,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>>>>>
>>>>>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>>>>>
>>>>>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>>>>>> xas_try_split() during split.
>>>>>>>>>>>>
>>>>>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>>>>>
>>>>>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>>>>>
>>>>>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>>>>>
>>>>>>>> Got it. I will check the commit.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>>>>>
>>>>>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>>>>>
>>>>>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>>>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>>>>>> be used.
>>>>>>>>>>>>
>>>>>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>>>>>
>>>>>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>>>>>
>>>>>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>>     * Re-set the swap entry after splitting, and the swap
>>>>>>>>>>     * offset of the original large entry must be continuous.
>>>>>>>>>>     */
>>>>>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>>>>>        pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>>>>>        swp_entry_t tmp;
>>>>>>>>>>
>>>>>>>>>>        tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>>>>>        __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>>>>>               swp_to_radix_entry(tmp), 0);
>>>>>>>>>> }
>>>>>>>>
>>>>>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>>    * If the large swap entry has already been split, it is
>>>>>>>>>    * necessary to recalculate the new swap entry based on
>>>>>>>>>    * the old order alignment.
>>>>>>>>>    */
>>>>>>>>>    if (split_order > 0) {
>>>>>>>>>      pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>>>>
>>>>>>>>>      swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>>>> }
>>>>>>>>
>>>>>>>> Got it. I will fix it.
>>>>>>>>
>>>>>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>>>>>> properly?
>>>>>>>
>>>>>>> The diff below adjusts the swp_entry_t and returns the right order after
>>>>>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>>>>>
>>>>>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>>>>>> I can test locally. Thanks.
>>>>>>
>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>>> index b35ba250c53d..bfc4ef511391 100644
>>>>>> --- a/mm/shmem.c
>>>>>> +++ b/mm/shmem.c
>>>>>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>    {
>>>>>>        struct address_space *mapping = inode->i_mapping;
>>>>>>        XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>>>>> -    int split_order = 0;
>>>>>> +    int split_order = 0, entry_order = 0;
>>>>>>        int i;
>>>>>>
>>>>>>        /* Convert user data gfp flags to xarray node gfp flags */
>>>>>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>            }
>>>>>>
>>>>>>            order = xas_get_order(&xas);
>>>>>> +        entry_order = order;
>>>>>>
>>>>>>            /* Try to split large swap entry in pagecache */
>>>>>>            if (order > 0) {
>>>>>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>                    xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>>>>>                    if (xas_error(&xas))
>>>>>>                        goto unlock;
>>>>>> +
>>>>>> +                /*
>>>>>> +                 * Re-set the swap entry after splitting, and the swap
>>>>>> +                 * offset of the original large entry must be continuous.
>>>>>> +                 */
>>>>>> +                for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>>>>>> +                    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>>>>>> +                    swp_entry_t tmp;
>>>>>> +
>>>>>> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>> +                    __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>> +                           swp_to_radix_entry(tmp), 0);
>>>>>> +                }
>>>>>>                    cur_order = split_order;
>>>>>>                    split_order =
>>>>>>                        xas_try_split_min_order(split_order);
>>>>>>                }
>>>>>
>>>>> This looks incorrect to me. Suppose we are splitting an order-9 swap entry, in the first iteration of the loop, it splits the order-9 swap entry into 8 order-6 swap entries. At this point, the order-6 swap entries are reset, and everything seems fine.
>>>>>
>>>>> However, in the second iteration, where an order-6 swap entry is split into 63 order-0 swap entries, the split operation itself is correct. But
>>>>
>>>> typo: 64
>>>>
>>>>> when resetting the order-0 swap entry, it seems incorrect. Now the 'cur_order' = 6 and 'split_order' = 0, which means the range for the reset index is always between 0 and 63 (see __xa_store()).
>>>>
>>>> Sorry for confusing. The 'aligned_index' will be rounded down by 'cur_order' (which is 6), so the index is correct. But the swap offset calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is always between 0 and 63.
>>>
>>> Right. I think I need to recalculate swap’s swp_offset for each iteration
>>> by adding the difference of round_down(index, 1 << cur_order) and
>>> round_down(index, 1 << split_order) and use the new swap in this iteration.
>>> Thank you a lot for walking me through the details. I really appreciate it. :)
>>>
>>> My tests did not fail probably because I was using linear access pattern
>>> to swap in folios.
>>
>> Here is my new fix on top of my original patch. I tested it with zswap
>> and a random swapin order without any issue. Let me know if it passes
>> your tests. Thanks.
>>
>>
>>  From aaf4407546ff08b761435048d0850944d5de211d Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Tue, 25 Feb 2025 12:03:34 -0500
>> Subject: [PATCH] mm/shmem: fix shmem_split_large_entry()
>>
>> the swap entry offset was updated incorrectly. fix it.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/shmem.c | 41 ++++++++++++++++++++++++++---------------
>>   1 file changed, 26 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 48caa16e8971..f4e58611899f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2153,7 +2153,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>   {
>>   	struct address_space *mapping = inode->i_mapping;
>>   	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>> -	int split_order = 0;
>> +	int split_order = 0, entry_order = 0;
>>   	int i;
>>
>>   	/* Convert user data gfp flags to xarray node gfp flags */
>> @@ -2171,35 +2171,46 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>   		}
>>
>>   		order = xas_get_order(&xas);
>> +		entry_order = order;
>
> It seems ‘entry_order’ and ‘order’ are duplicate variables, and you can remove the 'order' variable.
Sure. Will remove one of them.


>>
>>   		/* Try to split large swap entry in pagecache */
>>   		if (order > 0) {
>
> You can change the code as:
> 		if (!entry_order)
> 			goto unlock;
>
> which can some indentation.
Sure.

>
>>   			int cur_order = order;
>> +			pgoff_t swap_index = round_down(index, 1 << order);
>>
>>   			split_order = xas_try_split_min_order(cur_order);
>>
>>   			while (cur_order > 0) {
>> +				pgoff_t aligned_index =
>> +					round_down(index, 1 << cur_order);
>> +				pgoff_t swap_offset = aligned_index - swap_index;
>> +
>>   				xas_set_order(&xas, index, split_order);
>>   				xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>   				if (xas_error(&xas))
>>   					goto unlock;
>> +
>> +				/*
>> +				 * Re-set the swap entry after splitting, and
>> +				 * the swap offset of the original large entry
>> +				 * must be continuous.
>> +				 */
>> +				for (i = 0; i < 1 << cur_order;
>> +				     i += (1 << split_order)) {
>> +					swp_entry_t tmp;
>> +
>> +					tmp = swp_entry(swp_type(swap),
>> +							swp_offset(swap) +
>> +							swap_offset +
>> +								i);
>> +					__xa_store(&mapping->i_pages,
>> +						   aligned_index + i,
>> +						   swp_to_radix_entry(tmp), 0);
>> +				}
>>   				cur_order = split_order;
>>   				split_order =
>>   					xas_try_split_min_order(split_order);
>>   			}
>> -
>> -			/*
>> -			 * Re-set the swap entry after splitting, and the swap
>> -			 * offset of the original large entry must be continuous.
>> -			 */
>> -			for (i = 0; i < 1 << order; i++) {
>> -				pgoff_t aligned_index = round_down(index, 1 << order);
>> -				swp_entry_t tmp;
>> -
>> -				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>> -				__xa_store(&mapping->i_pages, aligned_index + i,
>> -					   swp_to_radix_entry(tmp), 0);
>> -			}
>>   		}
>>
>>   unlock:
>> @@ -2212,7 +2223,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>   	if (xas_error(&xas))
>>   		return xas_error(&xas);
>>
>> -	return split_order;
>> +	return entry_order;
>>   }
>
> I did not find any obvious issues. But could you rebase and resend the patch with fixing above coding style issues? (BTW, I posted one bugfix patch to fix the split issues[1]) I can do more testing.
>
> [1] https://lore.kernel.org/all/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.1740476943.git.baolin.wang@linux.alibaba.com/


No problem. Thank you for the review.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-02-26 15:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-18 23:54 [PATCH v2 0/2] Minimize xa_node allocation during xarry split Zi Yan
2025-02-18 23:54 ` [PATCH v2 1/2] mm/filemap: use xas_try_split() in __filemap_add_folio() Zi Yan
2025-02-18 23:54 ` [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry() Zi Yan
2025-02-19 10:04   ` Baolin Wang
2025-02-19 16:10     ` Zi Yan
2025-02-20  9:07       ` Baolin Wang
2025-02-20  9:27         ` Baolin Wang
2025-02-20 13:06           ` Zi Yan
2025-02-21  2:33             ` Zi Yan
2025-02-21  2:38               ` Zi Yan
2025-02-21  6:17                 ` Baolin Wang
2025-02-21 23:47                   ` Zi Yan
2025-02-25  9:25                     ` Baolin Wang
2025-02-25  9:20                 ` Baolin Wang
2025-02-25 10:15                   ` Baolin Wang
2025-02-25 16:41                     ` Zi Yan
2025-02-25 20:32                       ` Zi Yan
2025-02-26  6:37                         ` Baolin Wang
2025-02-26 15:03                           ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox