linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
@ 2024-06-18  6:54 Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap-out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, and large folio swap-in[3] series is queued into mm-unstable branch.
Hence this patch set also supports the large folio swap-out and swap-in for
shmem.

Functional testing
==================
I use the latest mm-unstable branch to test with reverting Chris's
"mm: swap: mTHP swap allocator base on swap cluster order" series which
can cause some problems (Hugh also reported in [4]).

Machine environment: 32 Arm cores, 120G memory and 50G swap device.

1. Run xfstests suite to test tmpfs filesystem, and I did not catch any
regressions with this patch set.
FSTYP=tmpfs
export TEST_DIR=/mnt/tempfs_mnt
export TEST_DEV=/mnt/tempfs_mnt
export SCRATCH_MNT=/mnt/scratchdir
export SCRATCH_DEV=/mnt/scratchdir

2. Run all mm selftests in tools/testing/selftests/mm/, and no
regressions found.

3. I also wrote several shmem swap test cases, including shmem splitting,
shmem swapout, shmem swapin, swapoff during shmem swapout, shmem reclaim,
shmem swapin replacement, etc. I tested these cases under 4K and 64K
shmem folio sizes with a swap device, and shmem swap functionality works
well on my machine (I can share these test cases if needed).

Hugh, I think this version has fixed the reference issue you pointed out
before, and I have also fixed some issues related to shmem swapin replacement.
So could you help to take a look at this series when you find some time?
Thanks a lot.

[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
[3] https://lore.kernel.org/all/20240508224040.190469-6-21cnbao@gmail.com/T/
[4] https://lore.kernel.org/all/8db63194-77fd-e0b8-8601-2bbf04889a5b@google.com/

Changes from v1:
 - Remove useless 'order' variable in shmem_partial_swap_usage(), per Daniel.
 - Add a new patch to return number of pages beeing freed in shmem_free_swap(),
 per Daniel.
 - Drop 'orders' parameter for find_get_entries() and find_lock_entries().
 - Round down the index when adding the swapin folio into the pagecache,
 suggested by Hugh.
 - Fix the reference issue when removing folio from pagecache in patch 8.
 - Fix replacing old folio in swap cache in patch 7.

Changes from RFC:
 - Rebased to the latest mm-unstable.
 - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
 branch.

Baolin Wang (8):
  mm: vmscan: add validation before spliting shmem large folio
  mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
    flag setting
  mm: shmem: extend shmem_partial_swap_usage() to support large folio
    swap
  mm: filemap: use xa_get_order() to get the swap entry order
  mm: shmem: use swap_free_nr() to free shmem swap entries
  mm: shmem: support large folio allocation for shmem_replace_folio()
  mm: shmem: drop folio reference count using 'nr_pages' in
    shmem_delete_from_page_cache()
  mm: shmem: support large folio swap out

Daniel Gomez (1):
  mm: shmem: return number of pages beeing freed in shmem_free_swap

 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |   1 +
 include/linux/swap.h                      |   4 +-
 include/linux/writeback.h                 |   1 +
 mm/filemap.c                              |   4 +
 mm/shmem.c                                | 107 +++++++++++++---------
 mm/swapfile.c                             |  98 ++++++++++----------
 mm/vmscan.c                               |  22 ++++-
 7 files changed, 142 insertions(+), 95 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-20  6:12   ` Yu Zhao
  2024-06-18  6:54 ` [PATCH v2 2/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Add swap available space validation before spliting shmem large folio to
avoid redundant split, since we can not write shmem folio to the swap device
in this case.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1807e5d95dda..61465f92283f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1237,6 +1237,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		} else if (folio_test_swapbacked(folio) &&
 			   folio_test_large(folio)) {
+
+			/*
+			 * Do not split shmem folio if no swap memory
+			 * available.
+			 */
+			if (!total_swap_pages)
+				goto activate_locked;
+
 			/* Split shmem folio */
 			if (split_folio_to_list(folio, folio_list))
 				goto keep_locked;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 3/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for
shmem swap entries.

While we are at it, using folio_nr_pages() to get the number of pages
of the folio as a preparation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/swap.h |  4 +-
 mm/shmem.c           |  6 ++-
 mm/swapfile.c        | 98 +++++++++++++++++++++++---------------------
 3 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d33ce740b695..bffb2281840d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -474,7 +474,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t);
+extern void swap_shmem_alloc(swp_entry_t, int);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
@@ -541,7 +541,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline void swap_shmem_alloc(swp_entry_t swp)
+static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index d09c6bf1f28a..b90965486631 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1432,6 +1432,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	swp_entry_t swap;
 	pgoff_t index;
+	int nr_pages;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -1464,6 +1465,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	index = folio->index;
+	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
@@ -1516,8 +1518,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (add_to_swap_cache(folio, swap,
 			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
 			NULL) == 0) {
-		shmem_recalc_inode(inode, 0, 1);
-		swap_shmem_alloc(swap);
+		shmem_recalc_inode(inode, 0, nr_pages);
+		swap_shmem_alloc(swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
 
 		mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9c6d8e557c0f..1dde413264e2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3362,62 +3362,58 @@ void si_swapinfo(struct sysinfo *val)
  * - swap-cache reference is requested but the entry is not used. -> ENOENT
  * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate(struct swap_info_struct *p, unsigned long offset,
+			    int nr, unsigned char usage)
 {
-	struct swap_info_struct *p;
 	struct swap_cluster_info *ci;
-	unsigned long offset;
 	unsigned char count;
 	unsigned char has_cache;
-	int err;
+	int err, i;
 
-	p = swp_swap_info(entry);
-
-	offset = swp_offset(entry);
 	ci = lock_cluster_or_swap_info(p, offset);
 
-	count = p->swap_map[offset];
-
-	/*
-	 * swapin_readahead() doesn't check if a swap entry is valid, so the
-	 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
-	 */
-	if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
-		err = -ENOENT;
-		goto unlock_out;
-	}
-
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-	err = 0;
-
-	if (usage == SWAP_HAS_CACHE) {
+	for (i = 0; i < nr; i++) {
+		count = p->swap_map[offset + i];
 
-		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
-		if (!has_cache && count)
-			has_cache = SWAP_HAS_CACHE;
-		else if (has_cache)		/* someone else added cache */
-			err = -EEXIST;
-		else				/* no users remaining */
+		/*
+		 * swapin_readahead() doesn't check if a swap entry is valid, so the
+		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+		 */
+		if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
 			err = -ENOENT;
+			break;
+		}
 
-	} else if (count || has_cache) {
+		has_cache = count & SWAP_HAS_CACHE;
+		count &= ~SWAP_HAS_CACHE;
+		err = 0;
+
+		if (usage == SWAP_HAS_CACHE) {
+			/* set SWAP_HAS_CACHE if there is no cache and entry is used */
+			if (!has_cache && count)
+				has_cache = SWAP_HAS_CACHE;
+			else if (has_cache)		/* someone else added cache */
+				err = -EEXIST;
+			else				/* no users remaining */
+				err = -ENOENT;
+		} else if (count || has_cache) {
+			if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+				count += usage;
+			else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
+				err = -EINVAL;
+			else if (swap_count_continued(p, offset + i, count))
+				count = COUNT_CONTINUED;
+			else
+				err = -ENOMEM;
+		} else
+			err = -ENOENT;			/* unused swap entry */
 
-		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
-			count += usage;
-		else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
-			err = -EINVAL;
-		else if (swap_count_continued(p, offset, count))
-			count = COUNT_CONTINUED;
-		else
-			err = -ENOMEM;
-	} else
-		err = -ENOENT;			/* unused swap entry */
+		if (err)
+			break;
 
-	if (!err)
-		WRITE_ONCE(p->swap_map[offset], count | has_cache);
+		WRITE_ONCE(p->swap_map[offset + i], count | has_cache);
+	}
 
-unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
 	return err;
 }
@@ -3426,9 +3422,12 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
  * Help swapoff by noting that swap entry belongs to shmem/tmpfs
  * (in which case its reference count is never incremented).
  */
-void swap_shmem_alloc(swp_entry_t entry)
+void swap_shmem_alloc(swp_entry_t entry, int nr)
 {
-	__swap_duplicate(entry, SWAP_MAP_SHMEM);
+	struct swap_info_struct *p = swp_swap_info(entry);
+	unsigned long offset = swp_offset(entry);
+
+	__swap_duplicate(p, offset, nr, SWAP_MAP_SHMEM);
 }
 
 /*
@@ -3440,9 +3439,11 @@ void swap_shmem_alloc(swp_entry_t entry)
  */
 int swap_duplicate(swp_entry_t entry)
 {
+	struct swap_info_struct *p = swp_swap_info(entry);
+	unsigned long offset = swp_offset(entry);
 	int err = 0;
 
-	while (!err && __swap_duplicate(entry, 1) == -ENOMEM)
+	while (!err && __swap_duplicate(p, offset, 1, 1) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }
@@ -3457,7 +3458,10 @@ int swap_duplicate(swp_entry_t entry)
  */
 int swapcache_prepare(swp_entry_t entry)
 {
-	return __swap_duplicate(entry, SWAP_HAS_CACHE);
+	struct swap_info_struct *p = swp_swap_info(entry);
+	unsigned long offset = swp_offset(entry);
+
+	return __swap_duplicate(p, offset, 1, SWAP_HAS_CACHE);
 }
 
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 3/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 2/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 4/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b90965486631..012a06ef39aa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -870,7 +870,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 		if (xas_retry(&xas, page))
 			continue;
 		if (xa_is_value(page))
-			swapped++;
+			swapped += 1 << xa_get_order(xas.xa, xas.xa_index);
 		if (xas.xa_index == max)
 			break;
 		if (need_resched()) {
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 4/9] mm: shmem: return number of pages beeing freed in shmem_free_swap
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (2 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 3/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 5/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

From: Daniel Gomez <da.gomez@samsung.com>

Both shmem_free_swap callers expect the number of pages being freed. In
the large folios context, this needs to support larger values other than
0 (used as 1 page being freed) and -ENOENT (used as 0 pages being
freed). In preparation for large folios adoption, make shmem_free_swap
routine return the number of pages being freed. So, returning 0 in this
context, means 0 pages being freed.

While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 012a06ef39aa..a73d2da54897 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -836,18 +836,22 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
 }
 
 /*
- * Remove swap entry from page cache, free the swap and its page cache.
+ * Remove swap entry from page cache, free the swap and its page cache. Returns
+ * the number of pages being freed. 0 means entry not found in XArray (0 pages
+ * being freed).
  */
-static int shmem_free_swap(struct address_space *mapping,
-			   pgoff_t index, void *radswap)
+static long shmem_free_swap(struct address_space *mapping,
+			    pgoff_t index, void *radswap)
 {
+	int order = xa_get_order(&mapping->i_pages, index);
 	void *old;
 
 	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
 	if (old != radswap)
-		return -ENOENT;
-	free_swap_and_cache(radix_to_swp_entry(radswap));
-	return 0;
+		return 0;
+	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+
+	return 1 << order;
 }
 
 /*
@@ -999,7 +1003,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			if (xa_is_value(folio)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
+				nr_swaps_freed += shmem_free_swap(mapping,
 							indices[i], folio);
 				continue;
 			}
@@ -1066,14 +1070,17 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			folio = fbatch.folios[i];
 
 			if (xa_is_value(folio)) {
+				long swaps_freed;
+
 				if (unfalloc)
 					continue;
-				if (shmem_free_swap(mapping, indices[i], folio)) {
+				swaps_freed = shmem_free_swap(mapping, indices[i], folio);
+				if (!swaps_freed) {
 					/* Swap was replaced by page: retry */
 					index = indices[i];
 					break;
 				}
-				nr_swaps_freed++;
+				nr_swaps_freed += swaps_freed;
 				continue;
 			}
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 5/9] mm: filemap: use xa_get_order() to get the swap entry order
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (3 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 4/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 6/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 876cc64aadd7..ecc785758aaf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
 		folio = fbatch->folios[idx];
 		if (!xa_is_value(folio))
 			nr = folio_nr_pages(folio);
+		else
+			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
 		*start = indices[idx] + nr;
 	}
 	return folio_batch_count(fbatch);
@@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
 		folio = fbatch->folios[idx];
 		if (!xa_is_value(folio))
 			nr = folio_nr_pages(folio);
+		else
+			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
 		*start = indices[idx] + nr;
 	}
 	return folio_batch_count(fbatch);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 6/9] mm: shmem: use swap_free_nr() to free shmem swap entries
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (4 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 5/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 7/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

As a preparation for supporting shmem large folio swapout, use swap_free_nr()
to free some continuous swap entries of the shmem large folio when the
large folio was swapped in from the swap cache. In addition, the index
should also be round down to the number of pages when adding the swapin
folio into the pagecache.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index a73d2da54897..4d7996962388 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1950,6 +1950,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	swp_entry_t swapin_error;
 	void *old;
+	int nr_pages;
 
 	swapin_error = make_poisoned_swp_entry();
 	old = xa_cmpxchg_irq(&mapping->i_pages, index,
@@ -1958,6 +1959,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	if (old != swp_to_radix_entry(swap))
 		return;
 
+	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
 	delete_from_swap_cache(folio);
 	/*
@@ -1965,8 +1967,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
 	 * in shmem_evict_inode().
 	 */
-	shmem_recalc_inode(inode, -1, -1);
-	swap_free(swap);
+	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
+	swap_free_nr(swap, nr_pages);
 }
 
 /*
@@ -1985,7 +1987,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	swp_entry_t swap;
-	int error;
+	int error, nr_pages;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
@@ -2032,6 +2034,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
+	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -2045,19 +2048,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping, index,
+	error = shmem_add_to_page_cache(folio, mapping,
+					round_down(index, nr_pages),
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
 
-	shmem_recalc_inode(inode, 0, -1);
+	shmem_recalc_inode(inode, 0, -nr_pages);
 
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
 	delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
-	swap_free(swap);
+	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
 
 	*foliop = folio;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 7/9] mm: shmem: support large folio allocation for shmem_replace_folio()
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (5 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 6/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 8/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in shmem_replace_folio().
Moreover large folios occupy N consecutive entries in the swap cache
instead of using multi-index entries like the page cache, therefore
we should replace each consecutive entries in the swap cache instead
of using the shmem_replace_entry().

As well as updating statistics and folio reference count using the number
of pages in the folio.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 53 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4d7996962388..c0a9253f3a99 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1878,28 +1878,24 @@ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp)
 static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index)
 {
-	struct folio *old, *new;
-	struct address_space *swap_mapping;
-	swp_entry_t entry;
-	pgoff_t swap_index;
-	int error;
-
-	old = *foliop;
-	entry = old->swap;
-	swap_index = swap_cache_index(entry);
-	swap_mapping = swap_address_space(entry);
+	struct folio *new, *old = *foliop;
+	swp_entry_t entry = old->swap;
+	struct address_space *swap_mapping = swap_address_space(entry);
+	pgoff_t swap_index = swap_cache_index(entry);
+	XA_STATE(xas, &swap_mapping->i_pages, swap_index);
+	int nr_pages = folio_nr_pages(old);
+	int error = 0, i;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success by further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	VM_BUG_ON_FOLIO(folio_test_large(old), old);
-	new = shmem_alloc_folio(gfp, 0, info, index);
+	new = shmem_alloc_folio(gfp, folio_order(old), info, index);
 	if (!new)
 		return -ENOMEM;
 
-	folio_get(new);
+	folio_ref_add(new, nr_pages);
 	folio_copy(new, old);
 	flush_dcache_folio(new);
 
@@ -1909,18 +1905,24 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	/*
-	 * Our caller will very soon move newpage out of swapcache, but it's
-	 * a nice clean interface for us to replace oldpage by newpage there.
-	 */
+	/* Swap cache still stores N entries instead of a high-order entry */
 	xa_lock_irq(&swap_mapping->i_pages);
-	error = shmem_replace_entry(swap_mapping, swap_index, old, new);
+	for (i = 0; i < nr_pages; i++) {
+		void *item = xas_store(&xas, new);
+
+		if (item != old) {
+			error = -ENOENT;
+			break;
+		}
+
+		xas_next(&xas);
+	}
 	if (!error) {
 		mem_cgroup_replace_folio(old, new);
-		__lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1);
-		__lruvec_stat_mod_folio(new, NR_SHMEM, 1);
-		__lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1);
-		__lruvec_stat_mod_folio(old, NR_SHMEM, -1);
+		__lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages);
+		__lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages);
+		__lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages);
+		__lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages);
 	}
 	xa_unlock_irq(&swap_mapping->i_pages);
 
@@ -1940,7 +1942,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	old->private = NULL;
 
 	folio_unlock(old);
-	folio_put_refs(old, 2);
+	/*
+	 * The old folio are removed from swap cache, drop the 'nr_pages'
+	 * reference, as well as one temporary reference getting from swap
+	 * cache.
+	 */
+	folio_put_refs(old, nr_pages + 1);
 	return error;
 }
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 8/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (6 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 7/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18  6:54 ` [PATCH v2 9/9] mm: shmem: support large folio swap out Baolin Wang
  2024-06-18 20:05 ` [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Andrew Morton
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding
into swap cache.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c0a9253f3a99..9a35ee7e7f40 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -831,7 +831,7 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
 	__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
 	__lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
 	xa_unlock_irq(&mapping->i_pages);
-	folio_put(folio);
+	folio_put_refs(folio, nr);
 	BUG_ON(error);
 }
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 9/9] mm: shmem: support large folio swap out
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (7 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 8/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
@ 2024-06-18  6:54 ` Baolin Wang
  2024-06-18 20:05 ` [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Andrew Morton
  9 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-18  6:54 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, hence this patch set supports the large folio swap out for shmem.

Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate spliting
the large folio.

[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |  1 +
 include/linux/writeback.h                 |  1 +
 mm/shmem.c                                |  3 +--
 mm/vmscan.c                               | 14 ++++++++++++--
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index c5e1c718a6d2..c66cb9c585e1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -308,6 +308,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
 		.range_start = 0,
 		.range_end = LLONG_MAX,
 		.for_reclaim = 1,
+		.split_large_folio = 1,
 	};
 	unsigned long i;
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 112d806ddbe4..6f2599244ae0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,6 +63,7 @@ struct writeback_control {
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 	unsigned unpinned_netfs_wb:1;	/* Cleared I_PINNING_NETFS_WB */
+	unsigned split_large_folio:1;	/* Split large folio for shmem writeback */
 
 	/*
 	 * When writeback IOs are bounced through async layers, only the
diff --git a/mm/shmem.c b/mm/shmem.c
index 9a35ee7e7f40..2c951d936fc1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -775,7 +775,6 @@ static int shmem_add_to_page_cache(struct folio *folio,
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
-	VM_BUG_ON(expected && folio_test_large(folio));
 
 	folio_ref_add(folio, nr);
 	folio->mapping = mapping;
@@ -1462,7 +1461,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	 * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
 	 * and its shmem_writeback() needs them to be split when swapping.
 	 */
-	if (folio_test_large(folio)) {
+	if (wbc->split_large_folio && folio_test_large(folio)) {
 		/* Ensure the subpages are still dirty */
 		folio_test_set_dirty(folio);
 		if (split_huge_page(page) < 0)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 61465f92283f..fd503506262d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1245,8 +1245,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (!total_swap_pages)
 				goto activate_locked;
 
-			/* Split shmem folio */
-			if (split_folio_to_list(folio, folio_list))
+			/*
+			 * Only split shmem folio when CONFIG_THP_SWAP
+			 * is not enabled.
+			 */
+			if (!IS_ENABLED(CONFIG_THP_SWAP) &&
+			    split_folio_to_list(folio, folio_list))
 				goto keep_locked;
 		}
 
@@ -1348,10 +1352,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
+try_pageout:
 			switch (pageout(folio, mapping, &plug)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
+				if (shmem_mapping(mapping) && folio_test_large(folio) &&
+				    !split_folio_to_list(folio, folio_list)) {
+					nr_pages = 1;
+					goto try_pageout;
+				}
 				goto activate_locked;
 			case PAGE_SUCCESS:
 				stat->nr_pageout += nr_pages;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (8 preceding siblings ...)
  2024-06-18  6:54 ` [PATCH v2 9/9] mm: shmem: support large folio swap out Baolin Wang
@ 2024-06-18 20:05 ` Andrew Morton
  2024-06-19  1:28   ` Baolin Wang
  9 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2024-06-18 20:05 UTC (permalink / raw)
  To: Baolin Wang
  Cc: hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel

On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Shmem will support large folio allocation [1] [2] to get a better performance,
> however, the memory reclaim still splits the precious large folios when trying
> to swap-out shmem, which may lead to the memory fragmentation issue and can not
> take advantage of the large folio for shmeme.
> 
> Moreover, the swap code already supports for swapping out large folio without
> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> Hence this patch set also supports the large folio swap-out and swap-in for
> shmem.

I'll add this to mm-unstable for some exposure, but I wonder how much
testing it will have recieved by the time the next merge window opens?

> Functional testing
> ==================
> I use the latest mm-unstable branch to test with reverting Chris's
> "mm: swap: mTHP swap allocator base on swap cluster order" series which
> can cause some problems (Hugh also reported in [4]).

I dropped Chris's series from mm-unstable.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-18 20:05 ` [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Andrew Morton
@ 2024-06-19  1:28   ` Baolin Wang
  2024-06-19  8:16     ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2024-06-19  1:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel



On 2024/6/19 04:05, Andrew Morton wrote:
> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Shmem will support large folio allocation [1] [2] to get a better performance,
>> however, the memory reclaim still splits the precious large folios when trying
>> to swap-out shmem, which may lead to the memory fragmentation issue and can not
>> take advantage of the large folio for shmeme.
>>
>> Moreover, the swap code already supports for swapping out large folio without
>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>> Hence this patch set also supports the large folio swap-out and swap-in for
>> shmem.
> 
> I'll add this to mm-unstable for some exposure, but I wonder how much
> testing it will have recieved by the time the next merge window opens?

Thanks Andrew. I am fine with this series going to 6.12 if you are 
concerned about insufficient testing (and let's also wait for Hugh's 
comments). Since we (Daniel and I) have some follow-up patches that will 
rely on this swap series, hope this series can be tested as extensively 
as possible to ensure its stability in the mm branch.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-19  1:28   ` Baolin Wang
@ 2024-06-19  8:16     ` Hugh Dickins
  2024-06-19  8:58       ` Baolin Wang
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Hugh Dickins @ 2024-06-19  8:16 UTC (permalink / raw)
  To: Andrew Morton, Baolin Wang
  Cc: hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel

On Wed, 19 Jun 2024, Baolin Wang wrote:
> On 2024/6/19 04:05, Andrew Morton wrote:
> > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> > 
> >> Shmem will support large folio allocation [1] [2] to get a better
> >> performance,
> >> however, the memory reclaim still splits the precious large folios when
> >> trying
> >> to swap-out shmem, which may lead to the memory fragmentation issue and can
> >> not
> >> take advantage of the large folio for shmeme.
> >>
> >> Moreover, the swap code already supports for swapping out large folio
> >> without
> >> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> >> Hence this patch set also supports the large folio swap-out and swap-in for
> >> shmem.
> > 
> > I'll add this to mm-unstable for some exposure, but I wonder how much
> > testing it will have recieved by the time the next merge window opens?
> 
> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
> about insufficient testing (and let's also wait for Hugh's comments). Since we
> (Daniel and I) have some follow-up patches that will rely on this swap series,
> hope this series can be tested as extensively as possible to ensure its
> stability in the mm branch.

Thanks for giving it the exposure, Andrew, but please drop it from
mm-unstable until the next cycle. I'd been about to write to say I
wouldn't be trying it until next cycle, when your mm-commits came in:
so I thought I ought at least to give mm-everything-2024-06-18 a try.

Baolin may have fixed stuff, but he (or the interaction with other mm
work) has broken stuff too: I couldn't get as far with it as with the
previous version. Just "cp -a" of a kernel source tree into a tmpfs
huge=always size=<bigenough> failed with lots of ENOSPCs, and when
"rm -rf"ed lots of WARN_ON(inode->i_blocks) from shmem_evict_inode();
and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
find_lock_entries().

Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug
I'm trying to chase even when this series is reverted: some kind of page
double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
mostly from page reclaim or from exit_mmap(). I'm still getting a feel
for it, maybe it occurs soon enough for a reliable bisection, maybe not.

(While writing, a run with mm-unstable cut off at 2a9964cc5d27,
drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
has not yet hit any problem: too early to tell but promising.)

And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
Chris Li's mTHP swap series: which worked fairly well, until it locked
up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
on a page with 0 refcount, while a page table lock is held which one
by one the other CPUs come to want for reclaim. On two machines.

None of these problems seen on Stephen's last next-2024-06-13.
I had wanted to see if mm-everything-2024-06-18 fixed that lockup,
but with the new problems I cannot tell (or it could all be the same
problem: but if so, odd that it manifests consistently differently).

There are way too many mTHP shmem and swap patch series floating
around at the moment, in mm and in fs, for me to cope with:
everyone, please slow down and test more.

Hugh

p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
long enough, and deserves promotion to hotfix and Linus soon.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-19  8:16     ` Hugh Dickins
@ 2024-06-19  8:58       ` Baolin Wang
  2024-06-20  1:33       ` Andrew Morton
  2024-06-20  4:42       ` Hugh Dickins
  2 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-19  8:58 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel



On 2024/6/19 16:16, Hugh Dickins wrote:
> On Wed, 19 Jun 2024, Baolin Wang wrote:
>> On 2024/6/19 04:05, Andrew Morton wrote:
>>> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>> Shmem will support large folio allocation [1] [2] to get a better
>>>> performance,
>>>> however, the memory reclaim still splits the precious large folios when
>>>> trying
>>>> to swap-out shmem, which may lead to the memory fragmentation issue and can
>>>> not
>>>> take advantage of the large folio for shmeme.
>>>>
>>>> Moreover, the swap code already supports for swapping out large folio
>>>> without
>>>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>>>> Hence this patch set also supports the large folio swap-out and swap-in for
>>>> shmem.
>>>
>>> I'll add this to mm-unstable for some exposure, but I wonder how much
>>> testing it will have recieved by the time the next merge window opens?
>>
>> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
>> about insufficient testing (and let's also wait for Hugh's comments). Since we
>> (Daniel and I) have some follow-up patches that will rely on this swap series,
>> hope this series can be tested as extensively as possible to ensure its
>> stability in the mm branch.
> 
> Thanks for giving it the exposure, Andrew, but please drop it from
> mm-unstable until the next cycle. I'd been about to write to say I
> wouldn't be trying it until next cycle, when your mm-commits came in:
> so I thought I ought at least to give mm-everything-2024-06-18 a try.
> 
> Baolin may have fixed stuff, but he (or the interaction with other mm
> work) has broken stuff too: I couldn't get as far with it as with the
> previous version. Just "cp -a" of a kernel source tree into a tmpfs
> huge=always size=<bigenough> failed with lots of ENOSPCs, and when
> "rm -rf"ed lots of WARN_ON(inode->i_blocks) from shmem_evict_inode();
> and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
> find_lock_entries().

Thanks Hugh for giving it a try. I also can reproduce the 
WARN_ON(inode->i_blocks) issue with today's mm-unstable branch (sadly, 
this issue didn't occur in the older mm-unstable branch), and now I have 
a fix in my local tree. But I will not send out a V3 so quickly, and I 
agree we can drop this series from mm-unstable branch until next cycle.

> Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug

I did not encounter the VM_BUG_ON_FOLIO() issue, but let me try your 
testing case...

> I'm trying to chase even when this series is reverted: some kind of page
> double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
> mostly from page reclaim or from exit_mmap(). I'm still getting a feel
> for it, maybe it occurs soon enough for a reliable bisection, maybe not.
> 
> (While writing, a run with mm-unstable cut off at 2a9964cc5d27,
> drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
> has not yet hit any problem: too early to tell but promising.)
> 
> And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
> Chris Li's mTHP swap series: which worked fairly well, until it locked
> up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
> on a page with 0 refcount, while a page table lock is held which one
> by one the other CPUs come to want for reclaim. On two machines.
> 
> None of these problems seen on Stephen's last next-2024-06-13.
> I had wanted to see if mm-everything-2024-06-18 fixed that lockup,
> but with the new problems I cannot tell (or it could all be the same
> problem: but if so, odd that it manifests consistently differently).
> 
> There are way too many mTHP shmem and swap patch series floating
> around at the moment, in mm and in fs, for me to cope with:
> everyone, please slow down and test more.

Sure. I will continue to do more testing. Thanks.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-19  8:16     ` Hugh Dickins
  2024-06-19  8:58       ` Baolin Wang
@ 2024-06-20  1:33       ` Andrew Morton
  2024-06-20  3:59         ` Hugh Dickins
  2024-06-20  4:42       ` Hugh Dickins
  2 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2024-06-20  1:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Baolin Wang, willy, david, wangkefeng.wang, chrisl, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel, Andrew Bresticker

On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> On Wed, 19 Jun 2024, Baolin Wang wrote:
> > On 2024/6/19 04:05, Andrew Morton wrote:
> > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
> > > <baolin.wang@linux.alibaba.com> wrote:
> > > 
> > >> Shmem will support large folio allocation [1] [2] to get a better
> > >> performance,
> > >> however, the memory reclaim still splits the precious large folios when
> > >> trying
> > >> to swap-out shmem, which may lead to the memory fragmentation issue and can
> > >> not
> > >> take advantage of the large folio for shmeme.
> > >>
> > >> Moreover, the swap code already supports for swapping out large folio
> > >> without
> > >> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> > >> Hence this patch set also supports the large folio swap-out and swap-in for
> > >> shmem.
> > > 
> > > I'll add this to mm-unstable for some exposure, but I wonder how much
> > > testing it will have recieved by the time the next merge window opens?
> > 
> > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
> > about insufficient testing (and let's also wait for Hugh's comments). Since we
> > (Daniel and I) have some follow-up patches that will rely on this swap series,
> > hope this series can be tested as extensively as possible to ensure its
> > stability in the mm branch.
> 
> Thanks for giving it the exposure, Andrew, but please drop it from
> mm-unstable until the next cycle.

Thanks, dropped.

> p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
> long enough, and deserves promotion to hotfix and Linus soon.

Oh, OK, done.

And it's cc:stable.  I didn't get any sens of urgency for this one -
what is your thinking here?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-20  1:33       ` Andrew Morton
@ 2024-06-20  3:59         ` Hugh Dickins
  2024-06-20  6:41           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2024-06-20  3:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Baolin Wang, willy, david, wangkefeng.wang, chrisl,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, ioworker0,
	da.gomez, p.raghav, linux-mm, linux-kernel, Andrew Bresticker

On Wed, 19 Jun 2024, Andrew Morton wrote:
> On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > On Wed, 19 Jun 2024, Baolin Wang wrote:
> > > On 2024/6/19 04:05, Andrew Morton wrote:
> > > > On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
> > > > <baolin.wang@linux.alibaba.com> wrote:
> > > > 
> > > >> Shmem will support large folio allocation [1] [2] to get a better
> > > >> performance,
> > > >> however, the memory reclaim still splits the precious large folios when
> > > >> trying
> > > >> to swap-out shmem, which may lead to the memory fragmentation issue and can
> > > >> not
> > > >> take advantage of the large folio for shmeme.
> > > >>
> > > >> Moreover, the swap code already supports for swapping out large folio
> > > >> without
> > > >> split, and large folio swap-in[3] series is queued into mm-unstable branch.
> > > >> Hence this patch set also supports the large folio swap-out and swap-in for
> > > >> shmem.
> > > > 
> > > > I'll add this to mm-unstable for some exposure, but I wonder how much
> > > > testing it will have recieved by the time the next merge window opens?
> > > 
> > > Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
> > > about insufficient testing (and let's also wait for Hugh's comments). Since we
> > > (Daniel and I) have some follow-up patches that will rely on this swap series,
> > > hope this series can be tested as extensively as possible to ensure its
> > > stability in the mm branch.
> > 
> > Thanks for giving it the exposure, Andrew, but please drop it from
> > mm-unstable until the next cycle.
> 
> Thanks, dropped.

Thanks. I'll add a little more info in other mail, against the further
2024-06-18 problems I reported, but tl;dr is they are still a mystery:
I cannot yet say "drop this" or "drop that" or "here's a fix".

> 
> > p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
> > long enough, and deserves promotion to hotfix and Linus soon.
> 
> Oh, OK, done.
> 
> And it's cc:stable.  I didn't get any sens of urgency for this one -
> what is your thinking here?

I thought you were right to add the cc:stable. The current v6.8..v6.10
state does not result in any crashes or warnings, but it totally (well,
511 times out of 512, in some workloads anyway) defeats the purpose of
shmem+file huge pages - the kernel is going to all the trouble of trying
to allocate those huge pages, but then refuses to map them by PMD unless
the fault happens to occur within the first 4096 bytes (talking x86_64).

I imagine that this causes a significant performance degradation in
some workloads which ask for and are used to getting huge pages there;
and they might also exceed their memory quotas, since a page table has
to be allocated where a PMD-mapping needs none (anon THPs reserve a page
table anyway, to rely on later if splitting, but shmem+file THPs do not).
And it's surprising that no tests were reported as failing.

I did forget that khugepaged should come along later and fix this up
(that used not to be done, but I believe we got it working around v6.6).
The tests where I noticed the low ShmemPmdMapped were too quick for
khugepaged to fix them: but if you'd prefer to avoid cc:stable, we
could say that khugepaged should in due course usually fix the
long-running cases, which are the ones most in need of fixing.

Hugh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-19  8:16     ` Hugh Dickins
  2024-06-19  8:58       ` Baolin Wang
  2024-06-20  1:33       ` Andrew Morton
@ 2024-06-20  4:42       ` Hugh Dickins
  2024-06-20  6:52         ` David Hildenbrand
  2024-06-25  4:54         ` Hugh Dickins
  2 siblings, 2 replies; 24+ messages in thread
From: Hugh Dickins @ 2024-06-20  4:42 UTC (permalink / raw)
  To: Andrew Morton, Barry Song
  Cc: Hugh Dickins, Baolin Wang, willy, david, wangkefeng.wang, chrisl,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, ioworker0,
	da.gomez, p.raghav, linux-mm, linux-kernel

On Wed, 19 Jun 2024, Hugh Dickins wrote:
> 
> and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
> find_lock_entries().
> 
> Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug
> I'm trying to chase even when this series is reverted:

Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related
to Baolin's series: much more likely to be an instance of other problems.

> some kind of page
> double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
> mostly from page reclaim or from exit_mmap(). I'm still getting a feel
> for it, maybe it occurs soon enough for a reliable bisection, maybe not.
> 
> (While writing, a run with mm-unstable cut off at 2a9964cc5d27,
> drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
> has not yet hit any problem: too early to tell but promising.)

Yes, that ran without trouble for many hours on two machines. I didn't
do a formal bisection, but did appear to narrow it down convincingly to
Barry's folio_add_new_anon_rmap() series: crashes soon on both machines
with Barry's in but Baolin's out, no crashes with both out.

Yet while I was studying Barry's patches trying to explain it, one
of the machines did at last crash: it's as if Barry's has opened a
window which makes these crashes more likely, but not itself to blame.

I'll go back to studying that crash now: two CPUs crashed about the
same time, perhaps they interacted and give a hint at root cause.

(I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
was all about optimizing a known-exclusive case, so it surprises me
to see it being extended to non-exclusive; and I worry over how its
atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
I've never caught up with David's exclusive changes, I'm out of date).

But even if those are wrong, I'd expect them to tend towards a mapped
page becoming unreclaimable, then "Bad page map" when munmapped,
not to any of the double-free symptoms I've actually seen.)

> 
> And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
> Chris Li's mTHP swap series: which worked fairly well, until it locked
> up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
> on a page with 0 refcount, while a page table lock is held which one
> by one the other CPUs come to want for reclaim. On two machines.

I've not seen that symptom at all since 2024-06-15: intriguing,
but none of us can afford the time to worry about vanished bugs.

Hugh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio
  2024-06-18  6:54 ` [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
@ 2024-06-20  6:12   ` Yu Zhao
  2024-06-20  8:26     ` Baolin Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Yu Zhao @ 2024-06-20  6:12 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel

On Tue, Jun 18, 2024 at 12:54 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Add swap available space validation before spliting shmem large folio to
> avoid redundant split, since we can not write shmem folio to the swap device
> in this case.

We don't scan anon LRU unless we know we may be able to swap. Even if
there is swap space, we may still not be able to swap. See
can_reclaim_anon_pages().



> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/vmscan.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1807e5d95dda..61465f92283f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1237,6 +1237,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                         }
>                 } else if (folio_test_swapbacked(folio) &&
>                            folio_test_large(folio)) {
> +
> +                       /*
> +                        * Do not split shmem folio if no swap memory
> +                        * available.
> +                        */
> +                       if (!total_swap_pages)
> +                               goto activate_locked;
> +
>                         /* Split shmem folio */
>                         if (split_folio_to_list(folio, folio_list))
>                                 goto keep_locked;
> --
> 2.39.3
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-20  3:59         ` Hugh Dickins
@ 2024-06-20  6:41           ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-20  6:41 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Baolin Wang, willy, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel, Andrew Bresticker

On 20.06.24 05:59, Hugh Dickins wrote:
> On Wed, 19 Jun 2024, Andrew Morton wrote:
>> On Wed, 19 Jun 2024 01:16:42 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
>>> On Wed, 19 Jun 2024, Baolin Wang wrote:
>>>> On 2024/6/19 04:05, Andrew Morton wrote:
>>>>> On Tue, 18 Jun 2024 14:54:12 +0800 Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>
>>>>>> Shmem will support large folio allocation [1] [2] to get a better
>>>>>> performance,
>>>>>> however, the memory reclaim still splits the precious large folios when
>>>>>> trying
>>>>>> to swap-out shmem, which may lead to the memory fragmentation issue and can
>>>>>> not
>>>>>> take advantage of the large folio for shmeme.
>>>>>>
>>>>>> Moreover, the swap code already supports for swapping out large folio
>>>>>> without
>>>>>> split, and large folio swap-in[3] series is queued into mm-unstable branch.
>>>>>> Hence this patch set also supports the large folio swap-out and swap-in for
>>>>>> shmem.
>>>>>
>>>>> I'll add this to mm-unstable for some exposure, but I wonder how much
>>>>> testing it will have recieved by the time the next merge window opens?
>>>>
>>>> Thanks Andrew. I am fine with this series going to 6.12 if you are concerned
>>>> about insufficient testing (and let's also wait for Hugh's comments). Since we
>>>> (Daniel and I) have some follow-up patches that will rely on this swap series,
>>>> hope this series can be tested as extensively as possible to ensure its
>>>> stability in the mm branch.
>>>
>>> Thanks for giving it the exposure, Andrew, but please drop it from
>>> mm-unstable until the next cycle.
>>
>> Thanks, dropped.
> 
> Thanks. I'll add a little more info in other mail, against the further
> 2024-06-18 problems I reported, but tl;dr is they are still a mystery:
> I cannot yet say "drop this" or "drop that" or "here's a fix".
> 
>>
>>> p.s. I think Andrew Bresticker's do_set_pmd() fix has soaked
>>> long enough, and deserves promotion to hotfix and Linus soon.
>>
>> Oh, OK, done.
>>
>> And it's cc:stable.  I didn't get any sens of urgency for this one -
>> what is your thinking here?
> 
> I thought you were right to add the cc:stable. The current v6.8..v6.10
> state does not result in any crashes or warnings, but it totally (well,
> 511 times out of 512, in some workloads anyway) defeats the purpose of
> shmem+file huge pages - the kernel is going to all the trouble of trying
> to allocate those huge pages, but then refuses to map them by PMD unless
> the fault happens to occur within the first 4096 bytes (talking x86_64).
> 
> I imagine that this causes a significant performance degradation in
> some workloads which ask for and are used to getting huge pages there;
> and they might also exceed their memory quotas, since a page table has
> to be allocated where a PMD-mapping needs none (anon THPs reserve a page
> table anyway, to rely on later if splitting, but shmem+file THPs do not).
> And it's surprising that no tests were reported as failing.

Exactly my thinking. Either lack of tests or it doesn't really happen 
that often where khugepaged doesn't fix it up.

After all it's been two kernel releases ....

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-20  4:42       ` Hugh Dickins
@ 2024-06-20  6:52         ` David Hildenbrand
  2024-06-20 16:27           ` Hugh Dickins
  2024-06-25  4:54         ` Hugh Dickins
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-06-20  6:52 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, Barry Song
  Cc: Baolin Wang, willy, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel

> 
> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
> was all about optimizing a known-exclusive case, so it surprises me
> to see it being extended to non-exclusive; and I worry over how its
> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
> I've never caught up with David's exclusive changes, I'm out of date).

We discussed that a while ago: if we wouldn't be holding the folio lock 
in the "folio == swapcache" at that point (which we do for both 
do_swap_page() and unuse_pte()) things would already be pretty broken.

That's I added a while ago:

if (unlikely(!folio_test_anon(folio))) {
	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
	/*
	 * For a PTE-mapped large folio, we only know that the single
	 * PTE is exclusive. Further, __folio_set_anon() might not get
	 * folio->index right when not given the address of the head
	 * page.
	 */
	...

We should probably move that VM_WARN_ON_FOLIO() to 
folio_add_new_anon_rmap() and document that it's required in the 
non-exclusive case.

> 
> But even if those are wrong, I'd expect them to tend towards a mapped
> page becoming unreclaimable, then "Bad page map" when munmapped,
> not to any of the double-free symptoms I've actually seen.)

What's the first known-good commit?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio
  2024-06-20  6:12   ` Yu Zhao
@ 2024-06-20  8:26     ` Baolin Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2024-06-20  8:26 UTC (permalink / raw)
  To: Yu Zhao
  Cc: akpm, hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel



On 2024/6/20 14:12, Yu Zhao wrote:
> On Tue, Jun 18, 2024 at 12:54 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Add swap available space validation before spliting shmem large folio to
>> avoid redundant split, since we can not write shmem folio to the swap device
>> in this case.
> 
> We don't scan anon LRU unless we know we may be able to swap. Even if
> there is swap space, we may still not be able to swap. See
> can_reclaim_anon_pages().

Right. Another case is MADV_PAGEOUT can still split a large shmem folio 
even without a swap device. I know this is not a common case, and I can 
drop this patch if it is less helpful.

Thanks for your input.

>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/vmscan.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 1807e5d95dda..61465f92283f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1237,6 +1237,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>                          }
>>                  } else if (folio_test_swapbacked(folio) &&
>>                             folio_test_large(folio)) {
>> +
>> +                       /*
>> +                        * Do not split shmem folio if no swap memory
>> +                        * available.
>> +                        */
>> +                       if (!total_swap_pages)
>> +                               goto activate_locked;
>> +
>>                          /* Split shmem folio */
>>                          if (split_folio_to_list(folio, folio_list))
>>                                  goto keep_locked;
>> --
>> 2.39.3
>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-20  6:52         ` David Hildenbrand
@ 2024-06-20 16:27           ` Hugh Dickins
  2024-06-20 16:51             ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2024-06-20 16:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Andrew Morton, Barry Song, Baolin Wang, willy,
	wangkefeng.wang, chrisl, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel

On Thu, 20 Jun 2024, David Hildenbrand wrote:

> > 
> > (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
> > was all about optimizing a known-exclusive case, so it surprises me
> > to see it being extended to non-exclusive; and I worry over how its
> > atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
> > I've never caught up with David's exclusive changes, I'm out of date).
> 
> We discussed that a while ago: if we wouldn't be holding the folio lock in the
> "folio == swapcache" at that point (which we do for both do_swap_page() and
> unuse_pte()) things would already be pretty broken.

You're thinking of the non-atomic-ness of atomic_set(): I agree that
the folio lock makes that safe (against other adds; but not against
concurrent removes, which could never occur with the old "_new" usage).

But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0):
once the folio lock has been dropped, another non-exclusive add could
come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when
it should be 2)?

> 
> That's I added a while ago:
> 
> if (unlikely(!folio_test_anon(folio))) {
> 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> 	/*
> 	 * For a PTE-mapped large folio, we only know that the single
> 	 * PTE is exclusive. Further, __folio_set_anon() might not get
> 	 * folio->index right when not given the address of the head
> 	 * page.
> 	 */
> 	...
> 
> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap()
> and document that it's required in the non-exclusive case.
> 
> > 
> > But even if those are wrong, I'd expect them to tend towards a mapped
> > page becoming unreclaimable, then "Bad page map" when munmapped,
> > not to any of the double-free symptoms I've actually seen.)
> 
> What's the first known-good commit?

I cannot answer that with any certainty: we're on the shifting sands
of next and mm-unstable, and the bug itself is looking rather like
something which gets amplified or masked by other changes - witness
my confident arrival at Barry's series as introducing the badness,
only for a longer run then to contradict that conclusion.

There was no sign of a problem in a 20-hour run of the same load on
rc3-based next-2024-06-13 (plus my posted fixes); there has been no
sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4
itself, mm.git needing the more urgent attention). mm-everything-
2024-06-15 (minus Chris's mTHP swap) did not show this problem, but
did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus
Baolin's mTHP shmem swap) is where I'm hunting it.

Hugh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-20 16:27           ` Hugh Dickins
@ 2024-06-20 16:51             ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-06-20 16:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Barry Song, Baolin Wang, willy, wangkefeng.wang,
	chrisl, ying.huang, 21cnbao, ryan.roberts, shy828301, ziy,
	ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel

On 20.06.24 18:27, Hugh Dickins wrote:
> On Thu, 20 Jun 2024, David Hildenbrand wrote:
> 
>>>
>>> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
>>> was all about optimizing a known-exclusive case, so it surprises me
>>> to see it being extended to non-exclusive; and I worry over how its
>>> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
>>> I've never caught up with David's exclusive changes, I'm out of date).
>>
>> We discussed that a while ago: if we wouldn't be holding the folio lock in the
>> "folio == swapcache" at that point (which we do for both do_swap_page() and
>> unuse_pte()) things would already be pretty broken.
> 
> You're thinking of the non-atomic-ness of atomic_set(): I agree that
> the folio lock makes that safe (against other adds; but not against
> concurrent removes, which could never occur with the old "_new" usage).
> 
> But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0):
> once the folio lock has been dropped, another non-exclusive add could
> come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when
> it should be 2)?

That would mean that we have someone checking folio_test_anon() before 
doing the folio_add_new*, and *not* performing that check under folio 
lock such that we can race with another folio_add_new*() call. (or not 
checking for folio_test_anon() at all, which would be similarly broken)

When thinking about this in the past I concluded that such code would be 
inherently racy and wrong already and the existing VM_WARN_ON_FOLIO() 
would have revealed that race.

Whoever calls folio_add_new*() either (a) just allocated that thing and 
can be sure that nobody else does something concurrently -- no need for 
the folio lock; or (b) calls it only when !folio_test_anon(), and 
synchronizes against any other possible races using the folio lock.

swapin code falls into category (b), everything else we have in the tree 
into category (a).

So far my thinking :)

> 
>>
>> That's I added a while ago:
>>
>> if (unlikely(!folio_test_anon(folio))) {
>> 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>> 	/*
>> 	 * For a PTE-mapped large folio, we only know that the single
>> 	 * PTE is exclusive. Further, __folio_set_anon() might not get
>> 	 * folio->index right when not given the address of the head
>> 	 * page.
>> 	 */
>> 	...
>>
>> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap()
>> and document that it's required in the non-exclusive case.
>>
>>>
>>> But even if those are wrong, I'd expect them to tend towards a mapped
>>> page becoming unreclaimable, then "Bad page map" when munmapped,
>>> not to any of the double-free symptoms I've actually seen.)
>>
>> What's the first known-good commit?
> 
> I cannot answer that with any certainty: we're on the shifting sands
> of next and mm-unstable, and the bug itself is looking rather like
> something which gets amplified or masked by other changes - witness
> my confident arrival at Barry's series as introducing the badness,
> only for a longer run then to contradict that conclusion.
> 
> There was no sign of a problem in a 20-hour run of the same load on
> rc3-based next-2024-06-13 (plus my posted fixes); there has been no
> sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4
> itself, mm.git needing the more urgent attention). mm-everything-
> 2024-06-15 (minus Chris's mTHP swap) did not show this problem, but
> did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus
> Baolin's mTHP shmem swap) is where I'm hunting it.

Good, as long as we suspect only something  very recent is affected.

There was this report against rc3 upstream:

https://lore.kernel.org/all/CA+G9fYvKmr84WzTArmfaypKM9+=Aw0uXCtuUKHQKFCNMGJyOgQ@mail.gmail.com/

It's only been observed once, and who knows what's happening in that NFS 
setup. I suspected NUMA hinting code, but I am not sure if that really 
explains the issue ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
  2024-06-20  4:42       ` Hugh Dickins
  2024-06-20  6:52         ` David Hildenbrand
@ 2024-06-25  4:54         ` Hugh Dickins
  1 sibling, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2024-06-25  4:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Barry Song, Baolin Wang, willy, david,
	wangkefeng.wang, chrisl, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel

On Wed, 19 Jun 2024, Hugh Dickins wrote:
> On Wed, 19 Jun 2024, Hugh Dickins wrote:
> > 
> > and on second attempt, then a VM_BUG_ON_FOLIO(!folio_contains) from
> > find_lock_entries().
> > 
> > Or maybe that VM_BUG_ON_FOLIO() was unrelated, but a symptom of the bug
> > I'm trying to chase even when this series is reverted:
> 
> Yes, I doubt now that the VM_BUG_ON_FOLIO(!folio_contains) was related
> to Baolin's series: much more likely to be an instance of other problems.
> 
> > some kind of page
> > double usage, manifesting as miscellaneous "Bad page"s and VM_BUG_ONs,
> > mostly from page reclaim or from exit_mmap(). I'm still getting a feel
> > for it, maybe it occurs soon enough for a reliable bisection, maybe not.
> > 
> > (While writing, a run with mm-unstable cut off at 2a9964cc5d27,
> > drop KSM_KMEM_CACHE(), instead of reverting just Baolin's latest,
> > has not yet hit any problem: too early to tell but promising.)
> 
> Yes, that ran without trouble for many hours on two machines. I didn't
> do a formal bisection, but did appear to narrow it down convincingly to
> Barry's folio_add_new_anon_rmap() series: crashes soon on both machines
> with Barry's in but Baolin's out, no crashes with both out.
> 
> Yet while I was studying Barry's patches trying to explain it, one
> of the machines did at last crash: it's as if Barry's has opened a
> window which makes these crashes more likely, but not itself to blame.
> 
> I'll go back to studying that crash now: two CPUs crashed about the
> same time, perhaps they interacted and give a hint at root cause.
> 
> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
> was all about optimizing a known-exclusive case, so it surprises me
> to see it being extended to non-exclusive; and I worry over how its
> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
> I've never caught up with David's exclusive changes, I'm out of date).

David answered on this, thanks: yes, I hadn't got around to seeing that
it only got to be called this way when page anon was not already set:
so agreed, no problem here with the _mapcount 0.

But eventually I came to see what's wrong with folio_add_new_anon_rmap():
this Baolin thread is the wrong place to post the fix, I'll send it now
in response to Barry's 2/3.  With that fix, and another mentioned below,
mm-unstable becomes very much more stable for me.

There is still something else causing "Bad page"s and maybe double
frees, something perhaps in the intersection of folio migration and
deferred splitting and miniTHP; but it's too rare, happened last night
when I wanted to post, but has refused to reappear since then; not
holding up testing, I'll give it more thought next time it comes.

> 
> But even if those are wrong, I'd expect them to tend towards a mapped
> page becoming unreclaimable, then "Bad page map" when munmapped,
> not to any of the double-free symptoms I've actually seen.)
> 
> > 
> > And before 2024-06-18, I was working on mm-everything-2024-06-15 minus
> > Chris Li's mTHP swap series: which worked fairly well, until it locked
> > up with __try_to_reclaim_swap()'s filemap_get_folio() spinning around
> > on a page with 0 refcount, while a page table lock is held which one
> > by one the other CPUs come to want for reclaim. On two machines.
> 
> I've not seen that symptom at all since 2024-06-15: intriguing,
> but none of us can afford the time to worry about vanished bugs.

That issue reappeared when I was testing on next-20240621:
I'll send the fix now in response to Kefeng's 2/5.

Hugh


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-06-25  4:54 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-18  6:54 [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
2024-06-18  6:54 ` [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
2024-06-20  6:12   ` Yu Zhao
2024-06-20  8:26     ` Baolin Wang
2024-06-18  6:54 ` [PATCH v2 2/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
2024-06-18  6:54 ` [PATCH v2 3/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
2024-06-18  6:54 ` [PATCH v2 4/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
2024-06-18  6:54 ` [PATCH v2 5/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
2024-06-18  6:54 ` [PATCH v2 6/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
2024-06-18  6:54 ` [PATCH v2 7/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
2024-06-18  6:54 ` [PATCH v2 8/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
2024-06-18  6:54 ` [PATCH v2 9/9] mm: shmem: support large folio swap out Baolin Wang
2024-06-18 20:05 ` [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Andrew Morton
2024-06-19  1:28   ` Baolin Wang
2024-06-19  8:16     ` Hugh Dickins
2024-06-19  8:58       ` Baolin Wang
2024-06-20  1:33       ` Andrew Morton
2024-06-20  3:59         ` Hugh Dickins
2024-06-20  6:41           ` David Hildenbrand
2024-06-20  4:42       ` Hugh Dickins
2024-06-20  6:52         ` David Hildenbrand
2024-06-20 16:27           ` Hugh Dickins
2024-06-20 16:51             ` David Hildenbrand
2024-06-25  4:54         ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox