* [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
@ 2025-06-08 19:27 Kairui Song
2025-06-08 21:44 ` kernel test robot
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Kairui Song @ 2025-06-08 19:27 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, Usama Arif, linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Following softlockup can be easily reproduced on my test machine with:
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
swapon /dev/zram0 # zram0 is a 48G swap device
mkdir -p /sys/fs/cgroup/memory/test
echo 1G > /sys/fs/cgroup/test/memory.max
echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
while true; do
dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
cat /tmp/test.img > /dev/null
rm /tmp/test.img
done
Then after a while:
watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
Modules linked in: zram virtiofs
CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
Tainted: [L]=SOFTLOCKUP
Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
shmem_alloc_folio+0x31/0xc0
shmem_swapin_folio+0x309/0xcf0
? filemap_get_entry+0x117/0x1e0
? xas_load+0xd/0xb0
? filemap_get_entry+0x101/0x1e0
shmem_get_folio_gfp+0x2ed/0x5b0
shmem_file_read_iter+0x7f/0x2e0
vfs_read+0x252/0x330
ksys_read+0x68/0xf0
do_syscall_64+0x4c/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f03f9a46991
Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
</TASK>
The reason is simple, readahead brought some order 0 folio in swap
cache, and the swapin mTHP folio being allocated is in confict with it,
so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
-EEXIST, and shmem simply retries again and again causing this loop.
Fix it by applying a similar fix for anon mTHP swapin.
The performance change is very slight, time of swapin 10g zero folios
(test for 12 times):
Before: 2.49s
After: 2.52s
Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
I found this issue while doing a performance comparing of mm-new with
swap table series [1] on top of mm-new. This issue no longer exists
if the swap table series is applied, because it elimated both
SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
the performance and simplify the code, and the race swapin is solved
differently by then.
(The zero map fix might still need to stay for a while, but could be
optimized too later with swap table).
It will be good if the swap table series could get reviewed and merged
to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO has
a history of causing many issues. I'll do a swap table rebase on top of
this fix, if this one is accepted.
And for a comparision, swap in 10G into shmem:
Before this patch: 2.49s
After this patch: 2.52s
After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO,
still not in the best shape but looking good)
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
mm/memory.c | 20 --------------------
mm/shmem.c | 12 +++++++++++-
mm/swap.h | 19 +++++++++++++++++++
3 files changed, 30 insertions(+), 21 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 9ead7ab07e8e..3845ed068d74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
-{
- struct swap_info_struct *si = swp_swap_info(entry);
- pgoff_t offset = swp_offset(entry);
- int i;
-
- /*
- * While allocating a large folio and doing swap_read_folio, which is
- * the case the being faulted pte doesn't have swapcache. We need to
- * ensure all PTEs have no cache as well, otherwise, we might go to
- * swap devices while the content is in swapcache.
- */
- for (i = 0; i < max_nr; i++) {
- if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
- return i;
- }
-
- return i;
-}
-
/*
* Check if the PTEs within a range are contiguous swap entries
* and have consistent swapcache, zeromap.
diff --git a/mm/shmem.c b/mm/shmem.c
index 73182e904f9c..484cd3043a78 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
*/
if (swapcache_prepare(entry, nr_pages)) {
folio_put(new);
+
+ /*
+ * A smaller folio is in the swap cache, mTHP swapin will always fail
+ * until it's gone. Return -EINVAL to fallback to order 0.
+ */
+ if (non_swapcache_batch(entry, nr_pages) != nr_pages)
+ return ERR_PTR(-EINVAL);
+
return ERR_PTR(-EEXIST);
}
@@ -2256,6 +2264,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
folio = swap_cache_get_folio(swap, NULL, 0);
order = xa_get_order(&mapping->i_pages, index);
if (!folio) {
+ int nr_pages = 1 << order;
bool fallback_order0 = false;
/* Or update major stats only when swapin succeeds?? */
@@ -2271,7 +2280,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
* to swapin order-0 folio, as well as for zswap case.
*/
if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
- !zswap_never_enabled()))
+ !zswap_never_enabled() ||
+ nr_pages != swap_zeromap_batch(swap, nr_pages, NULL)))
fallback_order0 = true;
/* Skip swapcache for synchronous device. */
diff --git a/mm/swap.h b/mm/swap.h
index e87a0f19a0ee..2d8ce1102153 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -108,6 +108,25 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
return find_next_bit(sis->zeromap, end, start) - start;
}
+static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
+{
+ struct swap_info_struct *si = swp_swap_info(entry);
+ pgoff_t offset = swp_offset(entry);
+ int i;
+
+ /*
+ * While allocating a large folio and doing mTHP swapin, we need to
+ * ensure all entries are not cached, otherwise, the mTHP folio will
+ * be in conflict with the folio in swap cache.
+ */
+ for (i = 0; i < max_nr; i++) {
+ if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
+ return i;
+ }
+
+ return i;
+}
+
#else /* CONFIG_SWAP */
struct swap_iocb;
static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
--
2.49.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-08 19:27 [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin Kairui Song
@ 2025-06-08 21:44 ` kernel test robot
2025-06-08 23:57 ` Barry Song
2025-06-09 8:27 ` Baolin Wang
2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2025-06-08 21:44 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
Hugh Dickins, Baolin Wang, Kemeng Shi, Chris Li, Nhat Pham,
Baoquan He, Barry Song, Usama Arif, linux-kernel, Kairui Song
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-shmem-swap-fix-softlockup-with-mTHP-swapin/20250609-032924
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250608192713.95875-1-ryncsn%40gmail.com
patch subject: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
config: arm-randconfig-003-20250609 (https://download.01.org/0day-ci/archive/20250609/202506090525.yM9XIl8O-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250609/202506090525.yM9XIl8O-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506090525.yM9XIl8O-lkp@intel.com/
All errors (new ones prefixed by >>):
| ^
include/linux/huge_mm.h:108:28: note: expanded from macro 'HPAGE_PMD_SHIFT'
108 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
| ^
note: (skipping 2 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
In file included from mm/shmem.c:24:
In file included from include/linux/fs.h:7:
In file included from include/linux/wait_bit.h:8:
In file included from include/linux/wait.h:7:
include/linux/list.h:37:2: warning: attribute 'error' is already applied with different arguments [-Wignored-attributes]
37 | WRITE_ONCE(list->next, list);
| ^
include/asm-generic/rwonce.h:60:2: note: expanded from macro 'WRITE_ONCE'
60 | compiletime_assert_rwonce_type(x); \
| ^
include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
mm/shmem.c:1904:20: note: previous attribute is here
1904 | count_vm_event(THP_FILE_FALLBACK);
| ^
include/linux/vm_event_item.h:196:30: note: expanded from macro 'THP_FILE_FALLBACK'
196 | #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
| ^
include/linux/build_bug.h:59:21: note: expanded from macro 'BUILD_BUG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^
include/linux/build_bug.h:39:37: note: expanded from macro 'BUILD_BUG_ON_MSG'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
In file included from mm/shmem.c:24:
In file included from include/linux/fs.h:7:
In file included from include/linux/wait_bit.h:8:
In file included from include/linux/wait.h:7:
include/linux/list.h:37:2: warning: attribute 'error' is already applied with different arguments [-Wignored-attributes]
37 | WRITE_ONCE(list->next, list);
| ^
include/asm-generic/rwonce.h:60:2: note: expanded from macro 'WRITE_ONCE'
60 | compiletime_assert_rwonce_type(x); \
| ^
include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
mm/shmem.c:1905:20: note: previous attribute is here
1905 | count_vm_event(THP_FILE_FALLBACK_CHARGE);
| ^
include/linux/vm_event_item.h:197:37: note: expanded from macro 'THP_FILE_FALLBACK_CHARGE'
197 | #define THP_FILE_FALLBACK_CHARGE ({ BUILD_BUG(); 0; })
| ^
include/linux/build_bug.h:59:21: note: expanded from macro 'BUILD_BUG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^
include/linux/build_bug.h:39:37: note: expanded from macro 'BUILD_BUG_ON_MSG'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
>> mm/shmem.c:2003:7: error: call to undeclared function 'non_swapcache_batch'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2003 | if (non_swapcache_batch(entry, nr_pages) != nr_pages)
| ^
In file included from mm/shmem.c:24:
In file included from include/linux/fs.h:7:
In file included from include/linux/wait_bit.h:8:
In file included from include/linux/wait.h:7:
include/linux/list.h:37:2: warning: attribute 'error' is already applied with different arguments [-Wignored-attributes]
37 | WRITE_ONCE(list->next, list);
| ^
include/asm-generic/rwonce.h:60:2: note: expanded from macro 'WRITE_ONCE'
60 | compiletime_assert_rwonce_type(x); \
| ^
include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
mm/shmem.c:2531:20: note: previous attribute is here
2531 | count_vm_event(THP_FILE_ALLOC);
| ^
include/linux/vm_event_item.h:195:27: note: expanded from macro 'THP_FILE_ALLOC'
195 | #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
| ^
include/linux/build_bug.h:59:21: note: expanded from macro 'BUILD_BUG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^
include/linux/build_bug.h:39:37: note: expanded from macro 'BUILD_BUG_ON_MSG'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
In file included from mm/shmem.c:24:
In file included from include/linux/fs.h:7:
In file included from include/linux/wait_bit.h:8:
In file included from include/linux/wait.h:7:
include/linux/list.h:37:2: warning: attribute 'error' is already applied with different arguments [-Wignored-attributes]
37 | WRITE_ONCE(list->next, list);
| ^
include/asm-generic/rwonce.h:60:2: note: expanded from macro 'WRITE_ONCE'
60 | compiletime_assert_rwonce_type(x); \
| ^
include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
mm/shmem.c:2790:15: note: previous attribute is here
2790 | hpage_size = HPAGE_PMD_SIZE;
| ^
include/linux/huge_mm.h:115:34: note: expanded from macro 'HPAGE_PMD_SIZE'
115 | #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
| ^
include/linux/huge_mm.h:108:28: note: expanded from macro 'HPAGE_PMD_SHIFT'
108 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
| ^
include/linux/build_bug.h:59:21: note: expanded from macro 'BUILD_BUG'
59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
| ^
note: (skipping 1 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
include/linux/compiler_types.h:565:2: note: expanded from macro 'compiletime_assert'
565 | __compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:547:4: note: expanded from macro '__compiletime_assert'
547 | __compiletime_error(msg); \
| ^
include/linux/compiler_attributes.h:138:56: note: expanded from macro '__compiletime_error'
138 | # define __compiletime_error(msg) __attribute__((__error__(msg)))
| ^
In file included from mm/shmem.c:24:
In file included from include/linux/fs.h:7:
In file included from include/linux/wait_bit.h:8:
In file included from include/linux/wait.h:7:
include/linux/list.h:37:2: warning: attribute 'error' is already applied with different arguments [-Wignored-attributes]
37 | WRITE_ONCE(list->next, list);
| ^
include/asm-generic/rwonce.h:60:2: note: expanded from macro 'WRITE_ONCE'
60 | compiletime_assert_rwonce_type(x); \
| ^
include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
vim +/non_swapcache_batch +2003 mm/shmem.c
1954
1955 static struct folio *shmem_swap_alloc_folio(struct inode *inode,
1956 struct vm_area_struct *vma, pgoff_t index,
1957 swp_entry_t entry, int order, gfp_t gfp)
1958 {
1959 struct shmem_inode_info *info = SHMEM_I(inode);
1960 struct folio *new;
1961 void *shadow;
1962 int nr_pages;
1963
1964 /*
1965 * We have arrived here because our zones are constrained, so don't
1966 * limit chance of success with further cpuset and node constraints.
1967 */
1968 gfp &= ~GFP_CONSTRAINT_MASK;
1969 if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
1970 gfp_t huge_gfp = vma_thp_gfp_mask(vma);
1971
1972 gfp = limit_gfp_mask(huge_gfp, gfp);
1973 }
1974
1975 new = shmem_alloc_folio(gfp, order, info, index);
1976 if (!new)
1977 return ERR_PTR(-ENOMEM);
1978
1979 nr_pages = folio_nr_pages(new);
1980 if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
1981 gfp, entry)) {
1982 folio_put(new);
1983 return ERR_PTR(-ENOMEM);
1984 }
1985
1986 /*
1987 * Prevent parallel swapin from proceeding with the swap cache flag.
1988 *
1989 * Of course there is another possible concurrent scenario as well,
1990 * that is to say, the swap cache flag of a large folio has already
1991 * been set by swapcache_prepare(), while another thread may have
1992 * already split the large swap entry stored in the shmem mapping.
1993 * In this case, shmem_add_to_page_cache() will help identify the
1994 * concurrent swapin and return -EEXIST.
1995 */
1996 if (swapcache_prepare(entry, nr_pages)) {
1997 folio_put(new);
1998
1999 /*
2000 * A smaller folio is in the swap cache, mTHP swapin will always fail
2001 * until it's gone. Return -EINVAL to fallback to order 0.
2002 */
> 2003 if (non_swapcache_batch(entry, nr_pages) != nr_pages)
2004 return ERR_PTR(-EINVAL);
2005
2006 return ERR_PTR(-EEXIST);
2007 }
2008
2009 __folio_set_locked(new);
2010 __folio_set_swapbacked(new);
2011 new->swap = entry;
2012
2013 memcg1_swapin(entry, nr_pages);
2014 shadow = get_shadow_from_swap_cache(entry);
2015 if (shadow)
2016 workingset_refault(new, shadow);
2017 folio_add_lru(new);
2018 swap_read_folio(new, NULL);
2019 return new;
2020 }
2021
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-08 19:27 [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin Kairui Song
2025-06-08 21:44 ` kernel test robot
@ 2025-06-08 23:57 ` Barry Song
2025-06-09 2:31 ` Kairui Song
2025-06-09 8:27 ` Baolin Wang
2 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2025-06-08 23:57 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 7:27 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Following softlockup can be easily reproduced on my test machine with:
>
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> swapon /dev/zram0 # zram0 is a 48G swap device
> mkdir -p /sys/fs/cgroup/memory/test
> echo 1G > /sys/fs/cgroup/test/memory.max
> echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> while true; do
> dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> cat /tmp/test.img > /dev/null
> rm /tmp/test.img
> done
>
> Then after a while:
> watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> Modules linked in: zram virtiofs
> CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> Tainted: [L]=SOFTLOCKUP
> Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> shmem_alloc_folio+0x31/0xc0
> shmem_swapin_folio+0x309/0xcf0
> ? filemap_get_entry+0x117/0x1e0
> ? xas_load+0xd/0xb0
> ? filemap_get_entry+0x101/0x1e0
> shmem_get_folio_gfp+0x2ed/0x5b0
> shmem_file_read_iter+0x7f/0x2e0
> vfs_read+0x252/0x330
> ksys_read+0x68/0xf0
> do_syscall_64+0x4c/0x1c0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f03f9a46991
> Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> </TASK>
>
> The reason is simple, readahead brought some order 0 folio in swap
> cache, and the swapin mTHP folio being allocated is in confict with it,
> so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> -EEXIST, and shmem simply retries again and again causing this loop.
>
> Fix it by applying a similar fix for anon mTHP swapin.
>
> The performance change is very slight, time of swapin 10g zero folios
> (test for 12 times):
> Before: 2.49s
> After: 2.52s
>
> Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> Signed-off-by: Kairui Song <kasong@tencent.com>
>
> ---
>
> I found this issue while doing a performance comparing of mm-new with
> swap table series [1] on top of mm-new. This issue no longer exists
> if the swap table series is applied, because it elimated both
> SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> the performance and simplify the code, and the race swapin is solved
> differently by then.
>
> (The zero map fix might still need to stay for a while, but could be
> optimized too later with swap table).
>
> It will be good if the swap table series could get reviewed and merged
> to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO has
> a history of causing many issues. I'll do a swap table rebase on top of
> this fix, if this one is accepted.
>
> And for a comparision, swap in 10G into shmem:
>
> Before this patch: 2.49s
> After this patch: 2.52s
> After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO,
> still not in the best shape but looking good)
>
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
>
> mm/memory.c | 20 --------------------
> mm/shmem.c | 12 +++++++++++-
> mm/swap.h | 19 +++++++++++++++++++
> 3 files changed, 30 insertions(+), 21 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ead7ab07e8e..3845ed068d74 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> -{
> - struct swap_info_struct *si = swp_swap_info(entry);
> - pgoff_t offset = swp_offset(entry);
> - int i;
> -
> - /*
> - * While allocating a large folio and doing swap_read_folio, which is
> - * the case the being faulted pte doesn't have swapcache. We need to
> - * ensure all PTEs have no cache as well, otherwise, we might go to
> - * swap devices while the content is in swapcache.
> - */
> - for (i = 0; i < max_nr; i++) {
> - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> - return i;
> - }
> -
> - return i;
> -}
> -
> /*
> * Check if the PTEs within a range are contiguous swap entries
> * and have consistent swapcache, zeromap.
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 73182e904f9c..484cd3043a78 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> */
> if (swapcache_prepare(entry, nr_pages)) {
> folio_put(new);
> +
> + /*
> + * A smaller folio is in the swap cache, mTHP swapin will always fail
> + * until it's gone. Return -EINVAL to fallback to order 0.
> + */
> + if (non_swapcache_batch(entry, nr_pages) != nr_pages)
> + return ERR_PTR(-EINVAL);
> +
We're doing this before swapcache_prepare() for mTHP swapin. Why does it
happen after swapcache_prepare() in the shmem case?
> return ERR_PTR(-EEXIST);
> }
>
> @@ -2256,6 +2264,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> folio = swap_cache_get_folio(swap, NULL, 0);
> order = xa_get_order(&mapping->i_pages, index);
> if (!folio) {
> + int nr_pages = 1 << order;
> bool fallback_order0 = false;
>
> /* Or update major stats only when swapin succeeds?? */
> @@ -2271,7 +2280,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> * to swapin order-0 folio, as well as for zswap case.
> */
> if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
> - !zswap_never_enabled()))
> + !zswap_never_enabled() ||
> + nr_pages != swap_zeromap_batch(swap, nr_pages, NULL)))
> fallback_order0 = true;
I mean, why don't we reject large folios at this point instead?
Because if we do the check here, we might end up with a small folio in
swapcache afterward?
>
> /* Skip swapcache for synchronous device. */
> diff --git a/mm/swap.h b/mm/swap.h
> index e87a0f19a0ee..2d8ce1102153 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -108,6 +108,25 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> return find_next_bit(sis->zeromap, end, start) - start;
> }
>
> +static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> +{
> + struct swap_info_struct *si = swp_swap_info(entry);
> + pgoff_t offset = swp_offset(entry);
> + int i;
> +
> + /*
> + * While allocating a large folio and doing mTHP swapin, we need to
> + * ensure all entries are not cached, otherwise, the mTHP folio will
> + * be in conflict with the folio in swap cache.
> + */
> + for (i = 0; i < max_nr; i++) {
> + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> + return i;
> + }
> +
> + return i;
> +}
> +
> #else /* CONFIG_SWAP */
> struct swap_iocb;
> static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> --
> 2.49.0
>
Thanks
Barry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-08 23:57 ` Barry Song
@ 2025-06-09 2:31 ` Kairui Song
2025-06-09 4:29 ` Barry Song
0 siblings, 1 reply; 11+ messages in thread
From: Kairui Song @ 2025-06-09 2:31 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 7:58 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 7:27 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Following softlockup can be easily reproduced on my test machine with:
> >
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > swapon /dev/zram0 # zram0 is a 48G swap device
> > mkdir -p /sys/fs/cgroup/memory/test
> > echo 1G > /sys/fs/cgroup/test/memory.max
> > echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> > while true; do
> > dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> > cat /tmp/test.img > /dev/null
> > rm /tmp/test.img
> > done
> >
> > Then after a while:
> > watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> > Modules linked in: zram virtiofs
> > CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> > Tainted: [L]=SOFTLOCKUP
> > Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> > Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> > RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> > RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> > RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> > RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> > R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> > FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> > <TASK>
> > shmem_alloc_folio+0x31/0xc0
> > shmem_swapin_folio+0x309/0xcf0
> > ? filemap_get_entry+0x117/0x1e0
> > ? xas_load+0xd/0xb0
> > ? filemap_get_entry+0x101/0x1e0
> > shmem_get_folio_gfp+0x2ed/0x5b0
> > shmem_file_read_iter+0x7f/0x2e0
> > vfs_read+0x252/0x330
> > ksys_read+0x68/0xf0
> > do_syscall_64+0x4c/0x1c0
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7f03f9a46991
> > Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> > RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> > RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> > RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> > R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> > R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> > </TASK>
> >
> > The reason is simple, readahead brought some order 0 folio in swap
> > cache, and the swapin mTHP folio being allocated is in confict with it,
> > so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> > -EEXIST, and shmem simply retries again and again causing this loop.
> >
> > Fix it by applying a similar fix for anon mTHP swapin.
> >
> > The performance change is very slight, time of swapin 10g zero folios
> > (test for 12 times):
> > Before: 2.49s
> > After: 2.52s
> >
> > Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > ---
> >
> > I found this issue while doing a performance comparing of mm-new with
> > swap table series [1] on top of mm-new. This issue no longer exists
> > if the swap table series is applied, because it elimated both
> > SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> > the performance and simplify the code, and the race swapin is solved
> > differently by then.
> >
> > (The zero map fix might still need to stay for a while, but could be
> > optimized too later with swap table).
> >
> > It will be good if the swap table series could get reviewed and merged
> > to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO has
> > a history of causing many issues. I'll do a swap table rebase on top of
> > this fix, if this one is accepted.
> >
> > And for a comparision, swap in 10G into shmem:
> >
> > Before this patch: 2.49s
> > After this patch: 2.52s
> > After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO,
> > still not in the best shape but looking good)
> >
> > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
> >
> > mm/memory.c | 20 --------------------
> > mm/shmem.c | 12 +++++++++++-
> > mm/swap.h | 19 +++++++++++++++++++
> > 3 files changed, 30 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ead7ab07e8e..3845ed068d74 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> > }
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> > -{
> > - struct swap_info_struct *si = swp_swap_info(entry);
> > - pgoff_t offset = swp_offset(entry);
> > - int i;
> > -
> > - /*
> > - * While allocating a large folio and doing swap_read_folio, which is
> > - * the case the being faulted pte doesn't have swapcache. We need to
> > - * ensure all PTEs have no cache as well, otherwise, we might go to
> > - * swap devices while the content is in swapcache.
> > - */
> > - for (i = 0; i < max_nr; i++) {
> > - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> > - return i;
> > - }
> > -
> > - return i;
> > -}
> > -
> > /*
> > * Check if the PTEs within a range are contiguous swap entries
> > * and have consistent swapcache, zeromap.
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 73182e904f9c..484cd3043a78 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> > */
> > if (swapcache_prepare(entry, nr_pages)) {
> > folio_put(new);
> > +
> > + /*
> > + * A smaller folio is in the swap cache, mTHP swapin will always fail
> > + * until it's gone. Return -EINVAL to fallback to order 0.
> > + */
> > + if (non_swapcache_batch(entry, nr_pages) != nr_pages)
> > + return ERR_PTR(-EINVAL);
> > +
Hi Barry,
> We're doing this before swapcache_prepare() for mTHP swapin. Why does it
> happen after swapcache_prepare() in the shmem case?
`non_swapcache_batch(entry, nr_pages) != nr_pages` is unlikely, that's
the reason why no one noticed this issue so far, so moving it after
swapcache_prepare can help avoid overhead caused by it in the common
case. swapcache_prepare already implies this check, but
swapcache_prepare can fall for multiple reasons, and shmem should and
only should fallback to order 0 swapin if it's caused by an existing
cache. (currently shmem unconditionally retry)
And non_swapcache_batch might not be the best solution here, it also
might have false positives, we can add a full filemap lookup here, but
might be overkill for a corner case like this. I still think merge
swap cache with swap_map using swap table is the long term solution.
Maybe I'm premature optimizing it, I can use the easier to review
implementation (same way with anon mTHP) and do a quick benchmark, if
there is no obvious performance change I'll use that style in V2.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-09 2:31 ` Kairui Song
@ 2025-06-09 4:29 ` Barry Song
2025-06-09 8:29 ` Kairui Song
0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2025-06-09 4:29 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 2:32 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 7:58 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 7:27 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Following softlockup can be easily reproduced on my test machine with:
> > >
> > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > > swapon /dev/zram0 # zram0 is a 48G swap device
> > > mkdir -p /sys/fs/cgroup/memory/test
> > > echo 1G > /sys/fs/cgroup/test/memory.max
> > > echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> > > while true; do
> > > dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> > > cat /tmp/test.img > /dev/null
> > > rm /tmp/test.img
> > > done
> > >
> > > Then after a while:
> > > watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> > > Modules linked in: zram virtiofs
> > > CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> > > Tainted: [L]=SOFTLOCKUP
> > > Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > > RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> > > Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> > > RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> > > RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> > > RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> > > RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> > > R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> > > R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> > > FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > Call Trace:
> > > <TASK>
> > > shmem_alloc_folio+0x31/0xc0
> > > shmem_swapin_folio+0x309/0xcf0
> > > ? filemap_get_entry+0x117/0x1e0
> > > ? xas_load+0xd/0xb0
> > > ? filemap_get_entry+0x101/0x1e0
> > > shmem_get_folio_gfp+0x2ed/0x5b0
> > > shmem_file_read_iter+0x7f/0x2e0
> > > vfs_read+0x252/0x330
> > > ksys_read+0x68/0xf0
> > > do_syscall_64+0x4c/0x1c0
> > > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > RIP: 0033:0x7f03f9a46991
> > > Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> > > RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > > RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> > > RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> > > RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> > > R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> > > R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> > > </TASK>
> > >
> > > The reason is simple, readahead brought some order 0 folio in swap
> > > cache, and the swapin mTHP folio being allocated is in confict with it,
> > > so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> > > -EEXIST, and shmem simply retries again and again causing this loop.
> > >
> > > Fix it by applying a similar fix for anon mTHP swapin.
> > >
> > > The performance change is very slight, time of swapin 10g zero folios
> > > (test for 12 times):
> > > Before: 2.49s
> > > After: 2.52s
> > >
> > > Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > >
> > > ---
> > >
> > > I found this issue while doing a performance comparing of mm-new with
> > > swap table series [1] on top of mm-new. This issue no longer exists
> > > if the swap table series is applied, because it elimated both
> > > SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> > > the performance and simplify the code, and the race swapin is solved
> > > differently by then.
> > >
> > > (The zero map fix might still need to stay for a while, but could be
> > > optimized too later with swap table).
> > >
> > > It will be good if the swap table series could get reviewed and merged
> > > to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO has
> > > a history of causing many issues. I'll do a swap table rebase on top of
> > > this fix, if this one is accepted.
> > >
> > > And for a comparision, swap in 10G into shmem:
> > >
> > > Before this patch: 2.49s
> > > After this patch: 2.52s
> > > After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO,
> > > still not in the best shape but looking good)
> > >
> > > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
> > >
> > > mm/memory.c | 20 --------------------
> > > mm/shmem.c | 12 +++++++++++-
> > > mm/swap.h | 19 +++++++++++++++++++
> > > 3 files changed, 30 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 9ead7ab07e8e..3845ed068d74 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> > > }
> > >
> > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> > > -{
> > > - struct swap_info_struct *si = swp_swap_info(entry);
> > > - pgoff_t offset = swp_offset(entry);
> > > - int i;
> > > -
> > > - /*
> > > - * While allocating a large folio and doing swap_read_folio, which is
> > > - * the case the being faulted pte doesn't have swapcache. We need to
> > > - * ensure all PTEs have no cache as well, otherwise, we might go to
> > > - * swap devices while the content is in swapcache.
> > > - */
> > > - for (i = 0; i < max_nr; i++) {
> > > - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> > > - return i;
> > > - }
> > > -
> > > - return i;
> > > -}
> > > -
> > > /*
> > > * Check if the PTEs within a range are contiguous swap entries
> > > * and have consistent swapcache, zeromap.
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 73182e904f9c..484cd3043a78 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> > > */
> > > if (swapcache_prepare(entry, nr_pages)) {
> > > folio_put(new);
> > > +
> > > + /*
> > > + * A smaller folio is in the swap cache, mTHP swapin will always fail
> > > + * until it's gone. Return -EINVAL to fallback to order 0.
> > > + */
> > > + if (non_swapcache_batch(entry, nr_pages) != nr_pages)
> > > + return ERR_PTR(-EINVAL);
> > > +
>
> Hi Barry,
>
> > We're doing this before swapcache_prepare() for mTHP swapin. Why does it
> > happen after swapcache_prepare() in the shmem case?
>
> `non_swapcache_batch(entry, nr_pages) != nr_pages` is unlikely, that's
> the reason why no one noticed this issue so far, so moving it after
> swapcache_prepare can help avoid overhead caused by it in the common
> case. swapcache_prepare already implies this check, but
> swapcache_prepare can fall for multiple reasons, and shmem should and
> only should fallback to order 0 swapin if it's caused by an existing
> cache. (currently shmem unconditionally retry)
Maybe it's because people are running it on systems with plenty of memory?
Once we run it on a system with limited memory, we might see more failures
allocating large folios and fall back to order-0 more often?
For example, what if there's a 50% chance of failing to allocate large
folios?
>
> And non_swapcache_batch might not be the best solution here, it also
> might have false positives, we can add a full filemap lookup here, but
> might be overkill for a corner case like this. I still think merge
> swap cache with swap_map using swap table is the long term solution.
>
> Maybe I'm premature optimizing it, I can use the easier to review
> implementation (same way with anon mTHP) and do a quick benchmark, if
> there is no obvious performance change I'll use that style in V2.
Right, the current approach is a bit hard to follow, since we ultimately
change the return value from -EEXIST to -EINVAL. It does feel like there’s
some back-and-forth. But anyway, let’s look at the data—if the current
approach yields better results, we can refine the code comments to make
it easier to understand.
Thanks
Barry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-08 19:27 [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin Kairui Song
2025-06-08 21:44 ` kernel test robot
2025-06-08 23:57 ` Barry Song
@ 2025-06-09 8:27 ` Baolin Wang
2025-06-09 8:36 ` Kairui Song
2 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-06-09 8:27 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Kemeng Shi, Chris Li, Nhat Pham,
Baoquan He, Barry Song, Usama Arif, linux-kernel
On 2025/6/9 03:27, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Following softlockup can be easily reproduced on my test machine with:
>
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> swapon /dev/zram0 # zram0 is a 48G swap device
> mkdir -p /sys/fs/cgroup/memory/test
> echo 1G > /sys/fs/cgroup/test/memory.max
> echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> while true; do
> dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> cat /tmp/test.img > /dev/null
> rm /tmp/test.img
> done
>
> Then after a while:
> watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> Modules linked in: zram virtiofs
> CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> Tainted: [L]=SOFTLOCKUP
> Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> shmem_alloc_folio+0x31/0xc0
> shmem_swapin_folio+0x309/0xcf0
> ? filemap_get_entry+0x117/0x1e0
> ? xas_load+0xd/0xb0
> ? filemap_get_entry+0x101/0x1e0
> shmem_get_folio_gfp+0x2ed/0x5b0
> shmem_file_read_iter+0x7f/0x2e0
> vfs_read+0x252/0x330
> ksys_read+0x68/0xf0
> do_syscall_64+0x4c/0x1c0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f03f9a46991
> Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> </TASK>
>
> The reason is simple, readahead brought some order 0 folio in swap
> cache, and the swapin mTHP folio being allocated is in confict with it,
> so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> -EEXIST, and shmem simply retries again and again causing this loop.
If swapcache_prepare() fails and retries, the folio's order (order 0)
getting from swapcache will be different from the order stored in the
shmem mapping, so we will split the large swap entry by the following
logic in shmem_swapin_folio(). So I am not sure why causing a softlockup?
} else if (order != folio_order(folio)) {
/*
* Swap readahead may swap in order 0 folios into swapcache
* asynchronously, while the shmem mapping can still stores
* large swap entries. In such cases, we should split the
* large swap entry to prevent possible data corruption.
*/
split_order = shmem_split_large_entry(inode, index, swap, gfp);
if (split_order < 0) {
error = split_order;
goto failed;
}
/*
* If the large swap entry has already been split, it is
* necessary to recalculate the new swap entry based on
* the old order alignment.
*/
if (split_order > 0) {
pgoff_t offset = index - round_down(index, 1 << split_order);
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
}
}
>
> Fix it by applying a similar fix for anon mTHP swapin.
>
> The performance change is very slight, time of swapin 10g zero folios
> (test for 12 times):
> Before: 2.49s
> After: 2.52s
>
> Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> Signed-off-by: Kairui Song <kasong@tencent.com>
>
> ---
>
> I found this issue while doing a performance comparing of mm-new with
> swap table series [1] on top of mm-new. This issue no longer exists
> if the swap table series is applied, because it elimated both
> SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> the performance and simplify the code, and the race swapin is solved
> differently by then.
>
> (The zero map fix might still need to stay for a while, but could be
> optimized too later with swap table).
I don't understand why adding zeromap changes, and should explain this
explicitly.
> It will be good if the swap table series could get reviewed and merged
> to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO has
> a history of causing many issues. I'll do a swap table rebase on top of
> this fix, if this one is accepted.
>
> And for a comparision, swap in 10G into shmem:
>
> Before this patch: 2.49s
> After this patch: 2.52s
> After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO,
> still not in the best shape but looking good)
>
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
>
> mm/memory.c | 20 --------------------
> mm/shmem.c | 12 +++++++++++-
> mm/swap.h | 19 +++++++++++++++++++
> 3 files changed, 30 insertions(+), 21 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ead7ab07e8e..3845ed068d74 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> -{
> - struct swap_info_struct *si = swp_swap_info(entry);
> - pgoff_t offset = swp_offset(entry);
> - int i;
> -
> - /*
> - * While allocating a large folio and doing swap_read_folio, which is
> - * the case the being faulted pte doesn't have swapcache. We need to
> - * ensure all PTEs have no cache as well, otherwise, we might go to
> - * swap devices while the content is in swapcache.
> - */
> - for (i = 0; i < max_nr; i++) {
> - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> - return i;
> - }
> -
> - return i;
> -}
> -
> /*
> * Check if the PTEs within a range are contiguous swap entries
> * and have consistent swapcache, zeromap.
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 73182e904f9c..484cd3043a78 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> */
> if (swapcache_prepare(entry, nr_pages)) {
> folio_put(new);
> +
> + /*
> + * A smaller folio is in the swap cache, mTHP swapin will always fail
> + * until it's gone. Return -EINVAL to fallback to order 0.
> + */
> + if (non_swapcache_batch(entry, nr_pages) != nr_pages)
> + return ERR_PTR(-EINVAL);
> +
> return ERR_PTR(-EEXIST);
> }
>
> @@ -2256,6 +2264,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> folio = swap_cache_get_folio(swap, NULL, 0);
> order = xa_get_order(&mapping->i_pages, index);
> if (!folio) {
> + int nr_pages = 1 << order;
> bool fallback_order0 = false;
>
> /* Or update major stats only when swapin succeeds?? */
> @@ -2271,7 +2280,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> * to swapin order-0 folio, as well as for zswap case.
> */
> if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
> - !zswap_never_enabled()))
> + !zswap_never_enabled() ||
> + nr_pages != swap_zeromap_batch(swap, nr_pages, NULL)))
> fallback_order0 = true;
>
> /* Skip swapcache for synchronous device. */
> diff --git a/mm/swap.h b/mm/swap.h
> index e87a0f19a0ee..2d8ce1102153 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -108,6 +108,25 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> return find_next_bit(sis->zeromap, end, start) - start;
> }
>
> +static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> +{
> + struct swap_info_struct *si = swp_swap_info(entry);
> + pgoff_t offset = swp_offset(entry);
> + int i;
> +
> + /*
> + * While allocating a large folio and doing mTHP swapin, we need to
> + * ensure all entries are not cached, otherwise, the mTHP folio will
> + * be in conflict with the folio in swap cache.
> + */
> + for (i = 0; i < max_nr; i++) {
> + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> + return i;
> + }
> +
> + return i;
> +}
> +
> #else /* CONFIG_SWAP */
> struct swap_iocb;
> static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-09 4:29 ` Barry Song
@ 2025-06-09 8:29 ` Kairui Song
0 siblings, 0 replies; 11+ messages in thread
From: Kairui Song @ 2025-06-09 8:29 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 12:30 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 2:32 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 7:58 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Jun 9, 2025 at 7:27 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Following softlockup can be easily reproduced on my test machine with:
> > > >
> > > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > > > swapon /dev/zram0 # zram0 is a 48G swap device
> > > > mkdir -p /sys/fs/cgroup/memory/test
> > > > echo 1G > /sys/fs/cgroup/test/memory.max
> > > > echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> > > > while true; do
> > > > dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> > > > cat /tmp/test.img > /dev/null
> > > > rm /tmp/test.img
> > > > done
> > > >
> > > > Then after a while:
> > > > watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> > > > Modules linked in: zram virtiofs
> > > > CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> > > > Tainted: [L]=SOFTLOCKUP
> > > > Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > > > RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> > > > Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> > > > RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> > > > RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> > > > RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> > > > RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> > > > R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> > > > R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> > > > FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > PKRU: 55555554
> > > > Call Trace:
> > > > <TASK>
> > > > shmem_alloc_folio+0x31/0xc0
> > > > shmem_swapin_folio+0x309/0xcf0
> > > > ? filemap_get_entry+0x117/0x1e0
> > > > ? xas_load+0xd/0xb0
> > > > ? filemap_get_entry+0x101/0x1e0
> > > > shmem_get_folio_gfp+0x2ed/0x5b0
> > > > shmem_file_read_iter+0x7f/0x2e0
> > > > vfs_read+0x252/0x330
> > > > ksys_read+0x68/0xf0
> > > > do_syscall_64+0x4c/0x1c0
> > > > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > RIP: 0033:0x7f03f9a46991
> > > > Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> > > > RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > > > RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> > > > RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> > > > RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> > > > R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> > > > R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> > > > </TASK>
> > > >
> > > > The reason is simple, readahead brought some order 0 folio in swap
> > > > cache, and the swapin mTHP folio being allocated is in confict with it,
> > > > so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> > > > -EEXIST, and shmem simply retries again and again causing this loop.
> > > >
> > > > Fix it by applying a similar fix for anon mTHP swapin.
> > > >
> > > > The performance change is very slight, time of swapin 10g zero folios
> > > > (test for 12 times):
> > > > Before: 2.49s
> > > > After: 2.52s
> > > >
> > > > Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > >
> > > > ---
> > > >
> > > > I found this issue while doing a performance comparing of mm-new with
> > > > swap table series [1] on top of mm-new. This issue no longer exists
> > > > if the swap table series is applied, because it elimated both
> > > > SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> > > > the performance and simplify the code, and the race swapin is solved
> > > > differently by then.
> > > >
> > > > (The zero map fix might still need to stay for a while, but could be
> > > > optimized too later with swap table).
> > > >
> > > > It will be good if the swap table series could get reviewed and merged
> > > > to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO has
> > > > a history of causing many issues. I'll do a swap table rebase on top of
> > > > this fix, if this one is accepted.
> > > >
> > > > And for a comparision, swap in 10G into shmem:
> > > >
> > > > Before this patch: 2.49s
> > > > After this patch: 2.52s
> > > > After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO,
> > > > still not in the best shape but looking good)
> > > >
> > > > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
> > > >
> > > > mm/memory.c | 20 --------------------
> > > > mm/shmem.c | 12 +++++++++++-
> > > > mm/swap.h | 19 +++++++++++++++++++
> > > > 3 files changed, 30 insertions(+), 21 deletions(-)
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 9ead7ab07e8e..3845ed068d74 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> > > > }
> > > >
> > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> > > > -{
> > > > - struct swap_info_struct *si = swp_swap_info(entry);
> > > > - pgoff_t offset = swp_offset(entry);
> > > > - int i;
> > > > -
> > > > - /*
> > > > - * While allocating a large folio and doing swap_read_folio, which is
> > > > - * the case the being faulted pte doesn't have swapcache. We need to
> > > > - * ensure all PTEs have no cache as well, otherwise, we might go to
> > > > - * swap devices while the content is in swapcache.
> > > > - */
> > > > - for (i = 0; i < max_nr; i++) {
> > > > - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> > > > - return i;
> > > > - }
> > > > -
> > > > - return i;
> > > > -}
> > > > -
> > > > /*
> > > > * Check if the PTEs within a range are contiguous swap entries
> > > > * and have consistent swapcache, zeromap.
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index 73182e904f9c..484cd3043a78 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> > > > */
> > > > if (swapcache_prepare(entry, nr_pages)) {
> > > > folio_put(new);
> > > > +
> > > > + /*
> > > > + * A smaller folio is in the swap cache, mTHP swapin will always fail
> > > > + * until it's gone. Return -EINVAL to fallback to order 0.
> > > > + */
> > > > + if (non_swapcache_batch(entry, nr_pages) != nr_pages)
> > > > + return ERR_PTR(-EINVAL);
> > > > +
> >
> > Hi Barry,
> >
> > > We're doing this before swapcache_prepare() for mTHP swapin. Why does it
> > > happen after swapcache_prepare() in the shmem case?
> >
> > `non_swapcache_batch(entry, nr_pages) != nr_pages` is unlikely, that's
> > the reason why no one noticed this issue so far, so moving it after
> > swapcache_prepare can help avoid overhead caused by it in the common
> > case. swapcache_prepare already implies this check, but
> > swapcache_prepare can fall for multiple reasons, and shmem should and
> > only should fallback to order 0 swapin if it's caused by an existing
> > cache. (currently shmem unconditionally retry)
>
> Maybe it's because people are running it on systems with plenty of memory?
> Once we run it on a system with limited memory, we might see more failures
> allocating large folios and fall back to order-0 more often?
> For example, what if there's a 50% chance of failing to allocate large
> folios?
When under memory pressure the swap cache pages will get reclaimed, so
it will also be less likely to hit this issue, it has to be in a
situation where some swapin falls back to order 0 while memory
pressure is not too high to cause swap cache reclaim. So in most cases
it's unlikely to run into conflicting small folios, but chance is high
enough to trigger real issues, the reproducer shell script can trigger
this within 5 minutes on my machine.
>
> >
> > And non_swapcache_batch might not be the best solution here, it also
> > might have false positives, we can add a full filemap lookup here, but
> > might be overkill for a corner case like this. I still think merge
> > swap cache with swap_map using swap table is the long term solution.
> >
> > Maybe I'm premature optimizing it, I can use the easier to review
> > implementation (same way with anon mTHP) and do a quick benchmark, if
> > there is no obvious performance change I'll use that style in V2.
>
> Right, the current approach is a bit hard to follow, since we ultimately
> change the return value from -EEXIST to -EINVAL. It does feel like there’s
> some back-and-forth. But anyway, let’s look at the data—if the current
> approach yields better results, we can refine the code comments to make
> it easier to understand.
After more testing the performance change seems trivial, and
considering checking SWAP_HAS_CACHE before swapcache_prepare could
help avoid a ci locking in certain workloads, let me use that way in
V2 later today.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-09 8:27 ` Baolin Wang
@ 2025-06-09 8:36 ` Kairui Song
2025-06-09 8:49 ` Baolin Wang
0 siblings, 1 reply; 11+ messages in thread
From: Kairui Song @ 2025-06-09 8:36 UTC (permalink / raw)
To: Baolin Wang
Cc: linux-mm, Andrew Morton, Hugh Dickins, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 4:27 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 2025/6/9 03:27, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Following softlockup can be easily reproduced on my test machine with:
> >
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > swapon /dev/zram0 # zram0 is a 48G swap device
> > mkdir -p /sys/fs/cgroup/memory/test
> > echo 1G > /sys/fs/cgroup/test/memory.max
> > echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> > while true; do
> > dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> > cat /tmp/test.img > /dev/null
> > rm /tmp/test.img
> > done
> >
> > Then after a while:
> > watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> > Modules linked in: zram virtiofs
> > CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> > Tainted: [L]=SOFTLOCKUP
> > Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> > Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> > RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> > RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> > RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> > RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> > R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> > FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> > <TASK>
> > shmem_alloc_folio+0x31/0xc0
> > shmem_swapin_folio+0x309/0xcf0
> > ? filemap_get_entry+0x117/0x1e0
> > ? xas_load+0xd/0xb0
> > ? filemap_get_entry+0x101/0x1e0
> > shmem_get_folio_gfp+0x2ed/0x5b0
> > shmem_file_read_iter+0x7f/0x2e0
> > vfs_read+0x252/0x330
> > ksys_read+0x68/0xf0
> > do_syscall_64+0x4c/0x1c0
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7f03f9a46991
> > Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> > RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> > RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> > RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> > R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> > R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> > </TASK>
> >
> > The reason is simple, readahead brought some order 0 folio in swap
> > cache, and the swapin mTHP folio being allocated is in confict with it,
> > so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> > -EEXIST, and shmem simply retries again and again causing this loop.
>
> If swapcache_prepare() fails and retries, the folio's order (order 0)
> getting from swapcache will be different from the order stored in the
> shmem mapping, so we will split the large swap entry by the following
> logic in shmem_swapin_folio(). So I am not sure why causing a softlockup?
>
> } else if (order != folio_order(folio)) {
> /*
> * Swap readahead may swap in order 0 folios into swapcache
> * asynchronously, while the shmem mapping can still stores
> * large swap entries. In such cases, we should split the
> * large swap entry to prevent possible data corruption.
> */
> split_order = shmem_split_large_entry(inode, index, swap, gfp);
> if (split_order < 0) {
> error = split_order;
> goto failed;
> }
>
> /*
> * If the large swap entry has already been split, it is
> * necessary to recalculate the new swap entry based on
> * the old order alignment.
> */
> if (split_order > 0) {
> pgoff_t offset = index - round_down(index, 1 << split_order);
>
> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> }
> }
For example if the swap entry is 0x0 in shmem with order 4 (so it
corresponds to swap entries 0x0 - 0x10), and a order 0 folio is
currently cached with swap entry 0xa, then shmem swapin will try to
use a folio with order 4, that will always fails swapcache_prepare,
but filemap/swapcache lookup use entry 0x0 will return NULL, causing a
loop.
>
> >
> > Fix it by applying a similar fix for anon mTHP swapin.
> >
> > The performance change is very slight, time of swapin 10g zero folios
> > (test for 12 times):
> > Before: 2.49s
> > After: 2.52s
> >
> > Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > ---
> >
> > I found this issue while doing a performance comparing of mm-new with
> > swap table series [1] on top of mm-new. This issue no longer exists
> > if the swap table series is applied, because it elimated both
> > SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> > the performance and simplify the code, and the race swapin is solved
> > differently by then.
> >
> > (The zero map fix might still need to stay for a while, but could be
> > optimized too later with swap table).
>
> I don't understand why adding zeromap changes, and should explain this
> explicitly.
To stay in consistency with anon mTHP swapin, swap_zeromap_batch have
it's own comments that a hybird folio with zero and non-zero pages
can't be brought back as a whole. I can mention that in the commit
message.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-09 8:36 ` Kairui Song
@ 2025-06-09 8:49 ` Baolin Wang
2025-06-09 8:55 ` Barry Song
0 siblings, 1 reply; 11+ messages in thread
From: Baolin Wang @ 2025-06-09 8:49 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, Usama Arif, linux-kernel
On 2025/6/9 16:36, Kairui Song wrote:
> On Mon, Jun 9, 2025 at 4:27 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>> On 2025/6/9 03:27, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Following softlockup can be easily reproduced on my test machine with:
>>>
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>> swapon /dev/zram0 # zram0 is a 48G swap device
>>> mkdir -p /sys/fs/cgroup/memory/test
>>> echo 1G > /sys/fs/cgroup/test/memory.max
>>> echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
>>> while true; do
>>> dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
>>> cat /tmp/test.img > /dev/null
>>> rm /tmp/test.img
>>> done
>>>
>>> Then after a while:
>>> watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
>>> Modules linked in: zram virtiofs
>>> CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
>>> Tainted: [L]=SOFTLOCKUP
>>> Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
>>> RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
>>> Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
>>> RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
>>> RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
>>> RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
>>> RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
>>> R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
>>> R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
>>> FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> PKRU: 55555554
>>> Call Trace:
>>> <TASK>
>>> shmem_alloc_folio+0x31/0xc0
>>> shmem_swapin_folio+0x309/0xcf0
>>> ? filemap_get_entry+0x117/0x1e0
>>> ? xas_load+0xd/0xb0
>>> ? filemap_get_entry+0x101/0x1e0
>>> shmem_get_folio_gfp+0x2ed/0x5b0
>>> shmem_file_read_iter+0x7f/0x2e0
>>> vfs_read+0x252/0x330
>>> ksys_read+0x68/0xf0
>>> do_syscall_64+0x4c/0x1c0
>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> RIP: 0033:0x7f03f9a46991
>>> Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
>>> RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>> RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
>>> RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
>>> RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
>>> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
>>> R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
>>> </TASK>
>>>
>>> The reason is simple, readahead brought some order 0 folio in swap
>>> cache, and the swapin mTHP folio being allocated is in confict with it,
>>> so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
>>> -EEXIST, and shmem simply retries again and again causing this loop.
>>
>> If swapcache_prepare() fails and retries, the folio's order (order 0)
>> getting from swapcache will be different from the order stored in the
>> shmem mapping, so we will split the large swap entry by the following
>> logic in shmem_swapin_folio(). So I am not sure why causing a softlockup?
>>
>> } else if (order != folio_order(folio)) {
>> /*
>> * Swap readahead may swap in order 0 folios into swapcache
>> * asynchronously, while the shmem mapping can still stores
>> * large swap entries. In such cases, we should split the
>> * large swap entry to prevent possible data corruption.
>> */
>> split_order = shmem_split_large_entry(inode, index, swap, gfp);
>> if (split_order < 0) {
>> error = split_order;
>> goto failed;
>> }
>>
>> /*
>> * If the large swap entry has already been split, it is
>> * necessary to recalculate the new swap entry based on
>> * the old order alignment.
>> */
>> if (split_order > 0) {
>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>
>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>> }
>> }
>
> For example if the swap entry is 0x0 in shmem with order 4 (so it
> corresponds to swap entries 0x0 - 0x10), and a order 0 folio is
> currently cached with swap entry 0xa, then shmem swapin will try to
> use a folio with order 4, that will always fails swapcache_prepare,
> but filemap/swapcache lookup use entry 0x0 will return NULL, causing a
> loop.
OK. Thanks for the explanation.
>>> Fix it by applying a similar fix for anon mTHP swapin.
>>>
>>> The performance change is very slight, time of swapin 10g zero folios
>>> (test for 12 times):
>>> Before: 2.49s
>>> After: 2.52s
>>>
>>> Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>>
>>> ---
>>>
>>> I found this issue while doing a performance comparing of mm-new with
>>> swap table series [1] on top of mm-new. This issue no longer exists
>>> if the swap table series is applied, because it elimated both
>>> SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
>>> the performance and simplify the code, and the race swapin is solved
>>> differently by then.
>>>
>>> (The zero map fix might still need to stay for a while, but could be
>>> optimized too later with swap table).
>>
>> I don't understand why adding zeromap changes, and should explain this
>> explicitly.
>
> To stay in consistency with anon mTHP swapin, swap_zeromap_batch have
> it's own comments that a hybird folio with zero and non-zero pages
> can't be brought back as a whole. I can mention that in the commit
> message.
Yes. Thanks.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-09 8:49 ` Baolin Wang
@ 2025-06-09 8:55 ` Barry Song
2025-06-09 9:28 ` Kairui Song
0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2025-06-09 8:55 UTC (permalink / raw)
To: Baolin Wang
Cc: Kairui Song, linux-mm, Andrew Morton, Hugh Dickins, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 8:49 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/6/9 16:36, Kairui Song wrote:
> > On Mon, Jun 9, 2025 at 4:27 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 2025/6/9 03:27, Kairui Song wrote:
> >>> From: Kairui Song <kasong@tencent.com>
> >>>
> >>> Following softlockup can be easily reproduced on my test machine with:
> >>>
> >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>> swapon /dev/zram0 # zram0 is a 48G swap device
> >>> mkdir -p /sys/fs/cgroup/memory/test
> >>> echo 1G > /sys/fs/cgroup/test/memory.max
> >>> echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> >>> while true; do
> >>> dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> >>> cat /tmp/test.img > /dev/null
> >>> rm /tmp/test.img
> >>> done
> >>>
> >>> Then after a while:
> >>> watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> >>> Modules linked in: zram virtiofs
> >>> CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> >>> Tainted: [L]=SOFTLOCKUP
> >>> Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> >>> RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> >>> Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> >>> RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> >>> RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> >>> RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> >>> RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> >>> R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> >>> R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> >>> FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>> PKRU: 55555554
> >>> Call Trace:
> >>> <TASK>
> >>> shmem_alloc_folio+0x31/0xc0
> >>> shmem_swapin_folio+0x309/0xcf0
> >>> ? filemap_get_entry+0x117/0x1e0
> >>> ? xas_load+0xd/0xb0
> >>> ? filemap_get_entry+0x101/0x1e0
> >>> shmem_get_folio_gfp+0x2ed/0x5b0
> >>> shmem_file_read_iter+0x7f/0x2e0
> >>> vfs_read+0x252/0x330
> >>> ksys_read+0x68/0xf0
> >>> do_syscall_64+0x4c/0x1c0
> >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>> RIP: 0033:0x7f03f9a46991
> >>> Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> >>> RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> >>> RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> >>> RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> >>> RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> >>> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> >>> R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> >>> </TASK>
> >>>
> >>> The reason is simple, readahead brought some order 0 folio in swap
> >>> cache, and the swapin mTHP folio being allocated is in confict with it,
> >>> so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> >>> -EEXIST, and shmem simply retries again and again causing this loop.
> >>
> >> If swapcache_prepare() fails and retries, the folio's order (order 0)
> >> getting from swapcache will be different from the order stored in the
> >> shmem mapping, so we will split the large swap entry by the following
> >> logic in shmem_swapin_folio(). So I am not sure why causing a softlockup?
> >>
> >> } else if (order != folio_order(folio)) {
> >> /*
> >> * Swap readahead may swap in order 0 folios into swapcache
> >> * asynchronously, while the shmem mapping can still stores
> >> * large swap entries. In such cases, we should split the
> >> * large swap entry to prevent possible data corruption.
> >> */
> >> split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >> if (split_order < 0) {
> >> error = split_order;
> >> goto failed;
> >> }
> >>
> >> /*
> >> * If the large swap entry has already been split, it is
> >> * necessary to recalculate the new swap entry based on
> >> * the old order alignment.
> >> */
> >> if (split_order > 0) {
> >> pgoff_t offset = index - round_down(index, 1 << split_order);
> >>
> >> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >> }
> >> }
> >
> > For example if the swap entry is 0x0 in shmem with order 4 (so it
> > corresponds to swap entries 0x0 - 0x10), and a order 0 folio is
> > currently cached with swap entry 0xa, then shmem swapin will try to
> > use a folio with order 4, that will always fails swapcache_prepare,
> > but filemap/swapcache lookup use entry 0x0 will return NULL, causing a
> > loop.
>
> OK. Thanks for the explanation.
>
> >>> Fix it by applying a similar fix for anon mTHP swapin.
> >>>
> >>> The performance change is very slight, time of swapin 10g zero folios
> >>> (test for 12 times):
> >>> Before: 2.49s
> >>> After: 2.52s
> >>>
> >>> Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> >>> Signed-off-by: Kairui Song <kasong@tencent.com>
> >>>
> >>> ---
> >>>
> >>> I found this issue while doing a performance comparing of mm-new with
> >>> swap table series [1] on top of mm-new. This issue no longer exists
> >>> if the swap table series is applied, because it elimated both
> >>> SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> >>> the performance and simplify the code, and the race swapin is solved
> >>> differently by then.
> >>>
> >>> (The zero map fix might still need to stay for a while, but could be
> >>> optimized too later with swap table).
> >>
> >> I don't understand why adding zeromap changes, and should explain this
> >> explicitly.
> >
> > To stay in consistency with anon mTHP swapin, swap_zeromap_batch have
> > it's own comments that a hybird folio with zero and non-zero pages
> > can't be brought back as a whole. I can mention that in the commit
> > message.
For mTHP swapin, we need the zeromap check because we have no way to record
whether there was a prior mTHP swap-out. So we rely on checking the
continuity of swap offsets.
It’s entirely possible that, in the past, several small folios were
swapped out to consecutive locations, and one of them happened to be a
zero folio, while the others were not.
But for shmem, we have a place to record that information - we swapped-out
a mTHP, right?
Regarding zeromap: for an mTHP swap-out, we currently can't mark subpages
individually as zeromap—it’s either all-zero for every subpage or none are.
So maybe we don't need swap_zeromap_batch() for shmem?
>
> Yes. Thanks.
Thanks
Barry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-09 8:55 ` Barry Song
@ 2025-06-09 9:28 ` Kairui Song
0 siblings, 0 replies; 11+ messages in thread
From: Kairui Song @ 2025-06-09 9:28 UTC (permalink / raw)
To: Barry Song
Cc: Baolin Wang, linux-mm, Andrew Morton, Hugh Dickins, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Usama Arif, linux-kernel
On Mon, Jun 9, 2025 at 4:55 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 8:49 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2025/6/9 16:36, Kairui Song wrote:
> > > On Mon, Jun 9, 2025 at 4:27 PM Baolin Wang
> > > <baolin.wang@linux.alibaba.com> wrote:
> > >> On 2025/6/9 03:27, Kairui Song wrote:
> > >>> From: Kairui Song <kasong@tencent.com>
> > >>>
> > >>> Following softlockup can be easily reproduced on my test machine with:
> > >>>
> > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > >>> swapon /dev/zram0 # zram0 is a 48G swap device
> > >>> mkdir -p /sys/fs/cgroup/memory/test
> > >>> echo 1G > /sys/fs/cgroup/test/memory.max
> > >>> echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
> > >>> while true; do
> > >>> dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
> > >>> cat /tmp/test.img > /dev/null
> > >>> rm /tmp/test.img
> > >>> done
> > >>>
> > >>> Then after a while:
> > >>> watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
> > >>> Modules linked in: zram virtiofs
> > >>> CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
> > >>> Tainted: [L]=SOFTLOCKUP
> > >>> Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> > >>> RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
> > >>> Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
> > >>> RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
> > >>> RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
> > >>> RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
> > >>> RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
> > >>> R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
> > >>> R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
> > >>> FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
> > >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>> CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
> > >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >>> PKRU: 55555554
> > >>> Call Trace:
> > >>> <TASK>
> > >>> shmem_alloc_folio+0x31/0xc0
> > >>> shmem_swapin_folio+0x309/0xcf0
> > >>> ? filemap_get_entry+0x117/0x1e0
> > >>> ? xas_load+0xd/0xb0
> > >>> ? filemap_get_entry+0x101/0x1e0
> > >>> shmem_get_folio_gfp+0x2ed/0x5b0
> > >>> shmem_file_read_iter+0x7f/0x2e0
> > >>> vfs_read+0x252/0x330
> > >>> ksys_read+0x68/0xf0
> > >>> do_syscall_64+0x4c/0x1c0
> > >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >>> RIP: 0033:0x7f03f9a46991
> > >>> Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> > >>> RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > >>> RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
> > >>> RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
> > >>> RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
> > >>> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
> > >>> R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
> > >>> </TASK>
> > >>>
> > >>> The reason is simple, readahead brought some order 0 folio in swap
> > >>> cache, and the swapin mTHP folio being allocated is in confict with it,
> > >>> so swapcache_prepare fails and causes shmem_swap_alloc_folio to return
> > >>> -EEXIST, and shmem simply retries again and again causing this loop.
> > >>
> > >> If swapcache_prepare() fails and retries, the folio's order (order 0)
> > >> getting from swapcache will be different from the order stored in the
> > >> shmem mapping, so we will split the large swap entry by the following
> > >> logic in shmem_swapin_folio(). So I am not sure why causing a softlockup?
> > >>
> > >> } else if (order != folio_order(folio)) {
> > >> /*
> > >> * Swap readahead may swap in order 0 folios into swapcache
> > >> * asynchronously, while the shmem mapping can still stores
> > >> * large swap entries. In such cases, we should split the
> > >> * large swap entry to prevent possible data corruption.
> > >> */
> > >> split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > >> if (split_order < 0) {
> > >> error = split_order;
> > >> goto failed;
> > >> }
> > >>
> > >> /*
> > >> * If the large swap entry has already been split, it is
> > >> * necessary to recalculate the new swap entry based on
> > >> * the old order alignment.
> > >> */
> > >> if (split_order > 0) {
> > >> pgoff_t offset = index - round_down(index, 1 << split_order);
> > >>
> > >> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > >> }
> > >> }
> > >
> > > For example if the swap entry is 0x0 in shmem with order 4 (so it
> > > corresponds to swap entries 0x0 - 0x10), and a order 0 folio is
> > > currently cached with swap entry 0xa, then shmem swapin will try to
> > > use a folio with order 4, that will always fails swapcache_prepare,
> > > but filemap/swapcache lookup use entry 0x0 will return NULL, causing a
> > > loop.
> >
> > OK. Thanks for the explanation.
> >
> > >>> Fix it by applying a similar fix for anon mTHP swapin.
> > >>>
> > >>> The performance change is very slight, time of swapin 10g zero folios
> > >>> (test for 12 times):
> > >>> Before: 2.49s
> > >>> After: 2.52s
> > >>>
> > >>> Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
> > >>> Signed-off-by: Kairui Song <kasong@tencent.com>
> > >>>
> > >>> ---
> > >>>
> > >>> I found this issue while doing a performance comparing of mm-new with
> > >>> swap table series [1] on top of mm-new. This issue no longer exists
> > >>> if the swap table series is applied, because it elimated both
> > >>> SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improving
> > >>> the performance and simplify the code, and the race swapin is solved
> > >>> differently by then.
> > >>>
> > >>> (The zero map fix might still need to stay for a while, but could be
> > >>> optimized too later with swap table).
> > >>
> > >> I don't understand why adding zeromap changes, and should explain this
> > >> explicitly.
> > >
> > > To stay in consistency with anon mTHP swapin, swap_zeromap_batch have
> > > it's own comments that a hybird folio with zero and non-zero pages
> > > can't be brought back as a whole. I can mention that in the commit
> > > message.
>
> For mTHP swapin, we need the zeromap check because we have no way to record
> whether there was a prior mTHP swap-out. So we rely on checking the
> continuity of swap offsets.
>
> It’s entirely possible that, in the past, several small folios were
> swapped out to consecutive locations, and one of them happened to be a
> zero folio, while the others were not.
>
> But for shmem, we have a place to record that information - we swapped-out
> a mTHP, right?
>
> Regarding zeromap: for an mTHP swap-out, we currently can't mark subpages
> individually as zeromap—it’s either all-zero for every subpage or none are.
Thanks for the declaration! Yes, that's correct, I wasn't sure if zero
map will mark subpages so just left the check there. Will remove the
check in V2.
> So maybe we don't need swap_zeromap_batch() for shmem?
Right, it's not needed here, the fix will be simpler.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-06-09 9:28 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-06-08 19:27 [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin Kairui Song
2025-06-08 21:44 ` kernel test robot
2025-06-08 23:57 ` Barry Song
2025-06-09 2:31 ` Kairui Song
2025-06-09 4:29 ` Barry Song
2025-06-09 8:29 ` Kairui Song
2025-06-09 8:27 ` Baolin Wang
2025-06-09 8:36 ` Kairui Song
2025-06-09 8:49 ` Baolin Wang
2025-06-09 8:55 ` Barry Song
2025-06-09 9:28 ` Kairui Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox