linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] kasan, vmalloc: avoid lock contention when depopulating vmalloc
@ 2024-09-25 13:47 Adrian Huang
  2024-09-25 20:47 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Adrian Huang @ 2024-09-25 13:47 UTC (permalink / raw)
  To: Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Uladzislau Rezki
  Cc: kasan-dev, linux-mm, linux-kernel, Adrian Huang

From: Adrian Huang <ahuang12@lenovo.com>

When running the test_vmalloc stress on a 448-core server, the following
soft/hard lockups were observed and the OS was panicked eventually.

1) Kernel config
   CONFIG_KASAN=y
   CONFIG_KASAN_VMALLOC=y

2) Reproduced command
   # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8

3) OS Log: Detail is in [1].
   watchdog: BUG: soft lockup - CPU#258 stuck for 26s!
   RIP: 0010:native_queued_spin_lock_slowpath+0x504/0x940
   Call Trace:
    do_raw_spin_lock+0x1e7/0x270
    _raw_spin_lock+0x63/0x80
    kasan_depopulate_vmalloc_pte+0x3c/0x70
    apply_to_pte_range+0x127/0x4e0
    apply_to_pmd_range+0x19e/0x5c0
    apply_to_pud_range+0x167/0x510
    __apply_to_page_range+0x2b4/0x7c0
    kasan_release_vmalloc+0xc8/0xd0
    purge_vmap_node+0x190/0x980
    __purge_vmap_area_lazy+0x640/0xa60
    drain_vmap_area_work+0x23/0x30
    process_one_work+0x84a/0x1760
    worker_thread+0x54d/0xc60
    kthread+0x2a8/0x380
    ret_from_fork+0x2d/0x70
    ret_from_fork_asm+0x1a/0x30
   ...
   watchdog: Watchdog detected hard LOCKUP on cpu 8
   watchdog: Watchdog detected hard LOCKUP on cpu 42
   watchdog: Watchdog detected hard LOCKUP on cpu 10
   ...
   Shutting down cpus with NMI
   Kernel Offset: disabled
   pstore: backend (erst) writing error (-28)
   ---[ end Kernel panic - not syncing: Hard LOCKUP ]---

BTW, the issue can be also reproduced on a 192-core server and a 256-core
server.

[Root Cause]
The tight loop in kasan_release_vmalloc_node() iteratively calls
kasan_release_vmalloc() to clear the corresponding PTE, which
acquires/releases "init_mm.page_table_lock" in
kasan_depopulate_vmalloc_pte().

The lock_stat shows that the "init_mm.page_table_lock" is the first entry
of top list of the contentions. This lock_stat info is based on the
following command (in order not to get OS panicked), where the max
wait time is 600ms:

  # modprobe test_vmalloc nr_threads=150 run_test_mask=0x1 nr_pages=8

<snip>
------------------------------------------------------------------
class name con-bounces contentions waittime-min   waittime-max ...
------------------------------------------------------------------
init_mm.page_table_lock:  87859653 93020601  0.27 600304.90 ...
  -----------------------
  init_mm.page_table_lock  54332301  [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock   6680902  [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock  31991077  [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70
  init_mm.page_table_lock     16321  [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720
  -----------------------
  init_mm.page_table_lock  50278552  [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock   5725380  [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock  36992410  [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70
  init_mm.page_table_lock     24259  [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720
  ...
<snip>

[Solution]
After re-visiting code path about setting the kasan ptep (pte pointer),
it's unlikely that a kasan ptep is set and cleared simultaneously by
different CPUs. So, use ptep_get_and_clear() to get rid of the spinlock
operation.

The result shows the max wait time is 13ms with the following command
(448 cores are fully stressed):

  # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8

<snip>
------------------------------------------------------------------
class name con-bounces contentions waittime-min   waittime-max ...
------------------------------------------------------------------
init_mm.page_table_lock:  109999304  110008477  0.27  13534.76
  -----------------------
  init_mm.page_table_lock 109369156  [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock    637661  [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock      1660  [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720
  -----------------------
  init_mm.page_table_lock 109410237  [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock    595016  [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock      3224  [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720

[More verifications on a 448-core server: Passed]
1) test_vmalloc module
   * Each test is run sequentially.

2) stress-ng
   * fork() and exit()
       # stress-ng --fork 448 --timeout 180
   * pthread
       # stress-ng --pthread 448 --timeout 180
   * fork()/exit() and pthread
       # stress-ng --pthread 448 --fork 448 --timeout 180

The above verifications were run repeatedly for more than 24 hours.

[1] https://gist.github.com/AdrianHuang/99d12986a465cc33a38c7a7ceeb6f507

Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
---
 mm/kasan/shadow.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..985356811aee 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -397,17 +397,13 @@ int kasan_populate_vmalloc(unsigned long addr, unsigned long size)
 static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 					void *unused)
 {
+	pte_t orig_pte = ptep_get_and_clear(&init_mm, addr, ptep);
 	unsigned long page;
 
-	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
-
-	spin_lock(&init_mm.page_table_lock);
-
-	if (likely(!pte_none(ptep_get(ptep)))) {
-		pte_clear(&init_mm, addr, ptep);
+	if (likely(!pte_none(orig_pte))) {
+		page = (unsigned long)__va(pte_pfn(orig_pte) << PAGE_SHIFT);
 		free_page(page);
 	}
-	spin_unlock(&init_mm.page_table_lock);
 
 	return 0;
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-09-30 15:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-25 13:47 [PATCH 1/1] kasan, vmalloc: avoid lock contention when depopulating vmalloc Adrian Huang
2024-09-25 20:47 ` Andrew Morton
2024-09-26 12:22   ` Huang Adrian
2024-09-26 16:16     ` Uladzislau Rezki
2024-09-30  9:49       ` Huang Adrian
2024-09-30 15:22         ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox