[PATCH 1/1] mm: protect xa split stuff under lruvec->lru

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
@ 2024-04-12  6:43 zhaoyang.huang
  2024-04-12 12:24 ` Matthew Wilcox
  2024-04-12 21:34 ` Andrew Morton
  0 siblings, 2 replies; 21+ messages in thread
From: zhaoyang.huang @ 2024-04-12  6:43 UTC (permalink / raw)
  To: Andrew Morton, Alex Shi, Kirill A . Shutemov, Hugh Dickins,
	Baolin Wang, linux-mm, linux-kernel, Zhaoyang Huang, steve.kang

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Livelock in [1] is reported multitimes since v515, where the zero-ref
folio is repeatly found on the page cache by find_get_entry. A possible
timing sequence is proposed in [2], which can be described briefly as
the lockless xarray operation could get harmed by an illegal folio
remaining on the slot[offset]. This commit would like to protect
the xa split stuff(folio_ref_freeze and __split_huge_page) under
lruvec->lock to remove the race window.

[1]
[167789.800297] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780305] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167726.780319] (detected by 3, t=17256977 jiffies, g=19883597, q=2397394)
[167726.780325] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800308] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167789.800322] (detected by 3, t=17272732 jiffies, g=19883597, q=2397470)
[167789.800328] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800339] Call trace:
[167789.800342]  dump_backtrace.cfi_jt+0x0/0x8
[167789.800355]  show_stack+0x1c/0x2c
[167789.800363]  sched_show_task+0x1ac/0x27c
[167789.800370]  print_other_cpu_stall+0x314/0x4dc
[167789.800377]  check_cpu_stall+0x1c4/0x36c
[167789.800382]  rcu_sched_clock_irq+0xe8/0x388
[167789.800389]  update_process_times+0xa0/0xe0
[167789.800396]  tick_sched_timer+0x7c/0xd4
[167789.800404]  __run_hrtimer+0xd8/0x30c
[167789.800408]  hrtimer_interrupt+0x1e4/0x2d0
[167789.800414]  arch_timer_handler_phys+0x5c/0xa0
[167789.800423]  handle_percpu_devid_irq+0xbc/0x318
[167789.800430]  handle_domain_irq+0x7c/0xf0
[167789.800437]  gic_handle_irq+0x54/0x12c
[167789.800445]  call_on_irq_stack+0x40/0x70
[167789.800451]  do_interrupt_handler+0x44/0xa0
[167789.800457]  el1_interrupt+0x34/0x64
[167789.800464]  el1h_64_irq_handler+0x1c/0x2c
[167789.800470]  el1h_64_irq+0x7c/0x80
[167789.800474]  xas_find+0xb4/0x28c
[167789.800481]  find_get_entry+0x3c/0x178
[167789.800487]  find_lock_entries+0x98/0x2f8
[167789.800492]  __invalidate_mapping_pages.llvm.3657204692649320853+0xc8/0x224
[167789.800500]  invalidate_mapping_pages+0x18/0x28
[167789.800506]  inode_lru_isolate+0x140/0x2a4
[167789.800512]  __list_lru_walk_one+0xd8/0x204
[167789.800519]  list_lru_walk_one+0x64/0x90
[167789.800524]  prune_icache_sb+0x54/0xe0
[167789.800529]  super_cache_scan+0x160/0x1ec
[167789.800535]  do_shrink_slab+0x20c/0x5c0
[167789.800541]  shrink_slab+0xf0/0x20c
[167789.800546]  shrink_node_memcgs+0x98/0x320
[167789.800553]  shrink_node+0xe8/0x45c
[167789.800557]  balance_pgdat+0x464/0x814
[167789.800563]  kswapd+0xfc/0x23c
[167789.800567]  kthread+0x164/0x1c8
[167789.800573]  ret_from_fork+0x10/0x20

[2]
Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages via pfn
       (folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

4. alloc_contig_range->migrate_pages->__split_huge_page->folio_lruvec_lock
failed which have the folio be failed in setting refcnt to 2

5. Thread_kswapd enter the livelock by the chain below
      rcu_read_lock();
   retry:
        find_get_entry
            folio = xas_find
            if(!folio_try_get_rcu)
                xas_reset;
            goto retry;
      rcu_read_unlock();

5'. Thread_holdlock as the lruvec->lru_lock holder could be stalled in
the same core of Thread_kswapd.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 mm/huge_memory.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..418e8d03480a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2891,7 +2891,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 {
 	struct folio *folio = page_folio(page);
 	struct page *head = &folio->page;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = folio_lruvec(folio);
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	int i, nr_dropped = 0;
@@ -2908,8 +2908,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);
 
 	ClearPageHasHWPoisoned(head);
 
@@ -2942,7 +2940,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 		folio_set_order(new_folio, new_order);
 	}
-	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, order, new_order);
@@ -2961,7 +2958,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		folio_ref_add(folio, 1 + new_nr);
 		xa_unlock(&folio->mapping->i_pages);
 	}
-	local_irq_enable();
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
@@ -3048,6 +3044,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	int extra_pins, ret;
 	pgoff_t end;
 	bool is_hzp;
+	struct lruvec *lruvec;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -3159,6 +3156,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
+
+	/*
+	 * take lruvec's lock before freeze the folio to prevent the folio
+	 * remains in the page cache with refcnt == 0, which could lead to
+	 * find_get_entry enters livelock by iterating the xarray.
+	 */
+	lruvec = folio_lruvec_lock(folio);
+
 	if (mapping) {
 		/*
 		 * Check if the folio is present in page cache.
@@ -3203,12 +3208,16 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		}
 
 		__split_huge_page(page, list, end, new_order);
+		unlock_page_lruvec(lruvec);
+		local_irq_enable();
 		ret = 0;
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
+
+		unlock_page_lruvec(lruvec);
 		local_irq_enable();
 		remap_page(folio, folio_nr_pages(folio));
 		ret = -EAGAIN;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-12  6:43 [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration zhaoyang.huang
@ 2024-04-12 12:24 ` Matthew Wilcox
  2024-04-13  7:10   ` Zhaoyang Huang
  2024-04-12 21:34 ` Andrew Morton
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2024-04-12 12:24 UTC (permalink / raw)
  To: zhaoyang.huang
  Cc: Andrew Morton, Alex Shi, Kirill A . Shutemov, Hugh Dickins,
	Baolin Wang, linux-mm, linux-kernel, Zhaoyang Huang, steve.kang

On Fri, Apr 12, 2024 at 02:43:53PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> Livelock in [1] is reported multitimes since v515, where the zero-ref
> folio is repeatly found on the page cache by find_get_entry. A possible
> timing sequence is proposed in [2], which can be described briefly as

I have no patience for going through another one of your "analyses".

1. Can you reproduce this bug without this patch?
2. Does the reproducer stop working after this patch?

Otherwise I'm not interested.  Sorry.  You burnt all my good will.

> the lockless xarray operation could get harmed by an illegal folio
> remaining on the slot[offset]. This commit would like to protect
> the xa split stuff(folio_ref_freeze and __split_huge_page) under
> lruvec->lock to remove the race window.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-12 12:24 ` Matthew Wilcox
@ 2024-04-13  7:10   ` Zhaoyang Huang
  0 siblings, 0 replies; 21+ messages in thread
From: Zhaoyang Huang @ 2024-04-13  7:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: zhaoyang.huang, Andrew Morton, Alex Shi, Kirill A . Shutemov,
	Hugh Dickins, Baolin Wang, linux-mm, linux-kernel, steve.kang

On Fri, Apr 12, 2024 at 8:24 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 02:43:53PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Livelock in [1] is reported multitimes since v515, where the zero-ref
> > folio is repeatly found on the page cache by find_get_entry. A possible
> > timing sequence is proposed in [2], which can be described briefly as
>
> I have no patience for going through another one of your "analyses".
>
> 1. Can you reproduce this bug without this patch?
> 2. Does the reproducer stop working after this patch?
>
> Otherwise I'm not interested.  Sorry.  You burnt all my good will.

This bug has been reported many times by three people including me as
below for at least two years, have you ever tried to solve it? Do Dave
and Brian also burnt your good will if you ever have? Be aware that
you are the maintainer who has the responsibility for maintaining the
code but not us. "To wear crowns shall bear the heavy or give up". Put
me on your SPAM list, thank you.

https://lore.kernel.org/linux-mm/20221018223042.GJ2703033@dread.disaster.area/
https://lore.kernel.org/linux-mm/Y0%2FkZbIvMgkNhWpM@bfoster/

>
> > the lockless xarray operation could get harmed by an illegal folio
> > remaining on the slot[offset]. This commit would like to protect
> > the xa split stuff(folio_ref_freeze and __split_huge_page) under
> > lruvec->lock to remove the race window.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-12  6:43 [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration zhaoyang.huang
  2024-04-12 12:24 ` Matthew Wilcox
@ 2024-04-12 21:34 ` Andrew Morton
  2024-04-13  2:01   ` Zhaoyang Huang
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2024-04-12 21:34 UTC (permalink / raw)
  To: zhaoyang.huang
  Cc: Alex Shi, Kirill A . Shutemov, Hugh Dickins, Baolin Wang,
	linux-mm, linux-kernel, Zhaoyang Huang, steve.kang

On Fri, 12 Apr 2024 14:43:53 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:

> Livelock in [1] is reported multitimes since v515, 

Are you able to provide us with a means by which others can reproduce this?

Thanks.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-12 21:34 ` Andrew Morton
@ 2024-04-13  2:01   ` Zhaoyang Huang
  2024-04-15  0:09     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Zhaoyang Huang @ 2024-04-13  2:01 UTC (permalink / raw)
  To: Andrew Morton, Dave Chinner
  Cc: zhaoyang.huang, Alex Shi, Kirill A . Shutemov, Hugh Dickins,
	Baolin Wang, linux-mm, linux-kernel, steve.kang

loop Dave, since he has ever helped set up an reproducer in
https://lore.kernel.org/linux-mm/20221101071721.GV2703033@dread.disaster.area/
@Dave Chinner , I would like to ask for your kindly help on if you can
verify this patch on your environment if convenient. Thanks a lot.


On Sat, Apr 13, 2024 at 5:34 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 12 Apr 2024 14:43:53 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
>
> > Livelock in [1] is reported multitimes since v515,
>
> Are you able to provide us with a means by which others can reproduce this?
>
> Thanks.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-13  2:01   ` Zhaoyang Huang
@ 2024-04-15  0:09     ` Dave Chinner
  2024-04-15  1:50       ` Zhaoyang Huang
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2024-04-15  0:09 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Andrew Morton, zhaoyang.huang, Alex Shi, Kirill A . Shutemov,
	Hugh Dickins, Baolin Wang, linux-mm, linux-kernel, steve.kang

On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang Huang wrote:
> loop Dave, since he has ever helped set up an reproducer in
> https://lore.kernel.org/linux-mm/20221101071721.GV2703033@dread.disaster.area/
> @Dave Chinner , I would like to ask for your kindly help on if you can
> verify this patch on your environment if convenient. Thanks a lot.

I don't have the test environment from 18 months ago available any
more. Also, I haven't seen this problem since that specific test
environment tripped over the issue. Hence I don't have any way of
confirming that the problem is fixed, either, because first I'd have
to reproduce it...

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-15  0:09     ` Dave Chinner
@ 2024-04-15  1:50       ` Zhaoyang Huang
  2024-05-20 19:42         ` Marcin Wanat
  0 siblings, 1 reply; 21+ messages in thread
From: Zhaoyang Huang @ 2024-04-15  1:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, zhaoyang.huang, Alex Shi, Kirill A . Shutemov,
	Hugh Dickins, Baolin Wang, linux-mm, linux-kernel, steve.kang

On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang Huang wrote:
> > loop Dave, since he has ever helped set up an reproducer in
> > https://lore.kernel.org/linux-mm/20221101071721.GV2703033@dread.disaster.area/
> > @Dave Chinner , I would like to ask for your kindly help on if you can
> > verify this patch on your environment if convenient. Thanks a lot.
>
> I don't have the test environment from 18 months ago available any
> more. Also, I haven't seen this problem since that specific test
> environment tripped over the issue. Hence I don't have any way of
> confirming that the problem is fixed, either, because first I'd have
> to reproduce it...
Thanks for the information. I noticed that you reported another soft
lockup which is related to xas_load since NOV.2023. This patch is
supposed to be helpful for this. With regard to the version timing,
this commit is actually a revert of <mm/thp: narrow lru locking>
b6769834aac1d467fa1c71277d15688efcbb4d76 which is merged before v5.15.

For saving your time, a brief description below. IMO, b6769834aa
introduce a potential stall between freeze the folio's refcnt and
store it back to 2, which have the xas_load->folio_try_get_rcu loops
as livelock if it stalls the lru_lock's holder.

b6769834aa
    split_huge_page_to_list
-       spin_lock(lru_lock)
        xas_split(&xas, folio,order)
        folio_refcnt_freeze(folio, 1 + folio_nr_pages(folio0)
+      spin_lock(lru_lock)
        xas_store(&xas, offset++, head+i)
        page_ref_add(head, 2)
        spin_unlock(lru_lock)

Sorry in advance if the above doesn't make sense, I am just a
developer who is also suffering from this bug and trying to fix it
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-04-15  1:50       ` Zhaoyang Huang
@ 2024-05-20 19:42         ` Marcin Wanat
  2024-05-21  0:58           ` Zhaoyang Huang
  2024-05-30  8:48           ` Yafang Shao
  0 siblings, 2 replies; 21+ messages in thread
From: Marcin Wanat @ 2024-05-20 19:42 UTC (permalink / raw)
  To: Zhaoyang Huang, Dave Chinner
  Cc: Andrew Morton, zhaoyang.huang, Alex Shi, Kirill A . Shutemov,
	Hugh Dickins, Baolin Wang, linux-mm, linux-kernel, steve.kang

On 15.04.2024 03:50, Zhaoyang Huang wrote:
> On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang 
Huang wrote: >>> loop Dave, since he has ever helped set up an 
reproducer in >>> https://lore.kernel.org/linux- >>> 
mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I 
would like to ask for your kindly help on if you can verify >>> this 
patch on your environment if convenient. Thanks a lot. >> >> I don't 
have the test environment from 18 months ago available any >> more. 
Also, I haven't seen this problem since that specific test >> 
environment tripped over the issue. Hence I don't have any way of >> 
confirming that the problem is fixed, either, because first I'd >> have 
to reproduce it... > Thanks for the information. I noticed that you 
reported another soft > lockup which is related to xas_load since 
NOV.2023. This patch is > supposed to be helpful for this. With regard 
to the version timing, > this commit is actually a revert of <mm/thp: 
narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is 
merged before > v5.15. > > For saving your time, a brief description 
below. IMO, b6769834aa > introduce a potential stall between freeze the 
folio's refcnt and > store it back to 2, which have the 
xas_load->folio_try_get_rcu loops > as livelock if it stalls the 
lru_lock's holder. > > b6769834aa split_huge_page_to_list - 
spin_lock(lru_lock) > xas_split(&xas, folio,order) 
folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) + 
spin_lock(lru_lock) xas_store(&xas, > offset++, head+i) 
page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the 
above doesn't make sense, I am just a > developer who is also suffering 
from this bug and trying to fix it
I am experiencing a similar error on dozens of hosts, with stack traces 
that are all similar:

[627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s! 
[file_get:953301]
[627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT 
xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat 
nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr 
intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common 
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal 
intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl 
iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support 
dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core 
dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage 
i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf 
ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1 
sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel 
polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw 
drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel 
nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata 
scsi_transport_sas i2c_algo_bit wmi
[627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded 
Tainted: G             L     6.6.30.el9 #2
[627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS 
2.21.2 02/19/2024
[627163.727847] RIP: 0010:xas_descend+0x1b/0x70
[627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6 
0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6 
08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
[627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
[627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX: 
0000000000000020
[627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI: 
ffffc90034a67990
[627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09: 
0000000000008820
[627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12: 
ffffc90034a67a20
[627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15: 
ffffc90034a67a18
[627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000) 
knlGS:0000000000000000
[627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4: 
00000000007706e0
[627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[627163.727878] PKRU: 55555554
[627163.727879] Call Trace:
[627163.727882]  <IRQ>
[627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
[627163.727892]  ? softlockup_fn+0x70/0x70
[627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
[627163.727903]  ? hrtimer_interrupt+0x106/0x240
[627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
[627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
[627163.727917]  </IRQ>
[627163.727918]  <TASK>
[627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[627163.727927]  ? xas_descend+0x1b/0x70
[627163.727930]  xas_load+0x2c/0x40
[627163.727933]  xas_find+0x161/0x1a0
[627163.727937]  find_get_entries+0x77/0x1d0
[627163.727944]  truncate_inode_pages_range+0x244/0x3f0
[627163.727950]  truncate_pagecache+0x44/0x60
[627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
[627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
[627163.728153]  notify_change+0x34f/0x4f0
[627163.728158]  ? _raw_spin_lock+0x13/0x30
[627163.728165]  ? do_truncate+0x80/0xd0
[627163.728169]  do_truncate+0x80/0xd0
[627163.728172]  do_open+0x2ce/0x400
[627163.728177]  path_openat+0x10d/0x280
[627163.728181]  do_filp_open+0xb2/0x150
[627163.728186]  ? check_heap_object+0x34/0x190
[627163.728189]  ? __check_object_size.part.0+0x5a/0x130
[627163.728194]  do_sys_openat2+0x92/0xc0
[627163.728197]  __x64_sys_openat+0x53/0x90
[627163.728200]  do_syscall_64+0x35/0x80
[627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
[627163.728210] RIP: 0033:0x7fc5e493e7fb
[627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 
00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 
05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX: 
0000000000000101
[627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX: 
00007fc5e493e7fb
[627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI: 
00000000ffffff9c
[627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09: 
0000000000000001
[627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12: 
0000000000000241
[627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15: 
0000000000000000
[627163.728227]  </TASK>

I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
However, with long-term kernels 6.1.XX and 6.6.XX,
(tested at least 10 different versions), this lockup always appears
after 2-30 days, similar to the report in the original thread.
The more load (for example, copying a lot of local files while
serving 20Gbps traffic), the higher the chance that the bug will appear.

I haven't been able to reproduce this during synthetic tests,
but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
If anyone can provide a patch, I can test it on multiple machines
over the next few days.

Regards,
Marcin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-20 19:42         ` Marcin Wanat
@ 2024-05-21  0:58           ` Zhaoyang Huang
  2024-05-21  1:00             ` Zhaoyang Huang
  2024-05-30  8:48           ` Yafang Shao
  1 sibling, 1 reply; 21+ messages in thread
From: Zhaoyang Huang @ 2024-05-21  0:58 UTC (permalink / raw)
  To: Marcin Wanat
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> Huang wrote: >>> loop Dave, since he has ever helped set up an
> reproducer in >>> https://lore.kernel.org/linux- >>>
> mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> would like to ask for your kindly help on if you can verify >>> this
> patch on your environment if convenient. Thanks a lot. >> >> I don't
> have the test environment from 18 months ago available any >> more.
> Also, I haven't seen this problem since that specific test >>
> environment tripped over the issue. Hence I don't have any way of >>
> confirming that the problem is fixed, either, because first I'd >> have
> to reproduce it... > Thanks for the information. I noticed that you
> reported another soft > lockup which is related to xas_load since
> NOV.2023. This patch is > supposed to be helpful for this. With regard
> to the version timing, > this commit is actually a revert of <mm/thp:
> narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> merged before > v5.15. > > For saving your time, a brief description
> below. IMO, b6769834aa > introduce a potential stall between freeze the
> folio's refcnt and > store it back to 2, which have the
> xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> spin_lock(lru_lock) > xas_split(&xas, folio,order)
> folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> above doesn't make sense, I am just a > developer who is also suffering
> from this bug and trying to fix it
> I am experiencing a similar error on dozens of hosts, with stack traces
> that are all similar:
>
> [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> [file_get:953301]
> [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> scsi_transport_sas i2c_algo_bit wmi
> [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> Tainted: G             L     6.6.30.el9 #2
> [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> 2.21.2 02/19/2024
> [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> 0000000000000020
> [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> ffffc90034a67990
> [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> 0000000000008820
> [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> ffffc90034a67a20
> [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> ffffc90034a67a18
> [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> knlGS:0000000000000000
> [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> 00000000007706e0
> [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [627163.727878] PKRU: 55555554
> [627163.727879] Call Trace:
> [627163.727882]  <IRQ>
> [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> [627163.727892]  ? softlockup_fn+0x70/0x70
> [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> [627163.727917]  </IRQ>
> [627163.727918]  <TASK>
> [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [627163.727927]  ? xas_descend+0x1b/0x70
> [627163.727930]  xas_load+0x2c/0x40
> [627163.727933]  xas_find+0x161/0x1a0
> [627163.727937]  find_get_entries+0x77/0x1d0
> [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> [627163.727950]  truncate_pagecache+0x44/0x60
> [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> [627163.728153]  notify_change+0x34f/0x4f0
> [627163.728158]  ? _raw_spin_lock+0x13/0x30
> [627163.728165]  ? do_truncate+0x80/0xd0
> [627163.728169]  do_truncate+0x80/0xd0
> [627163.728172]  do_open+0x2ce/0x400
> [627163.728177]  path_openat+0x10d/0x280
> [627163.728181]  do_filp_open+0xb2/0x150
> [627163.728186]  ? check_heap_object+0x34/0x190
> [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> [627163.728194]  do_sys_openat2+0x92/0xc0
> [627163.728197]  __x64_sys_openat+0x53/0x90
> [627163.728200]  do_syscall_64+0x35/0x80
> [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [627163.728210] RIP: 0033:0x7fc5e493e7fb
> [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000101
> [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> 00007fc5e493e7fb
> [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> 00000000ffffff9c
> [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> 0000000000000001
> [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> 0000000000000241
> [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> 0000000000000000
> [627163.728227]  </TASK>
>
> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> However, with long-term kernels 6.1.XX and 6.6.XX,
> (tested at least 10 different versions), this lockup always appears
> after 2-30 days, similar to the report in the original thread.
> The more load (for example, copying a lot of local files while
> serving 20Gbps traffic), the higher the chance that the bug will appear.
>
> I haven't been able to reproduce this during synthetic tests,
> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> If anyone can provide a patch, I can test it on multiple machines
> over the next few days.
Could you please try this one which could be applied on 6.6 directly. Thank you!
>
> Regards,
> Marcin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-21  0:58           ` Zhaoyang Huang
@ 2024-05-21  1:00             ` Zhaoyang Huang
  2024-05-21 15:47               ` Marcin Wanat
  0 siblings, 1 reply; 21+ messages in thread
From: Zhaoyang Huang @ 2024-05-21  1:00 UTC (permalink / raw)
  To: Marcin Wanat
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> >
> > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > reproducer in >>> https://lore.kernel.org/linux- >>>
> > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > would like to ask for your kindly help on if you can verify >>> this
> > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > have the test environment from 18 months ago available any >> more.
> > Also, I haven't seen this problem since that specific test >>
> > environment tripped over the issue. Hence I don't have any way of >>
> > confirming that the problem is fixed, either, because first I'd >> have
> > to reproduce it... > Thanks for the information. I noticed that you
> > reported another soft > lockup which is related to xas_load since
> > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > to the version timing, > this commit is actually a revert of <mm/thp:
> > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > merged before > v5.15. > > For saving your time, a brief description
> > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > folio's refcnt and > store it back to 2, which have the
> > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > above doesn't make sense, I am just a > developer who is also suffering
> > from this bug and trying to fix it
> > I am experiencing a similar error on dozens of hosts, with stack traces
> > that are all similar:
> >
> > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > [file_get:953301]
> > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > scsi_transport_sas i2c_algo_bit wmi
> > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > Tainted: G             L     6.6.30.el9 #2
> > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > 2.21.2 02/19/2024
> > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > 0000000000000020
> > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > ffffc90034a67990
> > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > 0000000000008820
> > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > ffffc90034a67a20
> > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > ffffc90034a67a18
> > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > knlGS:0000000000000000
> > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > 00000000007706e0
> > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [627163.727878] PKRU: 55555554
> > [627163.727879] Call Trace:
> > [627163.727882]  <IRQ>
> > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > [627163.727892]  ? softlockup_fn+0x70/0x70
> > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > [627163.727917]  </IRQ>
> > [627163.727918]  <TASK>
> > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [627163.727927]  ? xas_descend+0x1b/0x70
> > [627163.727930]  xas_load+0x2c/0x40
> > [627163.727933]  xas_find+0x161/0x1a0
> > [627163.727937]  find_get_entries+0x77/0x1d0
> > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > [627163.727950]  truncate_pagecache+0x44/0x60
> > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > [627163.728153]  notify_change+0x34f/0x4f0
> > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > [627163.728165]  ? do_truncate+0x80/0xd0
> > [627163.728169]  do_truncate+0x80/0xd0
> > [627163.728172]  do_open+0x2ce/0x400
> > [627163.728177]  path_openat+0x10d/0x280
> > [627163.728181]  do_filp_open+0xb2/0x150
> > [627163.728186]  ? check_heap_object+0x34/0x190
> > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > [627163.728194]  do_sys_openat2+0x92/0xc0
> > [627163.728197]  __x64_sys_openat+0x53/0x90
> > [627163.728200]  do_syscall_64+0x35/0x80
> > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000101
> > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > 00007fc5e493e7fb
> > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > 00000000ffffff9c
> > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > 0000000000000001
> > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > 0000000000000241
> > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > 0000000000000000
> > [627163.728227]  </TASK>
> >
> > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > However, with long-term kernels 6.1.XX and 6.6.XX,
> > (tested at least 10 different versions), this lockup always appears
> > after 2-30 days, similar to the report in the original thread.
> > The more load (for example, copying a lot of local files while
> > serving 20Gbps traffic), the higher the chance that the bug will appear.
> >
> > I haven't been able to reproduce this during synthetic tests,
> > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> > If anyone can provide a patch, I can test it on multiple machines
> > over the next few days.
> Could you please try this one which could be applied on 6.6 directly. Thank you!
URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/

> >
> > Regards,
> > Marcin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-21  1:00             ` Zhaoyang Huang
@ 2024-05-21 15:47               ` Marcin Wanat
  2024-05-22  5:37                 ` Zhaoyang Huang
  0 siblings, 1 reply; 21+ messages in thread
From: Marcin Wanat @ 2024-05-21 15:47 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On 21.05.2024 03:00, Zhaoyang Huang wrote:
> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>>
>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>>>
>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>> (tested at least 10 different versions), this lockup always appears
>>> after 2-30 days, similar to the report in the original thread.
>>> The more load (for example, copying a lot of local files while
>>> serving 20Gbps traffic), the higher the chance that the bug will appear.
>>>
>>> I haven't been able to reproduce this during synthetic tests,
>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>>> If anyone can provide a patch, I can test it on multiple machines
>>> over the next few days.
>> Could you please try this one which could be applied on 6.6 directly. Thank you!
> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/
> 

Unfortunately, I am unable to cleanly apply this patch against the 
latest 6.6.31


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-21 15:47               ` Marcin Wanat
@ 2024-05-22  5:37                 ` Zhaoyang Huang
  2024-05-22 10:13                   ` Marcin Wanat
  0 siblings, 1 reply; 21+ messages in thread
From: Zhaoyang Huang @ 2024-05-22  5:37 UTC (permalink / raw)
  To: Marcin Wanat
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> > On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >>
> >> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> >>>
> >>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> >>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>> (tested at least 10 different versions), this lockup always appears
> >>> after 2-30 days, similar to the report in the original thread.
> >>> The more load (for example, copying a lot of local files while
> >>> serving 20Gbps traffic), the higher the chance that the bug will appear.
> >>>
> >>> I haven't been able to reproduce this during synthetic tests,
> >>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> >>> If anyone can provide a patch, I can test it on multiple machines
> >>> over the next few days.
> >> Could you please try this one which could be applied on 6.6 directly. Thank you!
> > URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/
> >
>
> Unfortunately, I am unable to cleanly apply this patch against the
> latest 6.6.31
Please try below one which works on my v6.6 based android. Thank you
for your test in advance :D

mm/huge_memory.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..5899906c326a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
 {
  struct folio *folio = page_folio(page);
  struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
  struct address_space *swap_cache = NULL;
  unsigned long offset = 0;
  unsigned int nr = thp_nr_pages(head);
@@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  xa_lock(&swap_cache->i_pages);
  }

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
  ClearPageHasHWPoisoned(head);

  for (i = nr - 1; i >= 1; i--) {
@@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  }

  ClearPageCompound(head);
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
  split_page_owner(head, nr);

  /* See comment in __split_huge_page_tail() */
@@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  page_ref_add(head, 2);
  xa_unlock(&head->mapping->i_pages);
  }
- local_irq_enable();

  if (nr_dropped)
  shmem_uncharge(head->mapping->host, nr_dropped);
@@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  int extra_pins, ret;
  pgoff_t end;
  bool is_hzp;
+ struct lruvec *lruvec;

  VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
  VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)

  /* block interrupt reentry in xa_lock and spinlock */
  local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
  if (mapping) {
  /*
  * Check if the folio is present in page cache.
@@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  }

  __split_huge_page(page, list, end);
+ unlock_page_lruvec(lruvec);
+ local_irq_enable();
  ret = 0;
  } else {
  spin_unlock(&ds_queue->split_queue_lock);
 fail:
  if (mapping)
  xas_unlock(&xas);
+
+ unlock_page_lruvec(lruvec);
  local_irq_enable();
  remap_page(folio, folio_nr_pages(folio));
  ret = -EAGAIN;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-22  5:37                 ` Zhaoyang Huang
@ 2024-05-22 10:13                   ` Marcin Wanat
  2024-05-27  8:22                     ` Marcin Wanat
  0 siblings, 1 reply; 21+ messages in thread
From: Marcin Wanat @ 2024-05-22 10:13 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On 22.05.2024 07:37, Zhaoyang Huang wrote:
> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>>
>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>>>>
>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>>>>>
>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>>>> (tested at least 10 different versions), this lockup always appears
>>>>> after 2-30 days, similar to the report in the original thread.
>>>>> The more load (for example, copying a lot of local files while
>>>>> serving 20Gbps traffic), the higher the chance that the bug will appear.
>>>>>
>>>>> I haven't been able to reproduce this during synthetic tests,
>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>>>>> If anyone can provide a patch, I can test it on multiple machines
>>>>> over the next few days.
>>>> Could you please try this one which could be applied on 6.6 directly. Thank you!
>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/
>>>
>>
>> Unfortunately, I am unable to cleanly apply this patch against the
>> latest 6.6.31
> Please try below one which works on my v6.6 based android. Thank you
> for your test in advance :D
> 
> mm/huge_memory.c | 22 ++++++++++++++--------
>   1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c

I have compiled 6.6.31 with this patch and will test it on multiple 
machines over the next 30 days. I will provide an update after 30 days 
if everything is fine or sooner if any of the hosts experience the same 
soft lockup again.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-22 10:13                   ` Marcin Wanat
@ 2024-05-27  8:22                     ` Marcin Wanat
  2024-05-27  8:53                       ` Zhaoyang Huang
  2024-06-14  3:31                       ` Zhaoyang Huang
  0 siblings, 2 replies; 21+ messages in thread
From: Marcin Wanat @ 2024-05-27  8:22 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On 22.05.2024 12:13, Marcin Wanat wrote:
> On 22.05.2024 07:37, Zhaoyang Huang wrote:
>> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl> 
>> wrote:
>>>
>>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
>>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang 
>>>> <huangzhaoyang@gmail.com> wrote:
>>>>>
>>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat 
>>>>> <private@marcinwanat.pl> wrote:
>>>>>>
>>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT 
>>>>>> affected.
>>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>>>>> (tested at least 10 different versions), this lockup always appears
>>>>>> after 2-30 days, similar to the report in the original thread.
>>>>>> The more load (for example, copying a lot of local files while
>>>>>> serving 20Gbps traffic), the higher the chance that the bug will 
>>>>>> appear.
>>>>>>
>>>>>> I haven't been able to reproduce this during synthetic tests,
>>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 
>>>>>> days.
>>>>>> If anyone can provide a patch, I can test it on multiple machines
>>>>>> over the next few days.
>>>>> Could you please try this one which could be applied on 6.6 
>>>>> directly. Thank you!
>>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1- 
>>>> zhaoyang.huang@unisoc.com/
>>>>
>>>
>>> Unfortunately, I am unable to cleanly apply this patch against the
>>> latest 6.6.31
>> Please try below one which works on my v6.6 based android. Thank you
>> for your test in advance :D
>>
>> mm/huge_memory.c | 22 ++++++++++++++--------
>>   1 file changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> 
> I have compiled 6.6.31 with this patch and will test it on multiple 
> machines over the next 30 days. I will provide an update after 30 days 
> if everything is fine or sooner if any of the hosts experience the same 
> soft lockup again.
> 

First server with 6.6.31 and this patch hang today. Soft lockup changed 
to hard lockup:

[26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
[26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit 
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit 
nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables 
nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency 
intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200 
intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi 
intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si 
drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal 
ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr 
drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul 
crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio 
i40e libata megaraid_sas dca ghash_clmulni_intel wmi
[26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G 
      W          6.6.31.el9 #3
[26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS 
V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
[26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX: 
0000000000000000
[26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ad6c6f67050
[26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09: 
0000000000000067
[26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000046
[26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15: 
ffffe1138aa98000
[26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000) 
knlGS:0000000000000000
[26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4: 
00000000007706e0
[26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[26887.389713] PKRU: 55555554
[26887.389714] Call Trace:
[26887.389717]  <NMI>
[26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
[26887.389725]  ? __perf_event_overflow+0x102/0x1d0
[26887.389729]  ? handle_pmi_common+0x189/0x3e0
[26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
[26887.389742]  ? native_set_fixmap+0x65/0x80
[26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
[26887.389756]  ? perf_event_nmi_handler+0x28/0x50
[26887.389759]  ? nmi_handle+0x58/0x150
[26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389768]  ? default_do_nmi+0x6b/0x170
[26887.389770]  ? exc_nmi+0x12c/0x1a0
[26887.389772]  ? end_repeat_nmi+0x16/0x1f
[26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389787]  </NMI>
[26887.389788]  <TASK>
[26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
[26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389798]  __page_cache_release+0x68/0x230
[26887.389801]  ? remove_migration_ptes+0x5c/0x80
[26887.389807]  __folio_put+0x24/0x60
[26887.389808]  __split_huge_page+0x368/0x520
[26887.389812]  split_huge_page_to_list+0x4b3/0x570
[26887.389816]  deferred_split_scan+0x1c8/0x290
[26887.389819]  do_shrink_slab+0x12f/0x2d0
[26887.389824]  shrink_slab_memcg+0x133/0x1d0
[26887.389829]  shrink_node_memcgs+0x18e/0x1d0
[26887.389832]  shrink_node+0xa7/0x370
[26887.389836]  balance_pgdat+0x332/0x6f0
[26887.389842]  kswapd+0xf0/0x190
[26887.389845]  ? balance_pgdat+0x6f0/0x6f0
[26887.389848]  kthread+0xee/0x120
[26887.389851]  ? kthread_complete_and_exit+0x20/0x20
[26887.389853]  ret_from_fork+0x2d/0x50
[26887.389857]  ? kthread_complete_and_exit+0x20/0x20
[26887.389859]  ret_from_fork_asm+0x11/0x20
[26887.389864]  </TASK>
[26887.389865] Kernel panic - not syncing: Hard LOCKUP
[26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G 
      W          6.6.31.el9 #3
[26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS 
V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
[26887.389870] Call Trace:
[26887.389871]  <NMI>
[26887.389872]  dump_stack_lvl+0x44/0x60
[26887.389877]  panic+0x241/0x330
[26887.389881]  nmi_panic+0x2f/0x40
[26887.389883]  watchdog_hardlockup_check+0x119/0x150
[26887.389886]  __perf_event_overflow+0x102/0x1d0
[26887.389889]  handle_pmi_common+0x189/0x3e0
[26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
[26887.389899]  ? native_set_fixmap+0x65/0x80
[26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389909]  intel_pmu_handle_irq+0x10b/0x230
[26887.389911]  perf_event_nmi_handler+0x28/0x50
[26887.389913]  nmi_handle+0x58/0x150
[26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389920]  default_do_nmi+0x6b/0x170
[26887.389922]  exc_nmi+0x12c/0x1a0
[26887.389923]  end_repeat_nmi+0x16/0x1f
[26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX: 
0000000000000000
[26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ad6c6f67050
[26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09: 
0000000000000067
[26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000046
[26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15: 
ffffe1138aa98000
[26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389946]  </NMI>
[26887.389947]  <TASK>
[26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
[26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389953]  __page_cache_release+0x68/0x230
[26887.389955]  ? remove_migration_ptes+0x5c/0x80
[26887.389958]  __folio_put+0x24/0x60
[26887.389960]  __split_huge_page+0x368/0x520
[26887.389963]  split_huge_page_to_list+0x4b3/0x570
[26887.389967]  deferred_split_scan+0x1c8/0x290
[26887.389971]  do_shrink_slab+0x12f/0x2d0
[26887.389974]  shrink_slab_memcg+0x133/0x1d0
[26887.389978]  shrink_node_memcgs+0x18e/0x1d0
[26887.389982]  shrink_node+0xa7/0x370
[26887.389985]  balance_pgdat+0x332/0x6f0
[26887.389991]  kswapd+0xf0/0x190
[26887.389994]  ? balance_pgdat+0x6f0/0x6f0
[26887.389997]  kthread+0xee/0x120
[26887.389998]  ? kthread_complete_and_exit+0x20/0x20
[26887.390000]  ret_from_fork+0x2d/0x50
[26887.390003]  ? kthread_complete_and_exit+0x20/0x20
[26887.390004]  ret_from_fork_asm+0x11/0x20
[26887.390009]  </TASK>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-27  8:22                     ` Marcin Wanat
@ 2024-05-27  8:53                       ` Zhaoyang Huang
  2024-06-14  3:31                       ` Zhaoyang Huang
  1 sibling, 0 replies; 21+ messages in thread
From: Zhaoyang Huang @ 2024-05-27  8:53 UTC (permalink / raw)
  To: Marcin Wanat
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <huangzhaoyang@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <private@marcinwanat.pl> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> zhaoyang.huang@unisoc.com/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >>   1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717]  <NMI>
> [26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725]  ? __perf_event_overflow+0x102/0x1d0
> [26887.389729]  ? handle_pmi_common+0x189/0x3e0
> [26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742]  ? native_set_fixmap+0x65/0x80
> [26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756]  ? perf_event_nmi_handler+0x28/0x50
> [26887.389759]  ? nmi_handle+0x58/0x150
> [26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768]  ? default_do_nmi+0x6b/0x170
> [26887.389770]  ? exc_nmi+0x12c/0x1a0
> [26887.389772]  ? end_repeat_nmi+0x16/0x1f
> [26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787]  </NMI>
> [26887.389788]  <TASK>
> [26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798]  __page_cache_release+0x68/0x230
> [26887.389801]  ? remove_migration_ptes+0x5c/0x80
> [26887.389807]  __folio_put+0x24/0x60
> [26887.389808]  __split_huge_page+0x368/0x520
> [26887.389812]  split_huge_page_to_list+0x4b3/0x570
> [26887.389816]  deferred_split_scan+0x1c8/0x290
> [26887.389819]  do_shrink_slab+0x12f/0x2d0
> [26887.389824]  shrink_slab_memcg+0x133/0x1d0
> [26887.389829]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389832]  shrink_node+0xa7/0x370
> [26887.389836]  balance_pgdat+0x332/0x6f0
> [26887.389842]  kswapd+0xf0/0x190
> [26887.389845]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389848]  kthread+0xee/0x120
> [26887.389851]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389853]  ret_from_fork+0x2d/0x50
> [26887.389857]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389859]  ret_from_fork_asm+0x11/0x20
> [26887.389864]  </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389870] Call Trace:
> [26887.389871]  <NMI>
> [26887.389872]  dump_stack_lvl+0x44/0x60
> [26887.389877]  panic+0x241/0x330
> [26887.389881]  nmi_panic+0x2f/0x40
> [26887.389883]  watchdog_hardlockup_check+0x119/0x150
> [26887.389886]  __perf_event_overflow+0x102/0x1d0
> [26887.389889]  handle_pmi_common+0x189/0x3e0
> [26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899]  ? native_set_fixmap+0x65/0x80
> [26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909]  intel_pmu_handle_irq+0x10b/0x230
> [26887.389911]  perf_event_nmi_handler+0x28/0x50
> [26887.389913]  nmi_handle+0x58/0x150
> [26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920]  default_do_nmi+0x6b/0x170
> [26887.389922]  exc_nmi+0x12c/0x1a0
> [26887.389923]  end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946]  </NMI>
> [26887.389947]  <TASK>
> [26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953]  __page_cache_release+0x68/0x230
> [26887.389955]  ? remove_migration_ptes+0x5c/0x80
> [26887.389958]  __folio_put+0x24/0x60
> [26887.389960]  __split_huge_page+0x368/0x520
> [26887.389963]  split_huge_page_to_list+0x4b3/0x570
> [26887.389967]  deferred_split_scan+0x1c8/0x290
> [26887.389971]  do_shrink_slab+0x12f/0x2d0
> [26887.389974]  shrink_slab_memcg+0x133/0x1d0
> [26887.389978]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389982]  shrink_node+0xa7/0x370
> [26887.389985]  balance_pgdat+0x332/0x6f0
> [26887.389991]  kswapd+0xf0/0x190
> [26887.389994]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389997]  kthread+0xee/0x120
> [26887.389998]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390000]  ret_from_fork+0x2d/0x50
> [26887.390003]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390004]  ret_from_fork_asm+0x11/0x20
> [26887.390009]  </TASK>
>
ok, thanks for the information. That should be generated by lock's
contention. I will check the code and keep you posted.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-27  8:22                     ` Marcin Wanat
  2024-05-27  8:53                       ` Zhaoyang Huang
@ 2024-06-14  3:31                       ` Zhaoyang Huang
  1 sibling, 0 replies; 21+ messages in thread
From: Zhaoyang Huang @ 2024-06-14  3:31 UTC (permalink / raw)
  To: Marcin Wanat
  Cc: Dave Chinner, Andrew Morton, zhaoyang.huang, Alex Shi,
	Kirill A . Shutemov, Hugh Dickins, Baolin Wang, linux-mm,
	linux-kernel, steve.kang

On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <huangzhaoyang@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <private@marcinwanat.pl> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> zhaoyang.huang@unisoc.com/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >>   1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717]  <NMI>
> [26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725]  ? __perf_event_overflow+0x102/0x1d0
> [26887.389729]  ? handle_pmi_common+0x189/0x3e0
> [26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742]  ? native_set_fixmap+0x65/0x80
> [26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756]  ? perf_event_nmi_handler+0x28/0x50
> [26887.389759]  ? nmi_handle+0x58/0x150
> [26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768]  ? default_do_nmi+0x6b/0x170
> [26887.389770]  ? exc_nmi+0x12c/0x1a0
> [26887.389772]  ? end_repeat_nmi+0x16/0x1f
> [26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787]  </NMI>
> [26887.389788]  <TASK>
> [26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798]  __page_cache_release+0x68/0x230
> [26887.389801]  ? remove_migration_ptes+0x5c/0x80
> [26887.389807]  __folio_put+0x24/0x60
> [26887.389808]  __split_huge_page+0x368/0x520
> [26887.389812]  split_huge_page_to_list+0x4b3/0x570
> [26887.389816]  deferred_split_scan+0x1c8/0x290
> [26887.389819]  do_shrink_slab+0x12f/0x2d0
> [26887.389824]  shrink_slab_memcg+0x133/0x1d0
> [26887.389829]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389832]  shrink_node+0xa7/0x370
> [26887.389836]  balance_pgdat+0x332/0x6f0
> [26887.389842]  kswapd+0xf0/0x190
> [26887.389845]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389848]  kthread+0xee/0x120
> [26887.389851]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389853]  ret_from_fork+0x2d/0x50
> [26887.389857]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389859]  ret_from_fork_asm+0x11/0x20
> [26887.389864]  </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389870] Call Trace:
> [26887.389871]  <NMI>
> [26887.389872]  dump_stack_lvl+0x44/0x60
> [26887.389877]  panic+0x241/0x330
> [26887.389881]  nmi_panic+0x2f/0x40
> [26887.389883]  watchdog_hardlockup_check+0x119/0x150
> [26887.389886]  __perf_event_overflow+0x102/0x1d0
> [26887.389889]  handle_pmi_common+0x189/0x3e0
> [26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899]  ? native_set_fixmap+0x65/0x80
> [26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909]  intel_pmu_handle_irq+0x10b/0x230
> [26887.389911]  perf_event_nmi_handler+0x28/0x50
> [26887.389913]  nmi_handle+0x58/0x150
> [26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920]  default_do_nmi+0x6b/0x170
> [26887.389922]  exc_nmi+0x12c/0x1a0
> [26887.389923]  end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946]  </NMI>
> [26887.389947]  <TASK>
> [26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953]  __page_cache_release+0x68/0x230
> [26887.389955]  ? remove_migration_ptes+0x5c/0x80
> [26887.389958]  __folio_put+0x24/0x60
> [26887.389960]  __split_huge_page+0x368/0x520
> [26887.389963]  split_huge_page_to_list+0x4b3/0x570
> [26887.389967]  deferred_split_scan+0x1c8/0x290
> [26887.389971]  do_shrink_slab+0x12f/0x2d0
> [26887.389974]  shrink_slab_memcg+0x133/0x1d0
> [26887.389978]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389982]  shrink_node+0xa7/0x370
> [26887.389985]  balance_pgdat+0x332/0x6f0
> [26887.389991]  kswapd+0xf0/0x190
> [26887.389994]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389997]  kthread+0xee/0x120
> [26887.389998]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390000]  ret_from_fork+0x2d/0x50
> [26887.390003]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390004]  ret_from_fork_asm+0x11/0x20
> [26887.390009]  </TASK>
>
Hi Marcin. Sorry for this late reply. I think the above hard lockup is
caused by a recursive deadlock as [1] and has been fixed by [2] which
is on v6.8+. I would like to know if your regression test is still
going on? Thanks very much.

[1]
static void __split_huge_page(struct page *page, struct list_head *list,
                pgoff_t end, unsigned int new_order)
{
        /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
        lruvec = folio_lruvec_lock(folio);
                                                 //take lruvec_lock
here 1st

        for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
                __split_huge_page_tail(folio, i, lruvec, list, new_order);
                /* Some pages can be beyond EOF: drop them from page cache */
                if (head[i].index >= end) {
                        folio_put(tail);
                            __page_cache_release
                                  folio_lruvec_lock_irqsave
                                                  //hanged by 2nd try

[2]
commit f1ee018baee9f4e724e08859c2559323be768be3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 27 17:42:42 2024 +0000

    mm: use __page_cache_release() in folios_put()

    Pass a pointer to the lruvec so we can take advantage of the
    folio_lruvec_relock_irqsave().  Adjust the calling convention of
    folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
    wrapper.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-20 19:42         ` Marcin Wanat
  2024-05-21  0:58           ` Zhaoyang Huang
@ 2024-05-30  8:48           ` Yafang Shao
  2024-05-30  8:57             ` Zhaoyang Huang
  1 sibling, 1 reply; 21+ messages in thread
From: Yafang Shao @ 2024-05-30  8:48 UTC (permalink / raw)
  To: Marcin Wanat
  Cc: Zhaoyang Huang, Dave Chinner, Andrew Morton, zhaoyang.huang,
	Alex Shi, Kirill A . Shutemov, Hugh Dickins, Baolin Wang,
	linux-mm, linux-kernel, steve.kang

On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> Huang wrote: >>> loop Dave, since he has ever helped set up an
> reproducer in >>> https://lore.kernel.org/linux- >>>
> mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> would like to ask for your kindly help on if you can verify >>> this
> patch on your environment if convenient. Thanks a lot. >> >> I don't
> have the test environment from 18 months ago available any >> more.
> Also, I haven't seen this problem since that specific test >>
> environment tripped over the issue. Hence I don't have any way of >>
> confirming that the problem is fixed, either, because first I'd >> have
> to reproduce it... > Thanks for the information. I noticed that you
> reported another soft > lockup which is related to xas_load since
> NOV.2023. This patch is > supposed to be helpful for this. With regard
> to the version timing, > this commit is actually a revert of <mm/thp:
> narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> merged before > v5.15. > > For saving your time, a brief description
> below. IMO, b6769834aa > introduce a potential stall between freeze the
> folio's refcnt and > store it back to 2, which have the
> xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> spin_lock(lru_lock) > xas_split(&xas, folio,order)
> folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> above doesn't make sense, I am just a > developer who is also suffering
> from this bug and trying to fix it
> I am experiencing a similar error on dozens of hosts, with stack traces
> that are all similar:
>
> [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> [file_get:953301]
> [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> scsi_transport_sas i2c_algo_bit wmi
> [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> Tainted: G             L     6.6.30.el9 #2
> [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> 2.21.2 02/19/2024
> [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> 0000000000000020
> [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> ffffc90034a67990
> [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> 0000000000008820
> [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> ffffc90034a67a20
> [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> ffffc90034a67a18
> [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> knlGS:0000000000000000
> [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> 00000000007706e0
> [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [627163.727878] PKRU: 55555554
> [627163.727879] Call Trace:
> [627163.727882]  <IRQ>
> [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> [627163.727892]  ? softlockup_fn+0x70/0x70
> [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> [627163.727917]  </IRQ>
> [627163.727918]  <TASK>
> [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [627163.727927]  ? xas_descend+0x1b/0x70
> [627163.727930]  xas_load+0x2c/0x40
> [627163.727933]  xas_find+0x161/0x1a0
> [627163.727937]  find_get_entries+0x77/0x1d0
> [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> [627163.727950]  truncate_pagecache+0x44/0x60
> [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> [627163.728153]  notify_change+0x34f/0x4f0
> [627163.728158]  ? _raw_spin_lock+0x13/0x30
> [627163.728165]  ? do_truncate+0x80/0xd0
> [627163.728169]  do_truncate+0x80/0xd0
> [627163.728172]  do_open+0x2ce/0x400
> [627163.728177]  path_openat+0x10d/0x280
> [627163.728181]  do_filp_open+0xb2/0x150
> [627163.728186]  ? check_heap_object+0x34/0x190
> [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> [627163.728194]  do_sys_openat2+0x92/0xc0
> [627163.728197]  __x64_sys_openat+0x53/0x90
> [627163.728200]  do_syscall_64+0x35/0x80
> [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [627163.728210] RIP: 0033:0x7fc5e493e7fb
> [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000101
> [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> 00007fc5e493e7fb
> [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> 00000000ffffff9c
> [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> 0000000000000001
> [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> 0000000000000241
> [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> 0000000000000000
> [627163.728227]  </TASK>
>
> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> However, with long-term kernels 6.1.XX and 6.6.XX,
> (tested at least 10 different versions), this lockup always appears
> after 2-30 days, similar to the report in the original thread.
> The more load (for example, copying a lot of local files while
> serving 20Gbps traffic), the higher the chance that the bug will appear.
>
> I haven't been able to reproduce this during synthetic tests,
> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.

We encountered a similar issue several months ago. Some of our
production servers crashed within days after deploying the 6.1.y
stable kernel. The soft lock info as follows,

[282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
[container-execu:1572375]
[282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
overlay af_packet bonding intel_rapl_msr intel_rapl_common
64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
sd_mod t10_pi ahci libahci libata
[282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
[282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
[282879.612576] RIP: 0010:xas_descend+0x18/0x80
[282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
00 00
[282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
[282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
0000000000000006
[282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
ffffad700b247c68
[282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
fffffffffffffffe
[282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
ffffad700b247cf8
[282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
ffffdfcd2c778000
[282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
knlGS:0000000000000000
[282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
0000000000350ee0
[282879.612599] Call Trace:
[282879.612601]  <IRQ>
[282879.612605]  ? show_regs.cold+0x1a/0x1f
[282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
[282879.612614]  ? softlockup_fn+0x30/0x30
[282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
[282879.612620]  ? hrtimer_interrupt+0x109/0x220
[282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
[282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
[282879.612629]  </IRQ>
[282879.612630]  <TASK>
[282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[282879.612640]  ? xas_descend+0x18/0x80
[282879.612641]  ? xas_load+0x35/0x40
[282879.612643]  xas_find+0x197/0x1d0
[282879.612645]  find_get_entries+0x6e/0x170
[282879.612649]  truncate_inode_pages_range+0x294/0x4c0
[282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
[282879.612787]  ? kvfree+0x2c/0x40
[282879.612791]  ? trace_hardirqs_off+0x36/0xf0
[282879.612795]  truncate_inode_pages_final+0x44/0x50
[282879.612798]  evict+0x177/0x190
[282879.612802]  iput.part.0+0x183/0x1e0
[282879.612804]  iput+0x1c/0x30
[282879.612806]  do_unlinkat+0x1c7/0x2c0
[282879.612810]  __x64_sys_unlinkat+0x38/0x70
[282879.612812]  do_syscall_64+0x38/0x90
[282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[282879.612818] RIP: 0033:0x7f5f56cf120d
[282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
89 02
[282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
0000000000000107
[282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007f5f56cf120d
[282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
0000000000000003
[282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
0000000001640403
[282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
0000000000000003
[282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
0000000000000000
[282879.612836]  </TASK>


Unfortunately, we couldn't reproduce the issue on our test servers. We
worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
production servers have been running smoothly for several months.

> If anyone can provide a patch, I can test it on multiple machines
> over the next few days.
>


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-30  8:48           ` Yafang Shao
@ 2024-05-30  8:57             ` Zhaoyang Huang
  2024-05-30  9:24               ` Yafang Shao
  0 siblings, 1 reply; 21+ messages in thread
From: Zhaoyang Huang @ 2024-05-30  8:57 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Marcin Wanat, Dave Chinner, Andrew Morton, zhaoyang.huang,
	Alex Shi, Kirill A . Shutemov, Hugh Dickins, Baolin Wang,
	linux-mm, linux-kernel, steve.kang

On Thu, May 30, 2024 at 4:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> >
> > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > reproducer in >>> https://lore.kernel.org/linux- >>>
> > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > would like to ask for your kindly help on if you can verify >>> this
> > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > have the test environment from 18 months ago available any >> more.
> > Also, I haven't seen this problem since that specific test >>
> > environment tripped over the issue. Hence I don't have any way of >>
> > confirming that the problem is fixed, either, because first I'd >> have
> > to reproduce it... > Thanks for the information. I noticed that you
> > reported another soft > lockup which is related to xas_load since
> > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > to the version timing, > this commit is actually a revert of <mm/thp:
> > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > merged before > v5.15. > > For saving your time, a brief description
> > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > folio's refcnt and > store it back to 2, which have the
> > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > above doesn't make sense, I am just a > developer who is also suffering
> > from this bug and trying to fix it
> > I am experiencing a similar error on dozens of hosts, with stack traces
> > that are all similar:
> >
> > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > [file_get:953301]
> > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > scsi_transport_sas i2c_algo_bit wmi
> > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > Tainted: G             L     6.6.30.el9 #2
> > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > 2.21.2 02/19/2024
> > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > 0000000000000020
> > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > ffffc90034a67990
> > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > 0000000000008820
> > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > ffffc90034a67a20
> > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > ffffc90034a67a18
> > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > knlGS:0000000000000000
> > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > 00000000007706e0
> > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [627163.727878] PKRU: 55555554
> > [627163.727879] Call Trace:
> > [627163.727882]  <IRQ>
> > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > [627163.727892]  ? softlockup_fn+0x70/0x70
> > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > [627163.727917]  </IRQ>
> > [627163.727918]  <TASK>
> > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [627163.727927]  ? xas_descend+0x1b/0x70
> > [627163.727930]  xas_load+0x2c/0x40
> > [627163.727933]  xas_find+0x161/0x1a0
> > [627163.727937]  find_get_entries+0x77/0x1d0
> > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > [627163.727950]  truncate_pagecache+0x44/0x60
> > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > [627163.728153]  notify_change+0x34f/0x4f0
> > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > [627163.728165]  ? do_truncate+0x80/0xd0
> > [627163.728169]  do_truncate+0x80/0xd0
> > [627163.728172]  do_open+0x2ce/0x400
> > [627163.728177]  path_openat+0x10d/0x280
> > [627163.728181]  do_filp_open+0xb2/0x150
> > [627163.728186]  ? check_heap_object+0x34/0x190
> > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > [627163.728194]  do_sys_openat2+0x92/0xc0
> > [627163.728197]  __x64_sys_openat+0x53/0x90
> > [627163.728200]  do_syscall_64+0x35/0x80
> > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000101
> > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > 00007fc5e493e7fb
> > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > 00000000ffffff9c
> > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > 0000000000000001
> > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > 0000000000000241
> > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > 0000000000000000
> > [627163.728227]  </TASK>
> >
> > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > However, with long-term kernels 6.1.XX and 6.6.XX,
> > (tested at least 10 different versions), this lockup always appears
> > after 2-30 days, similar to the report in the original thread.
> > The more load (for example, copying a lot of local files while
> > serving 20Gbps traffic), the higher the chance that the bug will appear.
> >
> > I haven't been able to reproduce this during synthetic tests,
> > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>
> We encountered a similar issue several months ago. Some of our
> production servers crashed within days after deploying the 6.1.y
> stable kernel. The soft lock info as follows,
>
> [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> [container-execu:1572375]
> [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> overlay af_packet bonding intel_rapl_msr intel_rapl_common
> 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> sd_mod t10_pi ahci libahci libata
> [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
> [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> 00 00
> [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> 0000000000000006
> [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> ffffad700b247c68
> [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> fffffffffffffffe
> [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> ffffad700b247cf8
> [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> ffffdfcd2c778000
> [282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> knlGS:0000000000000000
> [282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> 0000000000350ee0
> [282879.612599] Call Trace:
> [282879.612601]  <IRQ>
> [282879.612605]  ? show_regs.cold+0x1a/0x1f
> [282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
> [282879.612614]  ? softlockup_fn+0x30/0x30
> [282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
> [282879.612620]  ? hrtimer_interrupt+0x109/0x220
> [282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
> [282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
> [282879.612629]  </IRQ>
> [282879.612630]  <TASK>
> [282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [282879.612640]  ? xas_descend+0x18/0x80
> [282879.612641]  ? xas_load+0x35/0x40
> [282879.612643]  xas_find+0x197/0x1d0
> [282879.612645]  find_get_entries+0x6e/0x170
> [282879.612649]  truncate_inode_pages_range+0x294/0x4c0
> [282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> [282879.612787]  ? kvfree+0x2c/0x40
> [282879.612791]  ? trace_hardirqs_off+0x36/0xf0
> [282879.612795]  truncate_inode_pages_final+0x44/0x50
> [282879.612798]  evict+0x177/0x190
> [282879.612802]  iput.part.0+0x183/0x1e0
> [282879.612804]  iput+0x1c/0x30
> [282879.612806]  do_unlinkat+0x1c7/0x2c0
> [282879.612810]  __x64_sys_unlinkat+0x38/0x70
> [282879.612812]  do_syscall_64+0x38/0x90
> [282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [282879.612818] RIP: 0033:0x7f5f56cf120d
> [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> 89 02
> [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000107
> [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007f5f56cf120d
> [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> 0000000000000003
> [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> 0000000001640403
> [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> 0000000000000003
> [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> 0000000000000000
> [282879.612836]  </TASK>
>
>
> Unfortunately, we couldn't reproduce the issue on our test servers. We
> worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> production servers have been running smoothly for several months.
>
> > If anyone can provide a patch, I can test it on multiple machines
> > over the next few days.
It is highly appreciated that you could help to try below one which
works on my v6.6 based android. However, there is a hard lockup
reported on an ongoing regression test(not sure if caused by this
patch yet). Thank you!

mm/huge_memory.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..5899906c326a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
 {
  struct folio *folio = page_folio(page);
  struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
  struct address_space *swap_cache = NULL;
  unsigned long offset = 0;
  unsigned int nr = thp_nr_pages(head);
@@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  xa_lock(&swap_cache->i_pages);
  }

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
  ClearPageHasHWPoisoned(head);

  for (i = nr - 1; i >= 1; i--) {
@@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  }

  ClearPageCompound(head);
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
  split_page_owner(head, nr);

  /* See comment in __split_huge_page_tail() */
@@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  page_ref_add(head, 2);
  xa_unlock(&head->mapping->i_pages);
  }
- local_irq_enable();

  if (nr_dropped)
  shmem_uncharge(head->mapping->host, nr_dropped);
@@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  int extra_pins, ret;
  pgoff_t end;
  bool is_hzp;
+ struct lruvec *lruvec;

  VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
  VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)

  /* block interrupt reentry in xa_lock and spinlock */
  local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
  if (mapping) {
  /*
  * Check if the folio is present in page cache.
@@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  }

  __split_huge_page(page, list, end);

> >
>
>
> --
> Regards
> Yafang


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-30  8:57             ` Zhaoyang Huang
@ 2024-05-30  9:24               ` Yafang Shao
  2024-05-31  6:17                 ` Zhaoyang Huang
  0 siblings, 1 reply; 21+ messages in thread
From: Yafang Shao @ 2024-05-30  9:24 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Marcin Wanat, Dave Chinner, Andrew Morton, zhaoyang.huang,
	Alex Shi, Kirill A . Shutemov, Hugh Dickins, Baolin Wang,
	linux-mm, linux-kernel, steve.kang

On Thu, May 30, 2024 at 4:57 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Thu, May 30, 2024 at 4:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> > >
> > > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > > reproducer in >>> https://lore.kernel.org/linux- >>>
> > > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > > would like to ask for your kindly help on if you can verify >>> this
> > > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > > have the test environment from 18 months ago available any >> more.
> > > Also, I haven't seen this problem since that specific test >>
> > > environment tripped over the issue. Hence I don't have any way of >>
> > > confirming that the problem is fixed, either, because first I'd >> have
> > > to reproduce it... > Thanks for the information. I noticed that you
> > > reported another soft > lockup which is related to xas_load since
> > > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > > to the version timing, > this commit is actually a revert of <mm/thp:
> > > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > > merged before > v5.15. > > For saving your time, a brief description
> > > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > > folio's refcnt and > store it back to 2, which have the
> > > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > > above doesn't make sense, I am just a > developer who is also suffering
> > > from this bug and trying to fix it
> > > I am experiencing a similar error on dozens of hosts, with stack traces
> > > that are all similar:
> > >
> > > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > > [file_get:953301]
> > > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > > scsi_transport_sas i2c_algo_bit wmi
> > > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > > Tainted: G             L     6.6.30.el9 #2
> > > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > > 2.21.2 02/19/2024
> > > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > > 0000000000000020
> > > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > > ffffc90034a67990
> > > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > > 0000000000008820
> > > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > > ffffc90034a67a20
> > > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > > ffffc90034a67a18
> > > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > > knlGS:0000000000000000
> > > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > > 00000000007706e0
> > > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [627163.727878] PKRU: 55555554
> > > [627163.727879] Call Trace:
> > > [627163.727882]  <IRQ>
> > > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > > [627163.727892]  ? softlockup_fn+0x70/0x70
> > > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > > [627163.727917]  </IRQ>
> > > [627163.727918]  <TASK>
> > > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [627163.727927]  ? xas_descend+0x1b/0x70
> > > [627163.727930]  xas_load+0x2c/0x40
> > > [627163.727933]  xas_find+0x161/0x1a0
> > > [627163.727937]  find_get_entries+0x77/0x1d0
> > > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > > [627163.727950]  truncate_pagecache+0x44/0x60
> > > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > > [627163.728153]  notify_change+0x34f/0x4f0
> > > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > > [627163.728165]  ? do_truncate+0x80/0xd0
> > > [627163.728169]  do_truncate+0x80/0xd0
> > > [627163.728172]  do_open+0x2ce/0x400
> > > [627163.728177]  path_openat+0x10d/0x280
> > > [627163.728181]  do_filp_open+0xb2/0x150
> > > [627163.728186]  ? check_heap_object+0x34/0x190
> > > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > > [627163.728194]  do_sys_openat2+0x92/0xc0
> > > [627163.728197]  __x64_sys_openat+0x53/0x90
> > > [627163.728200]  do_syscall_64+0x35/0x80
> > > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000101
> > > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > > 00007fc5e493e7fb
> > > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > > 00000000ffffff9c
> > > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > > 0000000000000001
> > > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > > 0000000000000241
> > > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > > 0000000000000000
> > > [627163.728227]  </TASK>
> > >
> > > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > > However, with long-term kernels 6.1.XX and 6.6.XX,
> > > (tested at least 10 different versions), this lockup always appears
> > > after 2-30 days, similar to the report in the original thread.
> > > The more load (for example, copying a lot of local files while
> > > serving 20Gbps traffic), the higher the chance that the bug will appear.
> > >
> > > I haven't been able to reproduce this during synthetic tests,
> > > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> >
> > We encountered a similar issue several months ago. Some of our
> > production servers crashed within days after deploying the 6.1.y
> > stable kernel. The soft lock info as follows,
> >
> > [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> > [container-execu:1572375]
> > [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> > iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> > unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> > xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> > overlay af_packet bonding intel_rapl_msr intel_rapl_common
> > 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> > polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> > aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> > ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> > acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> > ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> > psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> > sd_mod t10_pi ahci libahci libata
> > [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> > loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
> > [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> > UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> > [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> > [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> > cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> > 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> > 00 00
> > [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> > [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> > 0000000000000006
> > [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> > ffffad700b247c68
> > [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> > fffffffffffffffe
> > [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> > ffffad700b247cf8
> > [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> > ffffdfcd2c778000
> > [282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> > knlGS:0000000000000000
> > [282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> > 0000000000350ee0
> > [282879.612599] Call Trace:
> > [282879.612601]  <IRQ>
> > [282879.612605]  ? show_regs.cold+0x1a/0x1f
> > [282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
> > [282879.612614]  ? softlockup_fn+0x30/0x30
> > [282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
> > [282879.612620]  ? hrtimer_interrupt+0x109/0x220
> > [282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
> > [282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
> > [282879.612629]  </IRQ>
> > [282879.612630]  <TASK>
> > [282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> > [282879.612640]  ? xas_descend+0x18/0x80
> > [282879.612641]  ? xas_load+0x35/0x40
> > [282879.612643]  xas_find+0x197/0x1d0
> > [282879.612645]  find_get_entries+0x6e/0x170
> > [282879.612649]  truncate_inode_pages_range+0x294/0x4c0
> > [282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> > [282879.612787]  ? kvfree+0x2c/0x40
> > [282879.612791]  ? trace_hardirqs_off+0x36/0xf0
> > [282879.612795]  truncate_inode_pages_final+0x44/0x50
> > [282879.612798]  evict+0x177/0x190
> > [282879.612802]  iput.part.0+0x183/0x1e0
> > [282879.612804]  iput+0x1c/0x30
> > [282879.612806]  do_unlinkat+0x1c7/0x2c0
> > [282879.612810]  __x64_sys_unlinkat+0x38/0x70
> > [282879.612812]  do_syscall_64+0x38/0x90
> > [282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [282879.612818] RIP: 0033:0x7f5f56cf120d
> > [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> > 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> > 89 02
> > [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> > 0000000000000107
> > [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > 00007f5f56cf120d
> > [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> > 0000000000000003
> > [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> > 0000000001640403
> > [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> > 0000000000000003
> > [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> > 0000000000000000
> > [282879.612836]  </TASK>
> >
> >
> > Unfortunately, we couldn't reproduce the issue on our test servers. We
> > worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> > production servers have been running smoothly for several months.
> >
> > > If anyone can provide a patch, I can test it on multiple machines
> > > over the next few days.
> It is highly appreciated that you could help to try below one which
> works on my v6.6 based android. However, there is a hard lockup
> reported on an ongoing regression test(not sure if caused by this
> patch yet). Thank you!

I'm sorry to inform you that our users are unwilling to experiment
with these changes on our production servers again, and I am unable to
reproduce the issue on our test servers. I am reporting this issue to
highlight to the community that it is indeed a serious problem, and we
should consider it carefully.

>
> mm/huge_memory.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 064fbd90822b..5899906c326a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>  {
>   struct folio *folio = page_folio(page);
>   struct page *head = &folio->page;
> - struct lruvec *lruvec;
> + struct lruvec *lruvec = folio_lruvec(folio);
>   struct address_space *swap_cache = NULL;
>   unsigned long offset = 0;
>   unsigned int nr = thp_nr_pages(head);
> @@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>   xa_lock(&swap_cache->i_pages);
>   }
>
> - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> - lruvec = folio_lruvec_lock(folio);
> -
>   ClearPageHasHWPoisoned(head);
>
>   for (i = nr - 1; i >= 1; i--) {
> @@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>   }
>
>   ClearPageCompound(head);
> - unlock_page_lruvec(lruvec);
> - /* Caller disabled irqs, so they are still disabled here */
> -
>   split_page_owner(head, nr);
>
>   /* See comment in __split_huge_page_tail() */
> @@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>   page_ref_add(head, 2);
>   xa_unlock(&head->mapping->i_pages);
>   }
> - local_irq_enable();
>
>   if (nr_dropped)
>   shmem_uncharge(head->mapping->host, nr_dropped);
> @@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>   int extra_pins, ret;
>   pgoff_t end;
>   bool is_hzp;
> + struct lruvec *lruvec;
>
>   VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>   VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> @@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>
>   /* block interrupt reentry in xa_lock and spinlock */
>   local_irq_disable();
> +
> + /*
> + * take lruvec's lock before freeze the folio to prevent the folio
> + * remains in the page cache with refcnt == 0, which could lead to
> + * find_get_entry enters livelock by iterating the xarray.
> + */
> + lruvec = folio_lruvec_lock(folio);
> +
>   if (mapping) {
>   /*
>   * Check if the folio is present in page cache.
> @@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>   }
>
>   __split_huge_page(page, list, end);
>
> > >
> >
> >
> > --
> > Regards
> > Yafang



-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2024-05-30  9:24               ` Yafang Shao
@ 2024-05-31  6:17                 ` Zhaoyang Huang
  0 siblings, 0 replies; 21+ messages in thread
From: Zhaoyang Huang @ 2024-05-31  6:17 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Marcin Wanat, Dave Chinner, Andrew Morton, zhaoyang.huang,
	Alex Shi, Kirill A . Shutemov, Hugh Dickins, Baolin Wang,
	linux-mm, linux-kernel, steve.kang

On Thu, May 30, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, May 30, 2024 at 4:57 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >
> > On Thu, May 30, 2024 at 4:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> > > >
> > > > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > > > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > > > reproducer in >>> https://lore.kernel.org/linux- >>>
> > > > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > > > would like to ask for your kindly help on if you can verify >>> this
> > > > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > > > have the test environment from 18 months ago available any >> more.
> > > > Also, I haven't seen this problem since that specific test >>
> > > > environment tripped over the issue. Hence I don't have any way of >>
> > > > confirming that the problem is fixed, either, because first I'd >> have
> > > > to reproduce it... > Thanks for the information. I noticed that you
> > > > reported another soft > lockup which is related to xas_load since
> > > > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > > > to the version timing, > this commit is actually a revert of <mm/thp:
> > > > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > > > merged before > v5.15. > > For saving your time, a brief description
> > > > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > > > folio's refcnt and > store it back to 2, which have the
> > > > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > > > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > > > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > > > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > > > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > > > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > > > above doesn't make sense, I am just a > developer who is also suffering
> > > > from this bug and trying to fix it
> > > > I am experiencing a similar error on dozens of hosts, with stack traces
> > > > that are all similar:
> > > >
> > > > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > > > [file_get:953301]
> > > > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > > > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > > > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > > > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > > > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > > > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > > > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > > > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > > > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > > > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > > > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > > > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > > > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > > > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > > > scsi_transport_sas i2c_algo_bit wmi
> > > > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > > > Tainted: G             L     6.6.30.el9 #2
> > > > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > > > 2.21.2 02/19/2024
> > > > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > > > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > > > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > > > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > > > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > > > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > > > 0000000000000020
> > > > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > > > ffffc90034a67990
> > > > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > > > 0000000000008820
> > > > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > > > ffffc90034a67a20
> > > > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > > > ffffc90034a67a18
> > > > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > > > knlGS:0000000000000000
> > > > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > > > 00000000007706e0
> > > > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [627163.727878] PKRU: 55555554
> > > > [627163.727879] Call Trace:
> > > > [627163.727882]  <IRQ>
> > > > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > > > [627163.727892]  ? softlockup_fn+0x70/0x70
> > > > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > > > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > > > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > > > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > > > [627163.727917]  </IRQ>
> > > > [627163.727918]  <TASK>
> > > > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [627163.727927]  ? xas_descend+0x1b/0x70
> > > > [627163.727930]  xas_load+0x2c/0x40
> > > > [627163.727933]  xas_find+0x161/0x1a0
> > > > [627163.727937]  find_get_entries+0x77/0x1d0
> > > > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > > > [627163.727950]  truncate_pagecache+0x44/0x60
> > > > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > > > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > > > [627163.728153]  notify_change+0x34f/0x4f0
> > > > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > > > [627163.728165]  ? do_truncate+0x80/0xd0
> > > > [627163.728169]  do_truncate+0x80/0xd0
> > > > [627163.728172]  do_open+0x2ce/0x400
> > > > [627163.728177]  path_openat+0x10d/0x280
> > > > [627163.728181]  do_filp_open+0xb2/0x150
> > > > [627163.728186]  ? check_heap_object+0x34/0x190
> > > > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > > > [627163.728194]  do_sys_openat2+0x92/0xc0
> > > > [627163.728197]  __x64_sys_openat+0x53/0x90
> > > > [627163.728200]  do_syscall_64+0x35/0x80
> > > > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > > > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > > > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > > > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > > > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > > > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > > > 0000000000000101
> > > > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > > > 00007fc5e493e7fb
> > > > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > > > 00000000ffffff9c
> > > > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > > > 0000000000000001
> > > > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > > > 0000000000000241
> > > > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > > > 0000000000000000
> > > > [627163.728227]  </TASK>
> > > >
> > > > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > > > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > > > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > > > However, with long-term kernels 6.1.XX and 6.6.XX,
> > > > (tested at least 10 different versions), this lockup always appears
> > > > after 2-30 days, similar to the report in the original thread.
> > > > The more load (for example, copying a lot of local files while
> > > > serving 20Gbps traffic), the higher the chance that the bug will appear.
> > > >
> > > > I haven't been able to reproduce this during synthetic tests,
> > > > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> > >
> > > We encountered a similar issue several months ago. Some of our
> > > production servers crashed within days after deploying the 6.1.y
> > > stable kernel. The soft lock info as follows,
> > >
> > > [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> > > [container-execu:1572375]
> > > [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> > > iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> > > unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> > > xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> > > nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> > > overlay af_packet bonding intel_rapl_msr intel_rapl_common
> > > 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> > > polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> > > aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> > > ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> > > acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> > > ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> > > psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> > > sd_mod t10_pi ahci libahci libata
> > > [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> > > loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
> > > [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> > > UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> > > [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> > > [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> > > cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> > > 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> > > 00 00
> > > [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> > > [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> > > 0000000000000006
> > > [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> > > ffffad700b247c68
> > > [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> > > fffffffffffffffe
> > > [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> > > ffffad700b247cf8
> > > [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> > > ffffdfcd2c778000
> > > [282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> > > knlGS:0000000000000000
> > > [282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> > > 0000000000350ee0
> > > [282879.612599] Call Trace:
> > > [282879.612601]  <IRQ>
> > > [282879.612605]  ? show_regs.cold+0x1a/0x1f
> > > [282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
> > > [282879.612614]  ? softlockup_fn+0x30/0x30
> > > [282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
> > > [282879.612620]  ? hrtimer_interrupt+0x109/0x220
> > > [282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
> > > [282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
> > > [282879.612629]  </IRQ>
> > > [282879.612630]  <TASK>
> > > [282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> > > [282879.612640]  ? xas_descend+0x18/0x80
> > > [282879.612641]  ? xas_load+0x35/0x40
> > > [282879.612643]  xas_find+0x197/0x1d0
> > > [282879.612645]  find_get_entries+0x6e/0x170
> > > [282879.612649]  truncate_inode_pages_range+0x294/0x4c0
> > > [282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> > > [282879.612787]  ? kvfree+0x2c/0x40
> > > [282879.612791]  ? trace_hardirqs_off+0x36/0xf0
> > > [282879.612795]  truncate_inode_pages_final+0x44/0x50
> > > [282879.612798]  evict+0x177/0x190
> > > [282879.612802]  iput.part.0+0x183/0x1e0
> > > [282879.612804]  iput+0x1c/0x30
> > > [282879.612806]  do_unlinkat+0x1c7/0x2c0
> > > [282879.612810]  __x64_sys_unlinkat+0x38/0x70
> > > [282879.612812]  do_syscall_64+0x38/0x90
> > > [282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [282879.612818] RIP: 0033:0x7f5f56cf120d
> > > [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> > > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> > > 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> > > 89 02
> > > [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> > > 0000000000000107
> > > [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > > 00007f5f56cf120d
> > > [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> > > 0000000000000003
> > > [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> > > 0000000001640403
> > > [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> > > 0000000000000003
> > > [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> > > 0000000000000000
> > > [282879.612836]  </TASK>
> > >
> > >
> > > Unfortunately, we couldn't reproduce the issue on our test servers. We
> > > worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> > > production servers have been running smoothly for several months.
> > >
> > > > If anyone can provide a patch, I can test it on multiple machines
> > > > over the next few days.
> > It is highly appreciated that you could help to try below one which
> > works on my v6.6 based android. However, there is a hard lockup
> > reported on an ongoing regression test(not sure if caused by this
> > patch yet). Thank you!
>
> I'm sorry to inform you that our users are unwilling to experiment
> with these changes on our production servers again, and I am unable to
> reproduce the issue on our test servers. I am reporting this issue to
> highlight to the community that it is indeed a serious problem, and we
> should consider it carefully.
ok. I would like to suggest a possible reproduce timing sequence which
inspired during investigation as mmap and truncate the same file
simultaneously by multi-process and reserve a certain number of CMA
area via dts could be more helpful.
>
> >
> > mm/huge_memory.c | 22 ++++++++++++++--------
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 064fbd90822b..5899906c326a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >  {
> >   struct folio *folio = page_folio(page);
> >   struct page *head = &folio->page;
> > - struct lruvec *lruvec;
> > + struct lruvec *lruvec = folio_lruvec(folio);
> >   struct address_space *swap_cache = NULL;
> >   unsigned long offset = 0;
> >   unsigned int nr = thp_nr_pages(head);
> > @@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >   xa_lock(&swap_cache->i_pages);
> >   }
> >
> > - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> > - lruvec = folio_lruvec_lock(folio);
> > -
> >   ClearPageHasHWPoisoned(head);
> >
> >   for (i = nr - 1; i >= 1; i--) {
> > @@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >   }
> >
> >   ClearPageCompound(head);
> > - unlock_page_lruvec(lruvec);
> > - /* Caller disabled irqs, so they are still disabled here */
> > -
> >   split_page_owner(head, nr);
> >
> >   /* See comment in __split_huge_page_tail() */
> > @@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >   page_ref_add(head, 2);
> >   xa_unlock(&head->mapping->i_pages);
> >   }
> > - local_irq_enable();
> >
> >   if (nr_dropped)
> >   shmem_uncharge(head->mapping->host, nr_dropped);
> > @@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >   int extra_pins, ret;
> >   pgoff_t end;
> >   bool is_hzp;
> > + struct lruvec *lruvec;
> >
> >   VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> >   VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> > @@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >
> >   /* block interrupt reentry in xa_lock and spinlock */
> >   local_irq_disable();
> > +
> > + /*
> > + * take lruvec's lock before freeze the folio to prevent the folio
> > + * remains in the page cache with refcnt == 0, which could lead to
> > + * find_get_entry enters livelock by iterating the xarray.
> > + */
> > + lruvec = folio_lruvec_lock(folio);
> > +
> >   if (mapping) {
> >   /*
> >   * Check if the folio is present in page cache.
> > @@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >   }
> >
> >   __split_huge_page(page, list, end);
> >
> > > >
> > >
> > >
> > > --
> > > Regards
> > > Yafang
>
>
>
> --
> Regards
> Yafang


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [BUG] soft lockup in filemap_get_read_batch
@ 2023-10-03 13:48 antal.nemes
  2024-04-16  9:31 ` [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration zhaoyang.huang
  0 siblings, 1 reply; 21+ messages in thread
From: antal.nemes @ 2023-10-03 13:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, Daniel Dao

Hi Matthew,

We have observed intermittent soft lockups on at least seven different hosts:
- six hosts ran 6.2.8.fc37-200
- one host ran 6.0.13.fc37-200

The list of affected hosts is growing.

Stack traces are all similar:

emerg kern kernel - - watchdog: BUG: soft lockup - CPU#7 stuck for 17117s! [postmaster:2238460]
warning kern kernel - - Modules linked in: target_core_user uio target_core_pscsi target_core_file target_core_iblock nbd loop nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver fscache netfs veth iscsi_tcp libiscsi_tcp libiscsi iscsi_target_mod target_core_mod scsi_transport_iscsi nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bochs drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul i2c_piix4 crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 virtio_balloon joydev pcspkr xfs crc32c_intel virtio_net serio_raw ata_generic net_failover failover virtio_scsi pata_acpi qemu_fw_cfg fuse [last unloaded: nbd]
warning kern kernel - - CPU: 7 PID: 2238460 Comm: postmaster Kdump: loaded Tainted: G             L     6.2.8-200.fc37.x86_64 #1
warning kern kernel - - Hardware name: Nutanix AHV, BIOS 1.11.0-2.el7 04/01/2014
warning kern kernel - - RIP: 0010:xas_descend+0x28/0x70
warning kern kernel - - Code: 90 90 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 <48> 3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc 48 c1 e8 02 89 c2
warning kern kernel - - RSP: 0018:ffffab66c9f4bb98 EFLAGS: 00000246
warning kern kernel - - RAX: 00000000000000c2 RBX: ffffab66c9f4bbb8 RCX: 0000000000000002
warning kern kernel - - RDX: 0000000000000032 RSI: ffff89cd6c8cd6d0 RDI: ffffab66c9f4bbb8
warning kern kernel - - RBP: ffff89cd6c8cd6d0 R08: ffffab66c9f4be20 R09: 0000000000000000
warning kern kernel - - R10: 0000000000000001 R11: 0000000000000100 R12: 00000000000000b3
warning kern kernel - - R13: 00000000000000b2 R14: 00000000000000b2 R15: ffffab66c9f4be48
warning kern kernel - - FS:  00007ff1e8bfb540(0000) GS:ffff89d35fbc0000(0000) knlGS:0000000000000000
warning kern kernel - - CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
warning kern kernel - - CR2: 00007ff1e8af0768 CR3: 000000016fdde001 CR4: 00000000003706e0
warning kern kernel - - Call Trace:
warning kern kernel - -  <TASK>
warning kern kernel - -  xas_load+0x3d/0x50
warning kern kernel - -  filemap_get_read_batch+0x179/0x270
warning kern kernel - -  filemap_get_pages+0xa9/0x690
warning kern kernel - -  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
warning kern kernel - -  filemap_read+0xd2/0x340
warning kern kernel - -  ? filemap_read+0x32f/0x340
warning kern kernel - -  xfs_file_buffered_read+0x4f/0xd0 [xfs]
warning kern kernel - -  xfs_file_read_iter+0x70/0xe0 [xfs]
warning kern kernel - -  vfs_read+0x23c/0x310
warning kern kernel - -  ksys_read+0x6b/0xf0
warning kern kernel - -  do_syscall_64+0x5b/0x80
warning kern kernel - -  ? syscall_exit_to_user_mode+0x17/0x40
warning kern kernel - -  ? do_syscall_64+0x67/0x80
warning kern kernel - -  ? do_syscall_64+0x67/0x80
warning kern kernel - -  ? __irq_exit_rcu+0x3d/0x140
warning kern kernel - -  entry_SYSCALL_64_after_hwframe+0x72/0xdc
warning kern kernel - - RIP: 0033:0x7ff1e5b20b25
warning kern kernel - - Code: fe ff ff 50 48 8d 3d 0a c9 06 00 e8 25 ee 01 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 f5 4b 2a 00 8b 00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
warning kern kernel - - RSP: 002b:00007ffe1a5d8d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
warning kern kernel - - RAX: ffffffffffffffda RBX: 00000000035345c0 RCX: 00007ff1e5b20b25
warning kern kernel - - RDX: 0000000000002000 RSI: 00007ff1dc9c3080 RDI: 0000000000000032
warning kern kernel - - RBP: 0000000000000000 R08: 0000000000000009 R09: 0000000000000000
warning kern kernel - - R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000
warning kern kernel - - R13: 00007ff1dc9c3080 R14: 0000000000000000 R15: 0000000001452148
warning kern kernel - -  </TASK>

Lockup is always reported from postgres process with all data and config on a XFS filesystem. 
Because this blocks a postgres process, lockup has a bunch of knock-on effects 
(invalid page errors, hanged or aborted transactions, tuple accumulation, etc). 
All occurrences eventually required a reboot to remedy.

Issue coincided with our rollout with the 6.x kernel. Previously we ran Rocky 
Linux 8 with 4.18.* (clone of RHEL8 kernel), so I recognize that this issue may 
not be new (AFAICT, livelocks were sporadically reported since folio merge in 5.17).

Issue takes anywhere from 2 days to 30+ days since boot to materialize, and lockups
are reported for duration ranging from 1min to 7 hours (the latter until it was 
manually rebooted). This is followed by a  period of relatively high load averages
(~2*#cpus), but low CPU usage. Memory usage was < 70%, so it does not appear 
to be a high-psi condition.

We are unable to reproduce the issue at will (i.e. by load/stress testing), but
the affected hosts have had multiple occurrences across reboots, so we should
be able to observe effects of any patches over a longer span.

From what I can tell, this appears to be similar to what was reported in
https://lore.kernel.org/linux-kernel/CA+wXwBS7YTHUmxGP3JrhcKMnYQJcd6=7HE+E1v-guk01L2K3Zw@mail.gmail.com/
and 
https://lore.kernel.org/linux-fsdevel/CA+wXwBRGab3UqbLqsr8xG=ZL2u9bgyDNNea4RGfTDjqB=J3geQ@mail.gmail.com/

> > We also have a deadlock reading a very specific file on this host. We managed to
> > do a kdump on this host and extracted out the state of the mapping.
>
> This is almost certainly a different bug, but alos XArray related, so
> I'll keep looking at this one.

I am not sure if the deadlock that Daniel observed matches our stack trace. 
Assuming yes, has there been any follow-up on this?

We tried the patch from https://bugzilla.kernel.org/show_bug.cgi?id=216646#c31 , but the
soft lockup reoccurred with the same signature.

Is there anything we can do to further aid in troubleshooting? If this is a folio
lock issue, would it be possible to trace where the lock was taken?

Best regards,
Antal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration
  2023-10-03 13:48 [BUG] soft lockup in filemap_get_read_batch antal.nemes
@ 2024-04-16  9:31 ` zhaoyang.huang
  0 siblings, 0 replies; 21+ messages in thread
From: zhaoyang.huang @ 2024-04-16  9:31 UTC (permalink / raw)
  To: antal.nemes; +Cc: dqminh, linux-fsdevel, linux-mm, steve.kang, huangzhaoyang

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Livelock in [1] is reported multitimes since v515, where the zero-ref
folio is repeatly found on the page cache by find_get_entry. A possible
timing sequence is proposed in [2], which can be described briefly as
the lockless xarray operation could get harmed by an illegal folio
remaining on the slot[offset]. This commit would like to protect
the xa split stuff(folio_ref_freeze and __split_huge_page) under
lruvec->lock to remove the race window.

[1]
[167789.800297] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780305] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167726.780319] (detected by 3, t=17256977 jiffies, g=19883597, q=2397394)
[167726.780325] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800308] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167789.800322] (detected by 3, t=17272732 jiffies, g=19883597, q=2397470)
[167789.800328] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800339] Call trace:
[167789.800342]  dump_backtrace.cfi_jt+0x0/0x8
[167789.800355]  show_stack+0x1c/0x2c
[167789.800363]  sched_show_task+0x1ac/0x27c
[167789.800370]  print_other_cpu_stall+0x314/0x4dc
[167789.800377]  check_cpu_stall+0x1c4/0x36c
[167789.800382]  rcu_sched_clock_irq+0xe8/0x388
[167789.800389]  update_process_times+0xa0/0xe0
[167789.800396]  tick_sched_timer+0x7c/0xd4
[167789.800404]  __run_hrtimer+0xd8/0x30c
[167789.800408]  hrtimer_interrupt+0x1e4/0x2d0
[167789.800414]  arch_timer_handler_phys+0x5c/0xa0
[167789.800423]  handle_percpu_devid_irq+0xbc/0x318
[167789.800430]  handle_domain_irq+0x7c/0xf0
[167789.800437]  gic_handle_irq+0x54/0x12c
[167789.800445]  call_on_irq_stack+0x40/0x70
[167789.800451]  do_interrupt_handler+0x44/0xa0
[167789.800457]  el1_interrupt+0x34/0x64
[167789.800464]  el1h_64_irq_handler+0x1c/0x2c
[167789.800470]  el1h_64_irq+0x7c/0x80
[167789.800474]  xas_find+0xb4/0x28c
[167789.800481]  find_get_entry+0x3c/0x178
[167789.800487]  find_lock_entries+0x98/0x2f8
[167789.800492]  __invalidate_mapping_pages.llvm.3657204692649320853+0xc8/0x224
[167789.800500]  invalidate_mapping_pages+0x18/0x28
[167789.800506]  inode_lru_isolate+0x140/0x2a4
[167789.800512]  __list_lru_walk_one+0xd8/0x204
[167789.800519]  list_lru_walk_one+0x64/0x90
[167789.800524]  prune_icache_sb+0x54/0xe0
[167789.800529]  super_cache_scan+0x160/0x1ec
[167789.800535]  do_shrink_slab+0x20c/0x5c0
[167789.800541]  shrink_slab+0xf0/0x20c
[167789.800546]  shrink_node_memcgs+0x98/0x320
[167789.800553]  shrink_node+0xe8/0x45c
[167789.800557]  balance_pgdat+0x464/0x814
[167789.800563]  kswapd+0xfc/0x23c
[167789.800567]  kthread+0x164/0x1c8
[167789.800573]  ret_from_fork+0x10/0x20

[2]
Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages via pfn
       (folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

4. alloc_contig_range->migrate_pages->__split_huge_page->folio_lruvec_lock
failed which have the folio be failed in setting refcnt to 2

5. Thread_kswapd enter the livelock by the chain below
      rcu_read_lock();
   retry:
        find_get_entry
            folio = xas_find
            if(!folio_try_get_rcu)
                xas_reset;
            goto retry;
      rcu_read_unlock();

5'. Thread_holdlock as the lruvec->lru_lock holder could be stalled in
the same core of Thread_kswapd.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 mm/huge_memory.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..418e8d03480a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2891,7 +2891,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 {
 	struct folio *folio = page_folio(page);
 	struct page *head = &folio->page;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = folio_lruvec(folio);
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	int i, nr_dropped = 0;
@@ -2908,8 +2908,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);
 
 	ClearPageHasHWPoisoned(head);
 
@@ -2942,7 +2940,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 		folio_set_order(new_folio, new_order);
 	}
-	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, order, new_order);
@@ -2961,7 +2958,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		folio_ref_add(folio, 1 + new_nr);
 		xa_unlock(&folio->mapping->i_pages);
 	}
-	local_irq_enable();
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
@@ -3048,6 +3044,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	int extra_pins, ret;
 	pgoff_t end;
 	bool is_hzp;
+	struct lruvec *lruvec;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -3159,6 +3156,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
+
+	/*
+	 * take lruvec's lock before freeze the folio to prevent the folio
+	 * remains in the page cache with refcnt == 0, which could lead to
+	 * find_get_entry enters livelock by iterating the xarray.
+	 */
+	lruvec = folio_lruvec_lock(folio);
+
 	if (mapping) {
 		/*
 		 * Check if the folio is present in page cache.
@@ -3203,12 +3208,16 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		}
 
 		__split_huge_page(page, list, end, new_order);
+		unlock_page_lruvec(lruvec);
+		local_irq_enable();
 		ret = 0;
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
+
+		unlock_page_lruvec(lruvec);
 		local_irq_enable();
 		remap_page(folio, folio_nr_pages(folio));
 		ret = -EAGAIN;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-06-14  3:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-12  6:43 [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration zhaoyang.huang
2024-04-12 12:24 ` Matthew Wilcox
2024-04-13  7:10   ` Zhaoyang Huang
2024-04-12 21:34 ` Andrew Morton
2024-04-13  2:01   ` Zhaoyang Huang
2024-04-15  0:09     ` Dave Chinner
2024-04-15  1:50       ` Zhaoyang Huang
2024-05-20 19:42         ` Marcin Wanat
2024-05-21  0:58           ` Zhaoyang Huang
2024-05-21  1:00             ` Zhaoyang Huang
2024-05-21 15:47               ` Marcin Wanat
2024-05-22  5:37                 ` Zhaoyang Huang
2024-05-22 10:13                   ` Marcin Wanat
2024-05-27  8:22                     ` Marcin Wanat
2024-05-27  8:53                       ` Zhaoyang Huang
2024-06-14  3:31                       ` Zhaoyang Huang
2024-05-30  8:48           ` Yafang Shao
2024-05-30  8:57             ` Zhaoyang Huang
2024-05-30  9:24               ` Yafang Shao
2024-05-31  6:17                 ` Zhaoyang Huang
  -- strict thread matches above, loose matches on Subject: below --
2023-10-03 13:48 [BUG] soft lockup in filemap_get_read_batch antal.nemes
2024-04-16  9:31 ` [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration zhaoyang.huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox