linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Bharata B Rao <bharata@amd.com>
To: Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, nikunj@amd.com,
	"Upadhyay, Neeraj" <Neeraj.Upadhyay@amd.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	willy@infradead.org, vbabka@suse.cz, kinseyho@google.com,
	Mel Gorman <mgorman@suse.de>
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system
Date: Mon, 8 Jul 2024 20:04:22 +0530	[thread overview]
Message-ID: <1998d479-eb1a-4bc8-a11e-59f8dd71aadb@amd.com> (raw)
In-Reply-To: <CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com>

Hi Yu Zhao,

Thanks for your patches. See below...

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> Hi Bharata,
> 
> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>
<snip>
>> 
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
> 
> This is not really an MGLRU issue -- can you please try one of the
> attached patches? It (truncate.patch) should help with or without
> MGLRU.

With truncate.patch and default LRU scheme, a few hard lockups are seen.

First one is this:

watchdog: Watchdog detected hard LOCKUP on cpu 487
CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x81/0x300
   </NMI>
   <TASK>
   ? __pfx_folio_activate_fn+0x10/0x10
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   folio_batch_move_lru+0x9d/0x160
   folio_activate+0x95/0xe0
   folio_mark_accessed+0x11f/0x160
   filemap_read+0x343/0x3d0
<SNIP>
   blkdev_read_iter+0x6f/0x140
   vfs_read+0x25b/0x340
   ksys_read+0x67/0xf0
   __x64_sys_read+0x19/0x20
   x64_sys_call+0x1771/0x20d0

This is the next one:

watchdog: Watchdog detected hard LOCKUP on cpu 219
CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x2b4/0x300
   </NMI>
   <TASK>
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   __page_cache_release+0x89/0x2f0
   folios_put_refs+0x92/0x230
   __folio_batch_release+0x74/0x90
   truncate_inode_pages_range+0x16f/0x520
   truncate_pagecache+0x49/0x70
   ext4_setattr+0x326/0xaa0
   notify_change+0x353/0x500
   do_truncate+0x83/0xe0
   path_openat+0xd9e/0x1090
   do_filp_open+0xaa/0x150
   do_sys_openat2+0x9b/0xd0
   __x64_sys_openat+0x55/0x90
   x64_sys_call+0xe55/0x20d0
   do_syscall_64+0x7e/0x130
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

When this happens, all-CPU backtrace shows a CPU being in 
isolate_lru_folios().

> 
>> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
>> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L
>> 6.10.0-rc3-mglru-irqstrc #24
>> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: Call Trace:
>> kernel:  <IRQ>
>> kernel:  ? show_regs+0x69/0x80
>> kernel:  ? watchdog_timer_fn+0x223/0x2b0
>> kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
>> <SNIP>
>> kernel:  </IRQ>
>> kernel:  <TASK>
>> kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel:  _raw_spin_lock+0x38/0x50
>> kernel:  clear_shadow_entry+0x3d/0x100
>> kernel:  ? __pfx_workingset_update_node+0x10/0x10
>> kernel:  mapping_try_invalidate+0x117/0x1d0
>> kernel:  invalidate_mapping_pages+0x10/0x20
>> kernel:  invalidate_bdev+0x3c/0x50
>> kernel:  blkdev_common_ioctl+0x5f7/0xa90
>> kernel:  blkdev_ioctl+0x109/0x270
>> kernel:  x64_sys_call+0x1215/0x20d0
>> kernel:  do_syscall_64+0x7e/0x130
>>
>> This happens to be contending on inode i_lock spinlock.
>>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
> 
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.

Currently testing is in progress with mglru.patch and MGLRU enabled. 
Will get back on the results.

Regards,
Bharata.


  reply	other threads:[~2024-07-08 14:34 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-03 15:11 Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34   ` Bharata B Rao [this message]
2024-07-08 16:17     ` Yu Zhao
2024-07-09  4:30       ` Bharata B Rao
2024-07-09  5:58         ` Yu Zhao
2024-07-11  5:43           ` Bharata B Rao
2024-07-15  5:19             ` Bharata B Rao
2024-07-19 20:21               ` Yu Zhao
2024-07-20  7:57                 ` Mateusz Guzik
2024-07-22  4:17                   ` Bharata B Rao
2024-07-22  4:12                 ` Bharata B Rao
2024-07-25  9:59               ` zhaoyang.huang
2024-07-26  3:26                 ` Zhaoyang Huang
2024-07-29  4:49                   ` Bharata B Rao
2024-08-13 11:04           ` Usama Arif
2024-08-13 17:43             ` Yu Zhao
2024-07-17  9:37         ` Vlastimil Babka
2024-07-17 10:50           ` Bharata B Rao
2024-07-17 11:15             ` Hillf Danton
2024-07-18  9:02               ` Bharata B Rao
2024-07-10 12:03   ` Bharata B Rao
2024-07-10 12:24     ` Mateusz Guzik
2024-07-10 13:04       ` Mateusz Guzik
2024-07-15  5:22         ` Bharata B Rao
2024-07-15  6:48           ` Mateusz Guzik
2024-07-10 18:04     ` Yu Zhao
2024-07-17  9:42 ` Vlastimil Babka
2024-07-17 10:31   ` Bharata B Rao
2024-07-17 16:44     ` Karim Manaouil
2024-07-17 11:29   ` Mateusz Guzik
2024-07-18  9:00     ` Bharata B Rao
2024-07-18 12:11       ` Mateusz Guzik
2024-07-19  6:16         ` Bharata B Rao
2024-07-19  7:06           ` Yu Zhao
2024-07-19 14:26           ` Mateusz Guzik
2024-07-17 16:34   ` Karim Manaouil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1998d479-eb1a-4bc8-a11e-59f8dd71aadb@amd.com \
    --to=bharata@amd.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=nikunj@amd.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox