[BUG] mm: lru_add_drain_all() hangs indefinitely [6.17.0-1007-aws aarch64]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Travis Downs <travis.downs@gmail.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	 Tejun Heo <tj@kernel.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>
Subject: [BUG] mm: lru_add_drain_all() hangs indefinitely [6.17.0-1007-aws aarch64]
Date: Mon, 13 Apr 2026 16:42:20 -0400	[thread overview]
Message-ID: <CAOBGo4wrHeqCe1=++N-nvZ2piPKk6zK9Ro+x_-cbj7YT3mWEMA@mail.gmail.com> (raw)

After migrating from 6.8 to 6.17 we observed that lru_add_drain_all()
hangs indefinitely on 6.17.0-1007-aws #7~24.04.1-Ubuntu (aarch64, EC2
m7gd.8xlarge) in production workloads. The hang reproduces reliably,
happening on ~all hosts which migrated to 6.17, within the first hour.

The hang leaves the system in a semi-broken state: not frequent enough
to lock it up immediately, but slowly breaking it as various userspace
syscalls and kernel work smack into the mutex. flush_work() blocks
indefinitely while holding the mutex, and all subsequent callers pile
up on it.

This is a regression from 6.8.0-1050-aws (Ubuntu 22.04 HWE). The same
workload on the same instance type runs indefinitely without issue on
6.8.

Trigger (or not)
=======

The first manifestation of the hang involves two callers of lru_add_drain_all().
One caller acquires the mutex, schedules per-CPU drain work, and blocks in
flush_work() -> wait_for_completion() because the drain work never
completes. The second caller blocks on the mutex. Both directions have
been observed:

  - khugepaged holds the mutex, stuck in flush_work(). FluentBit
    (flb-pipeline) blocks on the mutex via generic_fadvise:
      INFO: task flb-pipeline:18374 is blocked on a mutex likely owned
by task khugepaged:220.

  - flb-pipeline holds the mutex, stuck in flush_work(). khugepaged
    blocks on the mutex:
      INFO: task khugepaged:220 is blocked on a mutex likely owned by
task flb-pipeline:18416.

In both cases khugepaged was calling lru_add_drain_all() as part of
THP collapsing, and FluentBit was calling fadvise(POSIX_FADV_DONTNEED)
-> generic_fadvise -> lru_add_drain_all().

Sometimes the two processes are reversed, i.e., fluentbit holds the LRU
mutex and khugepaged is waiting on it.

It is not clear that this is actually the origin event for the hang:
it seems more like
the workqueues get into a bad state then this is the first obvious manifestation
as the LRU drain does not complete and things start pilling up.

Kernel version
==============

  6.17.0-1007-aws #7~24.04.1-Ubuntu
  Architecture: aarch64 (ARM64, Graviton3, EC2 m7gd.8xlarge)
  Distribution: Ubuntu 24.04 LTS (Noble Numbat), HWE kernel

  Source tree used for decoding:
    git://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-aws/+git/noble
    commit 8d7dfbe07b0b ("UBUNTU: Ubuntu-aws-6.17-6.17.0-1007.7~24.04.1")
    tag: Ubuntu-aws-6.17-6.17.0-1007.7_24.04.1
    merge base with upstream: v6.17 (e5f0a698b34e "Linux 6.17")
    includes upstream stable patches through v6.17.9 (65723f3975a0)
    3687 commits on top of v6.17 (1193 UBUNTU packaging, 2494 code)

Stack traces below decoded with faddr2line against the matching
vmlinux with DWARF debug info (linux-image-unsigned-6.17.0-1007-aws-
dbgsym_6.17.0-1007.7~24.04.1_arm64.ddeb). XFS module symbols decoded
to source file only (no module debug symbols).

Dmesg evidence (host green-0, i-01fbf12d0b46e234d)
===================================================

After the initial hang we reproduced the issue to capture additional
information. First hung task report at Apr 09 22:01:23 UTC:

  INFO: task khugepaged:220 blocked for more than 122 seconds.
        Not tainted 6.17.0-1007-aws #7~24.04.1-Ubuntu
  task:khugepaged      state:D stack:0     pid:220
  Call trace:
   __switch_to+0xf0/0x178 (T)
   __schedule+0x2e0/0x790
   schedule+0x34/0xc0
   schedule_timeout+0x13c/0x150
   __wait_for_common+0xe4/0x2a8
   wait_for_completion+0x2c/0x60
   __flush_work+0x98/0x138                    # kernel/workqueue.c:4249
   flush_work+0x30/0x58                       # kernel/workqueue.c:4266
   __lru_add_drain_all+0x1bc/0x2e8            # mm/swap.c:881
   lru_add_drain_all+0x20/0x48                # mm/swap.c:891
   khugepaged+0xa8/0x2c8                      # mm/khugepaged.c:2623
   kthread+0xfc/0x110
   ret_from_fork+0x10/0x20

  INFO: task flb-pipeline:18374 blocked for more than 122 seconds.
        Not tainted 6.17.0-1007-aws #7~24.04.1-Ubuntu
  task:flb-pipeline    state:D stack:0     pid:18374
  Call trace:
   __switch_to+0xf0/0x178 (T)
   __schedule+0x2e0/0x790
   schedule+0x34/0xc0
   schedule_preempt_disabled+0x1c/0x40
   __mutex_lock.constprop.0+0x420/0xcb0       # kernel/locking/mutex.c:760
   __mutex_lock_slowpath+0x20/0x48
   mutex_lock+0x8c/0xc0
   __lru_add_drain_all+0x50/0x2e8             # mm/swap.c:843
   lru_add_drain_all+0x20/0x48                # mm/swap.c:891
   generic_fadvise+0x228/0x3b8                # mm/fadvise.c:168
   __arm64_sys_fadvise64_64+0xa8/0x138        # mm/fadvise.c:201
   invoke_syscall+0x74/0x128
   el0_svc_common.constprop.0+0x4c/0x140
   do_el0_svc+0x28/0x58
   el0_svc+0x40/0x160
   el0t_64_sync_handler+0xc0/0x108
   el0t_64_sync+0x1b8/0x1c0
  INFO: task flb-pipeline:18374 is blocked on a mutex likely owned by
task khugepaged:220.

After resetting hung_task_warnings to -1 at Apr 10 20:28 UTC (22 hours
later), the hang was still present and growing:

  INFO: task khugepaged:220 blocked for more than 16588 seconds.
  INFO: task flb-pipeline:18374 blocked for more than 16588 seconds.
  INFO: task redpanda:19263 blocked for more than 122 seconds.
  INFO: task kworker/u128:6:763273 blocked for more than 25190 seconds.
  INFO: task kworker/u128:4:804606 blocked for more than 25190 seconds.
  INFO: task python3:1169237 blocked for more than 16588 seconds.

By this point the Redpanda process itself was blocked in close() on an
inotify fd (sysrq-w at 20:37:41):

  task:redpanda        state:D stack:0     pid:19263
  Call trace:
   __switch_to+0xf0/0x178 (T)
   __schedule+0x2e0/0x790
   schedule+0x34/0xc0
   schedule_timeout+0x13c/0x150
   __wait_for_common+0xe4/0x2a8
   wait_for_completion+0x2c/0x60
   __flush_work+0x98/0x138                    # kernel/workqueue.c:4249
   flush_delayed_work+0x4c/0xb0               # kernel/workqueue.c:4288
   fsnotify_wait_marks_destroyed+0x28/0x50    # fs/notify/mark.c:1008
   fsnotify_destroy_group+0x54/0x120          # fs/notify/group.c:84
   inotify_release+0x2c/0xb8                  #
fs/notify/inotify/inotify_user.c:311
   __fput+0xe4/0x328                          # fs/file_table.c:469
   fput_close_sync+0x4c/0x138                 # fs/file_table.c:574
   __arm64_sys_close+0x44/0xa0                # fs/open.c:1574
   invoke_syscall+0x74/0x128

Two kworkers stuck in fsnotify teardown waiting on SRCU grace periods:

  task:kworker/u128:6  pid:763273 (blocked 25190s)
  Workqueue: events_unbound fsnotify_connector_destroy_workfn
  Call trace:
   synchronize_srcu+0x194/0x228               # kernel/rcu/srcutree.c:1528
   fsnotify_connector_destroy_workfn+0x5c/0xf0 # fs/notify/mark.c:323
   process_one_work+0x174/0x408               # kernel/workqueue.c:3241

  task:kworker/u128:4  pid:804606 (blocked 25190s)
  Workqueue: events_unbound fsnotify_mark_destroy_workfn
  Call trace:
   synchronize_srcu+0x194/0x228               # kernel/rcu/srcutree.c:1528
   fsnotify_mark_destroy_workfn+0x9c/0x188    # fs/notify/mark.c:998
   process_one_work+0x174/0x408               # kernel/workqueue.c:3241

Sysrq workqueue state (host green-0, Apr 10 20:43 UTC)
======================================================

Most detailed dump, captured ~22 hours into the hang.

The core anomaly -- lru_add_drain_per_cpu pending with idle workers:

  workqueue mm_percpu_wq: flags=0x8
    pwq 114: cpus=28 node=0 flags=0x0 nice=0 active=2 refcnt=4
      pending: lru_add_drain_per_cpu BAR(220), vmstat_update

The lru_add_drain_per_cpu barrier work item (queued by khugepaged, pid
220) is pending on mm_percpu_wq for CPU 28, with two workers on that
CPU:

  PID 1042104 (kworker/28:0-mm_percpu_wq): state I (idle)
    stack: worker_thread+0x220/0x4f0  # kernel/workqueue.c:3416

  PID 1158410 (kworker/28:1-mm_percpu_wq): state R
    stack: worker_thread+0x220/0x4f0  # same idle path despite state R

Both show idle-path stacks. The work item is visible, workers are
present on the correct CPU, yet the work is never dispatched.

The BAR(220) annotation indicates a barrier/flush work item. No prior
work item is shown in-flight on this pwq.

For comparison, mm_percpu_wq on CPU 29 looked normal:

  pwq 118: cpus=29 active=2  in-flight: 1149421:vmstat_update
                              pending: vmstat_update

Workers on CPUs with pending work show anomalous scheduler stats
compared to workers on normal CPUs. From the sysrq-t scheduler dump:

  Anomalous (CPUs with hung pools):
    kworker/28:0  pid=1042104  state=I  sum_exec=8450s   switches=1363275
    kworker/28:1  pid=1158410  state=R  sum_exec=3926s   switches=641601
    kworker/29:1  pid=1126545  state=I  sum_exec=14499s  switches=2250077
    kworker/29:2  pid=1149421  state=R  sum_exec=3750s   switches=578918
    kworker/7:2   pid=677335   state=I  sum_exec=3609s   switches=627256

  Normal (other CPUs, for comparison):
    kworker/27:1  pid=1226273  state=I  sum_exec=110s    switches=4906
    kworker/12:1  pid=1235753  state=I  sum_exec=248s    switches=14159
    kworker/11:1  pid=1225304  state=I  sum_exec=307s    switches=16239

Workers on the affected CPUs have 10-100x more CPU time and context
switches than normal workers, suggesting a hot wake/sleep cycle:
waking, failing to dispatch, sleeping immediately, repeat.

DIO completion worker stuck in slab allocator:

  workqueue dio/nvme1n1: flags=0x8
    pwq 30: cpus=7   active=6  in-flight: 713604:iomap_dio_complete_work
                                pending: 5*iomap_dio_complete_work
    pwq 34: cpus=8   active=66 pending: 66*iomap_dio_complete_work
    pwq 54: cpus=13  active=2  pending: 2*iomap_dio_complete_work
    pwq 66: cpus=16  active=2  pending: 2*iomap_dio_complete_work
    pwq 114: cpus=28 active=2  pending: 2*iomap_dio_complete_work
    pwq 118: cpus=29 active=1  pending: iomap_dio_complete_work

78 iomap_dio_complete_work items piled up across 6 CPUs. The sole
in-flight worker (PID 713604, kworker/7:0+dio/nvme1n1) was stuck in
kmem_cache_alloc_noprof during an XFS transaction commit. /proc stack
sampled 3 times at 1s intervals, identical each time:

    kmem_cache_alloc_noprof+0x220/0x3c0       # mm/slub.c:4266
    xfs_rui_init+0xb8/0xc8 [xfs]
    xfs_rmap_update_create_intent+0x38/0xb8 [xfs]
    xfs_defer_create_intent+0x78/0xf8 [xfs]
    xfs_defer_create_intents+0x5c/0x118 [xfs]
    xfs_defer_finish_noroll+0x88/0x3d0 [xfs]
    xfs_trans_commit+0x88/0xd8 [xfs]
    xfs_iomap_write_unwritten+0xc0/0x350 [xfs]
    xfs_dio_write_end_io+0x1f4/0x298 [xfs]
    iomap_dio_complete+0x50/0x208
    iomap_dio_complete_work+0x30/0x68
    process_one_work+0x174/0x408              # kernel/workqueue.c:3241

  State R (running), Cpus_allowed_list: 7, stime=178, utime=0
  voluntary_ctxt_switches: 301273, nonvoluntary_ctxt_switches: 70

State R, pinned to CPU 7, 178 stime ticks over ~6.5 hours. Appears to
be yielding in the allocator slow path (301k voluntary switches) but
never completing.

The events workqueue was also backed up with io_uring/AIO completions:

  workqueue events:
    pwq 114: cpus=28 active=29 pending: 29*aio_poll_complete_work
    pwq 118: cpus=29 active=24 pending: aio_poll_complete_work,
                                        psi_avgs_work,
                                        aio_poll_complete_work,
                                        aio_fsync_work,
                                        20*aio_poll_complete_work

Hung pool summary:

  pool 30: cpus=7  hung=23642s workers=2 idle: 677335
  pool 54: cpus=13 hung=3116s  workers=2 idle: 788155
  pool 118: cpus=29 hung=17036s workers=2 idle: 1126545

All three hung pools: idle workers, pending work, work not dispatched.

In a separate reproduction (green-1) sysrq-t showed the same pattern:

  workqueue mm_percpu_wq: flags=0x8
    pwq 90: cpus=22 node=0 flags=0x0 nice=0 active=2 refcnt=4
      pending: vmstat_update, lru_add_drain_per_cpu BAR(220)

Same: lru_add_drain_per_cpu pending, no in-flight worker.

Reproducer
==========

Reproduced on two separate clusters. Conditions:

  1. Kernel 6.17.0-1007-aws on aarch64 (Graviton3, m7gd.8xlarge)
  2. THP set to "madvise", all other THP parameters default
  3. A process calling fadvise(POSIX_FADV_DONTNEED) with frequency --
     FluentBit (flb-pipeline) does this naturally on log files
  4. Redpanda (Seastar framework) performing O_DIRECT writes to XFS
     on local NVMe (nvme1n1), generating DIO completions
  5. Kubernetes environment: FluentBit and Redpanda in separate pods,
     cgroups involved, possibly ongoing cgroup creation

The hang appeared on all 3 nodes within hours of switching to 6.17
(from 6.8). Other clusters running the same 6.17 ARM64 kernel with
only 4 vCPUs and modest load do not exhibit the issue.

The hang does not self-resolve. On host green-0, tasks were blocked
for days when we last captured state. The k8s control plane eventually
became unreachable due to cascading effects of the workqueue stalls.

A separate reproducer on the same HW with the following changes did
*not* reproduce the issue:

  1) No Kubernetes, just plain EC2 VMs (cgroup CPU parameters adjusted
     to match what k8s applies)
  2) No FluentBit, just stress processes calling madvise DONTNEED in
     random patterns, plus stress-ng doing mm stress

Open questions
==============

  1. Why do idle mm_percpu_wq workers on CPU 28 not pick up the
     pending lru_add_drain_per_cpu BAR(220) work item? No prior work
     item is shown in-flight on that pwq. Is there a race in the
     barrier mechanism, or is the workqueue state dump not showing the
     full picture?

  2. The same "idle workers + pending work + not executing" pattern
     appears on hung pools 30 (CPU 7), 54 (CPU 13), and 118 (CPU 29).
     Is there a common root cause preventing worker dispatch across
     multiple per-CPU pools?

  3. The DIO completion worker (pid 713604) stuck in
     kmem_cache_alloc_noprof inside an XFS transaction -- is this a
     consequence of the hang (reclaim can't proceed because LRU
     draining is stalled), or is it an independent issue?

  4. Is this ARM64-specific? We have only tested on aarch64.

Happy to capture additional diagnostics or test patches.

Thanks,
Travis Downs

                 reply	other threads:[~2026-04-13 20:43 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOBGo4wrHeqCe1=++N-nvZ2piPKk6zK9Ro+x_-cbj7YT3mWEMA@mail.gmail.com' \
    --to=travis.downs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=jiangshanlai@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox