On 9/30/24 7:34 PM, Christian Theune wrote:
> Hi,
> 
> we’ve been running a number of VMs since last week on 6.11. We’ve encountered one hung task situation multiple times now that seems to be resolving itself after a bit of time, though. I do not see spinning CPU during this time.
> 
> The situation seems to be related to cgroups-based IO throttling / weighting so far:
> 
> Here are three examples of similar tracebacks where jobs that do perform a certain amount of IO (either given a weight or given an explicit limit like this:
> 
> IOWeight=10
> IOReadIOPSMax=/dev/vda 188
> IOWriteIOPSMax=/dev/vda 188
> 	
> Telemetry for the affected VM does not show that it actually reaches 188 IOPS (the load is mostly writing) but creates a kind of gaussian curve … 
> 
> The underlying storage and network was completely inconspicuous during the whole time.

Not disagreeing with Linus at all, but given that you've got IO
throttling too, we might really just be waiting.  It's hard to tell
because the hung task timeouts only give you information about one process.

I've attached a minimal version of a script we use here to show all the
D state processes, it might help explain things.  The only problem is
you have to actually ssh to the box and run it when you're stuck.

The idea is to print the stack trace of every D state process, and then
also print out how often each unique stack trace shows up.  When we're
deadlocked on something, there are normally a bunch of the same stack
(say waiting on writeback) and then one jerk sitting around in a
different stack who is causing all the trouble.

(I made some quick changes to make this smaller, so apologies if you get
silly errors)

Example output:

 sudo ./walker.py
15 rcu_tasks_trace_kthread D
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30

1440504 dd D
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

-----
stack summary

1 hit:
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30

-----
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0