From: Chris Mason <clm@meta.com>
To: Christian Theune <ct@flyingcircus.io>,
Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>,
Matthew Wilcox <willy@infradead.org>,
Jens Axboe <axboe@kernel.dk>,
linux-mm@kvack.org,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Daniel Dao <dqminh@cloudflare.com>,
regressions@lists.linux.dev, regressions@leemhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)
Date: Mon, 30 Sep 2024 20:56:03 -0400 [thread overview]
Message-ID: <f8232f8b-06e0-4d1a-bee4-cfc2ac23194e@meta.com> (raw)
In-Reply-To: <02121707-E630-4E7E-837B-8F53B4C28721@flyingcircus.io>
[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]
On 9/30/24 7:34 PM, Christian Theune wrote:
> Hi,
>
> we’ve been running a number of VMs since last week on 6.11. We’ve encountered one hung task situation multiple times now that seems to be resolving itself after a bit of time, though. I do not see spinning CPU during this time.
>
> The situation seems to be related to cgroups-based IO throttling / weighting so far:
>
> Here are three examples of similar tracebacks where jobs that do perform a certain amount of IO (either given a weight or given an explicit limit like this:
>
> IOWeight=10
> IOReadIOPSMax=/dev/vda 188
> IOWriteIOPSMax=/dev/vda 188
>
> Telemetry for the affected VM does not show that it actually reaches 188 IOPS (the load is mostly writing) but creates a kind of gaussian curve …
>
> The underlying storage and network was completely inconspicuous during the whole time.
Not disagreeing with Linus at all, but given that you've got IO
throttling too, we might really just be waiting. It's hard to tell
because the hung task timeouts only give you information about one process.
I've attached a minimal version of a script we use here to show all the
D state processes, it might help explain things. The only problem is
you have to actually ssh to the box and run it when you're stuck.
The idea is to print the stack trace of every D state process, and then
also print out how often each unique stack trace shows up. When we're
deadlocked on something, there are normally a bunch of the same stack
(say waiting on writeback) and then one jerk sitting around in a
different stack who is causing all the trouble.
(I made some quick changes to make this smaller, so apologies if you get
silly errors)
Example output:
sudo ./walker.py
15 rcu_tasks_trace_kthread D
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30
1440504 dd D
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
-----
stack summary
1 hit:
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30
-----
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[-- Attachment #2: walker.py.txt --]
[-- Type: text/plain, Size: 3020 bytes --]
#!/usr/bin/env python3
#
# this walks all the tasks on the system and prints out a stack trace
# of any tasks waiting in D state. If you pass -a, it will print out
# the stack of every task it finds.
#
# It also makes a histogram of the common stacks so you can see where
# more of the tasks are. Usually when we're deadlocked, we care about
# the least common stacks.
#
import sys
import os
import argparse
parser = argparse.ArgumentParser(description='Show kernel stacks')
parser.add_argument('-a', '--all_tasks', action='store_true', help='Dump all stacks')
parser.add_argument('-p', '--pid', type=str, help='Filter on pid')
parser.add_argument('-c', '--command', type=str, help='Filter on command name')
options = parser.parse_args()
stacks = {}
# parse the units from a number and normalize into KB
def parse_number(s):
try:
words = s.split()
unit = words[-1].lower()
number = int(words[1])
tag = words[0].lower().rstrip(':')
# we store in kb
if unit == "mb":
number = number * 1024
elif unit == "gb":
number = number * 1024 * 1024
elif unit == "tb":
number = number * 1024 * 1024
return (tag, number)
except:
return (None, None)
# read /proc/pid/stack and add it to the hashes
def add_stack(path, pid, cmd, status):
global stacks
try:
stack = open(os.path.join(path, "stack"), 'r').read()
except:
return
if (status != "D" and not options.all_tasks):
return
print("%s %s %s" % (pid, cmd, status))
print(stack)
v = stacks.get(stack)
if v:
v += 1
else:
v = 1
stacks[stack] = v
# worker to read all the files for one individual task
def run_one_task(path):
try:
stat = open(os.path.join(path, "stat"), 'r').read()
except:
return
words = stat.split()
pid, cmd, status = words[0:3]
cmd = cmd.lstrip('(')
cmd = cmd.rstrip(')')
if options.command and options.command != cmd:
return
add_stack(path, pid, cmd, status)
def print_usage():
sys.stderr.write("Usage: %s [-a]\n" % sys.argv[0])
sys.exit(1)
# for a given pid in string form, read the files from proc
def run_pid(name):
try:
pid = int(name)
except:
return
p = os.path.join("/proc", name, "task")
if not os.path.exists(p):
return
try:
for t in os.listdir(p):
run_one_task(os.path.join(p, t))
except:
pass
if options.pid:
run_pid(options.pid)
else:
for name in os.listdir("/proc"):
run_pid(name)
values = {}
for stack, count in stacks.items():
l = values.setdefault(count, [])
l.append(stack)
counts = list(values.keys())
counts.sort(reverse=True)
if counts:
print("-----\nstack summary\n")
for x in counts:
if x == 1:
print("1 hit:")
else:
print("%d hits: " % x)
l = values[x]
for stack in l:
print(stack)
print("-----")
next prev parent reply other threads:[~2024-10-01 0:56 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-12 21:18 Christian Theune
2024-09-12 21:55 ` Matthew Wilcox
2024-09-12 22:11 ` Christian Theune
2024-09-12 22:12 ` Jens Axboe
2024-09-12 22:25 ` Linus Torvalds
2024-09-12 22:30 ` Jens Axboe
2024-09-12 22:56 ` Linus Torvalds
2024-09-13 3:44 ` Matthew Wilcox
2024-09-13 13:23 ` Christian Theune
2024-09-13 12:11 ` Christian Brauner
2024-09-16 13:29 ` Matthew Wilcox
2024-09-18 9:51 ` Christian Brauner
2024-09-13 15:30 ` Chris Mason
2024-09-13 15:51 ` Matthew Wilcox
2024-09-13 16:33 ` Chris Mason
2024-09-13 18:15 ` Matthew Wilcox
2024-09-13 21:24 ` Linus Torvalds
2024-09-13 21:30 ` Matthew Wilcox
2024-09-13 16:04 ` David Howells
2024-09-13 16:37 ` Chris Mason
2024-09-16 0:00 ` Dave Chinner
2024-09-16 4:20 ` Linus Torvalds
2024-09-16 8:47 ` Chris Mason
2024-09-17 9:32 ` Matthew Wilcox
2024-09-17 9:36 ` Chris Mason
2024-09-17 10:11 ` Christian Theune
2024-09-17 11:13 ` Chris Mason
2024-09-17 13:25 ` Matthew Wilcox
2024-09-18 6:37 ` Jens Axboe
2024-09-18 9:28 ` Chris Mason
2024-09-18 12:23 ` Chris Mason
2024-09-18 13:34 ` Matthew Wilcox
2024-09-18 13:51 ` Linus Torvalds
2024-09-18 14:12 ` Matthew Wilcox
2024-09-18 14:39 ` Linus Torvalds
2024-09-18 17:12 ` Matthew Wilcox
2024-09-18 16:37 ` Chris Mason
2024-09-19 1:43 ` Dave Chinner
2024-09-19 3:03 ` Linus Torvalds
2024-09-19 3:12 ` Linus Torvalds
2024-09-19 3:38 ` Jens Axboe
2024-09-19 4:32 ` Linus Torvalds
2024-09-19 4:42 ` Jens Axboe
2024-09-19 4:36 ` Matthew Wilcox
2024-09-19 4:46 ` Jens Axboe
2024-09-19 5:20 ` Jens Axboe
2024-09-19 4:46 ` Linus Torvalds
2024-09-20 13:54 ` Chris Mason
2024-09-24 15:58 ` Matthew Wilcox
2024-09-24 17:16 ` Sam James
2024-09-25 16:06 ` Kairui Song
2024-09-25 16:42 ` Christian Theune
2024-09-27 14:51 ` Sam James
2024-09-27 14:58 ` Jens Axboe
2024-10-01 21:10 ` Kairui Song
2024-09-24 19:17 ` Chris Mason
2024-09-24 19:24 ` Linus Torvalds
2024-09-19 6:34 ` Christian Theune
2024-09-19 6:57 ` Linus Torvalds
2024-09-19 10:19 ` Christian Theune
2024-09-30 17:34 ` Christian Theune
2024-09-30 18:46 ` Linus Torvalds
2024-09-30 19:25 ` Christian Theune
2024-09-30 20:12 ` Linus Torvalds
2024-09-30 20:56 ` Matthew Wilcox
2024-09-30 22:42 ` Davidlohr Bueso
2024-09-30 23:00 ` Davidlohr Bueso
2024-09-30 23:53 ` Linus Torvalds
2024-10-01 0:56 ` Chris Mason [this message]
2024-10-01 7:54 ` Christian Theune
2024-10-10 6:29 ` Christian Theune
2024-10-11 7:27 ` Christian Theune
2024-10-11 9:08 ` Christian Theune
2024-10-11 13:06 ` Chris Mason
2024-10-11 13:50 ` Christian Theune
2024-10-12 17:01 ` Linus Torvalds
2024-12-02 10:44 ` Christian Theune
2024-10-01 2:22 ` Dave Chinner
2024-09-16 7:14 ` Christian Theune
2024-09-16 12:16 ` Matthew Wilcox
2024-09-18 8:31 ` Christian Theune
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f8232f8b-06e0-4d1a-bee4-cfc2ac23194e@meta.com \
--to=clm@meta.com \
--cc=axboe@kernel.dk \
--cc=ct@flyingcircus.io \
--cc=david@fromorbit.com \
--cc=dqminh@cloudflare.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=regressions@leemhuis.info \
--cc=regressions@lists.linux.dev \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox