From: Lance Yang <ioworker0@gmail.com>
To: akpm@linux-foundation.org
Cc: dj456119@gmail.com, cunhuang@tencent.com, leonylgao@tencent.com,
j.granados@samsung.com, jsiddle@redhat.com,
kent.overstreet@linux.dev, 21cnbao@gmail.com,
ryan.roberts@arm.com, david@redhat.com, ziy@nvidia.com,
libang.li@antgroup.com, baolin.wang@linux.alibaba.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
joel.granados@kernel.org, linux@weissschuh.net,
Lance Yang <ioworker0@gmail.com>
Subject: [PATCH v2 0/2] add detect count for hung tasks
Date: Sun, 27 Oct 2024 20:07:45 +0800 [thread overview]
Message-ID: <20241027120747.42833-1-ioworker0@gmail.com> (raw)
Hi all,
This patchset adds a counter, hung_task_detect_count, to track the number
of times hung tasks are detected.
IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly
as using a counter.
Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
the hung_task_warnings to zero. Without warnings, we must directly access
the node to ensure that there are no more hung tasks and that the system
has recovered. After all, load average alone cannot provide a clear
picture.
Once this counter is in place, in a high-density deployment pattern, we
plan to set hung_task_timeout_secs to a lower number to improve stability,
even though this might result in false positives. And then we can set a
time-based threshold: if hung tasks last beyond this duration, we will
automatically migrate containers to other nodes. Based on past experience,
this approach could help avoid many production disruptions.
Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense ;)
---
Changes since v1 [1]
====================
- hung_task: add detect count for hung tasks
- Update the changelog (per Andrew)
- Find other folks to CC (per Andrew)
[1] https://lore.kernel.org/linux-mm/20241022114736.83285-1-ioworker0@gmail.com
Lance Yang (2):
hung_task: add detect count for hung tasks
hung_task: add docs for hung_task_detect_count
Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++
kernel/hung_task.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+)
--
2.45.2
next reply other threads:[~2024-10-27 12:08 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-27 12:07 Lance Yang [this message]
2024-10-27 12:07 ` [PATCH v2 1/2] hung_task: " Lance Yang
2024-11-01 11:48 ` Lance Yang
2024-10-27 12:07 ` [PATCH v2 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241027120747.42833-1-ioworker0@gmail.com \
--to=ioworker0@gmail.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=cunhuang@tencent.com \
--cc=david@redhat.com \
--cc=dj456119@gmail.com \
--cc=j.granados@samsung.com \
--cc=joel.granados@kernel.org \
--cc=jsiddle@redhat.com \
--cc=kent.overstreet@linux.dev \
--cc=leonylgao@tencent.com \
--cc=libang.li@antgroup.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@weissschuh.net \
--cc=ryan.roberts@arm.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox