* [PATCH v2 0/2] add detect count for hung tasks
@ 2024-10-27 12:07 Lance Yang
2024-10-27 12:07 ` [PATCH v2 1/2] hung_task: " Lance Yang
2024-10-27 12:07 ` [PATCH v2 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
0 siblings, 2 replies; 4+ messages in thread
From: Lance Yang @ 2024-10-27 12:07 UTC (permalink / raw)
To: akpm
Cc: dj456119, cunhuang, leonylgao, j.granados, jsiddle,
kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li,
baolin.wang, linux-kernel, linux-mm, joel.granados, linux,
Lance Yang
Hi all,
This patchset adds a counter, hung_task_detect_count, to track the number
of times hung tasks are detected.
IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly
as using a counter.
Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
the hung_task_warnings to zero. Without warnings, we must directly access
the node to ensure that there are no more hung tasks and that the system
has recovered. After all, load average alone cannot provide a clear
picture.
Once this counter is in place, in a high-density deployment pattern, we
plan to set hung_task_timeout_secs to a lower number to improve stability,
even though this might result in false positives. And then we can set a
time-based threshold: if hung tasks last beyond this duration, we will
automatically migrate containers to other nodes. Based on past experience,
this approach could help avoid many production disruptions.
Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense ;)
---
Changes since v1 [1]
====================
- hung_task: add detect count for hung tasks
- Update the changelog (per Andrew)
- Find other folks to CC (per Andrew)
[1] https://lore.kernel.org/linux-mm/20241022114736.83285-1-ioworker0@gmail.com
Lance Yang (2):
hung_task: add detect count for hung tasks
hung_task: add docs for hung_task_detect_count
Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++
kernel/hung_task.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+)
--
2.45.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/2] hung_task: add detect count for hung tasks
2024-10-27 12:07 [PATCH v2 0/2] add detect count for hung tasks Lance Yang
@ 2024-10-27 12:07 ` Lance Yang
2024-11-01 11:48 ` Lance Yang
2024-10-27 12:07 ` [PATCH v2 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
1 sibling, 1 reply; 4+ messages in thread
From: Lance Yang @ 2024-10-27 12:07 UTC (permalink / raw)
To: akpm
Cc: dj456119, cunhuang, leonylgao, j.granados, jsiddle,
kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li,
baolin.wang, linux-kernel, linux-mm, joel.granados, linux,
Lance Yang, Mingzhe Yang
This commit adds a counter, hung_task_detect_count, to track the number of
times hung tasks are detected.
IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly as
using a counter.
Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
the hung_task_warnings to zero. Without warnings, we must directly access
the node to ensure that there are no more hung tasks and that the system
has recovered. After all, load average alone cannot provide a clear
picture.
Once this counter is in place, in a high-density deployment pattern, we
plan to set hung_task_timeout_secs to a lower number to improve stability,
even though this might result in false positives. And then we can set a
time-based threshold: if hung tasks last beyond this duration, we will
automatically migrate containers to other nodes. Based on past experience,
this approach could help avoid many production disruptions.
Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense.
Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
kernel/hung_task.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 959d99583d1c..229ff3d4e501 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -30,6 +30,11 @@
*/
static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
+/*
+ * Total number of tasks detected as hung since boot:
+ */
+static unsigned long __read_mostly sysctl_hung_task_detect_count;
+
/*
* Limit number of tasks checked in a batch.
*
@@ -115,6 +120,12 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
return;
+ /*
+ * This counter tracks the total number of tasks detected as hung
+ * since boot.
+ */
+ sysctl_hung_task_detect_count++;
+
trace_sched_process_hang(t);
if (sysctl_hung_task_panic) {
@@ -314,6 +325,13 @@ static struct ctl_table hung_task_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_NEG_ONE,
},
+ {
+ .procname = "hung_task_detect_count",
+ .data = &sysctl_hung_task_detect_count,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0444,
+ .proc_handler = proc_dointvec,
+ },
};
static void __init hung_task_sysctl_init(void)
--
2.45.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 2/2] hung_task: add docs for hung_task_detect_count
2024-10-27 12:07 [PATCH v2 0/2] add detect count for hung tasks Lance Yang
2024-10-27 12:07 ` [PATCH v2 1/2] hung_task: " Lance Yang
@ 2024-10-27 12:07 ` Lance Yang
1 sibling, 0 replies; 4+ messages in thread
From: Lance Yang @ 2024-10-27 12:07 UTC (permalink / raw)
To: akpm
Cc: dj456119, cunhuang, leonylgao, j.granados, jsiddle,
kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li,
baolin.wang, linux-kernel, linux-mm, joel.granados, linux,
Lance Yang, Mingzhe Yang
This commit introduces documentation for hung_task_detect_count in
kernel.rst.
Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index f8bc1630eba0..b2b36d0c3094 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -401,6 +401,15 @@ The upper bound on the number of tasks that are checked.
This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
+hung_task_detect_count
+======================
+
+Indicates the total number of tasks that have been detected as hung since
+the system boot.
+
+This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
+
+
hung_task_timeout_secs
======================
--
2.45.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2 1/2] hung_task: add detect count for hung tasks
2024-10-27 12:07 ` [PATCH v2 1/2] hung_task: " Lance Yang
@ 2024-11-01 11:48 ` Lance Yang
0 siblings, 0 replies; 4+ messages in thread
From: Lance Yang @ 2024-11-01 11:48 UTC (permalink / raw)
To: akpm
Cc: ioworker0, 21cnbao, baolin.wang, cunhuang, david, dj456119,
j.granados, joel.granados, jsiddle, kent.overstreet, leonylgao,
libang.li, linux-kernel, linux-mm, linux, mingzhe.yang,
ryan.roberts, ziy
Hi Andrew,
Sorry, I made a stupid mistake :(
sysctl_hung_task_detect_count is defined as an unsigned long, so we need
to use proc_doulongvec_minmax instead of proc_dointvec to handle it
correctly. Could you please fold the following changes into this patch?
---
kernel/hung_task.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 229ff3d4e501..c18717189f32 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -330,7 +330,7 @@ static struct ctl_table hung_task_sysctls[] = {
.data = &sysctl_hung_task_detect_count,
.maxlen = sizeof(unsigned long),
.mode = 0444,
- .proc_handler = proc_dointvec,
+ .proc_handler = proc_doulongvec_minmax,
},
};
--
Thanks,
Lance
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-11-01 11:49 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-27 12:07 [PATCH v2 0/2] add detect count for hung tasks Lance Yang
2024-10-27 12:07 ` [PATCH v2 1/2] hung_task: " Lance Yang
2024-11-01 11:48 ` Lance Yang
2024-10-27 12:07 ` [PATCH v2 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox