* [PATCH 0/2] hung_task: add detect count for hung tasks
@ 2024-10-22 11:47 Lance Yang
2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Lance Yang @ 2024-10-22 11:47 UTC (permalink / raw)
To: akpm
Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
linux-kernel, linux-mm, Lance Yang
Hi all,
This patchset adds a counter, hung_task_detect_count, to track the number of
times hung tasks are detected. This counter provides a straightforward way
to monitor hung task events without manually checking dmesg logs.
With this counter in place, system issues can be spotted quickly, allowing
admins to step in promptly before system load spikes occur, even if the
hung_task_warnings value has been decreased to 0 well before.
Recently, we encountered a situation where warnings about hung tasks were
buried in dmesg logs during load spikes. Introducing this counter could
have helped us detect such issues earlier and improve our analysis efficiency.
Lance Yang (2):
hung_task: add detect count for hung tasks
hung_task: add docs for hung_task_detect_count
Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++
kernel/hung_task.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+)
--
2.45.2
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 1/2] hung_task: add detect count for hung tasks 2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang @ 2024-10-22 11:47 ` Lance Yang 2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang 2024-10-24 2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton 2 siblings, 0 replies; 7+ messages in thread From: Lance Yang @ 2024-10-22 11:47 UTC (permalink / raw) To: akpm Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang, linux-kernel, linux-mm, Lance Yang, Mingzhe Yang This commit adds a counter, hung_task_detect_count, to track the number of times hung tasks are detected. This counter provides a straightforward way to monitor hung task events without manually checking dmesg logs. With this counter in place, system issues can be spotted quickly, allowing admins to step in promptly before system load spikes occur, even if the hung_task_warnings value has been decreased to 0 well before. Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> --- kernel/hung_task.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 959d99583d1c..229ff3d4e501 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -30,6 +30,11 @@ */ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT; +/* + * Total number of tasks detected as hung since boot: + */ +static unsigned long __read_mostly sysctl_hung_task_detect_count; + /* * Limit number of tasks checked in a batch. * @@ -115,6 +120,12 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) return; + /* + * This counter tracks the total number of tasks detected as hung + * since boot. + */ + sysctl_hung_task_detect_count++; + trace_sched_process_hang(t); if (sysctl_hung_task_panic) { @@ -314,6 +325,13 @@ static struct ctl_table hung_task_sysctls[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_NEG_ONE, }, + { + .procname = "hung_task_detect_count", + .data = &sysctl_hung_task_detect_count, + .maxlen = sizeof(unsigned long), + .mode = 0444, + .proc_handler = proc_dointvec, + }, }; static void __init hung_task_sysctl_init(void) -- 2.45.2 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] hung_task: add docs for hung_task_detect_count 2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang 2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang @ 2024-10-22 11:47 ` Lance Yang 2024-10-24 2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton 2 siblings, 0 replies; 7+ messages in thread From: Lance Yang @ 2024-10-22 11:47 UTC (permalink / raw) To: akpm Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang, linux-kernel, linux-mm, Lance Yang, Mingzhe Yang This commit introduces documentation for hung_task_detect_count in kernel.rst. Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> --- Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index f8bc1630eba0..b2b36d0c3094 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -401,6 +401,15 @@ The upper bound on the number of tasks that are checked. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_detect_count +====================== + +Indicates the total number of tasks that have been detected as hung since +the system boot. + +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + + hung_task_timeout_secs ====================== -- 2.45.2 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] hung_task: add detect count for hung tasks 2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang 2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang 2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang @ 2024-10-24 2:05 ` Andrew Morton 2024-10-24 3:28 ` Lance Yang 2 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2024-10-24 2:05 UTC (permalink / raw) To: Lance Yang Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang, linux-kernel, linux-mm On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote: > Hi all, > > This patchset adds a counter, hung_task_detect_count, to track the number of > times hung tasks are detected. This counter provides a straightforward way > to monitor hung task events without manually checking dmesg logs. > > With this counter in place, system issues can be spotted quickly, allowing > admins to step in promptly before system load spikes occur, even if the > hung_task_warnings value has been decreased to 0 well before. > > Recently, we encountered a situation where warnings about hung tasks were > buried in dmesg logs during load spikes. Introducing this counter could > have helped us detect such issues earlier and improve our analysis efficiency. > Isn't the answer to this problem "write a better parser"? I mean, we're providing userspace with information which is already available. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] hung_task: add detect count for hung tasks 2024-10-24 2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton @ 2024-10-24 3:28 ` Lance Yang 2024-10-24 4:28 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Lance Yang @ 2024-10-24 3:28 UTC (permalink / raw) To: Andrew Morton Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang, linux-kernel, linux-mm Hi Andrew, Thanks a lot for paying attention! On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote: > > > Hi all, > > > > This patchset adds a counter, hung_task_detect_count, to track the number of > > times hung tasks are detected. This counter provides a straightforward way > > to monitor hung task events without manually checking dmesg logs. > > > > With this counter in place, system issues can be spotted quickly, allowing > > admins to step in promptly before system load spikes occur, even if the > > hung_task_warnings value has been decreased to 0 well before. > > > > Recently, we encountered a situation where warnings about hung tasks were > > buried in dmesg logs during load spikes. Introducing this counter could > > have helped us detect such issues earlier and improve our analysis efficiency. > > > > Isn't the answer to this problem "write a better parser"? I mean, Yeah, I certainly agree that having a good parser is important, and I'm working on that as well ;) > we're providing userspace with information which is already available. IHMO, there are two reasons why this counter remains valuable: 1) It allows us to easily detect hung tasks in time before load spikes occur, using simple and common monitoring tools like Prometheus. 2) It ensures that we remain aware of hung tasks even when the hung_task_warnings value has already been decreased to 0 well before. Thanks again for your time! Lance > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] hung_task: add detect count for hung tasks 2024-10-24 3:28 ` Lance Yang @ 2024-10-24 4:28 ` Andrew Morton 2024-10-24 8:48 ` Lance Yang 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2024-10-24 4:28 UTC (permalink / raw) To: Lance Yang Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang, linux-kernel, linux-mm On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@gmail.com> wrote: > Hi Andrew, > > Thanks a lot for paying attention! > > On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote: > > > > > Hi all, > > > > > > This patchset adds a counter, hung_task_detect_count, to track the number of > > > times hung tasks are detected. This counter provides a straightforward way > > > to monitor hung task events without manually checking dmesg logs. > > > > > > With this counter in place, system issues can be spotted quickly, allowing > > > admins to step in promptly before system load spikes occur, even if the > > > hung_task_warnings value has been decreased to 0 well before. > > > > > > Recently, we encountered a situation where warnings about hung tasks were > > > buried in dmesg logs during load spikes. Introducing this counter could > > > have helped us detect such issues earlier and improve our analysis efficiency. > > > > > > > Isn't the answer to this problem "write a better parser"? I mean, > > Yeah, I certainly agree that having a good parser is important, and I'm > working on that as well ;) > > > we're providing userspace with information which is already available. > > IHMO, there are two reasons why this counter remains valuable: > > 1) It allows us to easily detect hung tasks in time before load spikes occur, > using simple and common monitoring tools like Prometheus. But the new sysctl_hung_task_detect_count counter gets incremented a microsecond before the printk comes out. I don't understand the difference. > 2) It ensures that we remain aware of hung tasks even when the > hung_task_warnings value has already been decreased to 0 well before. That makes sense, I guess. But fleshing this out with a real operational scenario would help persuade reviewers of the benefit of this change. So please describe the utility with full details - sell it to us! ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] hung_task: add detect count for hung tasks 2024-10-24 4:28 ` Andrew Morton @ 2024-10-24 8:48 ` Lance Yang 0 siblings, 0 replies; 7+ messages in thread From: Lance Yang @ 2024-10-24 8:48 UTC (permalink / raw) To: Andrew Morton Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet, 21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang, linux-kernel, linux-mm On Thu, Oct 24, 2024 at 12:28 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@gmail.com> wrote: > > > Hi Andrew, > > > > Thanks a lot for paying attention! > > > > On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton > > <akpm@linux-foundation.org> wrote: > > > > > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > This patchset adds a counter, hung_task_detect_count, to track the number of > > > > times hung tasks are detected. This counter provides a straightforward way > > > > to monitor hung task events without manually checking dmesg logs. > > > > > > > > With this counter in place, system issues can be spotted quickly, allowing > > > > admins to step in promptly before system load spikes occur, even if the > > > > hung_task_warnings value has been decreased to 0 well before. > > > > > > > > Recently, we encountered a situation where warnings about hung tasks were > > > > buried in dmesg logs during load spikes. Introducing this counter could > > > > have helped us detect such issues earlier and improve our analysis efficiency. > > > > > > > > > > Isn't the answer to this problem "write a better parser"? I mean, > > > > Yeah, I certainly agree that having a good parser is important, and I'm > > working on that as well ;) > > > > > we're providing userspace with information which is already available. > > > > IHMO, there are two reasons why this counter remains valuable: > > > > 1) It allows us to easily detect hung tasks in time before load spikes occur, > > using simple and common monitoring tools like Prometheus. > > But the new sysctl_hung_task_detect_count counter gets incremented a > microsecond before the printk comes out. I don't understand the > difference. > > > 2) It ensures that we remain aware of hung tasks even when the > > hung_task_warnings value has already been decreased to 0 well before. > > That makes sense, I guess. But fleshing this out with a real > operational scenario would help persuade reviewers of the benefit of > this change. > > So please describe the utility with full details - sell it to us! Thanks, the suggestion is very helpful! IHMO, hung tasks are a critical metric. Currently, we detect them by periodically parsing dmesg. However, this method isn't as user-friendly as using a counter. Sometimes, a short-lived issue with the NIC or hard drive can quickly decrease the hung_task_warnings to zero. Without warnings, we must directly access the node to ensure that there are no more hung tasks and that the system has recovered. After all, load alone cannot provide a clear picture. Once this counter is in place, in a high-density deployment pattern, we plan to set hung_task_timeout_secs to a lower number to improve stability, even though this might result in false positives. And then we can set a time-based threshold: if hung tasks last beyond this duration, we will automatically migrate containers to other nodes. Based on past experience, this approach could help avoid many production disruptions. Moreover, just like other important events such as OOM that already have counters, having a dedicated counter for hung tasks makes sense ;) Thanks, Lance ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-24 8:49 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang 2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang 2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang 2024-10-24 2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton 2024-10-24 3:28 ` Lance Yang 2024-10-24 4:28 ` Andrew Morton 2024-10-24 8:48 ` Lance Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox