[PATCH 0/2] hung_task: add detect count for hung tasks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] hung_task: add detect count for hung tasks
@ 2024-10-22 11:47 Lance Yang
  2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Lance Yang @ 2024-10-22 11:47 UTC (permalink / raw)
  To: akpm
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm, Lance Yang

Hi all,

This patchset adds a counter, hung_task_detect_count, to track the number of
times hung tasks are detected. This counter provides a straightforward way
to monitor hung task events without manually checking dmesg logs.

With this counter in place, system issues can be spotted quickly, allowing
admins to step in promptly before system load spikes occur, even if the
hung_task_warnings value has been decreased to 0 well before.

Recently, we encountered a situation where warnings about hung tasks were
buried in dmesg logs during load spikes. Introducing this counter could
have helped us detect such issues earlier and improve our analysis efficiency.

Lance Yang (2):
  hung_task: add detect count for hung tasks
  hung_task: add docs for hung_task_detect_count

 Documentation/admin-guide/sysctl/kernel.rst |  9 +++++++++
 kernel/hung_task.c                          | 18 ++++++++++++++++++
 2 files changed, 27 insertions(+)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] hung_task: add detect count for hung tasks
  2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang
@ 2024-10-22 11:47 ` Lance Yang
  2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
  2024-10-24  2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton
  2 siblings, 0 replies; 7+ messages in thread
From: Lance Yang @ 2024-10-22 11:47 UTC (permalink / raw)
  To: akpm
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm, Lance Yang, Mingzhe Yang

This commit adds a counter, hung_task_detect_count, to track the number of
times hung tasks are detected. This counter provides a straightforward way
to monitor hung task events without manually checking dmesg logs.

With this counter in place, system issues can be spotted quickly, allowing
admins to step in promptly before system load spikes occur, even if the
hung_task_warnings value has been decreased to 0 well before.

Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
 kernel/hung_task.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 959d99583d1c..229ff3d4e501 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -30,6 +30,11 @@
  */
 static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
 
+/*
+ * Total number of tasks detected as hung since boot:
+ */
+static unsigned long __read_mostly sysctl_hung_task_detect_count;
+
 /*
  * Limit number of tasks checked in a batch.
  *
@@ -115,6 +120,12 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
 		return;
 
+	/*
+	 * This counter tracks the total number of tasks detected as hung
+	 * since boot.
+	 */
+	sysctl_hung_task_detect_count++;
+
 	trace_sched_process_hang(t);
 
 	if (sysctl_hung_task_panic) {
@@ -314,6 +325,13 @@ static struct ctl_table hung_task_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_NEG_ONE,
 	},
+	{
+		.procname	= "hung_task_detect_count",
+		.data		= &sysctl_hung_task_detect_count,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 };
 
 static void __init hung_task_sysctl_init(void)
-- 
2.45.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] hung_task: add docs for hung_task_detect_count
  2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang
  2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang
@ 2024-10-22 11:47 ` Lance Yang
  2024-10-24  2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton
  2 siblings, 0 replies; 7+ messages in thread
From: Lance Yang @ 2024-10-22 11:47 UTC (permalink / raw)
  To: akpm
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm, Lance Yang, Mingzhe Yang

This commit introduces documentation for hung_task_detect_count in
kernel.rst.

Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
 Documentation/admin-guide/sysctl/kernel.rst | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index f8bc1630eba0..b2b36d0c3094 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -401,6 +401,15 @@ The upper bound on the number of tasks that are checked.
 This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
 
 
+hung_task_detect_count
+======================
+
+Indicates the total number of tasks that have been detected as hung since
+the system boot.
+
+This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
+
+
 hung_task_timeout_secs
 ======================
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] hung_task: add detect count for hung tasks
  2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang
  2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang
  2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
@ 2024-10-24  2:05 ` Andrew Morton
  2024-10-24  3:28   ` Lance Yang
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2024-10-24  2:05 UTC (permalink / raw)
  To: Lance Yang
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm

On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote:

> Hi all,
> 
> This patchset adds a counter, hung_task_detect_count, to track the number of
> times hung tasks are detected. This counter provides a straightforward way
> to monitor hung task events without manually checking dmesg logs.
> 
> With this counter in place, system issues can be spotted quickly, allowing
> admins to step in promptly before system load spikes occur, even if the
> hung_task_warnings value has been decreased to 0 well before.
> 
> Recently, we encountered a situation where warnings about hung tasks were
> buried in dmesg logs during load spikes. Introducing this counter could
> have helped us detect such issues earlier and improve our analysis efficiency.
> 

Isn't the answer to this problem "write a better parser"?  I mean,
we're providing userspace with information which is already available.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] hung_task: add detect count for hung tasks
  2024-10-24  2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton
@ 2024-10-24  3:28   ` Lance Yang
  2024-10-24  4:28     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Lance Yang @ 2024-10-24  3:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm

Hi Andrew,

Thanks a lot for paying attention!

On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote:
>
> > Hi all,
> >
> > This patchset adds a counter, hung_task_detect_count, to track the number of
> > times hung tasks are detected. This counter provides a straightforward way
> > to monitor hung task events without manually checking dmesg logs.
> >
> > With this counter in place, system issues can be spotted quickly, allowing
> > admins to step in promptly before system load spikes occur, even if the
> > hung_task_warnings value has been decreased to 0 well before.
> >
> > Recently, we encountered a situation where warnings about hung tasks were
> > buried in dmesg logs during load spikes. Introducing this counter could
> > have helped us detect such issues earlier and improve our analysis efficiency.
> >
>
> Isn't the answer to this problem "write a better parser"?  I mean,

Yeah, I certainly agree that having a good parser is important, and I'm
working on that as well ;)

> we're providing userspace with information which is already available.

IHMO, there are two reasons why this counter remains valuable:

1) It allows us to easily detect hung tasks in time before load spikes occur,
using simple and common monitoring tools like Prometheus.

2) It ensures that we remain aware of hung tasks even when the
hung_task_warnings value has already been decreased to 0 well before.

Thanks again for your time!
Lance

>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] hung_task: add detect count for hung tasks
  2024-10-24  3:28   ` Lance Yang
@ 2024-10-24  4:28     ` Andrew Morton
  2024-10-24  8:48       ` Lance Yang
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2024-10-24  4:28 UTC (permalink / raw)
  To: Lance Yang
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm

On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@gmail.com> wrote:

> Hi Andrew,
> 
> Thanks a lot for paying attention!
> 
> On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > This patchset adds a counter, hung_task_detect_count, to track the number of
> > > times hung tasks are detected. This counter provides a straightforward way
> > > to monitor hung task events without manually checking dmesg logs.
> > >
> > > With this counter in place, system issues can be spotted quickly, allowing
> > > admins to step in promptly before system load spikes occur, even if the
> > > hung_task_warnings value has been decreased to 0 well before.
> > >
> > > Recently, we encountered a situation where warnings about hung tasks were
> > > buried in dmesg logs during load spikes. Introducing this counter could
> > > have helped us detect such issues earlier and improve our analysis efficiency.
> > >
> >
> > Isn't the answer to this problem "write a better parser"?  I mean,
> 
> Yeah, I certainly agree that having a good parser is important, and I'm
> working on that as well ;)
> 
> > we're providing userspace with information which is already available.
> 
> IHMO, there are two reasons why this counter remains valuable:
> 
> 1) It allows us to easily detect hung tasks in time before load spikes occur,
> using simple and common monitoring tools like Prometheus.

But the new sysctl_hung_task_detect_count counter gets incremented a
microsecond before the printk comes out.  I don't understand the
difference.

> 2) It ensures that we remain aware of hung tasks even when the
> hung_task_warnings value has already been decreased to 0 well before.

That makes sense, I guess.  But fleshing this out with a real
operational scenario would help persuade reviewers of the benefit of
this change.

So please describe the utility with full details - sell it to us!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] hung_task: add detect count for hung tasks
  2024-10-24  4:28     ` Andrew Morton
@ 2024-10-24  8:48       ` Lance Yang
  0 siblings, 0 replies; 7+ messages in thread
From: Lance Yang @ 2024-10-24  8:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cunhuang, leonylgao, j.granados, jsiddle, kent.overstreet,
	21cnbao, ryan.roberts, david, ziy, libang.li, baolin.wang,
	linux-kernel, linux-mm

On Thu, Oct 24, 2024 at 12:28 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > Thanks a lot for paying attention!
> >
> > On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This patchset adds a counter, hung_task_detect_count, to track the number of
> > > > times hung tasks are detected. This counter provides a straightforward way
> > > > to monitor hung task events without manually checking dmesg logs.
> > > >
> > > > With this counter in place, system issues can be spotted quickly, allowing
> > > > admins to step in promptly before system load spikes occur, even if the
> > > > hung_task_warnings value has been decreased to 0 well before.
> > > >
> > > > Recently, we encountered a situation where warnings about hung tasks were
> > > > buried in dmesg logs during load spikes. Introducing this counter could
> > > > have helped us detect such issues earlier and improve our analysis efficiency.
> > > >
> > >
> > > Isn't the answer to this problem "write a better parser"?  I mean,
> >
> > Yeah, I certainly agree that having a good parser is important, and I'm
> > working on that as well ;)
> >
> > > we're providing userspace with information which is already available.
> >
> > IHMO, there are two reasons why this counter remains valuable:
> >
> > 1) It allows us to easily detect hung tasks in time before load spikes occur,
> > using simple and common monitoring tools like Prometheus.
>
> But the new sysctl_hung_task_detect_count counter gets incremented a
> microsecond before the printk comes out.  I don't understand the
> difference.
>
> > 2) It ensures that we remain aware of hung tasks even when the
> > hung_task_warnings value has already been decreased to 0 well before.
>
> That makes sense, I guess.  But fleshing this out with a real
> operational scenario would help persuade reviewers of the benefit of
> this change.
>
> So please describe the utility with full details - sell it to us!

Thanks, the suggestion is very helpful!

IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly
as using a counter.

Sometimes, a short-lived issue with the NIC or hard drive can quickly
decrease the hung_task_warnings to zero. Without warnings, we must
directly access the node to ensure that there are no more hung tasks
and that the system has recovered. After all, load alone cannot provide
a clear picture.

Once this counter is in place, in a high-density deployment pattern, we plan
to set hung_task_timeout_secs to a lower number to improve stability, even
though this might result in false positives. And then we can set a time-based
threshold: if hung tasks last beyond this duration, we will automatically
migrate containers to other nodes. Based on past experience, this approach
could help avoid many production disruptions.

Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense ;)

Thanks,
Lance


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-24  8:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-22 11:47 [PATCH 0/2] hung_task: add detect count for hung tasks Lance Yang
2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang
2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
2024-10-24  2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton
2024-10-24  3:28   ` Lance Yang
2024-10-24  4:28     ` Andrew Morton
2024-10-24  8:48       ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox