linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lance Yang <ioworker0@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: cunhuang@tencent.com, leonylgao@tencent.com,
	j.granados@samsung.com,  jsiddle@redhat.com,
	kent.overstreet@linux.dev, 21cnbao@gmail.com,
	 ryan.roberts@arm.com, david@redhat.com, ziy@nvidia.com,
	 libang.li@antgroup.com, baolin.wang@linux.alibaba.com,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 0/2] hung_task: add detect count for hung tasks
Date: Thu, 24 Oct 2024 16:48:45 +0800	[thread overview]
Message-ID: <CAK1f24njxUdAc8GibSfrut78jQ4mH8Bno_=m8Pm8E49APnrhyw@mail.gmail.com> (raw)
In-Reply-To: <20241023212815.240844bdf83e4dc17b66b88c@linux-foundation.org>

On Thu, Oct 24, 2024 at 12:28 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > Thanks a lot for paying attention!
> >
> > On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This patchset adds a counter, hung_task_detect_count, to track the number of
> > > > times hung tasks are detected. This counter provides a straightforward way
> > > > to monitor hung task events without manually checking dmesg logs.
> > > >
> > > > With this counter in place, system issues can be spotted quickly, allowing
> > > > admins to step in promptly before system load spikes occur, even if the
> > > > hung_task_warnings value has been decreased to 0 well before.
> > > >
> > > > Recently, we encountered a situation where warnings about hung tasks were
> > > > buried in dmesg logs during load spikes. Introducing this counter could
> > > > have helped us detect such issues earlier and improve our analysis efficiency.
> > > >
> > >
> > > Isn't the answer to this problem "write a better parser"?  I mean,
> >
> > Yeah, I certainly agree that having a good parser is important, and I'm
> > working on that as well ;)
> >
> > > we're providing userspace with information which is already available.
> >
> > IHMO, there are two reasons why this counter remains valuable:
> >
> > 1) It allows us to easily detect hung tasks in time before load spikes occur,
> > using simple and common monitoring tools like Prometheus.
>
> But the new sysctl_hung_task_detect_count counter gets incremented a
> microsecond before the printk comes out.  I don't understand the
> difference.
>
> > 2) It ensures that we remain aware of hung tasks even when the
> > hung_task_warnings value has already been decreased to 0 well before.
>
> That makes sense, I guess.  But fleshing this out with a real
> operational scenario would help persuade reviewers of the benefit of
> this change.
>
> So please describe the utility with full details - sell it to us!

Thanks, the suggestion is very helpful!

IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly
as using a counter.

Sometimes, a short-lived issue with the NIC or hard drive can quickly
decrease the hung_task_warnings to zero. Without warnings, we must
directly access the node to ensure that there are no more hung tasks
and that the system has recovered. After all, load alone cannot provide
a clear picture.

Once this counter is in place, in a high-density deployment pattern, we plan
to set hung_task_timeout_secs to a lower number to improve stability, even
though this might result in false positives. And then we can set a time-based
threshold: if hung tasks last beyond this duration, we will automatically
migrate containers to other nodes. Based on past experience, this approach
could help avoid many production disruptions.

Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense ;)

Thanks,
Lance


      reply	other threads:[~2024-10-24  8:49 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-22 11:47 Lance Yang
2024-10-22 11:47 ` [PATCH 1/2] " Lance Yang
2024-10-22 11:47 ` [PATCH 2/2] hung_task: add docs for hung_task_detect_count Lance Yang
2024-10-24  2:05 ` [PATCH 0/2] hung_task: add detect count for hung tasks Andrew Morton
2024-10-24  3:28   ` Lance Yang
2024-10-24  4:28     ` Andrew Morton
2024-10-24  8:48       ` Lance Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAK1f24njxUdAc8GibSfrut78jQ4mH8Bno_=m8Pm8E49APnrhyw@mail.gmail.com' \
    --to=ioworker0@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=cunhuang@tencent.com \
    --cc=david@redhat.com \
    --cc=j.granados@samsung.com \
    --cc=jsiddle@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=leonylgao@tencent.com \
    --cc=libang.li@antgroup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox