From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3CE48D0BB74 for ; Thu, 24 Oct 2024 08:49:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A38CF6B0083; Thu, 24 Oct 2024 04:49:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C10E6B0085; Thu, 24 Oct 2024 04:49:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 861BF6B0089; Thu, 24 Oct 2024 04:49:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 64A536B0083 for ; Thu, 24 Oct 2024 04:49:26 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D9138094A for ; Thu, 24 Oct 2024 08:49:10 +0000 (UTC) X-FDA: 82707871722.14.0AFFFA5 Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) by imf25.hostedemail.com (Postfix) with ESMTP id 3EA24A0006 for ; Thu, 24 Oct 2024 08:49:11 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iyBBBQWs; spf=pass (imf25.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729759612; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xuM8bCSeFkAdLSk99j/aKolBQPa9Ke/95wfPnP0XcD8=; b=8epuEsTx1/LVFGMKTTGakmDUkfvKU5du9Yu035EMFIRUkRppzNG2q+ekSI2X3C1g0i335h jPrKfALGwCwJV3+OxbCo1J8oNpPtMeobFTkGgC836Sk03SOz2rHycj7wbfUkNNno0GuqzA ohTvUKFtIswb2IXWGmPtBDIdnjLLtY8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729759612; a=rsa-sha256; cv=none; b=hTS2LhjZhb81aVflQcdDiR20XDhDc0SVcfE3n2wG/BJmLgeipGRWMip8Ygoq3+qH/5h56t SnkLxTX0s1La6xeIZ+HOX4SZJo503WHj26tbLI1IGUk0SdZp+ybMt2D3QVnCJsZC+xVfu2 45xm29iwEL1oAOYuj3R+AxcxEUzFeqU= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iyBBBQWs; spf=pass (imf25.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lf1-f47.google.com with SMTP id 2adb3069b0e04-539f72c913aso794495e87.1 for ; Thu, 24 Oct 2024 01:49:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729759762; x=1730364562; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xuM8bCSeFkAdLSk99j/aKolBQPa9Ke/95wfPnP0XcD8=; b=iyBBBQWsSHOZK6f8/SbHiBEHU7HKduSHA3Uty4242P0UabuH61kjJZbyKcqeOil25S gov058jbNsS2mHlCycyIFRkbyrwK3i2EA6VptTIqIKiB9/FUq4x/A9ZGUV2lA6mEbCUB 5DwQs1uXPmpHmonXBvgwERjkSQsQQD9ZZwco96AutnDmV8Iza9wrUZNWFchMStqO5CtH yr77cDmVfoFHxErcUnLb08woLIPljDFciH1SByGRFz3xwEL0HkO1+gqX3WMJLIjcK8CC esDf3C1pIsx5f0I+A/QdKIzgP22Etf8l8l4x65vchpQrk8mVog9sMqftHdJOiUJy6n2p HeSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729759762; x=1730364562; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xuM8bCSeFkAdLSk99j/aKolBQPa9Ke/95wfPnP0XcD8=; b=P/SxxGjd67BQMLi+pls8PR0RVnvdbSz8+rxjZoLV8wISSvWx2vlH1ZpghjKpieITC9 +mI6wjlorEtLIW9JUhld/YZWYEQwsIuw1d90bZ0Zb0pIQ4/bYEkzkjafkBgF27hnCSsH 0TBs1rKxceY2GAyzlWoIIGoKVqlTbJq+0rAHcWO0IxIyBbBfdmCSh/eLMz3mIqUddc7N CHuE2LZGOXtEUTZhYbHLtbiZ0Z7S1qvwG/g8eY9+QlbsVLT5sIbfPpOyX40kdPiVC/Cu ulLdektS5tdkHE7VXF1yqjJUerFM4rncOHFpip/hYHdQQeun0lf4MfFimCEIUZx494TL bndQ== X-Forwarded-Encrypted: i=1; AJvYcCWilJVikQwSfkRi6hHsKEEZT/m+ZAUQpkFpqCmJifS7Qc76e8bH+xBtW1FaAWbCpEez5m0nu5shMQ==@kvack.org X-Gm-Message-State: AOJu0YxnCsfg/2ZdzV5GpcIwg3dBK8ck/Rp5lMSlUDlUDsSVRRIcX8pB TCTXIZGeO4GjNi8RD8ZobqkTyHDrD3fjf3hwEozVAL6cfWBW+xBQwSJBXddxYoOlrWj27iBeG5Z f9Bb8316Z9pv3jRiFHDIusB3gCbk= X-Google-Smtp-Source: AGHT+IHqefTcaivDmQ9oHDeTqHREwoeXheT6QRnbFD5ge6DuQtPHJS3tx3adLZOkBDFO+ARAJJuMHI3n6wGZYrnU3jI= X-Received: by 2002:a05:6512:3e17:b0:539:f26f:d280 with SMTP id 2adb3069b0e04-53b23dcc1cdmr774100e87.5.1729759761988; Thu, 24 Oct 2024 01:49:21 -0700 (PDT) MIME-Version: 1.0 References: <20241022114736.83285-1-ioworker0@gmail.com> <20241023190515.a80c77fe3fa895910d554888@linux-foundation.org> <20241023212815.240844bdf83e4dc17b66b88c@linux-foundation.org> In-Reply-To: <20241023212815.240844bdf83e4dc17b66b88c@linux-foundation.org> From: Lance Yang Date: Thu, 24 Oct 2024 16:48:45 +0800 Message-ID: Subject: Re: [PATCH 0/2] hung_task: add detect count for hung tasks To: Andrew Morton Cc: cunhuang@tencent.com, leonylgao@tencent.com, j.granados@samsung.com, jsiddle@redhat.com, kent.overstreet@linux.dev, 21cnbao@gmail.com, ryan.roberts@arm.com, david@redhat.com, ziy@nvidia.com, libang.li@antgroup.com, baolin.wang@linux.alibaba.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3EA24A0006 X-Stat-Signature: o6wney7skdufn7eqp16wfgxx679bgq8s X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729759751-414377 X-HE-Meta: U2FsdGVkX1+CsvvjIorE8RGVi0C0kOkkmD0YylEhp6JMvOSk4oRitM58r5e4zKoz+j1tgfhQI2DcSKm7UTVLt0VDlmeItMdjs0NGHCFopaF/RRSvPSguHnRaENDcQf5RcqepUG7fREC+CjrARto8W2HZpdDCXDFIxayPB2/GrpoqivKd6pIR/LT0hp8LyJWI2G25JF8BlIUbI+0l58jJOum046xUid/x35sYGVrAhW+B0aApcHUPBdncqXCzA/CZOSlVC059PbdwhU4UB/u2F5MG8i8NPgbdPh0dxJlk5UcXotguiqMRYGWYKYOOTyQv7YmrBpHiMOwy7A+9f4s7xdwIhfmeRzDvXAm4lOGAet9+0KAq8MkhBBp4sjz/sFVDkRFWyAZuEpfo9pWjtRaVBIhRlUMWfYJ4FQ6OrzZ7SnFxM6Lhe1tgRzewAiGRUZTXPZK9fl9ojO5ktBwVyY1r7G9d+taxUiqX/B9IRatfXyQ3DZU5ac0QDJPzU95RcBr2sn9NLEL60q4IGoF2W0E8X3VMNbhrhh4HylpT0GKL/N45YiH7ttgxqNC9ujcBwc3iJpWJVwQAOs+DovRWSOnTFBN0A1F103mzASXXfsGTffXVlhE9/oT+DLUK/3iYOR5SEf61ndIi38nQySGxpWiBpb3vNWxOpDTqSg1mRFE3h8V7hSMvx0jMShMUzPHBDNfdPmBLqFKmPeNMATBevnHz58DSGUb1bXNkOnWe8xHHPhbHNw9YLZ/uyth01tXqokYAbk4kiR08JTmHZtmBtO5EeYIGX41r0c63Z4JKxlBBadLNFSAfFndVurqtaJmD62iELzsVu9IE7kjOXSDUeZOS+46GRemCVy6m+0mQCiFaiEPqwmDyashTXq1P/uSbWmAvFSb+frxQMWQrivf3S99AIBMx1PXV6SZsgbsJ98IRZCOq+iHa9ouHHGksf7z8Sd6B6BSlPQXyp3laRU2ZppD 5+Pg51dQ iHn4McEuiu5jisF6YyDpy54ZA+93h+l/0FtUI9PmNqeJI4WUBFU1o5OjqdtTGEXTue32V1DUY7yrVjzDvDudd+zdQP683UWqEYWWeE+Z3wFFvcnXraecjqHpYo/c+E8bAnOJZUitkwIectSnBgjT33/859Tgt/DLKkcAc69AQo48CQtzA6QuhCrCq2EGdgArwI5ieJamWFtcWkKRRnMEEZKbxaproJppGo+8CRHB5YsrWnh3SIsxwlpCl7vBZ2YBVGRcYX3RpVuQ6nYA7jt59nXhGCAG3nvhTp+7pcnPLycmhv1RgBIMCqyp7yLGKt3ptMcRW3QUptiRj7oMAbn2v6ODtxqUpOOfGxskPXYNWTstfYne1VR6A1YPqK1sSJoB8/gK6E+POJjDqGKpkV74frUZ+5W6B9v2cHSPN5HzRQcaN60KwxHDAJO3mze5aLxn0YID2WcK41Ug3pWM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.053929, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 24, 2024 at 12:28=E2=80=AFPM Andrew Morton wrote: > > On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang wrote= : > > > Hi Andrew, > > > > Thanks a lot for paying attention! > > > > On Thu, Oct 24, 2024 at 10:05=E2=80=AFAM Andrew Morton > > wrote: > > > > > > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang w= rote: > > > > > > > Hi all, > > > > > > > > This patchset adds a counter, hung_task_detect_count, to track the = number of > > > > times hung tasks are detected. This counter provides a straightforw= ard way > > > > to monitor hung task events without manually checking dmesg logs. > > > > > > > > With this counter in place, system issues can be spotted quickly, a= llowing > > > > admins to step in promptly before system load spikes occur, even if= the > > > > hung_task_warnings value has been decreased to 0 well before. > > > > > > > > Recently, we encountered a situation where warnings about hung task= s were > > > > buried in dmesg logs during load spikes. Introducing this counter c= ould > > > > have helped us detect such issues earlier and improve our analysis = efficiency. > > > > > > > > > > Isn't the answer to this problem "write a better parser"? I mean, > > > > Yeah, I certainly agree that having a good parser is important, and I'm > > working on that as well ;) > > > > > we're providing userspace with information which is already available= . > > > > IHMO, there are two reasons why this counter remains valuable: > > > > 1) It allows us to easily detect hung tasks in time before load spikes = occur, > > using simple and common monitoring tools like Prometheus. > > But the new sysctl_hung_task_detect_count counter gets incremented a > microsecond before the printk comes out. I don't understand the > difference. > > > 2) It ensures that we remain aware of hung tasks even when the > > hung_task_warnings value has already been decreased to 0 well before. > > That makes sense, I guess. But fleshing this out with a real > operational scenario would help persuade reviewers of the benefit of > this change. > > So please describe the utility with full details - sell it to us! Thanks, the suggestion is very helpful! IHMO, hung tasks are a critical metric. Currently, we detect them by periodically parsing dmesg. However, this method isn't as user-friendly as using a counter. Sometimes, a short-lived issue with the NIC or hard drive can quickly decrease the hung_task_warnings to zero. Without warnings, we must directly access the node to ensure that there are no more hung tasks and that the system has recovered. After all, load alone cannot provide a clear picture. Once this counter is in place, in a high-density deployment pattern, we pla= n to set hung_task_timeout_secs to a lower number to improve stability, even though this might result in false positives. And then we can set a time-bas= ed threshold: if hung tasks last beyond this duration, we will automatically migrate containers to other nodes. Based on past experience, this approach could help avoid many production disruptions. Moreover, just like other important events such as OOM that already have counters, having a dedicated counter for hung tasks makes sense ;) Thanks, Lance