From: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: "tony.luck@intel.com" <tony.luck@intel.com>,
"duenwen@google.com" <duenwen@google.com>,
"rientjes@google.com" <rientjes@google.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"shy828301@gmail.com" <shy828301@gmail.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"wangkefeng.wang@huawei.com" <wangkefeng.wang@huawei.com>
Subject: Re: [PATCH v1 1/3] mm: memory-failure: Add memory failure stats to sysfs
Date: Tue, 17 Jan 2023 09:02:09 +0000 [thread overview]
Message-ID: <20230117090151.GA3428106@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <20230116193902.1315236-2-jiaqiyan@google.com>
On Mon, Jan 16, 2023 at 07:39:00PM +0000, Jiaqi Yan wrote:
> Today kernel provides following memory error info to userspace, but each
> has its own disadvantage
> * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
> not per NUMA node stats though
> * ras:memory_failure_event: only available after explicitly enabled
> * /dev/mcelog provides many useful info about the MCEs, but
> doesn't capture how memory_failure recovered memory MCEs
> * kernel logs: userspace needs to process log text
>
> Exposes per NUMA node memory error stats as sysfs entries:
>
> /sys/devices/system/node/node${X}/memory_failure/pages_poisoned
> /sys/devices/system/node/node${X}/memory_failure/pages_recovered
> /sys/devices/system/node/node${X}/memory_failure/pages_ignored
> /sys/devices/system/node/node${X}/memory_failure/pages_failed
> /sys/devices/system/node/node${X}/memory_failure/pages_delayed
>
> These counters describe how many raw pages are poisoned and after the
> attempted recoveries by the kernel, their resolutions: how many are
> recovered, ignored, failed, or delayed respectively.
>
> The following math holds for the statistics:
> * pages_poisoned = pages_recovered + pages_ignored + pages_failed +
> pages_delayed
> * pages_poisoned * PAGE_SIZE = /proc/meminfo/HardwareCorrupted
>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
...
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cd28a100d9e4..0a14b35a96da 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1110,6 +1110,31 @@ struct deferred_split {
> };
> #endif
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * Per NUMA node memory failure handling statistics.
> + */
> +struct memory_failure_stats {
> + /*
> + * Number of pages poisoned.
> + * Cases not accounted: memory outside kernel control, offline page,
> + * arch-specific memory_failure (SGX), and hwpoison_filter()
> + * filtered error events.
> + */
Yes, this comment is important. So the sum of the pages_poisoned counters
over NUMA nodes can be mismatched to the global counter shown in /proc/meminfo.
But this makes code simple, and maybe the new stats info is useful enough
even without supporting the special cases. So I'm OK with this.
BTW, maybe "unpoison" can be also mentioned here?
Thanks,
Naoya Horiguchi
next prev parent reply other threads:[~2023-01-17 9:02 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-16 19:38 [PATCH v1 0/3] Introduce per NUMA node memory error statistics Jiaqi Yan
2023-01-16 19:39 ` [PATCH v1 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
2023-01-16 20:15 ` Andrew Morton
2023-01-17 9:14 ` HORIGUCHI NAOYA(堀口 直也)
2023-01-19 21:16 ` Jiaqi Yan
2023-01-17 9:02 ` HORIGUCHI NAOYA(堀口 直也) [this message]
2023-01-16 19:39 ` [PATCH v1 2/3] mm: memory-failure: Bump memory failure stats to pglist_data Jiaqi Yan
2023-01-16 20:16 ` Andrew Morton
2023-01-17 9:03 ` HORIGUCHI NAOYA(堀口 直也)
2023-01-18 23:05 ` Jiaqi Yan
2023-01-19 6:40 ` HORIGUCHI NAOYA(堀口 直也)
2023-01-19 18:05 ` Jiaqi Yan
2023-01-16 19:39 ` [PATCH v1 3/3] mm: memory-failure: Document memory failure stats Jiaqi Yan
2023-01-16 20:13 ` [PATCH v1 0/3] Introduce per NUMA node memory error statistics Andrew Morton
2023-01-16 21:52 ` Jiaqi Yan
2023-01-17 9:18 ` HORIGUCHI NAOYA(堀口 直也)
2023-01-17 17:51 ` Jiaqi Yan
2023-01-17 18:33 ` Luck, Tony
2023-01-18 17:31 ` Jiaqi Yan
2023-01-18 17:50 ` Luck, Tony
2023-01-18 23:33 ` Jiaqi Yan
2023-01-19 4:52 ` HORIGUCHI NAOYA(堀口 直也)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230117090151.GA3428106@hori.linux.bs1.fc.nec.co.jp \
--to=naoya.horiguchi@nec.com \
--cc=akpm@linux-foundation.org \
--cc=duenwen@google.com \
--cc=jiaqiyan@google.com \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=shy828301@gmail.com \
--cc=tony.luck@intel.com \
--cc=wangkefeng.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox