From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
To: qemu-devel@nongnu.org, linux-cxl@vger.kernel.org,
linux-edac@vger.kernel.org, linux-mm@kvack.org,
dan.j.williams@intel.com, vishal.l.verma@intel.com,
Jonathan.Cameron@huawei.com, alison.schofield@intel.com
Cc: bp@alien8.de, dave.jiang@intel.com, dave@stgolabs.net,
ira.weiny@intel.com, james.morse@arm.com, linmiaohe@huawei.com,
mchehab@kernel.org, nao.horiguchi@gmail.com, rric@kernel.org,
tony.luck@intel.com, ruansy.fnst@fujitsu.com
Subject: [PATCH v4 0/2] cxl: add device reporting poison handler
Date: Thu, 8 Aug 2024 23:13:26 +0800 [thread overview]
Message-ID: <20240808151328.707869-1-ruansy.fnst@fujitsu.com> (raw)
This patchset includes "cxl/core: introduce poison creation hanlding"
and "cxl: avoid duplicated report from MCE & device", which were posted
separately. Here are changes since last version of each patch:
P1: 1. since its async memory_failure(), set the flag to 0
2. also handle CXL_EVENT_TRANSACTION_SCAN_MEDIA type
P2: 1. use XArray instead of list_head
2. add guard() lock for cxl device iteration
P1&P2: Rebase to v6.11-rc1
As is known to us, CXL spec defines POISON feature to notify its status
when CXL memory device got a broken page. Basically, there are two
major paths for the notification.
1. CPU handling error
When a process is accessing this broken page, CXL device returns data
with POISON. When CPU consumes the POISON, it raises a kind of error
notification.
To be precise, "how CPU should behave when it consumes POISON" is
architecture dependent. In my understanding, x86-64 raises Machine
Check Exception(MCE) via interrupt #18 in this case.
2. CXL device reporting error
When CXL device detects the broken page by itself and sends memory
error signal to kernel in two optional paths.
2.a. FW-First
CXL device sends error via VDM to CXL Host, then CXL Host sends it
to System Firmware via interrupt, finally kernel handles the error.
2.b. OS-First
CXL device directly sends error via MSI/MSI-X to kernel.
Note: Since I'm now focusing on x86_64, basically I'll describe about
x86-64 only.
The following diagram should describe the 2 major paths and 2 optional
sub-paths above.
```
1. MCE (interrupt #18, while CPU consuming POISON)
-> do_machine_check()
-> mce_log()
-> notify chain (x86_mce_decoder_chain)
-> memory_failure()
2.a FW-First (optional, CXL device proactively find&report)
-> CXL device -> Firmware
-> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace
2.b OS-First (optional, CXL device proactively find&report)
-> CXL device -> MSI
-> OS: CXL driver -> trace
```
For "1. CPU handling error" path, the current code seems to work fine.
When I used error injection feature on QEMU emulation, the code path is
executed certainly. Then, if the CPU certainly raises a MCE when it
consumes the POISON, this path has no problem.
So, I'm working on making for 2.a and 2.b path, which is CXL device
reported POISON error could be handled by kernel. This path has two
advantages.
- Proactively find&report memory problems
Even if a process does not read data yet, kernel/drivers can prevent
the process from using corrupted data proactively. AFAIK, the current
kernel only traces POISON error event from FW-First/OS-First path, but
it doesn't handle them, neither notify processes who are using the
POISON page like MCE does. User space tools like rasdaemon reads the
trace and log it, but as well, it doesn't handle the POISON page. As
a result, user has to read the error log from rasdaemon, distinguish
whether the POISON error is from CXL memory or DDR memory, find out
which applications are effected. That is not an easy work and cannot
be handled in time. Thus, I'd like to add a feature to make the work
done automatically and quickly. Once CXL device reports the POISON
error (via FW-First/OS-First), kernel handles it immediately, similar
to the flow when a MCE is triggered. This is my first motivation.
- Architecture independent
As the mentioned above, "1. CPU handling error" path is architecture
dependent. On the other hand, this route can be architecture
independent code. If there is a CPU which does not have similar
feature like MCE of x86-64, my work will be essential. (To be honest,
I did not notice this advantage at first as mentioned later, but I
think this is also important.)
Shiyang Ruan (2):
cxl/core: introduce device reporting poison hanlding
cxl: avoid duplicated report from MCE & device
arch/x86/include/asm/mce.h | 1 +
drivers/cxl/core/mbox.c | 190 ++++++++++++++++++++++++++++++++++---
drivers/cxl/core/memdev.c | 6 +-
drivers/cxl/cxlmem.h | 11 ++-
drivers/cxl/pci.c | 4 +-
include/linux/cxl-event.h | 16 +++-
6 files changed, 207 insertions(+), 21 deletions(-)
--
2.34.1
next reply other threads:[~2024-08-08 15:13 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-08 15:13 Shiyang Ruan [this message]
2024-08-08 15:13 ` [PATCH v4 1/2] cxl/core: introduce device reporting poison hanlding Shiyang Ruan
2024-08-08 18:28 ` Fan Ni
2024-08-21 13:57 ` Shiyang Ruan
2024-08-27 15:46 ` Jonathan Cameron
2024-09-02 14:03 ` Shiyang Ruan
2024-08-08 15:13 ` [PATCH v4 2/2] cxl: avoid duplicated report from MCE & device Shiyang Ruan
2024-08-09 7:31 ` kernel test robot
2024-08-09 7:31 ` kernel test robot
2024-08-09 11:48 ` kernel test robot
2024-08-27 15:52 ` Jonathan Cameron
2024-09-02 14:19 ` Shiyang Ruan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240808151328.707869-1-ruansy.fnst@fujitsu.com \
--to=ruansy.fnst@fujitsu.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=james.morse@arm.com \
--cc=linmiaohe@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mchehab@kernel.org \
--cc=nao.horiguchi@gmail.com \
--cc=qemu-devel@nongnu.org \
--cc=rric@kernel.org \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox