From: Dan Williams <dan.j.williams@intel.com>
To: Ben Cheatham <benjamin.cheatham@amd.com>,
Dan Williams <dan.j.williams@intel.com>,
<linux-cxl@vger.kernel.org>, <linux-pci@vger.kernel.org>
Cc: <dave@stgolabs.net>, <jonathan.cameron@huawei.com>,
<dave.jiang@intel.com>, <alison.schofield@intel.com>,
<vishal.l.verma@intel.com>, <ira.weiny@intel.com>,
<bhelgaas@google.com>, <linux-mm@kvack.org>
Subject: Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support
Date: Mon, 25 Mar 2024 08:54:06 -0700 [thread overview]
Message-ID: <66019e1ec2153_2690d29440@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <1287adbe-364f-431c-b33f-9d73e7829b6c@amd.com>
Ben Cheatham wrote:
> Hey Dan,
>
> Sorry that I went radio silent on this, I've been thinking about the approach to take
> from here on out. More below!
>
> On 2/15/24 5:43 PM, Dan Williams wrote:
>
> [snip]
>
> >> The series also introduces a PCIe port bus driver dependency on the CXL
> >> core. I expect to be able to remove that when when my team submits
> >> patches for a future rework of the PCIe port bus driver.
> >
> > We have time to wait for that work to settle. Do not make the job harder
> > in the near term by adding one more dependency to unwind.
> >
>
> For sure, I'll withhold any work I do that touches the port bus driver
> until that's sent upstream. I'll send anything that doesn't touch
> the port driver, i.e. CXL region changes, as RFC as well.
>
> [snip]
>
> > So if someone says, "yes I can tolerate losing a root port at a time and
> > I can tolerate deploying my workloads with userspace memory management,
> > and this is preferable to a reboot", then maybe Linux should entertain
> > CXL Error Isolation. Until such an end use case gains clear uptake it
> > seems too early to worry about plumbing the notification mechanism.
>
> So in my mind the primary use case for this feature is in a
> server/data center environment. Losing a root port and the attached
> memory is definitely preferable to a reboot in this environment, so I
> think that isolation would still be useful even if it isn't
> plug-and-play.
That is not sufficient. This feature needs an implementation partner to
work through the caveats.
> I agree with your assessment of what has to happen before we can make
> this feature work. From what I understand these are the main
> requirements for making isolation work:
Requirement 1 is "Find customer".
> 1. The memory region can't be onlined as System RAM
> 2. The region needs to be mapped as device-dax
> 3. There need to be safeguards such that someone can't remap the
> region to System RAM with error isolation enabled (added by me)
No, this policy does not belong in the kernel.
> Considering these all have to do with mm, I think that's a good place
> to start. What I'm thinking of starting with is adding a module
> parameter (or config option) to enable isolation for CXL dax regions
> by default, as well as a sysfs option for toggling error isolation for
> the CXL region. When the module parameter is specified, the CXL region
> driver would create the region as a device-dax region by default
> instead of onlining the region as system RAM. The sysfs would allow
> toggling error isolation for CXL regions that are already device-dax.
No, definitely not going to invent module paramter policy for this. The
policy to not online dax regions is already available.
> That would cover requirements 1 & 2, but would still allow someone to
> reconfigure the region as a system RAM region with error isolation
> still enabled using sysfs/daxctl. To make sure the this doesn't
> happen, my plan is to have the CXL region driver automatically disable
> error isolation when the underlying memory region is going offline,
> then check to make sure the memory isn't onlined as System RAM before
> re-enabling isolation.
Why would you ever need to disable isolation? If it is present it at
least nominally allows software to keep running vs hang awaiting a
watchdog-timer reboot.
> So, with that in mind, isolation would most likely go from something
> that is enabled by default when compiled in to a feature for
> correctly-configured CXL regions that has to be enabled by the end
> user. If that is sounds good, here's what my roadmap would be going
> forward:
Again, this needs a customer to weigh the mitigations that the kernel
needs to carry.
> 1. Enable creating device-dax mapped CXL regions by default
Already exists.
> 2. Add support for checking a CXL region meets the requirements
> for enabling isolation (i.e. all devices in
> dax region(s) are device-dax)
Regions should not know or care if isolation is enabled, the
implications are identical to force unbinding the region driver.
> 3. Add back in the error isolation enablement and notification
> mechanisms in this patch series
Not until it is clear that this feature has deployment value.
next prev parent reply other threads:[~2024-03-25 15:54 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240215194048.141411-1-Benjamin.Cheatham@amd.com>
2024-02-15 23:43 ` Dan Williams
2024-03-25 15:15 ` Ben Cheatham
2024-03-25 15:54 ` Dan Williams [this message]
2024-04-01 19:41 ` Ben Cheatham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=66019e1ec2153_2690d29440@dwillia2-mobl3.amr.corp.intel.com.notmuch \
--to=dan.j.williams@intel.com \
--cc=alison.schofield@intel.com \
--cc=benjamin.cheatham@amd.com \
--cc=bhelgaas@google.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pci@vger.kernel.org \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox