linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Ben Cheatham <benjamin.cheatham@amd.com>,
	Dan Williams <dan.j.williams@intel.com>,
	<linux-cxl@vger.kernel.org>, <linux-pci@vger.kernel.org>
Cc: <dave@stgolabs.net>, <jonathan.cameron@huawei.com>,
	<dave.jiang@intel.com>, <alison.schofield@intel.com>,
	<vishal.l.verma@intel.com>, <ira.weiny@intel.com>,
	<bhelgaas@google.com>, <linux-mm@kvack.org>
Subject: Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support
Date: Mon, 25 Mar 2024 08:54:06 -0700	[thread overview]
Message-ID: <66019e1ec2153_2690d29440@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <1287adbe-364f-431c-b33f-9d73e7829b6c@amd.com>

Ben Cheatham wrote:
> Hey Dan,
> 
> Sorry that I went radio silent on this, I've been thinking about the approach to take
> from here on out. More below!
> 
> On 2/15/24 5:43 PM, Dan Williams wrote:
> 
> [snip]
> 
> >> The series also introduces a PCIe port bus driver dependency on the CXL
> >> core. I expect to be able to remove that when when my team submits
> >> patches for a future rework of the PCIe port bus driver.
> > 
> > We have time to wait for that work to settle. Do not make the job harder
> > in the near term by adding one more dependency to unwind.
> > 
> 
> For sure, I'll withhold any work I do that touches the port bus driver
> until that's sent upstream. I'll send anything that doesn't touch
> the port driver, i.e. CXL region changes, as RFC as well.
> 
> [snip]
> 
> > So if someone says, "yes I can tolerate losing a root port at a time and
> > I can tolerate deploying my workloads with userspace memory management,
> > and this is preferable to a reboot", then maybe Linux should entertain
> > CXL Error Isolation. Until such an end use case gains clear uptake it
> > seems too early to worry about plumbing the notification mechanism.
> 
> So in my mind the primary use case for this feature is in a
> server/data center environment.  Losing a root port and the attached
> memory is definitely preferable to a reboot in this environment, so I
> think that isolation would still be useful even if it isn't
> plug-and-play.

That is not sufficient. This feature needs an implementation partner to
work through the caveats.

> I agree with your assessment of what has to happen before we can make
> this feature work. From what I understand these are the main
> requirements for making isolation work:

Requirement 1 is "Find customer".

> 	1. The memory region can't be onlined as System RAM
> 	2. The region needs to be mapped as device-dax
> 	3. There need to be safeguards such that someone can't remap the
> 	region to System RAM with error isolation enabled (added by me)

No, this policy does not belong in the kernel.

> Considering these all have to do with mm, I think that's a good place
> to start. What I'm thinking of starting with is adding a module
> parameter (or config option) to enable isolation for CXL dax regions
> by default, as well as a sysfs option for toggling error isolation for
> the CXL region. When the module parameter is specified, the CXL region
> driver would create the region as a device-dax region by default
> instead of onlining the region as system RAM.  The sysfs would allow
> toggling error isolation for CXL regions that are already device-dax.

No, definitely not going to invent module paramter policy for this. The
policy to not online dax regions is already available.

> That would cover requirements 1 & 2, but would still allow someone to
> reconfigure the region as a system RAM region with error isolation
> still enabled using sysfs/daxctl. To make sure the this doesn't
> happen, my plan is to have the CXL region driver automatically disable
> error isolation when the underlying memory region is going offline,
> then check to make sure the memory isn't onlined as System RAM before
> re-enabling isolation.

Why would you ever need to disable isolation? If it is present it at
least nominally allows software to keep running vs hang awaiting a
watchdog-timer reboot.

> So, with that in mind, isolation would most likely go from something
> that is enabled by default when compiled in to a feature for
> correctly-configured CXL regions that has to be enabled by the end
> user. If that is sounds good, here's what my roadmap would be going
> forward:

Again, this needs a customer to weigh the mitigations that the kernel
needs to carry.

> 	1. Enable creating device-dax mapped CXL regions by default

Already exists.

> 	2. Add support for checking a CXL region meets the requirements
> 	for enabling isolation (i.e. all devices in
> 	dax region(s) are device-dax)

Regions should not know or care if isolation is enabled, the
implications are identical to force unbinding the region driver.

> 	3. Add back in the error isolation enablement and notification
> 	mechanisms in this patch series

Not until it is clear that this feature has deployment value.


  reply	other threads:[~2024-03-25 15:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240215194048.141411-1-Benjamin.Cheatham@amd.com>
2024-02-15 23:43 ` Dan Williams
2024-03-25 15:15   ` Ben Cheatham
2024-03-25 15:54     ` Dan Williams [this message]
2024-04-01 19:41       ` Ben Cheatham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66019e1ec2153_2690d29440@dwillia2-mobl3.amr.corp.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=alison.schofield@intel.com \
    --cc=benjamin.cheatham@amd.com \
    --cc=bhelgaas@google.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox