Re: Onlining CXL Type2 device coherent memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Vikram Sethi <vsethi@nvidia.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"Natu, Mahesh" <mahesh.natu@intel.com>,
	"Rudoff, Andy" <andy.rudoff@intel.com>,
	Jeff Smith <JSMITH@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	"jglisse@redhat.com" <jglisse@redhat.com>,
	Linux MM <linux-mm@kvack.org>,
	Linux ACPI <linux-acpi@vger.kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>
Subject: Re: Onlining CXL Type2 device coherent memory
Date: Mon, 2 Nov 2020 10:51:07 +0100	[thread overview]
Message-ID: <958912b2-1436-378f-43d7-cbc5c8955ffd@redhat.com> (raw)
In-Reply-To: <CAPcyv4jX1tedjuU-vCSKgvhQeNFukyq9d0ddmsk7jAjWMX+iBQ@mail.gmail.com>

On 31.10.20 17:51, Dan Williams wrote:
> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 30.10.20 21:37, Dan Williams wrote:
>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@nvidia.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device
>>>> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL
>>>> devices which are available/plugged in at boot. A type 2 CXL device can be simply
>>>> thought of as an accelerator with coherent device memory, that also has a
>>>> CXL.cache to cache system memory.
>>>>
>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory map
>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least
>>>> on some architectures (arm64) EFI conventional memory available at kernel boot
>>>> memory cannot be offlined, so this may not be suitable on all architectures.
>>>
>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
>>> they might be interested / have comments on this restriction as well.
>>>
>>
>> I am missing some important details.
>>
>> a) What happens after offlining? Will the memory be remove_memory()'ed?
>> Will the device get physically unplugged?
>>
>> b) What's the general purpose of the memory and its intended usage when
>> *not* exposed as system RAM? What's the main point of treating it like
>> ordinary system RAM as default?
>>
>> Also, can you be sure that you can offline that memory? If it's
>> ZONE_NORMAL (as usually all system RAM in the initial map), there are no
>> such guarantees, especially once the system ran for long enough, but
>> also in other cases (e.g., shuffling), or if allocation policies change
>> in the future.
>>
>> So I *guess* you would already have to use kernel cmdline hacks like
>> "movablecore" to make it work. In that case, you can directly specify
>> what you *actually* want (which I am not sure yet I completely
>> understood) - e.g., something like "memmap=16G!16G" ... or something
>> similar.
>>
>> I consider offlining+removing *boot* memory to not physically unplug it
>> (e.g., a DIMM getting unplugged) abusing the memory hotunplug
>> infrastructure. It's a different thing when manually adding memory like
>> dax_kmem does via add_memory_driver_managed().
>>
>>
>> Now, back to your original question: arm64 does not support physically
>> unplugging DIMMs that were part of the initial map. If you'd reboot
>> after unplugging a DIMM, your system would crash. We achieve that by
>> disallowing to offline boot memory - we could also try to handle it in
>> ACPI code. But again, most uses of offlining+removing boot memory are
>> abusing the memory hotunplug infrastructure and should rather be solved
>> cleanly via a different mechanism (firmware, kernel cmdline, ...).
>>
>> Just recently discussed in
>>
>> https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org
>>
>>>> Further, the device driver associated with the type 2 device/accelerator may
>>>> want to save off a chunk of HDM for driver private use.
>>>> So it seems the more appropriate model may be something like dev dax model
>>>> where the device driver probe/open calls add_memory_driver_managed, and
>>>> the driver could choose how much of the HDM it wants to reserve and how
>>>> much to make generally available for application mmap/malloc.
>>>
>>> Sure, it can always be driver managed. The trick will be getting the
>>> platform firmware to agree to not map it by default, but I suspect
>>> you'll have a hard time convincing platform-firmware to take that
>>> stance. The BIOS does not know, and should not care what OS is booting
>>> when it produces the memory map. So I think CXL memory unplug after
>>> the fact is more realistic than trying to get the BIOS not to map it.
>>> So, to me it looks like arm64 needs to reconsider its unplug stance.
>>
>> My personal opinion is, if memory isn't just "ordinary system RAM", then
>> let the system know early that memory is special (as we do with
>> soft-reserved).
>>
>> Ideally, you could configure the firmware (e.g., via BIOS setup) on what
>> to do, that's the cleanest solution, but I can understand that's rather
>> hard to achieve.
> 
> Yes, my hope, which is about the most influence I can have on
> platform-firmware implementations, is that it marks CXL attached
> memory as soft-reserved by default and allow OS policy decide where it
> goes. Barring that, for the configuration that Vikram mentioned, the
> only other way to get this differentiated / not-ordinary system-ram
> back to being driver managed would be to unplug it. The soft-reserved
> path is cleaner.

If we already need kernel cmdline parameters (movablecore), we can 
handle this differently via the cmdline. That sets expectations for 
people implementing the firmware - we shouldn't make their life too easy 
with such decisions.

The paragraph started with

"One could envision that BIOS/UEFI could expose the HDM in EFI memory 
map ..." Let's not envision it, but instead suggest people to not do it ;)

-- 
Thanks,

David / dhildenb

next prev parent reply	other threads:[~2020-11-02  9:51 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <BL0PR12MB25321C8689BAFDF8678E5C69BD170@BL0PR12MB2532.namprd12.prod.outlook.com>
2020-10-30 20:37 ` Dan Williams
2020-10-30 20:59   ` Matthew Wilcox
2020-10-30 23:38     ` Dan Williams
2020-10-30 22:39   ` Vikram Sethi
2020-11-02 17:47     ` Dan Williams
2020-10-31 10:21   ` David Hildenbrand
2020-10-31 16:51     ` Dan Williams
2020-11-02  9:51       ` David Hildenbrand [this message]
2020-11-02 16:17         ` Vikram Sethi
2020-11-02 17:53           ` David Hildenbrand
2020-11-02 18:03             ` Dan Williams
2020-11-02 19:25               ` Vikram Sethi
2020-11-02 19:45                 ` Dan Williams
2020-11-03  3:56                 ` Alistair Popple
2020-11-02 18:34       ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=958912b2-1436-378f-43d7-cbc5c8955ffd@redhat.com \
    --to=david@redhat.com \
    --cc=JSMITH@nvidia.com \
    --cc=andy.rudoff@intel.com \
    --cc=anshuman.khandual@arm.com \
    --cc=dan.j.williams@intel.com \
    --cc=jglisse@redhat.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mahesh.natu@intel.com \
    --cc=mhairgrove@nvidia.com \
    --cc=vsethi@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox