From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5B8BC388F9 for ; Sat, 31 Oct 2020 10:21:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3F95620719 for ; Sat, 31 Oct 2020 10:21:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WLG5hXgA" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3F95620719 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 642076B0036; Sat, 31 Oct 2020 06:21:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F3CD6B005C; Sat, 31 Oct 2020 06:21:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5096F6B005D; Sat, 31 Oct 2020 06:21:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 274006B0036 for ; Sat, 31 Oct 2020 06:21:38 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 9F1C6180AD806 for ; Sat, 31 Oct 2020 10:21:37 +0000 (UTC) X-FDA: 77431829034.26.songs80_3405cc32729d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 817B51804B668 for ; Sat, 31 Oct 2020 10:21:37 +0000 (UTC) X-HE-Tag: songs80_3405cc32729d X-Filterd-Recvd-Size: 6841 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Sat, 31 Oct 2020 10:21:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1604139695; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dGSnhPkl0I7vSPYzYjDBbctn/wTQ7qKaQlUwoPQ5d7Q=; b=WLG5hXgAjWehlGwoi0vM25fbl6pS87kRHLbA+NLAZS8ITALWTt8TtK4J0GxSHmWGBIPUoy NCVjEoCdTFae0yIznFevVUxlhLzdFk+5c1hKRY40vU24yyjxT7hCAmpaAOSzALPC1j7SM8 44PM1HpVO5qyGBsgkKmubZSCkjQbTE0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-340-r2J-ZnybPsGTt8b1mWtArg-1; Sat, 31 Oct 2020 06:21:31 -0400 X-MC-Unique: r2J-ZnybPsGTt8b1mWtArg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7AE0E805F08; Sat, 31 Oct 2020 10:21:29 +0000 (UTC) Received: from [10.36.112.45] (ovpn-112-45.ams2.redhat.com [10.36.112.45]) by smtp.corp.redhat.com (Postfix) with ESMTP id 182FB19482; Sat, 31 Oct 2020 10:21:26 +0000 (UTC) Subject: Re: Onlining CXL Type2 device coherent memory To: Dan Williams , Vikram Sethi Cc: "linux-cxl@vger.kernel.org" , "Natu, Mahesh" , "Rudoff, Andy" , Jeff Smith , Mark Hairgrove , "jglisse@redhat.com" , Linux MM , Linux ACPI , Anshuman Khandual References: From: David Hildenbrand Organization: Red Hat GmbH Message-ID: <451b2571-c3e8-97d8-bfd0-f8054a1b75c5@redhat.com> Date: Sat, 31 Oct 2020 11:21:26 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 30.10.20 21:37, Dan Williams wrote: > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi wrote: >> >> Hello, >> >> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device >> Coherent memory aka Host managed device memory (HDM) will work for type 2 CXL >> devices which are available/plugged in at boot. A type 2 CXL device can be simply >> thought of as an accelerator with coherent device memory, that also has a >> CXL.cache to cache system memory. >> >> One could envision that BIOS/UEFI could expose the HDM in EFI memory map >> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least >> on some architectures (arm64) EFI conventional memory available at kernel boot >> memory cannot be offlined, so this may not be suitable on all architectures. > > That seems an odd restriction. Add David, linux-mm, and linux-acpi as > they might be interested / have comments on this restriction as well. > I am missing some important details. a) What happens after offlining? Will the memory be remove_memory()'ed? Will the device get physically unplugged? b) What's the general purpose of the memory and its intended usage when *not* exposed as system RAM? What's the main point of treating it like ordinary system RAM as default? Also, can you be sure that you can offline that memory? If it's ZONE_NORMAL (as usually all system RAM in the initial map), there are no such guarantees, especially once the system ran for long enough, but also in other cases (e.g., shuffling), or if allocation policies change in the future. So I *guess* you would already have to use kernel cmdline hacks like "movablecore" to make it work. In that case, you can directly specify what you *actually* want (which I am not sure yet I completely understood) - e.g., something like "memmap=16G!16G" ... or something similar. I consider offlining+removing *boot* memory to not physically unplug it (e.g., a DIMM getting unplugged) abusing the memory hotunplug infrastructure. It's a different thing when manually adding memory like dax_kmem does via add_memory_driver_managed(). Now, back to your original question: arm64 does not support physically unplugging DIMMs that were part of the initial map. If you'd reboot after unplugging a DIMM, your system would crash. We achieve that by disallowing to offline boot memory - we could also try to handle it in ACPI code. But again, most uses of offlining+removing boot memory are abusing the memory hotunplug infrastructure and should rather be solved cleanly via a different mechanism (firmware, kernel cmdline, ...). Just recently discussed in https://lkml.kernel.org/r/de8388df2fbc5a6a33aab95831ba7db4@codeaurora.org >> Further, the device driver associated with the type 2 device/accelerator may >> want to save off a chunk of HDM for driver private use. >> So it seems the more appropriate model may be something like dev dax model >> where the device driver probe/open calls add_memory_driver_managed, and >> the driver could choose how much of the HDM it wants to reserve and how >> much to make generally available for application mmap/malloc. > > Sure, it can always be driver managed. The trick will be getting the > platform firmware to agree to not map it by default, but I suspect > you'll have a hard time convincing platform-firmware to take that > stance. The BIOS does not know, and should not care what OS is booting > when it produces the memory map. So I think CXL memory unplug after > the fact is more realistic than trying to get the BIOS not to map it. > So, to me it looks like arm64 needs to reconsider its unplug stance. My personal opinion is, if memory isn't just "ordinary system RAM", then let the system know early that memory is special (as we do with soft-reserved). Ideally, you could configure the firmware (e.g., via BIOS setup) on what to do, that's the cleanest solution, but I can understand that's rather hard to achieve. -- Thanks, David / dhildenb