From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56D55C4742C for ; Mon, 2 Nov 2020 17:47:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AE5CC206BE for ; Mon, 2 Nov 2020 17:47:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="ZLrhjOc7" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AE5CC206BE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 08C876B0036; Mon, 2 Nov 2020 12:47:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 03BC16B0068; Mon, 2 Nov 2020 12:47:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E45696B006C; Mon, 2 Nov 2020 12:47:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250]) by kanga.kvack.org (Postfix) with ESMTP id B75946B0036 for ; Mon, 2 Nov 2020 12:47:26 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 51E7C3631 for ; Mon, 2 Nov 2020 17:47:26 +0000 (UTC) X-FDA: 77440210092.18.view37_000434b272b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin18.hostedemail.com (Postfix) with ESMTP id 35DA3100ED3A9 for ; Mon, 2 Nov 2020 17:47:26 +0000 (UTC) X-HE-Tag: view37_000434b272b1 X-Filterd-Recvd-Size: 8951 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Mon, 2 Nov 2020 17:47:25 +0000 (UTC) Received: by mail-ed1-f53.google.com with SMTP id a71so9849456edf.9 for ; Mon, 02 Nov 2020 09:47:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=V4wLdButXyc+1Eb8bFrf8nQwDzP26JReH8M9WtyD/YQ=; b=ZLrhjOc7VGG4u1FcRlStqQ6C+lYJMLuxl+wSam+nSelhoflQKZFNH4NIlWuO+a/5Oa 2zYhlJWuSS1/FyLnuyml/Vlth60X48hXJXln8J7hgFI/zF6MiLlQgpGvJoHhw8w8pccr dg/XUKF5Kne8aNORb/WU8diM92OjOx8bSZ7zimohpl1uL7AVCAGg4Rst3cxdMCHqY5Ea zo82UIregs8WuCGDi194J24I/dctzC2okDLrRHdbbkh0O6eQ1A07/LDjrm7Rqb1sXQ4S qVhuRbt+FI5FBhtEcjEWDDM0Pp39m7Yz+kahPHk69OT2nXcg/FqyDvKCaM3qD2U2vpw+ cC2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=V4wLdButXyc+1Eb8bFrf8nQwDzP26JReH8M9WtyD/YQ=; b=ktIbaZpHIBRD39MLhBCYlNjoUI9dQ5J5iFMYtYeahibrdFVpVTjUq/pRij4S7hQmIY DZ7WR3/uCnDzuVt/WCyqUZDjiT+GPgwILWaCcU2i9B/itjHw8VyN1VuE1OxZYQx+y95S x7PVNbctob2dayIhGDDIJMTb+1+kx/HHvwj5eLexN4snAOe58378H34RPREsCgWbqfLh RZZkespj7zHmMNwfp0zN3ZGD0MKhGUC3cX/jrQQNoJAM8lM8wS53VdLEhdDniAMGG5AJ tcc4ArIjw9jQQO2Uw4+aSGl6dBqQxR+nJ46MJxHkiUZEIIZa1jsmhCEzKWqdNB/Sward n+OQ== X-Gm-Message-State: AOAM532wUPv2YkgAR77lzLhh9dBOyrrTyxMb1jnLslgRpAF6tgVXgiQc bvmmaFtEKy2T11QS8tfq7kw9TBA11bi9yAJDw5CQkg== X-Google-Smtp-Source: ABdhPJwM1r88OldayVm4GhZX5F1+f4Q4gRWw7jgrzoBJeb+vkuiz+Wd1U6wbZ+0cTWhd8yYaTzTjxyRNZA/YbLYfua4= X-Received: by 2002:a50:871d:: with SMTP id i29mr17882067edb.300.1604339243804; Mon, 02 Nov 2020 09:47:23 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Dan Williams Date: Mon, 2 Nov 2020 09:47:11 -0800 Message-ID: Subject: Re: Onlining CXL Type2 device coherent memory To: Vikram Sethi Cc: "linux-cxl@vger.kernel.org" , "Natu, Mahesh" , "Rudoff, Andy" , Jeff Smith , Mark Hairgrove , "jglisse@redhat.com" , David Hildenbrand , Linux MM , Linux ACPI , "will@kernel.org" , "anshuman.khandual@arm.com" , "catalin.marinas@arm.com" , Ard Biesheuvel , Dave Hansen Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 30, 2020 at 3:40 PM Vikram Sethi wrote: > > Hi Dan, > > From: Dan Williams > > On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi wrote: > > > > > > Hello, > > > > > > I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 device > > > Coherent memory aka Host managed device memory (HDM) will work for type 2 > > CXL > > > devices which are available/plugged in at boot. A type 2 CXL device can be > > simply > > > thought of as an accelerator with coherent device memory, that also has a > > > CXL.cache to cache system memory. > > > > > > One could envision that BIOS/UEFI could expose the HDM in EFI memory map > > > as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, at least > > > on some architectures (arm64) EFI conventional memory available at kernel boot > > > memory cannot be offlined, so this may not be suitable on all architectures. > > > > That seems an odd restriction. Add David, linux-mm, and linux-acpi as > > they might be interested / have comments on this restriction as well. > > > > > Further, the device driver associated with the type 2 device/accelerator may > > > want to save off a chunk of HDM for driver private use. > > > So it seems the more appropriate model may be something like dev dax model > > > where the device driver probe/open calls add_memory_driver_managed, and > > > the driver could choose how much of the HDM it wants to reserve and how > > > much to make generally available for application mmap/malloc. > > > > Sure, it can always be driver managed. The trick will be getting the > > platform firmware to agree to not map it by default, but I suspect > > you'll have a hard time convincing platform-firmware to take that > > stance. The BIOS does not know, and should not care what OS is booting > > when it produces the memory map. So I think CXL memory unplug after > > the fact is more realistic than trying to get the BIOS not to map it. > > So, to me it looks like arm64 needs to reconsider its unplug stance. > > Agree. Cc Anshuman, Will, Catalin, Ard, in case I missed something in > Anshuman's patches adding arm64 memory remove, or if any plans to remove > the limitation. > > > > Another thing to think about is whether the kernel relies on UEFI having fully > > > described NUMA proximity domains and end-end NUMA distances for HDM, > > > or whether the kernel will provide some infrastructure to make use of the > > > device-local affinity information provided by the device in the Coherent Device > > > Attribute Table (CDAT) via a mailbox, and use that to add a new NUMA node ID > > > for the HDM, and with the NUMA distances calculated by adding to the NUMA > > > distance of the host bridge/Root port with the device local distance. At least > > > that's how I think CDAT is supposed to work when kernel doesn't want to rely > > > on BIOS tables. > > > > The kernel can supplement the NUMA configuration from CDAT, but not if > > the memory is already enumerated in the EFI Memory Map and ACPI > > SRAT/HMAT. At that point CDAT is a nop because the BIOS has precluded > > the OS from consuming it. > > That makes sense. > > > > A similar question on NUMA node ID and distances for HDM arises for CXL > > hotplug. > > > Will the kernel rely on CDAT, and create its own NUMA node ID and patch up > > > distances, or will it rely on BIOS providing PXM domain reserved at boot in > > > SRAT to be used later on hotplug? > > > > I don't expect the kernel to merge any CDAT data into the ACPI tables. > > Instead the kernel will optionally use CDAT as an alternative method > > to generate Linux NUMA topology independent of ACPI SRAT. Think of it > > like Linux supporting both ACPI and Open Firmware NUMA descriptions at > > the same time. CDAT is its own NUMA description domain unless BIOS has > > blurred the lines and pre-incorporated it into SRAT/HMAT. That said I > > think the CXL attached memory not described by EFI / ACPI is currently > > the NULL set. > > What I meant by patch/merge was if on a dual socket system with distance 40 > between the sockets (not getting into HMAT vs SLIT description of latency), > if you hotplugged in a CXL type2/3 device whose CDAT says device local 'distance' > is 80, then the kernel is still merging that 80 in with the 40 to the remote socket > to say 120 from remote socket CPU to this socket's CXL device i.e whether the > 40 came from SLIT or HMAT, it is still merged into the data kernel had obtained > from ACPI. I think you're saying the same thing in a different way: > that the device local part is not being merged with anything ACPI provided for > the device, example _SLI at time of hotplug (which I agree with). Thankfully CDAT abandons the broken and gamed system of distance values (i.e. firmware sometimes reverse engineering OS behavior) in favor of nominal performance values like HMAT. With that in hand I think it simplifies the kernel's responsibility to worry less about "distance" values and instead identify whether the memory range is "Linux-local" or "Linux-remote" and where to order it in the allocation fallback lists. As Dave implemented in his "migrate in lieu of discard" series [1], find_next_best_node() establishes this ordering for memory tiering, so the rough plan is to teach each CXL supporting arch how to incorporate CDAT data into its find_next_best_node() implementation. [1]: https://lore.kernel.org/linux-mm/20201007161736.ACC6E387@viggo.jf.intel.com/