From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E4A5C3DA7F for ; Wed, 31 Jul 2024 01:26:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1520D6B0085; Tue, 30 Jul 2024 21:26:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 102386B0088; Tue, 30 Jul 2024 21:26:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0B9B6B0089; Tue, 30 Jul 2024 21:26:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CFF716B0085 for ; Tue, 30 Jul 2024 21:26:11 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 57DF816016A for ; Wed, 31 Jul 2024 01:26:11 +0000 (UTC) X-FDA: 82398306942.21.834D07C Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) by imf22.hostedemail.com (Postfix) with ESMTP id D2AA0C002F for ; Wed, 31 Jul 2024 01:26:08 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JcahOmsS; spf=pass (imf22.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722389096; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kKQb7o2/nHU7Yqu4Eyfab1ibFMwbepKyxAZXWgI9YrI=; b=x4LTTNAPij1ZOdZKqqxnCdbrfKLMdekmUP06AA7I25Tx8bsEqLUeLmpikNAKD0Mz+N44m5 nGwCJFLSIGJpIzL9V/XyH2xCl+u9PpFT803R1EkdnjaGua11cQhr2XeuK9EfvBBeTkXA25 K6UBP28GXQrMEor6H3Z7hAH9bwqoWds= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JcahOmsS; spf=pass (imf22.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722389096; a=rsa-sha256; cv=none; b=XbV2n7c2xS9x7vBBaWL6iDClVLyONBvF34SXpgDxseQU+tDWbQlIW4kYmRnUfxO3XVSHHA GyaRqAwaEq7TjVWy8lqGe34booT9pulVqYloiGZV/XMrI53zr2EEcztm35VBajgkvGx9Cz bvLGvNzvpqUESpXdxPpLVWclBN4ikzU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1722389169; x=1753925169; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=06LFBPA6T6F2EWn9cPGZo5IMaK6O3uRHX4yqHc0sRf8=; b=JcahOmsSPXxlXjJzu+cRDGNbAHt7nGycoxrIz94enhTnoHsZM2jxEJON P6nnegrZruIrxZuqfbECnLM+oZnIIAu2pW5nMCkNk++kt68oTtK+HZ95+ siOHni0wQRIu31fBiJjmni8i53nO649ZgOwu61/mNkxyRwNvbaO2s+qHg kbkNMAAvDsq/KGh3nTqf8st20btpISNqMoeMJoohJSNGWW3X1HiJwA+gq ehxMNv6WfuR4Y3tdZXTLNGmnwtewEDLwN5Aj4ujue+ye55Hn/SXeAYrEd goOsa1MfB5GXqfU/SXiSP7kjYUX2xPiYGJOgoRJkpEyLShc6t5j4fnFmr Q==; X-CSE-ConnectionGUID: M0YKgiPJR9meOG3ipdY9PQ== X-CSE-MsgGUID: dcz1ClcTSeWL8BUhcoBVTw== X-IronPort-AV: E=McAfee;i="6700,10204,11149"; a="20398228" X-IronPort-AV: E=Sophos;i="6.09,250,1716274800"; d="scan'208";a="20398228" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jul 2024 18:26:07 -0700 X-CSE-ConnectionGUID: /+iu9CEnTWma/xEo4/qtqg== X-CSE-MsgGUID: TYcdjIGOQviTnArIcqjAYg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,250,1716274800"; d="scan'208";a="54562879" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jul 2024 18:26:05 -0700 From: "Huang, Ying" To: Gregory Price Cc: linux-mm@kvack.org, akpm@linux-foundation.org, dave.jiang@intel.com, Jonathan.Cameron@huawei.com, horenchuang@bytedance.com, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, dan.j.williams@intel.com, lenb@kernel.org, "Aneesh Kumar K.V" Subject: Re: [PATCH] acpi/hmat,mm/memtier: always register hmat adist calculation callback In-Reply-To: (Gregory Price's message of "Mon, 29 Jul 2024 23:18:16 -0400") References: <20240726215548.10653-1-gourry@gourry.net> <87ttg91046.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cd3u1go.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 31 Jul 2024 09:22:32 +0800 Message-ID: <87cymupd7r.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D2AA0C002F X-Stat-Signature: opk55nxfruu4urwwji8bpbptgktiu1eh X-Rspam-User: X-HE-Tag: 1722389168-565457 X-HE-Meta: U2FsdGVkX1/tguqcPzaJVRKtcs0WIJxqJDeC28vsWBh0lKFA7q3LkTY7G6eEAzgm85jtJ/zU6wBwabeq0xGJuU8fUKffJ0jv0TTJbeB9D90GyWs+JSXyz+ubD4pTyDp8OWhSu7wE6U+HepU4jBLG3RE4BxmhMWi4BIeYhvqzT7k1UPR/mNDctbwzDu5T2XkW62zGFFB8UWYbwK73e6I3HyqfmiHEewJKB76LGYtl6+KPCexgPiEJSvAp6rCFoiAmKuSB7+bVWkBLu98TtJYJ6xiTd1Fwiy8ML1oyMV3bNofsNIC/gGYUZidA2aUjGOIkXkjHitHvZMolLdrByey9yKtYXGIDJzZVhbYJkrBk76Ktx3EoSfy6Kw5uwDY9f+yXI1DYupdYDOz3NiVGCwfecvouFFINC6QM2KRxIfsKKJ3kTWiTjFTm39/GGJtjKi6DMBBURS3caX+XGtzvmtAcYISDZB4Tzmj7UoQpnF3Mb/imoRBatFbyPeCjBq5E0E67KFrKh2oYpRfx0+FJsptG1ELsOgT/AV/KCEte1IPfpgUI30xzkbLbNishms9Krz3rrRt6gc0OLIINbj/8TLRkgDhhaXVF/QIz8MA9GRn1vceChDQpV9GElR4jZerFU366OA13FO/VMMuCSypiCSnjqKNkyENmHQRIieD+daklyOmty8Ls19smiY/bcf4du8qoDJSmzH6zDIZUD8WzMC1qx2efqlBaSx6SH6m/Oaue7rf6+6csac7Z423SODiibUuBRgO39ZW8Pvqr0ZSYscZEbzX3ePNDhLLtp4PeAl83CAljt1d46sl56Hc5YNPNTAivZWz3fTMOtP6+7IlO6lZi3wjd6j/KbQzviTEyOlU3LJyHwKW8g9a+84xVW3BVHtg6BaQ0yM5/PwbgIduYE1uXzsAZG/MyFrk7VMMQ3FNWupsJEydfSmOYWsRXAICWct0ekokYkUQdl40F4ycbLFe Txjy5B7P UD9Yf/4xOW2cI7zn/3Wh25Gt1nxU6mdf42phXxyrGukV2wRe4nsz2O2zrtrvCreqSV15QaJpqQpI428/ozWyhiusqw+Je+QETqGxjqEuLpHHAeSNufSoHtj4CVCXCQIHI53ufTE8+BnUmsp/3KNKmOxiv7k8l3wN175boKkHfoCPgog7LUWrNNL1QTOXLisC5PnJ338pVMOyydzk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Tue, Jul 30, 2024 at 09:12:55AM +0800, Huang, Ying wrote: >> Gregory Price writes: >> >> > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: >> >> Gregory Price writes: >> >> >> >> > In the event that hmat data is not available for the DRAM tier, >> >> > or if it is invalid (bandwidth or latency is 0), we can still register >> >> > a callback to calculate the abstract distance for non-cpu nodes >> >> > and simply assign it a different tier manually. >> >> > >> >> > In the case where DRAM HMAT values are missing or not sane we >> >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). >> >> > >> >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we >> >> > cannot reasonable determine where to place the tier, so it will default >> >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). >> >> >> >> Why do we need this? Do you have machines with broken HMAT table? Can >> >> you ask the vendor to fix the HMAT table? >> >> >> > >> > It's a little unclear from the ACPI specification whether HMAT is >> > technically optional or not (given that the kernel handles missing HMAT >> > gracefully, it certainly seems optional). In one scenario I have seen >> > incorrect data, and in another scenario I have seen the HMAT omitted >> > entirely. In another scenario I have seen the HMAT-SLLBI omitted while >> > the CDAT is present. >> >> IIUC, HMAT is optional. Is it possible for you to ask the system vendor >> to fix the broken HMAT table. >> > > In this case we are (BW=0), but in the other cases, there is technically > nothing broken. That's my concern. > >> > In all scenarios the result is the same: all nodes in the same tier. >> >> I don't think so, in drivers/dax/kmem.c, we will put memory devices >> onlined by kmem.c in another tier by default. >> > > This presumes driver configured devices, which is not always the case. > > kmem.c will set MEMTIER_DEFAULT_DAX_ADISTANCE > > but if BIOS/EFI has set up the node instead, you get the default of > MEMTIER_ADISTANCE_DRAM if HMAT is not present or otherwise not sane. "efi_fake_mem=" kernel parameter can be used to add "EFI_MEMORY_SP" flag to the memory range, so that kmem.c can manage it. > Not everyone is going to have the ability to get a platform vendor to > fix a BIOS bug, and I've seen this in production. So, some vendor build a machine with broken/missing HMAT/CDAT and wants users to use CXL memory devices in it? Have the vendor tested whether CXL memory devices work? >> > The HMAT is explicitly described as "A hint" in the ACPI spec. >> > >> > ACPI 5.2.28.1 HMAT Overview >> > >> > "The software is expected to use this information as a hint for >> > optimization, or when the system has heterogeneous memory" >> > >> > If something is "a hint", then it should not be used prescriptively. >> > >> > Right now HMAT appears to be used prescriptively, this despite the fact >> > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in >> > the memory-tier code. So this patch simply realizes this intent when the >> > hints are not very reasonable. >> >> If HMAT isn't available, it's hard to put memory devices to >> appropriate memory tiers without other information. > > Not having a CPU is "other information". What tier a device belongs to > is really arbitrary, "appropriate" is at best a codified opinion. > >> In commit >> 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), >> Aneesh pointed out that it doesn't work for his system to put >> non-CPU-nodes in lower tier. >> > > This seems like a bug / something else incorrect. I will investigate. > >> Even if we want to use other information to put memory devices to memory >> tiers, we can register another adist calculation callback instead of >> reusing hmat callback. >> > > I suppose during init, we could register a default adist callback with > CPU/non-CPU checks if HMAT is not sane. I can look at that. > > It might also be worth having some kind of modal mechanism, like: > > echo "auto" > /sys/.../memory_tiering/mode # Auto select mode > echo "hmat" > /sys/.../memory_tiering/mode # Use HMAT Info > echo "simple" > /sys/.../memory_tiering/mode # CPU vs non-CPU Node > echo "topology" > /sys/.../memory_tiering/mode # More complex > > To abstract away the hardware complexities as best as possible. > > But the first step here would be creating two modes. HMAT-is-sane and > CPU/Non-CPU seems reasonable to me but open to opinions. IMHO, we should reduce user configurable knobs unless we can prove it is really necessary. -- Best Regards, Huang, Ying