From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C897BC28B2E for ; Tue, 11 Mar 2025 04:42:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E194280003; Tue, 11 Mar 2025 00:42:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 89268280001; Tue, 11 Mar 2025 00:42:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 758CF280003; Tue, 11 Mar 2025 00:42:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 56A1D280001 for ; Tue, 11 Mar 2025 00:42:54 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2B03C1218AD for ; Tue, 11 Mar 2025 04:42:56 +0000 (UTC) X-FDA: 83208025152.30.3EDA06B Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182]) by imf20.hostedemail.com (Postfix) with ESMTP id 44F5C1C0005 for ; Tue, 11 Mar 2025 04:42:54 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=jRmHX2cM; spf=pass (imf20.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.182 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741668174; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yggy5E5nMEcdAw0yO3eFHyu+/DXmHOC/abpMBj8KMA4=; b=AFGsyz6Md5EuZFpuAYl4UkoG7YLhEdX0mnO1uqrJill1b2chySvFvNAIffiWw686/KvJbM vRK/VM4c07ydfy7smumsshTxeX8Z2m2xhZyuYjqwV2/k1ak54v0JwgChSf5VIFwzOcRe8t aQeNjR9/R0IOb5xdYdtXq4+3e+umpc8= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=jRmHX2cM; spf=pass (imf20.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.182 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741668174; a=rsa-sha256; cv=none; b=VHRwwnl6FTai1X7Y6Lddm3tk+coPz5GeW6L+p+QqpqMrEucX5Zv5NGhxNvpqspiYBSq6J+ 8LtfQSy2+20F9k3JYy+2Vb3PyAgxxTYKyU4OTUOkA0ILlcAB+dFQry7i2Yyd/4BoxcbTTQ dDafwShf1du7SethMRMRE5y9TlcHZgM= Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7c54b651310so280491485a.0 for ; Mon, 10 Mar 2025 21:42:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1741668173; x=1742272973; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=yggy5E5nMEcdAw0yO3eFHyu+/DXmHOC/abpMBj8KMA4=; b=jRmHX2cMO/jDAujMUqXzXZ3RLcc/mD8nfkuxIWe2JooNu7cQvJQIvFyl/s3JrzE7ei bQ8qUXAuB2RYvt2Q3H83LimUD6vhdJNbbkcsrF+dJDxc4RDeVZWHXi5n5JT5xrkLL89R mB74ixGRgp7jkvA2ZjCk0V897DT/dymSlASwgsB/R9doKN67iLJAUsURxwWoIPZI0XpR 3ylK+VL9SqXuMkB53BEWw8Xv63D2QWjgFjgXcN+HZv9yjziQufC6hXVRwgbE0ncv/kIz irQNh1CS2i2RaexIK3sp9vyptBAd7/TNez6gPKkKuUEXBcXrtqfCLWQTYri544ON/I31 BNdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741668173; x=1742272973; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=yggy5E5nMEcdAw0yO3eFHyu+/DXmHOC/abpMBj8KMA4=; b=WOBUeRuJr0jXbGe/PRpB6IJLHEma5Y40sw4WysEtxnQuO3gqW6VrbRes0qZnrzLG7l xniWHLI9S/UAKk3UenqejY5u3qHE0RimRFAJbVy0JhnSiJPYINPirZoDhekFzAzc6IkD WttUXHR3gB126ulRgP7ngvVDuSD/ZZUzKEWxSJq/P9TV9ZM1uC+ZZhA33ddSrOlI8l9I hGAZQNo9IMv8J+g4+vt1o2fo8mwu9pL3b2czdXISmDgLqtvkRXgYD+zNmqxdm40r9Sy6 T9Q222AmOr0bQtIv7VqV2NBCG7jNGN4MNJzjNWJKVfkFsRFMKnnghtFozh9IfH1H8wNb AGug== X-Forwarded-Encrypted: i=1; AJvYcCUCwWWWnHBqVtHa/VEBHHOW6qmoCti66Tgijrbpxw4wRzlkVE1HHTl1ebTDx85k0ucT1rPmyWMPBA==@kvack.org X-Gm-Message-State: AOJu0YwCMi2daMoNtcHuDHaAXgvWgrI19Qr+bniNPY6C1P1FTzCj8IPl /Rb9EFAFbYDXqdTeo7VJWXp63YOvQ2SLRv25/X2JKCY8MErvcaObXcH+JvUv7Gs= X-Gm-Gg: ASbGncujgAVoLM863aJPnNkI7C4q8kzezRjr7QO8FyY2Tr3HRJt7xZJ0/WI6/30fMgS n3uj4IOFPv8oblSOrNvK6kczhPow15FaRJIKfVN2YhTMsyH5UenQf0E6Ev0/sZC6nShqlRPIPG7 A0gIVf1+CnVHmB2c6GCxYqvGsEYnYSc2oq2vx4q3/m01IhMQbUgHmMxQN1faIbHnrFVOv7aPSex amAsgzICnNQIh+RYwyflHew6kvtDVZ2G8Kr0ttriA89tfSWDtEh2oogSm5umahqGPxAkPjWFZSL FQAuNTYuGvdgAHPzEp3g9LKGt6k04zPtf31A0kmKNfi+9r4Dtx23uXQiDPyswBYwdNVPam80aX7 CVr/PCIs4gm2JQLlNPIXKfPNHqRM= X-Google-Smtp-Source: AGHT+IE3DXDoXpaGabxCNGVv6brJLbyVLodPXFyApwSTWHeoS6ie+LPjHb7T5fwZkKUirni5TTUSjw== X-Received: by 2002:a05:620a:8803:b0:7c5:5a97:f784 with SMTP id af79cd13be357-7c55a97f9fbmr560057085a.33.1741668173210; Mon, 10 Mar 2025 21:42:53 -0700 (PDT) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c54f33d3bbsm324043985a.45.2025.03.10.21.42.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Mar 2025 21:42:52 -0700 (PDT) Date: Tue, 11 Mar 2025 00:42:49 -0400 From: Gregory Price To: Yunjeong Mun Cc: kernel_team@skhynix.com, Joshua Hahn , harry.yoo@oracle.com, ying.huang@linux.alibaba.com, gregkh@linuxfoundation.org, rakie.kim@sk.com, akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com, Honggyu Kim Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Message-ID: References: <20250311040252.425-1-yunjeong.mun@sk.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250311040252.425-1-yunjeong.mun@sk.com> X-Rspam-User: X-Rspamd-Queue-Id: 44F5C1C0005 X-Rspamd-Server: rspam03 X-Stat-Signature: kqfo8n94nfkxniwkwzdxgay3j85r6wgk X-HE-Tag: 1741668174-957938 X-HE-Meta: U2FsdGVkX19xrCD+J17AYQmBJdgXvU3KRXFKQppcY1Ef3bt5+GlJdiqgdHbkrOMki6Ok4ByXMFUdA8ox1zYn0nQdALWubDpa7LKgjvwemefgmKZdmCVDwAAYAM8bIpwODw0N7T6JVxelXnwY2oKpvjZnD11XlHwLsHksAdI3PtgdJ7PKtzjvxNWuzLX0/OHn0HlvZPgDVPoKI03MjlIZa8dNaVChyFiTje5JhDR35J3rd3L+15YxuDqqGNDVhdA8jwREnhhVGLsxqFSRPR6R4+NzcG8WocwzwTYQ4fjHDMXf1VQ+uGmkee/hTCEOgF+5tqdbsLfbphcCSA0Ghbkm5a1ZcKUr9YtDJChQ+l9V0808ayALYkJ4O1bwd82dPJhuWjn514y8Y4IcLQCQDTery2JtYu/MX0NRl9tTipJb6btsnwdEEoiLdyUM3Pypo1DYik4RKFPt8SGqBT5MhO7RfgEoc5jXw5l2+sUxBJz1w/G7qsCO/C/7BAGj+Vj3wOaSVh2pDObh/vtLe0WG/cUVE5Tf9o2J0cCHwpPzZdaedIn0dIUFpkNvA/FeWCa2S6cwtz0DHS4c0z/EeInOJwk6qaEUu/4K9puZk0E9tAxtV7Vha+Hhl381eLmFQstmYWC4uuSQMGht2kJS3zEVSDRedsQl79IHaIla00/OKXQDn+Ro8yv97tfXo0dbNKrjI6Omwu5LLDYBhCj6Wq+63Dl0urcmGmBdDJpKKFszZ62VrgGWG7lLvFr3JpHKUiZ+ZxghJ2GiuQZlLYEqr93uiHJvcsb3Z+4dRwOPVubBYUznXhJ4mA9nx471jOHMnHqUpqMi87R9a6vfydGepY80SW+kKJ/N6HrWcC3qke7TZ16SgqczrGFSwNyJWtnXQmQf5+FzxxkjMH9gP1HahY83xZTgxyTUrLY9uV63XAH4SoKuDbvlcahn1Yh14IbFs9uMvlAKDpn86D80Zt0diqjkq5P /Js6mfFg oyWgdGWo2Hkvgo4E6qCTYdpyXZ42BMXjdIqUtcrigHujf56/26gjHvcSyZV+BAFNrmBc5SQ08Io5gAq3f5dsmHpMlcRnl4CWoHbSkeYD9qTUddhdaHIuT1TxVZcWqYLfD7BP3PNqbQnetfcjx9L+basVo3IUV+/1+fFG+fG1jH5jQnJ/QfI6f6/7HHWFDhzGg5ClmXcyxTrfioZ9X6kWWFvHB79zINrfDycmcLgHgZsW7BjmP8oqABzLOvt4P9vpqWhF/m7hVtNMY8J0lxjSTpTWGgEiyK1PG1wFhSQ9tSI1zJADb1m5eci2WStCMp+ENRa3MeGTbH0oLMSZgiQOor6pvdjH2eRcx7wKjIYCU1lbOaZ96HiLBvSr34hAmT3IjWuCbJClnm9wbV1Y0Ibdo98auRA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000800, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote: forenote - Hi Andrew, please hold off on the auto-configuration patch for now, the sk group has identified a hotplug issue we need to work out and we'll likely need to merge these two patch set together. I really appreciate your patience with this feature. > Hi Gregory, > > In my understanding, the reason we are seeing 12 NUMA node is because > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones) > in the code [1] below: > ... snip ... Appreciated, so yes this confirms what i thought was going on. There's 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per socket that is intended to interleave across the host bridges. As you mention below, the code in acpi/numa/srat.c will create 1 NUMA node per SRAT Memory Affinity Entry - and then also 1 NUMA node per CFMWS that doesn't have a matching SRAT entry (with a known corner case for a missing SRAT which doesn't apply here). So essentialy what the system is doing is marking that it's absolutely possible to create 1 region per device and also 1 region that interleaves across host each pair of host bridges (I presume this is a dual socket system?). So, tl;dr: All these nodes are valid and this configuration is correct. Weighted interleave presently works fine as intended, but with the inclusion of the auto-configuration, there will be issues for your system configuration. This means we probably need to consider merging these as a group. During boot, the following will occur 1) drivers/acpi/numa/srat.c marks 12 nodes as possible 0-1) Socket nodes 2-3) Cross-host-bridge interleave nodes 4-11) single region nodes 2) drivers/cxl/* will probe the various devices and create a root decoder for each CXL Fixed Memory Window decoder0.0 - decoder11.0 (or maybe decoder0.0 - decoder0.11) 3) during probe auto-configuration of wieghted interleave occurs as a result of this code being called with hmat or cdat data: void node_set_perf_attrs() { ... /* When setting CPU access coordinates, update mempolicy */ if (access == ACCESS_COORDINATE_CPU) { if (mempolicy_set_node_perf(nid, coord)) { pr_info("failed to set mempolicy attrs for node %d\n", nid); } } ... } under the current system, since we calculate with N_POSSIBLE, all nodes will be assigned weights (assuming HMAT or CDAT data is available for all of them). We actually have a few issues here 1) If all nodes are included in the weighting reduction, we're actually over-representing a particular set of hardware. The interleave node and the individual device nodes would actually over-represent the bandwidth available (comparative to the CPU nodes). 2) As stated on this patch line, just switching to N_MEMORY causes issues with hotplug - where the bandwidth can be reported, but if memory hasn't been added yet then we'll end up with wrong weights because it wasn't included in the calculation. 3) However, not exposing the nodes because N_MEMORY isn't set yet a) prevents pre-configuration before memory is onlined, and b) hides the implications of hotplugging memory into a node from the user (adding memory causes a re-weight and may affect an interleave-all configuration). but - i think it's reasonable that anyone using weighted-interleave is *probably* not going to have nodes come and go. It just seems like a corner case that isn't reasonable to spend time supporting. So coming back around to the hotplug patch line, I do think it's reasonable hide nodes marked !N_MEMORY, but consider two issues: 1) In auto mode, we need to re-weight on hotplug to only include onlined nodes. This is because the reduction may be sensitive to the available bandwidth changes. This behavior needs to be clearly documented. 2) We need to clearly define what the weight of a node will be when in manual mode and a node goes (memory -> no memory -> memory) a) does it retain it's old, manually set weight? b) does it revert to 1? Sorry for the long email, just working through all the implications. I think the proposed hotplug patch is a requirement for the auto-configuration patch set. ~Gregory