From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9470CC282EC for ; Tue, 18 Mar 2025 08:03:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 81BFB680001; Tue, 18 Mar 2025 04:03:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7A81F280002; Tue, 18 Mar 2025 04:03:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 627D0680001; Tue, 18 Mar 2025 04:03:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D02AD280002 for ; Tue, 18 Mar 2025 04:02:52 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 23A60C0C4D for ; Tue, 18 Mar 2025 08:02:53 +0000 (UTC) X-FDA: 83233930626.18.1269502 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf30.hostedemail.com (Postfix) with ESMTP id B0ED68000F for ; Tue, 18 Mar 2025 08:02:50 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of yunjeong.mun@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=yunjeong.mun@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742284971; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TgiQnDciN8hmJJ4fSa1hakZaTQOP2AyT+UJIfI2aF5I=; b=P2Yq+75iseWFWgIhTrCa/qCi8Da4Hr/r+KIxZxvluBBHBnLYK84dRvARSXFiZekZ5Rn3hI gUMR2jvninlnuUM4dj6hH3q31N46es9ud1r6EOsAiFEMPCo0pd4wDUfXlensfmk3AJnatp kux7OA946fUo957aA6CAsABRR80RyGk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of yunjeong.mun@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=yunjeong.mun@sk.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742284971; a=rsa-sha256; cv=none; b=rXsU2Q2ujj+ds0yXt7Ot6GrdihTB3C33AsCBEmKzgqc8PqumrJ/jqeiJfYDO9lxoqrCeZ+ 5lMgp+RwVDIDjE5WRKyWng57j6VOl/yOiT1BfETkuWXstI45WHCzHVuEavz8NgdrVp8nTx zwaLc8cnC0LWMFxoY4Z6ilQrO4VZLLY= X-AuditID: a67dfc5b-681ff7000002311f-09-67d928a8ae53 From: Yunjeong Mun To: Gregory Price Cc: kernel_team@skhynix.com, Joshua Hahn , harry.yoo@oracle.com, ying.huang@linux.alibaba.com, gregkh@linuxfoundation.org, rakie.kim@sk.com, akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com, Honggyu Kim Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Date: Tue, 18 Mar 2025 17:02:38 +0900 Message-ID: <20250318080246.1058-1-yunjeong.mun@sk.com> X-Mailer: git-send-email 2.48.1.windows.1 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrAIsWRmVeSWpSXmKPExsXC9ZZnke4KjZvpBptOcFrMWb+GzWL61AuM FiduNrJZ/Lx7nN2iefF6NovVm3wt7i97xmJxu/8cq8WqhdfYLI5vncduse8iUMPOh2/ZLJbv 62e0uLxrDpvFvTX/WS3mfpnKbLF6TYaDoMfhN++ZPXbOusvu0d12md2j5chbVo/Fe14yeWxa 1cnmsenTJHaPEzN+s3jsfGjpsbBhKrPH/rlr2D3OXazw+Pj0FovH501yAXxRXDYpqTmZZalF +nYJXBlta8IL1upV/Dqwjq2BcYZKFyMnh4SAicT6Rd+YYexrXQuYQGw2AQ2Jg4dOAsU5OEQE VCXarrh3MXJxMAu0sUjcfLWYEaRGWCBC4tyj0ywgNgtQTd/sl+wgNq+AucT5/nnsEDM1JRou 3QObySlgJjHtyG2wuJAAj8SrDfsZIeoFJU7OfMICsotZQF1i/TwhkDCzgLxE89bZzCB7JQSO sUusefwfaqakxMEVN1gmMArMQtI+C6F9FpL2BYzMqxiFMvPKchMzc0z0MirzMiv0kvNzNzEC 43FZ7Z/oHYyfLgQfYhTgYFTi4d3BfiNdiDWxrLgy9xCjBAezkgiv+5Pr6UK8KYmVValF+fFF pTmpxYcYpTlYlMR5jb6VpwgJpCeWpGanphakFsFkmTg4pRoYRefF9+vN0RQInrXH/I3NxMVa fy9KG8d3nHo4M+h62JV1J9/fnXmx/ZHe7wWMEWlrq1Ysa7eT6PvhtV70Vts6O2P981e6Nk78 sb+05TOP6tY71ny9zyP/xW0L/HP3SbHSHemUgJ4t4vKrq53uTcjZvFdhaXVSjfDRjRWCz4MY xLec5ixpm5u7Q4mlOCPRUIu5qDgRAMCmBzHDAgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrKIsWRmVeSWpSXmKPExsXCNUNWR3e5xs10g+4L5hZz1q9hs5g+9QKj xYmbjWwWP+8eZ7doXryezWL1Jl+L+8uesVh8fvaa2eJ2/zlWi1ULr7FZHN86j91i30WgrsNz T7Ja7Hz4ls1i+b5+RovLu+awWdxb85/VYu6XqcwWh649Z7VYvSbDQcTj8Jv3zB47Z91l9+hu u8zu0XLkLavH4j0vmTw2repk89j0aRK7x4kZv1k8dj609FjYMJXZY//cNewe5y5WeHx8eovF 49ttD4/FLz4weXzeJBcgEMVlk5Kak1mWWqRvl8CV0bYmvGCtXsWvA+vYGhhnqHQxcnJICJhI XOtawARiswloSBw8dJK5i5GDQ0RAVaLtinsXIxcHs0Abi8TNV4sZQWqEBSIkzj06zQJiswDV 9M1+yQ5i8wqYS5zvn8cOMVNTouHSPbCZnAJmEtOO3AaLCwnwSLzasJ8Rol5Q4uTMJywgu5gF 1CXWzxMCCTMLyEs0b53NPIGRdxaSqlkIVbOQVC1gZF7FKJKZV5abmJljqlecnVGZl1mhl5yf u4kRGHnLav9M3MH45bL7IUYBDkYlHt4d7DfShVgTy4orcw8xSnAwK4nwuj+5ni7Em5JYWZVa lB9fVJqTWnyIUZqDRUmc1ys8NUFIID2xJDU7NbUgtQgmy8TBKdXAqO3q8NsoN/mrx6LJs/+z c87PbPVY93N1mnJz5l9TXrarnrd/Cbs+zLguFZnW+cJiW07t/kl/P39c9uxK7bHfbEatW+v3 f/vYKdPSkhRk883bMa5+i4t6wYTNcrsKysMkefbULD8qupl/g8vj9ufLjWo2/C/UePP3raDw d4tyh0/sh0JdzJtvKLEUZyQaajEXFScCALU/Yo+4AgAA X-CFilter-Loop: Reflected X-Stat-Signature: m8aen37np7u8tihcm4fktekxia95g9a6 X-Rspamd-Queue-Id: B0ED68000F X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1742284970-828432 X-HE-Meta: U2FsdGVkX1/fFSeZjVaUdl6Jv/W+ysBLGkfFxGasU9rlvud+UcFXDa10dJbVheXs/8l5/2Iu4OLQtw0hQe6+x215cRM+zELBEQYQ3TAXay1p5IBrwYFaJ77TbHTctIsAqRJ+ucnRNqgg7qO2bbrnu44aNLee78atcePr5AiBm1NM00bPHv4BrnfrJUrxPYWXgV2ORhXUHXRyjqXclh6x3XNU/qLx4TSk0vXfv9C6Q6WDwyMWpsCjdMwnMiWDQ6VtDkIXuS7qIqqfH6vvBHUund02Nc9RFvmZUtl+Wt0Eapgr6Iv3U/j1Hq0SjfQuCxkX8AD42tpxsb8TtVqn8U4Sb1Vbr5L5ZScj4eFLiXqYDz273+CUIYIiM4huzLIRhAQvsRt35qRzefxTEqJGnAzb1W+rJ0/e9yrR7//h1CT+Mr/JBN5dQDzbvfwfciXAEw67GjjSRfVdAlu7iI0eUSy2WeYkLzKxbw9dLBEK2wVGpdq6QdKY3Az8LMbgGvidZeubMC/ubyozWa4mlZa7hl65pj5Z3yG1CS0n7kOa9/QGsozPkPKzbAWzE9dYMGhgxIZ/1NtqDlcF+5Z06pz8tTQBUPXxilIDUvTiPtmQCKD892+4mfs57JYFnxUrfmpv+TcwQWPVQacj/2Wiu1v70c6VVhYq6MG0P58j1dMXZMFXNQHrDKjcaFo11hCnRVp1xvYqoM+H6PcKX2m+KzbOz7zD+3ebqltMIIUPjrjfaEsXRRMC7q8MCnbwlL3GrQk92q+vKbv2eYVYh4UyzgGAWk7fUg7PR90CGVuEafqb43iDC+aw9uue0BIEddcod6veu9C73WKvdL22sY4Ob6fW9SEr0Ah4Pj0R5heLJTgEqs5eeNEHtE5cYfbT8l+gkIQrQOeC1mBhBBeLpauic3L3vImqJ/ihip3lwghXdNqhDqiqvOKOUtnS3n8NZ9VIyHwQy41haS9CZ4MzVatNN4CXqN7 KRMb9kXd TCFQFj99mzkHQ5jN8bJ73VLfIII/J9eb9s/C5vkEcsfwIAGS4U2B79sBtFVyVV5DUBOK2Kq8XQAZdSZ12+raoqp+XAFxMsNFokeXSfGArWL3DN+599h231Ao9hSwn5s7+hqdlf/3J67axq+o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Gregory, I have one more question below. On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price wrote: > On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote: > > forenote - Hi Andrew, please hold off on the auto-configuration patch > for now, the sk group has identified a hotplug issue we need to work out > and we'll likely need to merge these two patch set together. I really > appreciate your patience with this feature. > > > Hi Gregory, > > > > In my understanding, the reason we are seeing 12 NUMA node is because > > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones) > > in the code [1] below: > > > ... snip ... > > Appreciated, so yes this confirms what i thought was going on. There's > 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per > socket that is intended to interleave across the host bridges. > Thanks for confirm. Honggyu represented it as a tree sturcture: rootport/ ├── socket0 │   ├── cross-host-bridge0 -> SRAT && CEDT (interleave on) --> NODE 2 │   │   ├── host-bridge0 -> CEDT │   │   │   ├── cxl0 -> CEDT │   │   │   └── cxl1-> CEDT │   │   └── host-bridge1 -> CEDT │   │   ├── cxl2 -> CEDT │   │   └── cxl3 -> CEDT │   └── dram0 -> SRAT ---------------------------------------> NODE 0 └── socket1 ├── cross-host-bridge1 -> SRAT && CEDT (interleave on)---> NODE 3 │   ├── host-bridge2 -> CEDT │   │   ├── cxl4 -> CEDT │   │   └── cxl5 -> CEDT │   └── host-bridge3 -> CEDT │   ├── cxl6 -> CEDT │   └── cxl7 -> CEDT └── dram1 -> SRAT ---------------------------------------> NODE 1 > As you mention below, the code in acpi/numa/srat.c will create 1 NUMA > node per SRAT Memory Affinity Entry - and then also 1 NUMA node per > CFMWS that doesn't have a matching SRAT entry (with a known corner case > for a missing SRAT which doesn't apply here). > > So essentialy what the system is doing is marking that it's absolutely > possible to create 1 region per device and also 1 region that > interleaves across host each pair of host bridges (I presume this is a > dual socket system?). > > So, tl;dr: All these nodes are valid and this configuration is correct. I am wondering if all 12 nodes specifed as 'possible' is indeed correct. The definiton of 'possible' is: - 'Nodes that could be possibly become online at some point'. IMHO, it seems like there should only be 4 nodes specified as 'possible'. > > Weighted interleave presently works fine as intended, but with the > inclusion of the auto-configuration, there will be issues for your > system configuration. This means we probably need to consider > merging these as a group. > > During boot, the following will occur > > 1) drivers/acpi/numa/srat.c marks 12 nodes as possible > 0-1) Socket nodes > 2-3) Cross-host-bridge interleave nodes > 4-11) single region nodes > > 2) drivers/cxl/* will probe the various devices and create > a root decoder for each CXL Fixed Memory Window > decoder0.0 - decoder11.0 (or maybe decoder0.0 - decoder0.11) > > 3) during probe auto-configuration of wieghted interleave occurs as a > result of this code being called with hmat or cdat data: > > void node_set_perf_attrs() { > ... > /* When setting CPU access coordinates, update mempolicy */ > if (access == ACCESS_COORDINATE_CPU) { > if (mempolicy_set_node_perf(nid, coord)) { > pr_info("failed to set mempolicy attrs for node %d\n", > nid); > } > } > ... > } > > under the current system, since we calculate with N_POSSIBLE, all nodes > will be assigned weights (assuming HMAT or CDAT data is available for > all of them). > > We actually have a few issues here > > 1) If all nodes are included in the weighting reduction, we're actually > over-representing a particular set of hardware. The interleave node > and the individual device nodes would actually over-represent the > bandwidth available (comparative to the CPU nodes). > > 2) As stated on this patch line, just switching to N_MEMORY causes > issues with hotplug - where the bandwidth can be reported, but if > memory hasn't been added yet then we'll end up with wrong weights > because it wasn't included in the calculation. > > 3) However, not exposing the nodes because N_MEMORY isn't set yet > a) prevents pre-configuration before memory is onlined, and > b) hides the implications of hotplugging memory into a node from the > user (adding memory causes a re-weight and may affect an > interleave-all configuration). > > but - i think it's reasonable that anyone using weighted-interleave is > *probably* not going to have nodes come and go. It just seems like a > corner case that isn't reasonable to spend time supporting. > > So coming back around to the hotplug patch line, I do think it's > reasonable hide nodes marked !N_MEMORY, but consider two issues: > > 1) In auto mode, we need to re-weight on hotplug to only include > onlined nodes. This is because the reduction may be sensitive > to the available bandwidth changes. > > This behavior needs to be clearly documented. > > 2) We need to clearly define what the weight of a node will be when > in manual mode and a node goes (memory -> no memory -> memory) > a) does it retain it's old, manually set weight? > b) does it revert to 1? > > Sorry for the long email, just working through all the implications. > > I think the proposed hotplug patch is a requirement for the > auto-configuration patch set. > > ~Gregory > Best regards, Yunjeong