From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E6B1C28B23 for ; Fri, 7 Mar 2025 11:46:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D30DA280001; Fri, 7 Mar 2025 06:46:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CB9D06B0085; Fri, 7 Mar 2025 06:46:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B339F280001; Fri, 7 Mar 2025 06:46:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 918BD6B0083 for ; Fri, 7 Mar 2025 06:46:49 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 9C1A3809F4 for ; Fri, 7 Mar 2025 11:46:51 +0000 (UTC) X-FDA: 83194578222.18.C559875 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf04.hostedemail.com (Postfix) with ESMTP id 0420440004 for ; Fri, 7 Mar 2025 11:46:48 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of honggyu.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=honggyu.kim@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741348010; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=S2CDwSAHB1Dv6dpqmuY5nUr0Qut1T03LvoH7frTQxC4=; b=ZvodJGaJX9a5for3J+QxYNqfHLQEYxN10cfQAsSaNqu7Uo8GTE6BlXgRZyTIfvvOdevlBt tYLRGzhdvAwVaIJsgR8O6YThKeAx/TZPJF4eYOkw9sa1MQ+9iRAqEWLulWHCsX5TlLbQly bKi2ceN7cZsZwyD7+8/wWylsntbFk2A= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of honggyu.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=honggyu.kim@sk.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741348010; a=rsa-sha256; cv=none; b=mh9a/dt0o+8hjg5uF91EOdSAORmDR+ECtsEHds0huz+teUQBlsBQEEpriCPyUhWgtEvyRF tkS1/uJRCV9QCGSf/5tVjt9kddJPtJarIrm5BVkI0Vi4RkfJoJiMQBO5Kmy0sjw+EAwZex f8PZ/SxMizdBtxR57RbMO255C8sHp98= X-AuditID: a67dfc5b-3c9ff7000001d7ae-50-67cadca6c5cc Message-ID: <9c0d8aa8-cac7-4679-aece-af88e8129345@sk.com> Date: Fri, 7 Mar 2025 20:46:46 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Cc: kernel_team@skhynix.com, Joshua Hahn , harry.yoo@oracle.com, ying.huang@linux.alibaba.com, gregkh@linuxfoundation.org, rakie.kim@sk.com, akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com, yunjeong.mun@sk.com Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Content-Language: ko To: Gregory Price References: <20250226213518.767670-1-joshua.hahnjy@gmail.com> <20250226213518.767670-2-joshua.hahnjy@gmail.com> From: Honggyu Kim In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrNIsWRmVeSWpSXmKPExsXC9ZZnoe6yO6fSDbpX61rMWb+GzWL61AuM FiduNrJZ/Lx7nN2iefF6NovVm3wt7i97xmJxu/8cq8WqhdfYLI5vncduse8iUMPOh2/ZLJbv 62e0uLxrDpvFvTX/WS3mfpnKbLF6TYaDoMfhN++ZPXbOusvu0d12md2j5chbVo/Fe14yeWxa 1cnmsenTJHaPEzN+s3jsfGjpsbBhKrPH/rlr2D3OXazw+Pj0FovH501yAXxRXDYpqTmZZalF +nYJXBk9R3YzFSxTq/j0YQdjA+McmS5GTg4JAROJ3hcT2GHsl3fuMILYvAKWEjv/n2XpYuTg YBFQkZjSGQARFpQ4OfMJC4gtKiAvcf/WDKBWLg5mgcfMEp/udDGDJIQFoiR2vvrLBGIzC4hI zO5sYwaZIyKgKtF2xR2kXkhgJZNEz64rYHvZBNQkrrycBFbPKWAmMef6AVaIXjOJrq1djBC2 vMT2t3OYQZolBG6xS7Tvn8UGcbSkxMEVN1gmMArOQnLgLCS7ZyGZNQvJrAWMLKsYhTLzynIT M3NM9DIq8zIr9JLzczcxAqN4We2f6B2Mny4EH2IU4GBU4uH1mHoyXYg1say4MvcQowQHs5II r9r2U+lCvCmJlVWpRfnxRaU5qcWHGKU5WJTEeY2+lacICaQnlqRmp6YWpBbBZJk4OKUaGC2Z j710E5gYlhNrJfTTWfMy2wa9l+pS3e3OWomq89p2bvb4lBTj8+1Ly6zcEyEdX+fVftyx3PB0 j/DL244VZ5S+BrQf3FrzXO7QLi2h4r68q4uvH5NY4Jjwf82kpPPb84z6HB0+awYtluM+0357 XirLLtuKS2bTD3ox5OzY8kFbYdFGlwm/45VYijMSDbWYi4oTAVbdux/eAgAA X-Brightmail-Tracker: H4sIAAAAAAAAA02RbUhTYRTHeXbv7r1bLq5r5YOVwjIhSc1IeYyISMknQ/Fb4Re95GVbzhc2 G1MaGKnlNEzRsjVDE1bpQhjonJUOZ+mMLJks0pmM3lQa2ZRKTMwlgd9+nPP/nXPgMITUSUYy quIyXlPMqeWUmBS3xiXEW3zjiiOu7ihk7rFS6E7LW4DG3l+l0OrsKI2udfZQqNuWheYsX0g0 0zAhRF0dXgqN9t6n0eDkpuBqcwuRwx+g0MPBBoA8A2YKfbBuCFHbSguBhr1fhajbqkRrfY+o UzLs+vadwA7TLI3rajw0rhoJCHHnswUBtnXVUtgWbKLxWOsaiR3+VNxR2ULgoTYrjScm9fjH 52kS/5zBuHN+SYCXbVE5bK74RAGvVul4TeLJfLGyfuSpoNQSqw8u9YNKYN5nBCIGssfggs8H QixhU6Fj4zVpBAxDsjGwuTZnqxwO3Xc/kSHezUbDuelW2gjEDMF+JGDQZyRCjV1sLnQsrgtC TLAyeK+2hgjNkbEHYc1URigvZR8LYP3AFB3KUGwsnFpo+pcXsSnQ/M4p3HJToLHXCLY4GtoD ZuIW2Gnadodp2wrTNsW0TWkHZBeQqYp1RZxKnZygLVSWF6v0CRdLimxg868Ww5/GfrDiyRgG LAPkYRLc4lZIhZxOW140DCBDyGWSWPu4Qiop4MoreE1JnuaymtcOg70MKY+QZJ7n86Wsgivj C3m+lNf87woYUWQlOG2gDBbThdUnVTHt8xEb58Jf5fnjq3UrMfU+o/1KW9qDTu52arXBfeNQ Y1owMDRistknZo4fTgT4ZXLr8wPLeTnR6aJs7M2+tKrvfuOl1s807QnLbHY76xZdo9ly2ACr z+5HohceR911ZZLt94D9aIWnb0fWr5t+hTO9hJCTWiWXFEdotNxfKIs33tMCAAA= X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Queue-Id: 0420440004 X-Rspamd-Server: rspam09 X-Stat-Signature: ox86sicuaes5kxqn4ffiumoutew4kshq X-HE-Tag: 1741348008-378998 X-HE-Meta: U2FsdGVkX19Ct02lr8BMlbZQrny6FwfvlHFt/mFR8xo1CWlg5lrEfFhEulQxu5BjwxCMaxrCJg9l/RbZdwABi5EHSqD60lhEzOwUHerSWqiyJD9uishN3eReUvl1CosfHf7VCLmd1NBXkeaqAVSLEWW2Q89RdbRY8fxE4VlJZ+YnR/TUP23zTOuJ7bDhc3KAq7/MV7ZbWV2Vc5PJPrH0Nrfo+iuf8ufrU2lt0REFPzTVs8ZDI1f1WWrt3lCAPERAwHxw9ZirHWFbHfQFD5JrGTybV6KDh/6gl2SJCmHhD4k/qaWoYN9hqqwxyaDwXIfE5vCpEO8xK76uCh2X09R8HSvZqw7f6ygHASk1Eaan7YKaUMqc2cff1B9gbCTc+zt0+OdUaRPcoAzKfl4gGyIi+PVPphyFPNy2187sL44MwmRuqG0sbyUwngcPxm9+th263/E7hRg3Dv/ZEKQJlGbQX8pWmAOZ5i5S38s/kAPq0X2n9eEbzn81TtiLta8COAZAAq4qN2i8PUF3z8UJWjXDU3qwD4YK9V7G1I+p6R9fjkzGnFIne1z//ZZMAxfLDl5QjrBWUZ3DsvURfaru5a8TV9PTwzCcICJpe+9Qz5U/1yGHwuiOTWxOoIhek23LkMIi+4MXvP2QEj9c514ffatBdXDqzysLNfF1oduoLzzsnALrqj+WY+As0rQZiaNAPKkekd3dDN3GPd5rVfTrGnoeWHpvAA7cPnM+NZrZt6zkVVRMTlkBPbBrcWM8b5BXFsOgSyFYRlCSihe67D8UVc+iAsDu9E6nPGNaFtuwP7V8EO43/ms3meUB6MLE2bkvAZ0Q75jUozAyhm7HzxaWkKI/PG9u6TJbbr2PqlJorsGqALa360v9pHVi/7buu6GC+f5v752+a9FMsm8nHgHUa8lFDOgreiU5VeRDOomtKqiPMfkjqx3xPfUcOPNrB7/XRhrGvJw7KLd3WCYFSiMqbVN blB0yHtf oC6aQt4c/KO7YckQcbVp/G6Dg02bEIMtViCBUEtydayN9ehQtKdXFNyHtKKYDrTDwci4aQr0aJbubIieT9LPUioP0nHiINWaULa0TxUk2BibiVtUe7lpqyLe5FDme5KwTAxhG3hqBV4KmyJ3ML6sN3abhGeNm+z95YCkqWvkM46vRr4I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 3/7/2025 2:32 AM, Gregory Price wrote: > On Thu, Mar 06, 2025 at 09:39:26PM +0900, Honggyu Kim wrote: >> >> The memoryless nodes are printed as follows after those ACPI, SRAT, >> Node N PXM M messages. >> >> [ 0.010927] Initmem setup node 0 [mem >> 0x0000000000001000-0x000000207effffff] >> [ 0.010930] Initmem setup node 1 [mem >> 0x0000060f80000000-0x0000064f7fffffff] >> [ 0.010992] Initmem setup node 2 as memoryless >> [ 0.011055] Initmem setup node 3 as memoryless >> [ 0.011115] Initmem setup node 4 as memoryless >> [ 0.011177] Initmem setup node 5 as memoryless >> [ 0.011238] Initmem setup node 6 as memoryless >> [ 0.011299] Initmem setup node 7 as memoryless >> [ 0.011361] Initmem setup node 8 as memoryless >> [ 0.011422] Initmem setup node 9 as memoryless >> [ 0.011484] Initmem setup node 10 as memoryless >> [ 0.011544] Initmem setup node 11 as memoryless >> >> This is related why the 12 nodes at sysfs knobs are provided with the >> current N_POSSIBLE loop. >> > > This isn't actually why, this is another symptom. This gets printed > because someone is marking nodes 4-11 as possible and setup_nr_node_ids > reports 12 total nodes > > void __init setup_nr_node_ids(void) > { > unsigned int highest; > > highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES); > nr_node_ids = highest + 1; > } > > Given your configuration data so far, we may have a bug somewhere (or > i'm missing a configuration piece). Maybe there could be some misunderstanding on this issue. This isn't a problem of NUMA detection for CXL memory but just a problem of number of "node" knobs only for weighted interleave. The number of nodes in 'numactl -H' shows the correct nodes even without our fix. $ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 ... node 0 size: 128504 MB node 0 free: 118563 MB node 1 cpus: 144 145 146 147 ... node 1 size: 257961 MB node 1 free: 242628 MB node 2 cpus: node 2 size: 393216 MB node 2 free: 393216 MB node 3 cpus: node 3 size: 524288 MB node 3 free: 524288 MB node distances: node 0 1 2 3 0: 10 21 14 24 1: 21 10 24 14 2: 14 24 10 26 3: 24 14 26 10 You can see more info below. $ cd /sys/devices/system/node $ ls -d node* node0 node1 node2 node3 $ cat possible 0-11 $ cat online 0-3 $ cat has_memory 0-3 $ cat has_normal_memory 0-1 $ cat has_cpu 0-1 >>> Basically I need to know: >>> 1) Is each CXL device on a dedicated Host Bridge? >>> 2) Is inter-host-bridge interleaving configured? >>> 3) Is intra-host-bridge interleaving configured? >>> 4) Do SRAT entries exist for all nodes? >> >> Are there some simple commands that I can get those info? >> > > The content of the CEDT would be sufficient - that will show us the > number of CXL host bridges. Which command do we need for this info specifically? My output doesn't provide some useful info for that. $ acpidump -b $ iasl -d * $ cat cedt.dsl ... **** Unknown ACPI table signature [CEDT] > >>> 5) Why are there 12 nodes but only 10 sources? Are there additional >>> devices left out of your diagram? Are there 2 CFMWS but and 8 Memory >>> Affinity records - resulting in 10 nodes? This is strange. >> >> My blind guess is that there could be a logic node that combines 4ch of >> CXL memory so there are 5 nodes per each socket. Adding 2 nodes for >> local CPU/DRAM makes 12 nodes in total. >> > > The issue is that nodes have associated memory regions. If there are > multiple nodes with overlapping memory regions, that seems problematic. > > If there are "possible nodes" without memory and no real use case > (because the memory is associated with the aggregate node) then those > nodes probably shouldn't be reported as possible. > > the tl;dr here is we should figure out what is marking those nodes as > possible. > >> Not sure about this part but our approach with hotplug_memory_notifier() >> resolves this problem. Rakie will submit an initial working patchset >> soonish. > > This may just be a bandaid on the issue. We should get our node > configuration correct from the get-go. Not sure about it. This must be fixed ASAP because current kernel is broken on this issue and the fix should go into hotfix tree first. If you can think this is just a bandaid, but leaving it bleeding as is not the right approach. Our fix was posted a few hours ago. Please have a look, then think about the apprach again. https://lore.kernel.org/linux-mm/20250307063534.540-1-rakie.kim@sk.com Thanks, Honggyu > > ~Gregory