From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C284C282D1 for ; Thu, 6 Mar 2025 12:39:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D884280003; Thu, 6 Mar 2025 07:39:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 18982280001; Thu, 6 Mar 2025 07:39:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 02950280003; Thu, 6 Mar 2025 07:39:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D8FB6280001 for ; Thu, 6 Mar 2025 07:39:32 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9987B56E08 for ; Thu, 6 Mar 2025 12:39:33 +0000 (UTC) X-FDA: 83191082226.06.1AA01D7 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf19.hostedemail.com (Postfix) with ESMTP id DD7B51A0008 for ; Thu, 6 Mar 2025 12:39:29 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf19.hostedemail.com: domain of honggyu.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=honggyu.kim@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741264771; a=rsa-sha256; cv=none; b=bSju4rOL/ZJ94CN/fxngqYbnPFpuI5PCuDGzVGx/qzUI7848x2WqI9KEi7OxkziZ2s9Ze1 4OJ0WvP3uoIV91EdW4SwLnqSqFH0TZur52IiV7buf3m2x/fSXc5seeVWoK/vUhFmP8UJQt HvCOTiZabUOau4mu790Fy5Xw0FgNwTM= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf19.hostedemail.com: domain of honggyu.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=honggyu.kim@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741264771; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AmDVp3w7GeLALxFe6fWMhz9yV22/TkYjwvsU4yebiJY=; b=Ho++KTbdWTFvIav0tMvZgUqB8y0OQp5YFGpXdbQUKd3TKcjA7GdG/c3TNO7f7lNMXgSq7Y hU67H8C9KzJXyIu3gFXeDo059iyB52SQzFQgvKVbZzjjqcyP5Dem/I9nWPERFkA5AM8MZv fBwxInDE6a63DHfOAFgwZ/hU9MP9Ye8= X-AuditID: a67dfc5b-3c9ff7000001d7ae-ed-67c9977f06ad Message-ID: Date: Thu, 6 Mar 2025 21:39:26 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Cc: kernel_team@skhynix.com, Joshua Hahn , harry.yoo@oracle.com, ying.huang@linux.alibaba.com, gregkh@linuxfoundation.org, rakie.kim@sk.com, akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com, yunjeong.mun@sk.com Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes To: Gregory Price References: <20250226213518.767670-1-joshua.hahnjy@gmail.com> <20250226213518.767670-2-joshua.hahnjy@gmail.com> Content-Language: ko From: Honggyu Kim In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrDIsWRmVeSWpSXmKPExsXC9ZZnoW799JPpBtd/iVvMWb+GzWL61AuM FiduNrJZ/Lx7nN2iefF6NovVm3wt7i97xmJxu/8cq8WqhdfYLI5vncduse8iUMPOh2/ZLJbv 62e0uLxrDpvFvTX/WS3mfpnKbLF6TYaDoMfhN++ZPXbOusvu0d12md2j5chbVo/Fe14yeWxa 1cnmsenTJHaPEzN+s3jsfGjpsbBhKrPH/rlr2D3OXazw+Pj0FovH501yAXxRXDYpqTmZZalF +nYJXBn7D5xiLrilVtF04j9zA+N36S5GTg4JAROJOV1XmWHsr+evs4LYvAKWEvMX7QezWQRU JK6eWMkMEReUODnzCQuILSogL3H/1gz2LkYuDmaBx8wSn+50gRUJC0RJ7Hz1l6mLkYNDREBV ou2KO0iNkMBZRokV6x6ygdQwC4hIzO5sA6tnE1CTuPJyEhOIzSlgJtHyu58FosZMomtrFyOE LS+x/e0cZpBBEgK32CVOzFkMdbWkxMEVN1gmMArOQnLgLCQ7ZiGZNQvJrAWMLKsYhTLzynIT M3NM9DIq8zIr9JLzczcxAuN4We2f6B2Mny4EH2IU4GBU4uH1mHoyXYg1say4MvcQowQHs5II 70U/oBBvSmJlVWpRfnxRaU5q8SFGaQ4WJXFeo2/lKUIC6YklqdmpqQWpRTBZJg5OqQZGr3LJ bytXnZXT+eGyQq/zY/bNXVMPWviZBYTH/lV06bqlkaYlcNVdc5+Pa0LPX9Ga5rMtD7bWfC4+ lN2Ve/NhqNWzrV/k8mffSZT626T+Vb+gUqHqqbHDvQBG76PHfuy0Xty/6z6T0ikFVrEGsfC9 kcUyssydq24bdBuv7JnCcUDlXKt14UklluKMREMt5qLiRADfqXmK3wIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrIIsWRmVeSWpSXmKPExsXCNUNLT7d++sl0gwdrOSzmrF/DZjF96gVG ixM3G9ksft49zm7RvHg9m8XqTb4W95c9Y7G43X+O1WLVwmtsFse3zmO32HcRqOHw3JOsFjsf vmWzWL6vn9Hi8q45bBb31vxntZj7ZSqzxaFrz1ktVq/JsPi9bQWbg4jH4TfvmT12zrrL7tHd dpndo+XIW1aPxXteMnlsWtXJ5rHp0yR2jxMzfrN47Hxo6bGwYSqzx/65a9g9zl2s8Pj49BaL x7fbHh6LX3xg8vi8SS5AIIrLJiU1J7MstUjfLoErY/+BU8wFt9Qqmk78Z25g/C7dxcjJISFg IvH1/HVWEJtXwFJi/qL9YDaLgIrE1RMrmSHighInZz5hAbFFBeQl7t+awd7FyMXBLPCYWeLT nS6wImGBKImdr/4ydTFycIgIqEq0XXEHqRESOMsosWLdQzaQGmYBEYnZnW1g9WwCahJXXk5i ArE5BcwkWn73s0DUmEl0be1ihLDlJba/ncM8gZFvFpI7ZiEZNQtJyywkLQsYWVYximTmleUm ZuaY6hVnZ1TmZVboJefnbmIERuyy2j8TdzB+uex+iFGAg1GJh9dj6sl0IdbEsuLK3EOMEhzM SiK8F/2AQrwpiZVVqUX58UWlOanFhxilOViUxHm9wlMThATSE0tSs1NTC1KLYLJMHJxSDYws 06O6S+W7VjFM/FHQZ7jlC+fl62feX1foMfdWPzAjOOmWgvICFfbmILEEduYjAfPeXllx/OPi U8m5u3SVxd9d7mEp9Jhk/rThAqfxlwXWx64k72LJ3nhkd0d8+rzYkgQjh+0ro+smrnb68VNQ +twbNvWZ3QzvvL4v7Z9W8132oXXfwSr+y/1KLMUZiYZazEXFiQDNibwU1AIAAA== X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DD7B51A0008 X-Stat-Signature: htz9jynh4ys8i1tyacazhgz8ehpdw3mf X-HE-Tag: 1741264769-551866 X-HE-Meta: U2FsdGVkX18WQMn7UFvBLxIhkNjzHDZifFMbETaxm4R61AnZnl3ybBAVttO5KEsTBdnYZ6dgELZmAnAaVxMyIr5A6804lVl3fMLgGLp5ve5HcyjFXVB/RGWJgMEO62iGQtKPJvmgxK44xG8ZE9s/cxEnkUvKk0kIErQ0Landxwf0nSLSukSdbVGhNj39I1448yleTmjSFhOiInFd+MA92ID/+t0TL7q0uTnF/O+GV8L37A9kqaKz+CmToWI5mnzUQrPxxR5OM6o+v7YRm14tjRKG2jeh3q4uujzoKGTwlXL1nnDcejG/nSnS17DeUEvWmocUVJCHo/+2IEKhdE1Kbnq6OrKUgLpED4ruFmbk+NbTv8JBZZRtwb2wUktP4XOkccAS0KCcvPUO1asOJIh7HAwZ6vOQAYmmYgu/I/XxlucDuU2D/33KWEi+L639dJj6PJJunj8qw2xP4mVEg9XhKfjC+XxRkHu2lDnSkoUOhEgrCYvtyRr8VellYpjI9zlGQeHucYkfCaeB3sV/t7gHZpg0uzL14NYVX2QW/UBdOqiZ/ManC38vlciXYpNkDSS5WJzmyvSYO8bo7VxTEzsxtW5MlSe7Rg1DZp2Z7dOSnalm5vFcpUHWO+PjOS3r4reX598AAQJg9bceclhHY0LuuNqoGHvHVCpwweAdh7A4+XJNA9i8TQJ+7RzWqPhBlR9+p/sPmc1i9opMchc/n7Jd20zr1Zokg140bmXPVWs2dRpId2N+hfUBk1jqGxZ75cDgV1iRq64iBPMR4k7xDclSGhBN2Gbb5a3CDa8Shcncf9kStTQF4seanqJmZ6qAdgq7Pj1cVo2m5CQ4rsoMGWU8FoCDfGsvuNnFGk/Udkd+FYtKh6ZazFNn4BQGAaYSlPQBM4nyssLdG33dGjv/jEpX7pCyW4k7eRn33XxGN3Tljw/Oa4MoEJvsqFkGAFCGAUmLUmc05du0v85eBnUfsE/ rWRcegaO l/NIEjsrswL+lx0eZYooOxBNypH5Qbg/y/WXUxudIw8C6IhskMwHKCAfZIBWm5Mwf7GMrjYBGtjMrudXnYGMhnmClbQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Gregory, On 3/5/2025 1:29 AM, Gregory Price wrote: > On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote: >> Actually, we're aware of this issue and currently trying to fix this. >> In our system, we've attached 4ch of CXL memory for each socket as >> follows. >> >> node0 node1 >> +-------+ UPI +-------+ >> | CPU 0 |-+-----+-| CPU 1 | >> +-------+ +-------+ >> | DRAM0 | | DRAM1 | >> +---+---+ +---+---+ >> | | >> +---+---+ +---+---+ >> | CXL 0 | | CXL 4 | >> +---+---+ +---+---+ >> | CXL 1 | | CXL 5 | >> +---+---+ +---+---+ >> | CXL 2 | | CXL 6 | >> +---+---+ +---+---+ >> | CXL 3 | | CXL 7 | >> +---+---+ +---+---+ >> node2 node3 >> >> The 4ch of CXL memory are detected as a single NUMA node in each socket, >> but it shows as follows with the current N_POSSIBLE loop. >> >> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/ >> node0 node1 node2 node3 node4 node5 >> node6 node7 node8 node9 node10 node11 > > This is insufficient information for me to assess the correctness of the > configuration. Can you please show the contents of your CEDT/CFMWS and > SRAT/Memory Affinity structures? > > mkdir acpi_data && cd acpi_data > acpidump -b > iasl -d * > cat cedt.dsl <- find all CFMWS entries > cat srat.dsl <- find all Memory Affinity entries I'm not able to provide all the details as srat.dsl has too much info. $ wc -l srat.dsl 25229 srat.dsl Instead, I can show you that there are 4 diffferent proximity domains with "Enabled : 1" with the following filtered output from srat.dsl. $ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed 'N;s/\n//' | sort | uniq Enabled : 0 Enabled : 0 Proximity Domain : 00000000 Enabled : 0 Proximity Domain : 00000000 Enabled : 1 Proximity Domain : 00000001 Enabled : 1 Proximity Domain : 00000006 Enabled : 1 Proximity Domain : 00000007 Enabled : 1 We don't actually have to use those complicated commands to check this as dmesg clearly prints the SRAT and node numbers as follows. [ 0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] [ 0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff] [ 0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff] [ 0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug [ 0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug The memoryless nodes are printed as follows after those ACPI, SRAT, Node N PXM M messages. [ 0.010927] Initmem setup node 0 [mem 0x0000000000001000-0x000000207effffff] [ 0.010930] Initmem setup node 1 [mem 0x0000060f80000000-0x0000064f7fffffff] [ 0.010992] Initmem setup node 2 as memoryless [ 0.011055] Initmem setup node 3 as memoryless [ 0.011115] Initmem setup node 4 as memoryless [ 0.011177] Initmem setup node 5 as memoryless [ 0.011238] Initmem setup node 6 as memoryless [ 0.011299] Initmem setup node 7 as memoryless [ 0.011361] Initmem setup node 8 as memoryless [ 0.011422] Initmem setup node 9 as memoryless [ 0.011484] Initmem setup node 10 as memoryless [ 0.011544] Initmem setup node 11 as memoryless This is related why the 12 nodes at sysfs knobs are provided with the current N_POSSIBLE loop. > > Basically I need to know: > 1) Is each CXL device on a dedicated Host Bridge? > 2) Is inter-host-bridge interleaving configured? > 3) Is intra-host-bridge interleaving configured? > 4) Do SRAT entries exist for all nodes? Are there some simple commands that I can get those info? > 5) Why are there 12 nodes but only 10 sources? Are there additional > devices left out of your diagram? Are there 2 CFMWS but and 8 Memory > Affinity records - resulting in 10 nodes? This is strange. My blind guess is that there could be a logic node that combines 4ch of CXL memory so there are 5 nodes per each socket. Adding 2 nodes for local CPU/DRAM makes 12 nodes in total. > > By default, Linux creates a node for each proximity domain ("PXM") > detected in the SRAT Memory Affinity tables. If SRAT entries for a > memory region described in a CFMWS is absent, it will also create an > node for that CFMWS. > > Your reported configuration and results lead me to believe you have > a combination of CFMWS/SRAT configurations that are unexpected. > > ~Gregory Not sure about this part but our approach with hotplug_memory_notifier() resolves this problem. Rakie will submit an initial working patchset soonish. Thanks, Honggyu