From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA192C3ABC5 for ; Fri, 9 May 2025 02:30:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1383A6B000A; Thu, 8 May 2025 22:30:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E7766B0082; Thu, 8 May 2025 22:30:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF1106B0083; Thu, 8 May 2025 22:30:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D204D6B000A for ; Thu, 8 May 2025 22:30:41 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B5CBD8048C for ; Fri, 9 May 2025 02:30:42 +0000 (UTC) X-FDA: 83421791124.18.7D955FA Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf22.hostedemail.com (Postfix) with ESMTP id 1A3DEC0006 for ; Fri, 9 May 2025 02:30:39 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf22.hostedemail.com: domain of rakie.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=rakie.kim@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746757841; a=rsa-sha256; cv=none; b=5MkTyNuqEypE+1kzOPdngJtcHF8uk/6iLXd36RMbgJQRGLG2CHcEImBrUGWF8Cq6bhG3yt sHwVyEE2f50RvFkkT/mfzuN/vjnilmQJuLIA+8ACsYDFWqJXl5m7rFk24xGXQinCRpzGRV 8BzooQlvOJjzjMumbu6x3iGE/LXZlbc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf22.hostedemail.com: domain of rakie.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=rakie.kim@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746757841; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=75E8NXPWHgtbYOymjuKJ/5kzTZSkn+8nlIk0qnafwM4=; b=BGB8MYKfYK+OG8AnWNU3jr6QaGX3RTxXfSLPOoY9ntdmmeLsxGkAntDjkrQa08+lBkQU8Q kSU6Qvv4IeneYbVBKdhBc3kNVxKqumYOr3LKd8dK3ghLPUb7szuy9Wa5sw4O5tgF8p4yS/ xg2JId3tmQqt6Gm/DM8kt2TbnZRMmd8= X-AuditID: a67dfc5b-681ff7000002311f-30-681d68cd3122 From: Rakie Kim To: Gregory Price Cc: joshua.hahnjy@gmail.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, dan.j.williams@intel.com, ying.huang@linux.alibaba.com, kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com, Rakie Kim Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave Date: Fri, 9 May 2025 11:30:26 +0900 Message-ID: <20250509023032.235-1-rakie.kim@sk.com> X-Mailer: git-send-email 2.48.1.windows.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrDLMWRmVeSWpSXmKPExsXC9ZZnoe7ZDNkMg1l/lCzmrF/DZjF96gVG i593j7NbHN86j93i/KxTLBaXd81hs7i35j+rxeo1GQ4cHjtn3WX36G67zO6xeM9LJo9Nnyax e5yY8ZvFY+dDS4/Pm+QC2KO4bFJSczLLUov07RK4Mmbv+M5UME2+4tux2cwNjF2SXYycHBIC JhJzO3axw9j7X35g7GLk4GATUJI4tjcGxBQRUJVou+LexcjFwSywnkniwMWtYOXCAgkSp+Zd YgaxWYBqVjQcBovzChhLHO3cxAIxUlOi4dI9JhCbU8BMYuWEFWBxIQEeiVcb9jNC1AtKnJz5 BCzOLCAv0bx1NjPIMgmBE2wS0561MEMMkpQ4uOIGywRG/llIemYh6VnAyLSKUSgzryw3MTPH RC+jMi+zQi85P3cTIzCMl9X+id7B+OlC8CFGAQ5GJR5eDxHZDCHWxLLiytxDjBIczEoivM87 ZTKEeFMSK6tSi/Lji0pzUosPMUpzsCiJ8xp9K08REkhPLEnNTk0tSC2CyTJxcEo1MBpwsPTa chQv2vhT5DpDlbWJ4YmX7efLa14wcM/96HOj29Yz5sClb6e2Sm5qfzndfqOR3lqNq4VOJT+/ uEe+ZrfavLL/kGfW08Jzt1dOa1GPi3xnsT/1hdUPGfNCxomH2cvvnxNwe/h432HB8wUVTxmf 3+zjtl3a80d+t/wlXR7F08Jnu7qiZiuxFGckGmoxFxUnAgB5phVhXwIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrGLMWRmVeSWpSXmKPExsXCNUNNS/dshmyGQf9rEYs569ewWUyfeoHR 4ufd4+wWn5+9ZrY4vnUeu8XhuSdZLc7POsVicXnXHDaLe2v+s1ocuvac1WL1mgyL39tWsDnw eOycdZfdo7vtMrvH4j0vmTw2fZrE7nFixm8Wj50PLT2+3fbwWPziA5PH501yAZxRXDYpqTmZ ZalF+nYJXBmzd3xnKpgmX/Ht2GzmBsYuyS5GTg4JAROJ/S8/MHYxcnCwCShJHNsbA2KKCKhK tF1x72Lk4mAWWM8kceDiVnaQcmGBBIlT8y4xg9gsQDUrGg6DxXkFjCWOdm5igRipKdFw6R4T iM0pYCaxcsIKsLiQAI/Eqw37GSHqBSVOznwCFmcWkJdo3jqbeQIjzywkqVlIUgsYmVYximTm leUmZuaY6hVnZ1TmZVboJefnbmIEhu6y2j8TdzB+uex+iFGAg1GJh9dDRDZDiDWxrLgy9xCj BAezkgjv806ZDCHelMTKqtSi/Pii0pzU4kOM0hwsSuK8XuGpCUIC6YklqdmpqQWpRTBZJg5O qQZGr1sPHz2esef4Gy2nK7tWSK768kj0XV7907UeXpzn8vQa1wtsK9P+utoj6ktq1rKFtwqa 9AM/seVLXrz3L1O9XJ7jjIil17VY8WbT7JgXLVt95EXu32Fx6LFZznN2l+GqHpnGJZq/LTvl NNoSTJp2zNkzeda3VZMnJ6zVvefzg+PPLK/Xr9VslFiKMxINtZiLihMBznRWoVkCAAA= X-CFilter-Loop: Reflected X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 1A3DEC0006 X-Stat-Signature: h6swjxqtho37r3dzigxj3fyg89x45tb5 X-Rspam-User: X-HE-Tag: 1746757839-893923 X-HE-Meta: U2FsdGVkX18sNNG/LxR5FN4Z2TO4zSXbX5YIB84zVqH538DLNVI6LwxcIDoA9/BTO8dSrkQotjir0/yktJ06E/Ychv9i8AyU4tthQKzcbdE7VOGajEnfbJZicOpKUl+yRWo1L82CISDzutMU/TwIQJNpZoDJ+VzOD0enCbGIEiUumaewjmiefLdeOwVvk655HjUYmDE6ursse13DpHGjs0t3Lo9tOSRdq93lEkVCHaWa6D3c4VJ/iwIza5KR08O44KIpqqIgpRmx4NiQO13+bRQqTsc7CoiEFCVnBfSMDcu2I/NE+Ke20lUyeWPISzAdDhIO/Q9YfVGKwK5ceIbPaYf/DtZ0/68cgABsyJ7xrCMI5ufeliSbPUwXC6qKEV36bmFWMkRlLz4B90nNjD5eyGDyV16Qg6iuhViX5OtRcZfe6yswgA6QtwA4/2UktEl9aVD3ta+4BWHzd2CfSaZTcIiaQrevFBRydt5XclUpAA0yOcHmK1wtlxB/ZGqcv1LlKyngl/ikiqtJLTvvG7qJpd/7x6XWNLoE0H2e9jTGoCm8KU7T7L0RZt/SFhqz8dB4HkqC5fg7fV8hdMEga6ydyfPO5CjR8NJPmPemaNsUgejpecDMQwEM/PhS14gvKZH8ZD3CS7N2nKe61qH3azbP8jjVNGpDdF0gWWS+fuI3h71VxOf/WUuEA+r/bgEc944tW6UQ+1INmuZJkyQlwlD/vWWbRDK80jworbDkSzk+ssdEFRkDk71VtjxbnwIGrGdNxTVaxwrtGX9Ww5bJBOPgAd/J6uxG1Kt6Ah2Gy85pdifGmUSrRCjBq+yY/q0T0E6Z X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 8 May 2025 11:12:35 -0400 Gregory Price wrote: > On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote: > > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price wrote: > > > > The proposed design is completely optional and isolated: it retains the > > existing flat weight model as-is and activates the source-aware behavior only > > when 'multi' mode is enabled. The complexity is scoped entirely to users who > > opt into this mode. > > > > I get what you're going for, just expressing my experience around this > issue specifically. Thank you very much for your response. Your prior experience and insights have been extremely helpful in refining how I think about this problem. > > The lack of enthusiasm for solving the cross-socket case, and thus > reduction from a 2D array to a 1D array, was because reasoning about > interleave w/ cross-socket interconnects is not really feasible with > the NUMA abstraction. Cross-socket interconnects are "Invisible" but > have real performance implications. Unless we have a way to: > > 1) Represent the topology, AND > 2) A way to get performance about that topology > > It's not useful. So NUMA is an incomplete (if not wrong) tool for this. Your comment gave me an opportunity to reconsider the purpose of the feature I originally proposed. In fact, I had two different scenarios in mind when outlining this direction. Scenario 1: Adapt weighting based on the task's execution node A task prefers only the DRAM and locally attached CXL memory of the socket on which it is running, in order to avoid cross-socket access and optimize bandwidth. - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1) - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1) Scenario 2: Reflect relative memory access performance The system adjusts weights based on expected bandwidth differences for remote accesses. This relies on having access to interconnect performance data, which NUMA currently does not expose. As you rightly pointed out, Scenario 2 depends on being able to measure or model the cost of cross-socket access, which is not available in the current abstraction. I now realize that this case is less actionable and needs further research before being pursued. However, Scenario 1 does not depend on such information. Rather, it is a locality-preserving optimization where we isolate memory access to each socket's DRAM and CXL nodes. I believe this use case is implementable today and worth considering independently from interconnect performance awareness. > > Additionally - reacting to task migration is not a real issue. If > you're deploying an allocation strategy, you probably don't want your > task migrating away from the place where you just spent a bunch of time > allocating based on some existing strategy. So the solution is: don't > migrate, and if you do - don't use cross-socket interleave. That's a fair point. I also agree that handling migration is not critical at this stage, and I'm not actively focusing on that aspect in this proposal. > > Maybe if we solve the first half of this we can take a look at the task > migration piece again, but I wouldn't try to solve for migration. > > At the same time we were discussing this, we were also discussing how to > do external task-mempolicy modifications - which seemed significantly > more useful, but ultimately more complex and without sufficient > interested parties / users. I'd like to learn more about that thread. If you happen to have a pointer to that discussion, it would be really helpful. > > ~Gregory > Thanks again for sharing your insights. I will follow up with a refined proposal based on the localized socket-based routing model (Scenario 1) and will give further consideration to the parts dependent on topology performance measurement for now. Rakie