From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F00A7C3ABC0 for ; Wed, 7 May 2025 09:35:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B51726B000A; Wed, 7 May 2025 05:35:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AFBC76B0083; Wed, 7 May 2025 05:35:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C6816B0085; Wed, 7 May 2025 05:35:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 804186B000A for ; Wed, 7 May 2025 05:35:26 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C007512195F for ; Wed, 7 May 2025 09:35:26 +0000 (UTC) X-FDA: 83415603852.05.2BE5405 Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by imf12.hostedemail.com (Postfix) with ESMTP id 838A340002 for ; Wed, 7 May 2025 09:35:24 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of rakie.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=rakie.kim@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746610525; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=S8QkwUuOS39GSbqFJ6AWvjKVRssMb6Xw8T4WJc5xhZ0=; b=cF7Nq0bX+geX6dtnxrVqhdGPCOo0mllWb6KUpMQFNUhRC3W+Yi8y1A/5Xl+0JJXodoqtht 7oLVpd+63ILDQetnA6GGhTyg/sOkWvHGX/GxfvTraMx9Z2rs0s7ldZ+bX2kc44Gl6U3HVj 87OaFiNkryoIhLcwxGR9D4J1ak3uZpo= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of rakie.kim@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=rakie.kim@sk.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746610525; a=rsa-sha256; cv=none; b=1ktMxggEuW+R7GElypBS2mxQaMj4O+3zip0pw2aAcHUijtSWVFle/Z0Wf0V78GXWE+cs26 6/gCHT5o5I99ZBLNGeko2hTjGfHgBclB3WTkzjVaertnLsLhsynSShi4Kze5UH9w5SiZRi SAnt2QYqSo7GkSgsvsmYe/AbhrtTExk= X-AuditID: a67dfc5b-669ff7000002311f-b6-681b29594506 From: rakie.kim@sk.com To: gourry@gourry.net, joshua.hahnjy@gmail.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, dan.j.williams@intel.com, ying.huang@linux.alibaba.com, kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com, rakie.kim@sk.com Subject: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave Date: Wed, 7 May 2025 18:35:16 +0900 Message-ID: <20250507093517.184-1-rakie.kim@sk.com> X-Mailer: git-send-email 2.48.1.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrKLMWRmVeSWpSXmKPExsXC9ZZnoW6UpnSGwbNnrBZz1q9hs5g+9QKj xc+7x9ktjm+dx25xftYpFovLu+awWdxb85/VYvWaDAcOj52z7rJ7dLddZvdYvOclk8emT5PY PU7M+M3isfOhpcfnTXIB7FFcNimpOZllqUX6dglcGdvX9jIXfNap+LVtI2MD4xGFLkZODgkB E4l9M/cxwtgftq5mB7HZBMQkdk1/BGaLCBhIbG78DGRzcTALdDBJNC1bwwySEBaIkWhevIYV xGYRUJW4u2snWJxXwFjibftSVoihmhINl+4xQcQFJU7OfMICYjMLyEs0b53NDFGzhk3i37Jk CFtS4uCKGywTGHlnIWmZhaRlASPTKkahzLyy3MTMHBO9jMq8zAq95PzcTYzAwFxW+yd6B+On C8GHGAU4GJV4eA/8lMwQYk0sK67MPcQowcGsJMJ7/z5QiDclsbIqtSg/vqg0J7X4EKM0B4uS OK/Rt/IUIYH0xJLU7NTUgtQimCwTB6dUA6P+NJ7NNQJ/+0pM5lQ3TTH5ul8kJ+XKhhfP7008 WLaNafGOX1PZs/4ZL4pbXcvBrLq6ehfPZ5mqG0orPqUx1M8xufHyY1DPp3W1nRt/zL5j0TD3 p9WyRI+ZlRvv/zW5YByfvNz6a9TuhE+/Sk4xJjL07rQ2ao/T+TDNO6RswqsZ3jlTt9hXnk5X YinOSDTUYi4qTgQACDdNbUgCAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrALMWRmVeSWpSXmKPExsXCNUNNSzdSUzrDoLnLwmLO+jVsFtOnXmC0 +Hn3OLvF52evmS2Ob53HbnF47klWi/OzTrFYXN41h83i3pr/rBaHrj1ntVi9JsPi97YVbA48 Hjtn3WX36G67zO6xeM9LJo9Nnyaxe5yY8ZvFY+dDS49vtz08Fr/4wOTxeZNcAGcUl01Kak5m WWqRvl0CV8b2tb3MBZ91Kn5t28jYwHhEoYuRk0NCwETiw9bV7CA2m4CYxK7pj8BsEQEDic2N n4FsLg5mgQ4miaZla5hBEsICMRLNi9ewgtgsAqoSd3ftBIvzChhLvG1fygoxVFOi4dI9Joi4 oMTJmU9YQGxmAXmJ5q2zmScwcs1CkpqFJLWAkWkVo0hmXlluYmaOqV5xdkZlXmaFXnJ+7iZG YDguq/0zcQfjl8vuhxgFOBiVeHgP/JTMEGJNLCuuzD3EKMHBrCTCe/8+UIg3JbGyKrUoP76o NCe1+BCjNAeLkjivV3hqgpBAemJJanZqakFqEUyWiYNTqoGROfDsvcq3CRbrg3/vKV4j8fHK JQO1+i187aeCPW3keFkF1vz7tD347SmDJPMZM5PNV52MbJglNOd8/cNXvyWecxqsTrqtVn3n /YHfImG6Ta26HPfjjNVEo45WTtW8+5F32oPe5f+snjF/dJap1IvWZhJMPb6f28Ci13L7759n YmNEckwe7ryvxFKckWioxVxUnAgAwXL1R0MCAAA= X-CFilter-Loop: Reflected X-Stat-Signature: p961benghbbrb6murh3fzptbr65h6iwi X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 838A340002 X-Rspam-User: X-HE-Tag: 1746610524-759182 X-HE-Meta: U2FsdGVkX18s/9352T3X+GfA72zvqkvfhNso8t2QWMuRQFUWY449WCnuGpyMVup557MJe20g0y6NbBjZxJB93pfFL/yi8tiy9VxfrJeJ2EZ7l/4Ty7PIxRx/7tVSpPHQ+XH/J4u9mCWVrJ7NxMRWvYzrSrBp30sN4TqfwSoz+8WppqllfBE96uNJ+3tCAc8X9L8ZknpEJwaJ0EUqZwtTFtcZe3Ldj1DuqYdJ7ktijc6HlPwmPBRBcnY5COZD272mVQgZarKFVN+cC9oEmwJShjTU2XPkziK5zC+h5CVpERss48URvRArQF2algl0HVno5i5B5/yq6ZE8j3e1fNsGUNmdVVzlCWHGbe1m3hmjtpY+AY0gSzvQDIO0IoOGt1e9wDHMZV5XjRV4hGZUtqiXsbDKf9I7YfFQNNAZFUQSchKlEfnMSMWLrG17p6+XeB72C6v9km3CFj3i2FI/TeGQIvLtp9ws6+ja9v1vwVE/kUdPOHhQ4jEFQZr5gC/40LDRuWJcRi/bKZJoLp18KyJsvYw5bVD6nf1CcKJujzqnRDjddfHKU3uAfp+pC8bL9Jk5T3QxfzEslTZZfZ26mTtfIgzUD+oQcfpEKGQya9b5l/lBi7sMT6YEkAfe35SZ1UFb986be7M0SmL+Is3SC7kZtCnjYSbFPD5EsL+G6BNq9Zoe7x3J7nHergCrM/YCUj6F0dQ5WTh983FfA3BGrNZxOHi1fzl9QCdhjOacBAzaJlbr5X4GjF++9NuDf1u9yFqFmUlZYW9R5RaZJznuaeAmEBpt9RNrMoynoh9MbnajhROal+XmN3lmLOx8xyKbfW9lk/BpiJ43yB99CPykBBEMT2xDYnO969/YmCKMjMATuaLhh9N0CUInJuIOHP+Ws9XuLHAWKjP1hVXNPSVPWNtHssmdNTbwVmkqG9HlSkEbf11tBz9yG4Hn7rmQQPUluOVNjfVYuqhgOCooxqjBH1w dMA7nMaX y15vWyQicD2JOkrR84aOE6yr9R3x+emdN4/EK4Dj1fj128CnUMGedlYdw32mFMMyayHzfeVYqn85zAYY36iP4d/Yz62TPc4xrBtFlHAOLr4u9b3MnZuyVbyxjV63WCwOPDe3YY2I+Kaa4tx6iYq0CM4GOkHgwUSxyVjFT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Gregory, Joshua, I hope this message finds you well. I'm writing to discuss a feature I believe would enhance the flexibility of the weighted interleave policy: support for per-socket weighting in multi-socket systems. --- While reviewing the early versions of the weighted interleave patches, I noticed that a source-aware weighting structure was included in v1: https://lore.kernel.org/all/20231207002759.51418-1-gregory.price@memverge.com/ However, this structure was removed in a later version: https://lore.kernel.org/all/20231209065931.3458-1-gregory.price@memverge.com/ Unfortunately, I was unable to participate in the discussion at that time, and I sincerely apologize for missing it. >From what I understand, there may have been valid reasons for removing the source-relative design, including: 1. Increased complexity in mempolicy internals. Adding source awareness introduces challenges around dynamic nodemask changes, task policy sharing during fork(), mbind(), rebind(), etc. 2. A lack of concrete, motivating use cases. At that stage, it might have been more pragmatic to focus on a 1D flat weight array. If there were additional reasons, I would be grateful to learn them. That said, I would like to revisit this idea now, as I believe some real-world NUMA configurations would benefit significantly from reintroducing this capability. --- The system I am testing includes multiple CPU sockets, each with local DRAM and directly attached CXL memory. Here's a simplified diagram: node0 node1 +-------+ UPI +-------+ | CPU 0 |-+-----+-| CPU 1 | +-------+ +-------+ | DRAM0 | | DRAM1 | +---+---+ +---+---+ | | +---+---+ +---+---+ | CXL 0 | | CXL 1 | +-------+ +-------+ node2 node3 This type of system is becoming more common, and in my tests, I encountered two scenarios where per-socket weighting would be highly beneficial. Let's assume the following NUMA bandwidth matrix (GB/s): 0 1 2 3 0 300 150 100 50 1 150 300 50 100 And flat weights: node0 = 3 node1 = 3 node2 = 1 node3 = 1 --- Scenario 1: Adapt weighting based on the task's execution node Many applications can achieve reasonable performance just by using the CXL memory on their local socket. However, most workloads do not pin tasks to a specific CPU node, and the current implementation does not adjust weights based on where the task is running. If per-source-node weighting were available, the following matrix could be used: 0 1 2 3 0 3 0 1 0 1 0 3 0 1 Which means: 1. A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1) 2. A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1) 3. A large, multithreaded task using both sockets should get both sets This flexibility is currently not possible with a single flat weight array. --- Scenario 2: Reflect relative memory access performance Remote memory access (e.g., from node0 to node3) incurs a real bandwidth penalty. Ideally, weights should reflect this. For example: Bandwidth-based matrix: 0 1 2 3 0 6 3 2 1 1 3 6 1 2 Or DRAM + local CXL only: 0 1 2 3 0 6 0 2 1 1 0 6 1 2 While scenario 1 is probably more common in practice, both can be expressed within the same design if per-socket weights are supported. --- Instead of removing the current sysfs interface or flat weight logic, I propose introducing an optional "multi" mode for per-socket weights. This would allow users to opt into source-aware behavior. (The name 'multi' is just an example and should be changed to a more appropriate name in the future.) Draft sysfs layout: /sys/kernel/mm/mempolicy/weighted_interleave/ +-- multi (bool: enable per-socket mode) +-- node0 (flat weight for legacy/default mode) +-- node_groups/ +-- node0_group/ | +-- node0 (weight of node0 when running on node0) | +-- node1 +-- node1_group/ +-- node0 +-- node1 - When `multi` is false (default), existing behavior applies - When `multi` is true, the system will use per-task `task_numa_node()` to select a row in a 2D weight table --- 1. Compatibility: The proposal avoids breaking the current interface or behavior and remains backward-compatible. 2. Auto-tuning: Scenario 1 (local CXL + DRAM) likely works with minimal change. Scenario 2 (bandwidth-aware tuning) would require more development, and I would welcome Joshua's input on this. 3. Zero weights: Currently the minimum weight is 1. We may want to allow zero to fully support asymmetric exclusion. --- Before beginning an implementation, I would like to validate this direction with both of you: - Does this approach fit with your current design intentions? - Do you foresee problems with complexity, policy sharing, or interface? - Is there a better alternative to express this idea? If there's interest, I would be happy to send an RFC patch or prototype. Thank you for your time and consideration. Sincerely, Rakie