From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 370BDE7717D for ; Fri, 13 Dec 2024 06:19:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B65AD6B007B; Fri, 13 Dec 2024 01:19:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B14BA6B0082; Fri, 13 Dec 2024 01:19:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DC716B0083; Fri, 13 Dec 2024 01:19:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 81EDD6B007B for ; Fri, 13 Dec 2024 01:19:27 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 07202A0DBE for ; Fri, 13 Dec 2024 06:19:27 +0000 (UTC) X-FDA: 82888933176.30.A8269AD Received: from invmail3.skhynix.com (exvmail3.hynix.com [166.125.252.90]) by imf23.hostedemail.com (Postfix) with ESMTP id 6B97914000A for ; Fri, 13 Dec 2024 06:19:07 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of hyeonggon.yoo@sk.com designates 166.125.252.90 as permitted sender) smtp.mailfrom=hyeonggon.yoo@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734070753; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=G2+Bv/Gfx0vItB0KlEs7pLb/EGVqcghvMbmgrhDhaqw=; b=KxGhZsqEGdbyedSc1cJma9IyLB1gDiYIaBvS/q0Bjy78Uhh9xQWfIbIznw8uNi22kYa1Im rPtE6nVXb+z10FanCjSmNzR2L+rClDIAJF5A7m8GFGUnvOGGh6FK+fQO8jv0ZPYCR9TgcQ cPItqlcvvqlhR7/LX0NmxPEmEXNojuw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734070753; a=rsa-sha256; cv=none; b=z9fdn5F/QYu0qvqHEy03lHcxpuyhlra0Rmm0zNV3Ey3lJ5LU/k5WWBi6V41TJ6BhrUMzZC zB9pj8aOvIWEqtRiTttHeaLE0l5W7362ta8X1TmYB3NFctdLiqkBir2So3yX+Op2zc/5yB ApXQOoIG+ub7tV/oStG0ILrwlmQvPsk= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of hyeonggon.yoo@sk.com designates 166.125.252.90 as permitted sender) smtp.mailfrom=hyeonggon.yoo@sk.com X-AuditID: a67dfc59-7a9ff700000194b3-cd-675bd1e96ec1 Message-ID: <4ddfa283-eb64-4032-880b-c19b07e407e1@sk.com> Date: Fri, 13 Dec 2024 15:19:20 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Cc: kernel_team@skhynix.com, "rafael@kernel.org" , "lenb@kernel.org" , "gregkh@linuxfoundation.org" , "akpm@linux-foundation.org" , =?UTF-8?B?6rmA7ZmN6recKEtJTSBIT05HR1lVKSBTeXN0ZW0gU1c=?= , "ying.huang@linux.alibaba.com" , =?UTF-8?B?6rmA65296riwKEtJTSBSQUtJRSkgU3lzdGVtIFNX?= , "dan.j.williams@intel.com" , "Jonathan.Cameron@huawei.com" , "dave.jiang@intel.com" , "horen.chuang@linux.dev" , "hannes@cmpxchg.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "kernel-team@meta.com" Subject: Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning To: Joshua Hahn , "gourry@gourry.net" References: <20241210215439.94819-1-joshua.hahnjy@gmail.com> Content-Language: en-US From: Hyeonggon Yoo In-Reply-To: <20241210215439.94819-1-joshua.hahnjy@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrEIsWRmVeSWpSXmKPExsXC9ZZnoe7Li9HpBg/va1rMWb+GzWL61AuM FiduNrJZ/Lx7nN2iefF6NovVm3wtbvefY7VYtfAam8XxrfPYLfZdBKrd+fAtm8Xyff2MFpd3 zWGzuLfmP6vF3C9TmS1Wr8lwEPA4/OY9s8fOWXfZPbrbLrN7tBx5y+qxeM9LJo9NqzrZPDZ9 msTucWLGbxaPnQ8tPRY2TGX22D93DbvHuYsVHp83yQXwRnHZpKTmZJalFunbJXBlXLl3kang qlfFjD8rWBoYl1l3MXJySAiYSCzZ+ZoFxv59uIkVxOYVsJQ407IcKM7BwSKgKnH4awlEWFDi 5MwnYOWiAvIS92/NYO9i5OJgFljJLnG1oZ0NJCEsECHxfeUZRhBbRCBE4szueUwgtpCAjcTK yU1gcWYBcYlbT+YzgcxnE9CS2NGZChLmFLCVePFmKTNEiZlE19YuqHJ5ie1v5zCD7JIQOMUu 0fb8BTvEzZISB1fcYJnAKDgLyX2zkKyYhWTWLCSzFjCyrGIUycwry03MzDHWK87OqMzLrNBL zs/dxAiM2WW1fyJ3MH67EHyIUYCDUYmHN+BeVLoQa2JZcWXuIUYJDmYlEd4b9pHpQrwpiZVV qUX58UWlOanFhxilOViUxHmNvpWnCAmkJ5akZqemFqQWwWSZODilGhg9pecICd9gbuxZZlcU /+H25e3+JSs23DZIDgo40aAfG9GvziQb33Sy9ZB/me+0WQy/M7c/kNTJYd31VCtef43KKUat epnL6QE8LIK/k1aKGaxllBF8ZdailjDP5KV43Ml525bdDH3v0HpiqW8/d0ui1Ik1c1laLrDs 6H2fbi13zJ9965vuE0osxRmJhlrMRcWJAOQJvhbVAgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrKIsWRmVeSWpSXmKPExsXCNUOnRPflxeh0g+W3xC3mrF/DZjF96gVG ixM3G9ksft49zm7RvHg9m8XqTb4Wn5+9Zra43X+O1WLVwmtsFse3zmO32HcRqOHw3JOsFjsf vmWzWL6vn9Hi8q45bBb31vxntZj7ZSqzxaFrz1ktVq/JcBD2OPzmPbPHzll32T262y6ze7Qc ecvqsXjPSyaPTas62Tw2fZrE7nFixm8Wj50PLT0WNkxl9tg/dw27x7mLFR7fbnt4LH7xgcnj 8ya5AP4oLpuU1JzMstQifbsErowr9y4yFVz1qpjxZwVLA+My6y5GTg4JAROJ34ebWEFsXgFL iTMty1m6GDk4WARUJQ5/LYEIC0qcnPmEBcQWFZCXuH9rBnsXIxcHs8BKdomrDe1sIAlhgQiJ 7yvPMILYIgIhEmd2z2MCsYUEbCRWTm4CizMLiEvcejKfCWQ+m4CWxI7OVJAwp4CtxIs3S5kh SswkurZ2QZXLS2x/O4d5AiPfLCRnzEIyaRaSlllIWhYwsqxiFMnMK8tNzMwx0yvOzqjMy6zQ S87P3cQIjMxltX8m7WD8dtn9EKMAB6MSD2/Avah0IdbEsuLK3EOMEhzMSiK8N+wj04V4UxIr q1KL8uOLSnNSiw8xSnOwKInzeoWnJggJpCeWpGanphakFsFkmTg4pRoYN3+dfJlxo1/F6uar wWkNNx6HiGheYn9y5nlm/u8VBYJ8Ec+OKFzU1DnT8CZQ4UlcZ9tOlvPZwus332G8uaTqD0Ob 5qU97yY5dm2xM7AuS+4WsAuo+e0iPPPYHevK2GUHlN9d+cHfY3Dt1beKqYvsNgcY/jKa8t92 U/Xxw/MFEsVeLJ3y3OxnrhJLcUaioRZzUXEiALhuclrIAgAA X-CFilter-Loop: Reflected X-Stat-Signature: r6337cz64pqjiy4huscprzffk7d6upxw X-Rspamd-Queue-Id: 6B97914000A X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1734070747-401650 X-HE-Meta: U2FsdGVkX18t0lSAYSiDzEmHwHdlUldVjTanHlDxxzSrE+wZhp0s9rJ8/MrNFw7FTJKGhEFXa2NUl7CoL3OAT32F61Sxv2uFNxY0Oo2QRgM1lAs5mcZbUfXoS35TG4aIoiLReB7g/pNsl1K1DZ6QIFY/jg0/Xdn5CadZ4DBwQr5JfmBZdFp2T9YQmJSySe7hWgSODkhlVBd20YEWUybt8TBqk9U+iwcRW1zVO37L3IjE3tibJai1LV+b9+ZKSN6ayWYQtyqcxz8rhMMI+JBsfyX/L3Z+bOrcVtu55+jif4bsKFPKrDQVNoIoje/HMqzmxosO51RtyKpc4o9/rn8TV37J1ahUpJ6n4pi/covrEAsCeByKj0pQpZv2TzDN9e8h2f19J3dxC4Uv6qdCP/oOEe9bsWmjOHhF8/Ga2tJtR9pyvM0USUV0hM2JWj085q/nI0FJSa7mU1KjWZ2XPE5N3+5s1aa42tBA49sMmAdIZYH/EMtC7sP3PMdLztkpUcC3qkpJtUFe9NbPfVUxyqHOARl/TLIAqcwbkKBsoD6lmG80j/oaFH5X9o/PQV1mOmekaoUwnzKqu+PCgd0wOgXD1j4rHrr0OK6+QCncy5awKF50UBKXWCi+rwHlrSvy9OG31fECHoNZpapMFiK6rc9PY9QnDnViZOs3Mh/1w3yXWaqlBDvTCHy8BcNKUaFpnIu6V1Xh458J/NoGDG+ZsxqWG5wnPP8ClIGDx5jXdb4EQuaKdfQHxcF3TnxGT+IA4myf3QaLupDSLgzI3IT4+UNticUa+skC+zpzfCX05LXLZBbA4TNl+qqp595fxBluzAWp0RE8JnDtQCrmP0ZgaVfGJjDH8R2KzYoaGF3hyiL4u/m5gPWWfgcyvqRxv6z8mZKAEMPU324pDiPELIdxbv3B9SG5M+kGU7hozeno9VwD9r3QrDvCR46F8/SOzb3xU3DXWDdkFnNd/CUJHaSvVv+ XfoEMgxy gt6LFKtZppn9HAEq2BKJeWslqt05zpzfHKYyltnOd0qoPqVO2sDnUgpugkMdhiSrb0ciVWDTD/Xn0CAbfoBBvoINBn48bq0c2lonFzeHqeQBgEvzBjwQMZP5K1Hu7Jwp0BK3H9BN8LqHNaIsHRfZT1yRLZOCyFUDN9/1K9qlD6zuKQzlRT5Vywlu5a/8l0AGK5xuejBkKV8rZp+UJzGwqAVAMqPBYcZMjoM9bBdJ2FUuZIADiq7U1xX45f16ViugZOhFdnNJlh7dHKfU+w4aIi2ngTb/F1YKK606gUh9x6Fs9bop4YokXMhoOMmv3Uor881Rni/B0azLAWr3By0cPUGoh6c55XxFv+iP21dxlLOrnVgI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024-12-11 06:54 AM, Joshua Hahn wrote: > On machines with multiple memory nodes, interleaving page allocations > across nodes allows for better utilization of each node's bandwidth. > Previous work by Gregory Price [1] introduced weighted interleave, which > allowed for pages to be allocated across NUMA nodes according to > user-set ratios. > > Ideally, these weights should be proportional to their bandwidth, so > that under bandwidth pressure, each node uses its maximal efficient > bandwidth and prevents latency from increasing exponentially. > > At the same time, we want these weights to be as small as possible. > Having ratios that involve large co-prime numbers like 7639:1345:7 leads > to awkward and inefficient allocations, since the node with weight 7 > will remain mostly unused (and despite being proportional to bandwidth, > will not aid in relieving the pressure present in the other two nodes). > > This patch introduces an auto-configuration for the interleave weights > that aims to balance the two goals of setting node weights to be > proportional to their bandwidths and keeping the weight values low. > This balance is controlled by a value max_node_weight, which defines the > maximum weight a single node can take. Hi Joshua, I am wondering how this is going to work for host memory + CXL memory interleaving. I guess by "the ACPI table" you mean the ACPI HMAT or CXL CDAT, both of which does not provide the bandwidth of host memory. If this feature does not consider the bandwidth of host memory, manual configuration will be inevitable anyway. > Large max_node_weights generally lead to increased weight-bandwidth > proportionality, but can lead to underutilized nodes (think worst-case > scenario, which is 1:max_node_weight). Lower max_node_weights reduce the > effects of underutilized nodes, but may lead to improperly loaded > distributions. > > This knob is exposed as a sysfs interface with a default value of 32. > Weights are re-calculated once at boottime and then every time the knob > is changed by the user, or when the ACPI table is updated. >> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ > > Signed-off-by: Joshua Hahn > Signed-off-by: Gregory Price > Co-Developed-by: Gregory Price > --- > ...fs-kernel-mm-mempolicy-weighted-interleave | 24 +++ > drivers/acpi/numa/hmat.c | 1 + > drivers/base/node.c | 7 + > include/linux/mempolicy.h | 4 + > mm/mempolicy.c | 195 ++++++++++++++++-- > 5 files changed, 211 insertions(+), 20 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > index 0b7972de04e9..2ef9a87ce878 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > @@ -23,3 +23,27 @@ Description: Weight configuration interface for nodeN > Writing an empty string or `0` will reset the weight to the > system default. The system default may be set by the kernel > or drivers at boot or during hotplug events. > + > +What: /sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight > +Date: December 2024 > +Contact: Linux memory management mailing list > +Description: Weight limiting / scaling interface > + > + The maximum interleave weight for a memory node. When it is > + updated, any previous changes to interleave weights (i.e. via > + the nodeN sysfs interfaces) are ignored, and new weights are > + calculated using ACPI-reported bandwidths and scaled. > + At first this paragraph sounded like "previously stored weights are discarded after setting max_node_weight", but I think you mean "User can override the default values, but defaults values are calculated regardless of the values set by the user". Right? > + It is possible for weights to be greater than max_node_weight if > + the nodeN interfaces are directly modified to be greater. > + > + Minimum weight: 1 > + Default value: 32 > + Maximum weight: 255 > + > + Writing an empty string will set the value to be the default > + (32). Writing a value outside the valid range will return > + EINVAL and will not re-trigger a weight scaling. > + > + Setting max_node_weight to 1 is equivalent to unweighted > + interleave. > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index ee32a10e992c..f789280acdcb 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -109,6 +109,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -153,24 +154,116 @@ static unsigned int mempolicy_behavior; [...snip...] > +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) > +{ > + unsigned long *old_bw, *new_bw; > + unsigned long bw_val; > + u8 *old_iw, *new_iw; > + > + /* > + * Bandwidths above this limit causes rounding errors when reducing > + * weights. This value is ~16 exabytes, which is unreasonable anyways. > + */ > + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); > + if (bw_val > (U64_MAX / 10)) > + return -EINVAL; > + > + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL); > + if (!new_bw) > + return -ENOMEM; > + > + new_iw = kzalloc(nr_node_ids, GFP_KERNEL); I think kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL); will be more readable. > + if (!new_iw) { > + kfree(new_bw); > + return -ENOMEM; > + } > + > + mutex_lock(&default_iwt_lock); > + old_bw = node_bw_table; > + old_iw = rcu_dereference_protected(default_iw_table, > + lockdep_is_held(&default_iwt_lock)); > + > + if (old_bw) > + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long)); > + new_bw[node] = bw_val; > + node_bw_table = new_bw; > + > + reduce_interleave_weights(new_bw, new_iw); > + rcu_assign_pointer(default_iw_table, new_iw); > + > + mutex_unlock(&default_iwt_lock); > + synchronize_rcu(); > + kfree(old_bw); > + kfree(old_iw); > + return 0; > +} > + > /** > * numa_nearest_node - Find nearest node by state > * @node: Node id to start the search > @@ -2001,7 +2094,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > { > nodemask_t nodemask; > unsigned int target, nr_nodes; > - u8 *table; > + u8 *table, *defaults; > unsigned int weight_total = 0; > u8 weight; > int nid; > @@ -2012,11 +2105,12 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > > rcu_read_lock(); > table = rcu_dereference(iw_table); > + defaults = rcu_dereference(iw_table); Probably you intended rcu_dereference(default_iw_table)? > /* calculate the total weight */ > for_each_node_mask(nid, nodemask) { > /* detect system default usage */ > - weight = table ? table[nid] : 1; > - weight = weight ? weight : 1; > + weight = table ? table[nid] : 0; > + weight = weight ? weight : (defaults ? defaults[nid] : 1); > weight_total += weight; > } > > @@ -2025,8 +2119,8 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > nid = first_node(nodemask); > while (target) { > /* detect system default usage */ > - weight = table ? table[nid] : 1; > - weight = weight ? weight : 1; > + weight = table ? table[nid] : 0; > + weight = weight ? weight : (defaults ? defaults[nid] : 1); > if (target < weight) > break; > target -= weight; > @@ -2409,7 +2503,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > unsigned long nr_allocated = 0; > unsigned long rounds; > unsigned long node_pages, delta; > - u8 *table, *weights, weight; > + u8 *weights, weight; > unsigned int weight_total = 0; > unsigned long rem_pages = nr_pages; > nodemask_t nodes; > @@ -2458,16 +2552,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > if (!weights) > return total_allocated; > > - rcu_read_lock(); > - table = rcu_dereference(iw_table); > - if (table) > - memcpy(weights, table, nr_node_ids); > - rcu_read_unlock(); > - > - /* calculate total, detect system default usage */ > for_each_node_mask(node, nodes) { > - if (!weights[node]) > - weights[node] = 1; > + weights[node] = get_il_weight(node); > weight_total += weights[node]; > } > > @@ -3396,6 +3482,7 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, > } > > static struct iw_node_attr **node_attrs; > +static struct kobj_attribute *max_nw_attr; Where is max_nw_attr initialized? Best, Hyeonggon > static void sysfs_wi_node_release(struct iw_node_attr *node_attr, > struct kobject *parent) > @@ -3413,6 +3500,10 @@ static void sysfs_wi_release(struct kobject *wi_kobj) > > for (i = 0; i < nr_node_ids; i++) > sysfs_wi_node_release(node_attrs[i], wi_kobj); > + > + sysfs_remove_file(wi_kobj, &max_nw_attr->attr); > + kfree(max_nw_attr->attr.name); > + kfree(max_nw_attr); > kobject_put(wi_kobj); > }