From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 489CDE77188 for ; Fri, 20 Dec 2024 08:25:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9BCAD6B0085; Fri, 20 Dec 2024 03:25:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 96CA96B0088; Fri, 20 Dec 2024 03:25:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80CDB6B0089; Fri, 20 Dec 2024 03:25:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 626256B0085 for ; Fri, 20 Dec 2024 03:25:35 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DA4ABA087F for ; Fri, 20 Dec 2024 08:25:34 +0000 (UTC) X-FDA: 82914650994.17.18BD5A8 Received: from invmail3.skhynix.com (exvmail3.skhynix.com [166.125.252.90]) by imf16.hostedemail.com (Postfix) with ESMTP id 42A6D180004 for ; Fri, 20 Dec 2024 08:24:56 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf16.hostedemail.com: domain of hyeonggon.yoo@sk.com designates 166.125.252.90 as permitted sender) smtp.mailfrom=hyeonggon.yoo@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734683101; a=rsa-sha256; cv=none; b=J3aw68D4E9y/OHbKyk1sGNOkrV96tv9aLwD/66BkzAKl9G0wcuatlwzn4uTwbUwTO6zsC8 3wdY5zxeIBMjGcG3uBv8nAp3RPITDCW97t5tFL8xRK1ASvbdhVVx+dpACi3OkeV4x1Lw7L UaQLokwZ4GhRyyCkXRb4q+xKpK9Vx5k= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf16.hostedemail.com: domain of hyeonggon.yoo@sk.com designates 166.125.252.90 as permitted sender) smtp.mailfrom=hyeonggon.yoo@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734683101; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2EXn/bYCLObQR/zO3ygiMzvBWuma5hWR2fahb3v3fW4=; b=p4P+l9Mw2j7H2rMMVXmSbWc+aJl/4Yb2+MjHb78HvoBdbF2/PoPz9UKWmSzRziBtr+Q9vO lONgOt5rKDcoxnpZ+J8lFpgukdoMnqJa8a9ivqhSJNOaXGrgqFyd4cTqqIe+1e/2t+LeVv W1GjFMb8ArNrs9ryXFo5PH8oec4fDsc= X-AuditID: a67dfc59-7a9ff700000194b3-5f-676529f89a80 Message-ID: <3682b9cf-213c-497d-ab81-f70e1a785716@sk.com> Date: Fri, 20 Dec 2024 17:25:28 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Cc: kernel_team@skhynix.com, 42.hyeyoo@gmail.com, "rafael@kernel.org" , "lenb@kernel.org" , "gregkh@linuxfoundation.org" , "akpm@linux-foundation.org" , Honggyu Kim , "ying.huang@linux.alibaba.com" , Rakie Kim , "dan.j.williams@intel.com" , "Jonathan.Cameron@huawei.com" , "dave.jiang@intel.com" , "horen.chuang@linux.dev" , "hannes@cmpxchg.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "kernel-team@meta.com" Subject: Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning To: Joshua Hahn , "gourry@gourry.net" References: <20241219191845.3506370-1-joshua.hahnjy@gmail.com> Content-Language: en-US From: Hyeonggon Yoo In-Reply-To: <20241219191845.3506370-1-joshua.hahnjy@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrCIsWRmVeSWpSXmKPExsXC9ZZnoe5PzdR0g6crZC0m9hhYzFm/hs1i +tQLjBYnbjayWfy8e5zdonnxejaL1Zt8LW73n2O1WLXwGpvF8a3z2C32XQSq3fnwLZvF8n39 jBaXd81hs7i35j+rxdwvU5ktVq/JcBD0OPzmPbPHzll32T262y6ze7QcecvqsXjPSyaPTas6 2Tw2fZrE7nFixm8Wj50PLT0WNkxl9tg/dw27x7mLFR6fN8kF8EZx2aSk5mSWpRbp2yVwZZy/ PpWxoMGz4tCfBawNjBOsuhg5OCQETCQ+XwnsYuQEM++tWscGYvMKWEp8u7CLGcRmEVCV+L5+ NlRcUOLkzCcsILaogLzE/Vsz2EFsZoFnbBKbrwiB2MIC3hLf/t4HqxcRCJE4s3seE4gtJGAn 8arlJytEvbjErSfzmUBOYBPQktjRmQoS5hSwl5h9/DYLRImZRNfWLkYIW15i+9s5zBBnnmKX +PLBG8KWlDi44gbLBEbBWUium4Vkwywko2YhGbWAkWUVo0hmXlluYmaOsV5xdkZlXmaFXnJ+ 7iZGYNwuq/0TuYPx24XgQ4wCHIxKPLwHuFLShVgTy4orcw8xSnAwK4nw8silpgvxpiRWVqUW 5ccXleakFh9ilOZgURLnNfpWniIkkJ5YkpqdmlqQWgSTZeLglGpgTHZ5yz5TUKs0/6hJX+Yx T/dyvX0fM9tYj3mohogasbKVH7WUmbU7o7etTTXn14zbKRyZupseVXnH8E+w9Z/ge2G/xUqn DTqCHGVXj62rXXdofV6VUaDf+4sPAy8+FMsM3AR01w1GhdVGh1n1zZMlH8z9OynO2+H3YfFZ We9ZgzgtXopIn1FiKc5INNRiLipOBAAVDRQI1wIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrJIsWRmVeSWpSXmKPExsXCNUOnRPeHZmq6wafdChYTewws5qxfw2Yx feoFRosTNxvZLH7ePc5u0bx4PZvF6k2+Fp+fvWa2uN1/jtVi1cJrbBbHt85jt9h3Eajh8NyT rBY7H75ls1i+r5/R4vKuOWwW99b8Z7WY+2Uqs8Wha89ZLVavyXAQ8Tj85j2zx85Zd9k9utsu s3u0HHnL6rF4z0smj02rOtk8Nn2axO5xYsZvFo+dDy09FjZMZfbYP3cNu8e5ixUe3257eCx+ 8YHJ4/MmuQD+KC6blNSczLLUIn27BK6M89enMhY0eFYc+rOAtYFxglUXIyeHhICJxL1V69hA bF4BS4lvF3Yxg9gsAqoS39fPhooLSpyc+YQFxBYVkJe4f2sGO4jNLPCMTWLzFSEQW1jAW+Lb 3/tg9SICIRJnds9jArGFBOwkXrX8ZIWoF5e49WQ+UJyDg01AS2JHZypImFPAXmL28dssECVm El1buxghbHmJ7W/nME9g5JuF5IpZSCbNQtIyC0nLAkaWVYwimXlluYmZOWZ6xdkZlXmZFXrJ +bmbGIERuqz2z6QdjN8uux9iFOBgVOLhPcCVki7EmlhWXJl7iFGCg1lJhJdHLjVdiDclsbIq tSg/vqg0J7X4EKM0B4uSOK9XeGqCkEB6YklqdmpqQWoRTJaJg1OqgVHWLyX8TYS+2T4GsXMX WNcnm2c3rjuwJGR5Qebayyvfzsqv013aktjJumMeW4DrqoqyjJd3uPbscdQyPaWcfT5kWmLH 0d0Rnx3+bly5Rz/55a2du7els4uvmxH9NmtHh1RCyXJXxkdOt9ax2V/2ZtZPXp+UdLugRO31 RL1jGvY+X1+8u1jnEqbEUpyRaKjFXFScCADy3WwpzAIAAA== X-CFilter-Loop: Reflected X-Stat-Signature: kt89jmnc814noojaqr9peux8wqy6a7bc X-Rspam-User: X-Rspamd-Queue-Id: 42A6D180004 X-Rspamd-Server: rspam08 X-HE-Tag: 1734683096-136611 X-HE-Meta: U2FsdGVkX1+e4GxavMbV3EgJZBQCDvqSVSz4S8FdNhvBcWMjHE+QW06IgkT30OAVHAQ+BCwwD/ymcTGYgS3kKQ3l9wM33LJr70gSBP/moeWI1I5LJR32juddLg9Q9EtBVd5ejS+zLEdKoFV13BSrhtWuxyLIASwno05qd+bG6xHbG4dZpHR/NHOO1zgW9f/Wq+ZL7VC2D+9dIEFP9uciCYDWERhrWIvuZhQajF8bg43xqwmMxcP997xmXWyxJKT6ESXDOZrMs9uHRWHNGqz+d2+yecUUKrHsiuEXHvmunV01fqN60lhXUsW8sRxzbonzRg2FKSd9OBvBydAG1pAj+0WEkJ+Gyp6blL5/6lb8R+7m8h6DLR7+Jb6P9vXop/Oryebt/z0zv0HmVUYmJ907rPPesHYKH1j+9uON3TG9KeHEUl/TvfaVFO3zv7qssbxv52YAnV+k0kn8fwdyg3MuaxCfQpBqD3+F1UJ0f1ZI4V6mNKQ03fro8104nGntSF7QnLsRUtr2f3aLoJJRd6eGVDI78apyrBopmturdwMw2wu2iGHdWtPycVTJm/E6rUBH3KwU1rhlGvkq9VJIUs+0V05nrBpLQI8jgO1re88TpF2iAmbZQOVB/wyy+pnMXzO6GakvncsKM3RIkq+QosYsSJFCzF17HyUmj7RnWWguvmU8LFXtnhYSIpacst0M0g6sV34w45sPQ7gS0sGQp0J3t/swM0TFlxQ9aBMcb9vZdcH8qgtadkljIsq0m6F1hBIyvZ2wYsjGNJUhUy2Ad8GrGChOS3JwODBFZ+g0H+DEN/J96MKPpEeeodkOssUTMjXkgdMo2Uv/QH8LCy2ZrtZSJmjsc8dIZDbU+6YaxQxUhoZ3LsHlX0t8ajMLSVnmsG/RPmGNBb0bYjoQ9s+rTF70LL9WubXYDq50UUzLdYZciZe9gvN7cklP66bBq7KQ9GWQa0oMiuujG2df4K2/x2E l9Q5aNER Fmvl1G6blx2zpOKU0QKo9wyk8qRul1vTC/7g+LXC/hqX4kymR4d7ZsaBPBYY4DR3y/fOEb8dX7BJ58c5b59qHf5E90yUBvry7rXwhFb/vnYQXsUJEAGtV+Ny8a7g4O6ugJR0ffFqt2hneKnCJ//oVg+vRB4TKiDZXxD3y0hD6WdyByBzmPANQ/+Ud4Sl/0YefRd6y9wOY+SDja5R+H74OSPkdfbnuC1n+BsOaRhO39QC1Nl8nA1hLIQc2J+VP+Ozx59ImCuU9IwmxTJahfaaz7PVHOdGAKetXGyHCAFc/72jK0svrq9ngDShqz0gxyJUrBMRp6KcAxozPZisIxAYsE/K8U+i1MGV7QA1pKo7TBHHqF38= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024-12-20 4:18 AM, Joshua Hahn wrote: > On machines with multiple memory nodes, interleaving page allocations > across nodes allows for better utilization of each node's bandwidth. > Previous work by Gregory Price [1] introduced weighted interleave, which > allowed for pages to be allocated across NUMA nodes according to > user-set ratios. > > Ideally, these weights should be proportional to their bandwidth, so > that under bandwidth pressure, each node uses its maximal efficient > bandwidth and prevents latency from increasing exponentially. > > At the same time, we want these weights to be as small as possible. > Having ratios that involve large co-prime numbers like 7639:1345:7 leads > to awkward and inefficient allocations, since the node with weight 7 > will remain mostly unused (and despite being proportional to bandwidth, > will not aid in relieving the pressure present in the other two nodes). > > This patch introduces an auto-configuration for the interleave weights > that aims to balance the two goals of setting node weights to be > proportional to their bandwidths and keeping the weight values low. > This balance is controlled by a value "weightiness", which defines the > interleaving aggression. Higher values lead to less interleaving > (255:1), while lower values lead to more interleaving (1:1). > > Large weightiness values generally lead to increased weight-bandwidth > proportionality, but can lead to underutilized nodes (think worst-case > scenario, which is 1:max_node_weight). Lower weightiness reduces the > effects of underutilized nodes, but may lead to improperly loaded > distributions. s/max_node_weight/weightiness/ > This knob is exposed as a sysfs interface with a default value of 32. > Weights are re-calculated once at boottime and then every time the knob > is changed by the user, or when the ACPI table is updated. > > [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ > > Signed-off-by: Joshua Hahn > Signed-off-by: Gregory Price > Co-Developed-by: Gregory Price > > --- > Changelog > > v2: > - Name of the interface is changed from v1: "max_node_weight" --> "weightiness" > - Default interleave weight table no longer exists. Rather, the > interleave weight table is initialized with the defaults, if bandwidth > information is available. > - In addition, all sections that handle iw_table have been changed > to reference iw_table if it exists, otherwise defaulting to 1. > - All instances of unsigned long are converted to uint64_t to guarantee > support for both 32-bit and 64-bit machines > - sysfs initialization cleanup > - Documentation has been rewritten to explicitly outline expected > behavior and expand on the interpretation of "weightiness". > - kzalloc replaced with kcalloc for readability > - Thank you Gregory and Hyeonggon for your review & feedback! > > ...fs-kernel-mm-mempolicy-weighted-interleave | 36 ++++ > drivers/acpi/numa/hmat.c | 1 + > drivers/base/node.c | 7 + > include/linux/mempolicy.h | 4 + > mm/mempolicy.c | 183 +++++++++++++++--- > 5 files changed, 209 insertions(+), 22 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > index 0b7972de04e9..edb2c1f4753f 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > @@ -23,3 +23,39 @@ Description: Weight configuration interface for nodeN > Writing an empty string or `0` will reset the weight to the > system default. The system default may be set by the kernel > or drivers at boot or during hotplug events. > + > +What: /sys/kernel/mm/mempolicy/weighted_interleave/weightiness > +Date: December 2024 > +Contact: Linux memory management mailing list > +Description: Weight limiting / scaling interface > + > + "Weightiness": a measure of interleave aggression between > + memory nodes. Higher values lead to less interleaving (255:1), > + while lower values lead to more interleaving (1:1). It might be better to explain what low and high values of weightness imply, like the way how you described in the changelog? > + When this value is updated, all node weights are re-calculated > + to reflect the new weightiness. These re-calculated values > + overwrite all existing node weights, including those manually > + set by writing to the nodeN files. > + > + Node weight re-calculation is performed by scaling down > + bandwidth values reported in the ACPI HMAT to the range > + [1, weightiness]. Note that re-calculation uses only the > + weightiness parameter and bandwidth values, and ignores all > + current node weights. > + > + Minimum weight: 1 > + Default value: 32 > + Maximum weight: 255 > + > + Writing an empty string will set the value to be the default > + (32). Writing a value outside the valid range will return > + EINVAL and will not re-trigger a weight scaling. > + > + If there is no bandwidth data in the ACPI HMAT, then this file > + will return ENODEV on an attempted write and perform no updates. > + Furthermore, if there is no bandwidth information available, > + all nodes' weights will default to 1. > + > + Setting max_node_weight to 1 is equivalent to unweighted > + interleave. s/max_node_weight/weightiness/ > @@ -3397,6 +3471,54 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, > > static struct iw_node_attr **node_attrs; > > +static ssize_t weightiness_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sysfs_emit(buf, "%d\n", weightiness); > +} > + > +static ssize_t weightiness_store(struct kobject *kobj, > + struct kobj_attribute *attr, const char *buf, size_t count) > +{ > + uint64_t *bw; > + u8 *old_iw, *new_iw; > + u8 new_weightiness; > + > + if (count == 0 || sysfs_streq(buf, "")) > + new_weightiness = 32; > + else if (kstrtou8(buf, 0, &new_weightiness) || new_weightiness == 0) > + return -EINVAL; > + > + new_iw = kzalloc(nr_node_ids, GFP_KERNEL); > + if (!new_iw) > + return -ENOMEM; Could you please use kcalloc here similar to mempolicy_set_node_perf()? Otherwise the patch looks fine to me. (will add a review and test on the next revision) By the way, this might be out of scope, but let me ask for my own learning. We have a server with 2 sockets, each attached with local DRAM and CXL memory (and thus 4 NUMA nodes). When accessing remote socket's memory (either CXL or not), the bandwidth is limited by the interconnect's bandwidth. On this server, ideally weighted interleaving should be configured within a socket (e.g. local NUMA node + local CXL node) because weighted interleaving does not consider the bandwidth when accessed from a remote socket. So, the question is: On systems with multiple sockets (and CXL mem attached to each socket), do you always assume the admin must bind to a specific socket for optimal performance or is there any plan to mitigate this problem without binding tasks to a socket? > + > + mutex_lock(&iw_table_lock); > + bw = node_bw_table; > + > + if (!bw) { > + mutex_unlock(&iw_table_lock); > + kfree(new_iw); > + return -ENODEV; > + } > + > + weightiness = new_weightiness; > + old_iw = rcu_dereference_protected(iw_table, > + lockdep_is_held(&iw_table_lock)); > + > + reduce_interleave_weights(bw, new_iw); > + rcu_assign_pointer(iw_table, new_iw); > + mutex_unlock(&iw_table_lock); > + > + synchronize_rcu(); > + kfree(old_iw); > + > + return count; > +} > + > +static struct kobj_attribute wi_attr = > + __ATTR(weightiness, 0664, weightiness_show, weightiness_store); > + > static void sysfs_wi_node_release(struct iw_node_attr *node_attr, > struct kobject *parent) > { > @@ -3413,6 +3535,7 @@ static void sysfs_wi_release(struct kobject *wi_kobj) > > for (i = 0; i < nr_node_ids; i++) > sysfs_wi_node_release(node_attrs[i], wi_kobj); > + > kobject_put(wi_kobj); > } > > @@ -3454,6 +3577,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj) > return 0; > } > > +static struct attribute *wi_default_attrs[] = { > + &wi_attr.attr, > + NULL > +}; > + > +static const struct attribute_group wi_attr_group = { > + .attrs = wi_default_attrs, > +}; > + > static int add_weighted_interleave_group(struct kobject *root_kobj) > { > struct kobject *wi_kobj; > @@ -3470,6 +3602,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj) > return err; > } > > + err = sysfs_create_group(wi_kobj, &wi_attr_group); > + if (err) { > + pr_err("failed to add sysfs [weightiness]\n"); > + kobject_put(wi_kobj); > + return err; > + } > + > for_each_node_state(nid, N_POSSIBLE) { > err = add_weight_node(nid, wi_kobj); > if (err) {