From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6314C48BC3 for ; Tue, 20 Feb 2024 20:25:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 75C9D6B0074; Tue, 20 Feb 2024 15:25:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 70C846B0078; Tue, 20 Feb 2024 15:25:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 587B46B007B; Tue, 20 Feb 2024 15:25:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4836F6B0074 for ; Tue, 20 Feb 2024 15:25:40 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 187F8160876 for ; Tue, 20 Feb 2024 20:25:40 +0000 (UTC) X-FDA: 81813312840.10.E412661 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) by imf14.hostedemail.com (Postfix) with ESMTP id 3B389100012 for ; Tue, 20 Feb 2024 20:25:37 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GuribpG5; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.214.194 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708460738; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZBnklkS7WMFU168TcI+v2P140FGO/f8vdLXL7Ok+vL4=; b=FmjRYXuojCLcgHml94O+qa7FuPKRHIXi4fyzHD2/8P2jVVLxntj8zGlBusCxOlcsxslI8J EYiUcXKe4u8YsV+Ev/ZNxFsZSXMWOrLVtw/YY1dv+/iiFhfirA3xL+nN28gc25Ju334lVK q+KT8q6x7vgDEDP3A/BzkL4fXiBAzWA= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GuribpG5; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.214.194 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708460738; a=rsa-sha256; cv=none; b=wZ6IvQEzqlaqXsurSuG+LZ12OUbPS4QwZ59vOO2V+CTijklJok/KljiYWgUKstCJ1LfKeb ceOLj77Yu81+0w9IyTgQBOWYTwnJ/+sWlbktjLIerEIRTMbmRrfDoOmnNVY0/sSvNUk3lG d7O6NKYKN41awBONu2jKK2nwfp1yhPA= Received: by mail-pl1-f194.google.com with SMTP id d9443c01a7336-1db6e0996ceso43689895ad.2 for ; Tue, 20 Feb 2024 12:25:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708460737; x=1709065537; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZBnklkS7WMFU168TcI+v2P140FGO/f8vdLXL7Ok+vL4=; b=GuribpG5cyrBNaLMFb/isqJAbaAyXDC9J1l12zmaM5vSdAcZyXDRUh6OkwEVIm+a+c 3an62xQknI1hG+Y/qbjBEaMoxEAr7GPhFNaZxx1DhhG2S/znAbBCy6O5Kh2WeRjSJ0V6 rtodE7ucHZkQmgmbtG82Pl6ZMHikvzjegcTjALuR+uoCbYfnxHh79CqYZBvYht598Pn3 j6o1HvucIeA1KnNq4T9yktosb53sj/StfY+8BQndLG67UDcG8g5JVQCZWTD0SJnygT4q 3HnVZ+iS2ltm8gKc3pQRGKD8HsTUVaGe5M8Dsr9yKXjKKMjiHfh0VksA/kGIm4BkP4g5 ux/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708460737; x=1709065537; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZBnklkS7WMFU168TcI+v2P140FGO/f8vdLXL7Ok+vL4=; b=d192rKMJqc2MnEzFPJnxX/gJCXQbXtJTL8FeK4A6u9Jswss5mhimikJd9K5+vpaJAM V7tdMJtUzXBl63huchKIhs56COWzIld68LnncKh5ocRwLgx3njKx+ExI6QqKrSMOyUs9 20hXlAZmi9vr+juSWjXrxGWTfkDeiTJdTMnOZ2/eJMXTp4+4tT+BIoOGCGVpxdpZ6LKO bx2+WTy8aP0AsfOUwsQFbs74EPx3T/RkV0fnDqLgjKdICEdwz5+6HPE20VPBcNhVueFJ td7aVGxK3v1pmqGX6vEYETV0xJdW8ZNKmEkRHdi3ps+bSDiVpWIYWQrmcTjXQbS1OhBc 9afQ== X-Gm-Message-State: AOJu0YyDYJibIZ42kcVo3WChlp9xQ5ZJH5ma3rIY6w+CT0welg7KbnG5 hcutmkWy3BUQfIp5G3BCyqeKOZhC80N98CAOUBJ4+1W5c6HT1Xnfv/Hmi0Khh/SA X-Google-Smtp-Source: AGHT+IHkA4V6qwKf4cOC/uOz/uS+q2KycSuLitxGvr0jW9ZC5rlBdfn8y7exgV7lyAyk5hf55NW/PA== X-Received: by 2002:a17:902:eb8c:b0:1db:e245:8c35 with SMTP id q12-20020a170902eb8c00b001dbe2458c35mr6191582plg.30.1708460736845; Tue, 20 Feb 2024 12:25:36 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id je13-20020a170903264d00b001db40c0ed33sm6696678plb.61.2024.02.20.12.25.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Feb 2024 12:25:36 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, ying.huang@intel.com, hannes@cmpxchg.org, dan.j.williams@intel.com, dave.jiang@intel.com, Gregory Price Subject: [RFC 1/1] mm/mempolicy: introduce system default interleave weights Date: Tue, 20 Feb 2024 15:25:29 -0500 Message-Id: <20240220202529.2365-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20240220202529.2365-1-gregory.price@memverge.com> References: <20240220202529.2365-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 3B389100012 X-Stat-Signature: r7uaczcrkf6xuqq8yzw4ronif4nz9p7b X-HE-Tag: 1708460737-945824 X-HE-Meta: U2FsdGVkX183TqTHggW0fulfRG99yhSpAYFjHxoyoTzbu5HiMNSE/ERCP5eilTqJM9KGZ1pnhdoQDOzZQiEChVSzvY+oYn8tgQrMkPbCp67ZkELQU+mEFY2D+EaWPOu1Bv2njBm+HbNEZM8bed3ygt/WdfTwmujnLt8bfJnDDTaWsFFqTfx4NLJjkn/a+Vowp46ntiyb/UtCrJMEXgQakLvDu0/1M80JicuV71GjSws2m2h0L2c02VSb67+dSdzOzkUTifiWtRJ8z5/5tYHmbQbLdBCiIvK+u46zRdD7QgLWpGcu1XDKyowwZUEiI6yROWgudX+rou+P0pyuLTO8L/GCUIouTwEKt3JihkacM4JMAMnYWl9jWrrEgAUoptV42zVjnLgd+xc7X1fSbt3jhJx76IY6b5v3mTsTUFuIZ5bfR4oJd6Lyiq/tLwy9bCmRxolQPGeRyn/jZbGxROjisEDd85+0ow+tqz06XD3Na8kJ2w4YhyhZws8NCk7cgTi3aHp8aMgT0mZFiQSIHcPEDBDnPq3Rf1H1Pe2ItfGcvcvEh//cQGklRG7NZXgIbF30YloXgnyUqgM3CW7/1ttgvtBJxFEjLWg7mx1vhgUCabJZ/olQzAgENZeeA4XoQV9wU2lK3tY5il3MrDoqHkH0sYo+lXvM175exH5xgBBYRjbMAe/53+XZ9PQ7qBw4f8hmnPnwCftkNJtcuNp72wk1GVTb3QHMTyvnK8P02p20wyG0PcZ8NG2pSgVXbHhXvdqP6AikHfZeQ33QFyH6SY1Wj7uiiefJEV7D8ax+EWT0SSTTqa3kHNJDdNQra8ik3KJEgF8OJPpA7tGyPl0uglN/CEmIWLU8a2FmytnnWZkFJ/pXB6b8EKPPEPFkS0J1NPQil+pTcoSCWsExvvwlADjG6be2W6db/8c4sQU99wN4BiuavhTn0L965K0dzYqQ+euCC9dbB0KdePmH5Hln+y4 vrz1QOEj zgEbuBQNvKw4nM3qo43oCRAU+Z2Bvgfkv6zafkqwVO19OcXsIT/Q0ogfkJCGfulUK37hpdpEcBdVKHtLl4zZuiz+7Lol7jU7Wc5EFq3zpk7xBbnN6u+xQFioojWDNAChX2GVtQTw+eTScOC7oU583k7+QzPXPnLk5hNy7ULt2p9DURz62glxgX9WOkBfFIobe6B7jscZTyXLoaQAaZO/XLfbNg5udcyHRB4J8xp2RzFv1IIzUkKy0uM+E19C5zpsYbFOCf9cuxAgiax9r+JTvsq8iUrZEC7xZVmT6znM42FEM8hX5rwyOvXvu+Dlpp2P/XvO3L2gD0CK/oza8C3vE/oA33+ujeNYwDslRx+suh7bUCejdaGNFNPP6ZgSJyKLvvag9mD3OnG3r/vIjvywtQw4TPpo2tvf1O44oAhveAF2VfEmfYWopc90UQx5qM3Wnb9nsVAF4KvKYe8dizmxo2vZLk0ysXV4wgxp292JiilnBG+l9NO6V+cTCBZK+wor5HNaBMLS86dRwtjKHxmwl5sVZsg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Startup and hotplug code may register HMAT data for memory devices. Utilize this data to generate reasonable default weighted interleave values. Introduce `mempolicy_set_node_perf()`. A function which can be invoked from node and CXL code to have mempolicy rebalance the system default interleave weights. mempolicy_set_node_perf() cache's each node's bandwidth (in this patch: min(read_bw, write_bw)), and recalculates the weight associated with each node. After weights are calculated, we use gcd() to reduce these weights to the smallest amount possible in and effort to more aggressively interleave on smaller intervals. For example, a 1-socket system with a CXL memory expander which exposes 224GB/s and 64GB/s of bandwidth respectively will end up with a weight array of [7,2]. The downside of this approach is that some distributes may experience large default values if they happen to a bandwidth distribution that includes an unfortunate prime number, or if any two values are co-prime. Signed-off-by: Gregory Price --- drivers/acpi/numa/hmat.c | 1 + drivers/base/node.c | 7 +++ include/linux/mempolicy.h | 4 ++ mm/mempolicy.c | 129 ++++++++++++++++++++++++++++++-------- 4 files changed, 116 insertions(+), 25 deletions(-) diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index d6b85f0f6082..7935d387e001 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..30458df504b4 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord, break; } } + + /* When setting CPU access coordinates, update mempolicy */ + if (access == ACCESS_COORDINATE_CPU) { + if (mempolicy_set_node_perf(nid, coord)) + pr_info("failed to set node%d mempolicy attrs\n", nid); + } } /** diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..d564e9e893ea 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -177,6 +178,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); +extern int mempolicy_set_node_perf(unsigned int node, + struct access_coordinate *coords); + #else struct mempolicy {}; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ba0b2b81bd08..0a82aa51e497 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -109,6 +109,7 @@ #include #include #include +#include #include #include @@ -139,31 +140,114 @@ static struct mempolicy default_policy = { static struct mempolicy preferred_node_policy[MAX_NUMNODES]; /* - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes - * system-default value should be used. A NULL iw_table also denotes that - * system-default values should be used. Until the system-default table - * is implemented, the system-default is always 1. + * The interleave weight tables denote what weights should be used with + * the weighted interleave policy. There are two tables: + * - iw_table : the sysfs-set interleave weight table + * - default_iw_table : the system default interleave weight table. * - * iw_table is RCU protected + * If the iw_table is NULL, default_iw_table values are used. + * If both tables are NULL, a minimum weight of 1 is always used. + * A value of 0 in the iw_table means the system default value will be used. + * + * iw_table, and default_iw_table are RCU protected + * node_bw_table is protected by default_iwt_lock + * + * system startup and hotplug code may register node performance information + * via mempolicy_set_node_attributes() */ +static unsigned long *node_bw_table; +static u8 __rcu *default_iw_table; +static DEFINE_MUTEX(default_iwt_lock); + static u8 __rcu *iw_table; static DEFINE_MUTEX(iw_table_lock); static u8 get_il_weight(int node) { - u8 *table; + u8 *table, *default_table; u8 weight; rcu_read_lock(); table = rcu_dereference(iw_table); - /* if no iw_table, use system default */ - weight = table ? table[node] : 1; - /* if value in iw_table is 0, use system default */ - weight = weight ? weight : 1; + default_table = rcu_dereference(default_iw_table); + /* if no table pointers or value is 0, use system default or 1 */ + weight = table ? table[node] : 0; + weight = weight ? weight : (default_table ? default_table[node] : 1); rcu_read_unlock(); return weight; } +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) +{ + unsigned long *old_bw, *new_bw; + unsigned long gcd_val; + u8 *old_iw, *new_iw; + uint64_t ttl_bw = 0; + int i; + + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL); + if (!new_bw) + return -ENOMEM; + + new_iw = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new_iw) { + kfree(new_bw); + return -ENOMEM; + } + + mutex_lock(&default_iwt_lock); + old_bw = node_bw_table; + old_iw = rcu_dereference_protected(default_iw_table, + lockdep_is_held(&default_iwt_lock)); + + if (old_bw) + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long)); + new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth); + + /* New recalculate the bandwidth distribution given the new info */ + for (i = 0; i < nr_node_ids; i++) + ttl_bw += new_bw[i]; + + /* If node is not set or has < 1% of total bw, use minimum value of 1 */ + for (i = 0; i < nr_node_ids; i++) { + if (new_bw[i]) + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1); + else + new_iw[i] = 1; + } + /* + * Now attempt to aggressively reduce the interleave weights by GCD + * We want smaller interleave intervals to have a better distribution + * of memory, even on smaller memory regions. If weights are divisible + * by each other, we can do some quick math to aggresively squash them. + */ +reduce: + gcd_val = new_iw[i]; + for (i = 0; i < nr_node_ids; i++) { + /* Skip nodes that haven't been set */ + if (!new_bw[i]) + continue; + gcd_val = gcd(gcd_val, new_iw[i]); + if (gcd_val == 1) + goto leave; + } + for (i = 0; i < nr_node_ids; i++) { + if (!new_bw[i]) + continue; + new_iw[i] /= gcd_val; + } + /* repeat until we get a gcd of 1 */ + goto reduce; +leave: + node_bw_table = new_bw; + rcu_assign_pointer(default_iw_table, new_iw); + mutex_unlock(&default_iwt_lock); + synchronize_rcu(); + kfree(old_bw); + kfree(old_iw); + return 0; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -1983,7 +2067,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) { nodemask_t nodemask; unsigned int target, nr_nodes; - u8 *table; + u8 *table, *default_table; unsigned int weight_total = 0; u8 weight; int nid; @@ -1994,11 +2078,13 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) rcu_read_lock(); table = rcu_dereference(iw_table); + default_table = rcu_dereference(default_iw_table); /* calculate the total weight */ for_each_node_mask(nid, nodemask) { /* detect system default usage */ - weight = table ? table[nid] : 1; - weight = weight ? weight : 1; + weight = table ? table[nid] : 0; + weight = weight ? weight : + (default_table ? default_table[nid] : 1); weight_total += weight; } @@ -2007,8 +2093,9 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) nid = first_node(nodemask); while (target) { /* detect system default usage */ - weight = table ? table[nid] : 1; - weight = weight ? weight : 1; + weight = table ? table[nid] : 0; + weight = weight ? weight : + (default_table ? default_table[nid] : 1); if (target < weight) break; target -= weight; @@ -2391,7 +2478,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned long nr_allocated = 0; unsigned long rounds; unsigned long node_pages, delta; - u8 *table, *weights, weight; + u8 *weights, weight; unsigned int weight_total = 0; unsigned long rem_pages = nr_pages; nodemask_t nodes; @@ -2440,16 +2527,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!weights) return total_allocated; - rcu_read_lock(); - table = rcu_dereference(iw_table); - if (table) - memcpy(weights, table, nr_node_ids); - rcu_read_unlock(); - - /* calculate total, detect system default usage */ for_each_node_mask(node, nodes) { - if (!weights[node]) - weights[node] = 1; + weights[node] = get_il_weight(node); weight_total += weights[node]; } -- 2.39.1