From: Honggyu Kim <honggyu.kim@sk.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>,
Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: kernel_team@skhynix.com, gourry@gourry.net, hyeonggon.yoo@sk.com,
rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org,
akpm@linux-foundation.org, rakie.kim@sk.com,
dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, kernel-team@meta.com
Subject: Re: [PATCH v3] Weighted interleave auto-tuning
Date: Tue, 21 Jan 2025 20:27:17 +0900 [thread overview]
Message-ID: <147aba6e-7e23-4ad1-9bd2-1ceac0f3d55b@sk.com> (raw)
In-Reply-To: <87ikq8h0w4.fsf@DESKTOP-5N7EMDA>
Hi Ying and Joshua,
On 1/21/2025 8:17 PM, Huang, Ying wrote:
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>
>> On machines with multiple memory nodes, interleaving page allocations
>> across nodes allows for better utilization of each node's bandwidth.
>> Previous work by Gregory Price [1] introduced weighted interleave, which
>> allowed for pages to be allocated across NUMA nodes according to
>> user-set ratios.
>>
>> Ideally, these weights should be proportional to their bandwidth, so
>> that under bandwidth pressure, each node uses its maximal efficient
>> bandwidth and prevents latency from increasing exponentially.
>>
>> At the same time, we want these weights to be as small as possible.
>> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
>> to awkward and inefficient allocations, since the node with weight 7
>> will remain mostly unused (and despite being proportional to bandwidth,
>> will not aid in relieving the bandwidth pressure in the other two nodes).
>>
>> This patch introduces an auto-configuration mode for the interleave
>> weights that aims to balance the two goals of setting node weights to be
>> proportional to their bandwidths and keeping the weight values low.
>> In order to perform the weight re-scaling, we use an internal
>> "weightiness" value (fixed to 32) that defines interleave aggression.
>>
>> In this auto configuration mode, node weights are dynamically updated
>> every time there is a hotplug event that introduces new bandwidth.
>>
>> Users can also enter manual mode by writing "manual" to the new "mode"
>> sysfs interface. When a user enters manual mode, the system stops
>> dynamically updating any of the node weights, even during hotplug events
>> that can shift the optimal weight distribution. The system also enters
>> manual mode any time a user sets a node's weight by hand, using the
>> nodeN interface introduced in [1]. On the other hand, auto mode is
>> only entered by explicitly writing "auto" to the mode interface.
>>
>> There is one functional change that this patch makes to the existing
>> weighted_interleave ABI: previously, writing 0 directly to a nodeN
>> interface was said to reset the weight to the system default. Before
>> this patch, the default for all weights were 1, which meant that writing
>> 0 and 1 were functionally equivalent.
>>
>> This patch introduces "real" defaults, but we have decided to move away
>> from letting users use 0 as a "set to default" interface. Rather, users
>> who want to use system defaults should use "auto" mode. This patch seems
>> to be the appropriate place to make this change, since we would like to
>> remove this usage before users begin to rely on the feature in
>> userspace. Moreover, users will not be losing any functionality; they
>> can still write 1 into a node if they want a weight of 1. Thus, we
>> deprecate the "write zero to reset" feature in favor of returning an
>> error, the same way we would return an error when the user writes any
>> other invalid weight to the interface.
>>
>> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>>
>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Co-developed-by: Gregory Price <gourry@gourry.net>
>> Signed-off-by: Gregory Price <gourry@gourry.net>
>> ---
>> Changelog
>> v3:
>> - Weightiness (max_node_weight) is now fixed to 32.
>> - Instead, the sysfs interface now exposes a "mode" parameter, which
>> can either be "auto" or "manual".
>> - Thank you Hyeonggon and Honggyu for the feedback.
>> - Documentation updated to reflect new sysfs interface, explicitly
>> specifies that 0 is invalid.
>> - Thank you Gregory and Ying for the discussion on how best to
>> handle the 0 case.
>> - Re-worked nodeN sysfs store to handle auto --> manual shifts
>> - mempolicy_set_node_perf internally handles the auto / manual
>> caes differently now. bw is always updated, iw updates depend on
>> what mode the user is in.
>> - Wordsmithing comments for clarity.
>> - Removed RFC tag.
>>
>> v2:
>> - Name of the interface is changed: "max_node_weight" --> "weightiness"
>> - Default interleave weight table no longer exists. Rather, the
>> interleave weight table is initialized with the defaults, if bandwidth
>> information is available.
>> - In addition, all sections that handle iw_table have been changed
>> to reference iw_table if it exists, otherwise defaulting to 1.
>> - All instances of unsigned long are converted to uint64_t to guarantee
>> support for both 32-bit and 64-bit machines
>> - sysfs initialization cleanup
>> - Documentation has been rewritten to explicitly outline expected
>> behavior and expand on the interpretation of "weightiness".
>> - kzalloc replaced with kcalloc for readability
>> - Thank you Gregory and Hyeonggon for your review & feedback!
>>
>> ...fs-kernel-mm-mempolicy-weighted-interleave | 30 ++-
>> drivers/acpi/numa/hmat.c | 1 +
>> drivers/base/node.c | 7 +
>> include/linux/mempolicy.h | 4 +
>> mm/mempolicy.c | 212 ++++++++++++++++--
>> 5 files changed, 227 insertions(+), 27 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>> index 0b7972de04e9..d30dc29c53ff 100644
>> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>> @@ -20,6 +20,30 @@ Description: Weight configuration interface for nodeN
>> Minimum weight: 1
>> Maximum weight: 255
>>
>> - Writing an empty string or `0` will reset the weight to the
>> - system default. The system default may be set by the kernel
>> - or drivers at boot or during hotplug events.
>> + Writing invalid values (i.e. any values not in [1,255],
>> + empty string, ...) will return -EINVAL.
>> +
>> +What: /sys/kernel/mm/mempolicy/weighted_interleave/mode
>> +Date: January 2025
>> +Contact: Linux memory management mailing list <linux-mm@kvack.org>
>> +Description: Auto-weighting configuration interface
>> +
>> + Configuration modes for weighted interleave. Can take one of
>> + two options: "manual" and "auto". Default is "auto".
>> +
>> + In auto mode, all node weights are re-calculated and overwritten
>> + (visible via the nodeN interfaces) whenever new bandwidth data
>> + is made available either during boot or hotplug events.
>> +
>> + In manual mode, node weights can only be updated by the user.
>> + If a node is hotplugged while the user is in manual mode,
>> + the node will have a default weight of 1.
>> +
>> + Modes can be changed by writing either "auto" or "manual" to the
>> + interface. All other strings will be ignored, and -EINVAL will
>> + be returned. If "auto" is written to the interface but the
>> + recalculation / updates fail at any point (-ENOMEM or -ENODEV)
>> + then the mode will remain in manual mode.
>> +
>> + Writing a new weight to a node directly via the nodeN interface
>> + will also automatically update the system to manual mode.
>
> IMHO, this interface is somewhat hard to be used. Users need to know
> which value is legal. So, this will become something like,
>
> $ cat mode
> auto [manual]
> $ echo auto > mode
> $ cat mode
> [auto] manual
This is exactly I internally proposed to Hyeonggon, but couldn't share
the idea directly here.
>
> Unless it's possible we will add more modes in the future, this is kind
> of overkill for me. How about something simpler as below?
>
> $ cat auto
> true
> $ echo 0 > auto
> $ cat auto
> false
That also makes sense, but I feel like somewhat vague what "auto" false
means. The "auto" might be better to be "use_hmat" instead and this
makes "use_hmat" false more meaningful. "use_hmat_weight" or
"use_hmat_info" might be another candidates.
Thanks,
Honggyu
>
>> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> index 80a3481c0470..cc94cba112dd 100644
>> --- a/drivers/acpi/numa/hmat.c
>> +++ b/drivers/acpi/numa/hmat.c
>> @@ -20,6 +20,7 @@
>> #include <linux/list_sort.h>
>> #include <linux/memregion.h>
>> #include <linux/memory.h>
>> +#include <linux/mempolicy.h>
>> #include <linux/mutex.h>
>> #include <linux/node.h>
>> #include <linux/sysfs.h>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..16e7a5a8ebe7 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -7,6 +7,7 @@
>> #include <linux/init.h>
>> #include <linux/mm.h>
>> #include <linux/memory.h>
>> +#include <linux/mempolicy.h>
>> #include <linux/vmstat.h>
>> #include <linux/notifier.h>
>> #include <linux/node.h>
>> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>> break;
>> }
>> }
>> +
>> + /* When setting CPU access coordinates, update mempolicy */
>> + if (access == ACCESS_COORDINATE_CPU) {
>> + if (mempolicy_set_node_perf(nid, coord))
>> + pr_info("failed to set node%d mempolicy attrs\n", nid);
>> + }
>> }
>> EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>>
>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>> index ce9885e0178a..0fe96f3ab3ef 100644
>> --- a/include/linux/mempolicy.h
>> +++ b/include/linux/mempolicy.h
>> @@ -11,6 +11,7 @@
>> #include <linux/slab.h>
>> #include <linux/rbtree.h>
>> #include <linux/spinlock.h>
>> +#include <linux/node.h>
>> #include <linux/nodemask.h>
>> #include <linux/pagemap.h>
>> #include <uapi/linux/mempolicy.h>
>> @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>>
>> extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>>
>> +extern int mempolicy_set_node_perf(unsigned int node,
>> + struct access_coordinate *coords);
>> +
>> #else
>>
>> struct mempolicy {};
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 04f35659717a..8777bd6229bc 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -109,6 +109,7 @@
>> #include <linux/mmu_notifier.h>
>> #include <linux/printk.h>
>> #include <linux/swapops.h>
>> +#include <linux/gcd.h>
>>
>> #include <asm/tlbflush.h>
>> #include <asm/tlb.h>
>> @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
>>
>> static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>>
>> +static uint64_t *node_bw_table;
>> +
>> /*
>> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
>> - * system-default value should be used. A NULL iw_table also denotes that
>> - * system-default values should be used. Until the system-default table
>> - * is implemented, the system-default is always 1.
>> - *
>> + * iw_table is the interleave weight table.
>> + * If bandwiddth data is available and the user is in auto mode, the table
>> + * is populated with default values in [1,255].
>> * iw_table is RCU protected
>> */
>> static u8 __rcu *iw_table;
>> static DEFINE_MUTEX(iw_table_lock);
>> +static const int weightiness = 32;
>> +static bool weighted_interleave_auto = true;
>
> I still prefer to use 2 iw_table, one is for default, the other is for
> manual. The default one will be used if the manual one is NULL. Both
> are protected by RCU. The default one can be updated upon hotplug
> blindly. This makes the whole model easier to be understood IMHO.
>
> What do you think about that.
>
>> static u8 get_il_weight(int node)
>> {
>> @@ -156,14 +159,113 @@ static u8 get_il_weight(int node)
>>
>> rcu_read_lock();
>> table = rcu_dereference(iw_table);
>> - /* if no iw_table, use system default */
>> weight = table ? table[node] : 1;
>> - /* if value in iw_table is 0, use system default */
>> - weight = weight ? weight : 1;
>> rcu_read_unlock();
>> return weight;
>> }
>
> [snip]
>
> ---
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2025-01-21 11:27 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-15 18:58 Joshua Hahn
2025-01-20 4:47 ` Hyeonggon Yoo
2025-01-21 21:24 ` Joshua Hahn
2025-01-21 11:17 ` Huang, Ying
2025-01-21 11:27 ` Honggyu Kim [this message]
2025-01-21 20:02 ` Gregory Price
2025-01-22 1:24 ` Huang, Ying
2025-02-05 5:34 ` Honggyu Kim
2025-01-21 19:56 ` Gregory Price
2025-01-22 1:37 ` Huang, Ying
2025-01-22 15:59 ` Joshua Hahn
2025-01-22 16:53 ` Gregory Price
2025-01-23 3:32 ` Huang, Ying
2025-01-24 5:58 ` Matthew Wilcox
2025-01-24 15:48 ` Gregory Price
2025-01-24 15:53 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=147aba6e-7e23-4ad1-9bd2-1ceac0f3d55b@sk.com \
--to=honggyu.kim@sk.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=horen.chuang@linux.dev \
--cc=hyeonggon.yoo@sk.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=kernel_team@skhynix.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox