Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: honggyu.kim@sk.com, gregkh@linuxfoundation.org, rakie.kim@sk.com,
	akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org,
	dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
	dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@meta.com,
	kernel_team@skhynix.com
Subject: Re: [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning
Date: Fri, 28 Feb 2025 08:24:45 -0800	[thread overview]
Message-ID: <20250228162447.3850305-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <20250228064016.1325-1-yunjeong.mun@sk.com>

On Fri, 28 Feb 2025 15:39:55 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

Hi Yunjeong, thank you for taking time to review my work!

> Hi, Joshua. 
> 
> First of all I accidentally sent the wrong email a few hours ago.
> Please disregard it. Sorry for the confusion.

No worries at all!

> On Wed, 26 Feb 2025 13:35:17 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> [...snip...]
> >  
> > +/*
> > + * Convert bandwidth values into weighted interleave weights.
> > + * Call with iw_table_lock.
> > + */
> > +static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
> > +{
> > +	u64 sum_bw = 0;
> > +	unsigned int cast_sum_bw, sum_iw = 0;
> > +	unsigned int scaling_factor = 1, iw_gcd = 1;
> > +	int nid;
> > +
> > +	/* Recalculate the bandwidth distribution given the new info */
> > +	for_each_node_state(nid, N_MEMORY)
> > +		sum_bw += bw[nid];
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		/* Set memoryless nodes' weights to 1 to prevent div/0 later */
> > +		if (!node_state(nid, N_MEMORY)) {
> > +			new_iw[nid] = 1;
> > +			continue;
> > +		}
> > +
> > +		scaling_factor = 100 * bw[nid];
> > +
> > +		/*
> > +		 * Try not to perform 64-bit division.
> > +		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> > +		 * If sum_bw > scaling_factor, then bw[nid] is less than
> > +		 * 1% of the total bandwidth. Round up to 1%.
> > +		 */
> > +		if (bw[nid] && sum_bw < scaling_factor) {
> > +			cast_sum_bw = (unsigned int)sum_bw;
> > +			new_iw[nid] = scaling_factor / cast_sum_bw;
> > +		} else {
> > +			new_iw[nid] = 1;
> > +		}
> > +		sum_iw += new_iw[nid];
> > +	}
> > +
> > +	/*
> > +	 * Scale each node's share of the total bandwidth from percentages
> > +	 * to whole numbers in the range [1, weightiness]
> > +	 */
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		scaling_factor = weightiness * new_iw[nid];
> > +		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> > +		if (nid == 0)
> > +			iw_gcd = new_iw[0];
> > +		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> > +	}
> > +
> > +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> > +	for_each_node_state(nid, N_MEMORY)
> > +		new_iw[nid] /= iw_gcd;
> > +}
> 
> In my understanding, new_iw[nid] values are scaled twice, first to 100 and then to a 
> weightines value of 32. I think this scaling can be done just once, directly 
> to weightness value as follows:

Yes,  you are correct. I want to provide a bit of context on how this
patch has changed over time: In the first few iterations of this patch, 
"weightiness" was actually exposed as a sysfs interface that users could
change to change how much they scaled for high numbers (better weight
accuracy, but worse page allocation distributon fairness) and small numbers
(bigger errors, but better local fairness).

The reason why this matters is that we use a heuristic of "round all
weights whose weights are less than 1% of the total weight sum to 1%".
So if we have bandwidth ratios of 100 : 1000 : 3000 : 4000 : 6000,
we have a sum total of 14100. Then 100/14100 is only ~0.7%, and we would
want to round it up to 1% before moving on (since weights that are too
small don't end up helping). This problem only gets worse for machines
with more nodes, and it becomes possible for a node to have something like
0.1% of the total bandwidth.

When users could set weightiness to be up to 255, this was problematic,
becuase scenarios where weights become 1:255:255:255:255... become possible,
where we allocate a single page from one node, then allocate 255 pages from
the remaining nr_node_ids - 1 nodes (which is of course, not ideal).
However, with weightiness fixed to 32, maybe this heuristic makes less sense,
since the worst-case-scenario looks like 1:32:32:32:32...

I think this proposed change makes a lot of sense. It does seem silly to
have to round twice, and without giving the users the ability to set thier
own weightiness value, rounding just once seems to be enough to prevent
the worst case scenario. I will incorporate this into a v7.

I'm also going to wait a bit for more feedback to come in for this version,
so it may be a little bit before I send v7 out : -)

Thanks again for your review and the proposed change. Have a great day!
Joshua

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 50cbb7c047fa..65a7e2baf161 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -176,47 +176,22 @@ static u8 get_il_weight(int node)
>  static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
>  {
> 	u64 sum_bw = 0;
> -	unsigned int cast_sum_bw, sum_iw = 0;
> -	unsigned int scaling_factor = 1, iw_gcd = 1;
> +	unsigned int scaling_factor = 1, iw_gcd = 0;
> 	int nid;
> 
> 	/* Recalculate the bandwidth distribution given the new info */
> 	for_each_node_state(nid, N_MEMORY)
> 		sum_bw += bw[nid];
> 
> -       for (nid = 0; nid < nr_node_ids; nid++) {
>  			[...snip...]
> -		/*
> -		 * Try not to perform 64-bit division.
> -		 * If sum_bw < scaling_factor, then sum_bw < U32_MAX.
> -		 * If sum_bw > scaling_factor, then bw[nid] is less than
> -		 * 1% of the total bandwidth. Round up to 1%.
> -		 */
>  			[...snip...]
> -		sum_iw += new_iw[nid];
> -	}
> -
>      
> 	/*
> 	 * Scale each node's share of the total bandwidth from percentages
> 	 * to whole numbers in the range [1, weightiness]
> 	 */
> 	for_each_node_state(nid, N_MEMORY) {
> -		scaling_factor = weightiness * new_iw[nid];
> -		new_iw[nid] = max(scaling_factor / sum_iw, 1);
> -		if (nid == 0)
> -			iw_gcd = new_iw[0];
> +		scaling_factor = weightiness * bw[nid];
> +		new_iw[nid] = max(scaling_factor / sum_bw, 1);
> +		if (!iw_gcd)
> +			iw_gcd = new_iw[nid];
> 		iw_gcd = gcd(iw_gcd, new_iw[nid]);
> 	}
> 
> Please let me know how you think about this.
> 
> Best regards,
> Yunjeong

Sent using hkml (https://github.com/sjp38/hackermail)

next prev parent reply	other threads:[~2025-02-28 16:24 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250228001631.1102-1-yunjeong.mun@sk.com>
2025-02-26 21:35 ` Joshua Hahn
2025-02-26 21:35   ` [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes Joshua Hahn
2025-02-27  2:32     ` Honggyu Kim
2025-02-27  3:20       ` Honggyu Kim
2025-03-03 21:56         ` Joshua Hahn
2025-03-04 12:53           ` Honggyu Kim
2025-03-03 16:19       ` Gregory Price
2025-03-04 13:03         ` Honggyu Kim
2025-03-04 16:16           ` Gregory Price
2025-03-04 16:29       ` Gregory Price
2025-03-06 12:39         ` Honggyu Kim
2025-03-06 17:32           ` Gregory Price
2025-03-07 11:46             ` Honggyu Kim
2025-03-07 17:51               ` Gregory Price
2025-03-10 12:26                 ` Honggyu Kim
2025-03-10 14:22                   ` Gregory Price
2025-03-11  2:07                     ` Yunjeong Mun
2025-03-11  2:42                       ` Gregory Price
2025-03-11  4:02                         ` Yunjeong Mun
2025-03-11  4:42                           ` Gregory Price
2025-03-11  9:51                             ` Yunjeong Mun
2025-03-11 15:52                               ` Gregory Price
2025-03-18  8:02                             ` Yunjeong Mun
2025-03-18 11:02                               ` Honggyu Kim
2025-03-18 15:13                                 ` Gregory Price
2025-03-19  9:56                                   ` Yunjeong Mun
2025-03-19 14:54                                     ` Gregory Price
2025-02-28  0:16   ` [PATCH 1/2 v6] mm/mempolicy: Weighted Interleave Auto-tuning yunjeong.mun
2025-02-28  6:39   ` Yunjeong Mun
2025-02-28 16:24     ` Joshua Hahn [this message]
2025-03-04 21:56     ` Joshua Hahn
2025-03-04 22:22       ` Joshua Hahn
2025-03-05  9:49         ` Yunjeong Mun
2025-03-05 16:28           ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250228162447.3850305-1-joshua.hahnjy@gmail.com \
    --to=joshua.hahnjy@gmail.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=horen.chuang@linux.dev \
    --cc=kernel-team@meta.com \
    --cc=kernel_team@skhynix.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=yunjeong.mun@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox