From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68847C47258
	for <linux-mm@archiver.kernel.org>; Wed, 31 Jan 2024 06:45:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C28A66B0072; Wed, 31 Jan 2024 01:45:18 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BD81E6B0074; Wed, 31 Jan 2024 01:45:18 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A793C6B0075; Wed, 31 Jan 2024 01:45:18 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 95ACC6B0072
	for <linux-mm@kvack.org>; Wed, 31 Jan 2024 01:45:18 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 60A0CC0C2D
	for <linux-mm@kvack.org>; Wed, 31 Jan 2024 06:45:18 +0000 (UTC)
X-FDA: 81738669516.25.6C4E746
Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120])
	by imf24.hostedemail.com (Postfix) with ESMTP id B906818000A
	for <linux-mm@kvack.org>; Wed, 31 Jan 2024 06:45:15 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=NlQ6RV15;
	spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1706683516;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=oCWohknu9JP/ZjpPKAfXv/uqTuGy/kjj1lHFZpKW0go=;
	b=dRl4NSIUdx5YT2QNGKw/NnsTynkZ2JK8Okz9uhDi+lOMEPxDg4QSnXiqnarQ8bAgVYLNfG
	SaI7iaUOdnH/5gAn4R2S5qDIqwDxTJ4eczoQOSepSFdHoif9Q4u0bd+nQEXVBNhGLWaRJG
	XWVef+RpYoRUTb7dTBofyI5JpAzOiaU=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706683516; a=rsa-sha256;
	cv=none;
	b=I6WqdbwnMBIa21wUhDjQB4aMYMV1aMZ7j5z+pSCvqypifA/dNBi/DuT9a2jQzKBr+FfyFf
	7mgTzNy1Z8+mlnbDtkpAT310BBm43YZMxiVbXNVPTqOtbe7/xTj2rxOTuvevRn+Nxt8y+p
	rUWO+Yw/uG6sGvShvgucHu8PXP6hsb0=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=NlQ6RV15;
	spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1706683515; x=1738219515;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=62rH2nWT7oZIFy1JA85REPqTaRFFGgCKxs03yS0PYo4=;
  b=NlQ6RV15bbX/sEpEJR/MEcWOLJHJpmZAaM2MXkNXzMmgROutp7X68thS
   4/q5IFvr9ukwf+5H6hH/WxYANylQFfF6EjzJnS+NVFB2RRDEon7pbZ6ql
   Cnk7AvyW6ZPXw4Ho63dy7b7ZSY4+NTOz7gQYLP3jRciOE1kaC3sXUFbAa
   /UtWlL1z2tnua4/ZcHjqS/6tR7unImqn5RFa222QjypBr5uxBYa75xmpD
   IEilSHGkL5L4pivJWju4peMuVF20KNSwytOO0SajKSZWCWVcbvp6NyXma
   CCEjTSMFS/yPaMglDjdpA3f8cdxd/NQSNGVW76ILo485YCDSManrzuoRj
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10969"; a="402364623"
X-IronPort-AV: E=Sophos;i="6.05,231,1701158400"; 
   d="scan'208";a="402364623"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2024 22:45:13 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10969"; a="907783072"
X-IronPort-AV: E=Sophos;i="6.05,231,1701158400"; 
   d="scan'208";a="907783072"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2024 22:45:08 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gourry.memverge@gmail.com>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
  linux-doc@vger.kernel.org,  linux-fsdevel@vger.kernel.org,
  linux-api@vger.kernel.org,  corbet@lwn.net,  akpm@linux-foundation.org,
  gregory.price@memverge.com,  honggyu.kim@sk.com,  rakie.kim@sk.com,
  hyeongtak.ji@sk.com,  mhocko@kernel.org,  vtavarespetr@micron.com,
  jgroves@micron.com,  ravis.opensrc@micron.com,  sthanneeru@micron.com,
  emirakhur@micron.com,  Hasan.Maruf@amd.com,  seungjun.ha@samsung.com,
  hannes@cmpxchg.org,  dan.j.williams@intel.com,  Srinivasulu Thanneeru
 <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE
 for weighted interleaving
In-Reply-To: <20240130182046.74278-4-gregory.price@memverge.com> (Gregory
	Price's message of "Tue, 30 Jan 2024 13:20:46 -0500")
References: <20240130182046.74278-1-gregory.price@memverge.com>
	<20240130182046.74278-4-gregory.price@memverge.com>
Date: Wed, 31 Jan 2024 14:43:12 +0800
Message-ID: <877cjqgfzz.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: B906818000A
X-Rspam-User: 
X-Stat-Signature: 9xr73m4dgbonxw7uw578789n5scd9jh4
X-Rspamd-Server: rspam03
X-HE-Tag: 1706683515-699988
X-HE-Meta: U2FsdGVkX1+LDo14QIS179Zm2uL0qeYOTKyO5z1ZEFttEbuOiyPkTSuI/aWIhxf3RFiFF3TlrkBtSZy+dGABzrYJSqHKPeTjeFuOsT7mAQQdpoKwb3+NmIpoI3KG0pVMwPoIBworXkJnCjFiPNAWCcftTQxyLgrWCp4B2lrrYxCfjgaIgtMR79xkzf4cjSUMj/X/miPKbrS9bt9IIUjhcddEkBeq0AX8eUG5/8toU9QRxWtegT6MeUfUNSOz6+GqFeayJ0kjbgjljNMR0curozFAZKt0vNmu/jBG6a/NZ92FVX6sst9ieKhmF5KAZz+s1kaIKxUXCKtMkdUKZycc2gDDbMVaMGqGE6wPZPiFXE97YVYM+l6aqs59+NtD82HoHhO6N5Y37xF08vCt2DPjvW6dWQCpnHWJhqkU0d3Nq0Etk8VDMS6I+CZOJB2mxAcD5QDc+TwtKf7LWUJHGkcZLuDJ2bFp6JDNEeSoRJ8SMIh59dzCbMzUHxkOAFgmNVPFf4n3aXbB2wxbqOUnpb1QUYRBjymZ38uGVjauMzFc4FpVCxDxvdQnG+BBTQevF/3sPZPEhBtuzd2aWf71DEby2INKoMIcaXIkhjuHxEd0T+2DX71frsr4xM8CsKxajwYGNZSJbDqyZVmCFE2CmDxcVZY2I9xGcRHc5db5R1/aX0m/HQveSilToy4l24mbhgrXqr1/5J5b+O3seIHDEUjrfzRhFSPj4nRbvc1XN5/Z2BYfMJK8ZdqHkOG//dwVupMhyIg5uwlUhFIHBp4s6Z/V0GXkXz/mE0hJIpPN1snp/KjZzCrCDrys9zPdQEliqtJe+8B8qNUvLmt4N0x0XMGM5E1CVZAJxLn6m2qdeNt6DVEW9YbJrbOZaTvwkWsEc1+kJaaciyCndusdsK8qyF9Dv3sT1/iPpL9cVOfu9RUHZ8HwJW1X8oXqi4Cs/GcGnvL+iMHbtraKF+BpzJ10R31
 bVQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Gregory Price <gourry.memverge@gmail.com> writes:

> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes.  Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> For now, the default value of all nodes will be `1`, which matches
> the behavior of standard 1:1 round-robin interleave. An extension
> will be added in the future to allow default values to be registered
> at kernel and device bringup time.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> Some high level notes about the pieces of weighted interleave:
>
> current->il_prev:
>     Default interleave uses this to track the last used node.
>     Weighted interleave uses this to track the *current* node, and
>     when weight reaches 0 it will be used to acquire the next node.
>
> current->il_weight:
>     The active weight of the current node (current->il_prev)
>     When this reaches 0, current->il_prev is set to the next node
>     and current->il_weight is set to the next weight.

I still think that my description of the 2 fields above is easier to be
understood.  For weighted interleave,

current->il_prev is the node from which we allocated page in previous
allocation.

current->il_weight is the remaining weight for current->il_prev after
previous allocation.

But I will not force you to use this.  Use it only if you think that
they are better.

> weighted_interleave_nodes:
>     Counts the number of allocations as they occur, and applies the
>     weight for the current node.  When the weight reaches 0, switch
>     to the next node.  Operates only on task->mempolicy.
>
> weighted_interleave_nid:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the node based on the given index.
>     Operates on VMA policies.
>
> bulk_array_weighted_interleave:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the number of "interleave rounds" as
>     well as any delta ("partial round").  Calculates the number of
>     pages for each node and allocates them.
>
>     If a node was scheduled for interleave via interleave_nodes, the
>     current weight will be allocated first.
>
>     Operates only on the task->mempolicy.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actual application of the interleave. If a call
> to alloc_pages_mpol() were made with a weighted-interleave policy and
> ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would
> operate on a VMA policy - violating the description above.
>
> An inspection of all callers of alloc_pages_mpol() shows that all
> external callers set ilx to `0`, an index value, or will call
> get_vma_policy() to acquire the ilx.
>
> For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
> all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces
> the `weighted_interleave_nodes()` and `weighted_interleave_nid()`
> policy requirements (task/vma respectively).
>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     |   9 +
>  include/linux/sched.h                         |   1 +
>  include/uapi/linux/mempolicy.h                |   1 +
>  mm/mempolicy.c                                | 231 +++++++++++++++++-
>  4 files changed, 238 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
>  	can fall back to all existing numa nodes. This is effectively
>  	MPOL_PREFERRED allowed for a mask rather than a single node.
>  
> +MPOL_WEIGHTED_INTERLEAVE
> +	This mode operates the same as MPOL_INTERLEAVE, except that
> +	interleaving behavior is executed based on weights set in
> +	/sys/kernel/mm/mempolicy/weighted_interleave/
> +
> +	Weighted interleave allocates pages on nodes according to a
> +	weight.  For example if nodes [0,1] are weighted [5,2], 5 pages
> +	will be allocated on node0 for every 2 pages allocated on node1.
> +
>  NUMA memory policy supports the following optional mode flags:
>  
>  MPOL_F_STATIC_NODES
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffe8f618ab86..b9ce285d8c9c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1259,6 +1259,7 @@ struct task_struct {
>  	/* Protected by alloc_lock: */
>  	struct mempolicy		*mempolicy;
>  	short				il_prev;
> +	u8				il_weight;
>  	short				pref_node_fork;
>  #endif
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
>  	MPOL_PREFERRED_MANY,
> +	MPOL_WEIGHTED_INTERLEAVE,
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3bdfaf03b660..7cd92f4ec0d7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
>   *                for anonymous memory. For process policy an process counter
>   *                is used.
>   *
> + * weighted interleave
> + *                Allocate memory interleaved over a set of nodes based on
> + *                a set of weights (per-node), with normal fallback if it
> + *                fails.  Otherwise operates the same as interleave.
> + *                Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + *                on node 0 for every 1 page allocated on node 1.
> + *
>   * bind           Only allocate memory on a specific set of nodes,
>   *                no fallback.
>   *                FIXME: memory is allocated starting with the first node
> @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_preferred,
>  	},
> +	[MPOL_WEIGHTED_INTERLEAVE] = {
> +		.create = mpol_new_nodemask,
> +		.rebind = mpol_rebind_nodemask,
> +	},
>  };
>  
>  static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>  
>  	old = current->mempolicy;
>  	current->mempolicy = new;
> -	if (new && new->mode == MPOL_INTERLEAVE)
> +	if (new && (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE)) {
>  		current->il_prev = MAX_NUMNODES-1;
> +		current->il_weight = 0;
> +	}
>  	task_unlock(current);
>  	mpol_put(old);
>  	ret = 0;
> @@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*nodes = pol->nodes;
>  		break;
>  	case MPOL_LOCAL:
> @@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
>  			*policy = next_node_in(current->il_prev, pol->nodes);
> +		} else if (pol == current->mempolicy &&
> +				pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> +			if (current->il_weight)
> +				*policy = current->il_prev;
> +			else
> +				*policy = next_node_in(current->il_prev,
> +						       pol->nodes);
>  		} else {
>  			err = -EINVAL;
>  			goto out;
> @@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len,
>  		 * VMAs, the nodes will still be interleaved from the targeted
>  		 * nodemask, but one by one may be selected differently.
>  		 */
> -		if (new->mode == MPOL_INTERLEAVE) {
> +		if (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE) {
>  			struct page *page;
>  			unsigned int order;
>  			unsigned long addr = -EFAULT;
> @@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * @vma: virtual memory area whose policy is sought
>   * @addr: address in @vma for shared policy lookup
>   * @order: 0, or appropriate huge_page_order for interleaving
> - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
> + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
> + *       MPOL_WEIGHTED_INTERLEAVE
>   *
>   * Returns effective policy for a VMA at specified address.
>   * Falls back to current->mempolicy or system default policy, as necessary.
> @@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  	pol = __get_vma_policy(vma, addr, ilx);
>  	if (!pol)
>  		pol = get_task_policy(current);
> -	if (pol->mode == MPOL_INTERLEAVE) {
> +	if (pol->mode == MPOL_INTERLEAVE ||
> +	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
>  		*ilx += vma->vm_pgoff >> order;
>  		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
>  	}
> @@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  	return zone >= dynamic_policy_zone;
>  }
>  
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> +	unsigned int node = current->il_prev;
> +
> +	if (!current->il_weight || !node_isset(node, policy->nodes)) {
> +		node = next_node_in(node, policy->nodes);
> +		/* can only happen if nodemask is being rebound */
> +		if (node == MAX_NUMNODES)
> +			return node;

I feel a little unsafe to read policy->nodes at same time of writing in
rebound.  Is it better to use a seqlock to guarantee its consistency?
It's unnecessary to be a part of this series though.

> +		current->il_prev = node;
> +		current->il_weight = get_il_weight(node);
> +	}
> +	current->il_weight--;
> +	return node;
> +}
> +
>  /* Do dynamic interleaving for a process */
>  static unsigned int interleave_nodes(struct mempolicy *policy)
>  {
> @@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void)
>  	case MPOL_INTERLEAVE:
>  		return interleave_nodes(policy);
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		return weighted_interleave_nodes(policy);
> +
>  	case MPOL_BIND:
>  	case MPOL_PREFERRED_MANY:
>  	{
> @@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
>  	return nodes_weight(*mask);
>  }
>  
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> +	nodemask_t nodemask;
> +	unsigned int target, nr_nodes;
> +	u8 __rcu *table;
> +	unsigned int weight_total = 0;
> +	u8 weight;
> +	int nid;
> +
> +	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
> +	if (!nr_nodes)
> +		return numa_node_id();
> +
> +	rcu_read_lock();
> +	table = rcu_dereference(iw_table);
> +	/* calculate the total weight */
> +	for_each_node_mask(nid, nodemask) {
> +		/* detect system default usage */
> +		weight = table ? table[nid] : 1;
> +		weight = weight ? weight : 1;
> +		weight_total += weight;
> +	}
> +
> +	/* Calculate the node offset based on totals */
> +	target = ilx % weight_total;
> +	nid = first_node(nodemask);
> +	while (target) {
> +		/* detect system default usage */
> +		weight = table ? table[nid] : 1;
> +		weight = weight ? weight : 1;

I found duplicated pattern as above in this patch.  Can we define a
function like below to remove the duplication?

u8 __get_il_weight(u8 *table, int nid)
{
        u8 weight;

        weight = table ? table[nid] : 1;
        return weight ? : 1;
}

This can be used in alloc_pages_bulk_array_weighted_interleave() to copy
from global to local weights array too.

But this isn't a big deal.  I will leave it to you to decide.

> +		if (target < weight)
> +			break;
> +		target -= weight;
> +		nid = next_node_in(nid, nodemask);
> +	}
> +	rcu_read_unlock();
> +	return nid;
> +}
> +
>  /*
>   * Do static interleaving for interleave index @ilx.  Returns the ilx'th
>   * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>  		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
>  			interleave_nodes(pol) : interleave_nid(pol, ilx);
>  		break;
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
> +			weighted_interleave_nodes(pol) :
> +			weighted_interleave_nid(pol, ilx);
> +		break;
>  	}
>  
>  	return nodemask;
> @@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*mask = mempolicy->nodes;
>  		break;
>  
> @@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		 * node in its nodemask, we allocate the standard way.
>  		 */
>  		if (pol->mode != MPOL_INTERLEAVE &&
> +		    pol->mode != MPOL_WEIGHTED_INTERLEAVE &&
>  		    (!nodemask || node_isset(nid, *nodemask))) {
>  			/*
>  			 * First, try to allocate THP only on local node, but
> @@ -2279,6 +2369,127 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
>  	return total_allocated;
>  }
>  
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> +		struct mempolicy *pol, unsigned long nr_pages,
> +		struct page **page_array)
> +{
> +	struct task_struct *me = current;
> +	unsigned long total_allocated = 0;
> +	unsigned long nr_allocated = 0;
> +	unsigned long rounds;
> +	unsigned long node_pages, delta;
> +	u8 __rcu *table, *weights, weight;
> +	unsigned int weight_total = 0;
> +	unsigned long rem_pages = nr_pages;
> +	nodemask_t nodes;
> +	int nnodes, node, next_node;
> +	int resume_node = MAX_NUMNODES - 1;
> +	u8 resume_weight = 0;
> +	int prev_node;
> +	int i;
> +
> +	if (!nr_pages)
> +		return 0;
> +
> +	nnodes = read_once_policy_nodemask(pol, &nodes);
> +	if (!nnodes)
> +		return 0;
> +
> +	/* Continue allocating from most recent node and adjust the nr_pages */
> +	node = me->il_prev;
> +	weight = me->il_weight;
> +	if (weight && node_isset(node, nodes)) {
> +		node_pages = min(rem_pages, weight);
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		/* if that's all the pages, no need to interleave */
> +		if (rem_pages < weight) {
> +			/* stay on current node, adjust il_weight */
> +			me->il_weight -= rem_pages;
> +			return total_allocated;
> +		} else if (rem_pages == weight) {
> +			/* move to next node / weight */
> +			me->il_prev = next_node_in(node, nodes);
> +			me->il_weight = get_il_weight(next_node);
> +			return total_allocated;
> +		}
> +		/* Otherwise we adjust remaining pages, continue from there */
> +		rem_pages -= weight;
> +	}
> +	/* clear active weight in case of an allocation failure */
> +	me->il_weight = 0;
> +	prev_node = node;
> +
> +	/* create a local copy of node weights to operate on outside rcu */
> +	weights = kzalloc(nr_node_ids, GFP_KERNEL);
> +	if (!weights)
> +		return total_allocated;
> +
> +	rcu_read_lock();
> +	table = rcu_dereference(iw_table);
> +	if (table)
> +		memcpy(weights, table, nr_node_ids);
> +	rcu_read_unlock();
> +
> +	/* calculate total, detect system default usage */
> +	for_each_node_mask(node, nodes) {
> +		if (!weights[node])
> +			weights[node] = 1;
> +		weight_total += weights[node];
> +	}
> +
> +	/*
> +	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
> +	 * Track which node weighted interleave should resume from.
> +	 *
> +	 * if (rounds > 0) and (delta == 0), resume_node will always be
> +	 * the node following prev_node and its weight.
> +	 */
> +	rounds = rem_pages / weight_total;
> +	delta = rem_pages % weight_total;
> +	resume_node = next_node_in(prev_node, nodes);
> +	resume_weight = weights[resume_node];
> +	for (i = 0; i < nnodes; i++) {
> +		node = next_node_in(prev_node, nodes);
> +		weight = weights[node];
> +		node_pages = weight * rounds;
> +		/* If a delta exists, add this node's portion of the delta */
> +		if (delta > weight) {
> +			node_pages += weight;
> +			delta -= weight;
> +		} else if (delta) {
> +			node_pages += delta;
> +			/* delta may deplete on a boundary or w/ a remainder */
> +			if (delta == weight) {
> +				/* boundary: resume from next node/weight */
> +				resume_node = next_node_in(node, nodes);
> +				resume_weight = weights[resume_node];
> +			} else {
> +				/* remainder: resume this node w/ remainder */
> +				resume_node = node;
> +				resume_weight = weight - delta;
> +			}

If we are comfortable to leave resume_weight == 0, then the above
branch can be simplified to.

        resume_node = node;
        resume_weight = weight - delta;

But, this is a style issue again.  I will leave it to you to decide.

So, except the issue you pointed out already.  All series looks good to
me!  Thanks!  Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

to the whole series.

> +			delta = 0;
> +		}
> +		/* node_pages can be 0 if an allocation fails and rounds == 0 */
> +		if (!node_pages)
> +			break;
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		if (total_allocated == nr_pages)
> +			break;
> +		prev_node = node;
> +	}
> +	me->il_prev = resume_node;
> +	me->il_weight = resume_weight;
> +	kfree(weights);
> +	return total_allocated;
> +}
> +
>  static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
>  		struct mempolicy *pol, unsigned long nr_pages,
>  		struct page **page_array)
> @@ -2319,6 +2530,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
>  		return alloc_pages_bulk_array_interleave(gfp, pol,
>  							 nr_pages, page_array);
>  
> +	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> +		return alloc_pages_bulk_array_weighted_interleave(
> +				  gfp, pol, nr_pages, page_array);
> +
>  	if (pol->mode == MPOL_PREFERRED_MANY)
>  		return alloc_pages_bulk_array_preferred_many(gfp,
>  				numa_node_id(), pol, nr_pages, page_array);
> @@ -2394,6 +2609,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		return !!nodes_equal(a->nodes, b->nodes);
>  	case MPOL_LOCAL:
>  		return true;
> @@ -2530,6 +2746,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
>  		polnid = interleave_nid(pol, ilx);
>  		break;
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		polnid = weighted_interleave_nid(pol, ilx);
> +		break;
> +
>  	case MPOL_PREFERRED:
>  		if (node_isset(curnid, pol->nodes))
>  			goto out;
> @@ -2904,6 +3124,7 @@ static const char * const policy_modes[] =
>  	[MPOL_PREFERRED]  = "prefer",
>  	[MPOL_BIND]       = "bind",
>  	[MPOL_INTERLEAVE] = "interleave",
> +	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
>  	[MPOL_LOCAL]      = "local",
>  	[MPOL_PREFERRED_MANY]  = "prefer (many)",
>  };
> @@ -2963,6 +3184,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
>  		}
>  		break;
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		/*
>  		 * Default to online nodes with memory if no nodelist
>  		 */
> @@ -3073,6 +3295,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		nodes = pol->nodes;
>  		break;
>  	default:

--
Best Regards,
Huang, Ying