From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C2992C10DC3
	for <linux-mm@archiver.kernel.org>; Sat,  9 Dec 2023 06:59:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 37FFA6B0074; Sat,  9 Dec 2023 01:59:42 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 32E346B0075; Sat,  9 Dec 2023 01:59:42 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1818E6B007D; Sat,  9 Dec 2023 01:59:42 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 033556B0074
	for <linux-mm@kvack.org>; Sat,  9 Dec 2023 01:59:42 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id D8D1A1A0296
	for <linux-mm@kvack.org>; Sat,  9 Dec 2023 06:59:41 +0000 (UTC)
X-FDA: 81546379362.11.7881E09
Received: from mail-yb1-f194.google.com (mail-yb1-f194.google.com [209.85.219.194])
	by imf05.hostedemail.com (Postfix) with ESMTP id 18281100009
	for <linux-mm@kvack.org>; Sat,  9 Dec 2023 06:59:39 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=dOh7qQe0;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf05.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.219.194 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1702105180;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=kJfHcOqh+1VXL6BeGt8uno6KhS8AZJyMuELbDvAkwCQ=;
	b=gG9U6hSLhc0yZqZ0h1dc/Y1n05xUHU13IIx2dxkXIryVN3AWJ2YkaHw3vEDQdGYkFGwy4a
	6CJLHp4JR+BQrPKldDhp8kx0wy52Sgv91qxxL9w4WvqAU9HKAiKnxuN5AGTzRnywFQK8vL
	3T2jeGSPhS1MKiW9ePohJ3UVl71j6z4=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=dOh7qQe0;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf05.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.219.194 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702105180; a=rsa-sha256;
	cv=none;
	b=QYy7WFsRkUz3F77FFSUUZaIpJuAgwlFHzJMw9bjNoySYWJ0Xx7oFVvx5QyigLb1Jq08q1b
	l0KUkxjUXj+h5f1Yjoyhn8oJ9S1A7PYZ5T1hPTWDWw18Mb4JofEswgIJuwR9iiLZoyBtiw
	26ZU/PIpUoEvG7tZN3U2pbi+gY95D2U=
Received: by mail-yb1-f194.google.com with SMTP id 3f1490d57ef6-db5e5647c24so3344417276.1
        for <linux-mm@kvack.org>; Fri, 08 Dec 2023 22:59:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1702105179; x=1702709979; darn=kvack.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=kJfHcOqh+1VXL6BeGt8uno6KhS8AZJyMuELbDvAkwCQ=;
        b=dOh7qQe0eDS0juLO3n8GSOeb/rysMAIcv2OxHV2IS9EdI98Yr24ELqp+yAZqiVfqkc
         dRa/1aUVQtQuTAd1E+N8QwOE0TAnjcTzN+QvbkdbM4LN+7qFVWql/9+8itpJ/82gskas
         Or7AVfPbiPuNkw1VGoRBIphZdvV5yWAGrSgivUe5ShQHzDk2Pb/0Z+NkkESMG5ItnR0Z
         CDqyilRJnyrPycZnJKOoXPkEU3IA8DL9qxPSH5wFY1Po2kyAJuMkUr+zywa4aBaiz/AV
         PyZTlYYkSW2/qC/hTFIT43Hgh2dectNqjFcTGu+bOmlaB508plJRDjgPuveyBPZ1eMIW
         3g9g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702105179; x=1702709979;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=kJfHcOqh+1VXL6BeGt8uno6KhS8AZJyMuELbDvAkwCQ=;
        b=p4oVvjJuAcfW9/n4ewjUS0T8gJXvDWnARlXjZemp0QApb6uN4FPBk8sllQD5owWipL
         I6dpgrEVlIC3miaGaRIxqKZnck3gYpCFTtGZLgaZY1T2HeSwlYQepVJwnoePSE7hR1xy
         2Q5K+hnvOGoyujI3JjKzKqPWUDrnoGr2k3WsM+5X29iMJBDdeVMn+iTznMsoIGJjmwk0
         gnH4i25zTdszvS4oY/PLhLp/MIr6Hrf2lHQQgx301NYTNXZreqqJ23LQdmFXcQkC+scj
         qfiDOrMbbjkHvByF3NFsx7WOF2nhkJAIQw6p3VZZ5I7GdK4aOaFX5EfAvJiuJE+d4GgU
         1OZA==
X-Gm-Message-State: AOJu0YyBhQRSiBDxzmI5X6Ky0cBpPME2vzftp3D1CMHiYHkIwYxiEsu4
	8IcpPtjbCH6bjxjPbI/aa5HfiXX+8H9T
X-Google-Smtp-Source: AGHT+IHem49DcoGcs2D+xaYZ/BXuVuRdWD2QqTeF3fRSLx7GEMQSsEB+52OPmkNaYCUrWg8SJLhQvQ==
X-Received: by 2002:a0d:e802:0:b0:5d7:1940:3f03 with SMTP id r2-20020a0de802000000b005d719403f03mr949706ywe.52.1702105178928;
        Fri, 08 Dec 2023 22:59:38 -0800 (PST)
Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208])
        by smtp.gmail.com with ESMTPSA id x8-20020a81b048000000b005df5d592244sm326530ywk.78.2023.12.08.22.59.37
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 08 Dec 2023 22:59:38 -0800 (PST)
From: Gregory Price <gourry.memverge@gmail.com>
X-Google-Original-From: Gregory Price <gregory.price@memverge.com>
To: linux-mm@kvack.org
Cc: linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org,
	linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org,
	arnd@arndb.de,
	tglx@linutronix.de,
	luto@kernel.org,
	mingo@redhat.com,
	bp@alien8.de,
	dave.hansen@linux.intel.com,
	x86@kernel.org,
	hpa@zytor.com,
	mhocko@kernel.org,
	tj@kernel.org,
	ying.huang@intel.com,
	gregory.price@memverge.com,
	corbet@lwn.net,
	rakie.kim@sk.com,
	hyeongtak.ji@sk.com,
	honggyu.kim@sk.com,
	vtavarespetr@micron.com,
	peterz@infradead.org,
	jgroves@micron.com,
	ravis.opensrc@micron.com,
	sthanneeru@micron.com,
	emirakhur@micron.com,
	Hasan.Maruf@amd.com,
	seungjun.ha@samsung.com,
	Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: [PATCH v2 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
Date: Sat,  9 Dec 2023 01:59:22 -0500
Message-Id: <20231209065931.3458-3-gregory.price@memverge.com>
X-Mailer: git-send-email 2.39.1
In-Reply-To: <20231209065931.3458-1-gregory.price@memverge.com>
References: <20231209065931.3458-1-gregory.price@memverge.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 18281100009
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: ms3yiubuqbop15qn9cx9w48noaw4zt9a
X-HE-Tag: 1702105179-775298
X-HE-Meta: U2FsdGVkX18d+tk1BoRquHYDo9l+lwhDhw81BIHAuErF7aSzbJZIgaqe+zw9Fkn6/ObtIJJaRoUnMoUqi5TTNdoOJl40ypZkmp6s5ut2WSrKtPqqk6lV19gNL/3Wb3vG/D+ybxkXpUZUfQllVdKRTgK8cZCm2j1WAMyOGkpLuYYSWKriHp6/l/fNbP3OAa45p2AWqdv7HyLaSA4OytFx7nlxRFx/qLwdwW4A90oWaxs1pBX4s2+QZGb/e6CB0phqnGOe1Gko0ItNYVVMa/dpUK7uqsZnc4/HTBN1d10JRO/5uAWPdyjzVpc3KHiOejWDhWbsXe85Bb8/XueDwz5pUI72CVAtv40raPdJvt6RPFaUy+CmyZShQ5xSpmIuF1+r4Zl9Zj89MiYPrUt7xLd2Q480vqiD5fwRAGByMnkcvmf2lebVV/kSJGQSnl6ewO9J45rVLeI6v/c7e7uR//dUDfgDmF56rR0glVLh56bD4Rd+4AMqZNFL9howLxeVwPl8AwIM9JUkkik6ms+w2UYA7ZFFVa1/TAdgrv3bZ0VSjggYPzW5gAg99TCO2mA/rfz9sikwtoiVssT+uwxyAhavc2hCQyftvSWg/C578h7dZZyCtYw/zoonzYLEErZsMmS6xHbdpvcOX0tWnyItJPG1PhwZMlXl1x6+TvHRARZieEAkfB8aaBxzYMyISN5+HjcoqLWadN2kMxrs+bes5+bFch2vqj9NJy1zYs8GyFkqkmk2Dgf7ScUE2NmhIfQjWjQ2U2OHrXN6tJI//edf2fAt+gN64oQoPD9fREjxFa+6d7SDtlJUq4XCIvwDGcEZNw3/FoCRCXT4hwGGdsUi2BBouzVAj5najjM41kGY2XAD3UiDuS/YJTwAR8YSF0sR5eMhtaAkIclVcnyKSV1LYFF2nwSsqD9J5JCKHnEZNn450GUbUlyMMHQXCBzqsXGkvusH7db5lcDEP3nFNmXY180
 86YvZhVW
 WXpYtyf1t0/18q4fWoFPU4llklk1bLHjysJSDUkWyQeRa+289WaNoLSHkjCv/OxCezJHEQ7qjWrNGKAXvSiL/8hg7DCnMUdU1H6QfijERKfhgrswNx8nOuOFW6lIIwGogQZIAJQXOldqG09nci1b+GMUoJ5T35DYpYOHrTimvAOB6MxIKxEIoJEKR4PaBgL5un98sp5og4/9f9n2oMXjG2wEx1Pr5SpxUWXzm5g6Q2D+Tt3p5Y8CVIZbbYt08IrCl4lhgAFQXuMizqe5H1Ki/qA8zaD93h8s+pKYlAmhf7C04lufGX+ykAmMnCx6I7aJxa866dPdcEYYBJiGALyM6w8L1fb/cXIa0MJQJEKs8RtaAvYpz1WC0FLptBksDZHrbNu1oZV/DTtaxmXjLaSeS7haRSm1IdBd5pIE8vOiaELB9AX4JrDEQmK7Hiwqoa8I/O7bEkR+W8YS/EfY382CakFEICFDSewBsacog++PZqGaU19y1eaYm7gEm89O2Vug7l6+eqTxCUlWcgcbxo5zgV2SMbKLrgBcTukXQx8pvdZptwQKN5dkuWjJ+Vt3hCfSU3Qa360LetXIaGOZ1q4b9pvx35cWV7Mcl0MWkQOlMQNoYYGD8suOxAon7X+8rKncDKv1tzXoDMHN9EmapQ1WOipamLw/RMNNJtnSVk2qdevem0aeVWuumWr8HxvgK/fV3rG7naR3ie2waDNg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

From: Rakie Kim <rakie.kim@sk.com>

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as having local DRAM and CXL memory together, the current round-robin
based interleaving policy doesn't maximize the overall bandwidth because
of their different bandwidth characteristics.

Instead, the interleaving can be more efficient when the allocation
policy follows each NUMA nodes' bandwidth weight rather than having 1:1
round-robin allocation.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which
enables weighted interleaving between NUMA nodes.  Weighted interleave
allows for a proportional distribution of memory across multiple numa
nodes, preferablly apportioned to match the bandwidth capacity of each
node from the perspective of the accessing node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
appropriate weight distribution is (2:1).

Weights will be acquired from the global weight array exposed by the
sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

The policy will then allocate the number of pages according to the
set weights.  For example, if the weights are (2,1), then 2 pages
will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

There are 3 integration points:

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node. Applied by `mempolicy_slab_node()` and
    `policy_nodemask()`

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Applied by `policy_nodemask()` and `mpol_misplaced()`

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight (pol->cur_weight) will be allocated first, before
    the remaining bulk calculation is done. This simplifies the
    calculation at the cost of an additional allocation call.

One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave.  The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later applied by the
relevant weighted_interleave function.

Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  11 +
 include/linux/mempolicy.h                     |   5 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 197 +++++++++++++++++-
 4 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..d2c8e712785b 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY
 	can fall back to all existing numa nodes. This is effectively
 	MPOL_PREFERRED allowed for a mask rather than a single node.
 
+MPOL_WEIGHTED_INTERLEAVE
+	This mode operates the same as MPOL_INTERLEAVE, except that
+	interleaving behavior is executed based on weights set in
+	/sys/kernel/mm/mempolicy/weighted_interleave/
+
+	Weighted interleave allocations pages on nodes according to
+	their weight.  For example if nodes [0,1] are weighted [5,2]
+	respectively, 5 pages will be allocated on node0 for every
+	2 pages allocated on node1.  This can better distribute data
+	according to bandwidth on heterogeneous memory systems.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..ba09167e80f7 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,11 @@ struct mempolicy {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
+
+	/* Weighted interleave settings */
+	struct {
+		unsigned char cur_weight;
+	} wil;
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_WEIGHTED_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 28dfae195beb..b4d94646e6a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->wil.cur_weight = 0;
 
 	return policy;
 }
@@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
+	[MPOL_WEIGHTED_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
 		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
@@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*nodes = pol->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
 			*policy = next_node_in(current->il_prev, pol->nodes);
+		} else if (pol == current->mempolicy &&
+				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+			if (pol->wil.cur_weight)
+				*policy = current->il_prev;
+			else
+				*policy = next_node_in(current->il_prev,
+						       pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	pol = __get_vma_policy(vma, addr, ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
 		*ilx += vma->vm_pgoff >> order;
 		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
 	}
@@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+	unsigned int next;
+	struct task_struct *me = current;
+
+	next = next_node_in(me->il_prev, policy->nodes);
+	if (next == MAX_NUMNODES)
+		return next;
+
+	if (!policy->wil.cur_weight)
+		policy->wil.cur_weight = iw_table[next];
+
+	policy->wil.cur_weight--;
+	if (!policy->wil.cur_weight)
+		me->il_prev = next;
+	return next;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
@@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		return weighted_interleave_nodes(policy);
+
 	case MPOL_BIND:
 	case MPOL_PREFERRED_MANY:
 	{
@@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void)
 	}
 }
 
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+	nodemask_t nodemask = pol->nodes;
+	unsigned int target, weight_total = 0;
+	int nid;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char weight;
+
+	barrier();
+
+	/* first ensure we have a valid nodemask */
+	nid = first_node(nodemask);
+	if (nid == MAX_NUMNODES)
+		return nid;
+
+	/* Then collect weights on stack and calculate totals */
+	for_each_node_mask(nid, nodemask) {
+		weight = iw_table[nid];
+		weight_total += weight;
+		weights[nid] = weight;
+	}
+
+	/* Finally, calculate the node offset based on totals */
+	target = (unsigned int)ilx % weight_total;
+	nid = first_node(nodemask);
+	while (target) {
+		weight = weights[nid];
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+	return nid;
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
 			interleave_nodes(pol) : interleave_nid(pol, ilx);
 		break;
+	case MPOL_WEIGHTED_INTERLEAVE:
+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);
+		break;
 	}
 
 	return nodemask;
@@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode != MPOL_INTERLEAVE &&
+		if ((pol->mode != MPOL_INTERLEAVE &&
+		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
 		    (!nodemask || node_isset(nid, *nodemask))) {
 			/*
 			 * First, try to allocate THP only on local node, but
@@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	struct task_struct *me = current;
+	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	unsigned char weight;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned int weight_total;
+	unsigned long rem_pages = nr_pages;
+	nodemask_t nodes = pol->nodes;
+	int nnodes, node, prev_node;
+	int i;
+
+	/* Stabilize the nodemask on the stack */
+	barrier();
+
+	nnodes = nodes_weight(nodes);
+
+	/* Collect weights and save them on stack so they don't change */
+	for_each_node_mask(node, nodes) {
+		weight = iw_table[node];
+		weight_total += weight;
+		weights[node] = weight;
+	}
+
+	/* Continue allocating from most recent node and adjust the nr_pages */
+	if (pol->wil.cur_weight) {
+		node = next_node_in(me->il_prev, nodes);
+		node_pages = pol->wil.cur_weight;
+		if (node_pages > rem_pages)
+			node_pages = rem_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (rem_pages <= pol->wil.cur_weight) {
+			pol->wil.cur_weight -= rem_pages;
+			return total_allocated;
+		}
+		/* Otherwise we adjust nr_pages down, and continue from there */
+		rem_pages -= pol->wil.cur_weight;
+		pol->wil.cur_weight = 0;
+		prev_node = node;
+	}
+
+	/* Now we can continue allocating as if from 0 instead of an offset */
+	rounds = rem_pages / weight_total;
+	delta = rem_pages % weight_total;
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, nodes);
+		weight = weights[node];
+		node_pages = weight * rounds;
+		if (delta) {
+			if (delta > weight) {
+				node_pages += weight;
+				delta -= weight;
+			} else {
+				node_pages += delta;
+				delta = 0;
+			}
+		}
+		/* We may not make it all the way around */
+		if (!node_pages)
+			break;
+		/* If an over-allocation would occur, floor it */
+		if (node_pages + total_allocated > nr_pages) {
+			node_pages = nr_pages - total_allocated;
+			delta = 0;
+		}
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		prev_node = node;
+	}
+
+	/*
+	 * Finally, we need to update me->il_prev and pol->wil.cur_weight
+	 * if there were overflow pages, but not equivalent to the node
+	 * weight, set the cur_weight to node_weight - delta and the
+	 * me->il_prev to the previous node. Otherwise if it was perfect
+	 * we can simply set il_prev to node and cur_weight to 0
+	 */
+	if (node_pages) {
+		me->il_prev = prev_node;
+		node_pages %= weight;
+		pol->wil.cur_weight = weight - node_pages;
+	} else {
+		me->il_prev = node;
+		pol->wil.cur_weight = 0;
+	}
+
+	return total_allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+		return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
+								  nr_pages,
+								  page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		polnid = interleave_nid(pol, ilx);
 		break;
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		polnid = weighted_interleave_nid(pol, ilx);
+		break;
+
 	case MPOL_PREFERRED:
 		if (node_isset(curnid, pol->nodes))
 			goto out;
@@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		}
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		/*
 		 * Default to online nodes with memory if no nodelist
 		 */
@@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default:
-- 
2.39.1