From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3B338CDB47E
	for <linux-mm@archiver.kernel.org>; Wed, 11 Oct 2023 20:44:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5DBE68D00D4; Wed, 11 Oct 2023 16:44:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 586998D0002; Wed, 11 Oct 2023 16:44:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3B2A28D00D4; Wed, 11 Oct 2023 16:44:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 2904A8D0002
	for <linux-mm@kvack.org>; Wed, 11 Oct 2023 16:44:10 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id E0A571A03BF
	for <linux-mm@kvack.org>; Wed, 11 Oct 2023 20:44:09 +0000 (UTC)
X-FDA: 81334357818.02.2F4AF17
Received: from mail-yw1-f195.google.com (mail-yw1-f195.google.com [209.85.128.195])
	by imf26.hostedemail.com (Postfix) with ESMTP id E1460140015
	for <linux-mm@kvack.org>; Wed, 11 Oct 2023 20:44:07 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=h4TMNzxQ;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf26.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.128.195 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1697057048;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=soG7YmuaB3BHU6f8QJ9tYSLPWTcCInx3P1b5GCGBm/o=;
	b=LrI276DWpGQqUTbvSJC8FEtvjrW4JP9FuM6PLM78MtEWDfz4oshKFebLI3I8h8uWBfzXC4
	UCGf/S5VYsZtwrE2UhgvFe9ofKPJn135yaDN8TrT8NAwPlJMZvOzVvrUObzBbLsSnv7s9T
	jbhlWtVWhpoJFDB2GpeDVMGfTgsMsZQ=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=h4TMNzxQ;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf26.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.128.195 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697057048; a=rsa-sha256;
	cv=none;
	b=Pi1mHS7bJuAkWhNcVkFttqWVqp4g4akLxhhGyfyNwp60c6EPcRC/xzvht8vRmR/dFXrl3Z
	HF6LQznUev79kqX2b8iSQ1gOEjapB695am1blVqXDlER54fJwk8TY81Pprd4ETHJ4JzPHX
	xJp4pIUae5OHRl/HMgC7yb/ZcI8L7Yk=
Received: by mail-yw1-f195.google.com with SMTP id 00721157ae682-5a7c93507d5so3184127b3.2
        for <linux-mm@kvack.org>; Wed, 11 Oct 2023 13:44:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1697057047; x=1697661847; darn=kvack.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=soG7YmuaB3BHU6f8QJ9tYSLPWTcCInx3P1b5GCGBm/o=;
        b=h4TMNzxQ8CW3YVwXIY6ehd0Nre6qBpYo61Zkb1ZKVcAtjsLm6GBQaZveE9GK1qfm5K
         1dMl0Bym3rVJrvh4M4bhSlY8VHBZlDddF4Ddv9a9OVpvT3ZI2BncBt1s4RRXraoPRfZG
         hLiwkrzZn4UAaqUy2Xqcbm8tmhzUBm6fiUZCMWfULpg+7RRWeH+cssvXTwRTCs1WBTZ9
         NYpWlPMb8D3yYu8FZ0GBov3W+UNZtAfFh9gj4JXFrzM+kYHqLltSw/VXM8n6Wb6S6EaZ
         xmPpjHe9IqzP7z4CfLkU+HwxcsWeSzCLjc9Ve/c1RSPxSgBArElGMsQ8WRV9Sd1RDlBT
         Ppsg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1697057047; x=1697661847;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=soG7YmuaB3BHU6f8QJ9tYSLPWTcCInx3P1b5GCGBm/o=;
        b=wqfX+VIstOk0x8iIlrj6UP9huQQtPCNuIdtmDgDwcdGAgIoqkzj0AUEOWgZ+OI20nV
         Ry4bFlEYbFiYRmxQRfDsVxNnXpihSWr43LSQnLJ81SCacSMPDEAOhyjLfxQAi7FYBnAk
         xfaSdg9Ihe7J70oTMUip/l5r9UBo0zq/4Vop1oHsHgyobEFxLa77lDWlopeE3LeED7U9
         SIw6/7HqZI+gCO5Lz8h2XtJo5NpI6Q+Xm1/NQ2G7wD21PuMlk70SmePW2w2y/Hdl1qlu
         q+1WGC5ljR4109QP4J+L1Trc6bmPMjQmXy5m/9ydixvMYZSw6PWOlWr9ZoOZodMD0GQv
         JvWQ==
X-Gm-Message-State: AOJu0YxC7pUsXyAO9MY17rBGzQVl35RK3LnTZPPUPgGK1DTrbVufn8pr
	+J75y5eIONLmkP5t9CXDWkk46DKyc35ZwhM=
X-Google-Smtp-Source: AGHT+IHf6jLfcMIHuOM7n2tG5dhIzfpBz1A/0EiVInopY36IZIJJhi9a24C+fLfRIC5CuvWG/p91Rw==
X-Received: by 2002:a81:a0ce:0:b0:5a7:b10c:4772 with SMTP id x197-20020a81a0ce000000b005a7b10c4772mr9010552ywg.19.1697057046856;
        Wed, 11 Oct 2023 13:44:06 -0700 (PDT)
Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208])
        by smtp.gmail.com with ESMTPSA id q2-20020a819902000000b0059bc0d766f8sm1844588ywg.34.2023.10.11.13.44.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 11 Oct 2023 13:44:06 -0700 (PDT)
From: Gregory Price <gourry.memverge@gmail.com>
X-Google-Original-From: Gregory Price <gregory.price@memverge.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	akpm@linux-foundation.org,
	sthanneeru@micron.com,
	ying.huang@intel.com,
	gregory.price@memverge.com
Subject: [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights
Date: Mon,  9 Oct 2023 16:42:59 -0400
Message-Id: <20231009204259.875232-4-gregory.price@memverge.com>
X-Mailer: git-send-email 2.39.1
In-Reply-To: <20231009204259.875232-1-gregory.price@memverge.com>
References: <20231009204259.875232-1-gregory.price@memverge.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: E1460140015
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: j34864k99cm4ie84u7nb3sbyrwfoejqi
X-HE-Tag: 1697057047-474046
X-HE-Meta: U2FsdGVkX1+N9CkcYG6iK5IFzH6yi+PTaaqMctLL1zCYAQHejy/Gxi6WX2vJE0LcTe4ekvF0RPOo9ckhtLoF7pV7ONFgMKkt/BDuVJMB0rKpTvWH4PSKlU2oHGReE/RrLuojz8XVN7vIUOXDF2yGch4RPtfkTd5u5V12kyS/PqpJdhr5c+4nnrm0BZgW3+mAyYK3TeMMDsfnaim1flKfLoijEgKCyZfI6cLFSp+BrVjva+0e3VGqb8UfbT7cXWWimQpIKdFaQWbOfyrf0Gqcp/ZGZU1dwu+sIdmT5QJWLVFlavx29Cn6zNJW1tklRTg8G+E5rpPszSpeEFDyqUBryrvh1SoAwqxuD9u4abMTx24cD0xIdc1vK8DVFiPTFeJAGZzIfUtIyxnilZ+1eCidGXKrQqdaD7CphQaaIUKpbZRD0xo7hUf3ryI0NPUNW/WzNacFUmQL/bptTsqbJzvL4bf0nMvUOxuLq8WlBFuBR+7KV4QmHBotxMOgiZFtmOALFdUMTrQhtu23wrLJprrjAAoKn1IhuTa2RKl91N+7KUE0jW3I+DZsZ+HF49j9s4kH2tK6TQdgE9m9oNBqLLiJhwbvdXQb60br1GaP02IfDdzbe5Ug2Ej56AEpucfiisr6jHhDcDNXZOZU7IKN26kRwPBSDyqNpYS/zgi/X67kZAV0Q2k28fFzwQzwWwOs1nwqXyN0+LAy/TGZaXmJ20NZHW6nPJ0IUzIPGuso2X3V2Qg3KW93ySdR0Nyzdqaq6QRmgFubQHdKNyPFf4TbctJ8KEHDQnEa+x5Flksc7unfbLFOWM4wJmoHKSAPVhCOZZsYcOxsxXNlAIgE+15idyedo/YViR6kIKgnuITcsUf05pIq0yo+EiBVFCyNETlWn0i5Qy7ax3ZBdqzwsv3zJN9kQicY8m8wGRmwVEQI+3yT6RYQkT0s3zkLlFjw+NNF0pRQEM8+XijVCCxHTN0+xjN
 CmRSWv+k
 U0Sm7M1boB2oD/v2kpkMEVeBxWRs5QMHZwmBu8Ne9HcZgCG1/dnzzKq6/qW4UzFFu1psxPX4S9yTuQGOzTt0gqr5PBKcFmQaHgGnAC11PSjwvSDL//8g9SyyZ5/23xREVAl5/wmNJaNDlbk5rp7M3iq87XWtE7q/YbB3LPpm+7Rjjb5szEIqKNYJjwnXaTbQNDFzZ6rM7ZTL8iQXpyh8s78rNOiky1kmYYbj3oDomVG38EkPsX/r1rRmkvmj9Zd8Gi1z12KUh0rnCUi0r0KzqUqyRb4/V/YpGbxoyU7IG71YJj4gSLU+YSkuqbCz/E9E1yCfzwl9WlGtGpVbvdIfHsAqzw8YMRvqXY9W/Yt8XWvUFd9NYOw+dJpjlZd7XmarifVVmvsJcUFA7yX/Fg3ouR3LJD0IiOE13KF/2RxAGoFO6IvfMnFSYaxdY/BOfwB3C7UNzo0oeyqi31ISWqY1z3OpH1Vz/EyQOmEQ0TcUObJsCLpSrVyoAkAjCYmq37FrD5IUA
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The memory-tier subsystem implements interleave weighting
for tiers for the purpose of bandwidth optimization.  Each
tier may contain multiple numa nodes, and each tier may have
different weights in relation to each compute node ("from node").

The mempolicy MPOL_INTERLEAVE utilizes the memory-tier subsystem
functions to implement weighted tiering.  By default, since all
tiers default to a weight of 1, the original interleave behavior
is retained.

The mempolicy nodemask does not have to be inclusive of all nodes
in each respective memory tier, though this may lead to a more
complicated calculation in terms of how memory is distributed.

Examples

Weight settings:
echo 0:4 > memory_tier4/interleave_weight
echo 1:3 > memory_tier4/interleave_weight
echo 0:2 > memory_tier22/interleave_weight
echo 1:1 > memory_tier22/interleave_weight

Results:
Tier 1: Nodes(0,1), Weights(4,3) <- from nodes(0,1) respectively
Tier 2: Nodes(2,3), Weights(2,1) <- from nodes(0,1) respectively

Task A:
   cpunode:  0
   nodemask: [0,1]
   weights:  [4]
   allocation result: [0,0,1,1, repeat]
   Notice how weight is split between the nodes

Task B:
   cpunode:  0
   nodemask: [0,2]
   weights:  [4,2]
   allocation result: [0,0,0,0,2,2, repeat]
   Notice how weights are not split, each node
   has the entire weight of the respective tier applied

Task C:
   cpunode: 1
   nodemask: [1,3]
   weights:  [3,1]
   allocation result: [1,1,1,3, repeat]
   Notice the weights differ based on cpunode

Task D:
   cpunode: 0
   nodemask: [0,1,2]
   weights:  [4,2]
   allocation result: [0,0,1,1,2,2]
   Notice how tier1 splits the weight between nodes 0 and 1
   but tier 2 has the entire weight applied to node 2

Task E:
   cpunode:  1
   nodemask: [0,1]
   weights:  [3]
   allocation result: [0,0,1,1]
   Notice how the weight is aligned up to an effective 4, because
   weights are aligned to the number of nodes in the tier.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 include/linux/mempolicy.h |   3 +
 mm/mempolicy.c            | 148 ++++++++++++++++++++++++++++++--------
 2 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..ad57fdfdb57a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -48,6 +48,9 @@ struct mempolicy {
 	nodemask_t nodes;	/* interleave/bind/perfer */
 	int home_node;		/* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */
 
+	/* weighted interleave settings */
+	unsigned char cur_weight;
+
 	union {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f1b00d6ac7ee..131e6e56b2de 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -102,6 +102,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -300,6 +301,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->cur_weight = 0;
 
 	return policy;
 }
@@ -334,6 +336,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 		tmp = *nodes;
 
 	pol->nodes = tmp;
+	pol->cur_weight = 0;
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -881,8 +884,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && new->mode == MPOL_INTERLEAVE) {
 		current->il_prev = MAX_NUMNODES-1;
+		new->cur_weight = 0;
+	}
+
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -1901,12 +1907,23 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
-	unsigned next;
+	unsigned int next;
+	unsigned char next_weight;
 	struct task_struct *me = current;
 
 	next = next_node_in(me->il_prev, policy->nodes);
-	if (next < MAX_NUMNODES)
+	if (!policy->cur_weight) {
+		/* If the node is set, at least 1 allocation is required */
+		next_weight = memtier_get_node_weight(numa_node_id(), next,
+						      &policy->nodes);
+
+		policy->cur_weight = next_weight ? next_weight : 1;
+	}
+
+	policy->cur_weight--;
+	if (next < MAX_NUMNODES && !policy->cur_weight)
 		me->il_prev = next;
+
 	return next;
 }
 
@@ -1965,25 +1982,37 @@ unsigned int mempolicy_slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 {
 	nodemask_t nodemask = pol->nodes;
-	unsigned int target, nnodes;
-	int i;
+	unsigned int target, nnodes, il_weight;
+	unsigned char weight;
 	int nid;
+	int cur_node = numa_node_id();
+
 	/*
 	 * The barrier will stabilize the nodemask in a register or on
 	 * the stack so that it will stop changing under the code.
 	 *
 	 * Between first_node() and next_node(), pol->nodes could be changed
 	 * by other threads. So we put pol->nodes in a local stack.
+	 *
+	 * Additionally, place the cur_node on the stack in case of a migration
 	 */
 	barrier();
 
 	nnodes = nodes_weight(nodemask);
 	if (!nnodes)
-		return numa_node_id();
-	target = (unsigned int)n % nnodes;
+		return cur_node;
+
+	il_weight = memtier_get_total_weight(cur_node, &nodemask);
+	target = (unsigned int)n % il_weight;
 	nid = first_node(nodemask);
-	for (i = 0; i < target; i++)
-		nid = next_node(nid, nodemask);
+	while (target) {
+		weight = memtier_get_node_weight(cur_node, nid, &nodemask);
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+
 	return nid;
 }
 
@@ -2317,32 +2346,93 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
 {
-	int nodes;
-	unsigned long nr_pages_per_node;
-	int delta;
-	int i;
-	unsigned long nr_allocated;
+	struct task_struct *me = current;
 	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	unsigned char weight;
+	unsigned long il_weight;
+	unsigned long req_pages = nr_pages;
+	int nnodes, node, prev_node;
+	int cur_node = numa_node_id();
+	int i;
 
-	nodes = nodes_weight(pol->nodes);
-	nr_pages_per_node = nr_pages / nodes;
-	delta = nr_pages - nodes * nr_pages_per_node;
-
-	for (i = 0; i < nodes; i++) {
-		if (delta) {
-			nr_allocated = __alloc_pages_bulk(gfp,
-					interleave_nodes(pol), NULL,
-					nr_pages_per_node + 1, NULL,
-					page_array);
-			delta--;
-		} else {
-			nr_allocated = __alloc_pages_bulk(gfp,
-					interleave_nodes(pol), NULL,
-					nr_pages_per_node, NULL, page_array);
+	prev_node = me->il_prev;
+	nnodes = nodes_weight(pol->nodes);
+	/* Continue allocating from most recent node */
+	if (pol->cur_weight) {
+		node = next_node_in(prev_node, pol->nodes);
+		node_pages = pol->cur_weight;
+		if (node_pages > nr_pages)
+			node_pages = nr_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (req_pages <= pol->cur_weight) {
+			pol->cur_weight -= req_pages;
+			return total_allocated;
 		}
+		/* Otherwise we adjust req_pages down, and continue from there */
+		req_pages -= pol->cur_weight;
+		pol->cur_weight = 0;
+		prev_node = node;
+	}
 
+	/*
+	 * The memtier lock is not held during allocation, if weights change
+	 * there may be edge-cases (over/under-allocation) to handle.
+	 */
+try_again:
+	il_weight = memtier_get_total_weight(cur_node, &pol->nodes);
+	rounds = req_pages / il_weight;
+	delta = req_pages % il_weight;
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, pol->nodes);
+		weight = memtier_get_node_weight(cur_node, node, &pol->nodes);
+		node_pages = weight * rounds;
+		if (delta > weight) {
+			node_pages += weight;
+			delta -= weight;
+		} else if (delta) {
+			node_pages += delta;
+			delta = 0;
+		}
+		/* The number of requested pages may not hit every node */
+		if (!node_pages)
+			break;
+		/* If an over-allocation would occur, floor it */
+		if (node_pages + total_allocated > nr_pages) {
+			node_pages = nr_pages - total_allocated;
+			delta = 0;
+		}
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
 		page_array += nr_allocated;
 		total_allocated += nr_allocated;
+		prev_node = node;
+	}
+
+	/* If an under-allocation would occur, apply interleave again */
+	if (total_allocated != nr_pages)
+		goto try_again;
+
+	/*
+	 * Finally, we need to update me->il_prev and pol->cur_weight
+	 * if there were overflow pages, but not equivalent to the node
+	 * weight, set the cur_weight to node_weight - delta and the
+	 * me->il_prev to the previous node. Otherwise if it was perfect
+	 * we can simply set il_prev to node and cur_weight to 0
+	 */
+	if (node_pages) {
+		me->il_prev = prev_node;
+		node_pages %= weight;
+		pol->cur_weight = weight - node_pages;
+	} else {
+		me->il_prev = node;
+		pol->cur_weight = 0;
 	}
 
 	return total_allocated;
-- 
2.39.1