linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gregory Price <gourry.memverge@gmail.com>
To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com,
	sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com
Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	arnd@arndb.de, tglx@linutronix.de, luto@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org,
	ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net,
	rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com,
	vtavarespetr@micron.com, peterz@infradead.org
Subject: [RFC PATCH 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
Date: Wed,  6 Dec 2023 19:27:49 -0500	[thread overview]
Message-ID: <20231207002759.51418-2-gregory.price@memverge.com> (raw)
In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com>

From: Rakie Kim <rakie.kim@sk.com>

This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/node*/node*/weight

The sysfs structure is designed as follows.

  $ tree /sys/kernel/mm/mempolicy/
  /sys/kernel/mm/mempolicy/ [1]
  ├── cpu_nodes [2]
  ├── possible_nodes [3]
  └── weighted_interleave [4]
      ├── node0 [5]
      │   ├── node0 [6]
      │   │     └── weight [7]
      │   └── node1
      │         └── weight
      └── node1
          ├── node0
          │     └── weight
          └── node1
                └── weight

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] cpu_nodes: list of cpu nodes

    information interface which is used to describe which nodes
    may generate sub-folders under each policy interface. For example,
    the weighted_interleave policy generates a nodeN folder for each
    cpu node.

[3] possible_nodes: list of possible nodes

    informational interface which may be used across multiple memory
    policy configurations.  Lists the `possible` nodes for which
    configurations may be required.  A `possible` node is one which has
    been reserved by the kernel at boot, but may or may not be online.

    For example, the weighted_interleave policy generates a nodeN/nodeM
    folder for each cpu node and memory node combination [N,M].

[4] weighted_interleave/: config interface for weighted interleave policy

[5] weighted_interleave/nodeN/:  initiator node configurations

    Each CPU node receives its own weighting table, allowing for (src,dst)
    weighting to be accomplished, where src is the cpu node the task is
    running on, and dst is an index into the array of weights for that
    source node.

[6] weighted_interleave/nodeN/nodeM/:  memory node configurations

[7] weighted_interleave/nodeN/nodeM/weight: weight for [N,M]

    The weight table for nodeN which can be programmed to weight each
    target (nodeM) differently.  This is important for allowing re-weight
    to occur automatically on a task migration event, either via scheduler
    initiated migration or a cgroup.cpusets/mems_allowed policy change.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
---
 .../ABI/testing/sysfs-kernel-mm-mempolicy     |  33 +++
 ...fs-kernel-mm-mempolicy-weighted-interleave |  35 +++
 mm/mempolicy.c                                | 226 ++++++++++++++++++
 3 files changed, 294 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..8dc1129d4ab1
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,33 @@
+What:		/sys/kernel/mm/mempolicy/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for Mempolicy
+
+What:		/sys/kernel/mm/mempolicy/cpu_nodes
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	The numa nodes from which accesses can be generated
+
+		A cpu numa node is one which has at least 1 CPU. These nodes
+		are capable of generating accesses to memory numa nodes, and
+		will have an interleave weight table.
+
+		Example output:
+
+		=====         =================================================
+		"0,1"         nodes 0 and 1 have CPUs which may generate access
+		=====         =================================================
+
+What:		/sys/kernel/mm/mempolicy/possible_nodes
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	The numa nodes which are possible to come online
+
+		A possible numa node is one which has been reserved by the
+		system at boot, but may or may not be online at runtime.
+
+		Example output:
+
+		=========     ========================================
+		"0,1,2,3"     nodes 0-3 are possibly online or offline
+		=========     ========================================
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..75554895ede3
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,35 @@
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration Interface for the Weighted Interleave policy
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration interface for accesses initiated from nodeN
+
+		The directory to configure access initiator weights for nodeN.
+
+		Possible numa nodes which have not been marked as a CPU node
+		at boot will not have a nodeN directory made for them at boot.
+		Hotplug for CPU nodes is not supported.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM
+		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM/weight
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration interface for target nodes accessed from nodeNN
+
+		The interleave weight for a memory node (M) from initiating
+		node (N). These weights are utilized by processes which have set
+		the mempolicy to MPOL_WEIGHTED_INTERLEAVE and have opted into
+		global weights by omitting a task-local weight array.
+
+		These weights only affect new allocations, and changes at runtime
+		will not cause migrations on already allocated pages.
+
+		If the weight of 0 is desired, the appropriate way to do this is
+		by removing the node from the weighted interleave nodemask.
+
+		Minimum weight: 1
+		Maximum weight: 255
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..ce332b5e7a03 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,11 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+struct interleave_weight_table {
+	unsigned char weights[MAX_NUMNODES];
+};
+static struct interleave_weight_table *iw_table;
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -3067,3 +3072,224 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+struct iw_node_info {
+	struct kobject kobj;
+	int src;
+	int dst;
+};
+
+static ssize_t node_weight_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	struct iw_node_info *node_info = container_of(kobj, struct iw_node_info,
+						      kobj);
+	return sysfs_emit(buf, "%d\n",
+			  iw_table[node_info->src].weights[node_info->dst]);
+}
+
+static ssize_t node_weight_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	unsigned char weight = 0;
+	struct iw_node_info *node_info = NULL;
+
+	node_info = container_of(kobj, struct iw_node_info, kobj);
+
+	if (kstrtou8(buf, 0, &weight) || !weight)
+		return -EINVAL;
+
+	iw_table[node_info->src].weights[node_info->dst] = weight;
+
+	return count;
+}
+
+static struct kobj_attribute node_weight =
+	__ATTR(weight, 0664, node_weight_show, node_weight_store);
+
+static struct attribute *dst_node_attrs[] = {
+	&node_weight.attr,
+	NULL,
+};
+
+static struct attribute_group dst_node_attr_group = {
+	.attrs = dst_node_attrs,
+};
+
+static const struct attribute_group *dst_node_attr_groups[] = {
+	&dst_node_attr_group,
+	NULL,
+};
+
+static const struct kobj_type dst_node_kobj_ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+	.default_groups = dst_node_attr_groups,
+};
+
+static int add_dst_node(int src, int dst, struct kobject *src_kobj)
+{
+	struct iw_node_info *node_info = NULL;
+	int ret;
+
+	node_info = kzalloc(sizeof(struct iw_node_info), GFP_KERNEL);
+	if (!node_info)
+		return -ENOMEM;
+	node_info->src = src;
+	node_info->dst = dst;
+
+	kobject_init(&node_info->kobj, &dst_node_kobj_ktype);
+	ret = kobject_add(&node_info->kobj, src_kobj, "node%d", dst);
+	if (ret) {
+		pr_err("kobject_add error [%d-node%d]: %d", src, dst, ret);
+		kobject_put(&node_info->kobj);
+	}
+	return ret;
+}
+
+static int add_src_node(int src, struct kobject *root_kobj)
+{
+	int err, dst;
+	struct kobject *src_kobj;
+	char name[24];
+
+	snprintf(name, 24, "node%d", src);
+	src_kobj = kobject_create_and_add(name, root_kobj);
+	if (!src_kobj) {
+		pr_err("failed to create source node kobject\n");
+		return -ENOMEM;
+	}
+	for_each_node_state(dst, N_POSSIBLE) {
+		err = add_dst_node(src, dst, src_kobj);
+		if (err)
+			break;
+	}
+	if (err)
+		kobject_put(src_kobj);
+	return err;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+	struct kobject *wi_kobj;
+	int nid, err;
+
+	wi_kobj = kobject_create_and_add("weighted_interleave", root_kobj);
+	if (!wi_kobj) {
+		pr_err("failed to create node kobject\n");
+		return -ENOMEM;
+	}
+
+	for_each_node_state(nid, N_CPU) {
+		err = add_src_node(nid, wi_kobj);
+		if (err) {
+			pr_err("failed to add sysfs [node%d]\n", nid);
+			break;
+		}
+	}
+	if (err)
+		kobject_put(wi_kobj);
+	return 0;
+
+}
+
+static ssize_t cpu_nodes_show(struct kobject *kobj,
+			      struct kobj_attribute *attr, char *buf)
+{
+	int nid, next_nid;
+	int len = 0;
+
+	for_each_node_state(nid, N_CPU) {
+		len += sysfs_emit_at(buf, len, "%d", nid);
+		next_nid = next_node(nid, node_states[N_CPU]);
+		if (next_nid < MAX_NUMNODES)
+			len += sysfs_emit_at(buf, len, ",");
+	}
+	len += sysfs_emit_at(buf, len, "\n");
+
+	return len;
+}
+
+static ssize_t possible_nodes_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	int nid, next_nid;
+	int len = 0;
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		len += sysfs_emit_at(buf, len, "%d", nid);
+		next_nid = next_node(nid, node_states[N_POSSIBLE]);
+		if (next_nid < MAX_NUMNODES)
+			len += sysfs_emit_at(buf, len, ",");
+	}
+	len += sysfs_emit_at(buf, len, "\n");
+
+	return len;
+}
+
+static struct kobj_attribute cpu_nodes_attr = __ATTR_RO(cpu_nodes);
+static struct kobj_attribute possible_nodes_attr = __ATTR_RO(possible_nodes);
+
+static struct attribute *mempolicy_attrs[] = {
+	&cpu_nodes_attr.attr,
+	&possible_nodes_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group mempolicy_attr_group = {
+	.attrs = mempolicy_attrs,
+	NULL,
+};
+
+static void mempolicy_kobj_release(struct kobject *kobj)
+{
+	kfree(kobj);
+	kfree(iw_table);
+}
+
+static const struct kobj_type mempolicy_kobj_ktype = {
+	.release = mempolicy_kobj_release,
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static int __init mempolicy_sysfs_init(void)
+{
+	int err, nid;
+	int cpunodes = 0;
+	struct kobject *root_kobj;
+
+	for_each_node_state(nid, N_CPU)
+		cpunodes += 1;
+	iw_table = kmalloc_array(cpunodes, sizeof(*iw_table), GFP_KERNEL);
+	if (!iw_table) {
+		pr_err("failed to create interleave weight table\n");
+		err = -ENOMEM;
+		goto fail_obj;
+	}
+	memset(iw_table, 1, cpunodes * sizeof(*iw_table));
+
+	root_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+	if (!root_kobj)
+		return -ENOMEM;
+
+	kobject_init(root_kobj, &mempolicy_kobj_ktype);
+	err = kobject_add(root_kobj, mm_kobj, "mempolicy");
+	if (err) {
+		pr_err("failed to add kobject to the system\n");
+		goto fail_obj;
+	}
+
+	err = sysfs_create_group(root_kobj, &mempolicy_attr_group);
+	if (err) {
+		pr_err("failed to register mempolicy group\n");
+		goto fail_obj;
+	}
+
+	err = add_weighted_interleave_group(root_kobj);
+fail_obj:
+	if (err)
+		kobject_put(root_kobj);
+	return err;
+
+}
+late_initcall(mempolicy_sysfs_init);
-- 
2.39.1



  reply	other threads:[~2023-12-07  0:28 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-07  0:27 [RFC PATCH 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
2023-12-07  0:27 ` Gregory Price [this message]
2023-12-07 21:56   ` [RFC PATCH 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Davidlohr Bueso
2023-12-07 22:17     ` Davidlohr Bueso
2023-12-08  0:11     ` Gregory Price
2023-12-07  0:27 ` [RFC PATCH 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2023-12-07  0:27 ` [RFC PATCH 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
2023-12-07  0:27 ` [RFC PATCH 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
2023-12-07  0:27 ` [RFC PATCH 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
2023-12-07  0:27 ` [RFC PATCH 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
2023-12-07  0:27 ` [RFC PATCH 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2023-12-07  7:13   ` Arnd Bergmann
2023-12-07 14:58     ` Gregory Price
2023-12-07 15:43       ` Arnd Bergmann
2023-12-08  0:05         ` Gregory Price
2023-12-07  0:27 ` [RFC PATCH 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
2023-12-07  0:27 ` [RFC PATCH 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
2023-12-07  0:27 ` [RFC PATCH 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231207002759.51418-2-gregory.price@memverge.com \
    --to=gourry.memverge@gmail.com \
    --cc=Hasan.Maruf@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=emirakhur@micron.com \
    --cc=gregory.price@memverge.com \
    --cc=honggyu.kim@sk.com \
    --cc=hpa@zytor.com \
    --cc=hyeongtak.ji@sk.com \
    --cc=jgroves@micron.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rakie.kim@sk.com \
    --cc=ravis.opensrc@micron.com \
    --cc=sthanneeru@micron.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vtavarespetr@micron.com \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox