linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Patch v2 0/2] sched/numa, mm/numa: Soft Affinity via numa_preferred_nid.
@ 2025-05-02 18:59 chris hyser
  2025-05-02 18:59 ` [PATCH v2 1/2] sched/numa: Add ability to override task's numa_preferred_nid chris hyser
  2025-05-02 18:59 ` [PATCH v2 2/2] sched/numa: prctl to set/override " chris hyser
  0 siblings, 2 replies; 3+ messages in thread
From: chris hyser @ 2025-05-02 18:59 UTC (permalink / raw)
  To: Chris Hyser, Peter Zijlstra, Mel Gorman, Andrew Morton,
	Jonathan Corbet, linux-kernel, linux-mm

Soft Affinity (value of hard affinity with graceful handling of overload) as a
concept has been around for years. The original implementation was rejected

https://lore.kernel.org/lkml/20190702172851.GA3436@hirez.programming.kicks-ass.net/

with an alternative, using numa_preferred_nid, suggested by Peter Zijlstra.

This is a simple implementation with most of the changes associated with a
prctl() to set/get the value. It does not modify the scheduler's behavior but
simply exploits the current NUMA balancing behavior.

The intent is to provide a mechanism whereby a knowledgble user, system admin,
or importantly, a NUMA aware application can force Auto NUMA Balancing to prefer
the "correct" node, for example pinned memory like RDMA buffers or other
scenarios where heavily accessed memory ranges are pinned and not subject to
NUMA hint faults.

[PATCH v2 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
 include/linux/sched.h |  1 +
 init/init_task.c      |  1 +
 kernel/sched/core.c   |  5 ++++-
 kernel/sched/debug.c  |  1 +
 kernel/sched/fair.c   | 15 +++++++++++++--
 5 files changed, 20 insertions(+), 3 deletions(-)

[PATCH v2 2/2] sched/numa: prctl to set/override task's numa_preferred_nid
 Documentation/scheduler/sched-preferred-node.rst | 67 ++++++++++++++++++++++++++++++++
 include/linux/sched.h                            |  9 +++++
 include/uapi/linux/prctl.h                       |  8 ++++
 kernel/sched/fair.c                              | 64 ++++++++++++++++++++++++++++++
 kernel/sys.c                                     |  5 +++
 tools/include/uapi/linux/prctl.h                 |  6 +++
 6 files changed, 159 insertions(+)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
  2025-05-02 18:59 [Patch v2 0/2] sched/numa, mm/numa: Soft Affinity via numa_preferred_nid chris hyser
@ 2025-05-02 18:59 ` chris hyser
  2025-05-02 18:59 ` [PATCH v2 2/2] sched/numa: prctl to set/override " chris hyser
  1 sibling, 0 replies; 3+ messages in thread
From: chris hyser @ 2025-05-02 18:59 UTC (permalink / raw)
  To: Chris Hyser, Peter Zijlstra, Mel Gorman, Andrew Morton,
	Jonathan Corbet, linux-kernel, linux-mm

This patch allows directly setting and subsequent overriding of a task's
"Preferred Node Affinity" by setting the task's numa_preferred_nid and
relying on the existing NUMA balancing infrastructure.

NUMA balancing introduced the notion of tracking and using a task's
preferred memory node for both migrating/consolidating the physical pages
accessed by a task and to assist the scheduler in making NUMA aware
placement and load balancing decisions.

The existing mechanism for determining this, Auto NUMA Balancing, relies on
periodic removal of virtual mappings for blocks of a task's address space.
The resulting faults can indicate a task's preference for an accessed node.

This has two issues that this patch seeks to overcome:

- there is a trade-off between faulting overhead and the ability to detect
  dynamic access patterns. In cases where the task or user understand the
  NUMA sensitivities, this patch can enable the benefits of setting a
  preferred node used either in conjunction with Auto NUMA Balancing's
  default parameters or adjusting the NUMA balance parameters to reduce the
  faulting rate (potentially to 0).

- memory pinned to nodes or to physical addresses such as RDMA cannot be
  migrated and have thus far been excluded from the scanning. Not taking
  those faults however can prevent Auto NUMA Balancing from reliably
  detecting a node preference with the scheduler load balancer then
  possibly operating with incorrect NUMA information.

The following results were from TPCC runs on an Oracle Database. The system
was a 2-node AMD machine with a database running on each node with local
memory allocations. No tasks or memory were pinned.

There are four scenarios of interest:

- Auto NUMA Balancing OFF.
    base value

- Auto NUMA Balancing ON.
    1.2% - ANB ON better than ANB OFF.

- Use the prctl(), ANB ON, parameters set to prevent faulting.
    2.4% - prctl() better then ANB OFF.
    1.2% - prctl() better than ANB ON.

- Use the prctl(), ANB parameters normal.
    3.1% - prctl() and ANB ON better than ANB OFF.
    1.9% - prctl() and ANB ON better than just ANB ON.
    0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off

The primary advantage of PNA and ANB on is that the resulting NUMA hint
faults are also used to periodically check that a task is on it's preferred
node perhaps having been migrated during load balancing.

In benchmarks pinning large regions of heavily accessed memory, the
advantages of the prctl() over Auto NUMA Balancing alone is significantly
higher.

Suggested-by: Peter Zijlstra<peterz@infradead.org>
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
---
 include/linux/sched.h |  1 +
 init/init_task.c      |  1 +
 kernel/sched/core.c   |  5 ++++-
 kernel/sched/debug.c  |  1 +
 kernel/sched/fair.c   | 15 +++++++++++++--
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..373046c82b35 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1350,6 +1350,7 @@ struct task_struct {
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
+	int				numa_preferred_nid_force;
 	int				numa_scan_seq;
 	unsigned int			numa_scan_period;
 	unsigned int			numa_scan_period_max;
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..1921a87326db 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -184,6 +184,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.vtime.state	= VTIME_SYS,
 #endif
 #ifdef CONFIG_NUMA_BALANCING
+	.numa_preferred_nid_force = NUMA_NO_NODE,
 	.numa_preferred_nid = NUMA_NO_NODE,
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79692f85643f..3488450ee16e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7980,7 +7980,10 @@ void sched_setnuma(struct task_struct *p, int nid)
 	if (running)
 		put_prev_task(rq, p);
 
-	p->numa_preferred_nid = nid;
+	if (unlikely(p->numa_preferred_nid_force != NUMA_NO_NODE))
+		p->numa_preferred_nid = p->numa_preferred_nid_force;
+	else
+		p->numa_preferred_nid = nid;
 
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..a52ba5cf033c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1158,6 +1158,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 		P(mm->numa_scan_seq);
 
 	P(numa_pages_migrated);
+	P(numa_preferred_nid_force);
 	P(numa_preferred_nid);
 	P(total_numa_faults);
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb5a2572b4f8..26781452c636 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,9 +2642,15 @@ static void numa_migrate_preferred(struct task_struct *p)
 	unsigned long interval = HZ;
 
 	/* This task has no NUMA fault statistics yet */
-	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults))
+	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE))
 		return;
 
+	/* Execute rest of function if forced PNID */
+	if (p->numa_preferred_nid_force == NUMA_NO_NODE) {
+		if (unlikely(!p->numa_faults))
+			return;
+	}
+
 	/* Periodically retry migrating the task to the preferred node */
 	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
 	p->numa_migrate_retry = jiffies + interval;
@@ -3578,6 +3584,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
+		p->numa_preferred_nid_force = NUMA_NO_NODE;
 		p->numa_preferred_nid = NUMA_NO_NODE;
 		return;
 	}
@@ -9303,7 +9310,11 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (!static_branch_likely(&sched_numa_balancing))
 		return 0;
 
-	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+	/* Execute rest of function if forced PNID */
+	if (p->numa_preferred_nid_force == NUMA_NO_NODE && !p->numa_faults)
+		return 0;
+
+	if (!(env->sd->flags & SD_NUMA))
 		return 0;
 
 	src_nid = cpu_to_node(env->src_cpu);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 2/2] sched/numa: prctl to set/override task's numa_preferred_nid
  2025-05-02 18:59 [Patch v2 0/2] sched/numa, mm/numa: Soft Affinity via numa_preferred_nid chris hyser
  2025-05-02 18:59 ` [PATCH v2 1/2] sched/numa: Add ability to override task's numa_preferred_nid chris hyser
@ 2025-05-02 18:59 ` chris hyser
  1 sibling, 0 replies; 3+ messages in thread
From: chris hyser @ 2025-05-02 18:59 UTC (permalink / raw)
  To: Chris Hyser, Peter Zijlstra, Mel Gorman, Andrew Morton,
	Jonathan Corbet, linux-kernel, linux-mm

Adds a simple prctl() interface to enable setting or reading a task's
numa_preferred_nid. Once set this value will override any value set
by auto NUMA balancing.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
---
 .../scheduler/sched-preferred-node.rst        | 67 +++++++++++++++++++
 include/linux/sched.h                         |  9 +++
 include/uapi/linux/prctl.h                    |  8 +++
 kernel/sched/fair.c                           | 64 ++++++++++++++++++
 kernel/sys.c                                  |  5 ++
 tools/include/uapi/linux/prctl.h              |  6 ++
 6 files changed, 159 insertions(+)
 create mode 100644 Documentation/scheduler/sched-preferred-node.rst

diff --git a/Documentation/scheduler/sched-preferred-node.rst b/Documentation/scheduler/sched-preferred-node.rst
new file mode 100644
index 000000000000..753fd0b20993
--- /dev/null
+++ b/Documentation/scheduler/sched-preferred-node.rst
@@ -0,0 +1,67 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Prctl for Explicitly Setting Task's Preferred Node
+####################################################
+
+This feature is an addition to Auto NUMA Balancing. Auto NUMA balancing by
+default scans a task's address space removing address translations such that
+subsequent faults can indicate the predominant node from which memory is being
+accessed. A task's numa_preferred_nid is set to the node ID.
+
+The numa_preferred_nid is used to both consolidate physical pages and assist the
+scheduler in making NUMA friendly load balancing decisions.
+
+While quite useful for some workloads, this has two issues that this prctl() can
+help solve:
+
+- There is a trade-off between faulting overhead and the ability to detect
+dynamic access patterns. In cases where the task or user understand the NUMA
+sensitivities, this patch can enable the benefits of setting a preferred node
+used either in conjunction with Auto NUMA Balancing's default parameters or
+adjusting the NUMA balance parameters to reduce the faulting rate
+(potentially to 0).
+
+- Memory pinned to nodes or to physical addresses such as RDMA cannot be
+migrated and have thus far been excluded from the scanning. Not taking
+those faults however can prevent Auto NUMA Balancing from reliably detecting a
+node preference with the scheduler load balancer then possibly operating with
+incorrect NUMA information.
+
+
+Usage
+*******
+
+    Note: Auto NUMA Balancing must be enabled to get the effects.
+
+    #include <sys/prctl.h>
+
+    int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5);
+
+option:
+    ``PR_PREFERRED_NID``
+
+arg2:
+    Command for operation, must be one of:
+
+    - ``PR_PREFERRED_NID_GET`` -- get the forced preferred node ID for ``pid``.
+    - ``PR_PREFERRED_NID_SET`` -- set the forced preferred node ID for ``pid``.
+
+    Returns ERANGE for an illegal command.
+
+arg3:
+    ``pid`` of the task for which the operation applies. ``0`` implies current.
+
+    Returns ESRCH if ``pid`` is not found.
+
+arg4:
+    ``node_id`` for PR_PREFERRED_NID_SET. Between ``-1`` and ``num_possible_nodes()``.
+    ``-1`` indicates no preference.
+
+    Returns EINVAL for an illegal command.
+
+arg5:
+    userspace pointer to an integer for returning the Node ID from
+    ``PR_PREFERRED_NID_GET``. Should be 0 for all other commands.
+
+Must have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to get/set
+the preferred node ID to a process otherwise returns EPERM.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 373046c82b35..8054fd37acdc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2261,6 +2261,15 @@ static inline void sched_core_fork(struct task_struct *p) { }
 static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Change a task's numa_preferred_nid */
+int prctl_chg_pref_nid(unsigned long cmd, int nid, pid_t pid,
+		       unsigned long uaddr);
+#else
+static inline int prctl_chg_pref_nid(unsigned long cmd, int nid, pid_t pid,
+				     unsigned long uaddr) { return -ERANGE; }
+#endif
+
 extern void sched_set_stop_task(int cpu, struct task_struct *stop);
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..e8a47777aeb2 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,12 @@ struct prctl_mm_map {
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
 # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
+/*
+ * Set or get a task's numa_preferred_nid
+ */
+#define PR_PREFERRED_NID		78
+# define PR_PREFERRED_NID_GET		0
+# define PR_PREFERRED_NID_SET		1
+# define PR_PREFERRED_NID_CMD_MAX	2
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26781452c636..81f613f2b037 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -49,6 +49,7 @@
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
 #include <linux/rbtree_augmented.h>
+#include <linux/prctl.h>
 
 #include <asm/switch_to.h>
 
@@ -3670,6 +3671,69 @@ static void update_scan_period(struct task_struct *p, int new_cpu)
 	p->numa_scan_period = task_scan_start(p);
 }
 
+/*
+ * Enable setting task->numa_preferred_nid directly
+ */
+int prctl_chg_pref_nid(unsigned long cmd, pid_t pid, int nid,
+		       unsigned long uaddr)
+{
+	struct task_struct *task;
+	struct rq_flags rf;
+	struct rq *rq;
+	int err = 0;
+
+	if (cmd >= PR_PREFERRED_NID_CMD_MAX)
+		return -ERANGE;
+
+	rcu_read_lock();
+	if (pid == 0) {
+		task = current;
+	} else {
+		task = find_task_by_vpid((pid_t)pid);
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	switch (cmd) {
+	case PR_PREFERRED_NID_GET:
+		if (uaddr & 0x3) {
+			err = -EINVAL;
+			goto out;
+		}
+		err = put_user(task->numa_preferred_nid_force,
+			       (int __user *)uaddr);
+		break;
+
+	case PR_PREFERRED_NID_SET:
+		if (!(-1 <= nid && nid < num_possible_nodes())) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		rq = task_rq_lock(task, &rf);
+		task->numa_preferred_nid_force = nid;
+		task_rq_unlock(rq, task, &rf);
+		sched_setnuma(task, nid);
+		break;
+	}
+
+out:
+	put_task_struct(task);
+	return err;
+}
 #else
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..20629a3267b1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2746,6 +2746,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SCHED_CORE:
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	case PR_PREFERRED_NID:
+		error = prctl_chg_pref_nid(arg2, arg3, arg4, arg5);
+		break;
 #endif
 	case PR_SET_MDWE:
 		error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879..937160e3a77a 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -328,4 +328,10 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+/* Set or get a task's numa_preferred_nid
+ */
+#define PR_PREFERRED_NID		78
+# define PR_PREFERRED_NID_GET		0
+# define PR_PREFERRED_NID_SET		1
+# define PR_PREFERRED_NID_CMD_MAX	2
 #endif /* _LINUX_PRCTL_H */
-- 
2.43.5



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-05-02 19:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-02 18:59 [Patch v2 0/2] sched/numa, mm/numa: Soft Affinity via numa_preferred_nid chris hyser
2025-05-02 18:59 ` [PATCH v2 1/2] sched/numa: Add ability to override task's numa_preferred_nid chris hyser
2025-05-02 18:59 ` [PATCH v2 2/2] sched/numa: prctl to set/override " chris hyser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox