linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 11/19] kthread: Make sure kthread hasn't started while binding it
       [not found] <20240916224925.20540-1-frederic@kernel.org>
@ 2024-09-16 22:49 ` Frederic Weisbecker
  2024-09-16 22:49 ` [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node Frederic Weisbecker
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-16 22:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Kees Cook, Peter Zijlstra,
	Thomas Gleixner, Michal Hocko, Vlastimil Babka, linux-mm,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Boqun Feng,
	Zqiang, rcu

Make sure the kthread is sleeping in the schedule_preempt_disabled()
call before calling its handler when kthread_bind[_mask]() is called
on it. This provides a sanity check verifying that the task is not
randomly blocked later at some point within its function handler, in
which case it could be just concurrently awaken, leaving the call to
do_set_cpus_allowed() without any effect until the next voluntary sleep.

Rely on the wake-up ordering to ensure that the newly introduced "started"
field returns the expected value:

    TASK A                                   TASK B
    ------                                   ------
READ kthread->started
wake_up_process(B)
   rq_lock()
   ...
   rq_unlock() // RELEASE
                                           schedule()
                                              rq_lock() // ACQUIRE
                                              // schedule task B
                                              rq_unlock()
                                              WRITE kthread->started

Similarly, writing kthread->started before subsequent voluntary sleeps
will be visible after calling wait_task_inactive() in
__kthread_bind_mask(), reporting potential misuse of the API.

Upcoming patches will make further use of this facility.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/kthread.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index f7be976ff88a..ecb719f54f7a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -53,6 +53,7 @@ struct kthread_create_info
 struct kthread {
 	unsigned long flags;
 	unsigned int cpu;
+	int started;
 	int result;
 	int (*threadfn)(void *);
 	void *data;
@@ -382,6 +383,8 @@ static int kthread(void *_create)
 	schedule_preempt_disabled();
 	preempt_enable();
 
+	self->started = 1;
+
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
 		cgroup_kthread_ready();
@@ -540,7 +543,9 @@ static void __kthread_bind(struct task_struct *p, unsigned int cpu, unsigned int
 
 void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask)
 {
+	struct kthread *kthread = to_kthread(p);
 	__kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
+	WARN_ON_ONCE(kthread->started);
 }
 
 /**
@@ -554,7 +559,9 @@ void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask)
  */
 void kthread_bind(struct task_struct *p, unsigned int cpu)
 {
+	struct kthread *kthread = to_kthread(p);
 	__kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
+	WARN_ON_ONCE(kthread->started);
 }
 EXPORT_SYMBOL(kthread_bind);
 
-- 
2.46.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
       [not found] <20240916224925.20540-1-frederic@kernel.org>
  2024-09-16 22:49 ` [PATCH 11/19] kthread: Make sure kthread hasn't started while binding it Frederic Weisbecker
@ 2024-09-16 22:49 ` Frederic Weisbecker
  2024-09-17  6:26   ` Michal Hocko
  2024-09-16 22:49 ` [PATCH 13/19] mm: Create/affine kcompactd to its preferred node Frederic Weisbecker
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-16 22:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Kees Cook, Peter Zijlstra,
	Thomas Gleixner, Michal Hocko, Vlastimil Babka, linux-mm,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Boqun Feng,
	Zqiang, rcu

Kthreads attached to a preferred NUMA node for their task structure
allocation can also be assumed to run preferrably within that same node.

A more precise affinity is usually notified by calling
kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.

For the others, a default affinity to the node is desired and sometimes
implemented with more or less success when it comes to deal with hotplug
events and nohz_full / CPU Isolation interactions:

- kcompactd is affine to its node and handles hotplug but not CPU Isolation
- kswapd is affine to its node and ignores hotplug and CPU Isolation
- A bunch of drivers create their kthreads on a specific node and
  don't take care about affining further.

Handle that default node affinity preference at the generic level
instead, provided a kthread is created on an actual node and doesn't
apply any specific affinity such as a given CPU or a custom cpumask to
bind to before its first wake-up.

This generic handling is aware of CPU hotplug events and CPU isolation
such that:

* When a housekeeping CPU goes up and is part of the node of a given
  kthread, it is added to its applied affinity set (and
  possibly the default last resort online housekeeping set is removed
  from the set).

* When a housekeeping CPU goes down while it was part of the node of a
  kthread, it is removed from the kthread's applied
  affinity. The last resort is to affine the kthread to all online
  housekeeping CPUs.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuhotplug.h |   1 +
 kernel/kthread.c           | 120 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 9316c39260e0..89d852538b72 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -240,6 +240,7 @@ enum cpuhp_state {
 	CPUHP_AP_WORKQUEUE_ONLINE,
 	CPUHP_AP_RANDOM_ONLINE,
 	CPUHP_AP_RCUTREE_ONLINE,
+	CPUHP_AP_KTHREADS_ONLINE,
 	CPUHP_AP_BASE_CACHEINFO_ONLINE,
 	CPUHP_AP_ONLINE_DYN,
 	CPUHP_AP_ONLINE_DYN_END		= CPUHP_AP_ONLINE_DYN + 40,
diff --git a/kernel/kthread.c b/kernel/kthread.c
index ecb719f54f7a..eee5925e7725 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,6 +35,10 @@ static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
 struct task_struct *kthreadd_task;
 
+static struct cpumask kthread_online_mask;
+static LIST_HEAD(kthreads_hotplug);
+static DEFINE_MUTEX(kthreads_hotplug_lock);
+
 struct kthread_create_info
 {
 	/* Information passed to kthread() from kthreadd. */
@@ -53,6 +57,7 @@ struct kthread_create_info
 struct kthread {
 	unsigned long flags;
 	unsigned int cpu;
+	unsigned int node;
 	int started;
 	int result;
 	int (*threadfn)(void *);
@@ -64,6 +69,8 @@ struct kthread {
 #endif
 	/* To store the full name if task comm is truncated. */
 	char *full_name;
+	struct task_struct *task;
+	struct list_head hotplug_node;
 };
 
 enum KTHREAD_BITS {
@@ -122,8 +129,11 @@ bool set_kthread_struct(struct task_struct *p)
 
 	init_completion(&kthread->exited);
 	init_completion(&kthread->parked);
+	INIT_LIST_HEAD(&kthread->hotplug_node);
 	p->vfork_done = &kthread->exited;
 
+	kthread->task = p;
+	kthread->node = tsk_fork_get_node(current);
 	p->worker_private = kthread;
 	return true;
 }
@@ -314,6 +324,13 @@ void __noreturn kthread_exit(long result)
 {
 	struct kthread *kthread = to_kthread(current);
 	kthread->result = result;
+	if (!list_empty(&kthread->hotplug_node)) {
+		mutex_lock(&kthreads_hotplug_lock);
+		list_del(&kthread->hotplug_node);
+		/* Make sure the kthread never gets re-affined globally */
+		set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
+		mutex_unlock(&kthreads_hotplug_lock);
+	}
 	do_exit(0);
 }
 EXPORT_SYMBOL(kthread_exit);
@@ -339,6 +356,45 @@ void __noreturn kthread_complete_and_exit(struct completion *comp, long code)
 }
 EXPORT_SYMBOL(kthread_complete_and_exit);
 
+static void kthread_fetch_affinity(struct kthread *k, struct cpumask *mask)
+{
+	if (k->node == NUMA_NO_NODE) {
+		cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	} else {
+		/*
+		 * The node cpumask is racy when read from kthread() but:
+		 * - a racing CPU going down won't be present in kthread_online_mask
+		 * - a racing CPU going up will be handled by kthreads_online_cpu()
+		 */
+		cpumask_and(mask, cpumask_of_node(k->node), &kthread_online_mask);
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+		if (cpumask_empty(mask))
+			cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	}
+}
+
+static int kthread_affine_node(void)
+{
+	struct kthread *kthread = to_kthread(current);
+	cpumask_var_t affinity;
+
+	WARN_ON_ONCE(kthread_is_per_cpu(current));
+
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+		return -ENOMEM;
+
+	mutex_lock(&kthreads_hotplug_lock);
+	WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
+	list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+	kthread_fetch_affinity(kthread, affinity);
+	set_cpus_allowed_ptr(current, affinity);
+	mutex_unlock(&kthreads_hotplug_lock);
+
+	free_cpumask_var(affinity);
+
+	return 0;
+}
+
 static int kthread(void *_create)
 {
 	static const struct sched_param param = { .sched_priority = 0 };
@@ -369,7 +425,6 @@ static int kthread(void *_create)
 	 * back to default in case they have been changed.
 	 */
 	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
-	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
 
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
@@ -385,6 +440,9 @@ static int kthread(void *_create)
 
 	self->started = 1;
 
+	if (!(current->flags & PF_NO_SETAFFINITY))
+		kthread_affine_node();
+
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
 		cgroup_kthread_ready();
@@ -779,6 +837,66 @@ int kthreadd(void *unused)
 	return 0;
 }
 
+static int kthreads_hotplug_update(void)
+{
+	cpumask_var_t affinity;
+	struct kthread *k;
+	int err;
+
+	if (list_empty(&kthreads_hotplug))
+		return 0;
+
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = 0;
+
+	list_for_each_entry(k, &kthreads_hotplug, hotplug_node) {
+		if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) ||
+				 kthread_is_per_cpu(k->task))) {
+			err = -EINVAL;
+			continue;
+		}
+		kthread_fetch_affinity(k, affinity);
+		set_cpus_allowed_ptr(k->task, affinity);
+	}
+
+	free_cpumask_var(affinity);
+
+	return err;
+}
+
+static int kthreads_offline_cpu(unsigned int cpu)
+{
+	int ret = 0;
+
+	mutex_lock(&kthreads_hotplug_lock);
+	cpumask_clear_cpu(cpu, &kthread_online_mask);
+	ret = kthreads_hotplug_update();
+	mutex_unlock(&kthreads_hotplug_lock);
+
+	return ret;
+}
+
+static int kthreads_online_cpu(unsigned int cpu)
+{
+	int ret = 0;
+
+	mutex_lock(&kthreads_hotplug_lock);
+	cpumask_set_cpu(cpu, &kthread_online_mask);
+	ret = kthreads_hotplug_update();
+	mutex_unlock(&kthreads_hotplug_lock);
+
+	return ret;
+}
+
+static int kthreads_init(void)
+{
+	return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online",
+				kthreads_online_cpu, kthreads_offline_cpu);
+}
+early_initcall(kthreads_init);
+
 void __kthread_init_worker(struct kthread_worker *worker,
 				const char *name,
 				struct lock_class_key *key)
-- 
2.46.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 13/19] mm: Create/affine kcompactd to its preferred node
       [not found] <20240916224925.20540-1-frederic@kernel.org>
  2024-09-16 22:49 ` [PATCH 11/19] kthread: Make sure kthread hasn't started while binding it Frederic Weisbecker
  2024-09-16 22:49 ` [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node Frederic Weisbecker
@ 2024-09-16 22:49 ` Frederic Weisbecker
  2024-09-17  6:04   ` Michal Hocko
  2024-09-16 22:49 ` [PATCH 14/19] mm: Create/affine kswapd " Frederic Weisbecker
  2024-09-16 22:49 ` [PATCH 15/19] kthread: Implement preferred affinity Frederic Weisbecker
  4 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-16 22:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Hocko, Vlastimil Babka,
	Andrew Morton, linux-mm, Peter Zijlstra, Thomas Gleixner

Kcompactd is dedicated to a specific node. As such it wants to be
preferrably affine to it, memory and CPUs-wise.

Use the proper kthread API to achieve that. As a bonus it takes care of
CPU-hotplug events and CPU-isolation on its behalf.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 mm/compaction.c | 43 +++----------------------------------------
 1 file changed, 3 insertions(+), 40 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index eb95e9b435d0..69742555f2e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -3179,15 +3179,9 @@ void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx)
 static int kcompactd(void *p)
 {
 	pg_data_t *pgdat = (pg_data_t *)p;
-	struct task_struct *tsk = current;
 	long default_timeout = msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC);
 	long timeout = default_timeout;
 
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
-
 	set_freezable();
 
 	pgdat->kcompactd_max_order = 0;
@@ -3258,10 +3252,12 @@ void __meminit kcompactd_run(int nid)
 	if (pgdat->kcompactd)
 		return;
 
-	pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
+	pgdat->kcompactd = kthread_create_on_node(kcompactd, pgdat, nid, "kcompactd%d", nid);
 	if (IS_ERR(pgdat->kcompactd)) {
 		pr_err("Failed to start kcompactd on node %d\n", nid);
 		pgdat->kcompactd = NULL;
+	} else {
+		wake_up_process(pgdat->kcompactd);
 	}
 }
 
@@ -3279,30 +3275,6 @@ void __meminit kcompactd_stop(int nid)
 	}
 }
 
-/*
- * It's optimal to keep kcompactd on the same CPUs as their memory, but
- * not required for correctness. So if the last cpu in a node goes
- * away, we get changed to run anywhere: as the first one comes back,
- * restore their cpu bindings.
- */
-static int kcompactd_cpu_online(unsigned int cpu)
-{
-	int nid;
-
-	for_each_node_state(nid, N_MEMORY) {
-		pg_data_t *pgdat = NODE_DATA(nid);
-		const struct cpumask *mask;
-
-		mask = cpumask_of_node(pgdat->node_id);
-
-		if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
-			/* One of our CPUs online: restore mask */
-			if (pgdat->kcompactd)
-				set_cpus_allowed_ptr(pgdat->kcompactd, mask);
-	}
-	return 0;
-}
-
 static int proc_dointvec_minmax_warn_RT_change(const struct ctl_table *table,
 		int write, void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -3362,15 +3334,6 @@ static struct ctl_table vm_compaction[] = {
 static int __init kcompactd_init(void)
 {
 	int nid;
-	int ret;
-
-	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
-					"mm/compaction:online",
-					kcompactd_cpu_online, NULL);
-	if (ret < 0) {
-		pr_err("kcompactd: failed to register hotplug callbacks.\n");
-		return ret;
-	}
 
 	for_each_node_state(nid, N_MEMORY)
 		kcompactd_run(nid);
-- 
2.46.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 14/19] mm: Create/affine kswapd to its preferred node
       [not found] <20240916224925.20540-1-frederic@kernel.org>
                   ` (2 preceding siblings ...)
  2024-09-16 22:49 ` [PATCH 13/19] mm: Create/affine kcompactd to its preferred node Frederic Weisbecker
@ 2024-09-16 22:49 ` Frederic Weisbecker
  2024-09-17  6:05   ` Michal Hocko
  2024-09-16 22:49 ` [PATCH 15/19] kthread: Implement preferred affinity Frederic Weisbecker
  4 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-16 22:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Hocko, Vlastimil Babka, linux-mm,
	Andrew Morton, Peter Zijlstra, Thomas Gleixner

kswapd is dedicated to a specific node. As such it wants to be
preferrably affine to it, memory and CPUs-wise.

Use the proper kthread API to achieve that. As a bonus it takes care of
CPU-hotplug events and CPU-isolation on its behalf.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 mm/vmscan.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd489c1af228..00a7f1e92447 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7139,10 +7139,6 @@ static int kswapd(void *p)
 	unsigned int highest_zoneidx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t *)p;
 	struct task_struct *tsk = current;
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-
-	if (!cpumask_empty(cpumask))
-		set_cpus_allowed_ptr(tsk, cpumask);
 
 	/*
 	 * Tell the memory management that we're a "memory allocator",
@@ -7311,13 +7307,15 @@ void __meminit kswapd_run(int nid)
 
 	pgdat_kswapd_lock(pgdat);
 	if (!pgdat->kswapd) {
-		pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
+		pgdat->kswapd = kthread_create_on_node(kswapd, pgdat, nid, "kswapd%d", nid);
 		if (IS_ERR(pgdat->kswapd)) {
 			/* failure at boot is fatal */
 			pr_err("Failed to start kswapd on node %d,ret=%ld\n",
 				   nid, PTR_ERR(pgdat->kswapd));
 			BUG_ON(system_state < SYSTEM_RUNNING);
 			pgdat->kswapd = NULL;
+		} else {
+			wake_up_process(pgdat->kswapd);
 		}
 	}
 	pgdat_kswapd_unlock(pgdat);
-- 
2.46.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 15/19] kthread: Implement preferred affinity
       [not found] <20240916224925.20540-1-frederic@kernel.org>
                   ` (3 preceding siblings ...)
  2024-09-16 22:49 ` [PATCH 14/19] mm: Create/affine kswapd " Frederic Weisbecker
@ 2024-09-16 22:49 ` Frederic Weisbecker
  4 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-16 22:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Kees Cook, Peter Zijlstra,
	Thomas Gleixner, Michal Hocko, Vlastimil Babka, linux-mm,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Boqun Feng,
	Zqiang, rcu

Affining kthreads follow either of four existing different patterns:

1) Per-CPU kthreads must stay affine to a single CPU and never execute
   relevant code on any other CPU. This is currently handled by smpboot
   code which takes care of CPU-hotplug operations.

2) Kthreads that _have_ to be affine to a specific set of CPUs and can't
   run anywhere else. The affinity is set through kthread_bind_mask()
   and the subsystem takes care by itself to handle CPU-hotplug operations.

3) Kthreads that prefer to be affine to a specific NUMA node. That
   preferred affinity is applied by default when an actual node ID is
   passed on kthread creation, provided the kthread is not per-CPU and
   no call to kthread_bind_mask() has been issued before the first
   wake-up.

4) Similar to the previous point but kthreads have a preferred affinity
   different than a node. It is set manually like any other task and
   CPU-hotplug is supposed to be handled by the relevant subsystem so
   that the task is properly reaffined whenever a given CPU from the
   preferred affinity comes up or down. Also care must be taken so that
   the preferred affinity doesn't cross housekeeping cpumask boundaries.

Provide a function to handle the last usecase, mostly reusing the
current node default affinity infrastructure. kthread_affine_preferred()
is introduced, to be used just like kthread_bind_mask(), right after
kthread creation and before the first wake up. The kthread is then
affine right away to the cpumask passed through the API if it has online
housekeeping CPUs. Otherwise it will be affine to all online
housekeeping CPUs as a last resort.

As with node affinity, it is aware of CPU hotplug events such that:

* When a housekeeping CPU goes up and is part of the preferred affinity
  of a given kthread, it is added to its applied affinity set (and
  possibly the default last resort online housekeeping set is removed
  from the set).

* When a housekeeping CPU goes down while it was part of the preferred
  affinity of a kthread, it is removed from the kthread's applied
  affinity. The last resort is to affine the kthread to all online
  housekeeping CPUs.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/kthread.h |  1 +
 kernel/kthread.c        | 69 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index b11f53c1ba2e..30209bdf83a2 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -85,6 +85,7 @@ kthread_run_on_cpu(int (*threadfn)(void *data), void *data,
 void free_kthread_struct(struct task_struct *k);
 void kthread_bind(struct task_struct *k, unsigned int cpu);
 void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
+int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask);
 int kthread_stop(struct task_struct *k);
 int kthread_stop_put(struct task_struct *k);
 bool kthread_should_stop(void);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index eee5925e7725..e4ffc776928a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -71,6 +71,7 @@ struct kthread {
 	char *full_name;
 	struct task_struct *task;
 	struct list_head hotplug_node;
+	struct cpumask *preferred_affinity;
 };
 
 enum KTHREAD_BITS {
@@ -330,6 +331,11 @@ void __noreturn kthread_exit(long result)
 		/* Make sure the kthread never gets re-affined globally */
 		set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
 		mutex_unlock(&kthreads_hotplug_lock);
+
+		if (kthread->preferred_affinity) {
+			kfree(kthread->preferred_affinity);
+			kthread->preferred_affinity = NULL;
+		}
 	}
 	do_exit(0);
 }
@@ -358,19 +364,25 @@ EXPORT_SYMBOL(kthread_complete_and_exit);
 
 static void kthread_fetch_affinity(struct kthread *k, struct cpumask *mask)
 {
-	if (k->node == NUMA_NO_NODE) {
-		cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
-	} else {
+	const struct cpumask *pref;
+
+	if (k->preferred_affinity) {
+		pref = k->preferred_affinity;
+	} else if (k->node != NUMA_NO_NODE) {
 		/*
 		 * The node cpumask is racy when read from kthread() but:
 		 * - a racing CPU going down won't be present in kthread_online_mask
 		 * - a racing CPU going up will be handled by kthreads_online_cpu()
 		 */
-		cpumask_and(mask, cpumask_of_node(k->node), &kthread_online_mask);
-		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
-		if (cpumask_empty(mask))
-			cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+		pref = cpumask_of_node(k->node);
+	} else {
+		pref = housekeeping_cpumask(HK_TYPE_KTHREAD);
 	}
+
+	cpumask_and(mask, pref, &kthread_online_mask);
+	cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	if (cpumask_empty(mask))
+		cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
 }
 
 static int kthread_affine_node(void)
@@ -440,7 +452,7 @@ static int kthread(void *_create)
 
 	self->started = 1;
 
-	if (!(current->flags & PF_NO_SETAFFINITY))
+	if (!(current->flags & PF_NO_SETAFFINITY) && !self->preferred_affinity)
 		kthread_affine_node();
 
 	ret = -EINTR;
@@ -837,6 +849,47 @@ int kthreadd(void *unused)
 	return 0;
 }
 
+int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
+{
+	struct kthread *kthread = to_kthread(p);
+	cpumask_var_t affinity;
+	unsigned long flags;
+	int ret;
+
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE) || kthread->started) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	WARN_ON_ONCE(kthread->preferred_affinity);
+
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+		return -ENOMEM;
+
+	kthread->preferred_affinity = kzalloc(sizeof(struct cpumask), GFP_KERNEL);
+	if (!kthread->preferred_affinity) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mutex_lock(&kthreads_hotplug_lock);
+	cpumask_copy(kthread->preferred_affinity, mask);
+	WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
+	list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+	kthread_fetch_affinity(kthread, affinity);
+
+	/* It's safe because the task is inactive. */
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	do_set_cpus_allowed(p, affinity);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	mutex_unlock(&kthreads_hotplug_lock);
+out:
+	free_cpumask_var(affinity);
+
+	return 0;
+}
+
 static int kthreads_hotplug_update(void)
 {
 	cpumask_var_t affinity;
-- 
2.46.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 13/19] mm: Create/affine kcompactd to its preferred node
  2024-09-16 22:49 ` [PATCH 13/19] mm: Create/affine kcompactd to its preferred node Frederic Weisbecker
@ 2024-09-17  6:04   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2024-09-17  6:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Vlastimil Babka, Andrew Morton, linux-mm, Peter Zijlstra,
	Thomas Gleixner

On Tue 17-09-24 00:49:17, Frederic Weisbecker wrote:
> Kcompactd is dedicated to a specific node. As such it wants to be
> preferrably affine to it, memory and CPUs-wise.
> 
> Use the proper kthread API to achieve that. As a bonus it takes care of
> CPU-hotplug events and CPU-isolation on its behalf.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>
Clear simplification, thanks!

> ---
>  mm/compaction.c | 43 +++----------------------------------------
>  1 file changed, 3 insertions(+), 40 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index eb95e9b435d0..69742555f2e5 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -3179,15 +3179,9 @@ void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx)
>  static int kcompactd(void *p)
>  {
>  	pg_data_t *pgdat = (pg_data_t *)p;
> -	struct task_struct *tsk = current;
>  	long default_timeout = msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC);
>  	long timeout = default_timeout;
>  
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> -
> -	if (!cpumask_empty(cpumask))
> -		set_cpus_allowed_ptr(tsk, cpumask);
> -
>  	set_freezable();
>  
>  	pgdat->kcompactd_max_order = 0;
> @@ -3258,10 +3252,12 @@ void __meminit kcompactd_run(int nid)
>  	if (pgdat->kcompactd)
>  		return;
>  
> -	pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
> +	pgdat->kcompactd = kthread_create_on_node(kcompactd, pgdat, nid, "kcompactd%d", nid);
>  	if (IS_ERR(pgdat->kcompactd)) {
>  		pr_err("Failed to start kcompactd on node %d\n", nid);
>  		pgdat->kcompactd = NULL;
> +	} else {
> +		wake_up_process(pgdat->kcompactd);
>  	}
>  }
>  
> @@ -3279,30 +3275,6 @@ void __meminit kcompactd_stop(int nid)
>  	}
>  }
>  
> -/*
> - * It's optimal to keep kcompactd on the same CPUs as their memory, but
> - * not required for correctness. So if the last cpu in a node goes
> - * away, we get changed to run anywhere: as the first one comes back,
> - * restore their cpu bindings.
> - */
> -static int kcompactd_cpu_online(unsigned int cpu)
> -{
> -	int nid;
> -
> -	for_each_node_state(nid, N_MEMORY) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> -		const struct cpumask *mask;
> -
> -		mask = cpumask_of_node(pgdat->node_id);
> -
> -		if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
> -			/* One of our CPUs online: restore mask */
> -			if (pgdat->kcompactd)
> -				set_cpus_allowed_ptr(pgdat->kcompactd, mask);
> -	}
> -	return 0;
> -}
> -
>  static int proc_dointvec_minmax_warn_RT_change(const struct ctl_table *table,
>  		int write, void *buffer, size_t *lenp, loff_t *ppos)
>  {
> @@ -3362,15 +3334,6 @@ static struct ctl_table vm_compaction[] = {
>  static int __init kcompactd_init(void)
>  {
>  	int nid;
> -	int ret;
> -
> -	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> -					"mm/compaction:online",
> -					kcompactd_cpu_online, NULL);
> -	if (ret < 0) {
> -		pr_err("kcompactd: failed to register hotplug callbacks.\n");
> -		return ret;
> -	}
>  
>  	for_each_node_state(nid, N_MEMORY)
>  		kcompactd_run(nid);
> -- 
> 2.46.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 14/19] mm: Create/affine kswapd to its preferred node
  2024-09-16 22:49 ` [PATCH 14/19] mm: Create/affine kswapd " Frederic Weisbecker
@ 2024-09-17  6:05   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2024-09-17  6:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Vlastimil Babka, linux-mm, Andrew Morton, Peter Zijlstra,
	Thomas Gleixner

On Tue 17-09-24 00:49:18, Frederic Weisbecker wrote:
> kswapd is dedicated to a specific node. As such it wants to be
> preferrably affine to it, memory and CPUs-wise.
> 
> Use the proper kthread API to achieve that. As a bonus it takes care of
> CPU-hotplug events and CPU-isolation on its behalf.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af228..00a7f1e92447 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -7139,10 +7139,6 @@ static int kswapd(void *p)
>  	unsigned int highest_zoneidx = MAX_NR_ZONES - 1;
>  	pg_data_t *pgdat = (pg_data_t *)p;
>  	struct task_struct *tsk = current;
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> -
> -	if (!cpumask_empty(cpumask))
> -		set_cpus_allowed_ptr(tsk, cpumask);
>  
>  	/*
>  	 * Tell the memory management that we're a "memory allocator",
> @@ -7311,13 +7307,15 @@ void __meminit kswapd_run(int nid)
>  
>  	pgdat_kswapd_lock(pgdat);
>  	if (!pgdat->kswapd) {
> -		pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
> +		pgdat->kswapd = kthread_create_on_node(kswapd, pgdat, nid, "kswapd%d", nid);
>  		if (IS_ERR(pgdat->kswapd)) {
>  			/* failure at boot is fatal */
>  			pr_err("Failed to start kswapd on node %d,ret=%ld\n",
>  				   nid, PTR_ERR(pgdat->kswapd));
>  			BUG_ON(system_state < SYSTEM_RUNNING);
>  			pgdat->kswapd = NULL;
> +		} else {
> +			wake_up_process(pgdat->kswapd);
>  		}
>  	}
>  	pgdat_kswapd_unlock(pgdat);
> -- 
> 2.46.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-16 22:49 ` [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node Frederic Weisbecker
@ 2024-09-17  6:26   ` Michal Hocko
  2024-09-17  7:01     ` Vlastimil Babka
  2024-09-17 10:34     ` Frederic Weisbecker
  0 siblings, 2 replies; 16+ messages in thread
From: Michal Hocko @ 2024-09-17  6:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Kees Cook, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Boqun Feng, Zqiang, rcu

On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> Kthreads attached to a preferred NUMA node for their task structure
> allocation can also be assumed to run preferrably within that same node.
> 
> A more precise affinity is usually notified by calling
> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> 
> For the others, a default affinity to the node is desired and sometimes
> implemented with more or less success when it comes to deal with hotplug
> events and nohz_full / CPU Isolation interactions:
> 
> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> - kswapd is affine to its node and ignores hotplug and CPU Isolation
> - A bunch of drivers create their kthreads on a specific node and
>   don't take care about affining further.
> 
> Handle that default node affinity preference at the generic level
> instead, provided a kthread is created on an actual node and doesn't
> apply any specific affinity such as a given CPU or a custom cpumask to
> bind to before its first wake-up.

Makes sense.

> This generic handling is aware of CPU hotplug events and CPU isolation
> such that:
> 
> * When a housekeeping CPU goes up and is part of the node of a given
>   kthread, it is added to its applied affinity set (and
>   possibly the default last resort online housekeeping set is removed
>   from the set).
> 
> * When a housekeeping CPU goes down while it was part of the node of a
>   kthread, it is removed from the kthread's applied
>   affinity. The last resort is to affine the kthread to all online
>   housekeeping CPUs.

But I am not really sure about this part. Sure it makes sense to set the
affinity to exclude isolated CPUs but why do we care about hotplug
events at all. Let's say we offline all cpus from a given node (or
that all but isolated cpus are offline - is this even
realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
affinity in such a case? In other words how is that different from
tasksetting an userspace task to a cpu that goes offline? We still do
allow such a task to run, right? We just do not care about affinity
anymore.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-17  6:26   ` Michal Hocko
@ 2024-09-17  7:01     ` Vlastimil Babka
  2024-09-17  7:05       ` Michal Hocko
  2024-09-17 10:34     ` Frederic Weisbecker
  1 sibling, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2024-09-17  7:01 UTC (permalink / raw)
  To: Michal Hocko, Frederic Weisbecker
  Cc: LKML, Andrew Morton, Kees Cook, Peter Zijlstra, Thomas Gleixner,
	linux-mm, Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes,
	Boqun Feng, Zqiang, rcu

On 9/17/24 8:26 AM, Michal Hocko wrote:
> On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
>> Kthreads attached to a preferred NUMA node for their task structure
>> allocation can also be assumed to run preferrably within that same node.
>>
>> A more precise affinity is usually notified by calling
>> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
>>
>> For the others, a default affinity to the node is desired and sometimes
>> implemented with more or less success when it comes to deal with hotplug
>> events and nohz_full / CPU Isolation interactions:
>>
>> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
>> - kswapd is affine to its node and ignores hotplug and CPU Isolation
>> - A bunch of drivers create their kthreads on a specific node and
>>   don't take care about affining further.
>>
>> Handle that default node affinity preference at the generic level
>> instead, provided a kthread is created on an actual node and doesn't
>> apply any specific affinity such as a given CPU or a custom cpumask to
>> bind to before its first wake-up.
> 
> Makes sense.
> 
>> This generic handling is aware of CPU hotplug events and CPU isolation
>> such that:
>>
>> * When a housekeeping CPU goes up and is part of the node of a given
>>   kthread, it is added to its applied affinity set (and
>>   possibly the default last resort online housekeeping set is removed
>>   from the set).
>>
>> * When a housekeeping CPU goes down while it was part of the node of a
>>   kthread, it is removed from the kthread's applied
>>   affinity. The last resort is to affine the kthread to all online
>>   housekeeping CPUs.
> 
> But I am not really sure about this part. Sure it makes sense to set the
> affinity to exclude isolated CPUs but why do we care about hotplug
> events at all. Let's say we offline all cpus from a given node (or
> that all but isolated cpus are offline - is this even
> realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> affinity in such a case? In other words how is that different from
> tasksetting an userspace task to a cpu that goes offline? We still do
> allow such a task to run, right? We just do not care about affinity
> anymore.

AFAIU it handles better the situation where all houskeeping cpus from
the preferred node go down, then it affines to houskeeping cpus from any
node vs any cpu including isolated ones.
Yes it's probably a scenario that's not recommendable, but someone might
do it anyway...


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-17  7:01     ` Vlastimil Babka
@ 2024-09-17  7:05       ` Michal Hocko
  2024-09-17  7:14         ` Vlastimil Babka
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2024-09-17  7:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Kees Cook,
	Peter Zijlstra, Thomas Gleixner, linux-mm, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Boqun Feng, Zqiang, rcu

On Tue 17-09-24 09:01:08, Vlastimil Babka wrote:
> On 9/17/24 8:26 AM, Michal Hocko wrote:
> > On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> >> Kthreads attached to a preferred NUMA node for their task structure
> >> allocation can also be assumed to run preferrably within that same node.
> >>
> >> A more precise affinity is usually notified by calling
> >> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> >>
> >> For the others, a default affinity to the node is desired and sometimes
> >> implemented with more or less success when it comes to deal with hotplug
> >> events and nohz_full / CPU Isolation interactions:
> >>
> >> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> >> - kswapd is affine to its node and ignores hotplug and CPU Isolation
> >> - A bunch of drivers create their kthreads on a specific node and
> >>   don't take care about affining further.
> >>
> >> Handle that default node affinity preference at the generic level
> >> instead, provided a kthread is created on an actual node and doesn't
> >> apply any specific affinity such as a given CPU or a custom cpumask to
> >> bind to before its first wake-up.
> > 
> > Makes sense.
> > 
> >> This generic handling is aware of CPU hotplug events and CPU isolation
> >> such that:
> >>
> >> * When a housekeeping CPU goes up and is part of the node of a given
> >>   kthread, it is added to its applied affinity set (and
> >>   possibly the default last resort online housekeeping set is removed
> >>   from the set).
> >>
> >> * When a housekeeping CPU goes down while it was part of the node of a
> >>   kthread, it is removed from the kthread's applied
> >>   affinity. The last resort is to affine the kthread to all online
> >>   housekeeping CPUs.
> > 
> > But I am not really sure about this part. Sure it makes sense to set the
> > affinity to exclude isolated CPUs but why do we care about hotplug
> > events at all. Let's say we offline all cpus from a given node (or
> > that all but isolated cpus are offline - is this even
> > realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> > affinity in such a case? In other words how is that different from
> > tasksetting an userspace task to a cpu that goes offline? We still do
> > allow such a task to run, right? We just do not care about affinity
> > anymore.
> 
> AFAIU it handles better the situation where all houskeeping cpus from
> the preferred node go down, then it affines to houskeeping cpus from any
> node vs any cpu including isolated ones.

Doesn't that happen automagically? Or can it end up on a random
isolated cpu?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-17  7:05       ` Michal Hocko
@ 2024-09-17  7:14         ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2024-09-17  7:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frederic Weisbecker, LKML, Andrew Morton, Kees Cook,
	Peter Zijlstra, Thomas Gleixner, linux-mm, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Boqun Feng, Zqiang, rcu

On 9/17/24 9:05 AM, Michal Hocko wrote:
> On Tue 17-09-24 09:01:08, Vlastimil Babka wrote:
>> On 9/17/24 8:26 AM, Michal Hocko wrote:
>>> On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
>>>> Kthreads attached to a preferred NUMA node for their task structure
>>>> allocation can also be assumed to run preferrably within that same node.
>>>>
>>>> A more precise affinity is usually notified by calling
>>>> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
>>>>
>>>> For the others, a default affinity to the node is desired and sometimes
>>>> implemented with more or less success when it comes to deal with hotplug
>>>> events and nohz_full / CPU Isolation interactions:
>>>>
>>>> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
>>>> - kswapd is affine to its node and ignores hotplug and CPU Isolation
>>>> - A bunch of drivers create their kthreads on a specific node and
>>>>   don't take care about affining further.
>>>>
>>>> Handle that default node affinity preference at the generic level
>>>> instead, provided a kthread is created on an actual node and doesn't
>>>> apply any specific affinity such as a given CPU or a custom cpumask to
>>>> bind to before its first wake-up.
>>>
>>> Makes sense.
>>>
>>>> This generic handling is aware of CPU hotplug events and CPU isolation
>>>> such that:
>>>>
>>>> * When a housekeeping CPU goes up and is part of the node of a given
>>>>   kthread, it is added to its applied affinity set (and
>>>>   possibly the default last resort online housekeeping set is removed
>>>>   from the set).
>>>>
>>>> * When a housekeeping CPU goes down while it was part of the node of a
>>>>   kthread, it is removed from the kthread's applied
>>>>   affinity. The last resort is to affine the kthread to all online
>>>>   housekeeping CPUs.
>>>
>>> But I am not really sure about this part. Sure it makes sense to set the
>>> affinity to exclude isolated CPUs but why do we care about hotplug
>>> events at all. Let's say we offline all cpus from a given node (or
>>> that all but isolated cpus are offline - is this even
>>> realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
>>> affinity in such a case? In other words how is that different from
>>> tasksetting an userspace task to a cpu that goes offline? We still do
>>> allow such a task to run, right? We just do not care about affinity
>>> anymore.
>>
>> AFAIU it handles better the situation where all houskeeping cpus from
>> the preferred node go down, then it affines to houskeeping cpus from any
>> node vs any cpu including isolated ones.
> 
> Doesn't that happen automagically? Or can it end up on a random
> isolated cpu?

Good question, perhaps it can and there's no automagic, as I see code like:

+		/* Make sure the kthread never gets re-affined globally */
+		set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-17  6:26   ` Michal Hocko
  2024-09-17  7:01     ` Vlastimil Babka
@ 2024-09-17 10:34     ` Frederic Weisbecker
  2024-09-17 11:07       ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-17 10:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Andrew Morton, Kees Cook, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Boqun Feng, Zqiang, rcu

Le Tue, Sep 17, 2024 at 08:26:49AM +0200, Michal Hocko a écrit :
> On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> > Kthreads attached to a preferred NUMA node for their task structure
> > allocation can also be assumed to run preferrably within that same node.
> > 
> > A more precise affinity is usually notified by calling
> > kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> > 
> > For the others, a default affinity to the node is desired and sometimes
> > implemented with more or less success when it comes to deal with hotplug
> > events and nohz_full / CPU Isolation interactions:
> > 
> > - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> > - kswapd is affine to its node and ignores hotplug and CPU Isolation
> > - A bunch of drivers create their kthreads on a specific node and
> >   don't take care about affining further.
> > 
> > Handle that default node affinity preference at the generic level
> > instead, provided a kthread is created on an actual node and doesn't
> > apply any specific affinity such as a given CPU or a custom cpumask to
> > bind to before its first wake-up.
> 
> Makes sense.
> 
> > This generic handling is aware of CPU hotplug events and CPU isolation
> > such that:
> > 
> > * When a housekeeping CPU goes up and is part of the node of a given
> >   kthread, it is added to its applied affinity set (and
> >   possibly the default last resort online housekeeping set is removed
> >   from the set).
> > 
> > * When a housekeeping CPU goes down while it was part of the node of a
> >   kthread, it is removed from the kthread's applied
> >   affinity. The last resort is to affine the kthread to all online
> >   housekeeping CPUs.
> 
> But I am not really sure about this part. Sure it makes sense to set the
> affinity to exclude isolated CPUs but why do we care about hotplug
> events at all. Let's say we offline all cpus from a given node (or
> that all but isolated cpus are offline - is this even
> realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> affinity in such a case? In other words how is that different from
> tasksetting an userspace task to a cpu that goes offline? We still do
> allow such a task to run, right? We just do not care about affinity
> anymore.

Suppose we have this artificial online set:

NODE 0 -> CPU 0
NODE 1 -> CPU 1
NODE 2 -> CPU 2

And we have nohz_full=1,2

So there is kswapd/2 that is affine to NODE 2 and thus CPU 2 for now.

Now CPU 2 goes offline. The scheduler migrates off all
tasks. select_fallback_rq() for kswapd/2 doesn't find a suitable CPU
to run to so it affines kswapd/2 to all remaining online CPUs (CPU 0, CPU 1)
(see the "No more Mr. Nice Guy" comment).

But CPU 1 is nohz_full, so kswapd/2 could run on that isolated CPU. Unless we
handle things before, like this patchset does.

And note that adding isolcpus=domain,1,2 or setting 1,2 as isolated
cpuset partition (like most isolated workloads should do) is not helping
here. And I'm not sure this last resort scheduler code is the right place
to handle isolated cpumasks.

So it looks necessary, unless I am missing something else?

And that is just for reaffine on CPU down. CPU up needs mirroring treatment
and also it must handle new CPUs freshly added to a node.

Thanks.

> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-17 10:34     ` Frederic Weisbecker
@ 2024-09-17 11:07       ` Michal Hocko
  2024-09-18  9:37         ` Frederic Weisbecker
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2024-09-17 11:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Kees Cook, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Boqun Feng, Zqiang, rcu

On Tue 17-09-24 12:34:51, Frederic Weisbecker wrote:
> Le Tue, Sep 17, 2024 at 08:26:49AM +0200, Michal Hocko a écrit :
> > On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> > > Kthreads attached to a preferred NUMA node for their task structure
> > > allocation can also be assumed to run preferrably within that same node.
> > > 
> > > A more precise affinity is usually notified by calling
> > > kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> > > 
> > > For the others, a default affinity to the node is desired and sometimes
> > > implemented with more or less success when it comes to deal with hotplug
> > > events and nohz_full / CPU Isolation interactions:
> > > 
> > > - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> > > - kswapd is affine to its node and ignores hotplug and CPU Isolation
> > > - A bunch of drivers create their kthreads on a specific node and
> > >   don't take care about affining further.
> > > 
> > > Handle that default node affinity preference at the generic level
> > > instead, provided a kthread is created on an actual node and doesn't
> > > apply any specific affinity such as a given CPU or a custom cpumask to
> > > bind to before its first wake-up.
> > 
> > Makes sense.
> > 
> > > This generic handling is aware of CPU hotplug events and CPU isolation
> > > such that:
> > > 
> > > * When a housekeeping CPU goes up and is part of the node of a given
> > >   kthread, it is added to its applied affinity set (and
> > >   possibly the default last resort online housekeeping set is removed
> > >   from the set).
> > > 
> > > * When a housekeeping CPU goes down while it was part of the node of a
> > >   kthread, it is removed from the kthread's applied
> > >   affinity. The last resort is to affine the kthread to all online
> > >   housekeeping CPUs.
> > 
> > But I am not really sure about this part. Sure it makes sense to set the
> > affinity to exclude isolated CPUs but why do we care about hotplug
> > events at all. Let's say we offline all cpus from a given node (or
> > that all but isolated cpus are offline - is this even
> > realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> > affinity in such a case? In other words how is that different from
> > tasksetting an userspace task to a cpu that goes offline? We still do
> > allow such a task to run, right? We just do not care about affinity
> > anymore.
> 
> Suppose we have this artificial online set:
> 
> NODE 0 -> CPU 0
> NODE 1 -> CPU 1
> NODE 2 -> CPU 2
> 
> And we have nohz_full=1,2
> 
> So there is kswapd/2 that is affine to NODE 2 and thus CPU 2 for now.
> 
> Now CPU 2 goes offline. The scheduler migrates off all
> tasks. select_fallback_rq() for kswapd/2 doesn't find a suitable CPU
> to run to so it affines kswapd/2 to all remaining online CPUs (CPU 0, CPU 1)
> (see the "No more Mr. Nice Guy" comment).
> 
> But CPU 1 is nohz_full, so kswapd/2 could run on that isolated CPU. Unless we
> handle things before, like this patchset does.

But that is equally broken as before, no? CPU2 is isolated as well so it
doesn't really make much of a difference.

> And note that adding isolcpus=domain,1,2 or setting 1,2 as isolated
> cpuset partition (like most isolated workloads should do) is not helping
> here. And I'm not sure this last resort scheduler code is the right place
> to handle isolated cpumasks.

Well, we would have the same situation with userspace tasks, no? Say I
have taskset -p 2 (because I want bidning to node2) and that CPU2 goes
offline. The task needs to be moved somewhere. And it would be last
resort logic to do that unless I am missing anything. Why should kernel
threads be any different?

> So it looks necessary, unless I am missing something else?

I am not objecting to patch per se. I am just not sure this is really
needed. It is great to have kernel threads bound to non isolated cpus by
default if they have node preferences. But as soon as somebody starts
offlining cpus excessively and make the initial cpumask empty then
select_fallback_rq sounds like the right thing to do.

Not my call though. I was just curious why this is needed and it seems
to me you are looking for some sort of correctness for broken setups.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-17 11:07       ` Michal Hocko
@ 2024-09-18  9:37         ` Frederic Weisbecker
  2024-09-18 11:17           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2024-09-18  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Andrew Morton, Kees Cook, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Boqun Feng, Zqiang, rcu

Le Tue, Sep 17, 2024 at 01:07:25PM +0200, Michal Hocko a écrit :
> On Tue 17-09-24 12:34:51, Frederic Weisbecker wrote:
> > Le Tue, Sep 17, 2024 at 08:26:49AM +0200, Michal Hocko a écrit :
> > > On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> > > > Kthreads attached to a preferred NUMA node for their task structure
> > > > allocation can also be assumed to run preferrably within that same node.
> > > > 
> > > > A more precise affinity is usually notified by calling
> > > > kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> > > > 
> > > > For the others, a default affinity to the node is desired and sometimes
> > > > implemented with more or less success when it comes to deal with hotplug
> > > > events and nohz_full / CPU Isolation interactions:
> > > > 
> > > > - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> > > > - kswapd is affine to its node and ignores hotplug and CPU Isolation
> > > > - A bunch of drivers create their kthreads on a specific node and
> > > >   don't take care about affining further.
> > > > 
> > > > Handle that default node affinity preference at the generic level
> > > > instead, provided a kthread is created on an actual node and doesn't
> > > > apply any specific affinity such as a given CPU or a custom cpumask to
> > > > bind to before its first wake-up.
> > > 
> > > Makes sense.
> > > 
> > > > This generic handling is aware of CPU hotplug events and CPU isolation
> > > > such that:
> > > > 
> > > > * When a housekeeping CPU goes up and is part of the node of a given
> > > >   kthread, it is added to its applied affinity set (and
> > > >   possibly the default last resort online housekeeping set is removed
> > > >   from the set).
> > > > 
> > > > * When a housekeeping CPU goes down while it was part of the node of a
> > > >   kthread, it is removed from the kthread's applied
> > > >   affinity. The last resort is to affine the kthread to all online
> > > >   housekeeping CPUs.
> > > 
> > > But I am not really sure about this part. Sure it makes sense to set the
> > > affinity to exclude isolated CPUs but why do we care about hotplug
> > > events at all. Let's say we offline all cpus from a given node (or
> > > that all but isolated cpus are offline - is this even
> > > realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> > > affinity in such a case? In other words how is that different from
> > > tasksetting an userspace task to a cpu that goes offline? We still do
> > > allow such a task to run, right? We just do not care about affinity
> > > anymore.
> > 
> > Suppose we have this artificial online set:
> > 
> > NODE 0 -> CPU 0
> > NODE 1 -> CPU 1
> > NODE 2 -> CPU 2
> > 
> > And we have nohz_full=1,2
> > 
> > So there is kswapd/2 that is affine to NODE 2 and thus CPU 2 for now.
> > 
> > Now CPU 2 goes offline. The scheduler migrates off all
> > tasks. select_fallback_rq() for kswapd/2 doesn't find a suitable CPU
> > to run to so it affines kswapd/2 to all remaining online CPUs (CPU 0, CPU 1)
> > (see the "No more Mr. Nice Guy" comment).
> > 
> > But CPU 1 is nohz_full, so kswapd/2 could run on that isolated CPU. Unless we
> > handle things before, like this patchset does.
> 
> But that is equally broken as before, no? CPU2 is isolated as well so it
> doesn't really make much of a difference.

Right. I should correct my example with nohz_full=1 only.

> 
> > And note that adding isolcpus=domain,1,2 or setting 1,2 as isolated
> > cpuset partition (like most isolated workloads should do) is not helping
> > here. And I'm not sure this last resort scheduler code is the right place
> > to handle isolated cpumasks.
> 
> Well, we would have the same situation with userspace tasks, no? Say I
> have taskset -p 2 (because I want bidning to node2) and that CPU2 goes
> offline. The task needs to be moved somewhere. And it would be last
> resort logic to do that unless I am missing anything. Why should kernel
> threads be any different?

Good point.

> 
> > So it looks necessary, unless I am missing something else?
> 
> I am not objecting to patch per se. I am just not sure this is really
> needed. It is great to have kernel threads bound to non isolated cpus by
> default if they have node preferences. But as soon as somebody starts
> offlining cpus excessively and make the initial cpumask empty then
> select_fallback_rq sounds like the right thing to do.
> 
> Not my call though. I was just curious why this is needed and it seems
> to me you are looking for some sort of correctness for broken setups.

It looks like it makes sense to explore that path. We still need the
cpu up probe to reaffine when a suitable target comes up. But it seems
the CPU down part can be handled by select_fallback_rq. I'll try that.

Thanks.

> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
  2024-09-18  9:37         ` Frederic Weisbecker
@ 2024-09-18 11:17           ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2024-09-18 11:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Kees Cook, Peter Zijlstra, Thomas Gleixner,
	Vlastimil Babka, linux-mm, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Boqun Feng, Zqiang, rcu

On Wed 18-09-24 11:37:42, Frederic Weisbecker wrote:
> Le Tue, Sep 17, 2024 at 01:07:25PM +0200, Michal Hocko a écrit :
[...]
> > I am not objecting to patch per se. I am just not sure this is really
> > needed. It is great to have kernel threads bound to non isolated cpus by
> > default if they have node preferences. But as soon as somebody starts
> > offlining cpus excessively and make the initial cpumask empty then
> > select_fallback_rq sounds like the right thing to do.
> > 
> > Not my call though. I was just curious why this is needed and it seems
> > to me you are looking for some sort of correctness for broken setups.
> 
> It looks like it makes sense to explore that path. We still need the
> cpu up probe to reaffine when a suitable target comes up. But it seems
> the CPU down part can be handled by select_fallback_rq. I'll try that.

THanks! Btw. when you are looking at this, would it make sense to make
select_fallback_rq more cpu isolation aware as well? I mean using
housekeeping cpus before falling back to task_cpu_possible_mask?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node
       [not found] <20241211154035.75565-1-frederic@kernel.org>
@ 2024-12-11 15:40 ` Frederic Weisbecker
  0 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2024-12-11 15:40 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Kees Cook, Peter Zijlstra,
	Thomas Gleixner, Michal Hocko, Vlastimil Babka, linux-mm,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Boqun Feng,
	Uladzislau Rezki, Zqiang, rcu

Kthreads attached to a preferred NUMA node for their task structure
allocation can also be assumed to run preferrably within that same node.

A more precise affinity is usually notified by calling
kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.

For the others, a default affinity to the node is desired and sometimes
implemented with more or less success when it comes to deal with hotplug
events and nohz_full / CPU Isolation interactions:

- kcompactd is affine to its node and handles hotplug but not CPU Isolation
- kswapd is affine to its node and ignores hotplug and CPU Isolation
- A bunch of drivers create their kthreads on a specific node and
  don't take care about affining further.

Handle that default node affinity preference at the generic level
instead, provided a kthread is created on an actual node and doesn't
apply any specific affinity such as a given CPU or a custom cpumask to
bind to before its first wake-up.

This generic handling is aware of CPU hotplug events and CPU isolation
such that:

* When a housekeeping CPU goes up that is part of the node of a given
  kthread, the related task is re-affined to that own node if it was
  previously running on the default last resort online housekeeping set
  from other nodes.

* When a housekeeping CPU goes down while it was part of the node of a
  kthread, the running task is migrated (or the sleeping task is woken
  up) automatically by the scheduler to other housekeepers within the
  same node or, as a last resort, to all housekeepers from other nodes.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuhotplug.h |   1 +
 kernel/kthread.c           | 106 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index a04b73c40173..6cc5e484547c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -240,6 +240,7 @@ enum cpuhp_state {
 	CPUHP_AP_WORKQUEUE_ONLINE,
 	CPUHP_AP_RANDOM_ONLINE,
 	CPUHP_AP_RCUTREE_ONLINE,
+	CPUHP_AP_KTHREADS_ONLINE,
 	CPUHP_AP_BASE_CACHEINFO_ONLINE,
 	CPUHP_AP_ONLINE_DYN,
 	CPUHP_AP_ONLINE_DYN_END		= CPUHP_AP_ONLINE_DYN + 40,
diff --git a/kernel/kthread.c b/kernel/kthread.c
index b6f9ce475a4f..3394ff024a5a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,6 +35,9 @@ static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
 struct task_struct *kthreadd_task;
 
+static LIST_HEAD(kthreads_hotplug);
+static DEFINE_MUTEX(kthreads_hotplug_lock);
+
 struct kthread_create_info
 {
 	/* Information passed to kthread() from kthreadd. */
@@ -53,6 +56,7 @@ struct kthread_create_info
 struct kthread {
 	unsigned long flags;
 	unsigned int cpu;
+	unsigned int node;
 	int started;
 	int result;
 	int (*threadfn)(void *);
@@ -64,6 +68,8 @@ struct kthread {
 #endif
 	/* To store the full name if task comm is truncated. */
 	char *full_name;
+	struct task_struct *task;
+	struct list_head hotplug_node;
 };
 
 enum KTHREAD_BITS {
@@ -122,8 +128,11 @@ bool set_kthread_struct(struct task_struct *p)
 
 	init_completion(&kthread->exited);
 	init_completion(&kthread->parked);
+	INIT_LIST_HEAD(&kthread->hotplug_node);
 	p->vfork_done = &kthread->exited;
 
+	kthread->task = p;
+	kthread->node = tsk_fork_get_node(current);
 	p->worker_private = kthread;
 	return true;
 }
@@ -314,6 +323,11 @@ void __noreturn kthread_exit(long result)
 {
 	struct kthread *kthread = to_kthread(current);
 	kthread->result = result;
+	if (!list_empty(&kthread->hotplug_node)) {
+		mutex_lock(&kthreads_hotplug_lock);
+		list_del(&kthread->hotplug_node);
+		mutex_unlock(&kthreads_hotplug_lock);
+	}
 	do_exit(0);
 }
 EXPORT_SYMBOL(kthread_exit);
@@ -339,6 +353,48 @@ void __noreturn kthread_complete_and_exit(struct completion *comp, long code)
 }
 EXPORT_SYMBOL(kthread_complete_and_exit);
 
+static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpumask)
+{
+	cpumask_and(cpumask, cpumask_of_node(kthread->node),
+		    housekeeping_cpumask(HK_TYPE_KTHREAD));
+
+	if (cpumask_empty(cpumask))
+		cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+}
+
+static void kthread_affine_node(void)
+{
+	struct kthread *kthread = to_kthread(current);
+	cpumask_var_t affinity;
+
+	WARN_ON_ONCE(kthread_is_per_cpu(current));
+
+	if (kthread->node == NUMA_NO_NODE) {
+		housekeeping_affine(current, HK_TYPE_KTHREAD);
+	} else {
+		if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
+			WARN_ON_ONCE(1);
+			return;
+		}
+
+		mutex_lock(&kthreads_hotplug_lock);
+		WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
+		list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+		/*
+		 * The node cpumask is racy when read from kthread() but:
+		 * - a racing CPU going down will either fail on the subsequent
+		 *   call to set_cpus_allowed_ptr() or be migrated to housekeepers
+		 *   afterwards by the scheduler.
+		 * - a racing CPU going up will be handled by kthreads_online_cpu()
+		 */
+		kthread_fetch_affinity(kthread, affinity);
+		set_cpus_allowed_ptr(current, affinity);
+		mutex_unlock(&kthreads_hotplug_lock);
+
+		free_cpumask_var(affinity);
+	}
+}
+
 static int kthread(void *_create)
 {
 	static const struct sched_param param = { .sched_priority = 0 };
@@ -369,7 +425,6 @@ static int kthread(void *_create)
 	 * back to default in case they have been changed.
 	 */
 	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
-	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
 
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
@@ -385,6 +440,9 @@ static int kthread(void *_create)
 
 	self->started = 1;
 
+	if (!(current->flags & PF_NO_SETAFFINITY))
+		kthread_affine_node();
+
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
 		cgroup_kthread_ready();
@@ -781,6 +839,52 @@ int kthreadd(void *unused)
 	return 0;
 }
 
+/*
+ * Re-affine kthreads according to their preferences
+ * and the newly online CPU. The CPU down part is handled
+ * by select_fallback_rq() which default re-affines to
+ * housekeepers in case the preferred affinity doesn't
+ * apply anymore.
+ */
+static int kthreads_online_cpu(unsigned int cpu)
+{
+	cpumask_var_t affinity;
+	struct kthread *k;
+	int ret;
+
+	guard(mutex)(&kthreads_hotplug_lock);
+
+	if (list_empty(&kthreads_hotplug))
+		return 0;
+
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = 0;
+
+	list_for_each_entry(k, &kthreads_hotplug, hotplug_node) {
+		if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) ||
+				 kthread_is_per_cpu(k->task) ||
+				 k->node == NUMA_NO_NODE)) {
+			ret = -EINVAL;
+			continue;
+		}
+		kthread_fetch_affinity(k, affinity);
+		set_cpus_allowed_ptr(k->task, affinity);
+	}
+
+	free_cpumask_var(affinity);
+
+	return ret;
+}
+
+static int kthreads_init(void)
+{
+	return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online",
+				kthreads_online_cpu, NULL);
+}
+early_initcall(kthreads_init);
+
 void __kthread_init_worker(struct kthread_worker *worker,
 				const char *name,
 				struct lock_class_key *key)
-- 
2.46.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-12-11 15:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240916224925.20540-1-frederic@kernel.org>
2024-09-16 22:49 ` [PATCH 11/19] kthread: Make sure kthread hasn't started while binding it Frederic Weisbecker
2024-09-16 22:49 ` [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node Frederic Weisbecker
2024-09-17  6:26   ` Michal Hocko
2024-09-17  7:01     ` Vlastimil Babka
2024-09-17  7:05       ` Michal Hocko
2024-09-17  7:14         ` Vlastimil Babka
2024-09-17 10:34     ` Frederic Weisbecker
2024-09-17 11:07       ` Michal Hocko
2024-09-18  9:37         ` Frederic Weisbecker
2024-09-18 11:17           ` Michal Hocko
2024-09-16 22:49 ` [PATCH 13/19] mm: Create/affine kcompactd to its preferred node Frederic Weisbecker
2024-09-17  6:04   ` Michal Hocko
2024-09-16 22:49 ` [PATCH 14/19] mm: Create/affine kswapd " Frederic Weisbecker
2024-09-17  6:05   ` Michal Hocko
2024-09-16 22:49 ` [PATCH 15/19] kthread: Implement preferred affinity Frederic Weisbecker
     [not found] <20241211154035.75565-1-frederic@kernel.org>
2024-12-11 15:40 ` [PATCH 12/19] kthread: Default affine kthread to its preferred NUMA node Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox