Subject: [PATCH RESEND 1/1] cpusets/sched

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
@ 2007-03-22 23:15 Cliff Wickman
  2007-03-23  1:53 ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Cliff Wickman @ 2007-03-22 23:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm

Submission #2: This patch was diffed against 2.6.21-rc4
               (first submission was against 2.6.20-rc6)


This patch reconciles cpusets and sched_domains that get out of sync
due to disabling and re-enabling of cpu's.

Dinakar Guniguntala (IBM) is working on his own version of fixing this.
But as of this date that fix doesn't seem to be ready.

Here is an example of how the problem can occur:

   system of cpu's 0-31
   create cpuset /x  16-31
   create cpuset /x/y  16-23
   all cpu_exclusive

   disable cpu 17
     x is now    16,18-31
     x/y is now 16,18-23
   enable cpu 17
     x and x/y are unchanged

   to restore the cpusets:
     echo 16-31 > /dev/cpuset/x
     echo 16-23 > /dev/cpuset/x/y

   At the first echo, update_cpu_domains() is called for cpuset x/.

   The system is partitioned between:
        its parent, the root cpuset of 0-31, minus its
                                    children (x/ is 16-31): 0-15
        and x/ (16-31), minus its children (x/y/ 16,18-23): 17,24-31

   The sched_domain's for parent 0-15 are updated.
   The sched_domain's for current 17,24-31 are updated.

   But 16 has been untouched.
   As a result, 17's SD points to sched_group_phys[17] which is the only
   sched_group_phys on 17's list.  It points to itself.
   But 16's SD points to sched_group_phys[16], which still points to
   sched_group_phys[17].
   When cpu 16 executes find_busiest_group() it will hang on the non-
   circular sched_group list.

This solution is to update the sched_domain's for the cpuset
whose cpu's were changed and, in addition, all its children.
The update_cpu_domains() will end with a (recursive) call to itself
for each child.
The extra sched_domain reconstruction is overhead, but only at the
frequency of administrative change to the cpusets.

This patch also includes checks in find_busiest_group() and
find_idlest_group() that break from their loops on a sched_group that
points to itself.  This is needed because other cpu's are going through
load balancing while the sched_domains are being reconstructed.

There seems to be no administrative procedural work-around.  In the
example above one could not reverse the two echo's and set x/y before
x/.  It is not logical, so not allowed (Permission denied).

Diffed against 2.6.21-rc4

Signed-off-by: Cliff Wickman <cpw@sgi.com>



---
 kernel/cpuset.c |   11 +++++++++--
 kernel/sched.c  |   19 +++++++++++++++----
 2 files changed, 24 insertions(+), 6 deletions(-)

Index: morton.070205/kernel/sched.c
===================================================================
--- morton.070205.orig/kernel/sched.c
+++ morton.070205/kernel/sched.c
@@ -1201,11 +1201,14 @@ static inline unsigned long cpu_avg_load
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
-	struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *idlest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -1241,8 +1244,10 @@ find_idlest_group(struct sched_domain *s
 			idlest = group;
 		}
 nextgroup:
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
@@ -2259,7 +2264,8 @@ find_busiest_group(struct sched_domain *
 		   unsigned long *imbalance, enum idle_type idle, int *sd_idle,
 		   cpumask_t *cpus, int *balance)
 {
-	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *busiest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2282,6 +2288,8 @@ find_busiest_group(struct sched_domain *
 	else
 		load_idx = sd->idle_idx;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, group_capacity;
 		int local_group;
@@ -2410,8 +2418,11 @@ find_busiest_group(struct sched_domain *
 		}
 group_next:
 #endif
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+		/* careful, a printk here can cause a spinlock hang */
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!busiest || this_load >= max_load || busiest_nr_running == 0)
 		goto out_balanced;
Index: morton.070205/kernel/cpuset.c
===================================================================
--- morton.070205.orig/kernel/cpuset.c
+++ morton.070205/kernel/cpuset.c
@@ -765,6 +765,8 @@ static int validate_change(const struct 
  * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
  * Must not be called holding callback_mutex, because we must
  * not call lock_cpu_hotplug() while holding callback_mutex.
+ *
+ * Recursive, on depth of cpuset subtree.
  */
 
 static void update_cpu_domains(struct cpuset *cur)
@@ -790,8 +792,8 @@ static void update_cpu_domains(struct cp
 			return;
 		cspan = CPU_MASK_NONE;
 	} else {
-		if (cpus_empty(pspan))
-			return;
+		/* parent may be empty, but update anyway */
+
 		cspan = cur->cpus_allowed;
 		/*
 		 * Get all cpus from current cpuset's cpus_allowed not part
@@ -806,6 +808,11 @@ static void update_cpu_domains(struct cp
 	lock_cpu_hotplug();
 	partition_sched_domains(&pspan, &cspan);
 	unlock_cpu_hotplug();
+
+	/* walk all its children to make sure it's all consistent */
+	list_for_each_entry(c, &cur->children, sibling) {
+		update_cpu_domains(c);
+	}
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-22 23:15 Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation Cliff Wickman
@ 2007-03-23  1:53 ` Nick Piggin
  2007-03-23  3:47   ` Paul Jackson
  2007-03-23  3:50   ` Paul Jackson
  0 siblings, 2 replies; 8+ messages in thread
From: Nick Piggin @ 2007-03-23  1:53 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: akpm, linux-mm

Cliff Wickman wrote:
> Submission #2: This patch was diffed against 2.6.21-rc4
>                (first submission was against 2.6.20-rc6)
> 
> 
> This patch reconciles cpusets and sched_domains that get out of sync
> due to disabling and re-enabling of cpu's.
> 
> Dinakar Guniguntala (IBM) is working on his own version of fixing this.
> But as of this date that fix doesn't seem to be ready.
> 
> Here is an example of how the problem can occur:
> 
>    system of cpu's 0-31
>    create cpuset /x  16-31
>    create cpuset /x/y  16-23
>    all cpu_exclusive
> 
>    disable cpu 17
>      x is now    16,18-31
>      x/y is now 16,18-23
>    enable cpu 17
>      x and x/y are unchanged
> 
>    to restore the cpusets:
>      echo 16-31 > /dev/cpuset/x
>      echo 16-23 > /dev/cpuset/x/y
> 
>    At the first echo, update_cpu_domains() is called for cpuset x/.
> 
>    The system is partitioned between:
>         its parent, the root cpuset of 0-31, minus its
>                                     children (x/ is 16-31): 0-15
>         and x/ (16-31), minus its children (x/y/ 16,18-23): 17,24-31
> 
>    The sched_domain's for parent 0-15 are updated.
>    The sched_domain's for current 17,24-31 are updated.
> 
>    But 16 has been untouched.
>    As a result, 17's SD points to sched_group_phys[17] which is the only
>    sched_group_phys on 17's list.  It points to itself.
>    But 16's SD points to sched_group_phys[16], which still points to
>    sched_group_phys[17].
>    When cpu 16 executes find_busiest_group() it will hang on the non-
>    circular sched_group list.
> 
> This solution is to update the sched_domain's for the cpuset
> whose cpu's were changed and, in addition, all its children.
> The update_cpu_domains() will end with a (recursive) call to itself
> for each child.

I had a patch for doing "something" that I thought was right here,
and IIRC it didn't use any recursive call.

The problem was that Paul didn't think it followed cpus_exclusive
correctly, and I don't think we ever got to the point of giving it
a rigourous definition.

Can we start with getting some useful definition? My suggestion was
something like that if cpus_exclusive is set, then no other sets
except descendants and ancestors could have overlapping cpus. That
didn't go down well, for reasons I don't think I quit understood...

> The extra sched_domain reconstruction is overhead, but only at the
> frequency of administrative change to the cpusets.
> 
> This patch also includes checks in find_busiest_group() and
> find_idlest_group() that break from their loops on a sched_group that
> points to itself.  This is needed because other cpu's are going through
> load balancing while the sched_domains are being reconstructed.

This is not really allowed, to make locking simpler. You have to go
through the full detach and reattach.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-23  1:53 ` Nick Piggin
@ 2007-03-23  3:47   ` Paul Jackson
  2007-03-23  3:58     ` Nick Piggin
  2007-03-23  3:50   ` Paul Jackson
  1 sibling, 1 reply; 8+ messages in thread
From: Paul Jackson @ 2007-03-23  3:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: cpw, akpm, linux-mm

Nick wrote:
> My suggestion was
> something like that if cpus_exclusive is set, then no other sets
> except descendants and ancestors could have overlapping cpus.

That sure sounds right ... did I say different at some point?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-23  1:53 ` Nick Piggin
  2007-03-23  3:47   ` Paul Jackson
@ 2007-03-23  3:50   ` Paul Jackson
  2007-03-23  3:59     ` Nick Piggin
  1 sibling, 1 reply; 8+ messages in thread
From: Paul Jackson @ 2007-03-23  3:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: cpw, akpm, linux-mm

Nick also wrote:
> The problem was that Paul didn't think it followed cpus_exclusive
> correctly, and I don't think we ever got to the point of giving it
> a rigourous definition.

>From Documentation/cpusets.txt:

 - A cpuset may be marked exclusive, which ensures that no other
   cpuset (except direct ancestors and descendents) may contain
   any overlapping CPUs or Memory Nodes.

This seems like the same definition to me as you gave, and I just
agreed to in my previous post a few minutes ago.  It seems rigourous
to me ;>.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-23  3:47   ` Paul Jackson
@ 2007-03-23  3:58     ` Nick Piggin
  2007-03-23  5:16       ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2007-03-23  3:58 UTC (permalink / raw)
  To: Paul Jackson; +Cc: cpw, akpm, linux-mm

Paul Jackson wrote:
> Nick wrote:
> 
>>My suggestion was
>>something like that if cpus_exclusive is set, then no other sets
>>except descendants and ancestors could have overlapping cpus.
> 
> 
> That sure sounds right ... did I say different at some point?
> 

... I can't really remember, probably not. I just remember not
quite understanding what was going on when this last came up ;)

Ah, I think the issue was this: if cpus_exclusive is set in a
child, then you still wanted correct balancing over all CPUs in
the parent set. Thus we can never partition the system, because
you always come back to the root set which covers everything, and
that would be incompatible with any sched domains partitions.

What I *didn't* understand, was why we have any sched-domains
code in cpusets at all, if it is incorrect according to the above
definition (I think you vetoed my subsequent patch to remove it).

All that aside, I think we can probably do without cpus_exclusive
entirely (for sched-domains), and automatically detect a correct
set of partitions. I remember leaving that as an exercise for the
reader ;) but I think I've got some renewed energy, so I might
try tackling it.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-23  3:50   ` Paul Jackson
@ 2007-03-23  3:59     ` Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2007-03-23  3:59 UTC (permalink / raw)
  To: Paul Jackson; +Cc: cpw, akpm, linux-mm

Paul Jackson wrote:
> Nick also wrote:
> 
>>The problem was that Paul didn't think it followed cpus_exclusive
>>correctly, and I don't think we ever got to the point of giving it
>>a rigourous definition.
> 
> 
>>From Documentation/cpusets.txt:
> 
>  - A cpuset may be marked exclusive, which ensures that no other
>    cpuset (except direct ancestors and descendents) may contain
>    any overlapping CPUs or Memory Nodes.
> 
> This seems like the same definition to me as you gave, and I just
> agreed to in my previous post a few minutes ago.  It seems rigourous
> to me ;>.

Yeah, see my earlier reply. Naturally I was confused as to the
nature of my earlier confusion ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-23  3:58     ` Nick Piggin
@ 2007-03-23  5:16       ` Nick Piggin
  2007-03-23 23:55         ` Paul Jackson
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2007-03-23  5:16 UTC (permalink / raw)
  Cc: Paul Jackson, cpw, akpm, linux-mm

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

Nick Piggin wrote:

> All that aside, I think we can probably do without cpus_exclusive
> entirely (for sched-domains), and automatically detect a correct
> set of partitions. I remember leaving that as an exercise for the
> reader ;) but I think I've got some renewed energy, so I might
> try tackling it.

OK, something like this patch should automatically carve up the
sched-domains into an optimal set of partitions based solely on
the state of tasks in the system.

The downsides are that it is going to a very expensive operation,
and also that it would need to be called at task exit time in
order to never lose updates.

However, the same algorithm can be implemented using the cpusets
topology instead of cpus_allowed, and it will be much cheaper
(and cpusets already has a task exit hook).

Hmm, there will be still some problems with kernel thread like
pdflush in the root cpuset, preventing the partitioning to be
actually activated...

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: sched-domains-cpusets-fixes.patch --]
[-- Type: text/plain, Size: 8435 bytes --]

Index: linux-2.6/kernel/cpuset.c
===================================================================
--- linux-2.6.orig/kernel/cpuset.c	2007-02-27 20:14:11.000000000 +1100
+++ linux-2.6/kernel/cpuset.c	2007-03-23 16:02:41.000000000 +1100
@@ -754,61 +754,6 @@ static int validate_change(const struct 
 }
 
 /*
- * For a given cpuset cur, partition the system as follows
- * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
- *    exclusive child cpusets
- * b. All cpus in the current cpuset's cpus_allowed that are not part of any
- *    exclusive child cpusets
- * Build these two partitions by calling partition_sched_domains
- *
- * Call with manage_mutex held.  May nest a call to the
- * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
- * Must not be called holding callback_mutex, because we must
- * not call lock_cpu_hotplug() while holding callback_mutex.
- */
-
-static void update_cpu_domains(struct cpuset *cur)
-{
-	struct cpuset *c, *par = cur->parent;
-	cpumask_t pspan, cspan;
-
-	if (par == NULL || cpus_empty(cur->cpus_allowed))
-		return;
-
-	/*
-	 * Get all cpus from parent's cpus_allowed not part of exclusive
-	 * children
-	 */
-	pspan = par->cpus_allowed;
-	list_for_each_entry(c, &par->children, sibling) {
-		if (is_cpu_exclusive(c))
-			cpus_andnot(pspan, pspan, c->cpus_allowed);
-	}
-	if (!is_cpu_exclusive(cur)) {
-		cpus_or(pspan, pspan, cur->cpus_allowed);
-		if (cpus_equal(pspan, cur->cpus_allowed))
-			return;
-		cspan = CPU_MASK_NONE;
-	} else {
-		if (cpus_empty(pspan))
-			return;
-		cspan = cur->cpus_allowed;
-		/*
-		 * Get all cpus from current cpuset's cpus_allowed not part
-		 * of exclusive children
-		 */
-		list_for_each_entry(c, &cur->children, sibling) {
-			if (is_cpu_exclusive(c))
-				cpus_andnot(cspan, cspan, c->cpus_allowed);
-		}
-	}
-
-	lock_cpu_hotplug();
-	partition_sched_domains(&pspan, &cspan);
-	unlock_cpu_hotplug();
-}
-
-/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
@@ -835,8 +780,6 @@ static int update_cpumask(struct cpuset 
 	mutex_lock(&callback_mutex);
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
-	if (is_cpu_exclusive(cs) && !cpus_unchanged)
-		update_cpu_domains(cs);
 	return 0;
 }
 
@@ -1064,9 +1007,6 @@ static int update_flag(cpuset_flagbits_t
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs.flags;
 	mutex_unlock(&callback_mutex);
-
-	if (cpu_exclusive_changed)
-                update_cpu_domains(cs);
 	return 0;
 }
 
@@ -1931,17 +1871,6 @@ static int cpuset_mkdir(struct inode *di
 	return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
 }
 
-/*
- * Locking note on the strange update_flag() call below:
- *
- * If the cpuset being removed is marked cpu_exclusive, then simulate
- * turning cpu_exclusive off, which will call update_cpu_domains().
- * The lock_cpu_hotplug() call in update_cpu_domains() must not be
- * made while holding callback_mutex.  Elsewhere the kernel nests
- * callback_mutex inside lock_cpu_hotplug() calls.  So the reverse
- * nesting would risk an ABBA deadlock.
- */
-
 static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cpuset *cs = dentry->d_fsdata;
@@ -1961,13 +1890,7 @@ static int cpuset_rmdir(struct inode *un
 		mutex_unlock(&manage_mutex);
 		return -EBUSY;
 	}
-	if (is_cpu_exclusive(cs)) {
-		int retval = update_flag(CS_CPU_EXCLUSIVE, cs, "0");
-		if (retval < 0) {
-			mutex_unlock(&manage_mutex);
-			return retval;
-		}
-	}
+
 	parent = cs->parent;
 	mutex_lock(&callback_mutex);
 	set_bit(CS_REMOVED, &cs->flags);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2007-03-22 20:48:52.000000000 +1100
+++ linux-2.6/kernel/sched.c	2007-03-23 16:06:00.000000000 +1100
@@ -4600,6 +4600,8 @@ cpumask_t nohz_cpu_mask = CPU_MASK_NONE;
  * 7) we wake up and the migration is done.
  */
 
+static void autopartition_sched_domains(void);
+
 /*
  * Change a given task's CPU affinity. Migrate the thread to a
  * proper CPU and schedule it away if the CPU it's executing on
@@ -4623,6 +4625,7 @@ int set_cpus_allowed(struct task_struct 
 	}
 
 	p->cpus_allowed = new_mask;
+
 	/* Can the task run on the task's current CPU? If so, we're done */
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;
@@ -4637,6 +4640,7 @@ int set_cpus_allowed(struct task_struct 
 	}
 out:
 	task_rq_unlock(rq, &flags);
+	autopartition_sched_domains();
 
 	return ret;
 }
@@ -6328,29 +6332,106 @@ static void detach_destroy_domains(const
 
 /*
  * Partition sched domains as specified by the cpumasks below.
- * This attaches all cpus from the cpumasks to the NULL domain,
+ * This attaches all cpus from the partition to the NULL domain,
  * waits for a RCU quiescent period, recalculates sched
- * domain information and then attaches them back to the
- * correct sched domains
- * Call with hotplug lock held
+ * domain information and then attaches them back to their own
+ * isolated partition.
+ *
+ * Called with hotplug lock held
+ *
+ * Returns 0 on success.
  */
-int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
+int partition_sched_domains(const cpumask_t *partition)
 {
-	cpumask_t change_map;
-	int err = 0;
+	cpumask_t cpu_offline_map;
 
-	cpus_and(*partition1, *partition1, cpu_online_map);
-	cpus_and(*partition2, *partition2, cpu_online_map);
-	cpus_or(change_map, *partition1, *partition2);
+	if (cpus_intersects(*partition, cpu_isolated_map) &&
+			cpus_weight(*partition) != 1) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	cpus_complement(cpu_offline_map, cpu_online_map);
+	if (cpus_intersects(*partition, cpu_offline_map)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
 
 	/* Detach sched domains from all of the affected cpus */
-	detach_destroy_domains(&change_map);
-	if (!cpus_empty(*partition1))
-		err = build_sched_domains(partition1);
-	if (!err && !cpus_empty(*partition2))
-		err = build_sched_domains(partition2);
+	detach_destroy_domains(partition);
 
-	return err;
+	return build_sched_domains(partition);
+}
+
+struct domain_partition {
+	struct list_head list;
+	cpumask_t cpumask;
+};
+
+static DEFINE_MUTEX(autopartition_mutex);
+static void autopartition_sched_domains(void)
+{
+	LIST_HEAD(cover);
+	cpumask_t span;
+	struct task_struct *p;
+	struct domain_partition *dp, *tmp;
+
+	mutex_lock(&autopartition_mutex);
+	cpus_clear(span);
+
+	/*
+	 * Need to build the disjoint covering set of unions of overlapping
+	 * task cpumasks. This gives us the best possible sched-domains
+	 * partition.
+	 */
+	/* XXX: note this would need to be called at task exit to always
+	 * provide a perfect partition. This is probably going to be much
+	 * easier if driven from cpusets.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process(p) {
+
+		cpumask_t c = p->cpus_allowed;
+		if (!cpus_intersects(span, c)) {
+add_new_partition:
+			dp = kmalloc(sizeof(struct domain_partition), GFP_ATOMIC);
+			if (!dp)
+				panic("XXX: should preallocate these\n");
+			INIT_LIST_HEAD(&dp->list);
+			dp->cpumask = c;
+
+			list_add(&dp->list, &cover);
+			cpus_or(span, span, c);
+		} else {
+			cpumask_t newcov = c;
+			list_for_each_entry_safe(dp, tmp, &cover, list) {
+				if (cpus_intersects(c, dp->cpumask)) {
+					cpus_or(newcov, newcov, dp->cpumask);
+					list_del(&dp->list);
+					kfree(dp);
+				}
+			}
+			c = newcov;
+			goto add_new_partition;
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	detach_destroy_domains(&cpu_online_map);
+
+	cpus_clear(span);
+	list_for_each_entry_safe(dp, tmp, &cover, list) {
+		BUG_ON(cpus_intersects(span, dp->cpumask));
+		cpus_or(span, span, dp->cpumask);
+
+		build_sched_domains(&dp->cpumask);
+
+		list_del(&dp->list);
+		kfree(dp);
+	}
+	BUG_ON(!list_empty(&cover));
+
+	mutex_unlock(&autopartition_mutex);
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2007-03-22 20:45:18.000000000 +1100
+++ linux-2.6/include/linux/sched.h	2007-03-23 15:19:43.000000000 +1100
@@ -729,8 +729,7 @@ struct sched_domain {
 #endif
 };
 
-extern int partition_sched_domains(cpumask_t *partition1,
-				    cpumask_t *partition2);
+extern int partition_sched_domains(const cpumask_t *partition);
 
 /*
  * Maximum cache size the migration-costs auto-tuning code will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation
  2007-03-23  5:16       ` Nick Piggin
@ 2007-03-23 23:55         ` Paul Jackson
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Jackson @ 2007-03-23 23:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: cpw, akpm, linux-mm

Nick wrote:
> Hmm, there will be still some problems with kernel thread like
> pdflush in the root cpuset, preventing the partitioning to be
> actually activated...

Yeah - that concerns me too, in addition to the performance
implications of considering sched domain partitions every time a task
exits or moves.  Both seem to me to be serious problems with that
approach.

> However, the same algorithm can be implemented using the cpusets
> topology instead of cpus_allowed, and it will be much cheaper >
(and cpusets already has a task exit hook).

On the plus side, we wouldn't have to do this at task exit, if we
drove it off the cpuset cpus_allowed values.  We would only need to
adjust the sched domains when the cpusets changed, which is a much
less frequent operation.

On the down side, it doesn't work ;)

The root cpuset includes all online cpus, so would prevent any
partitioning.

Simply excluding the root cpuset from the calculation isn't necessarily
sufficient either.  Fairly often we have big systems divide, at the
level just below the root cpuset, into two cpusets: one small cpuset
for the classic Unix daemon and user login load, and one huge cpuset
managed by some such batch scheduler as PBS, LSF or SGE.

Partitioning the sched domains on a 512 CPU system into a 4 CPU
partition and a 508 CPU partition doesn't exactly help much.

The batch scheduler has knowledge, which it is usually quite willing to
leverage, as to which of the many child and grandchild cpusets it has
setup hold active jobs and need load balancing, and which don't need
balancing.  I doubt that the kernel can reliably intuit this.

But I think we've been here before, you and I <grin>.

If I recall correctly, you've been inclined to an implicit interface
for defining sched domain partitions, and certainly I've been inclined
to an explicit interface.

I'm still inclined to an explicit interface.  The folks who would
benefit from hard partitioning of the sched domains will have varied
and specific needs.  If we try to get away with some implicit interface
whereby the kernel intuits where to put those partitions based on a
natural partition of task->cpus_allowed or task->cpuset->cpus_allowed,
then those users will just have to quite consciously learn the magic
incantations needed to blindly define the partitioning they require.

Implicit interfaces are fine if they usually get the right answer,
and only rarely does user code have to second guess them.  They are
a pain in the backside when they have to be worked around.

The cpuset cpu_exclusive flag was the wrong hook for this explicit
interface, and I have had, for months now, a patch in Andrew's *-mm
tree, to disconnect the cpuset cpu_exclusive flag from any defining
role in sched domain partitions.  I never should have agreed to that
(ab)use of the cpu_exclusive in the first place.

I've been asking Andrew to hold off on sending that patch to Linus,
until we can get our act together somehow on an alternative mechanism
for defining sched domain partitions, as at least the real time folks
are already depending on having such a capability.  I'm willing to
impose on them a change in the API they use for this, but I am not
willing to remove that capability entirely from them.  The real time
folks need some way to mark certain CPUs as not being subject to
any scheduler load balancing at all.  This can also be done with a
kernel boot command option, but they really depend on being able to
dynamically (abeit infrequently) add or remove CPUs from the list of
"real-time" CPUs exempt from load balancing.

This explicit API should either be via the cpuset hierarchy, with a
new per-cpuset attribute, or else by a new and separate interface by
which user space can define a partition of the systems CPUs for the
scheduler (where a partition is a covering set of disjoint subsets,
and likely expressed as a list of cpumasks in this situation, one
cpumask for each element of the partition.)

If it is via a new per-cpuset flag, then that flag can either mark
those cpusets requiring load balancing, or those not requiring it.

If it is via cpumasks, then there would be a file or directory beneath
/proc, with one cpumask per line (or per file), such that the masks
were pairwise disjoint and such that their union equaled the set of
online cpus (not sure this last rule is necessary.)

I am inclined toward a cpuset flag here, and figure it can be done
with less kernel code this way, as we would just need one more
boolean flag bit.  But I recognize that I'm biased here, and that
others might need to partition sched domains without setting up an
entire cpuset hierarchy.

Either way, this flag is a hint to the kernel, allowing it to
improve load balancing performance by avoiding considering certain
possibilities.

Since defining a new partitioning is not an atomic operation (user
code will have to update possibly several cpusets or masks first),
perhaps there should be a separate trigger, and we only recalculate
when that trigger is fired.  This might make the kernel code easier,
in that it doesn't have to react to every change in the defining
flags or masks.  And it surely would make the user code easier,
as it would not have to carefully sequence a restructuring of the
partitioning into mini-steps, each one of which was a valid partition.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-03-23 23:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-22 23:15 Subject: [PATCH RESEND 1/1] cpusets/sched_domain reconciliation Cliff Wickman
2007-03-23  1:53 ` Nick Piggin
2007-03-23  3:47   ` Paul Jackson
2007-03-23  3:58     ` Nick Piggin
2007-03-23  5:16       ` Nick Piggin
2007-03-23 23:55         ` Paul Jackson
2007-03-23  3:50   ` Paul Jackson
2007-03-23  3:59     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox