[PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing
@ 2014-01-20 19:21 riel
  2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
   maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v2:
 - dropped tracepoint (for now?)
 - implement obvious improvements suggested by Peter
 - use the scheduler maintained CPU use statistics, drop
   the NUMA specific ones for now. We can add those later
   if they turn out to be beneficial
Changes since v1:
 - fix divide by zero found by Chegu Vinod
 - improve comment, as suggested by Peter Zijlstra
 - do stats calculations in task_numa_placement in local variables


Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt:           throughput =     755900.20 SPECjbb2005 bops
spec2.txt:           throughput =     754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt:           throughput =     706439.84 SPECjbb2005 bops
spec2.txt:           throughput =     729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt:           throughput =     667988.47 SPECjbb2005 bops
spec2.txt:           throughput =     638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt:           throughput =     544120.97 SPECjbb2005 bops
spec2.txt:           throughput =     453553.41 SPECjbb2005 bops


My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
@ 2014-01-20 19:21 ` riel
  2014-01-21 11:52   ` Mel Gorman
  2014-01-20 19:21 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes.  However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |  1 -
 kernel/sched/fair.c   |  8 --------
 kernel/sysctl.c       |  7 -------
 mm/mempolicy.c        | 45 ---------------------------------------------
 4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
 	int numa_preferred_nid;
-	int numa_migrate_deferred;
 	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
-/*
- * After skipping a page migration on a shared page, skip N more numa page
- * migrations unconditionally. This reduces the number of NUMA migrations
- * in shared memory workloads, and has the effect of pulling tasks towards
- * where their memory lives, over pulling the memory towards the task.
- */
-unsigned int sysctl_numa_balancing_migrate_deferred = 16;
-
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
 	unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 096db74..4d19492 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname       = "numa_balancing_migrate_deferred",
-		.data           = &sysctl_numa_balancing_migrate_deferred,
-		.maxlen         = sizeof(unsigned int),
-		.mode           = 0644,
-		.proc_handler   = proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
 		.maxlen		= sizeof(unsigned int),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
-	/* Never defer a private fault */
-	if (cpupid_match_pid(p, last_cpupid))
-		return false;
-
-	if (p->numa_migrate_deferred) {
-		p->numa_migrate_deferred--;
-		return true;
-	}
-	return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-	p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
-}
-#else
-static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
-	return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /**
  * mpol_misplaced - check whether current page node is valid in policy
  *
@@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 */
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
-			/* See sysctl_numa_balancing_migrate_deferred comment */
-			if (!cpupid_match_pid(current, last_cpupid))
-				defer_numa_migrate(current);
-
 			goto out;
 		}
-
-		/*
-		 * The quadratic filter above reduces extraneous migration
-		 * of shared pages somewhat. This code reduces it even more,
-		 * reducing the overhead of page migrations of shared pages.
-		 * This makes workloads with shared pages rely more on
-		 * "move task near its memory", and less on "move memory
-		 * towards its task", which is exactly what we want.
-		 */
-		if (numa_migrate_deferred(current, last_cpupid))
-			goto out;
 	}
 
 	if (curnid != polnid)
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
  2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
@ 2014-01-20 19:21 ` riel
  2014-01-21 12:21   ` Mel Gorman
  2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h | 10 ++++++++--
 kernel/sched/fair.c   | 30 +++++++++++++++++++++++-------
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	/*
+	 * Track the nodes where faults are incurred. This is not very
+	 * interesting on a per-task basis, but it help with smarter
+	 * numa memory placement for groups of processes.
+	 */
+	unsigned long *numa_faults_from;
+	unsigned long *numa_faults_from_buffer;
+
+	/*
 	 * numa_faults_locality tracks if faults recorded during the last
 	 * scan window were remote/local. The task scan period is adapted
 	 * based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {
 
 	struct rcu_head rcu;
 	unsigned long total_faults;
+	unsigned long *faults_from;
 	unsigned long faults[0];
 };
 
@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
-			long diff;
+			long diff, f_diff;
 
 			i = task_faults_idx(nid, priv);
 			diff = -p->numa_faults[i];
+			f_diff = -p->numa_faults_from[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
 			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
+			p->numa_faults_from[i] >>= 1;
+			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+			p->numa_faults_from_buffer[i] = 0;
+
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			f_diff += p->numa_faults_from[i];
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				p->numa_group->faults[i] += diff;
+				p->numa_group->faults_from[i] += f_diff;
 				p->numa_group->total_faults += diff;
 				group_faults += p->numa_group->faults[i];
 			}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-				    2*nr_node_ids*sizeof(unsigned long);
+				    4*nr_node_ids*sizeof(unsigned long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
 		grp->gid = p->pid;
+		/* Second half of the array tracks where faults come from */
+		grp->faults_from = grp->faults + 2 * nr_node_ids;
 
-		for (i = 0; i < 2*nr_node_ids; i++)
+		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] = p->numa_faults[i];
 
 		grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	double_lock(&my_grp->lock, &grp->lock);
 
-	for (i = 0; i < 2*nr_node_ids; i++) {
+	for (i = 0; i < 4*nr_node_ids; i++) {
 		my_grp->faults[i] -= p->numa_faults[i];
 		grp->faults[i] += p->numa_faults[i];
 	}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
 
 	if (grp) {
 		spin_lock(&grp->lock);
-		for (i = 0; i < 2*nr_node_ids; i++)
+		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] -= p->numa_faults[i];
 		grp->total_faults -= p->total_numa_faults;
 
@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
 
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+	p->numa_faults_from = NULL;
+	p->numa_faults_from_buffer = NULL;
 	kfree(numa_faults);
 }
 
@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
 	bool migrated = flags & TNF_MIGRATED;
+	int this_node = task_node(current);
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+		p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+		p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
 		p->total_numa_faults = 0;
 		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		p->numa_pages_migrated += pages;
 
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+	p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
 	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
 }
 
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
  2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
  2014-01-20 19:21 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
@ 2014-01-20 19:21 ` riel
  2014-01-21 14:19   ` Mel Gorman
  2014-01-20 19:21 ` [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..ea8b2ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	nodemask_t active_nodes;
 	unsigned long total_faults;
 	unsigned long *faults_from;
 	unsigned long faults[0];
@@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
 }
 
 /*
+ * Find the nodes on which the workload is actively running. We do this by
+ * tracking the nodes from which NUMA hinting faults are triggered. This can
+ * be different from the set of nodes where the workload's memory is currently
+ * located.
+ *
+ * The bitmask is used to make smarter decisions on when to do NUMA page
+ * migrations, To prevent flip-flopping, and excessive page migrations, nodes
+ * are added when they cause over 6/16 of the maximum number of faults, but
+ * only removed when they drop below 3/16.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+	unsigned long faults, max_faults = 0;
+	struct numa_group *numa_group = p->numa_group;
+	int nid;
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (faults > max_faults)
+			max_faults = faults;
+	}
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (!node_isset(nid, numa_group->active_nodes)) {
+			if (faults > max_faults * 6 / 16)
+				node_set(nid, numa_group->active_nodes);
+		} else if (faults < max_faults * 3 / 16)
+			node_clear(nid, numa_group->active_nodes);
+	}
+}
+
+/*
  * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
  * increments. The more local the fault statistics are, the higher the scan
  * period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
 
 	if (p->numa_group) {
+		update_numa_active_node_mask(p);
 		/*
 		 * If the preferred task and group nids are different,
 		 * iterate over the nodes again to find the best place.
@@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		/* Second half of the array tracks where faults come from */
 		grp->faults_from = grp->faults + 2 * nr_node_ids;
 
+		node_set(task_node(current), grp->active_nodes);
+
 		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] = p->numa_faults[i];
 
@@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	my_grp->nr_tasks--;
 	grp->nr_tasks++;
 
+	update_numa_active_node_mask(p);
+
 	spin_unlock(&my_grp->lock);
 	spin_unlock(&grp->lock);
 
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
                   ` (2 preceding siblings ...)
  2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
@ 2014-01-20 19:21 ` riel
  2014-01-21 15:08   ` Mel Gorman
  2014-01-20 19:21 ` [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
  2014-01-20 19:21 ` [PATCH 6/6] numa,sched: do statistics calculation using local variables only riel
  5 siblings, 1 reply; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
   maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
   for the numa_group
3) always migrate from a node outside of the set of active nodes,
   to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
   from a node with more NUMA page faults, to a node with fewer
   NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |  7 +++++++
 kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c        |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+				int src_nid, int dst_nid);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
 static inline void task_numa_free(struct task_struct *p)
 {
 }
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+				       int src_nid, int dst_nid)
+{
+	return true;
+}
 #endif
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea8b2ae..ea873b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
+bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+			 int src_nid, int dst_nid)
+{
+	struct numa_group *ng = p->numa_group;
+
+	/* Always allow migrate on private faults */
+	if (cpupid_match_pid(p, last_cpupid))
+		return true;
+
+	/* A shared fault, but p->numa_group has not been set up yet. */
+	if (!ng)
+		return true;
+
+	/*
+	 * Do not migrate if the destination is not a node that
+	 * is actively used by this numa group.
+	 */
+	if (!node_isset(dst_nid, ng->active_nodes))
+		return false;
+
+	/*
+	 * Source is a node that is not actively used by this
+	 * numa group, while the destination is. Migrate.
+	 */
+	if (!node_isset(src_nid, ng->active_nodes))
+		return true;
+
+	/*
+	 * Both source and destination are nodes in active
+	 * use by this numa group. Maximize memory bandwidth
+	 * by migrating from more heavily used groups, to less
+	 * heavily used ones, spreading the load around.
+	 * Use a 1/4 hysteresis to avoid spurious page movement.
+	 */
+	return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
 			goto out;
 		}
+
+		if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
+			goto out;
 	}
 
 	if (curnid != polnid)
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
                   ` (3 preceding siblings ...)
  2014-01-20 19:21 ` [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
@ 2014-01-20 19:21 ` riel
  2014-01-21 15:56   ` Mel Gorman
  2014-01-20 19:21 ` [PATCH 6/6] numa,sched: do statistics calculation using local variables only riel
  5 siblings, 1 reply; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea873b6..203877d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
 	unsigned long fault_types[2] = { 0, 0 };
+	unsigned long total_faults;
+	u64 runtime, period;
 	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	total_faults = p->numa_faults_locality[0] +
+		       p->numa_faults_locality[1] + 1;
+	runtime = p->se.avg.runnable_avg_sum;
+	period = p->se.avg.runnable_avg_period;
+
 	/* If the task is part of a group prevent parallel updates to group stats */
 	if (p->numa_group) {
 		group_lock = &p->numa_group->lock;
@@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
-			long diff, f_diff;
+			long diff, f_diff, f_weight;
 
 			i = task_faults_idx(nid, priv);
 			diff = -p->numa_faults[i];
@@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
 			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
+			/*
+			 * Normalize the faults_from, so all tasks in a group
+			 * count according to CPU use, instead of by the raw
+			 * number of faults. Tasks with little runtime have
+			 * little over-all impact on throughput, and thus their
+			 * faults are less important.
+			 */
+			f_weight = (16384 * runtime *
+				   p->numa_faults_from_buffer[i]) /
+				   (total_faults * period + 1);
 			p->numa_faults_from[i] >>= 1;
-			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+			p->numa_faults_from[i] += f_weight;
 			p->numa_faults_from_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 6/6] numa,sched: do statistics calculation using local variables only
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
                   ` (4 preceding siblings ...)
  2014-01-20 19:21 ` [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
@ 2014-01-20 19:21 ` riel
  2014-01-21 16:15   ` Mel Gorman
  5 siblings, 1 reply; 18+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.

The NUMA balancing code looks at those per-process variables, and
having other tasks temporarily see halved statistics could lead to
unwanted numa migrations. This can be avoided by doing all the math
in local variables.

This change also simplifies the code a little.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 203877d..ad30d14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1456,12 +1456,9 @@ static void task_numa_placement(struct task_struct *p)
 			long diff, f_diff, f_weight;
 
 			i = task_faults_idx(nid, priv);
-			diff = -p->numa_faults[i];
-			f_diff = -p->numa_faults_from[i];
 
 			/* Decay existing window, copy faults since last scan */
-			p->numa_faults[i] >>= 1;
-			p->numa_faults[i] += p->numa_faults_buffer[i];
+			diff = p->numa_faults_buffer[i] - p->numa_faults[i] / 2;
 			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
@@ -1475,13 +1472,12 @@ static void task_numa_placement(struct task_struct *p)
 			f_weight = (16384 * runtime *
 				   p->numa_faults_from_buffer[i]) /
 				   (total_faults * period + 1);
-			p->numa_faults_from[i] >>= 1;
-			p->numa_faults_from[i] += f_weight;
+			f_diff = f_weight - p->numa_faults_from[i] / 2;
 			p->numa_faults_from_buffer[i] = 0;
 
+			p->numa_faults[i] += diff;
+			p->numa_faults_from[i] += f_diff;
 			faults += p->numa_faults[i];
-			diff += p->numa_faults[i];
-			f_diff += p->numa_faults_from[i];
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred
  2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
@ 2014-01-21 11:52   ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 11:52 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:02PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Excessive migration of pages can hurt the performance of workloads
> that span multiple NUMA nodes.  However, it turns out that the
> p->numa_migrate_deferred knob is a really big hammer, which does
> reduce migration rates, but does not actually help performance.
> 
> Now that the second stage of the automatic numa balancing code
> has stabilized, it is time to replace the simplistic migration
> deferral code with something smarter.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>

When I added a tracepoint to track deferred migration I was surprised how
often it triggered for some workloads. I agree that we want to do something
better because it was a crutch albeit a necessary one at the time.

Note that the knob was not about performance as such, it was about avoiding
worst-case behaviour. We should keep an eye out for bugs that look like
excessive migration on workloads that are not converging.  Reintroducing this
hammer would be a last resort for working around the problem.

Finally, the sysctl is documented in Documentation/sysctl/kernel.txt and
this patch should also remove it.

Functionally, the patch looks fine and it's time to reinvestigate if
it's necessary so assuming the documentation gets removed;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
  2014-01-20 19:21 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
@ 2014-01-21 12:21   ` Mel Gorman
  2014-01-21 22:26     ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 12:21 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Track which nodes NUMA faults are triggered from, in other words
> the CPUs on which the NUMA faults happened. This uses a similar
> mechanism to what is used to track the memory involved in numa faults.
> 
> The next patches use this to build up a bitmap of which nodes a
> workload is actively running on.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/sched.h | 10 ++++++++--
>  kernel/sched/fair.c   | 30 +++++++++++++++++++++++-------
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 97efba4..a9f7f05 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1492,6 +1492,14 @@ struct task_struct {
>  	unsigned long *numa_faults_buffer;
>  
>  	/*
> +	 * Track the nodes where faults are incurred. This is not very
> +	 * interesting on a per-task basis, but it help with smarter
> +	 * numa memory placement for groups of processes.
> +	 */
> +	unsigned long *numa_faults_from;
> +	unsigned long *numa_faults_from_buffer;
> +

As an aside I wonder if we can derive any useful metric from this. One
potential santiy check would be the number of nodes that a task is incurring
faults on. It would be best if the highest number of faults were recorded
on the node the task is currently running on. After that we either want
to minimise the number of nodes trapping faults or interleave between
all available nodes to avoid applying too much memory pressure on any
one node. For interleaving to always be the best option we would have to
assume that all nodes are equal distance but that would be a reasonable
assumption to start with.

> +	/*
>  	 * numa_faults_locality tracks if faults recorded during the last
>  	 * scan window were remote/local. The task scan period is adapted
>  	 * based on the locality of the faults with different weights
> @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
>  extern pid_t task_numa_group_id(struct task_struct *p);
>  extern void set_numabalancing_state(bool enabled);
>  extern void task_numa_free(struct task_struct *p);
> -
> -extern unsigned int sysctl_numa_balancing_migrate_deferred;
>  #else
>  static inline void task_numa_fault(int last_node, int node, int pages,
>  				   int flags)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41e2176..1945ddc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -886,6 +886,7 @@ struct numa_group {
>  
>  	struct rcu_head rcu;
>  	unsigned long total_faults;
> +	unsigned long *faults_from;
>  	unsigned long faults[0];
>  };
>  

faults_from is not ambiguous but it does not tell us a lot of information
either. If I am reading this right then fundamentally this patch means we
are tracking two pieces of information

1. The node the data resided on at the time of the hinting fault (numa_faults)
2. The node the accessing task was residing on at the time of the fault (faults_from)

We should be able to have names that reflect that. How about
memory_faults_locality and cpu_faults_locality with a prepartion patch
doing a simple rename for numa_faults and this patch adding
cpu_faults_locality?

It will be tough to be consistent about this but the clearer we are about
making decisions based on task locationo vs data location the happier we
will be in the long run.

> @@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
>  		int priv, i;
>  
>  		for (priv = 0; priv < 2; priv++) {
> -			long diff;
> +			long diff, f_diff;
>  
>  			i = task_faults_idx(nid, priv);
>  			diff = -p->numa_faults[i];
> +			f_diff = -p->numa_faults_from[i];
>  
>  			/* Decay existing window, copy faults since last scan */
>  			p->numa_faults[i] >>= 1;
> @@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
>  			fault_types[priv] += p->numa_faults_buffer[i];
>  			p->numa_faults_buffer[i] = 0;
>  
> +			p->numa_faults_from[i] >>= 1;
> +			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> +			p->numa_faults_from_buffer[i] = 0;
> +
>  			faults += p->numa_faults[i];
>  			diff += p->numa_faults[i];
> +			f_diff += p->numa_faults_from[i];
>  			p->total_numa_faults += diff;
>  			if (p->numa_group) {
>  				/* safe because we can only change our own group */
>  				p->numa_group->faults[i] += diff;
> +				p->numa_group->faults_from[i] += f_diff;
>  				p->numa_group->total_faults += diff;
>  				group_faults += p->numa_group->faults[i];
>  			}
> @@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  
>  	if (unlikely(!p->numa_group)) {
>  		unsigned int size = sizeof(struct numa_group) +
> -				    2*nr_node_ids*sizeof(unsigned long);
> +				    4*nr_node_ids*sizeof(unsigned long);
>  

Should we convert that magic number to a define? NR_NUMA_HINT_FAULT_STATS?

>  		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
>  		if (!grp)
> @@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  		spin_lock_init(&grp->lock);
>  		INIT_LIST_HEAD(&grp->task_list);
>  		grp->gid = p->pid;
> +		/* Second half of the array tracks where faults come from */
> +		grp->faults_from = grp->faults + 2 * nr_node_ids;
>  

We have accessors when we overload arrays like this such as task_faults_idx
for example. We should have similar accessors for this in case those
offsets very change.

> -		for (i = 0; i < 2*nr_node_ids; i++)
> +		for (i = 0; i < 4*nr_node_ids; i++)
>  			grp->faults[i] = p->numa_faults[i];
>  

This is a little obscure now. Functionally it is copying both numa_faults and
numa_faults_from but a casual reading of that will get confused. Minimally
it needs a comment explaining what is being copied here. Also, why did we
not use memcpy?

>  		grp->total_faults = p->total_numa_faults;
> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  
>  	double_lock(&my_grp->lock, &grp->lock);
>  
> -	for (i = 0; i < 2*nr_node_ids; i++) {
> +	for (i = 0; i < 4*nr_node_ids; i++) {
>  		my_grp->faults[i] -= p->numa_faults[i];
>  		grp->faults[i] += p->numa_faults[i];
>  	}

The same obscure trick is used throughout and I'm not sure how
maintainable that will be. Would it be better to be explicit about this?

/* NUMA hinting faults may be either shared or private faults */
#define NR_NUMA_HINT_FAULT_TYPES 2

/* Track shared and private faults
#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES*2)

> @@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
>  
>  	if (grp) {
>  		spin_lock(&grp->lock);
> -		for (i = 0; i < 2*nr_node_ids; i++)
> +		for (i = 0; i < 4*nr_node_ids; i++)
>  			grp->faults[i] -= p->numa_faults[i];
>  		grp->total_faults -= p->total_numa_faults;
>  
> @@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
>  
>  	p->numa_faults = NULL;
>  	p->numa_faults_buffer = NULL;
> +	p->numa_faults_from = NULL;
> +	p->numa_faults_from_buffer = NULL;
>  	kfree(numa_faults);
>  }
>  
> @@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>  {
>  	struct task_struct *p = current;
>  	bool migrated = flags & TNF_MIGRATED;
> +	int this_node = task_node(current);
>  	int priv;
>  
>  	if (!numabalancing_enabled)
> @@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>  
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
> -		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
> +		int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
>  
>  		/* numa_faults and numa_faults_buffer share the allocation */
>  		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
> @@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>  			return;
>  
>  		BUG_ON(p->numa_faults_buffer);
> -		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
> +		p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
> +		p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
> +		p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
>  		p->total_numa_faults = 0;
>  		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
>  	}
> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>  		p->numa_pages_migrated += pages;
>  
>  	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> +	p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
>  	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;

this_node and node is similarly ambiguous in terms of name. Rename of
data_node and cpu_node would have been clearer.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
@ 2014-01-21 14:19   ` Mel Gorman
  2014-01-21 15:09     ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 14:19 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The faults_from statistics are used to maintain an active_nodes nodemask
> per numa_group. This allows us to be smarter about when to do numa migrations.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1945ddc..ea8b2ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -885,6 +885,7 @@ struct numa_group {
>  	struct list_head task_list;
>  
>  	struct rcu_head rcu;
> +	nodemask_t active_nodes;
>  	unsigned long total_faults;
>  	unsigned long *faults_from;
>  	unsigned long faults[0];

It's not a concern for now but in the land of unicorns and ponies we'll
relook at the size of some of these structures and see what can be
optimised.

Similar to my comment on faults_from I think we could potentially evaluate
the fitness of the automatic NUMA balancing feature by looking at the
weight of the active_nodes for a numa_group. If
bitmask_weight(active_nodes) == nr_online_nodes
for all numa_groups in the system then I think it would be an indication
that the algorithm has collapsed.

It's not a comment on the patch itself. We could could just do with more
metrics that help analyse this thing when debugging problems.

> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>  }
>  
>  /*
> + * Find the nodes on which the workload is actively running. We do this by

hmm, it's not the workload though, it's a single NUMA group and a workload
may consist of multiple NUMA groups. For example, in an ideal world and
a JVM-based workload the application threads and the GC threads would be
in different NUMA groups.

The signature is even more misleading because the signature implies that
the function is concerned with tasks. Pass in p->numa_group

> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> + * be different from the set of nodes where the workload's memory is currently
> + * located.
> + *
> + * The bitmask is used to make smarter decisions on when to do NUMA page
> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> + * are added when they cause over 6/16 of the maximum number of faults, but
> + * only removed when they drop below 3/16.
> + */

Looking at the values, I'm guessing you did it this way to use shifts
instead of divides. That's fine, but how did you arrive at those values?
Experimentally or just felt reasonable?

> +static void update_numa_active_node_mask(struct task_struct *p)
> +{
> +	unsigned long faults, max_faults = 0;
> +	struct numa_group *numa_group = p->numa_group;
> +	int nid;
> +
> +	for_each_online_node(nid) {
> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> +			 numa_group->faults_from[task_faults_idx(nid, 1)];

task_faults() implements a helper for p->numa_faults equivalent of this.
Just as with the other renaming, it would not hurt to rename task_faults()
to something like task_faults_memory() and add a task_faults_cpu() for
this. The objective again is to be clear about whether we care about CPU
or memory locality information.

> +		if (faults > max_faults)
> +			max_faults = faults;
> +	}
> +
> +	for_each_online_node(nid) {
> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> +			 numa_group->faults_from[task_faults_idx(nid, 1)];

group_faults would need similar adjustment.

> +		if (!node_isset(nid, numa_group->active_nodes)) {
> +			if (faults > max_faults * 6 / 16)
> +				node_set(nid, numa_group->active_nodes);
> +		} else if (faults < max_faults * 3 / 16)
> +			node_clear(nid, numa_group->active_nodes);
> +	}
> +}
> +

I think there is a subtle problem here

/*
 * Be mindful that this is subject to sampling error. As we only have
 * data on hinting faults active_nodes may miss a heavily referenced
 * node due to the references being to a small number of pages. If
 * there is a large linear scanner in the same numa group as a
 * task operating on a small amount of memory then the latter task
 * may be ignored.
 */

I have no suggestion on how to handle this because we're vulnerable to
sampling errors in a number of places but it does not hurt to be reminded
of that in a few places.

> +/*
>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>   * increments. The more local the fault statistics are, the higher the scan
>   * period will be for the next scan window. If local/remote ratio is below
> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
>  
>  	if (p->numa_group) {
> +		update_numa_active_node_mask(p);

We are updating that thing once per scan window, that's fine. There is
potentially a wee issue though. If all the tasks in the group are threads
then they share p->mm->numa_scan_seq and only one task does the update
per scan window. If they are different processes then we could be updating
more frequently than necessary.

Functionally it'll be fine but higher cost than necessary. I do not have a
better suggestion right now as superficially a numa_scan_seq per numa_group
would not be a good fit.

If we think of nothing better and the issue is real then we can at least
stick a comment there for future reference.

>  		/*
>  		 * If the preferred task and group nids are different,
>  		 * iterate over the nodes again to find the best place.
> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  		/* Second half of the array tracks where faults come from */
>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
>  
> +		node_set(task_node(current), grp->active_nodes);
> +
>  		for (i = 0; i < 4*nr_node_ids; i++)
>  			grp->faults[i] = p->numa_faults[i];
>  
> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  	my_grp->nr_tasks--;
>  	grp->nr_tasks++;
>  
> +	update_numa_active_node_mask(p);
> +

This may be subtle enough to deserve a comment

/* Tasks have joined/left groups and the active_mask is no longer valid */

If we left a group, we update our new group. Is the old group now out of
date and in need of updating too? If so, then we should update both and
only update the old group if it still has tasks in it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations
  2014-01-20 19:21 ` [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
@ 2014-01-21 15:08   ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 15:08 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:05PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
> 
> In order to maximize performance of workloads that do not fit in one NUMA
> node, we want to satisfy the following criteria:
> 1) keep private memory local to each thread
> 2) avoid excessive NUMA migration of pages
> 3) distribute shared memory across the active nodes, to
>    maximize memory bandwidth available to the workload
> 
> This patch accomplishes that by implementing the following policy for
> NUMA migrations:
> 1) always migrate on a private fault

Makes sense

> 2) never migrate to a node that is not in the set of active nodes
>    for the numa_group

This will work out in every case *except* where we miss an active node
because the task running there is faulting a very small number of pages.
Worth recording that in case we ever see a bug that could be explained
by it.

> 3) always migrate from a node outside of the set of active nodes,
>    to a node that is in that set

Clever

A *potential* consequence of this is that we may see large amounts of
migration traffic if we ever implement something that causes tasks to
enter/leave numa groups frequently.

> 4) within the set of active nodes in the numa_group, only migrate
>    from a node with more NUMA page faults, to a node with fewer
>    NUMA page faults, with a 25% margin to avoid ping-ponging
> 

Of the four, this is the highest risk again because we might miss tasks
in an active node due to them accessing a small number of pages.

Not suggesting you change the policy at this point, we should just keep
an eye out for it. It could be argued that a task accessing a small amount
of memory on a large NUMA machine is not a task we care about anyway :/

> This results in most pages of a workload ending up on the actively
> used nodes, with reduced ping-ponging of pages between those nodes.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/sched.h |  7 +++++++
>  kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
>  mm/mempolicy.c        |  3 +++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a9f7f05..0af6c1a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
>  extern pid_t task_numa_group_id(struct task_struct *p);
>  extern void set_numabalancing_state(bool enabled);
>  extern void task_numa_free(struct task_struct *p);
> +extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
> +				int src_nid, int dst_nid);
>  #else
>  static inline void task_numa_fault(int last_node, int node, int pages,
>  				   int flags)
> @@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
>  static inline void task_numa_free(struct task_struct *p)
>  {
>  }
> +static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
> +				       int src_nid, int dst_nid)
> +{
> +	return true;
> +}
>  #endif
>  
>  static inline struct pid *task_pid(struct task_struct *task)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea8b2ae..ea873b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>  	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
>  }
>  
> +bool should_numa_migrate(struct task_struct *p, int last_cpupid,
> +			 int src_nid, int dst_nid)
> +{

In light of the memory/data distinction, how about

should_numa_migrate_memory?

> +	struct numa_group *ng = p->numa_group;
> +
> +	/* Always allow migrate on private faults */
> +	if (cpupid_match_pid(p, last_cpupid))
> +		return true;
> +

We now have the two-stage filter detection in mpol_misplaced and the rest
of the migration decision logic here. Keep them in the same place? It
might necessitate passing in the page being faulted as well but then the
return value will be clearer

/*
 * This function returns true if the @page is misplaced and should be
 * migrated.
 */

It may need a name change as well if you decide to move everything into
this function including the call to page_cpupid_xchg_last

> +	/* A shared fault, but p->numa_group has not been set up yet. */
> +	if (!ng)
> +		return true;
> +
> +	/*
> +	 * Do not migrate if the destination is not a node that
> +	 * is actively used by this numa group.
> +	 */
> +	if (!node_isset(dst_nid, ng->active_nodes))
> +		return false;
> +

If I'm right about the sampling error potentially missing tasks accessing a
small number of pages then a reminder about the sampling error would not hurt

> +	/*
> +	 * Source is a node that is not actively used by this
> +	 * numa group, while the destination is. Migrate.
> +	 */
> +	if (!node_isset(src_nid, ng->active_nodes))
> +		return true;
> +
> +	/*
> +	 * Both source and destination are nodes in active
> +	 * use by this numa group. Maximize memory bandwidth
> +	 * by migrating from more heavily used groups, to less
> +	 * heavily used ones, spreading the load around.
> +	 * Use a 1/4 hysteresis to avoid spurious page movement.
> +	 */
> +	return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
> +}

I worried initially about how this would interact with the scheduler
placement which is concerned with the number of faults per node. I think
it's ok though because it should flatten out and the interleaved nodes
should not look like good scheduling candidates. Something to keep in
mind in the future.

I do not see why this is a 1/4 hysteresis though. It looks more like a
threshold based on the number of faults than anything to do with
hysteresis.

Finally, something like this is approximately the same as three-quarters
but does not use divides as a micro-optimisation. The approximation will
always be a greater value but the difference in error is marginal

src_group_faults = group_faults(p, src_nid);
src_group_faults -= src_group_faults >> 2;


> +
>  static unsigned long weighted_cpuload(const int cpu);
>  static unsigned long source_load(int cpu, int type);
>  static unsigned long target_load(int cpu, int type);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 052abac..050962b 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
>  			goto out;
>  		}
> +
> +		if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
> +			goto out;
>  	}
>  
>  	if (curnid != polnid)
> -- 
> 1.8.4.2
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-21 14:19   ` Mel Gorman
@ 2014-01-21 15:09     ` Rik van Riel
  2014-01-21 15:41       ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2014-01-21 15:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On 01/21/2014 09:19 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1945ddc..ea8b2ae 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -885,6 +885,7 @@ struct numa_group {
>>  	struct list_head task_list;
>>  
>>  	struct rcu_head rcu;
>> +	nodemask_t active_nodes;
>>  	unsigned long total_faults;
>>  	unsigned long *faults_from;
>>  	unsigned long faults[0];
> 
> It's not a concern for now but in the land of unicorns and ponies we'll
> relook at the size of some of these structures and see what can be
> optimised.

Unsigned int should be enough for systems with less than 8TB
of memory per node :)

> Similar to my comment on faults_from I think we could potentially evaluate
> the fitness of the automatic NUMA balancing feature by looking at the
> weight of the active_nodes for a numa_group. If
> bitmask_weight(active_nodes) == nr_online_nodes
> for all numa_groups in the system then I think it would be an indication
> that the algorithm has collapsed.

If the system runs one very large workload, I would expect the
scheduler to spread that workload across all nodes.

In that situation, it is perfectly legitimate for all nodes
to end up being marked as active nodes, and for the system
to try distribute the workload's memory somewhat evenly
between them.

> It's not a comment on the patch itself. We could could just do with more
> metrics that help analyse this thing when debugging problems.
> 
>> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>>  }
>>  
>>  /*
>> + * Find the nodes on which the workload is actively running. We do this by
> 
> hmm, it's not the workload though, it's a single NUMA group and a workload
> may consist of multiple NUMA groups. For example, in an ideal world and
> a JVM-based workload the application threads and the GC threads would be
> in different NUMA groups.

Why should they be in a different numa group?

The rest of the series contains patches to make sure they
should be just fine together in the same group...

> The signature is even more misleading because the signature implies that
> the function is concerned with tasks. Pass in p->numa_group

Will do.

>> + * tracking the nodes from which NUMA hinting faults are triggered. This can
>> + * be different from the set of nodes where the workload's memory is currently
>> + * located.
>> + *
>> + * The bitmask is used to make smarter decisions on when to do NUMA page
>> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
>> + * are added when they cause over 6/16 of the maximum number of faults, but
>> + * only removed when they drop below 3/16.
>> + */
> 
> Looking at the values, I'm guessing you did it this way to use shifts
> instead of divides. That's fine, but how did you arrive at those values?
> Experimentally or just felt reasonable?

Experimentally I got to 20% and 40%.  Peter suggested I change it
to 3/16 and 6/16, which appear to give identical performance.

>> +static void update_numa_active_node_mask(struct task_struct *p)
>> +{
>> +	unsigned long faults, max_faults = 0;
>> +	struct numa_group *numa_group = p->numa_group;
>> +	int nid;
>> +
>> +	for_each_online_node(nid) {
>> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> 
> task_faults() implements a helper for p->numa_faults equivalent of this.
> Just as with the other renaming, it would not hurt to rename task_faults()
> to something like task_faults_memory() and add a task_faults_cpu() for
> this. The objective again is to be clear about whether we care about CPU
> or memory locality information.

Will do.

>> +		if (faults > max_faults)
>> +			max_faults = faults;
>> +	}
>> +
>> +	for_each_online_node(nid) {
>> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> 
> group_faults would need similar adjustment.
> 
>> +		if (!node_isset(nid, numa_group->active_nodes)) {
>> +			if (faults > max_faults * 6 / 16)
>> +				node_set(nid, numa_group->active_nodes);
>> +		} else if (faults < max_faults * 3 / 16)
>> +			node_clear(nid, numa_group->active_nodes);
>> +	}
>> +}
>> +
> 
> I think there is a subtle problem here

Can you be more specific about what problem you think the hysteresis
could be causing?

> /*
>  * Be mindful that this is subject to sampling error. As we only have
>  * data on hinting faults active_nodes may miss a heavily referenced
>  * node due to the references being to a small number of pages. If
>  * there is a large linear scanner in the same numa group as a
>  * task operating on a small amount of memory then the latter task
>  * may be ignored.
>  */
> 
> I have no suggestion on how to handle this

Since the numa_faults_cpu statistics are all about driving
memory-follows-cpu, there actually is a decent way to handle
it.  See patch 5 :)

>> +/*
>>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>>   * increments. The more local the fault statistics are, the higher the scan
>>   * period will be for the next scan window. If local/remote ratio is below
>> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
>>  
>>  	if (p->numa_group) {
>> +		update_numa_active_node_mask(p);
> 
> We are updating that thing once per scan window, that's fine. There is
> potentially a wee issue though. If all the tasks in the group are threads
> then they share p->mm->numa_scan_seq and only one task does the update
> per scan window. If they are different processes then we could be updating
> more frequently than necessary.
> 
> Functionally it'll be fine but higher cost than necessary. I do not have a
> better suggestion right now as superficially a numa_scan_seq per numa_group
> would not be a good fit.

I suspect this cost will be small anyway, compared to the costs
incurred in both the earlier part of task_numa_placement, and
in the code where we may look for a better place to migrate the
task to.

This just iterates over memory we have already touched before
(likely to still be cached), and does some cheap comparisons.

>>  		/*
>>  		 * If the preferred task and group nids are different,
>>  		 * iterate over the nodes again to find the best place.
>> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  		/* Second half of the array tracks where faults come from */
>>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
>>  
>> +		node_set(task_node(current), grp->active_nodes);
>> +
>>  		for (i = 0; i < 4*nr_node_ids; i++)
>>  			grp->faults[i] = p->numa_faults[i];
>>  
>> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  	my_grp->nr_tasks--;
>>  	grp->nr_tasks++;
>>  
>> +	update_numa_active_node_mask(p);
>> +
> 
> This may be subtle enough to deserve a comment
> 
> /* Tasks have joined/left groups and the active_mask is no longer valid */

I have added a comment.

> If we left a group, we update our new group. Is the old group now out of
> date and in need of updating too?

The entire old group will join the new group, and the old group
is freed.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-21 15:09     ` Rik van Riel
@ 2014-01-21 15:41       ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 15:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Tue, Jan 21, 2014 at 10:09:14AM -0500, Rik van Riel wrote:
> On 01/21/2014 09:19 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:
> 
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1945ddc..ea8b2ae 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -885,6 +885,7 @@ struct numa_group {
> >>  	struct list_head task_list;
> >>  
> >>  	struct rcu_head rcu;
> >> +	nodemask_t active_nodes;
> >>  	unsigned long total_faults;
> >>  	unsigned long *faults_from;
> >>  	unsigned long faults[0];
> > 
> > It's not a concern for now but in the land of unicorns and ponies we'll
> > relook at the size of some of these structures and see what can be
> > optimised.
> 
> Unsigned int should be enough for systems with less than 8TB
> of memory per node :)
> 

Is it not bigger than that?

typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;

so it depends on the value of NODES_SHIFT? Anyway, not worth getting
into a twist over.

> > Similar to my comment on faults_from I think we could potentially evaluate
> > the fitness of the automatic NUMA balancing feature by looking at the
> > weight of the active_nodes for a numa_group. If
> > bitmask_weight(active_nodes) == nr_online_nodes
> > for all numa_groups in the system then I think it would be an indication
> > that the algorithm has collapsed.
> 
> If the system runs one very large workload, I would expect the
> scheduler to spread that workload across all nodes.
> 
> In that situation, it is perfectly legitimate for all nodes
> to end up being marked as active nodes, and for the system
> to try distribute the workload's memory somewhat evenly
> between them.
> 

In the specific case where the workload is not partitioned and really
accessing all of memory then sure, it'll be spread throughout the
system. However, if we are looking at a case like multiple JVMs sized to
fit within nodes then the metric would hold.

> > It's not a comment on the patch itself. We could could just do with more
> > metrics that help analyse this thing when debugging problems.
> > 
> >> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
> >>  }
> >>  
> >>  /*
> >> + * Find the nodes on which the workload is actively running. We do this by
> > 
> > hmm, it's not the workload though, it's a single NUMA group and a workload
> > may consist of multiple NUMA groups. For example, in an ideal world and
> > a JVM-based workload the application threads and the GC threads would be
> > in different NUMA groups.
> 
> Why should they be in a different numa group?
> 

It would be ideal that they are in different groups so the hinting faults
incurred by the garbage collector (linear scan of the address space)
does not affect scheduling and placement decisions based on the numa
groups fault statistics.

> The rest of the series contains patches to make sure they
> should be just fine together in the same group...
> 
> > The signature is even more misleading because the signature implies that
> > the function is concerned with tasks. Pass in p->numa_group
> 
> Will do.
> 
> >> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> >> + * be different from the set of nodes where the workload's memory is currently
> >> + * located.
> >> + *
> >> + * The bitmask is used to make smarter decisions on when to do NUMA page
> >> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> >> + * are added when they cause over 6/16 of the maximum number of faults, but
> >> + * only removed when they drop below 3/16.
> >> + */
> > 
> > Looking at the values, I'm guessing you did it this way to use shifts
> > instead of divides. That's fine, but how did you arrive at those values?
> > Experimentally or just felt reasonable?
> 
> Experimentally I got to 20% and 40%.  Peter suggested I change it
> to 3/16 and 6/16, which appear to give identical performance.
> 

Cool

> >> +static void update_numa_active_node_mask(struct task_struct *p)
> >> +{
> >> +	unsigned long faults, max_faults = 0;
> >> +	struct numa_group *numa_group = p->numa_group;
> >> +	int nid;
> >> +
> >> +	for_each_online_node(nid) {
> >> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> >> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> > 
> > task_faults() implements a helper for p->numa_faults equivalent of this.
> > Just as with the other renaming, it would not hurt to rename task_faults()
> > to something like task_faults_memory() and add a task_faults_cpu() for
> > this. The objective again is to be clear about whether we care about CPU
> > or memory locality information.
> 
> Will do.
> 
> >> +		if (faults > max_faults)
> >> +			max_faults = faults;
> >> +	}
> >> +
> >> +	for_each_online_node(nid) {
> >> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> >> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> > 
> > group_faults would need similar adjustment.
> > 
> >> +		if (!node_isset(nid, numa_group->active_nodes)) {
> >> +			if (faults > max_faults * 6 / 16)
> >> +				node_set(nid, numa_group->active_nodes);
> >> +		} else if (faults < max_faults * 3 / 16)
> >> +			node_clear(nid, numa_group->active_nodes);
> >> +	}
> >> +}
> >> +
> > 
> > I think there is a subtle problem here
> 
> Can you be more specific about what problem you think the hysteresis
> could be causing?
> 

Lets say

Thread A: Most important thread for performance, accesses small amounts
	of memory during each scan window. Lets say it's doing calculations
	over a large cache-aware structure of some description

Thread B: Big stupid linear scanner accessing all of memory for whatever
	reason.

Thread B will incur more NUMA hinting faults because it is accessing
idle memory that is unused by Thread A. The fault stats and placement
decisions are then skewed in favour of Thread B because Thread A did not
trap enough hinting faults.

It's a theoretical problem.

> > /*
> >  * Be mindful that this is subject to sampling error. As we only have
> >  * data on hinting faults active_nodes may miss a heavily referenced
> >  * node due to the references being to a small number of pages. If
> >  * there is a large linear scanner in the same numa group as a
> >  * task operating on a small amount of memory then the latter task
> >  * may be ignored.
> >  */
> > 
> > I have no suggestion on how to handle this
> 
> Since the numa_faults_cpu statistics are all about driving
> memory-follows-cpu, there actually is a decent way to handle
> it.  See patch 5 :)
> 
> >> +/*
> >>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
> >>   * increments. The more local the fault statistics are, the higher the scan
> >>   * period will be for the next scan window. If local/remote ratio is below
> >> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
> >>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
> >>  
> >>  	if (p->numa_group) {
> >> +		update_numa_active_node_mask(p);
> > 
> > We are updating that thing once per scan window, that's fine. There is
> > potentially a wee issue though. If all the tasks in the group are threads
> > then they share p->mm->numa_scan_seq and only one task does the update
> > per scan window. If they are different processes then we could be updating
> > more frequently than necessary.
> > 
> > Functionally it'll be fine but higher cost than necessary. I do not have a
> > better suggestion right now as superficially a numa_scan_seq per numa_group
> > would not be a good fit.
> 
> I suspect this cost will be small anyway, compared to the costs
> incurred in both the earlier part of task_numa_placement, and
> in the code where we may look for a better place to migrate the
> task to.
> 
> This just iterates over memory we have already touched before
> (likely to still be cached), and does some cheap comparisons.
> 

Fair enough. It'll show up in profiles if it's a problem anyway.

> >>  		/*
> >>  		 * If the preferred task and group nids are different,
> >>  		 * iterate over the nodes again to find the best place.
> >> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  		/* Second half of the array tracks where faults come from */
> >>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
> >>  
> >> +		node_set(task_node(current), grp->active_nodes);
> >> +
> >>  		for (i = 0; i < 4*nr_node_ids; i++)
> >>  			grp->faults[i] = p->numa_faults[i];
> >>  
> >> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  	my_grp->nr_tasks--;
> >>  	grp->nr_tasks++;
> >>  
> >> +	update_numa_active_node_mask(p);
> >> +
> > 
> > This may be subtle enough to deserve a comment
> > 
> > /* Tasks have joined/left groups and the active_mask is no longer valid */
> 
> I have added a comment.
> 
> > If we left a group, we update our new group. Is the old group now out of
> > date and in need of updating too?
> 
> The entire old group will join the new group, and the old group
> is freed.
> 

We reference count the old group so that it only gets freed when the
last task leaves it. If the old group was guaranteed to be destroyed
there would be no need to do stuff like

list_move(&p->numa_entry, &grp->task_list);
my_grp->total_faults -= p->total_numa_faults;
my_grp->nr_tasks--;

All that reads as "a single task is moving group" and not the entire
old group joins the new group. I expected that the old group was only
guaranteed to be destroyed in the case where we had just allocated it
because p->numa_group was NULL when task_numa_group was called.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-20 19:21 ` [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
@ 2014-01-21 15:56   ` Mel Gorman
  2014-01-21 21:05     ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 15:56 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:06PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The tracepoint has made it abundantly clear that the naive
> implementation of the faults_from code has issues.
> 
> Specifically, the garbage collector in some workloads will
> access orders of magnitudes more memory than the threads
> that do all the active work. This resulted in the node with
> the garbage collector being marked the only active node in
> the group.
> 

Maybe I should have read this patch before getting into a twist about the
earlier patches in the series and the treatment of active_mask :(. On the
plus side, even without reading the code I can still see the motivation
for this paragraph.

> This issue is avoided if we weigh the statistics by CPU use
> of each task in the numa group, instead of by how many faults
> each thread has occurred.
> 

Bah, yes. Because in my earlier review I was worried about the faults
being missed. If the faults stats are scaled by the CPU statistics then it
is a very rough proxy measure for how heavily a particular node is being
referenced by a process.

> To achieve this, we normalize the number of faults to the
> fraction of faults that occurred on each node, and then
> multiply that fraction by the fraction of CPU time the
> task has used since the last time task_numa_placement was
> invoked.
> 
> This way the nodes in the active node mask will be the ones
> where the tasks from the numa group are most actively running,
> and the influence of eg. the garbage collector and other
> do-little threads is properly minimized.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea873b6..203877d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p)
>  	int seq, nid, max_nid = -1, max_group_nid = -1;
>  	unsigned long max_faults = 0, max_group_faults = 0;
>  	unsigned long fault_types[2] = { 0, 0 };
> +	unsigned long total_faults;
> +	u64 runtime, period;
>  	spinlock_t *group_lock = NULL;
>  
>  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> @@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
>  	p->numa_scan_seq = seq;
>  	p->numa_scan_period_max = task_scan_max(p);
>  
> +	total_faults = p->numa_faults_locality[0] +
> +		       p->numa_faults_locality[1] + 1;

Depending on how you reacted to the review of other patches this may or
may not have a helper now.

> +	runtime = p->se.avg.runnable_avg_sum;
> +	period = p->se.avg.runnable_avg_period;
> +

Ok, IIRC these stats are based a decaying average based on recent
history so heavy activity followed by long periods of idle will not skew
the stats.

>  	/* If the task is part of a group prevent parallel updates to group stats */
>  	if (p->numa_group) {
>  		group_lock = &p->numa_group->lock;
> @@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)
>  		int priv, i;
>  
>  		for (priv = 0; priv < 2; priv++) {
> -			long diff, f_diff;
> +			long diff, f_diff, f_weight;
>  
>  			i = task_faults_idx(nid, priv);
>  			diff = -p->numa_faults[i];
> @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
>  			fault_types[priv] += p->numa_faults_buffer[i];
>  			p->numa_faults_buffer[i] = 0;
>  
> +			/*
> +			 * Normalize the faults_from, so all tasks in a group
> +			 * count according to CPU use, instead of by the raw
> +			 * number of faults. Tasks with little runtime have
> +			 * little over-all impact on throughput, and thus their
> +			 * faults are less important.
> +			 */
> +			f_weight = (16384 * runtime *
> +				   p->numa_faults_from_buffer[i]) /
> +				   (total_faults * period + 1);

Why 16384? It looks like a scaling factor to deal with integer approximations
but I'm not 100% sure and I do not see how you arrived at that value.

>  			p->numa_faults_from[i] >>= 1;
> -			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> +			p->numa_faults_from[i] += f_weight;
>  			p->numa_faults_from_buffer[i] = 0;
>  

numa_faults_from needs a big comment that it's no longer about the
number of faults in it. It's the sum of faults measured by the group
weighted by the CPU

>  			faults += p->numa_faults[i];
> -- 
> 1.8.4.2
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/6] numa,sched: do statistics calculation using local variables only
  2014-01-20 19:21 ` [PATCH 6/6] numa,sched: do statistics calculation using local variables only riel
@ 2014-01-21 16:15   ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2014-01-21 16:15 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:07PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The current code in task_numa_placement calculates the difference
> between the old and the new value, but also temporarily stores half
> of the old value in the per-process variables.
> 
> The NUMA balancing code looks at those per-process variables, and
> having other tasks temporarily see halved statistics could lead to
> unwanted numa migrations. This can be avoided by doing all the math
> in local variables.
> 
> This change also simplifies the code a little.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-21 15:56   ` Mel Gorman
@ 2014-01-21 21:05     ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-01-21 21:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On 01/21/2014 10:56 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:06PM -0500, riel@redhat.com wrote:

>> @@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
>>  	p->numa_scan_seq = seq;
>>  	p->numa_scan_period_max = task_scan_max(p);
>>  
>> +	total_faults = p->numa_faults_locality[0] +
>> +		       p->numa_faults_locality[1] + 1;
> 
> Depending on how you reacted to the review of other patches this may or
> may not have a helper now.

This is a faults "buffer", zeroed quickly after we take these
faults, so we should probably not tempt others by having a helper
function to get these numbers...

>> +	runtime = p->se.avg.runnable_avg_sum;
>> +	period = p->se.avg.runnable_avg_period;
>> +
> 
> Ok, IIRC these stats are based a decaying average based on recent
> history so heavy activity followed by long periods of idle will not skew
> the stats.

Turns out that using a longer time statistic results in a 1% performance
gain, so expect this code to change again in the next version :)

>> @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
>>  			fault_types[priv] += p->numa_faults_buffer[i];
>>  			p->numa_faults_buffer[i] = 0;
>>  
>> +			/*
>> +			 * Normalize the faults_from, so all tasks in a group
>> +			 * count according to CPU use, instead of by the raw
>> +			 * number of faults. Tasks with little runtime have
>> +			 * little over-all impact on throughput, and thus their
>> +			 * faults are less important.
>> +			 */
>> +			f_weight = (16384 * runtime *
>> +				   p->numa_faults_from_buffer[i]) /
>> +				   (total_faults * period + 1);
> 
> Why 16384? It looks like a scaling factor to deal with integer approximations
> but I'm not 100% sure and I do not see how you arrived at that value.

Indeed, it is simply a fixed point math scaling factor.

I used 1024 before, but that is kind of a small number when we could
be dealing with a node that has 20% of the accesses, and a task that
used 10% CPU time.

Having the numbers a little larger could help, and certainly should
not hurt, as long as we keep the number small enough to avoid overflows.

>>  			p->numa_faults_from[i] >>= 1;
>> -			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
>> +			p->numa_faults_from[i] += f_weight;
>>  			p->numa_faults_from_buffer[i] = 0;
>>  
> 
> numa_faults_from needs a big comment that it's no longer about the
> number of faults in it. It's the sum of faults measured by the group
> weighted by the CPU

Agreed.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
  2014-01-21 12:21   ` Mel Gorman
@ 2014-01-21 22:26     ` Rik van Riel
  2014-01-24 14:14       ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2014-01-21 22:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On 01/21/2014 07:21 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:

>> +++ b/include/linux/sched.h
>> @@ -1492,6 +1492,14 @@ struct task_struct {
>>  	unsigned long *numa_faults_buffer;
>>  
>>  	/*
>> +	 * Track the nodes where faults are incurred. This is not very
>> +	 * interesting on a per-task basis, but it help with smarter
>> +	 * numa memory placement for groups of processes.
>> +	 */
>> +	unsigned long *numa_faults_from;
>> +	unsigned long *numa_faults_from_buffer;
>> +
> 
> As an aside I wonder if we can derive any useful metric from this

It may provide for a better way to tune the numa scan interval
than the current code, since the "local vs remote" ratio is not
going to provide us much useful info when dealing with a workload
that is spread across multiple numa nodes.

>>  		grp->total_faults = p->total_numa_faults;
>> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  
>>  	double_lock(&my_grp->lock, &grp->lock);
>>  
>> -	for (i = 0; i < 2*nr_node_ids; i++) {
>> +	for (i = 0; i < 4*nr_node_ids; i++) {
>>  		my_grp->faults[i] -= p->numa_faults[i];
>>  		grp->faults[i] += p->numa_faults[i];
>>  	}
> 
> The same obscure trick is used throughout and I'm not sure how
> maintainable that will be. Would it be better to be explicit about this?

I have made a cleanup patch for this, using the defines you
suggested.

>> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>>  		p->numa_pages_migrated += pages;
>>  
>>  	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
>> +	p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
>>  	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
> 
> this_node and node is similarly ambiguous in terms of name. Rename of
> data_node and cpu_node would have been clearer.

I added a patch in the next version of the series.

Don't want to make the series too large, though :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
  2014-01-21 22:26     ` Rik van Riel
@ 2014-01-24 14:14       ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2014-01-24 14:14 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Tue, Jan 21, 2014 at 05:26:39PM -0500, Rik van Riel wrote:
> On 01/21/2014 07:21 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
> 
> >> +++ b/include/linux/sched.h
> >> @@ -1492,6 +1492,14 @@ struct task_struct {
> >>  	unsigned long *numa_faults_buffer;
> >>  
> >>  	/*
> >> +	 * Track the nodes where faults are incurred. This is not very
> >> +	 * interesting on a per-task basis, but it help with smarter
> >> +	 * numa memory placement for groups of processes.
> >> +	 */
> >> +	unsigned long *numa_faults_from;
> >> +	unsigned long *numa_faults_from_buffer;
> >> +
> > 
> > As an aside I wonder if we can derive any useful metric from this
> 
> It may provide for a better way to tune the numa scan interval
> than the current code, since the "local vs remote" ratio is not
> going to provide us much useful info when dealing with a workload
> that is spread across multiple numa nodes.
> 

Agreed. Local vs Remote handles the easier cases, particularly where the
workload has been configured to have aspects of it fit within NUMA nodes
(e.g. multiple JVMs, multiple virtual machines etc) but it's nowhere near
as useful for large single-image workloads spanning the full machine

I think in this New World Order that for single instance workloads we
would instead take into account the balance of all remote nodes. So if
all remote nodes are roughly even in terms of usage and we've decided to
interleave then slow the scan rate if the remote active nodes are evenly used

It's not something for this series just yet but I have observed a higher
system CPU usage as a result of this series. It's still far lower than
the overhead we had in the past but this is one potential idea that would
allow us to reduce the system overhead again in the future.

> >>  		grp->total_faults = p->total_numa_faults;
> >> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  
> >>  	double_lock(&my_grp->lock, &grp->lock);
> >>  
> >> -	for (i = 0; i < 2*nr_node_ids; i++) {
> >> +	for (i = 0; i < 4*nr_node_ids; i++) {
> >>  		my_grp->faults[i] -= p->numa_faults[i];
> >>  		grp->faults[i] += p->numa_faults[i];
> >>  	}
> > 
> > The same obscure trick is used throughout and I'm not sure how
> > maintainable that will be. Would it be better to be explicit about this?
> 
> I have made a cleanup patch for this, using the defines you
> suggested.
> 
> >> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> >>  		p->numa_pages_migrated += pages;
> >>  
> >>  	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> >> +	p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
> >>  	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
> > 
> > this_node and node is similarly ambiguous in terms of name. Rename of
> > data_node and cpu_node would have been clearer.
> 
> I added a patch in the next version of the series.
> 
> Don't want to make the series too large, though :)
> 

Understood, it's a bit of a mouthful already.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-01-24 14:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
2014-01-21 11:52   ` Mel Gorman
2014-01-20 19:21 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
2014-01-21 12:21   ` Mel Gorman
2014-01-21 22:26     ` Rik van Riel
2014-01-24 14:14       ` Mel Gorman
2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
2014-01-21 14:19   ` Mel Gorman
2014-01-21 15:09     ` Rik van Riel
2014-01-21 15:41       ` Mel Gorman
2014-01-20 19:21 ` [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
2014-01-21 15:08   ` Mel Gorman
2014-01-20 19:21 ` [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
2014-01-21 15:56   ` Mel Gorman
2014-01-21 21:05     ` Rik van Riel
2014-01-20 19:21 ` [PATCH 6/6] numa,sched: do statistics calculation using local variables only riel
2014-01-21 16:15   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox