Re: [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: riel@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	peterz@infradead.org, mingo@redhat.com, chegu_vinod@hp.com
Subject: Re: [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use
Date: Tue, 21 Jan 2014 15:56:52 +0000	[thread overview]
Message-ID: <20140121155652.GL4963@suse.de> (raw)
In-Reply-To: <1390245667-24193-6-git-send-email-riel@redhat.com>

On Mon, Jan 20, 2014 at 02:21:06PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The tracepoint has made it abundantly clear that the naive
> implementation of the faults_from code has issues.
> 
> Specifically, the garbage collector in some workloads will
> access orders of magnitudes more memory than the threads
> that do all the active work. This resulted in the node with
> the garbage collector being marked the only active node in
> the group.
> 

Maybe I should have read this patch before getting into a twist about the
earlier patches in the series and the treatment of active_mask :(. On the
plus side, even without reading the code I can still see the motivation
for this paragraph.

> This issue is avoided if we weigh the statistics by CPU use
> of each task in the numa group, instead of by how many faults
> each thread has occurred.
> 

Bah, yes. Because in my earlier review I was worried about the faults
being missed. If the faults stats are scaled by the CPU statistics then it
is a very rough proxy measure for how heavily a particular node is being
referenced by a process.

> To achieve this, we normalize the number of faults to the
> fraction of faults that occurred on each node, and then
> multiply that fraction by the fraction of CPU time the
> task has used since the last time task_numa_placement was
> invoked.
> 
> This way the nodes in the active node mask will be the ones
> where the tasks from the numa group are most actively running,
> and the influence of eg. the garbage collector and other
> do-little threads is properly minimized.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea873b6..203877d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p)
>  	int seq, nid, max_nid = -1, max_group_nid = -1;
>  	unsigned long max_faults = 0, max_group_faults = 0;
>  	unsigned long fault_types[2] = { 0, 0 };
> +	unsigned long total_faults;
> +	u64 runtime, period;
>  	spinlock_t *group_lock = NULL;
>  
>  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> @@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
>  	p->numa_scan_seq = seq;
>  	p->numa_scan_period_max = task_scan_max(p);
>  
> +	total_faults = p->numa_faults_locality[0] +
> +		       p->numa_faults_locality[1] + 1;

Depending on how you reacted to the review of other patches this may or
may not have a helper now.

> +	runtime = p->se.avg.runnable_avg_sum;
> +	period = p->se.avg.runnable_avg_period;
> +

Ok, IIRC these stats are based a decaying average based on recent
history so heavy activity followed by long periods of idle will not skew
the stats.

>  	/* If the task is part of a group prevent parallel updates to group stats */
>  	if (p->numa_group) {
>  		group_lock = &p->numa_group->lock;
> @@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)
>  		int priv, i;
>  
>  		for (priv = 0; priv < 2; priv++) {
> -			long diff, f_diff;
> +			long diff, f_diff, f_weight;
>  
>  			i = task_faults_idx(nid, priv);
>  			diff = -p->numa_faults[i];
> @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
>  			fault_types[priv] += p->numa_faults_buffer[i];
>  			p->numa_faults_buffer[i] = 0;
>  
> +			/*
> +			 * Normalize the faults_from, so all tasks in a group
> +			 * count according to CPU use, instead of by the raw
> +			 * number of faults. Tasks with little runtime have
> +			 * little over-all impact on throughput, and thus their
> +			 * faults are less important.
> +			 */
> +			f_weight = (16384 * runtime *
> +				   p->numa_faults_from_buffer[i]) /
> +				   (total_faults * period + 1);

Why 16384? It looks like a scaling factor to deal with integer approximations
but I'm not 100% sure and I do not see how you arrived at that value.

>  			p->numa_faults_from[i] >>= 1;
> -			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> +			p->numa_faults_from[i] += f_weight;
>  			p->numa_faults_from_buffer[i] = 0;
>  

numa_faults_from needs a big comment that it's no longer about the
number of faults in it. It's the sum of faults measured by the group
weighted by the CPU

>  			faults += p->numa_faults[i];
> -- 
> 1.8.4.2
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-01-21 15:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
2014-01-21 11:52   ` Mel Gorman
2014-01-20 19:21 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
2014-01-21 12:21   ` Mel Gorman
2014-01-21 22:26     ` Rik van Riel
2014-01-24 14:14       ` Mel Gorman
2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
2014-01-21 14:19   ` Mel Gorman
2014-01-21 15:09     ` Rik van Riel
2014-01-21 15:41       ` Mel Gorman
2014-01-20 19:21 ` [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
2014-01-21 15:08   ` Mel Gorman
2014-01-20 19:21 ` [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
2014-01-21 15:56   ` Mel Gorman [this message]
2014-01-21 21:05     ` Rik van Riel
2014-01-20 19:21 ` [PATCH 6/6] numa,sched: do statistics calculation using local variables only riel
2014-01-21 16:15   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140121155652.GL4963@suse.de \
    --to=mgorman@suse.de \
    --cc=chegu_vinod@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox