From: Mel Gorman <mgorman@suse.de>
To: riel@redhat.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
peterz@infradead.org, mingo@redhat.com, chegu_vinod@hp.com
Subject: Re: [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use
Date: Tue, 21 Jan 2014 15:56:52 +0000 [thread overview]
Message-ID: <20140121155652.GL4963@suse.de> (raw)
In-Reply-To: <1390245667-24193-6-git-send-email-riel@redhat.com>
On Mon, Jan 20, 2014 at 02:21:06PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
>
> The tracepoint has made it abundantly clear that the naive
> implementation of the faults_from code has issues.
>
> Specifically, the garbage collector in some workloads will
> access orders of magnitudes more memory than the threads
> that do all the active work. This resulted in the node with
> the garbage collector being marked the only active node in
> the group.
>
Maybe I should have read this patch before getting into a twist about the
earlier patches in the series and the treatment of active_mask :(. On the
plus side, even without reading the code I can still see the motivation
for this paragraph.
> This issue is avoided if we weigh the statistics by CPU use
> of each task in the numa group, instead of by how many faults
> each thread has occurred.
>
Bah, yes. Because in my earlier review I was worried about the faults
being missed. If the faults stats are scaled by the CPU statistics then it
is a very rough proxy measure for how heavily a particular node is being
referenced by a process.
> To achieve this, we normalize the number of faults to the
> fraction of faults that occurred on each node, and then
> multiply that fraction by the fraction of CPU time the
> task has used since the last time task_numa_placement was
> invoked.
>
> This way the nodes in the active node mask will be the ones
> where the tasks from the numa group are most actively running,
> and the influence of eg. the garbage collector and other
> do-little threads is properly minimized.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> kernel/sched/fair.c | 21 +++++++++++++++++++--
> 1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea873b6..203877d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p)
> int seq, nid, max_nid = -1, max_group_nid = -1;
> unsigned long max_faults = 0, max_group_faults = 0;
> unsigned long fault_types[2] = { 0, 0 };
> + unsigned long total_faults;
> + u64 runtime, period;
> spinlock_t *group_lock = NULL;
>
> seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> @@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
> p->numa_scan_seq = seq;
> p->numa_scan_period_max = task_scan_max(p);
>
> + total_faults = p->numa_faults_locality[0] +
> + p->numa_faults_locality[1] + 1;
Depending on how you reacted to the review of other patches this may or
may not have a helper now.
> + runtime = p->se.avg.runnable_avg_sum;
> + period = p->se.avg.runnable_avg_period;
> +
Ok, IIRC these stats are based a decaying average based on recent
history so heavy activity followed by long periods of idle will not skew
the stats.
> /* If the task is part of a group prevent parallel updates to group stats */
> if (p->numa_group) {
> group_lock = &p->numa_group->lock;
> @@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)
> int priv, i;
>
> for (priv = 0; priv < 2; priv++) {
> - long diff, f_diff;
> + long diff, f_diff, f_weight;
>
> i = task_faults_idx(nid, priv);
> diff = -p->numa_faults[i];
> @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
> fault_types[priv] += p->numa_faults_buffer[i];
> p->numa_faults_buffer[i] = 0;
>
> + /*
> + * Normalize the faults_from, so all tasks in a group
> + * count according to CPU use, instead of by the raw
> + * number of faults. Tasks with little runtime have
> + * little over-all impact on throughput, and thus their
> + * faults are less important.
> + */
> + f_weight = (16384 * runtime *
> + p->numa_faults_from_buffer[i]) /
> + (total_faults * period + 1);
Why 16384? It looks like a scaling factor to deal with integer approximations
but I'm not 100% sure and I do not see how you arrived at that value.
> p->numa_faults_from[i] >>= 1;
> - p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> + p->numa_faults_from[i] += f_weight;
> p->numa_faults_from_buffer[i] = 0;
>
numa_faults_from needs a big comment that it's no longer about the
number of faults in it. It's the sum of faults measured by the group
weighted by the CPU
> faults += p->numa_faults[i];
> --
> 1.8.4.2
>
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-01-21 15:56 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-20 19:21 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
2014-01-21 11:52 ` Mel Gorman
2014-01-20 19:21 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
2014-01-21 12:21 ` Mel Gorman
2014-01-21 22:26 ` Rik van Riel
2014-01-24 14:14 ` Mel Gorman
2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
2014-01-21 14:19 ` Mel Gorman
2014-01-21 15:09 ` Rik van Riel
2014-01-21 15:41 ` Mel Gorman
2014-01-20 19:21 ` [PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
2014-01-21 15:08 ` Mel Gorman
2014-01-20 19:21 ` [PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
2014-01-21 15:56 ` Mel Gorman [this message]
2014-01-21 21:05 ` Rik van Riel
2014-01-20 19:21 ` [PATCH 6/6] numa,sched: do statistics calculation using local variables only riel
2014-01-21 16:15 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140121155652.GL4963@suse.de \
--to=mgorman@suse.de \
--cc=chegu_vinod@hp.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox