Re: [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Ingo Molnar <mingo@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5
Date: Wed, 31 Jul 2013 12:48:14 +0200	[thread overview]
Message-ID: <20130731104814.GA3008@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20130731103052.GR2296@suse.de>

On Wed, Jul 31, 2013 at 11:30:52AM +0100, Mel Gorman wrote:
> I'm not sure I understand your point. The scan rate is decreased again if
> the page is found to be properly placed in the future. It's in the next
> hunk you modify although the periodically reset comment is now out of date.

Yeah its because of the next hunk. I figured that if we don't lower it,
we shouldn't raise it either.

> > @@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
> >  	/*
> >  	 * If pages are properly placed (did not migrate) then scan slower.
> >  	 * This is reset periodically in case of phase changes
> > -	 */
> > -        if (!migrated)
> > +	 *
> > +	 * APZ: it seems to me that one can get a ton of !migrated faults;
> > +	 * consider the scenario where two threads fight over a shared memory
> > +	 * segment. We'll win half the faults, half of that will be local, half
> > +	 * of that will be remote. This means we'll see 1/4-th of the total
> > +	 * memory being !migrated. Using a fixed increment will completely
> > +	 * flatten the scan speed for a sufficiently large workload. Another
> > +	 * scenario is due to that migration rate limit.
> > +	 *
> > +        if (!migrated) {
> >  		p->numa_scan_period = min(p->numa_scan_period_max,
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> > +	}
> > +	 */
> 
> FWIW, I'm also not happy with how the scan rate is reduced but did not
> come up with a better alternative that was not fragile or depended on
> gathering too much state. Granted, I also have not been treating it as a
> high priority problem.

Right, so what Ingo did is have the scan rate depend on the convergence.
What exactly did you dislike about that?

We could define the convergence as all the faults inside the interleave
mask vs the total faults, and then run at: min + (1 - c)*(max-min).

> > +#if 0
> >  	/*
> >  	 * We do not care about task placement until a task runs on a node
> >  	 * other than the first one used by the address space. This is
> >  	 * largely because migrations are driven by what CPU the task
> >  	 * is running on. If it's never scheduled on another node, it'll
> >  	 * not migrate so why bother trapping the fault.
> > +	 *
> > +	 * APZ: seems like a bad idea for pure shared memory workloads.
> >  	 */
> >  	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
> >  		mm->first_nid = numa_node_id();
> 
> At some point in the past scan starts were based on waiting a fixed interval
> but that seemed like a hack designed to get around hurting kernel compile
> benchmarks. I'll give it more thought and see can I think of a better
> alternative that is based on an event but not this event.

Ah, well the reasoning on that was that all this NUMA business is
'expensive' so we'd better only bother with tasks that persist long
enough for it to pay off.

In that regard it makes perfect sense to wait a fixed amount of runtime
before we start scanning.

So it was not a pure hack to make kbuild work again.. that is did was
good though.

> > @@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
> >  	 * Do not set pte_numa if the current running node is rate-limited.
> >  	 * This loses statistics on the fault but if we are unwilling to
> >  	 * migrate to this node, it is less likely we can do useful work
> > -	 */
> > +	 *
> > +	 * APZ: seems like a bad idea; even if this node can't migrate anymore
> > +	 * other nodes might and we want up-to-date information to do balance
> > +	 * decisions.
> > +	 *
> >  	if (migrate_ratelimited(numa_node_id()))
> >  		return;
> > +	 */
> >  
> 
> Ingo also disliked this but I wanted to avoid a situation where the
> workload suffered because of a corner case where the interconnect was
> filled with migration traffic.

Right, but you already rate limit the actual migrations, this should
leave enough bandwidth to allow the non-migrating scanning.

I think its important we keep up-to-date information if we're going to
do placement based on it.

On that rate-limit, this looks to be a hard-coded number unrelated to
the actual hardware. I think we should at the very least make it a
configurable number and preferably scale the number with the SLIT info.
Or alternatively actually measure the node to node bandwidth.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-07-31 10:48 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-15 15:20 Mel Gorman
2013-07-15 15:20 ` [PATCH 01/18] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-15 15:20 ` [PATCH 02/18] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-17 10:50   ` Peter Zijlstra
2013-07-31  7:54     ` Mel Gorman
2013-07-29 10:10   ` Peter Zijlstra
2013-07-31  7:54     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 03/18] mm: numa: Account for THP numa hinting faults on the correct node Mel Gorman
2013-07-17  0:33   ` Hillf Danton
2013-07-17  1:26     ` Wanpeng Li
2013-07-17  1:26     ` Wanpeng Li
2013-07-15 15:20 ` [PATCH 04/18] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-07-17 11:00   ` Peter Zijlstra
2013-07-31  8:11     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 05/18] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-15 15:20 ` [PATCH 06/18] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-15 15:20 ` [PATCH 07/18] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-25 10:40   ` [PATCH] sched, numa: migrates_degrades_locality() Peter Zijlstra
2013-07-31  8:44     ` Mel Gorman
2013-07-31  8:50       ` Peter Zijlstra
2013-07-15 15:20 ` [PATCH 08/18] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-17  1:31   ` Hillf Danton
2013-07-31  9:07     ` Mel Gorman
2013-07-31  9:38       ` Srikar Dronamraju
2013-08-01  4:47   ` Srikar Dronamraju
2013-08-01 15:38     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 09/18] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-07-17  2:17   ` Hillf Danton
2013-07-31  9:08     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 10/18] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-15 15:20 ` [PATCH 11/18] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-15 15:20 ` [PATCH 12/18] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-15 15:20 ` [PATCH 13/18] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-17  5:22   ` Sam Ben
2013-07-31  9:13     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 14/18] sched: Remove check that skips small VMAs Mel Gorman
2013-07-15 15:20 ` [PATCH 15/18] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-07-18  1:53   ` [PATCH 15/18] fix compilation with !CONFIG_NUMA_BALANCING Rik van Riel
2013-07-31  9:19     ` Mel Gorman
2013-07-26 11:20   ` [PATCH 15/18] sched: Set preferred NUMA node based on number of private faults Peter Zijlstra
2013-07-31  9:29     ` Mel Gorman
2013-07-31  9:34       ` Peter Zijlstra
2013-07-31 10:10         ` Mel Gorman
2013-07-15 15:20 ` [PATCH 16/18] sched: Avoid overloading CPUs on a preferred NUMA node Mel Gorman
2013-07-15 20:03   ` Peter Zijlstra
2013-07-16  8:23     ` Mel Gorman
2013-07-16 10:35       ` Peter Zijlstra
2013-07-16 15:55   ` Hillf Danton
2013-07-16 16:01     ` Mel Gorman
2013-07-17 10:54   ` Peter Zijlstra
2013-07-31  9:49     ` Mel Gorman
2013-08-01  7:10   ` Srikar Dronamraju
2013-08-01 15:42     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 17/18] sched: Retry migration of tasks to CPU on a preferred node Mel Gorman
2013-07-25 10:33   ` Peter Zijlstra
2013-07-31 10:03     ` Mel Gorman
2013-07-31 10:05       ` Peter Zijlstra
2013-07-31 10:07         ` Mel Gorman
2013-07-25 10:35   ` Peter Zijlstra
2013-08-01  5:13   ` Srikar Dronamraju
2013-08-01 15:46     ` Mel Gorman
2013-07-15 15:20 ` [PATCH 18/18] sched: Swap tasks when reschuling if a CPU on a target node is imbalanced Mel Gorman
2013-07-15 20:11   ` Peter Zijlstra
2013-07-16  9:41     ` Mel Gorman
2013-08-01  4:59   ` Srikar Dronamraju
2013-08-01 15:48     ` Mel Gorman
2013-07-15 20:14 ` [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Peter Zijlstra
2013-07-16 15:10 ` Srikar Dronamraju
2013-07-25 10:36 ` Peter Zijlstra
2013-07-31 10:30   ` Mel Gorman
2013-07-31 10:48     ` Peter Zijlstra [this message]
2013-07-31 11:57       ` Mel Gorman
2013-07-31 15:30         ` Peter Zijlstra
2013-07-31 16:11           ` Mel Gorman
2013-07-31 16:39             ` Peter Zijlstra
2013-08-01 15:51               ` Mel Gorman
2013-07-25 10:38 ` [PATCH] mm, numa: Sanitize task_numa_fault() callsites Peter Zijlstra
2013-07-31 11:25   ` Mel Gorman
2013-07-25 10:41 ` [PATCH] sched, numa: Improve scanner Peter Zijlstra
2013-07-25 10:46 ` [PATCH] mm, sched, numa: Create a per-task MPOL_INTERLEAVE policy Peter Zijlstra
2013-07-26  9:55   ` Peter Zijlstra
2013-08-26 16:10     ` Peter Zijlstra
2013-08-26 16:14       ` Peter Zijlstra
2013-07-30 11:24 ` [PATCH] mm, numa: Change page last {nid,pid} into {cpu,pid} Peter Zijlstra
2013-08-01 22:33   ` Rik van Riel
2013-07-30 11:38 ` [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults Peter Zijlstra
2013-07-31 15:07   ` Peter Zijlstra
2013-07-31 15:38     ` Peter Zijlstra
2013-07-31 15:45     ` Don Morris
2013-07-31 16:05       ` Peter Zijlstra
2013-08-02 16:47       ` [PATCH -v3] " Peter Zijlstra
2013-08-02 16:50         ` [PATCH] mm, numa: Do not group on RO pages Peter Zijlstra
2013-08-02 19:56           ` Peter Zijlstra
2013-08-05 19:36           ` [PATCH] numa,sched: use group fault statistics in numa placement Rik van Riel
2013-08-09 13:55             ` Don Morris
2013-08-28 16:41         ` [PATCH -v3] sched, numa: Use {cpu, pid} to create task groups for shared faults Peter Zijlstra
2013-08-28 17:10           ` Rik van Riel
2013-08-01  6:23   ` [PATCH,RFC] numa,sched: use group fault statistics in numa placement Rik van Riel
2013-08-01 10:37     ` Peter Zijlstra
2013-08-01 16:35       ` Rik van Riel
2013-08-01 22:36   ` [RFC PATCH -v2] " Rik van Riel
2013-07-30 13:58 ` [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Andrew Theurer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130731104814.GA3008@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox