Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Buddy Lumpkin <buddy.lumpkin@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	hannes@cmpxchg.org, riel@surriel.com, mgorman@suse.de,
	akpm@linux-foundation.org
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
Date: Wed, 4 Apr 2018 03:07:01 -0700	[thread overview]
Message-ID: <2D4C5B98-6B19-4430-AFA0-83C9D72DB86C@oracle.com> (raw)
In-Reply-To: <20180403211253.GC30145@bombadil.infradead.org>

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.  If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.

For the scenario you describe above. I have my own opinions, but I would rather not
speculate on what happens. Tomorrow I will try to simulate this situation and i’ll
report back on the results. I think this actually makes a case for accepting the patch 
as-is for now.  Please hear me out on this:

You mentioned being concerned that an admin will do the wrong thing with this
tunable. I worked in the System Administrator/System Engineering job families for
many years and even though I transitioned to spending most of my time on
performance and kernel work, I still maintain an active role in System Engineering
related projects, hiring and mentoring.

The kswapd_threads tunable defaults to a value of one, which is the current default
behavior. I think there are plenty of sysctls that are more confusing than this one. 
If you want to make a comparison, I would say that Transparent Hugepages is one
of the best examples of a feature that has confused System Administrators. I am sure
it works a lot better today, but it has a history of really sharp edges, and it has been
shipping enabled by default for a long time in the OS distributions I am familiar with.
I am hopeful that it works better in later kernels as I think we need more features
like it. Specifically, features that bring high performance to naive third party apps
that do not make use of advanced features like hugetlbfs, spoke, direct IO, or clumsy
interfaces like posix_fadvise. But until they are absolutely polished, I wish these kinds
of features would not be turned on by default. This includes kswapd_threads.

More reasons why implementing this tunable makes sense for now:
- A feature like this is a lot easier to reason about after it has been used in the field
   for a while. This includes trying to auto-tune it
- We need an answer for this problem today. Today there are single NVMe drives
   capable of 10GB/s and larger systems than the system I used for testing
- In the scenario you describe above, an admin would have no reason to touch
  this sysctl
- I think I mentioned this before. I honestly thought a lot of tuning would be necessary
  after implementing this but so far that hasn’t been the case. It works pretty well.

> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
>

next prev parent reply	other threads:[~2018-04-04 10:07 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-02  9:24 [RFC PATCH 0/1] mm: " Buddy Lumpkin
2018-04-02  9:24 ` [RFC PATCH 1/1] vmscan: " Buddy Lumpkin
2018-04-03 13:31   ` Michal Hocko
2018-04-03 19:07     ` Matthew Wilcox
2018-04-03 20:49       ` Buddy Lumpkin
2018-04-03 21:12         ` Matthew Wilcox
2018-04-04 10:07           ` Buddy Lumpkin [this message]
2018-04-05  4:08           ` Buddy Lumpkin
2018-04-11  6:37           ` Buddy Lumpkin
2018-04-11  3:52       ` Buddy Lumpkin
2018-04-03 19:41     ` Buddy Lumpkin
2018-04-12 13:16       ` Michal Hocko
2018-04-17  3:02         ` Buddy Lumpkin
2018-04-17  9:03           ` Michal Hocko
2018-04-03 20:13     ` Buddy Lumpkin
2018-04-11  3:10     ` Buddy Lumpkin
2018-04-12 13:23       ` Michal Hocko
2020-09-30 19:27 Sebastiaan Meijer
2020-10-01 12:30 ` Michal Hocko
2020-10-01 16:18   ` Sebastiaan Meijer
2020-10-02  7:03     ` Michal Hocko
2020-10-02  8:40       ` Mel Gorman
2020-10-02 13:53       ` Rik van Riel
2020-10-02 14:00         ` Matthew Wilcox
2020-10-02 14:29         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2D4C5B98-6B19-4430-AFA0-83C9D72DB86C@oracle.com \
    --to=buddy.lumpkin@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=riel@surriel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox