Re: page_waitqueue() considered harmful

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@techsingularity.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>, linux-mm <linux-mm@kvack.org>
Subject: Re: page_waitqueue() considered harmful
Date: Tue, 27 Sep 2016 09:31:04 +0100	[thread overview]
Message-ID: <20160927083104.GC2838@techsingularity.net> (raw)
In-Reply-To: <CA+55aFwVSXZPONk2OEyxcP-aAQU7-aJsF3OFXVi8Z5vA11v_-Q@mail.gmail.com>


Hi Linus,

On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote:
> So I've been doing some profiling of the git "make test" load, which
> is interesting because it's just a lot of small scripts, and it shows
> our fork/execve/exit costs.
> 
> Some of the top offenders are pretty understandable: we have long had
> unmap_page_range() show up at the top on these loads, because the
> zap_pte_range() function ends up touching each page as it unmaps
> things (it does it to check whether it's an anonymous page, but then
> also for the page map count update etc), and that's a disaster from a
> cache standpoint. That single function is something between 3-4% of
> CPU time, and the one instruction that accesses "struct page" the
> first time is a large portion of that. Yes, a single instruction is
> blamed for about 1% of all CPU time on a fork/exec/exit workload.
> 

It was found at one point that the fault-around made these costs worse as
there were simply more pages to tear down. However, this only applied to
fork/exit microbenchmarks.  Matt Fleming prototyped an unreleased patch
that tried to be clever about this but the cost was never worthwhile. A
plain revert helped a microbenchmark but hurt workloads like the git
testsuite which was shell intensive.

It got filed under "we're not fixing a fork/exit microbenchmark at the
expense of "real" workloads like git checkout and git testsuite".

> <SNIP>
>
> #5 and #6 on my profile are user space (_int_malloc in glibc, and
> do_lookup_x in the loader - I think user space should probably start
> thinking more about doing static libraries for the really core basic
> things, but whatever. Not a kernel issue.
> 

Recent problems have been fixed with _int_malloc in glibc, particularly as it
applies to threads but no fix springs to mind that might impact "make test".

> #7 is in the kernel again. And that one struck me as really odd. It's
> "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in
> this load, it's all cached, why do we use 3% of the time (1.7% and
> 1.4% respectively) on unlocking a page. And why can't I see the
> locking part?
> 
> It turns out that I *can* see the locking part, but it's pretty cheap.
> It's inside of filemap_map_pages(), which does a trylock, and it shows
> up as about 1/6th of the cost of that function. Still, it's much less
> than the unlocking side. Why is unlocking so expensive?
> 
> Yeah, unlocking is expensive because of the nasty __wake_up_bit()
> code. In fact, even inside "unlock_page()" itself, most of the costs
> aren't even the atomic bit clearing (like you'd expect), it's the
> inlined part of wake_up_bit(). Which does some really nasty crud.
> 
> Why is the page_waitqueue() handling so expensive? Let me count the ways:
> 

page_waitqueue() has been a hazard for years. I think the last attempt to
fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html

The patch is heavily derived from work by Nick Piggin who noticed the years
before that. I think that was the last version I posted and the changelog
includes profile data. I don't have an exact reference why it was rejected
but a consistent piece of feedback was that it was very complex for the
level of impact it had.

>  (a) stupid code generation interacting really badly with microarchitecture
> 
>      We clear the bit in page->flags with a byte-sized "lock andb",
> and then page_waitqueue() looks up the zone by reading the full value
> of "page->flags" immediately afterwards, which causes bad bad behavior
> with the whole read-after-partial-write thing. Nasty.
> 

Other than some code churn, it should be possible to lookup the waitqueue
before clearing the bit if the patch that side-steps the lookup entirely
is still distasteful.

>  (b) It's cache miss heaven. It takes a cache miss on three different
> things:looking up the zone 'wait_table', then looking up the hash
> queue there, and finally (inside __wake_up_bit) looking up the wait
> queue itself (which will effectively always be NULL).
> 
> Now, (a) could be fairly easy to fix several ways (the byte-size
> operation on an "unsigned long" field have caused problems before, and
> may just be a mistake), but (a) wouldn't even be a problem if we
> didn't have the complexity of (b) and having to look up the zone and
> everything. So realistically, it's actually (b) that is the primary
> problem, and indirectly causes (a) to happen too.
> 

The patch in question side-steps the (b) part.

> Is there really any reason for that incredible indirection? Do we
> really want to make the page_waitqueue() be a per-zone thing at all?
> Especially since all those wait-queues won't even be *used* unless
> there is actual IO going on and people are really getting into
> contention on the page lock.. Why isn't the page_waitqueue() just one
> statically sized array?
> 

About all I can think of is NUMA locality. Doing it per-zone at the time
would be convenient. It could be per-node but that's no better from a
complexity point of view. If it was a static array, the key could be related
to the node the page is on but that's still looking into the page flags.

> Also, if those bitlock ops had a different bit that showed contention,
> we could actually skip *all* of this, and just see that "oh, nobody is
> waiting on this page anyway, so there's no point in looking up those
> wait queues".

That's very close to what the 2014 patch does. However, it's 64-bit only
due to the use of page flags. Even though it's out of date, can you take
a look? If it doesn't trigger hate, I can forward port it unless you do
that yourself for the purposes of testing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-09-27  8:31 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-26 20:58 Linus Torvalds
2016-09-26 21:23 ` Rik van Riel
2016-09-26 21:30   ` Linus Torvalds
2016-09-26 23:11   ` Kirill A. Shutemov
2016-09-27  1:01     ` Rik van Riel
2016-09-27  7:30 ` Peter Zijlstra
2016-09-27  8:54   ` Mel Gorman
2016-09-27  9:11     ` Kirill A. Shutemov
2016-09-27  9:42       ` Mel Gorman
2016-09-27  9:52       ` Minchan Kim
2016-09-27 12:11         ` Kirill A. Shutemov
2016-09-29  8:01     ` Peter Zijlstra
2016-09-29 12:55       ` Nicholas Piggin
2016-09-29 13:16         ` Peter Zijlstra
2016-09-29 13:54           ` Nicholas Piggin
2016-09-29 15:05         ` Rik van Riel
2016-09-27  8:03 ` Jan Kara
2016-09-27  8:31 ` Mel Gorman [this message]
2016-09-27 14:34   ` Peter Zijlstra
2016-09-27 15:08     ` Nicholas Piggin
2016-09-27 16:31     ` Linus Torvalds
2016-09-27 16:49       ` Peter Zijlstra
2016-09-28 10:45     ` Mel Gorman
2016-09-28 11:11       ` Peter Zijlstra
2016-09-28 16:10         ` Linus Torvalds
2016-09-29 13:08           ` Peter Zijlstra
2016-10-03 10:47             ` Mel Gorman
2016-09-27 14:53   ` Nicholas Piggin
2016-09-27 15:17     ` Nicholas Piggin
2016-09-27 16:52     ` Peter Zijlstra
2016-09-27 17:06       ` Nicholas Piggin
2016-09-28  7:05         ` Peter Zijlstra
2016-09-28 11:05           ` Paul E. McKenney
2016-09-28 11:16             ` Peter Zijlstra
2016-09-28 12:58               ` Paul E. McKenney
2016-09-29  1:31           ` Nicholas Piggin
2016-09-29  2:12             ` Paul E. McKenney
2016-09-29  6:21             ` Peter Zijlstra
2016-09-29  6:42               ` Nicholas Piggin
2016-09-29  7:14                 ` Peter Zijlstra
2016-09-29  7:43                   ` Peter Zijlstra
2016-09-28  7:40     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160927083104.GC2838@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox