From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Joachim Deguara <joachim.deguara@amd.com>,
Christoph Lameter <clameter@sgi.com>, Mel Gorman <mel@csn.ul.ie>,
Eric Whitney <eric.whitney@hp.com>
Subject: Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache
Date: Mon, 13 Aug 2007 10:05:01 -0400 [thread overview]
Message-ID: <1187013901.5592.24.camel@localhost> (raw)
In-Reply-To: <20070813074351.GA15609@wotan.suse.de>
On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
> On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
> > On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> > > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > > Hi,
> > > >
> > > > Just got a bit of time to take another look at the replicated pagecache
> > > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > > gives me more confidence in the locking now; the new ->fault API makes
> > > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > > and fixed.
> > > >
> > > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > > tests...
> > > >
<snip>
>
> Hi Lee,
>
> Am sick with the flu for the past few days, so I haven't done much more
> work here, but I'll just add some (not very useful) comments....
>
> The get_page_from_freelist hang is quite strange. It would be zone->lock,
> which shouldn't have too much contention...
>
> Replication may be putting more stress on some locks. It will cause more
> tlb flushing that can not be batched well, which could cause the call_lock
> to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
> inherit the latency from call_lock. (If this is the case, we could
> potentially extend the tlb flushing API slightly to cope better with
> unmapping of pages from multiple mm's, but that comes way down the track
> when/if replication proves itself!).
>
> tlb flushing AFAIKS should not do the IPI unless it is deadling with a
> multithreaded mm... does usex use threads?
Yes. Apparently, there are some tests, perhaps some of the /usr/bin
apps that get run repeatedly, that are multi-threaded. This job mix
caught a number of races in my auto-migration patches when
multi-threaded tasks race in the page fault paths.
More below...
>
>
> > I should note that I was trying to unmap all mappings to the file backed pages
> > on internode task migration, instead of just the current task's pte's. However,
> > I was only attempting this on pages with mapcount <= 4. So, I don't think I
> > was looping trying to unmap pages with mapcounts of several 10s--such as I see
> > on some page cache pages in my traces.
>
> Replication teardown would still have to unmap all... but that shouldn't
> particularly be any worse than, say, page reclaim (except I guess that it
> could occur more often).
>
>
> > Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
> > task's ptes for ALL !anon pages, regardless of mapcount. I've started the test
> > again and will let it run over the weekend--or as long as it stays up, which
> > ever is shorter :-).
>
> Ah, so it does eventually die? Any hints of why?
No, doesn't die--as in panic. I was just commenting that I'd leave it
running ... However [:-(], it DID hang again. The test window said
that the tests ran for 62h:28m before the screen stopped updating. In
another window, I was running a script to snap the replication and #
file pages vmstats, along with a timestamp, every 10 minutes. That
stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
the test. It still wrote the timestamps [date command] until around 7am
this morning [Monday]--or ~62 hours into test.
So, I do have ~14 hours of replication stats that I can send you or plot
up...
Re: the hang: again, console was scrolling soft lockups continuously.
Checking the messages file, I see hangs in copy_process(),
smp_call_function [as in prev test], vma_link [from mmap], ...
I also see a number of NaT ["not a thing"] consumptions--ia64 specific
error, probably invalid pointer deref--in swapin_readahead, which my
patches hack. These might be the cause of the fork/mmap hangs.
Didn't see that in the 8-9Aug runs, so it might be a result of continued
operation after other hangs/problems; or a botch in the rebase to
rc2-mm2. In any case, I have some work to do there...
>
> >
> > I put a tarball with the rebased series in the Replication directory linked
> > above, in case you're interested. I haven't added the patch description for
> > the new patch yet, but it's pretty simple. Maybe even correct.
> >
> > ----
> >
> > Unrelated to the lockups [I think]:
> >
> > I forgot to look before I rebooted, but earlier the previous evening, I checked
> > the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 million
> > replications and ~4.8 million "zaps" [collapse of replicated page]. That's around
> > 98% zaps. Do we need some filter in the fault path to reduce the "thrashing"--if
> > that's what I'm seeing.
>
> Yep. The current replication patch is very much only infrastructure at
> this stage (and is good for stress testing). I feel sure that heuristics
> and perhaps tunables would be needed to make the most of it.
Yeah. I have some ideas to try...
At the end of the 14.5 hours when it stopped dumping vmstats, we were at
~95% zaps.
>
>
> > A while back I took a look at the Virtual Iron page replication patch. They had
> > set VM_DENY_WRITE when mapping shared executable segments, and only replicated pages
> > in those VMAs. Maybe 'DENY_WRITE isn't exactly what we want. Possibly set another
> > flag for shared executables, if we can detect them, and any shared mapping that has
> > no writable mappings ?
>
> mapping_writably_mapped would be a good one to try. That may be too
> broad in some corner cases where we do want occasionally-written files
> or even parts of files to be replicated, but if we were ever to enable
> CONFIG_REPLICATION by default, I imagine mapping_writably_mapped would
> be the default heuristic.
>
> Still, I appreciate the testing of the "thrashing" case, because with
> the mapping_writably_mapped heuristic, it is likely that bugs could
> remain lurking even in production workloads on huge systems (because
> they will hardly ever get unreplicated).
>
>
> > I'll try to remember to check the replication statistics after the currently
> > running test. If the system stays up, that is. A quick look < 10 minutes into
> > the test shows that zaps are now ~84% of replications. Also, ~47k replicated pages
> > out of ~287K file pages.
>
> Yeah I guess it can be a little misleading: as time approaches infinity,
> zaps will probably approach replications. But that doesn't tell you how
> long a replica stayed around and usefully fed CPUs with local memory...
May be able to capture that info with a more invasive patch -- e.g., add
a timestamp to the page struct. I'll think about it.
And, I'll keep you posted. Not sure how much time I'll be able to
dedicate to this patch stream. Got a few others I need to get back
to...
Later,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-08-13 14:05 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-07-27 8:42 Nick Piggin
2007-07-27 14:30 ` Lee Schermerhorn
2007-07-30 3:16 ` Nick Piggin
2007-07-30 16:29 ` Lee Schermerhorn
2007-08-08 20:25 ` Lee Schermerhorn
2007-08-10 21:08 ` Lee Schermerhorn
2007-08-13 7:43 ` Nick Piggin
2007-08-13 14:05 ` Lee Schermerhorn [this message]
2007-08-14 2:08 ` Nick Piggin
2007-09-11 20:52 ` Update: [Automatic] NUMA replicated pagecache on 2.6.23-rc4-mm1 Lee Schermerhorn
2007-09-12 1:52 ` Balbir Singh
2007-09-12 13:48 ` Lee Schermerhorn
2007-09-12 14:08 ` Balbir Singh
2007-09-12 15:09 ` Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache Lee Schermerhorn
2007-09-12 15:41 ` Andy Whitcroft
2007-09-12 17:04 ` Lee Schermerhorn
2007-09-12 19:46 ` [PATCH] " Lee Schermerhorn
2007-09-12 21:23 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1187013901.5592.24.camel@localhost \
--to=lee.schermerhorn@hp.com \
--cc=clameter@sgi.com \
--cc=eric.whitney@hp.com \
--cc=joachim.deguara@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox