From: Nick Piggin <npiggin@suse.de>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Linux Memory Management List <linux-mm@kvack.org>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
linux-arch@vger.kernel.org, rmk@arm.linux.org.uk,
James.Bottomley@HansenPartnership.com
Subject: Re: [patch] mm: fix PageUptodate memory ordering bug
Date: Sun, 23 Dec 2007 07:54:46 +0100 [thread overview]
Message-ID: <20071223065446.GB29288@wotan.suse.de> (raw)
In-Reply-To: <Pine.LNX.4.64.0712221152370.7460@blonde.wat.veritas.com>
On Sat, Dec 22, 2007 at 12:14:42PM +0000, Hugh Dickins wrote:
> On Sat, 22 Dec 2007, Andrew Morton wrote:
> > On Tue, 18 Dec 2007 02:26:32 +0100 Nick Piggin <npiggin@suse.de> wrote:
> >
> > > After running SetPageUptodate, preceeding stores to the page contents to
> > > actually bring it uptodate may not be ordered with the store to set the page
> > > uptodate.
> > >
> > > Therefore, another CPU which checks PageUptodate is true, then reads the
> > > page contents can get stale data.
> > >
> > > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > > PageUptodate.
> > >
> > > Many places that test PageUptodate, do so with the page locked, and this
> > > would be enough to ensure memory ordering in those places if SetPageUptodate
> > > were only called while the page is locked. Unfortunately that is not always
> > > the case for some filesystems, but it could be an idea for the future.
> > >
> > > One thing I like about it is that it brings the handling of anonymous page
> > > uptodateness in line with that of file backed page management, by marking anon
> > > pages as uptodate when they _are_ uptodate, rather than when our implementation
> > > requires that they be marked as such.
>
> Nick, you're welcome to make that a separate, less controversial patch,
> to send in ahead. Though I think the last time this came around, I hit
> one of your BUGs in testing shmem.c swapout or swapin or swapoff:
> something missing there that I've lost the record of - please do
> try testing that, maybe it's already fixed this time around.
I've given it some hours in your patented swapping kbuild-on-ext2-on-loop-on-tmpfs
stress testing (including swapoff). Haven't seen a problem as yet (except the tmpfs
swapin deadlock, which I've been patching out).
But if you see anything, please let me know...
> > > #ifdef CONFIG_S390
> > > + page_clear_dirty(page);
> > > +#endif
> > > +}
>
> That's an odd little extract, since page_clear_dirty only does anything
> on s390.
Ah yeah, we could just get rid of the ifdef. Although I don't mind it too much,
as it kind of helps the reader match the other ifdef there...
> > For an overall 0.5% increase in the i386 size of several core mm files. If
> > you don't blow us up on the spot, you'll slowly bleed us to death.
> >
> > Can it be improved?
>
> I do wish it could be.
>
> I never find the time to give it the thought it needs; and any criticism
> I make is probably unjust, probably patiently answered by Nick on a
> previous round.
>
> I'm never convinced that SetPageUptodate is the right place for
> this: what's wrong with doing it in those page copying functions?
> Or flush_dcache_page?
There are various places we _could_ do it, but I think PG_uptodate macros
are logically the best, without being too intrusive.
Let me explain. Normally I think the convention would be to open-code the
barriers in the callees (ie. between memset(); SetPageUptodate();, and
if (PageUptodate()) { read from page }).
However I think that would require going through quite a bit of code (including
filesystems) to audit. So I think having them in these macros is pretty
reasonable, and amounts to less thinking required by others.
Why don't I like doing it in page copying functions? Just because there are more
and more varied uses. I can't think of any reasons to rather do it in the page
copying functions, and some reasons against.
flush_dcache_page? Well this bug really is a problem ordering stores to the
page with store to page flags against loads from the same; nothing to do with
cache aliasing. So putting the smp_wmb in flush_dcache_page leaves you without
a natural complement to put the smp_rmb. Although it could be done, I think it
makes it more tangled than having the ordering done in the macros. We also
only need to order the *initial* stores which bring the page uptodate, rather
than for each store, in the case of flush_dcache_page.
> Don't we need different kinds of barrier
> according to how the data got into the page (by DMA or not)?
I had thought of that (my previous patch had an XXX: help...) for this
very issue. Without actually knowing what the underlying architecture does,
I "concluded" that it should be done somewhere down at the block layer. I
think it would be silly for the block layer to signal completion if the
results are still incoherent with the CPU cache... but if the experts have
a different opinion, then this needs to be solved with another call anyway
(not in the page uptodate macros and it's not exactly a memory ordering issue).
eg. direct IO reads would have the same DMA cache synchronisation before it
completes to userspace, and this is completely independent of PG_uptodate...
> Doesn't that enter territory discussed down the years between
> James Bottomley and Russell King? Worth CC'ing them the original?
... but since you bring this up again, I think that would be worthwhile. In
the interest of maintaining this thread I'll just link the original:
http://marc.info/?l=linux-mm&m=119794127303483&w=2
The question is this:
Must read from net/disk/etc into page P.
Device DMAs into P, signals completion
CPU0: handles completion, store to ram to mark P uptodate
CPU0/1: load from ram sees P uptodate, load from P must only see uptodate data
Are we guaranteed to get uptodate data from above the block layer, or do we
need to do anything special?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-12-23 6:54 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-18 1:26 Nick Piggin
2007-12-22 8:57 ` Andrew Morton
2007-12-22 12:14 ` Hugh Dickins
2007-12-23 6:54 ` Nick Piggin [this message]
2007-12-23 5:57 ` Nick Piggin
2007-12-23 6:32 ` Andrew Morton
2007-12-23 7:15 ` Nick Piggin
2007-12-23 7:29 ` Andrew Morton
2007-12-23 9:14 ` Nick Piggin
2007-12-23 9:28 ` Andrew Morton
2007-12-23 16:02 ` Andi Kleen
2007-12-30 16:33 ` Ingo Molnar
2008-01-01 23:26 ` Nick Piggin
2008-01-02 21:01 ` Andi Kleen
2008-01-03 3:32 ` Nick Piggin
2008-01-03 13:08 ` Andi Kleen
2007-12-23 17:22 ` Linus Torvalds
2007-12-23 21:35 ` Nick Piggin
2007-12-23 22:41 ` Nick Piggin
2008-01-01 23:41 ` Alan Cox
2008-01-02 11:02 ` [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels Nick Piggin
2008-01-02 13:44 ` Alan Cox
2008-01-03 4:17 ` Nick Piggin
2008-01-03 14:23 ` Alan Cox
2008-01-03 20:20 ` Benjamin Herrenschmidt
2008-01-03 22:23 ` Alan Cox
2008-01-03 23:10 ` Nick Piggin
2008-01-04 16:27 ` Alan Cox
2008-01-07 0:12 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071223065446.GB29288@wotan.suse.de \
--to=npiggin@suse.de \
--cc=James.Bottomley@HansenPartnership.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=hugh@veritas.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rmk@arm.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox