From: Andrew Morton <akpm@osdl.org>
To: Andrea Arcangeli <andrea@novell.com>
Cc: shaggy@austin.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
Date: Thu, 21 Oct 2004 19:03:20 -0700 [thread overview]
Message-ID: <20041021190320.02dccda7.akpm@osdl.org> (raw)
In-Reply-To: <20041022012211.GD14325@dualathlon.random>
Andrea Arcangeli <andrea@novell.com> wrote:
>
> On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote:
> > If you want to shootdown ptes before clearing the bitflag, that's fine
>
> small correction s/before/after/. doing the pte shootdown before
> clearing the uptodate bitflag, would still not guarantee to read
> uptodate data after the invalidate (a minor page fault could still
> happen between the shootdown and the clear_bit; while after clearing the
> uptodate bit a major fault hitting the disk and refreshing the pagecache
> contents will be guaranteed - modulo bhs, well at least nfs is sure ok
> in that respect ;).
Yeah. I think the only 100% reliable way to do this is:
lock_page()
unmap_mapping_range(page)
ClearPageUptodate(page);
invalidate(page); /* Try to free the thing */
unlock_page(page);
(Can do it for a whole range of pages if we always agree to lock pages in
file-index-ascending order).
but hrm, we don't even have the locking there to prevent do_no_page() from
reinstantiating the pte immediately after the unmap_mapping_range().
So what to do?
- The patch you originally sent has a race window which can be made
nonfatal by removing the BUGcheck in mpage_writepage().
- NFS should probably be taught to use unmap_mapping_range() regardless
of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
up.
- invalidate_inode_pages2() isn't successfully invalidating the page if
it has buffers.
This is the biggest problem, because the pages can trivially have
buffers - just write()ing to them will attach buffers, if they're ext2-
or ext3-backed.
It means that a direct-IO write to a section of a file which is mmapped
causes pagecache and disk to get out of sync. Question is: do we care,
or do we say "don't do that"? Nobody seems to have noticed thus far and
it's a pretty dopey thing to be doing.
If we _do_ want to fix it, it seems the simplest approach would be to
nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we
raced - try again" loop in there to handle the "do_no_page()
reinstantiated a pte" race.
Fun.
Something like this? Slow as a dog, but possibly correct ;)
void invalidate_inode_pages2(struct address_space *mapping)
{
struct pagevec pvec;
pgoff_t next = 0;
int i;
pagevec_init(&pvec, 0);
while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
lock_page(page);
if (page->mapping == mapping) { /* truncate race? */
wait_on_page_writeback(page);
next = page->index + 1;
while (page_mapped(page)) {
unmap_mapping_range(mapping,
page->index << PAGE_CACHE_SHIFT,
page->index << PAGE_CACHE_SHIFT+1,
0);
clear_page_dirty(page);
}
invalidate_complete_page(mapping, page);
}
unlock_page(page);
}
pagevec_release(&pvec);
cond_resched();
}
}
The unmapping from pagetables means that invalidate_complete_page() will
generally successfully remove the page from pagecache. That mostly fixes
the direct-write-of-mmapped-data coherency problem: the pages simply aren't
in pagecache any more so we'll surely redo physical I/O.
But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
it really sucks that we're offering behaviour which only works most of the
time :(
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2004-10-22 2:03 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-10-21 21:15 Dave Kleikamp
2004-10-21 21:45 ` Andrew Morton
2004-10-21 22:36 ` Andrea Arcangeli
2004-10-21 23:02 ` Andrew Morton
2004-10-21 23:20 ` Andrea Arcangeli
2004-10-21 23:42 ` Andrew Morton
2004-10-22 0:15 ` Andrew Morton
2004-10-22 0:41 ` Andrea Arcangeli
2004-10-22 2:51 ` Rik van Riel
2004-10-22 16:19 ` Andrea Arcangeli
2004-10-22 0:30 ` Andrea Arcangeli
2004-10-22 1:22 ` Andrea Arcangeli
2004-10-22 2:03 ` Andrew Morton [this message]
2004-10-22 16:17 ` Andrea Arcangeli
2004-10-22 17:04 ` Andrea Arcangeli
2004-10-22 23:24 ` Andrew Morton
2004-10-25 13:58 ` Dave Kleikamp
2004-10-26 0:35 ` Andrea Arcangeli
2004-11-09 14:15 ` Dave Kleikamp
2004-11-09 14:46 ` Andrea Arcangeli
2004-11-09 19:51 ` Andrew Morton
2004-11-09 19:46 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20041021190320.02dccda7.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=andrea@novell.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=shaggy@austin.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox