* [S390] page_mkclean data corruption.
@ 2007-04-04 16:37 Martin Schwidefsky
2007-04-04 17:10 ` Linus Torvalds
0 siblings, 1 reply; 3+ messages in thread
From: Martin Schwidefsky @ 2007-04-04 16:37 UTC (permalink / raw)
To: torvalds, gregkh; +Cc: linux-kernel, linux-s390, linux-mm
Linus, Greg,
the attached patch fixes a data corruption problem that has been
introduced with the page_mkclean/clear_page_dirty_for_io change
(the "Yes, Virginia, this is indeed insane." problem :-/)
In essence the fact that clear_page_dirty_for_io is called for
not-uptodate pages causes data corruption for architectures that use
page_test_and_clear_dirty (which is s390 only).
This should go into 2.6.21-rc and 2.6.20-stable.
Please pull message from git390 will follow.
blue skies,
Martin.
--
Subject: [S390] page_mkclean data corruption.
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
The git commit c2fda5fed81eea077363b285b66eafce20dfd45a which
added the page_test_and_clear_dirty call to page_mkclean and the
git commit 7658cc289288b8ae7dd2c2224549a048431222b3 which fixes
the "nasty and subtle race in shared mmap'ed page writeback"
problem in clear_page_dirty_for_io cause data corruption on s390.
The effect of the two changes is that for every call to
clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
the per page dirty bit is set set_page_dirty is called. Strangly
clear_page_dirty_for_io is called for not-uptodate pages, e.g.
over this call-chain:
[<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130
[<000000000007c494>] generic_writepages+0x258/0x3e0
[<000000000007c692>] do_writepages+0x76/0x7c
[<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4
[<00000000000c831a>] sync_sb_inodes+0x23e/0x398
[<00000000000c8802>] writeback_inodes+0x12e/0x140
[<000000000007b9ee>] wb_kupdate+0xd2/0x178
[<000000000007cca2>] pdflush+0x162/0x23c
The bad news now is that page_test_and_clear_dirty might claim
that a not-uptodate page is dirty since SetPageUptodate which
resets the per page dirty bit has not yet been called. The page
writeback that follows clobbers the data on disk.
The simplest solution to this problem is to move the call to
page_test_and_clear_dirty under the "if (page_mapped(page))".
If a file backed page is mapped it is uptodate.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
mm/rmap.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff -urpN linux-2.6/mm/rmap.c linux-2.6-patched/mm/rmap.c
--- linux-2.6/mm/rmap.c 2007-04-04 14:23:22.000000000 +0200
+++ linux-2.6-patched/mm/rmap.c 2007-04-04 14:23:35.000000000 +0200
@@ -498,9 +498,9 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping)
ret = page_mkclean_file(mapping, page);
+ if (page_test_and_clear_dirty(page))
+ ret = 1;
}
- if (page_test_and_clear_dirty(page))
- ret = 1;
return ret;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [S390] page_mkclean data corruption.
2007-04-04 16:37 [S390] page_mkclean data corruption Martin Schwidefsky
@ 2007-04-04 17:10 ` Linus Torvalds
2007-04-04 17:35 ` Martin Schwidefsky
0 siblings, 1 reply; 3+ messages in thread
From: Linus Torvalds @ 2007-04-04 17:10 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: gregkh, linux-kernel, linux-s390, linux-mm
On Wed, 4 Apr 2007, Martin Schwidefsky wrote:
>
> the attached patch fixes a data corruption problem that has been
> introduced with the page_mkclean/clear_page_dirty_for_io change
> (the "Yes, Virginia, this is indeed insane." problem :-/)
Ok. I'm a bit worried about something like this, this late in the release
cycle, but since I guess page_test_and_clear_dirty() is always 0 for any
architecture but S390, I guess there are no possible downsides except for
that architecture.
So I'll apply it, but:
> The effect of the two changes is that for every call to
> clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
> the per page dirty bit is set set_page_dirty is called. Strangly
> clear_page_dirty_for_io is called for not-uptodate pages, e.g.
> over this call-chain:
>
> [<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130
> [<000000000007c494>] generic_writepages+0x258/0x3e0
> [<000000000007c692>] do_writepages+0x76/0x7c
> [<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4
> [<00000000000c831a>] sync_sb_inodes+0x23e/0x398
> [<00000000000c8802>] writeback_inodes+0x12e/0x140
> [<000000000007b9ee>] wb_kupdate+0xd2/0x178
> [<000000000007cca2>] pdflush+0x162/0x23c
>
> The bad news now is that page_test_and_clear_dirty might claim
> that a not-uptodate page is dirty since SetPageUptodate which
> resets the per page dirty bit has not yet been called. The page
> writeback that follows clobbers the data on disk.
Wouldn't it be best if S390 tried to avoid this by clearing the dirty bit
whenever a new page is allocated?
This is a very subtle and very surprising problem with the whole
"page_test_and_clear_dirty()" thing - where a new page can be marked dirty
for no obvious reason.
If S390 marked it clean at *allocation* time instead of at
SetPageUptodate() time, that would also mean that the whole strange
special case for S390 in SetPageUptodate() would go away.
Hmm? Or is marking things clean so expensive that you generally don't want
to do it in the allocation path?
Anyway, I'll apply the patch, since for 2.6.21 this is clearly the
simplest solution, but
(a) I think it might be ugly
and
(b) are you sure that it doesn't introduce a new bug on S390, where some
page has been *removed* from the mappings, and should still trigger
the "page_test_and_clear_dirty()" test, but now, because it's done
inside the "if (page_mapped())" case, we miss it?
That said, in many ways, moving the whole "page_test_and_clear_dirty()"
thing inside the "page_mapped()" thing does seem to make conceptual sense
(since the only way it would become dirty in that way is if it's mapped),
so I don't mind the patch, I just worry about (b) a bit, and if we got rid
of the strange special code in S390 to SetPageUptodate() that would also
be nice.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [S390] page_mkclean data corruption.
2007-04-04 17:10 ` Linus Torvalds
@ 2007-04-04 17:35 ` Martin Schwidefsky
0 siblings, 0 replies; 3+ messages in thread
From: Martin Schwidefsky @ 2007-04-04 17:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: gregkh, linux-kernel, linux-s390, linux-mm
On Wed, 2007-04-04 at 10:10 -0700, Linus Torvalds wrote:
> Ok. I'm a bit worried about something like this, this late in the release
> cycle, but since I guess page_test_and_clear_dirty() is always 0 for any
> architecture but S390, I guess there are no possible downsides except for
> that architecture.
Yes, the change can only affect s390 since for all other architectures
page_test_and_clear_dirty is a nop.
> So I'll apply it, but:
>
> > The effect of the two changes is that for every call to
> > clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
> > the per page dirty bit is set set_page_dirty is called. Strangly
> > clear_page_dirty_for_io is called for not-uptodate pages, e.g.
> > over this call-chain:
> >
> > [<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130
> > [<000000000007c494>] generic_writepages+0x258/0x3e0
> > [<000000000007c692>] do_writepages+0x76/0x7c
> > [<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4
> > [<00000000000c831a>] sync_sb_inodes+0x23e/0x398
> > [<00000000000c8802>] writeback_inodes+0x12e/0x140
> > [<000000000007b9ee>] wb_kupdate+0xd2/0x178
> > [<000000000007cca2>] pdflush+0x162/0x23c
> >
> > The bad news now is that page_test_and_clear_dirty might claim
> > that a not-uptodate page is dirty since SetPageUptodate which
> > resets the per page dirty bit has not yet been called. The page
> > writeback that follows clobbers the data on disk.
>
> Wouldn't it be best if S390 tried to avoid this by clearing the dirty bit
> whenever a new page is allocated?
We would love to but we cannot. The point is that I/O makes a page
dirty. We could clear the dirty bit on allocation time but the page-in
operation would make it dirty again and we'd have to make it clean AGAIN
in SetPageUptodate. The iske + sske instructions are in the range of
several 100 cycles, so they are quite expensive.
> Anyway, I'll apply the patch, since for 2.6.21 this is clearly the
> simplest solution, but
> (a) I think it might be ugly
> and
> (b) are you sure that it doesn't introduce a new bug on S390, where some
> page has been *removed* from the mappings, and should still trigger
> the "page_test_and_clear_dirty()" test, but now, because it's done
> inside the "if (page_mapped())" case, we miss it?
No, I'm very sure that this won't be the case. The per page dirty bit on
s390 is used as a replacement for the per pte dirty bits. We check a
single time after all the pte operations instead of doing it for every
pte. As long as there is a page_test_and_clear_dirty after the last pte
related to a page has been modified in page_mkclean or removed in
page_remove_rmap we are fine.
--
blue skies, IBM Deutschland Entwicklung GmbH
Martin Vorsitzender des Aufsichtsrats: Johann Weihen
Geschaftsfuhrung: Herbert Kircher
Martin Schwidefsky Sitz der Gesellschaft: Boblingen
Linux on zSeries Registergericht: Amtsgericht Stuttgart,
Development HRB 243294
"Reality continues to ruin my life." - Calvin.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2007-04-04 17:35 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-04 16:37 [S390] page_mkclean data corruption Martin Schwidefsky
2007-04-04 17:10 ` Linus Torvalds
2007-04-04 17:35 ` Martin Schwidefsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox