[patch] mm: bug in set_page_dirty

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] mm: bug in set_page_dirty_buffers
@ 2006-10-10  2:36 Nick Piggin
  2006-10-10  3:06 ` Linus Torvalds
  2006-10-10  7:42 ` patch mm-bug-in-set_page_dirty_buffers.patch queued to -stable tree gregkh
  0 siblings, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  2:36 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Linux Memory Management List
  Cc: Greg KH

This was triggered, but not the fault of, the dirty page accounting
patches. Suitable for -stable as well, after it goes upstream.

Unable to handle kernel NULL pointer dereference at virtual address 0000004c
EIP is at _spin_lock+0x12/0x66
Call Trace:
 [<401766e7>] __set_page_dirty_buffers+0x15/0xc0
 [<401401e7>] set_page_dirty+0x2c/0x51
 [<40140db2>] set_page_dirty_balance+0xb/0x3b
 [<40145d29>] __do_fault+0x1d8/0x279
 [<40147059>] __handle_mm_fault+0x125/0x951
 [<401133f1>] do_page_fault+0x440/0x59f
 [<4034d0c1>] error_code+0x39/0x40
 [<08048a33>] 0x8048a33
 =======================

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -701,7 +701,10 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  */
 int __set_page_dirty_buffers(struct page *page)
 {
-	struct address_space * const mapping = page->mapping;
+	struct address_space * const mapping = page_mapping(page);
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
 
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  2:36 [patch] mm: bug in set_page_dirty_buffers Nick Piggin
@ 2006-10-10  3:06 ` Linus Torvalds
  2006-10-10  3:19   ` Nick Piggin
  2006-10-10  3:20   ` Andrew Morton
  2006-10-10  7:42 ` patch mm-bug-in-set_page_dirty_buffers.patch queued to -stable tree gregkh
  1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-10-10  3:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006, Nick Piggin wrote:
>
> This was triggered, but not the fault of, the dirty page accounting
> patches. Suitable for -stable as well, after it goes upstream.

Applied. However, I wonder what protects "page_mapping()" here? I don't 
think we hold the page lock anywhere, so "page->mapping" can change at any 
time, no?

The worry might be that the mapping is truncated, page->mapping is set to 
NULL (but after we cached the old value in "mapping"), and then the 
"struct address_space" is released, so that when we do

	spin_lock(&mapping->private_lock);

we'd be accessing a stale pointer..

Hmm. I guess the mapping cannot become stale at least in _this_ case, 
since the page is mapped into the addess space and the mapping is thus 
pinned by the vma for normal file mappings.

But what happens for other cases where that isn't the situation, and the 
page is related to some other address space (swap, remap_file_pages, 
whatever..)?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:06 ` Linus Torvalds
@ 2006-10-10  3:19   ` Nick Piggin
  2006-10-10  3:20   ` Andrew Morton
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  3:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 08:06:05PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 10 Oct 2006, Nick Piggin wrote:
> >
> > This was triggered, but not the fault of, the dirty page accounting
> > patches. Suitable for -stable as well, after it goes upstream.
> 
> Applied. However, I wonder what protects "page_mapping()" here? I don't 
> think we hold the page lock anywhere, so "page->mapping" can change at any 
> time, no?
> 
> The worry might be that the mapping is truncated, page->mapping is set to 
> NULL (but after we cached the old value in "mapping"), and then the 
> "struct address_space" is released, so that when we do
> 
> 	spin_lock(&mapping->private_lock);
> 
> we'd be accessing a stale pointer..
> 
> Hmm. I guess the mapping cannot become stale at least in _this_ case, 
> since the page is mapped into the addess space and the mapping is thus 
> pinned by the vma for normal file mappings.
> 
> But what happens for other cases where that isn't the situation, and the 
> page is related to some other address space (swap, remap_file_pages, 
> whatever..)?

We require that set_page_dirty only be called when it has the mapping
pinned. There are a few places that can't do this, so they have
set_page_dirty_lock which pins the mapping before calling down.

Aside, as you might remember a few months ago, we "discovered" that
lock_page needs the mapping pinned too, which gave rise to the lovely
lock_page_nosync special case!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:06 ` Linus Torvalds
  2006-10-10  3:19   ` Nick Piggin
@ 2006-10-10  3:20   ` Andrew Morton
  2006-10-10  3:34     ` Nick Piggin
  2006-10-10  3:37     ` Andrew Morton
  1 sibling, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  3:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, 9 Oct 2006 20:06:05 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> On Tue, 10 Oct 2006, Nick Piggin wrote:
> >
> > This was triggered, but not the fault of, the dirty page accounting
> > patches. Suitable for -stable as well, after it goes upstream.
> 
> Applied. However, I wonder what protects "page_mapping()" here?

Nothing.  And I don't understand the (unchangelogged) switch from
page->mapping to page_mapping().

> I don't 
> think we hold the page lock anywhere, so "page->mapping" can change at any 
> time, no?

Yes.  The patch makes the race window a bit smaller.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:20   ` Andrew Morton
@ 2006-10-10  3:34     ` Nick Piggin
  2006-10-10  3:50       ` Andrew Morton
  2006-10-10  3:37     ` Andrew Morton
  1 sibling, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  3:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 08:20:39PM -0700, Andrew Morton wrote:
> On Mon, 9 Oct 2006 20:06:05 -0700 (PDT)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > On Tue, 10 Oct 2006, Nick Piggin wrote:
> > >
> > > This was triggered, but not the fault of, the dirty page accounting
> > > patches. Suitable for -stable as well, after it goes upstream.
> > 
> > Applied. However, I wonder what protects "page_mapping()" here?
> 
> Nothing.  And I don't understand the (unchangelogged) switch from
> page->mapping to page_mapping().

I did the switch because that is that its callers and
the other spd functions are using to find the mapping.

> > I don't 
> > think we hold the page lock anywhere, so "page->mapping" can change at any 
> > time, no?
> 
> Yes.  The patch makes the race window a bit smaller.

It fixes the problem. The mapping is already pinned at this point,
the problem is that page_mapping is still free to go NULL at any
time, and __set_page_dirty_buffers wasn't checking for that.

If there is another race, then it must be because the buffer code
cannot cope with dirty buffers against a truncated page. It is
kind of spaghetti, though. What stops set_page_dirty_buffers from
racing with block_invalidatepage, for example?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:34     ` Nick Piggin
@ 2006-10-10  3:50       ` Andrew Morton
  2006-10-10  3:58         ` Nick Piggin
  2006-10-10  4:11         ` Nick Piggin
  0 siblings, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  3:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 05:34:12 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 08:20:39PM -0700, Andrew Morton wrote:
> > On Mon, 9 Oct 2006 20:06:05 -0700 (PDT)
> > Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > > On Tue, 10 Oct 2006, Nick Piggin wrote:
> > > >
> > > > This was triggered, but not the fault of, the dirty page accounting
> > > > patches. Suitable for -stable as well, after it goes upstream.
> > > 
> > > Applied. However, I wonder what protects "page_mapping()" here?
> > 
> > Nothing.  And I don't understand the (unchangelogged) switch from
> > page->mapping to page_mapping().
> 
> I did the switch because that is that its callers and
> the other spd functions are using to find the mapping.

Maybe they're wrong?

> > > I don't 
> > > think we hold the page lock anywhere, so "page->mapping" can change at any 
> > > time, no?
> > 
> > Yes.  The patch makes the race window a bit smaller.
> 
> It fixes the problem. The mapping is already pinned at this point,

Needs comment.

> the problem is that page_mapping is still free to go NULL at any
> time, and __set_page_dirty_buffers wasn't checking for that.
> 
> If there is another race, then it must be because the buffer code
> cannot cope with dirty buffers against a truncated page. It is
> kind of spaghetti, though. What stops set_page_dirty_buffers from
> racing with block_invalidatepage, for example?

Nothing that I can think of.  We keep on adding calls to set_page_dirty()
against unlocked pages.  It would have been better to fix that one case in
the pte-unmapping path rather than adding heaps more.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:50       ` Andrew Morton
@ 2006-10-10  3:58         ` Nick Piggin
  2006-10-10  4:14           ` Andrew Morton
  2006-10-10  4:11         ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  3:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 08:50:30PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 05:34:12 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> > > page->mapping to page_mapping().
> > 
> > I did the switch because that is that its callers and
> > the other spd functions are using to find the mapping.
> 
> Maybe they're wrong?

Well swapcache pages and anonymous pages can have set_page_dirty run
against them. Granted in the current setup, those won't reach
set_page_dirty_buffers, so maybe it isn't needed.

> > the problem is that page_mapping is still free to go NULL at any
> > time, and __set_page_dirty_buffers wasn't checking for that.
> > 
> > If there is another race, then it must be because the buffer code
> > cannot cope with dirty buffers against a truncated page. It is
> > kind of spaghetti, though. What stops set_page_dirty_buffers from
> > racing with block_invalidatepage, for example?
> 
> Nothing that I can think of.  We keep on adding calls to set_page_dirty()
> against unlocked pages.  It would have been better to fix that one case in
> the pte-unmapping path rather than adding heaps more.

I haven't been adding any, so I would love to ;) I think there are
a whole lot more problems than just the unmapping path, though. Direct
IO comes to mind.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:58         ` Nick Piggin
@ 2006-10-10  4:14           ` Andrew Morton
  2006-10-10  4:21             ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  4:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 05:58:51 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 08:50:30PM -0700, Andrew Morton wrote:
> > On Tue, 10 Oct 2006 05:34:12 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > > > page->mapping to page_mapping().
> > > 
> > > I did the switch because that is that its callers and
> > > the other spd functions are using to find the mapping.
> > 
> > Maybe they're wrong?
> 
> Well swapcache pages and anonymous pages can have set_page_dirty run
> against them. Granted in the current setup, those won't reach
> set_page_dirty_buffers, so maybe it isn't needed.

Well it wasn't needed before.

> > > the problem is that page_mapping is still free to go NULL at any
> > > time, and __set_page_dirty_buffers wasn't checking for that.
> > > 
> > > If there is another race, then it must be because the buffer code
> > > cannot cope with dirty buffers against a truncated page. It is
> > > kind of spaghetti, though. What stops set_page_dirty_buffers from
> > > racing with block_invalidatepage, for example?
> > 
> > Nothing that I can think of.  We keep on adding calls to set_page_dirty()
> > against unlocked pages.  It would have been better to fix that one case in
> > the pte-unmapping path rather than adding heaps more.
> 
> I haven't been adding any, so I would love to ;)

Can we convert set_page_dirty_balance() to call set_page_dirty_lock()?

And make set_page_dirty_lock() return if the page is already dirty?

> I think there are
> a whole lot more problems than just the unmapping path, though. Direct
> IO comes to mind.

Why?  direct-io locks the pages while invalidating them, and while marking
them dirty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  4:14           ` Andrew Morton
@ 2006-10-10  4:21             ` Nick Piggin
  2006-10-10  4:38               ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  4:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 09:14:04PM -0700, Andrew Morton wrote:
> Can we convert set_page_dirty_balance() to call set_page_dirty_lock()?

I think so. You can't in zap_pte_range though because you're under
spinlocks. Same with try_to_unmap_{one|cluster}, and page_remove_rmap.

> And make set_page_dirty_lock() return if the page is already dirty?
> 
> > I think there are
> > a whole lot more problems than just the unmapping path, though. Direct
> > IO comes to mind.
> 
> Why?  direct-io locks the pages while invalidating them, and while marking
> them dirty.

Uh, mistaken about dio. Point still stands.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  4:21             ` Nick Piggin
@ 2006-10-10  4:38               ` Andrew Morton
  2006-10-10  4:47                 ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  4:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 06:21:44 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 09:14:04PM -0700, Andrew Morton wrote:
> > Can we convert set_page_dirty_balance() to call set_page_dirty_lock()?
> 
> I think so. You can't in zap_pte_range though because you're under
> spinlocks.

There we're screwed.

> Same with try_to_unmap_{one|cluster}, and page_remove_rmap.

There we can trylock all the pages and bale if any fail.


> > And make set_page_dirty_lock() return if the page is already dirty?
> > 
> > > I think there are
> > > a whole lot more problems than just the unmapping path, though. Direct
> > > IO comes to mind.
> > 
> > Why?  direct-io locks the pages while invalidating them, and while marking
> > them dirty.
> 
> Uh, mistaken about dio. Point still stands.

But where?  locking the page is the preferred way to solve this stuff. 
(Well, locking the buffers might work, but isn't needed, and locking the
page covers other stuff)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  4:38               ` Andrew Morton
@ 2006-10-10  4:47                 ` Nick Piggin
  2006-10-10  5:01                   ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  4:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 09:38:06PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 06:21:44 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Mon, Oct 09, 2006 at 09:14:04PM -0700, Andrew Morton wrote:
> > > Can we convert set_page_dirty_balance() to call set_page_dirty_lock()?
> > 
> > I think so. You can't in zap_pte_range though because you're under
> > spinlocks.
> 
> There we're screwed.
> 
> > Same with try_to_unmap_{one|cluster}, and page_remove_rmap.
> 
> There we can trylock all the pages and bale if any fail.

Hmm, try_to_unmap is OK because the page is already locked. page_remove_rmap
isn't allowed to fail.

> > > And make set_page_dirty_lock() return if the page is already dirty?
> > > 
> > > > I think there are
> > > > a whole lot more problems than just the unmapping path, though. Direct
> > > > IO comes to mind.
> > > 
> > > Why?  direct-io locks the pages while invalidating them, and while marking
> > > them dirty.
> > 
> > Uh, mistaken about dio. Point still stands.
> 
> But where?  locking the page is the preferred way to solve this stuff. 
> (Well, locking the buffers might work, but isn't needed, and locking the
> page covers other stuff)

AFAIKS, it is just fs/buffer.c that is racy. Why can't it use
mapping->private_lock or the buffer bit spinlock?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  4:47                 ` Nick Piggin
@ 2006-10-10  5:01                   ` Andrew Morton
  2006-10-10  5:22                     ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  5:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 06:47:45 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 09:38:06PM -0700, Andrew Morton wrote:
> > On Tue, 10 Oct 2006 06:21:44 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Mon, Oct 09, 2006 at 09:14:04PM -0700, Andrew Morton wrote:
> > > > Can we convert set_page_dirty_balance() to call set_page_dirty_lock()?
> > > 
> > > I think so. You can't in zap_pte_range though because you're under
> > > spinlocks.
> > 
> > There we're screwed.
> > 
> > > Same with try_to_unmap_{one|cluster}, and page_remove_rmap.
> > 
> > There we can trylock all the pages and bale if any fail.
> 
> Hmm, try_to_unmap is OK because the page is already locked. page_remove_rmap
> isn't allowed to fail.

I was talking about try_to_unmap_cluster().

> > > > And make set_page_dirty_lock() return if the page is already dirty?
> > > > 
> > > > > I think there are
> > > > > a whole lot more problems than just the unmapping path, though. Direct
> > > > > IO comes to mind.
> > > > 
> > > > Why?  direct-io locks the pages while invalidating them, and while marking
> > > > them dirty.
> > > 
> > > Uh, mistaken about dio. Point still stands.
> > 
> > But where?  locking the page is the preferred way to solve this stuff. 
> > (Well, locking the buffers might work, but isn't needed, and locking the
> > page covers other stuff)
> 
> AFAIKS, it is just fs/buffer.c that is racy.

Need to review all ->set_page_dirty, ->writepage, ->invalidatepage, ->etc
implementations before we can say that.

> Why can't it use
> mapping->private_lock or the buffer bit spinlock?

block_invalidatepage() wants to do lock_buffer().

It can probably be made to work.  But a sane interface is "when dinking
with page internals, lock the page".

Which leaves us with zap_pte_range().  Perhaps one could reuse
mmu_gather.pages[]: when it's full, drop locks, dirty pages (if needed),
retake locks, resume.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  5:01                   ` Andrew Morton
@ 2006-10-10  5:22                     ` Nick Piggin
  2006-10-10  5:29                       ` Andrew Morton
  2006-10-10  6:48                       ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  5:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 10:01:27PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 06:47:45 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> > > There we can trylock all the pages and bale if any fail.
> > 
> > Hmm, try_to_unmap is OK because the page is already locked. page_remove_rmap
> > isn't allowed to fail.
> 
> I was talking about try_to_unmap_cluster().

But page_remove_rmap's many callers are still screwed. Take do_wp_page,
for example.

> > > But where?  locking the page is the preferred way to solve this stuff. 
> > > (Well, locking the buffers might work, but isn't needed, and locking the
> > > page covers other stuff)
> > 
> > AFAIKS, it is just fs/buffer.c that is racy.
> 
> Need to review all ->set_page_dirty, ->writepage, ->invalidatepage, ->etc
> implementations before we can say that.
> 
> > Why can't it use
> > mapping->private_lock or the buffer bit spinlock?
> 
> block_invalidatepage() wants to do lock_buffer().
> 
> It can probably be made to work.  But a sane interface is "when dinking
> with page internals, lock the page".

I disagree because it will lead to horrible hacks because many callers
can't sleep. If anything I would much prefer an innermost-spinlock in
page->flags that specifically excludes truncate. Actually tree_lock can
do that now, provided we pin mapping in all callers to set_page_dirty
(which we should do).

Then the locking protocol is up to fs/buffer.c. You could set a bit in
the buffer "BH_Invalidated" in truncate before clearing dirty, and test
for that bit in set_page_dirty_buffers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  5:22                     ` Nick Piggin
@ 2006-10-10  5:29                       ` Andrew Morton
  2006-10-10  5:48                         ` Nick Piggin
  2006-10-10  6:48                       ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  5:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 07:22:48 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 10:01:27PM -0700, Andrew Morton wrote:
> > On Tue, 10 Oct 2006 06:47:45 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > > > There we can trylock all the pages and bale if any fail.
> > > 
> > > Hmm, try_to_unmap is OK because the page is already locked. page_remove_rmap
> > > isn't allowed to fail.
> > 
> > I was talking about try_to_unmap_cluster().
> 
> But page_remove_rmap's many callers are still screwed. Take do_wp_page,
> for example.
> 
> > > > But where?  locking the page is the preferred way to solve this stuff. 
> > > > (Well, locking the buffers might work, but isn't needed, and locking the
> > > > page covers other stuff)
> > > 
> > > AFAIKS, it is just fs/buffer.c that is racy.
> > 
> > Need to review all ->set_page_dirty, ->writepage, ->invalidatepage, ->etc
> > implementations before we can say that.

    ^^^ this

> > > Why can't it use
> > > mapping->private_lock or the buffer bit spinlock?
> > 
> > block_invalidatepage() wants to do lock_buffer().
> > 
> > It can probably be made to work.  But a sane interface is "when dinking
> > with page internals, lock the page".
> 
> I disagree because it will lead to horrible hacks because many callers
> can't sleep. If anything I would much prefer an innermost-spinlock in
> page->flags that specifically excludes truncate. Actually tree_lock can
> do that now, provided we pin mapping in all callers to set_page_dirty
> (which we should do).
> 
> Then the locking protocol is up to fs/buffer.c. You could set a bit in
> the buffer "BH_Invalidated" in truncate before clearing dirty, and test
> for that bit in set_page_dirty_buffers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  5:29                       ` Andrew Morton
@ 2006-10-10  5:48                         ` Nick Piggin
  2006-10-10  6:08                           ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  5:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 10:29:05PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 07:22:48 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > > > AFAIKS, it is just fs/buffer.c that is racy.
> > > 
> > > Need to review all ->set_page_dirty, ->writepage, ->invalidatepage, ->etc
> > > implementations before we can say that.
> 
>     ^^^ this

->writepage is called under page lock.

->invalidatepage is called under page lock.

I think ->spd is the only one to worry about.

__set_page_dirty_nobuffers does use the tree_lock to ensure it hasn't been
truncated. However it doe leave orphaned clean buffers which vmscan cannot
reclaim. So probably we should not dirty it *until* we have verified it
is still part of a mapping.

Similarly for __set_page_dirty_buffers.

Comments in ext3 look like it has spd under control.
Reiserfs looks similar, but it does have the unchecked
page->mapping problem that I fixed for spd_buffers.

Am I missing something?

> > I disagree because it will lead to horrible hacks because many callers
> > can't sleep. If anything I would much prefer an innermost-spinlock in
> > page->flags that specifically excludes truncate. Actually tree_lock can
> > do that now, provided we pin mapping in all callers to set_page_dirty
> > (which we should do).
> > 
> > Then the locking protocol is up to fs/buffer.c. You could set a bit in
> > the buffer "BH_Invalidated" in truncate before clearing dirty, and test
> > for that bit in set_page_dirty_buffers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  5:48                         ` Nick Piggin
@ 2006-10-10  6:08                           ` Andrew Morton
  2006-10-10  6:19                             ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  6:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 07:48:33 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 10:29:05PM -0700, Andrew Morton wrote:
> > On Tue, 10 Oct 2006 07:22:48 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > > > AFAIKS, it is just fs/buffer.c that is racy.
> > > > 
> > > > Need to review all ->set_page_dirty, ->writepage, ->invalidatepage, ->etc
> > > > implementations before we can say that.
> > 
> >     ^^^ this
> 
> ->writepage is called under page lock.
> 
> ->invalidatepage is called under page lock.
> 
> I think ->spd is the only one to worry about.
> 
> __set_page_dirty_nobuffers does use the tree_lock to ensure it hasn't been
> truncated. However it doe leave orphaned clean buffers which vmscan cannot
> reclaim. So probably we should not dirty it *until* we have verified it
> is still part of a mapping.
> 
> Similarly for __set_page_dirty_buffers.
> 
> Comments in ext3 look like it has spd under control.
> Reiserfs looks similar, but it does have the unchecked
> page->mapping problem that I fixed for spd_buffers.
> 
> Am I missing something?

Well it's a matter of reviewing all codepaths in the kernel which
manipulate internal page state and see if they're racy against their
->set_page_dirty().  All because of zap_pte_range().

The page lock protects internal page state.  It'd be better to fix
zap_pte_range().

How about we trylock the page and if that fails, back out and drop locks
and lock the page then dirty it and then resume the zap?  Negligible
overhead, would be nice and simple apart from that i_mmap_lock thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:08                           ` Andrew Morton
@ 2006-10-10  6:19                             ` Nick Piggin
  2006-10-10  6:27                               ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  6:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 11:08:32PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 07:48:33 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> > 
> > Am I missing something?
> 
> Well it's a matter of reviewing all codepaths in the kernel which
> manipulate internal page state and see if they're racy against their
> ->set_page_dirty().  All because of zap_pte_range().

And page_remove_rmap.

> The page lock protects internal page state.  It'd be better to fix
> zap_pte_range().
> 
> How about we trylock the page and if that fails, back out and drop locks
> and lock the page then dirty it and then resume the zap?  Negligible
> overhead, would be nice and simple apart from that i_mmap_lock thing.

What about page_remove_rmap?

I don't see why. This has been the documented behaviour for ages, and
it seems to be made fairly clear in comments around mm/ and filesystems.
Considering the only nontrivial ->spds are those which set PageChecked
as well, I don't see why there is much to audit (other than fs/buffer.c).

Not that I think it would be a bad idea for filesystems writers to audit
carefully against truncate, because that's been screwed up in the VM for
so long...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:19                             ` Nick Piggin
@ 2006-10-10  6:27                               ` Andrew Morton
  2006-10-10  6:39                                 ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  6:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 08:19:58 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Oct 09, 2006 at 11:08:32PM -0700, Andrew Morton wrote:
> > On Tue, 10 Oct 2006 07:48:33 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > Am I missing something?
> > 
> > Well it's a matter of reviewing all codepaths in the kernel which
> > manipulate internal page state and see if they're racy against their
> > ->set_page_dirty().  All because of zap_pte_range().
> 
> And page_remove_rmap.

page_remove_rmap()'s call to set_page_dirty().  Same thing.

> > The page lock protects internal page state.  It'd be better to fix
> > zap_pte_range().
> > 
> > How about we trylock the page and if that fails, back out and drop locks
> > and lock the page then dirty it and then resume the zap?  Negligible
> > overhead, would be nice and simple apart from that i_mmap_lock thing.
> 
> What about page_remove_rmap?
> 
> I don't see why. This has been the documented behaviour for ages, and
> it seems to be made fairly clear in comments around mm/ and filesystems.
> Considering the only nontrivial ->spds are those which set PageChecked
> as well, I don't see why there is much to audit (other than fs/buffer.c).

Which approach is a good design?

> Not that I think it would be a bad idea for filesystems writers to audit
> carefully against truncate,

Good luck with that.  It needs to be done for them.

> because that's been screwed up in the VM for
> so long...

What has?  Please be specific.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:27                               ` Andrew Morton
@ 2006-10-10  6:39                                 ` Nick Piggin
  2006-10-10  6:52                                   ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 11:27:14PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 08:19:58 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Mon, Oct 09, 2006 at 11:08:32PM -0700, Andrew Morton wrote:
> > > On Tue, 10 Oct 2006 07:48:33 +0200
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > Am I missing something?
> > > 
> > > Well it's a matter of reviewing all codepaths in the kernel which
> > > manipulate internal page state and see if they're racy against their
> > > ->set_page_dirty().  All because of zap_pte_range().
> > 
> > And page_remove_rmap.
> 
> page_remove_rmap()'s call to set_page_dirty().  Same thing.

But harder to fix.

> > > The page lock protects internal page state.  It'd be better to fix
> > > zap_pte_range().
> > > 
> > > How about we trylock the page and if that fails, back out and drop locks
> > > and lock the page then dirty it and then resume the zap?  Negligible
> > > overhead, would be nice and simple apart from that i_mmap_lock thing.
> > 
> > What about page_remove_rmap?
> > 
> > I don't see why. This has been the documented behaviour for ages, and
> > it seems to be made fairly clear in comments around mm/ and filesystems.
> > Considering the only nontrivial ->spds are those which set PageChecked
> > as well, I don't see why there is much to audit (other than fs/buffer.c).
> 
> Which approach is a good design?

As far as the memory manager goes, not requiring sleeping, outermost locks
to be taken deep is a good design (and trylocks, backing out and retrying,
etc).

I don't know so much about the filesystem side of it, but nothing has changed
with regard to set_page_dirty in the mm for a long time, so I'd prefer not
to put these hacks in because we're worried filesystem code has screwed up
and don't want to audit it.

> 
> > Not that I think it would be a bad idea for filesystems writers to audit
> > carefully against truncate,
> 
> Good luck with that.  It needs to be done for them.

I agree, and having set_page_dirty callers hold the page lock wouldn't
help with that.

As far as set_page_dirty races goes, I am having a bit of a look at that,
but it would still require filesystems people to have a look.

> > because that's been screwed up in the VM for
> > so long...
> 
> What has?  Please be specific.

truncate/invalidate races. It still is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:39                                 ` Nick Piggin
@ 2006-10-10  6:52                                   ` Nick Piggin
  2006-10-10  7:06                                     ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  6:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 08:39:00AM +0200, Nick Piggin wrote:
> As far as set_page_dirty races goes, I am having a bit of a look at that,
> but it would still require filesystems people to have a look.

I'm thinking something along the lines of this (untested) patch.

Some internal filesystem path might still be racy against
set_page_dirty_buffers, but that might equally be a path that
doesn't hold the page lock.

--
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -701,10 +701,11 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  */
 int __set_page_dirty_buffers(struct page *page)
 {
+	int ret;
 	struct address_space * const mapping = page_mapping(page);
 
 	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
+		return 0;
 
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
@@ -712,26 +713,26 @@ int __set_page_dirty_buffers(struct page
 		struct buffer_head *bh = head;
 
 		do {
-			set_buffer_dirty(bh);
+			if (!buffer_invalid(bh))
+				set_buffer_dirty(bh);
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
 	spin_unlock(&mapping->private_lock);
 
-	if (!TestSetPageDirty(page)) {
-		write_lock_irq(&mapping->tree_lock);
-		if (page->mapping) {	/* Race with truncate? */
-			if (mapping_cap_account_dirty(mapping))
-				__inc_zone_page_state(page, NR_FILE_DIRTY);
-			radix_tree_tag_set(&mapping->page_tree,
-						page_index(page),
-						PAGECACHE_TAG_DIRTY);
-		}
-		write_unlock_irq(&mapping->tree_lock);
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-		return 1;
+	ret = 0;
+	write_lock_irq(&mapping->tree_lock);
+	if (page->mapping) {	/* Race with truncate? */
+		if (mapping_cap_account_dirty(mapping))
+			__inc_zone_page_state(page, NR_FILE_DIRTY);
+		radix_tree_tag_set(&mapping->page_tree,
+					page_index(page), PAGECACHE_TAG_DIRTY);
+		ret = !TestSetPageDirty(page);
 	}
-	return 0;
+	write_unlock_irq(&mapping->tree_lock);
+	if (ret)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
 
@@ -1407,7 +1408,6 @@ EXPORT_SYMBOL(set_bh_page);
 static void discard_buffer(struct buffer_head * bh)
 {
 	lock_buffer(bh);
-	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
@@ -1433,15 +1433,15 @@ static void discard_buffer(struct buffer
  */
 void block_invalidatepage(struct page *page, unsigned long offset)
 {
-	struct address_space *mapping;
+	struct address_space *mapping = page->mapping;
 	struct buffer_head *head, *bh, *next;
-	unsigned int curr_off = 0;
+	unsigned int curr_off;
 
 	BUG_ON(!PageLocked(page));
-	spin_lock(&mapping->private_lock);
 	if (!page_has_buffers(page))
 		goto out;
 
+	curr_off = 0;
 	head = page_buffers(page);
 	bh = head;
 	do {
@@ -1456,6 +1456,23 @@ void block_invalidatepage(struct page *p
 		curr_off = next_off;
 		bh = next;
 	} while (bh != head);
+
+	/* strip the dirty bits and protect against concurrent set_page_dirty */
+	spin_lock(&mapping->private_lock);
+	curr_off = 0;
+	head = page_buffers(page);
+	bh = head;
+	do {
+		unsigned int next_off = curr_off + bh->b_size;
+		next = bh->b_this_page;
+
+		if (offset <= curr_off) {
+			clear_buffer_dirty(bh);
+			set_buffer_invalid(bh);
+		}
+		curr_off = next_off;
+		bh = next;
+	} while (bh != head);
 	spin_unlock(&mapping->private_lock);
 
 	/*
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -18,6 +18,7 @@
 
 enum bh_state_bits {
 	BH_Uptodate,	/* Contains valid data */
+	BH_Invalid,	/* Has been truncated/invalidated */
 	BH_Dirty,	/* Is dirty */
 	BH_Lock,	/* Is locked */
 	BH_Req,		/* Has been submitted for I/O */
@@ -109,6 +110,7 @@ static inline int test_clear_buffer_##na
  * do something in addition to setting a b_state bit.
  */
 BUFFER_FNS(Uptodate, uptodate)
+BUFFER_FNS(Invalid, invalid)
 BUFFER_FNS(Dirty, dirty)
 TAS_BUFFER_FNS(Dirty, dirty)
 BUFFER_FNS(Lock, locked)
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -757,30 +757,36 @@ EXPORT_SYMBOL(write_one_page);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
-	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping;
+
+	if (PageDirty(page))
+		return 0;
+
+	mapping = page_mapping(page);
+	if (mapping) { /* Race with truncate? */
+		int ret;
 		struct address_space *mapping2;
 
-		if (mapping) {
-			write_lock_irq(&mapping->tree_lock);
-			mapping2 = page_mapping(page);
-			if (mapping2) { /* Race with truncate? */
-				BUG_ON(mapping2 != mapping);
-				if (mapping_cap_account_dirty(mapping))
-					__inc_zone_page_state(page,
-								NR_FILE_DIRTY);
-				radix_tree_tag_set(&mapping->page_tree,
+		ret = 0;
+		write_lock_irq(&mapping->tree_lock);
+		mapping2 = page_mapping(page);
+		if (mapping2 && !TestSetPageDirty(page)) {
+			BUG_ON(mapping2 != mapping);
+			if (mapping_cap_account_dirty(mapping))
+				__inc_zone_page_state(page, NR_FILE_DIRTY);
+			radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
-			}
-			write_unlock_irq(&mapping->tree_lock);
-			if (mapping->host) {
-				/* !PageAnon && !swapper_space */
-				__mark_inode_dirty(mapping->host,
-							I_DIRTY_PAGES);
-			}
+			ret = 1;
 		}
-		return 1;
+		write_unlock_irq(&mapping->tree_lock);
+		if (ret) {
+			/* !PageAnon && !swapper_space */
+			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		}
+		return ret;
 	}
+
+	/* Don't bother dirtying truncated pages */
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -341,6 +341,14 @@ static pageout_t pageout(struct page *pa
 				return PAGE_CLEAN;
 			}
 		}
+		/*
+		 * Truncate/invalidate clears dirty, and it shouldn't get dirty
+		 * again (unless SetPageDirty is used instead of set_page_dirty,
+		 * so this will have some false positives)
+		 */
+		if (unlikely(PageDirty(page)))
+			printk("%s: dirty orphaned page\n", __FUNCTION__);
+
 		return PAGE_KEEP;
 	}
 	if (mapping->a_ops->writepage == NULL)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:52                                   ` Nick Piggin
@ 2006-10-10  7:06                                     ` Andrew Morton
  2006-10-10  7:21                                       ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  7:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 08:52:17 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Oct 10, 2006 at 08:39:00AM +0200, Nick Piggin wrote:
> > As far as set_page_dirty races goes, I am having a bit of a look at that,
> > but it would still require filesystems people to have a look.
> 
> I'm thinking something along the lines of this (untested) patch.

ho hum.

>  void block_invalidatepage(struct page *page, unsigned long offset)
>  {
> -	struct address_space *mapping;
> +	struct address_space *mapping = page->mapping;
>  	struct buffer_head *head, *bh, *next;
> -	unsigned int curr_off = 0;
> +	unsigned int curr_off;
>  
>  	BUG_ON(!PageLocked(page));
> -	spin_lock(&mapping->private_lock);

block_invalidatepage() doesn't take ->private_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  7:06                                     ` Andrew Morton
@ 2006-10-10  7:21                                       ` Nick Piggin
  2006-10-10  8:07                                         ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  7:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 12:06:52AM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 08:52:17 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Tue, Oct 10, 2006 at 08:39:00AM +0200, Nick Piggin wrote:
> > > As far as set_page_dirty races goes, I am having a bit of a look at that,
> > > but it would still require filesystems people to have a look.
> > 
> > I'm thinking something along the lines of this (untested) patch.
> 
> ho hum.
> 
> >  void block_invalidatepage(struct page *page, unsigned long offset)
> >  {
> > -	struct address_space *mapping;
> > +	struct address_space *mapping = page->mapping;
> >  	struct buffer_head *head, *bh, *next;
> > -	unsigned int curr_off = 0;
> > +	unsigned int curr_off;
> >  
> >  	BUG_ON(!PageLocked(page));
> > -	spin_lock(&mapping->private_lock);
> 
> block_invalidatepage() doesn't take ->private_lock.

Err, sorry, quilt leakage. This one should be better.
--

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -701,10 +701,11 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  */
 int __set_page_dirty_buffers(struct page *page)
 {
+	int ret;
 	struct address_space * const mapping = page_mapping(page);
 
 	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
+		return 0;
 
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
@@ -712,26 +713,26 @@ int __set_page_dirty_buffers(struct page
 		struct buffer_head *bh = head;
 
 		do {
-			set_buffer_dirty(bh);
+			if (!buffer_invalid(bh))
+				set_buffer_dirty(bh);
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
 	spin_unlock(&mapping->private_lock);
 
-	if (!TestSetPageDirty(page)) {
-		write_lock_irq(&mapping->tree_lock);
-		if (page->mapping) {	/* Race with truncate? */
-			if (mapping_cap_account_dirty(mapping))
-				__inc_zone_page_state(page, NR_FILE_DIRTY);
-			radix_tree_tag_set(&mapping->page_tree,
-						page_index(page),
-						PAGECACHE_TAG_DIRTY);
-		}
-		write_unlock_irq(&mapping->tree_lock);
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-		return 1;
+	ret = 0;
+	write_lock_irq(&mapping->tree_lock);
+	if (page->mapping) {	/* Race with truncate? */
+		if (mapping_cap_account_dirty(mapping))
+			__inc_zone_page_state(page, NR_FILE_DIRTY);
+		radix_tree_tag_set(&mapping->page_tree,
+					page_index(page), PAGECACHE_TAG_DIRTY);
+		ret = !TestSetPageDirty(page);
 	}
-	return 0;
+	write_unlock_irq(&mapping->tree_lock);
+	if (ret)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
 
@@ -1407,7 +1408,6 @@ EXPORT_SYMBOL(set_bh_page);
 static void discard_buffer(struct buffer_head * bh)
 {
 	lock_buffer(bh);
-	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
@@ -1433,13 +1433,15 @@ static void discard_buffer(struct buffer
  */
 void block_invalidatepage(struct page *page, unsigned long offset)
 {
+ 	struct address_space *mapping = page->mapping;
 	struct buffer_head *head, *bh, *next;
-	unsigned int curr_off = 0;
+	unsigned int curr_off;
 
 	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
 		goto out;
 
+	curr_off = 0;
 	head = page_buffers(page);
 	bh = head;
 	do {
@@ -1455,6 +1457,24 @@ void block_invalidatepage(struct page *p
 		bh = next;
 	} while (bh != head);
 
+	/* strip the dirty bits and protect against concurrent set_page_dirty */
+	spin_lock(&mapping->private_lock);
+	curr_off = 0;
+	head = page_buffers(page);
+	bh = head;
+	do {
+		unsigned int next_off = curr_off + bh->b_size;
+		next = bh->b_this_page;
+
+		if (offset <= curr_off) {
+			clear_buffer_dirty(bh);
+			set_buffer_invalid(bh);
+		}
+		curr_off = next_off;
+		bh = next;
+	} while (bh != head);
+	spin_unlock(&mapping->private_lock);
+
 	/*
 	 * We release buffers only if the entire page is being invalidated.
 	 * The get_block cached value has been unconditionally invalidated,
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -18,6 +18,7 @@
 
 enum bh_state_bits {
 	BH_Uptodate,	/* Contains valid data */
+	BH_Invalid,	/* Has been truncated/invalidated */
 	BH_Dirty,	/* Is dirty */
 	BH_Lock,	/* Is locked */
 	BH_Req,		/* Has been submitted for I/O */
@@ -109,6 +110,7 @@ static inline int test_clear_buffer_##na
  * do something in addition to setting a b_state bit.
  */
 BUFFER_FNS(Uptodate, uptodate)
+BUFFER_FNS(Invalid, invalid)
 BUFFER_FNS(Dirty, dirty)
 TAS_BUFFER_FNS(Dirty, dirty)
 BUFFER_FNS(Lock, locked)
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -757,30 +757,36 @@ EXPORT_SYMBOL(write_one_page);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
-	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping;
+
+	if (PageDirty(page))
+		return 0;
+
+	mapping = page_mapping(page);
+	if (mapping) { /* Race with truncate? */
+		int ret;
 		struct address_space *mapping2;
 
-		if (mapping) {
-			write_lock_irq(&mapping->tree_lock);
-			mapping2 = page_mapping(page);
-			if (mapping2) { /* Race with truncate? */
-				BUG_ON(mapping2 != mapping);
-				if (mapping_cap_account_dirty(mapping))
-					__inc_zone_page_state(page,
-								NR_FILE_DIRTY);
-				radix_tree_tag_set(&mapping->page_tree,
+		ret = 0;
+		write_lock_irq(&mapping->tree_lock);
+		mapping2 = page_mapping(page);
+		if (mapping2 && !TestSetPageDirty(page)) {
+			BUG_ON(mapping2 != mapping);
+			if (mapping_cap_account_dirty(mapping))
+				__inc_zone_page_state(page, NR_FILE_DIRTY);
+			radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
-			}
-			write_unlock_irq(&mapping->tree_lock);
-			if (mapping->host) {
-				/* !PageAnon && !swapper_space */
-				__mark_inode_dirty(mapping->host,
-							I_DIRTY_PAGES);
-			}
+			ret = 1;
 		}
-		return 1;
+		write_unlock_irq(&mapping->tree_lock);
+		if (ret) {
+			/* !PageAnon && !swapper_space */
+			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		}
+		return ret;
 	}
+
+	/* Don't bother dirtying truncated pages */
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -341,6 +341,14 @@ static pageout_t pageout(struct page *pa
 				return PAGE_CLEAN;
 			}
 		}
+		/*
+		 * Truncate/invalidate clears dirty, and it shouldn't get dirty
+		 * again (unless SetPageDirty is used instead of set_page_dirty,
+		 * so this will have some false positives)
+		 */
+		if (unlikely(PageDirty(page)))
+			printk("%s: dirty orphaned page\n", __FUNCTION__);
+
 		return PAGE_KEEP;
 	}
 	if (mapping->a_ops->writepage == NULL)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  7:21                                       ` Nick Piggin
@ 2006-10-10  8:07                                         ` Andrew Morton
  2006-10-10  8:18                                           ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  8:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 09:21:29 +0200
Nick Piggin <npiggin@suse.de> wrote:

>  void block_invalidatepage(struct page *page, unsigned long offset)
>  {
> + 	struct address_space *mapping = page->mapping;
>  	struct buffer_head *head, *bh, *next;
> -	unsigned int curr_off = 0;
> +	unsigned int curr_off;
>  
>  	BUG_ON(!PageLocked(page));
>  	if (!page_has_buffers(page))
>  		goto out;
>  
> +	curr_off = 0;
>  	head = page_buffers(page);
>  	bh = head;
>  	do {
> @@ -1455,6 +1457,24 @@ void block_invalidatepage(struct page *p
>  		bh = next;
>  	} while (bh != head);
>  
> +	/* strip the dirty bits and protect against concurrent set_page_dirty */
> +	spin_lock(&mapping->private_lock);
> +	curr_off = 0;
> +	head = page_buffers(page);
> +	bh = head;
> +	do {
> +		unsigned int next_off = curr_off + bh->b_size;
> +		next = bh->b_this_page;
> +
> +		if (offset <= curr_off) {
> +			clear_buffer_dirty(bh);
> +			set_buffer_invalid(bh);
> +		}
> +		curr_off = next_off;
> +		bh = next;
> +	} while (bh != head);
> +	spin_unlock(&mapping->private_lock);

If the buffer's redirtied after discard_buffer() got at it, we've got some
nasty problems in there.

Are you sure this race can happen?  Nobody's allowed to have a page mapped
while it's undergoing truncation (vmtruncate()).

There might be a problem with the final blocks in the page outside i_size. 
iirc what happens here is that the bh outside i_size _is_ marked dirty, but
writepage() will notice that it's outside i_size and will just mark it
clean again without doing IO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  8:07                                         ` Andrew Morton
@ 2006-10-10  8:18                                           ` Nick Piggin
  2006-10-10  8:41                                             ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 01:07:42AM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 09:21:29 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> >  void block_invalidatepage(struct page *page, unsigned long offset)
> >  {
> > + 	struct address_space *mapping = page->mapping;
> >  	struct buffer_head *head, *bh, *next;
> > -	unsigned int curr_off = 0;
> > +	unsigned int curr_off;
> >  
> >  	BUG_ON(!PageLocked(page));
> >  	if (!page_has_buffers(page))
> >  		goto out;
> >  
> > +	curr_off = 0;
> >  	head = page_buffers(page);
> >  	bh = head;
> >  	do {
> > @@ -1455,6 +1457,24 @@ void block_invalidatepage(struct page *p
> >  		bh = next;
> >  	} while (bh != head);
> >  
> > +	/* strip the dirty bits and protect against concurrent set_page_dirty */
> > +	spin_lock(&mapping->private_lock);
> > +	curr_off = 0;
> > +	head = page_buffers(page);
> > +	bh = head;
> > +	do {
> > +		unsigned int next_off = curr_off + bh->b_size;
> > +		next = bh->b_this_page;
> > +
> > +		if (offset <= curr_off) {
> > +			clear_buffer_dirty(bh);
> > +			set_buffer_invalid(bh);
> > +		}
> > +		curr_off = next_off;
> > +		bh = next;
> > +	} while (bh != head);
> > +	spin_unlock(&mapping->private_lock);
> 
> If the buffer's redirtied after discard_buffer() got at it, we've got some
> nasty problems in there.
> 
> Are you sure this race can happen?  Nobody's allowed to have a page mapped
> while it's undergoing truncation (vmtruncate()).

Well, not technically. I think it can happen with nonlinear pages now,
but that is a bug in truncate and I have some patches to fix them.

But anyone who has done a get_user_pages, AFAIKS, can later run a
set_page_dirty on the pages. Nasty problem, but I think we are OK to
just ignore the dirty bits in this case, because the truncate might
well have happened _after_ the get_user_pages guy set the page dirty
anyway.

> There might be a problem with the final blocks in the page outside i_size. 
> iirc what happens here is that the bh outside i_size _is_ marked dirty, but
> writepage() will notice that it's outside i_size and will just mark it
> clean again without doing IO.

Didn't think of that. How does it get to writepage though, if it has
lost its mapping?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  8:18                                           ` Nick Piggin
@ 2006-10-10  8:41                                             ` Andrew Morton
  2006-10-10  8:49                                               ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  8:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 10:18:20 +0200
Nick Piggin <npiggin@suse.de> wrote:

> > If the buffer's redirtied after discard_buffer() got at it, we've got some
> > nasty problems in there.
> > 
> > Are you sure this race can happen?  Nobody's allowed to have a page mapped
> > while it's undergoing truncation (vmtruncate()).
> 
> Well, not technically. I think it can happen with nonlinear pages now,

How?

> but that is a bug in truncate and I have some patches to fix them.
> 
> But anyone who has done a get_user_pages, AFAIKS, can later run a
> set_page_dirty on the pages.

Most (all?) callers are (and should be) using set_page_dirty_lock().

> Nasty problem, but I think we are OK to
> just ignore the dirty bits in this case, because the truncate might
> well have happened _after_ the get_user_pages guy set the page dirty
> anyway.
> 
> > There might be a problem with the final blocks in the page outside i_size. 
> > iirc what happens here is that the bh outside i_size _is_ marked dirty, but
> > writepage() will notice that it's outside i_size and will just mark it
> > clean again without doing IO.
> 
> Didn't think of that. How does it get to writepage though, if it has
> lost its mapping?

It was only partially truncated - it's still on the address_space.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  8:41                                             ` Andrew Morton
@ 2006-10-10  8:49                                               ` Nick Piggin
  2006-10-10  9:07                                                 ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 01:41:14AM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 10:18:20 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > > If the buffer's redirtied after discard_buffer() got at it, we've got some
> > > nasty problems in there.
> > > 
> > > Are you sure this race can happen?  Nobody's allowed to have a page mapped
> > > while it's undergoing truncation (vmtruncate()).
> > 
> > Well, not technically. I think it can happen with nonlinear pages now,
> 
> How?

It is the pagefault vs invalidate race. It happens with truncate with
nonlinear mappings because they don't have the truncate_count logic.

It can also happen with normal pagecache vs invalidate_inode_pages2.

> > but that is a bug in truncate and I have some patches to fix them.
> > 
> > But anyone who has done a get_user_pages, AFAIKS, can later run a
> > set_page_dirty on the pages.
> 
> Most (all?) callers are (and should be) using set_page_dirty_lock().

They don't, and I haven't checked but I doubt it is because they
always have the page locked.

> > Nasty problem, but I think we are OK to
> > just ignore the dirty bits in this case, because the truncate might
> > well have happened _after_ the get_user_pages guy set the page dirty
> > anyway.
> > 
> > > There might be a problem with the final blocks in the page outside i_size. 
> > > iirc what happens here is that the bh outside i_size _is_ marked dirty, but
> > > writepage() will notice that it's outside i_size and will just mark it
> > > clean again without doing IO.
> > 
> > Didn't think of that. How does it get to writepage though, if it has
> > lost its mapping?
> 
> It was only partially truncated - it's still on the address_space.

Oh OK fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  8:49                                               ` Nick Piggin
@ 2006-10-10  9:07                                                 ` Andrew Morton
  2006-10-10  9:23                                                   ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  9:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, 10 Oct 2006 10:49:31 +0200
Nick Piggin <npiggin@suse.de> wrote:

> > > but that is a bug in truncate and I have some patches to fix them.
> > > 
> > > But anyone who has done a get_user_pages, AFAIKS, can later run a
> > > set_page_dirty on the pages.
> > 
> > Most (all?) callers are (and should be) using set_page_dirty_lock().
> 
> They don't, and I haven't checked but I doubt it is because they
> always have the page locked.

I don't understand that.  If they're using get_user_pages() then they're probably _not_
locking the page, and they should be using set_page_dirty_lock().

If they _are_ locking the page, and running set_page_dirty() inside that
lock_page() then fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  9:07                                                 ` Andrew Morton
@ 2006-10-10  9:23                                                   ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 02:07:26AM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 10:49:31 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > > > but that is a bug in truncate and I have some patches to fix them.
> > > > 
> > > > But anyone who has done a get_user_pages, AFAIKS, can later run a
> > > > set_page_dirty on the pages.
> > > 
> > > Most (all?) callers are (and should be) using set_page_dirty_lock().
> > 
> > They don't, and I haven't checked but I doubt it is because they
> > always have the page locked.
> 
> I don't understand that.  If they're using get_user_pages() then they're probably _not_
> locking the page, and they should be using set_page_dirty_lock().
> 
> If they _are_ locking the page, and running set_page_dirty() inside that
> lock_page() then fine.

The requirement (commented fairly clearly on top of set_page_dirty_lock)
is for the mapping to be pinned, whether by page lock or some other means.

I'm not saying you're wrong because I haven't gone through everything,
have you?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  5:22                     ` Nick Piggin
  2006-10-10  5:29                       ` Andrew Morton
@ 2006-10-10  6:48                       ` Peter Zijlstra
  2006-10-10  6:59                         ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2006-10-10  6:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, Greg KH

On Tue, 2006-10-10 at 07:22 +0200, Nick Piggin wrote:
> On Mon, Oct 09, 2006 at 10:01:27PM -0700, Andrew Morton wrote:
> > On Tue, 10 Oct 2006 06:47:45 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > > > There we can trylock all the pages and bale if any fail.
> > > 
> > > Hmm, try_to_unmap is OK because the page is already locked. page_remove_rmap
> > > isn't allowed to fail.
> > 
> > I was talking about try_to_unmap_cluster().
> 
> But page_remove_rmap's many callers are still screwed. Take do_wp_page,
> for example.

> > It can probably be made to work.  But a sane interface is "when dinking
> > with page internals, lock the page".
> 
> I disagree because it will lead to horrible hacks because many callers
> can't sleep. If anything I would much prefer an innermost-spinlock in
> page->flags that specifically excludes truncate. Actually tree_lock can
> do that now, provided we pin mapping in all callers to set_page_dirty
> (which we should do).

Yeah, but we're hard working to eradicate tree lock; I have ran into
this problem before; that is, zap_pte_range and co. not being able to
lock the page. I'd really like to see that fixed.

In my current concurrent pagecache patches I've abused your PG_nonewrefs
and made it this page internal (bit)spinlock, but it just doesn't look
nice to have both this lock and PG_locked.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:48                       ` Peter Zijlstra
@ 2006-10-10  6:59                         ` Nick Piggin
  2006-10-10  7:11                           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 08:48:56AM +0200, Peter Zijlstra wrote:
> On Tue, 2006-10-10 at 07:22 +0200, Nick Piggin wrote:
> > 
> > I disagree because it will lead to horrible hacks because many callers
> > can't sleep. If anything I would much prefer an innermost-spinlock in
> > page->flags that specifically excludes truncate. Actually tree_lock can
> > do that now, provided we pin mapping in all callers to set_page_dirty
> > (which we should do).
> 
> Yeah, but we're hard working to eradicate tree lock; I have ran into

Well yeah, but until then the tree_lock works.

> this problem before; that is, zap_pte_range and co. not being able to
> lock the page. I'd really like to see that fixed.

What's your problem with zap_pte_range?

> In my current concurrent pagecache patches I've abused your PG_nonewrefs
> and made it this page internal (bit)spinlock, but it just doesn't look
> nice to have both this lock and PG_locked.

Regardless of whether or not they spin, PG_locked is an outermost, and
set_page_dirty is called innermost. I don't see why we'd particularly
want to mush them together now, just because we're worried a filesystem
writer wrote buggy code.

It is all well and good to tell me I'm wrong unless I audit all
filesystems, but the fact is that I'm not a filesystem expert, and if
this is what it has come to then either the process has failed, or
we have a large number of filesystems to cull from the tree as
unmaintained.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  6:59                         ` Nick Piggin
@ 2006-10-10  7:11                           ` Peter Zijlstra
  2006-10-10  7:30                             ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2006-10-10  7:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, Greg KH

On Tue, 2006-10-10 at 08:59 +0200, Nick Piggin wrote:
> On Tue, Oct 10, 2006 at 08:48:56AM +0200, Peter Zijlstra wrote:
> > On Tue, 2006-10-10 at 07:22 +0200, Nick Piggin wrote:
> > > 
> > > I disagree because it will lead to horrible hacks because many callers
> > > can't sleep. If anything I would much prefer an innermost-spinlock in
> > > page->flags that specifically excludes truncate. Actually tree_lock can
> > > do that now, provided we pin mapping in all callers to set_page_dirty
> > > (which we should do).
> > 
> > Yeah, but we're hard working to eradicate tree lock; I have ran into
> 
> Well yeah, but until then the tree_lock works.

True, and your latest patch looks nice, will have to go over it in more
than a cursory glance though.

> > this problem before; that is, zap_pte_range and co. not being able to
> > lock the page. I'd really like to see that fixed.
> 
> What's your problem with zap_pte_range?

well, not only zap_pte_range, also page_remove_rmap and
try_to_unmap_cluster etc.. Basically all those who fiddle with the page
without holding the page lock.

Because with concurrent pagecache, there is no tree lock anymore to
protect/pin the whole mapping, I need to go pin individual pages.
Perhaps having an inner page bit-spinlock (PG_pin) isn't a bad thing,
I'd just raised the issue to see if it would be doable/a-good-thing to
try and merge these two.

> > In my current concurrent pagecache patches I've abused your PG_nonewrefs
> > and made it this page internal (bit)spinlock, but it just doesn't look
> > nice to have both this lock and PG_locked.
> 
> Regardless of whether or not they spin, PG_locked is an outermost, and
> set_page_dirty is called innermost. 

Yes, indeed, and changing it is going to be hard and painful, hence I'll
stick with PG_pin for now.

> I don't see why we'd particularly
> want to mush them together now, just because we're worried a filesystem
> writer wrote buggy code.
> 
> It is all well and good to tell me I'm wrong unless I audit all
> filesystems, but the fact is that I'm not a filesystem expert, and if
> this is what it has come to then either the process has failed, or
> we have a large number of filesystems to cull from the tree as
> unmaintained.

With you on that, no/little filesystems knowledge here either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  7:11                           ` Peter Zijlstra
@ 2006-10-10  7:30                             ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  7:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, Greg KH

On Tue, Oct 10, 2006 at 09:11:03AM +0200, Peter Zijlstra wrote:
> On Tue, 2006-10-10 at 08:59 +0200, Nick Piggin wrote:
> > 
> > What's your problem with zap_pte_range?
> 
> well, not only zap_pte_range, also page_remove_rmap and
> try_to_unmap_cluster etc.. Basically all those who fiddle with the page
> without holding the page lock.

Fiddle with the page -- as in setting it dirty? Because there is a
huge amount of other stuff that fiddles with the page without taking
the lock ;)

> Because with concurrent pagecache, there is no tree lock anymore to
> protect/pin the whole mapping, I need to go pin individual pages.
> Perhaps having an inner page bit-spinlock (PG_pin) isn't a bad thing,
> I'd just raised the issue to see if it would be doable/a-good-thing to
> try and merge these two.

It does seem attractive, but at the moment I can't see how it would
be done. One of the main problems is that truncate / invalidate need
to hold the page locked while traversing the rmaps (which require
i_mmap_lock, ptl locks, etc).

There are other things too, like the swapcache, which uses PG_locked
as an outer lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:50       ` Andrew Morton
  2006-10-10  3:58         ` Nick Piggin
@ 2006-10-10  4:11         ` Nick Piggin
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  4:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 08:50:30PM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 05:34:12 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > the problem is that page_mapping is still free to go NULL at any
> > time, and __set_page_dirty_buffers wasn't checking for that.
> > 
> > If there is another race, then it must be because the buffer code
> > cannot cope with dirty buffers against a truncated page. It is
> > kind of spaghetti, though. What stops set_page_dirty_buffers from
> > racing with block_invalidatepage, for example?
> 
> Nothing that I can think of.

block_invalidatepage
 discard_buffer
                         set_page_dirty_buffers
  try_to_release_page
   try_to_free_buffers
    drop_buffers [fails because buffer is dirty]

Hmm...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:20   ` Andrew Morton
  2006-10-10  3:34     ` Nick Piggin
@ 2006-10-10  3:37     ` Andrew Morton
  2006-10-10  3:42       ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-10-10  3:37 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Peter Zijlstra,
	Linux Memory Management List, Greg KH

On Mon, 9 Oct 2006 20:20:39 -0700
Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 9 Oct 2006 20:06:05 -0700 (PDT)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > On Tue, 10 Oct 2006, Nick Piggin wrote:
> > >
> > > This was triggered, but not the fault of, the dirty page accounting
> > > patches. Suitable for -stable as well, after it goes upstream.
> > 
> > Applied. However, I wonder what protects "page_mapping()" here?
> 
> Nothing.  And I don't understand the (unchangelogged) switch from
> page->mapping to page_mapping().
> 
> > I don't 
> > think we hold the page lock anywhere, so "page->mapping" can change at any 
> > time, no?
> 
> Yes.  The patch makes the race window a bit smaller.

OK, the address_space is protected from reclaim here by virtue of the
caller's ref on vma->vm_file (needs a comment).

The page_mapping() change remains a mystery.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] mm: bug in set_page_dirty_buffers
  2006-10-10  3:37     ` Andrew Morton
@ 2006-10-10  3:42       ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-10-10  3:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List, Greg KH

On Mon, Oct 09, 2006 at 08:37:18PM -0700, Andrew Morton wrote:
> On Mon, 9 Oct 2006 20:20:39 -0700
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > On Mon, 9 Oct 2006 20:06:05 -0700 (PDT)
> > Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > > On Tue, 10 Oct 2006, Nick Piggin wrote:
> > > >
> > > > This was triggered, but not the fault of, the dirty page accounting
> > > > patches. Suitable for -stable as well, after it goes upstream.
> > > 
> > > Applied. However, I wonder what protects "page_mapping()" here?
> > 
> > Nothing.  And I don't understand the (unchangelogged) switch from
> > page->mapping to page_mapping().
> > 
> > > I don't 
> > > think we hold the page lock anywhere, so "page->mapping" can change at any 
> > > time, no?
> > 
> > Yes.  The patch makes the race window a bit smaller.
> 
> OK, the address_space is protected from reclaim here by virtue of the
> caller's ref on vma->vm_file (needs a comment).

All callers are required to pin the the inode though. The comment is
on top of set_page_dirty_lock.

I guess you mean a comment in do_no_page? but I thought that was obvious:
zap_pte_range, access_process_vm, etc have been doing this forever.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* patch mm-bug-in-set_page_dirty_buffers.patch queued to -stable tree
  2006-10-10  2:36 [patch] mm: bug in set_page_dirty_buffers Nick Piggin
  2006-10-10  3:06 ` Linus Torvalds
@ 2006-10-10  7:42 ` gregkh
  1 sibling, 0 replies; 36+ messages in thread
From: gregkh @ 2006-10-10  7:42 UTC (permalink / raw)
  To: npiggin, a.p.zijlstra, akpm, gregkh, linux-mm, torvalds; +Cc: stable

This is a note to let we have just queued up the patch titled

     Subject: mm: bug in set_page_dirty_buffers

to the -stable tree.  Its filename is

     mm-bug-in-set_page_dirty_buffers.patch

A git repo of this tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/stable/stable-queue/

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-10-10  9:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-10  2:36 [patch] mm: bug in set_page_dirty_buffers Nick Piggin
2006-10-10  3:06 ` Linus Torvalds
2006-10-10  3:19   ` Nick Piggin
2006-10-10  3:20   ` Andrew Morton
2006-10-10  3:34     ` Nick Piggin
2006-10-10  3:50       ` Andrew Morton
2006-10-10  3:58         ` Nick Piggin
2006-10-10  4:14           ` Andrew Morton
2006-10-10  4:21             ` Nick Piggin
2006-10-10  4:38               ` Andrew Morton
2006-10-10  4:47                 ` Nick Piggin
2006-10-10  5:01                   ` Andrew Morton
2006-10-10  5:22                     ` Nick Piggin
2006-10-10  5:29                       ` Andrew Morton
2006-10-10  5:48                         ` Nick Piggin
2006-10-10  6:08                           ` Andrew Morton
2006-10-10  6:19                             ` Nick Piggin
2006-10-10  6:27                               ` Andrew Morton
2006-10-10  6:39                                 ` Nick Piggin
2006-10-10  6:52                                   ` Nick Piggin
2006-10-10  7:06                                     ` Andrew Morton
2006-10-10  7:21                                       ` Nick Piggin
2006-10-10  8:07                                         ` Andrew Morton
2006-10-10  8:18                                           ` Nick Piggin
2006-10-10  8:41                                             ` Andrew Morton
2006-10-10  8:49                                               ` Nick Piggin
2006-10-10  9:07                                                 ` Andrew Morton
2006-10-10  9:23                                                   ` Nick Piggin
2006-10-10  6:48                       ` Peter Zijlstra
2006-10-10  6:59                         ` Nick Piggin
2006-10-10  7:11                           ` Peter Zijlstra
2006-10-10  7:30                             ` Nick Piggin
2006-10-10  4:11         ` Nick Piggin
2006-10-10  3:37     ` Andrew Morton
2006-10-10  3:42       ` Nick Piggin
2006-10-10  7:42 ` patch mm-bug-in-set_page_dirty_buffers.patch queued to -stable tree gregkh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox