tracking dirty pages patches

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* tracking dirty pages patches
@ 2006-05-22 19:31 Hugh Dickins
  2006-05-22 20:29 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Hugh Dickins @ 2006-05-22 19:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, Linus Torvalds, David Howells, linux-mm

Belated observations on your "tracking dirty pages" patches.

page_wrprotect is a nice use of rmap, but I see a couple of problems.
One is in the lock ordering (there's info on mm lock ordering at the
top of filemap.c, but I find the list at the top of rmap.c easier).

set_page_dirty has always (awkwardly) been liable to be called from
very low in the hierarchy; whereas you're assuming clear_page_dirty
is called from much higher up.  And in most cases there's no problem
(please cross-check to confirm that); but try_to_free_buffers in fs/
buffer.c calls it while holding mapping->private_lock - page_wrprotect
called from test_clear_page_dirty then violates the order.

If we're lucky and that is indeed the only violation, maybe Andrew
can recommend a change to try_to_free_buffers to avoid it: I have
no appreciation of the issues at that end myself.

The other worries are in page_wrprotect_one's block
	entry = pte_mkclean(pte_wrprotect(*pte));
	ptep_establish(vma, address, pte, entry);
	update_mmu_cache(vma, address, entry);
	lazy_mmu_prot_update(entry);
ptep_establish, update_mmu_cache and lazy_mmu_prot_update are tricky
arch-dependent functions which have hitherto only been used on the
current task mm, whereas you're now using them from (perhaps) another.

Well, no, I'm wrong: ptrace's get_user_pages has been using them
from another process; but that's not so common a case as to reassure
me there won't be issues on some architectures there.

Quite likely ptep_establish and update_mmu_cache are okay for use in
that way (needs careful checking of arches), at least they take a vma
argument from which the mm can be found.  Whereas lazy_mmu_prot_update
looks likely to be wrong, but only does something on ia64: you need
to consult ia64 mm gurus to check what's needed there.  Maybe it'll
just be a suboptimal issue (but more important now than in ptrace
to make it optimal).

Is there a problem with page_wrprotect on VM_LOCKED vmas?  I'm not
sure: usually VM_LOCKED guarantees no faulting, you abandon that.

Like others, I don't care for "VM_SharedWritable": you followed the
VM_ReadHint macros, but this isn't a read hint, and those are weird.

Personally, I much prefer the explicit
	((vma->vm_flags & (VM_SHARED|VM_WRITE)) == (VM_SHARED|VM_WRITE))
which is the usual style for vm_flags tests throughout mm (except for
the hugetlb test designed to melt away without HUGETLB).  But I may be
in a minority on that, Linus did put an is_cow_mapping() in memory.c
recently, so maybe you'd follow that and say is_shared_writable().

There's a clash and overlap between your "tracking dirty pages" patches
and David Howell's "add notification of page becoming writable" patch.
The merge of the two in 2.6.17-rc4-mm1 was wrong: your handle_pte_fault
change meant it never reached David's page_mkwrite call in do_wp_page.

Andrew has resolved that in -mm2 and -mm3 by dropping David's patches
for the umpteenth time; but whatever happens to the rest of FS-Cache,
I suspect the page_mkwrite patch will return (Anton wants it for NTFS,
probably others will want it, but none are relying upon it yet).

Please take a look at that patch (David reposted it on linux-kernel
last Friday, as 08/14 of FS-Cache try #10): I went over it with him
many months ago, and it fills in at least one gap you're missing...

mprotect: we all forget mprotect, but mprotect(,,PROT_READ)
followed by mprotect(,,PROT_WRITE) will give write permission to all
those ptes you've carefully taken write permission from.  In the
page_mkwrite patch, we found that was most easily fixed by using
the !VM_SHARED vm_page_prot in place of the VM_SHARED one.  I
expect you can simplify your patch a little by doing the same.

msync: I rather think that with your changes, if they're to stay,
then all the page table walking code can be removed from msync -
since it already skipped vmas which were not VM_SHARED, and there's
nothing to gain from syncing the !mapping_cap_account_dirty ones.
I think MS_ASYNC becomes a no-op, and sys_msync so small it won't
deserve its own msync.c (madvise.c wouldn't be a bad place for it).
Or am I missing something?

I'm not convinced that optimize-follow_pages is a worthwhile optimization
(in some cases you're adding an atomic inc and dec), and it's irrelevant
to your tracking of dirty pages, but I don't feel strongly about it.
Except, if it stays then it needs fixing: the flags 0 case is doing
a put_page without having done a get_page.

Though currently it seems only some powerpc #ifdef __DEBUG code is using
follow_pages in that way: since that's not the common case, I think you'd
best just remove the "if (flags & (FOLL_GET | FOLL_TOUCH))" condition
from before the get_page.

(Why does follow_pages set_page_dirty at all?  I _think_ it's in case
the get_user_pages caller forgets to set_page_dirty when releasing.
But that's not how we usually write kernel code, to hide mistakes most
of the time, and your mods may change the balance there.  Andrew will
remember better whether that set_page_dirty has stronger justification.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-22 19:31 tracking dirty pages patches Hugh Dickins
@ 2006-05-22 20:29 ` Andrew Morton
  2006-05-23  8:17   ` Nick Piggin
  2006-05-23 14:55   ` Hugh Dickins
  2006-05-23 16:24 ` Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2006-05-22 20:29 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: a.p.zijlstra, torvalds, dhowells, linux-mm

Hugh Dickins <hugh@veritas.com> wrote:
>
> Belated observations on your "tracking dirty pages" patches.

Thanks, Hugh.

> page_wrprotect is a nice use of rmap, but I see a couple of problems.
> One is in the lock ordering (there's info on mm lock ordering at the
> top of filemap.c, but I find the list at the top of rmap.c easier).
> 
> set_page_dirty has always (awkwardly) been liable to be called from
> very low in the hierarchy; whereas you're assuming clear_page_dirty
> is called from much higher up.  And in most cases there's no problem
> (please cross-check to confirm that); but try_to_free_buffers in fs/
> buffer.c calls it while holding mapping->private_lock - page_wrprotect
> called from test_clear_page_dirty then violates the order.
> 
> If we're lucky and that is indeed the only violation, maybe Andrew
> can recommend a change to try_to_free_buffers to avoid it: I have
> no appreciation of the issues at that end myself.

I had troubles with that as well - tree_lock is a very "inner" lock.  So I
moved test_clear_page_dirty()'s call to page_wrprotect() to be outside
tree_lock.

But I don't think you were referring to that - I am unable to evaluate your
expression "the order".

The running of page_wrprotect_file() inside private_lock is a worry, yes. 
We can move the clear_page_dirty() call in try_to_free_buffers() to be
outside private_lock.

But I don't know which particular ranking violation you've identified.

> ...
>
> (Why does follow_pages set_page_dirty at all?  I _think_ it's in case
> the get_user_pages caller forgets to set_page_dirty when releasing.
> But that's not how we usually write kernel code, to hide mistakes most
> of the time,

Yes, that would be bad.

> and your mods may change the balance there.  Andrew will
> remember better whether that set_page_dirty has stronger justification.)

It was added by the below, which nobody was terribly happy with at the
time.  (Took me 5-10 minutes to hunt this down.  Insert rote comment about
comments).




Date: Mon, 19 Jan 2004 18:43:46 +0000
From: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
To: bk-commits-head@vger.kernel.org
Subject: [PATCH] s390: endless loop in follow_page.


ChangeSet 1.1490.3.215, 2004/01/19 10:43:46-08:00, akpm@osdl.org

	[PATCH] s390: endless loop in follow_page.
	
	From: Martin Schwidefsky <schwidefsky@de.ibm.com>
	
	Fix endless loop in get_user_pages() on s390.  It happens only on s/390
	because pte_dirty always returns 0.  For all other architectures this is an
	optimization.
	
	In the case of "write && !pte_dirty(pte)" follow_page() returns NULL.  On all
	architectures except s390 handle_pte_fault() will then create a pte with
	pte_dirty(pte)==1 because write_access==1.  In the following, second call to
	follow_page() all is fine.  With the physical dirty bit patch pte_dirty() is
	always 0 for s/390 because the dirty bit doesn't live in the pte.


# This patch includes the following deltas:
#	           ChangeSet	1.1490.3.214 -> 1.1490.3.215
#	         mm/memory.c	1.145   -> 1.146  
#

 memory.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)


diff -Nru a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c	Mon Jan 19 15:47:24 2004
+++ b/mm/memory.c	Mon Jan 19 15:47:24 2004
@@ -651,14 +651,19 @@
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (!write || (pte_write(pte) && pte_dirty(pte))) {
-			pfn = pte_pfn(pte);
-			if (pfn_valid(pfn)) {
-				struct page *page = pfn_to_page(pfn);
-
-				mark_page_accessed(page);
-				return page;
-			}
+		if (write && !pte_write(pte))
+			goto out;
+		if (write && !pte_dirty(pte)) {
+			struct page *page = pte_page(pte);
+			if (!PageDirty(page))
+				set_page_dirty(page);
+		}
+		pfn = pte_pfn(pte);
+		if (pfn_valid(pfn)) {
+			struct page *page = pfn_to_page(pfn);
+			
+			mark_page_accessed(page);
+			return page;
 		}
 	}
 
-
To unsubscribe from this list: send the line "unsubscribe bk-commits-head" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-22 20:29 ` Andrew Morton
@ 2006-05-23  8:17   ` Nick Piggin
  2006-05-23 14:55   ` Hugh Dickins
  1 sibling, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2006-05-23  8:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, a.p.zijlstra, torvalds, dhowells, linux-mm

Andrew Morton wrote:

>>and your mods may change the balance there.  Andrew will
>>remember better whether that set_page_dirty has stronger justification.)
> 
> 
> It was added by the below, which nobody was terribly happy with at the
> time.  (Took me 5-10 minutes to hunt this down.  Insert rote comment about
> comments).

Hmm, I couldn't find any discussion on lkml or linux-mm about it.

I wonder why it wasn't simply changed so as to return the page even if
the pte was not marked dirty.

> 
> 
> 
> Date: Mon, 19 Jan 2004 18:43:46 +0000
> From: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
> To: bk-commits-head@vger.kernel.org
> Subject: [PATCH] s390: endless loop in follow_page.
> 
> 
> ChangeSet 1.1490.3.215, 2004/01/19 10:43:46-08:00, akpm@osdl.org
> 
> 	[PATCH] s390: endless loop in follow_page.
> 	
> 	From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 	
> 	Fix endless loop in get_user_pages() on s390.  It happens only on s/390
> 	because pte_dirty always returns 0.  For all other architectures this is an
> 	optimization.
> 	
> 	In the case of "write && !pte_dirty(pte)" follow_page() returns NULL.  On all
> 	architectures except s390 handle_pte_fault() will then create a pte with
> 	pte_dirty(pte)==1 because write_access==1.  In the following, second call to
> 	follow_page() all is fine.  With the physical dirty bit patch pte_dirty() is
> 	always 0 for s/390 because the dirty bit doesn't live in the pte.
> 
> 
> # This patch includes the following deltas:
> #	           ChangeSet	1.1490.3.214 -> 1.1490.3.215
> #	         mm/memory.c	1.145   -> 1.146  
> #
> 
>  memory.c |   21 +++++++++++++--------
>  1 files changed, 13 insertions(+), 8 deletions(-)
> 
> 
> diff -Nru a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c	Mon Jan 19 15:47:24 2004
> +++ b/mm/memory.c	Mon Jan 19 15:47:24 2004
> @@ -651,14 +651,19 @@
>  	pte = *ptep;
>  	pte_unmap(ptep);
>  	if (pte_present(pte)) {
> -		if (!write || (pte_write(pte) && pte_dirty(pte))) {
> -			pfn = pte_pfn(pte);
> -			if (pfn_valid(pfn)) {
> -				struct page *page = pfn_to_page(pfn);
> -
> -				mark_page_accessed(page);
> -				return page;
> -			}
> +		if (write && !pte_write(pte))
> +			goto out;
> +		if (write && !pte_dirty(pte)) {
> +			struct page *page = pte_page(pte);
> +			if (!PageDirty(page))
> +				set_page_dirty(page);
> +		}
> +		pfn = pte_pfn(pte);
> +		if (pfn_valid(pfn)) {
> +			struct page *page = pfn_to_page(pfn);
> +			
> +			mark_page_accessed(page);
> +			return page;
>  		}
>  	}
>  

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-22 20:29 ` Andrew Morton
  2006-05-23  8:17   ` Nick Piggin
@ 2006-05-23 14:55   ` Hugh Dickins
  1 sibling, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2006-05-23 14:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Andrea Arcangeli, a.p.zijlstra, torvalds, dhowells,
	linux-mm

On Mon, 22 May 2006, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > set_page_dirty has always (awkwardly) been liable to be called from
> > very low in the hierarchy; whereas you're assuming clear_page_dirty
> > is called from much higher up.  And in most cases there's no problem
> > (please cross-check to confirm that); but try_to_free_buffers in fs/
> > buffer.c calls it while holding mapping->private_lock - page_wrprotect
> > called from test_clear_page_dirty then violates the order.
> > 
> > If we're lucky and that is indeed the only violation, maybe Andrew
> > can recommend a change to try_to_free_buffers to avoid it: I have
> > no appreciation of the issues at that end myself.
> 
> I had troubles with that as well - tree_lock is a very "inner" lock.  So I
> moved test_clear_page_dirty()'s call to page_wrprotect() to be outside
> tree_lock.

Ah.  I've only been looking at versions after that good change.

> But I don't think you were referring to that

Correct.

>  - I am unable to evaluate your expression "the order".

Sorry.  The order shown in rmap.c goes:

 * mm->mmap_sem
 *   page->flags PG_locked (lock_page)
 *     mapping->i_mmap_lock
 *       anon_vma->lock
 *         mm->page_table_lock or pte_lock
 *           zone->lru_lock (in mark_page_accessed, isolate_lru_page)
 *           swap_lock (in swap_duplicate, swap_info_get)
 *             mmlist_lock (in mmput, drain_mmlist and others)
 *             mapping->private_lock (in __set_page_dirty_buffers)
 *             inode_lock (in set_page_dirty's __mark_inode_dirty)
 *               sb_lock (within inode_lock in fs/fs-writeback.c)
 *               mapping->tree_lock (widely used, in set_page_dirty,
 *                         in arch-dependent flush_dcache_mmap_lock,
 *                         within inode_lock in __sync_single_inode)

> The running of page_wrprotect_file() inside private_lock is a worry, yes. 
> We can move the clear_page_dirty() call in try_to_free_buffers() to be
> outside private_lock.

That would be great (for Peter) if you know it to be safe.

> But I don't know which particular ranking violation you've identified.

page_wrprotect and callees take i_mmap_lock and pte_lock,
neither of which should be taken while private_lock is held.

> > ...
> >
> > (Why does follow_page set_page_dirty at all?  I _think_ it's in case
> > the get_user_pages caller forgets to set_page_dirty when releasing.
> > But that's not how we usually write kernel code, to hide mistakes most
> > of the time,
> 
> Yes, that would be bad.

It would, but you've shown I was wrong in thinking that.

> > and your mods may change the balance there.  Andrew will
> > remember better whether that set_page_dirty has stronger justification.)
> 
> It was added by the below, which nobody was terribly happy with at the
> time.  (Took me 5-10 minutes to hunt this down.  Insert rote comment about
> comments).

Thanks so much for hunting it down, I should have done so myself.

I can well understand s390 looping on its always 0 pte_dirty test,
but very much agree with Nick: the real question then becomes,
what was the point of ever testing for pte_dirty there?
and why don't we just remove the whole
		if ((flags & FOLL_WRITE) &&
		    !pte_dirty(pte) && !PageDirty(page))
			set_page_dirty(page);
from follow_page?

That pte_dirty test first came in 2.4.4-final from
 - Andrea Arkangeli: raw-io fixes
I've just spent some frustrating hours thinking I'd be able to
explain why it was necessary in the old 2.4 context, but actually
I can't.  Vague memory of modifications vanishing under pressure
because of race when pte_dirty was not set; yet the page count is
raised without dropping page_table_lock, and mark_dirty_kiobuf was
introduced in the same patch, to set dirty before releasing.  I've
CC'ed Andrea, but it's a little (a lot!) unfair to expect him to
explain it now.

I do believe that dirty check now should be redundant.  But I
said "should be" not "is" because I still haven't fixed drivers/scsi
sg.c and st.c to use set_page_dirty_lock in place of SetPageDirty.

Last time I was preparing that patch, I got distracted by the over-
whelming feeling that it should be easier to do at interrupt time
(as sg.c needs), but failed to find a satisfying way.  I'd better
revisit that before trying to cut set_page_dirty from follow_page.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-22 19:31 tracking dirty pages patches Hugh Dickins
  2006-05-22 20:29 ` Andrew Morton
@ 2006-05-23 16:24 ` Christoph Lameter
  2006-05-23 19:21   ` Hugh Dickins
  2006-05-23 16:41 ` David Howells
  2006-05-23 23:07 ` Peter Zijlstra
  3 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2006-05-23 16:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells, linux-mm

On Mon, 22 May 2006, Hugh Dickins wrote:

> The other worries are in page_wrprotect_one's block
> 	entry = pte_mkclean(pte_wrprotect(*pte));
> 	ptep_establish(vma, address, pte, entry);
> 	update_mmu_cache(vma, address, entry);
> 	lazy_mmu_prot_update(entry);
> ptep_establish, update_mmu_cache and lazy_mmu_prot_update are tricky
> arch-dependent functions which have hitherto only been used on the
> current task mm, whereas you're now using them from (perhaps) another.

Page migration is also doing that in the version slated for 2.6.18 
in Andrew's tree.
 
> Well, no, I'm wrong: ptrace's get_user_pages has been using them
> from another process; but that's not so common a case as to reassure
> me there won't be issues on some architectures there.

> Quite likely ptep_establish and update_mmu_cache are okay for use in
> that way (needs careful checking of arches), at least they take a vma
> argument from which the mm can be found.  Whereas lazy_mmu_prot_update
> looks likely to be wrong, but only does something on ia64: you need
> to consult ia64 mm gurus to check what's needed there.  Maybe it'll
> just be a suboptimal issue (but more important now than in ptrace
> to make it optimal).

On ia64 lazy_mmu_prot_update deals with the aliasing issues between the 
icache and the dcache. For an executable page we need to flush the icache.

> Is there a problem with page_wrprotect on VM_LOCKED vmas?  I'm not
> sure: usually VM_LOCKED guarantees no faulting, you abandon that.

mlock guarantees that the page is not swapped out. We already modify
the dirty bit and the protections on the VMLOCKED ptes via mprotect.

> (Why does follow_pages set_page_dirty at all?  I _think_ it's in case
> the get_user_pages caller forgets to set_page_dirty when releasing.
> But that's not how we usually write kernel code, to hide mistakes most
> of the time, and your mods may change the balance there.  Andrew will
> remember better whether that set_page_dirty has stronger justification.)

follow_page() transfers the dirty bit from the pte to the page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-23 16:24 ` Christoph Lameter
@ 2006-05-23 19:21   ` Hugh Dickins
  2006-05-23 19:31     ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2006-05-23 19:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells, linux-mm

On Tue, 23 May 2006, Christoph Lameter wrote:
> On Mon, 22 May 2006, Hugh Dickins wrote:
> 
> > The other worries are in page_wrprotect_one's block
> > 	entry = pte_mkclean(pte_wrprotect(*pte));
> > 	ptep_establish(vma, address, pte, entry);
> > 	update_mmu_cache(vma, address, entry);
> > 	lazy_mmu_prot_update(entry);
> > ptep_establish, update_mmu_cache and lazy_mmu_prot_update are tricky
> > arch-dependent functions which have hitherto only been used on the
> > current task mm, whereas you're now using them from (perhaps) another.
> 
> Page migration is also doing that in the version slated for 2.6.18 
> in Andrew's tree.

Ah, yes, that's good support for Peter's use of update_mmu_cache, thanks.

> > Well, no, I'm wrong: ptrace's get_user_pages has been using them
> > from another process; but that's not so common a case as to reassure
> > me there won't be issues on some architectures there.
> 
> > Quite likely ptep_establish and update_mmu_cache are okay for use in
> > that way (needs careful checking of arches), at least they take a vma
> > argument from which the mm can be found.  Whereas lazy_mmu_prot_update
> > looks likely to be wrong, but only does something on ia64: you need
> > to consult ia64 mm gurus to check what's needed there.  Maybe it'll
> > just be a suboptimal issue (but more important now than in ptrace
> > to make it optimal).
> 
> On ia64 lazy_mmu_prot_update deals with the aliasing issues between the 
> icache and the dcache. For an executable page we need to flush the icache.

And looking more closely, I now see it operates on the underlying struct
page and its kernel page_address(), nothing to do with userspace mm.

Okay, but it's pointless for Peter to call it from page_wrprotect_one
(which is making no change to executability), isn't that so?

> > Is there a problem with page_wrprotect on VM_LOCKED vmas?  I'm not
> > sure: usually VM_LOCKED guarantees no faulting, you abandon that.
> 
> mlock guarantees that the page is not swapped out. We already modify
> the dirty bit and the protections on the VMLOCKED ptes via mprotect.

You're right, silly of me not to look it up: yes, "memory-resident" is
the critical issue, so VM_LOCKED presents no problem to Peter's patch.

> > (Why does follow_pages set_page_dirty at all?  I _think_ it's in case
> > the get_user_pages caller forgets to set_page_dirty when releasing.
> > But that's not how we usually write kernel code, to hide mistakes most
> > of the time, and your mods may change the balance there.  Andrew will
> > remember better whether that set_page_dirty has stronger justification.)
> 
> follow_page() transfers the dirty bit from the pte to the page.

No, that's not what it's doing (if pte_dirty it does no such thing).

But thanks, you've cleared up several issues for me here.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-23 19:21   ` Hugh Dickins
@ 2006-05-23 19:31     ` Christoph Lameter
  2006-05-23 20:34       ` Hugh Dickins
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2006-05-23 19:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells, linux-mm

On Tue, 23 May 2006, Hugh Dickins wrote:

> > On ia64 lazy_mmu_prot_update deals with the aliasing issues between the 
> > icache and the dcache. For an executable page we need to flush the icache.
> 
> And looking more closely, I now see it operates on the underlying struct
> page and its kernel page_address(), nothing to do with userspace mm.
> 
> Okay, but it's pointless for Peter to call it from page_wrprotect_one
> (which is making no change to executability), isn't that so?

That is true for ia64. However, the name "lazy_mmu_prot_update" suggests
that the intended scope is to cover protection updates in general. 
And we definitely change the protections of the page.

Maybe we could rename lazy_mmu_prot_update? What does icache/dcache 
aliasing have to do with page protection?

> You're right, silly of me not to look it up: yes, "memory-resident" is
> the critical issue, so VM_LOCKED presents no problem to Peter's patch.

Page migration currently also assumes that VM_LOCKED means do not move the 
page. At some point we may want to have a separate flag that guarantees
that a page should not be moved. This would enable the moving of VM_LOCKED 
pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-23 19:31     ` Christoph Lameter
@ 2006-05-23 20:34       ` Hugh Dickins
  2006-05-23 21:16         ` Christoph Lameter
                           ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Hugh Dickins @ 2006-05-23 20:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells,
	Rohit Seth, linux-mm

On Tue, 23 May 2006, Christoph Lameter wrote:
> On Tue, 23 May 2006, Hugh Dickins wrote:
> 
> > > On ia64 lazy_mmu_prot_update deals with the aliasing issues between the 
> > > icache and the dcache. For an executable page we need to flush the icache.
> > 
> > And looking more closely, I now see it operates on the underlying struct
> > page and its kernel page_address(), nothing to do with userspace mm.
> > 
> > Okay, but it's pointless for Peter to call it from page_wrprotect_one
> > (which is making no change to executability), isn't that so?
> 
> That is true for ia64. However, the name "lazy_mmu_prot_update" suggests
> that the intended scope is to cover protection updates in general. 
> And we definitely change the protections of the page.

True, and I now see Documentation/cachetlb.txt documents it that way.
Yet nothing but ia64 has any use for it.

> Maybe we could rename lazy_mmu_prot_update? What does icache/dcache 
> aliasing have to do with page protection?

I'd strongly agree with you that it should be renamed: for a start,
why does it say "lazy"?  That's an architectural implementation detail.

Except that, instead of agreeing it should be renamed, I say it should
be deleted entirely.  It seems to represent that ia64 has an empty
update_mmu_cache, and someone decided to add a new interface instead
of giving ia64 that work to do in its update_mmu_cache.

That someone being Rohit, CC'ed.

I can make no sense of it from its callsites.  It seems to be called
immediately after any update_mmu_cache, unless the source file is
called mm/fremap.c, in which case it's left out.

> > You're right, silly of me not to look it up: yes, "memory-resident" is
> > the critical issue, so VM_LOCKED presents no problem to Peter's patch.
> 
> Page migration currently also assumes that VM_LOCKED means do not move the 
> page. At some point we may want to have a separate flag that guarantees
> that a page should not be moved. This would enable the moving of VM_LOCKED 
> pages.

Oh yes, I'd noticed that subject going by, and meant to speak up
sometime.  I feel pretty strongly, and have so declared in the past,
that VM_LOCKED should _not_ guarantee that the same physical page is
used forever: get_user_pages is what's used to pin a physical page
for that effect.  I remember Arjan sharing this opinion.

You might discover a problem or two in letting page migration go that
way, I'm not saying there cannot be a problem; but I'd much rather
you try without adding a new flag unless it's proved necessary.
And I know Linus prefers not to go overboard with extra flags.

You mentioned in one of the mails that went past that you'd seen
drivers enforcing VM_LOCKED in vm_flags: aren't those just drivers
copying other drivers which did so, but achieving nothing thereby,
to be cleaned up in due course?  (The pages aren't even on LRU.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-23 20:34       ` Hugh Dickins
@ 2006-05-23 21:16         ` Christoph Lameter
  2006-05-23 21:17         ` Chen, Kenneth W
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2006-05-23 21:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells,
	Rohit Seth, linux-mm

On Tue, 23 May 2006, Hugh Dickins wrote:

> > Page migration currently also assumes that VM_LOCKED means do not move the 
> > page. At some point we may want to have a separate flag that guarantees
> > that a page should not be moved. This would enable the moving of VM_LOCKED 
> > pages.
> 
> Oh yes, I'd noticed that subject going by, and meant to speak up
> sometime.  I feel pretty strongly, and have so declared in the past,
> that VM_LOCKED should _not_ guarantee that the same physical page is
> used forever: get_user_pages is what's used to pin a physical page
> for that effect.  I remember Arjan sharing this opinion.
> 
> You might discover a problem or two in letting page migration go that
> way, I'm not saying there cannot be a problem; but I'd much rather
> you try without adding a new flag unless it's proved necessary.
> And I know Linus prefers not to go overboard with extra flags.

Ok. I thought that there would be a requirement to have such a flag 
instead of VM_LOCKED.
 
> You mentioned in one of the mails that went past that you'd seen
> drivers enforcing VM_LOCKED in vm_flags: aren't those just drivers
> copying other drivers which did so, but achieving nothing thereby,
> to be cleaned up in due course?  (The pages aren't even on LRU.)

Could be. I think Kame looked at the drivers. The memory hotplug people 
are mostly interested in moving VM_LOCKED pages. I would like to support 
them in that but we have currently no need to move mlocked pages.

Pages that are not on the LRU cannot be moved by page migration. So maybe 
that kind of condition is sufficient to pin memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tracking dirty pages patches
  2006-05-23 20:34       ` Hugh Dickins
  2006-05-23 21:16         ` Christoph Lameter
@ 2006-05-23 21:17         ` Chen, Kenneth W
  2006-05-23 21:40           ` update_mmu_cache vs. lazy_mmu_prot_update Christoph Lameter
  2006-05-23 22:28         ` remove VM_LOCKED before remap_pfn_range and drop VM_SHM Christoph Lameter
  2006-05-24  2:25         ` tracking dirty pages patches Arjan van de Ven
  3 siblings, 1 reply; 20+ messages in thread
From: Chen, Kenneth W @ 2006-05-23 21:17 UTC (permalink / raw)
  To: 'Hugh Dickins', Christoph Lameter
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells,
	Rohit Seth, linux-mm

Hugh Dickins wrote on Tuesday, May 23, 2006 1:34 PM
> On Tue, 23 May 2006, Christoph Lameter wrote:
> > 
> > That is true for ia64. However, the name "lazy_mmu_prot_update" suggests
> > that the intended scope is to cover protection updates in general. 
> > And we definitely change the protections of the page.
> 
> True, and I now see Documentation/cachetlb.txt documents it that way.
> Yet nothing but ia64 has any use for it.
> 
> > Maybe we could rename lazy_mmu_prot_update? What does icache/dcache 
> > aliasing have to do with page protection?
> 
> I'd strongly agree with you that it should be renamed: for a start,
> why does it say "lazy"?  That's an architectural implementation detail.
> 
> Except that, instead of agreeing it should be renamed, I say it should
> be deleted entirely.  It seems to represent that ia64 has an empty
> update_mmu_cache, and someone decided to add a new interface instead
> of giving ia64 that work to do in its update_mmu_cache.

My memory recollects that it was done just like what you suggested:
overloading update_mmu_cache for ia64, but it was vetoed by several mm
experts.  And as a result a new function was introduced.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* update_mmu_cache vs. lazy_mmu_prot_update
  2006-05-23 21:17         ` Chen, Kenneth W
@ 2006-05-23 21:40           ` Christoph Lameter
  2006-05-24 14:12             ` Hugh Dickins
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2006-05-23 21:40 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Hugh Dickins',
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells,
	Rohit Seth, linux-mm, agl, linux-ia64

On Tue, 23 May 2006, Chen, Kenneth W wrote:

> > Except that, instead of agreeing it should be renamed, I say it should
> > be deleted entirely.  It seems to represent that ia64 has an empty
> > update_mmu_cache, and someone decided to add a new interface instead
> > of giving ia64 that work to do in its update_mmu_cache.
> 
> My memory recollects that it was done just like what you suggested:
> overloading update_mmu_cache for ia64, but it was vetoed by several mm
> experts.  And as a result a new function was introduced.

lazy_mmu_prot_update is always called after update_mmu_cache except
when we change permissions (hugetlb_change_protection() and 
change_pte_range()). 

So if we conflate those two then arches may have to be updated to avoid 
flushing the mmu if we only modified protections.

I think update_mmu_cache() should be dropped in page_wrprotect_one() in 
order to be consistent scheme. And avoiding mmu flushes will increase the 
performance of page_wrprotect_one.. lazy_mmu_prot_update must be there 
since we are changing permissions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: update_mmu_cache vs. lazy_mmu_prot_update
  2006-05-23 21:40           ` update_mmu_cache vs. lazy_mmu_prot_update Christoph Lameter
@ 2006-05-24 14:12             ` Hugh Dickins
  0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2006-05-24 14:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chen, Kenneth W, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	David Howells, Rohit Seth, linux-mm, agl, linux-ia64

On Tue, 23 May 2006, Christoph Lameter wrote:
> On Tue, 23 May 2006, Chen, Kenneth W wrote:
> 
> > My memory recollects that it was done just like what you suggested:
> > overloading update_mmu_cache for ia64, but it was vetoed by several mm
> > experts.  And as a result a new function was introduced.
> 
> lazy_mmu_prot_update is always called after update_mmu_cache except
> when we change permissions (hugetlb_change_protection() and 
> change_pte_range()). 
> 
> So if we conflate those two then arches may have to be updated to avoid 
> flushing the mmu if we only modified protections.

Ah, I missed those two lone usages of lazy_mmu_prot_update, thanks.
That makes sense, and fits with Ken's recollection: to have added
update_mmu_cache in those two places would have slowed down the
other architectures.

> I think update_mmu_cache() should be dropped in page_wrprotect_one() in 
> order to be consistent scheme. And avoiding mmu flushes will increase the 
> performance of page_wrprotect_one.. lazy_mmu_prot_update must be there 
> since we are changing permissions.

Agreed.

I'd still like to rename lazy_mmu_prot_update, and refactor it, but
that can be a later unrelated cleanup.  What makes sense to me is to
call it update_mmu_cache_prot, and #define the ia64 update_mmu_cache
to that: so we can unclutter common code from most of the
lazy_mmu_prot_update lines, leaving just those two significant
instances of update_mmu_cache_prot that you highlight.

And of the two instances of update_mmu_cache in mm/fremap.c:
it seems to me that the first, in install_page, ought to have a
lazy_mmu_prot_update (and will get it automatically by the #define
I suggest); whereas the second, in install_file_pte, ought not to
have an update_mmu_cache since it's installing a !present entry.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* remove VM_LOCKED before remap_pfn_range and drop VM_SHM
  2006-05-23 20:34       ` Hugh Dickins
  2006-05-23 21:16         ` Christoph Lameter
  2006-05-23 21:17         ` Chen, Kenneth W
@ 2006-05-23 22:28         ` Christoph Lameter
  2006-05-24 14:57           ` Hugh Dickins
  2006-05-24  2:25         ` tracking dirty pages patches Arjan van de Ven
  3 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2006-05-23 22:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells,
	Rohit Seth, linux-mm, linux-kernel

On Tue, 23 May 2006, Hugh Dickins wrote:

> Oh yes, I'd noticed that subject going by, and meant to speak up
> sometime.  I feel pretty strongly, and have so declared in the past,
> that VM_LOCKED should _not_ guarantee that the same physical page is
> used forever: get_user_pages is what's used to pin a physical page
> for that effect.  I remember Arjan sharing this opinion.
> 
> You might discover a problem or two in letting page migration go that
> way, I'm not saying there cannot be a problem; but I'd much rather
> you try without adding a new flag unless it's proved necessary.
> And I know Linus prefers not to go overboard with extra flags.
> 
> You mentioned in one of the mails that went past that you'd seen
> drivers enforcing VM_LOCKED in vm_flags: aren't those just drivers
> copying other drivers which did so, but achieving nothing thereby,
> to be cleaned up in due course?  (The pages aren't even on LRU.)

Ok that seems to be the case. Also VM_SHM falls out once we deal with 
this:


Remove VM_LOCKED before remap_pfn range from device drivers and get rid of 
VM_SHM.

remap_pfn_range() already sets VM_IO. There is no need to set VM_SHM since
it does nothing. VM_LOCKED is of no use since the remap_pfn_range does
not place pages on the LRU. The pages are therefore never subject to
swap anyways. Remove all the vm_flags settings before calling
remap_pfn_range.

After removing all the vm_flag settings no use of VM_SHM is left.
Drop it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm3/drivers/video/igafb.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/drivers/video/igafb.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4-mm3/drivers/video/igafb.c	2006-05-23 15:10:31.101159381 -0700
@@ -232,9 +232,6 @@ static int igafb_mmap(struct fb_info *in
 
 	size = vma->vm_end - vma->vm_start;
 
-	/* To stop the swapper from even considering these pages. */
-	vma->vm_flags |= (VM_SHM | VM_LOCKED);
-
 	/* Each page, see which map applies */
 	for (page = 0; page < size; ) {
 		map_size = 0;
Index: linux-2.6.17-rc4-mm3/drivers/sbus/char/flash.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/drivers/sbus/char/flash.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4-mm3/drivers/sbus/char/flash.c	2006-05-23 15:10:31.102135883 -0700
@@ -71,7 +71,6 @@ flash_mmap(struct file *file, struct vm_
 	if (vma->vm_end - (vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT)) > size)
 		size = vma->vm_end - (vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT));
 
-	vma->vm_flags |= (VM_SHM | VM_LOCKED);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	if (io_remap_pfn_range(vma, vma->vm_start, addr, size, vma->vm_page_prot))
Index: linux-2.6.17-rc4-mm3/drivers/char/mmtimer.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/drivers/char/mmtimer.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4-mm3/drivers/char/mmtimer.c	2006-05-23 15:10:31.103112385 -0700
@@ -329,7 +329,6 @@ static int mmtimer_mmap(struct file *fil
 	if (PAGE_SIZE > (1 << 16))
 		return -ENOSYS;
 
-	vma->vm_flags |= (VM_IO | VM_SHM | VM_LOCKED );
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	mmtimer_addr = __pa(RTC_COUNTER_ADDR);
Index: linux-2.6.17-rc4-mm3/arch/xtensa/kernel/pci.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/xtensa/kernel/pci.c	2006-05-23 15:10:16.963363024 -0700
+++ linux-2.6.17-rc4-mm3/arch/xtensa/kernel/pci.c	2006-05-23 15:10:34.631214211 -0700
@@ -350,17 +350,6 @@ __pci_mmap_make_offset(struct pci_dev *d
 }
 
 /*
- * Set vm_flags of VMA, as appropriate for this architecture, for a pci device
- * mapping.
- */
-static __inline__ void
-__pci_mmap_set_flags(struct pci_dev *dev, struct vm_area_struct *vma,
-		     enum pci_mmap_state mmap_state)
-{
-	vma->vm_flags |= VM_SHM | VM_LOCKED | VM_IO;
-}
-
-/*
  * Set vm_page_prot of VMA, as appropriate for this architecture, for a pci
  * device mapping.
  */
@@ -399,7 +388,6 @@ int pci_mmap_page_range(struct pci_dev *
 	if (ret < 0)
 		return ret;
 
-	__pci_mmap_set_flags(dev, vma, mmap_state);
 	__pci_mmap_set_pgprot(dev, vma, mmap_state, write_combine);
 
 	ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
Index: linux-2.6.17-rc4-mm3/arch/powerpc/kernel/pci_64.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/powerpc/kernel/pci_64.c	2006-05-23 15:10:16.675294926 -0700
+++ linux-2.6.17-rc4-mm3/arch/powerpc/kernel/pci_64.c	2006-05-23 15:10:34.376347182 -0700
@@ -875,7 +875,6 @@ int pci_mmap_page_range(struct pci_dev *
 		return -EINVAL;
 
 	vma->vm_pgoff = offset >> PAGE_SHIFT;
-	vma->vm_flags |= VM_SHM | VM_LOCKED | VM_IO;
 	vma->vm_page_prot = __pci_mmap_set_pgprot(dev, rp,
 						  vma->vm_page_prot,
 						  mmap_state, write_combine);
Index: linux-2.6.17-rc4-mm3/arch/i386/pci/i386.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/i386/pci/i386.c	2006-05-23 15:10:16.427263411 -0700
+++ linux-2.6.17-rc4-mm3/arch/i386/pci/i386.c	2006-05-23 15:10:34.373417675 -0700
@@ -276,8 +276,6 @@ int pci_mmap_page_range(struct pci_dev *
 	/* Leave vm_pgoff as-is, the PCI space address is the physical
 	 * address on this platform.
 	 */
-	vma->vm_flags |= (VM_SHM | VM_LOCKED | VM_IO);
-
 	prot = pgprot_val(vma->vm_page_prot);
 	if (boot_cpu_data.x86 > 3)
 		prot |= _PAGE_PCD | _PAGE_PWT;
Index: linux-2.6.17-rc4-mm3/arch/ppc/kernel/pci.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/ppc/kernel/pci.c	2006-05-23 15:10:16.782710149 -0700
+++ linux-2.6.17-rc4-mm3/arch/ppc/kernel/pci.c	2006-05-23 15:10:34.379276688 -0700
@@ -1040,7 +1040,6 @@ int pci_mmap_page_range(struct pci_dev *
 		return -EINVAL;
 
 	vma->vm_pgoff = offset >> PAGE_SHIFT;
-	vma->vm_flags |= VM_SHM | VM_LOCKED | VM_IO;
 	vma->vm_page_prot = __pci_mmap_set_pgprot(dev, rp,
 						  vma->vm_page_prot,
 						  mmap_state, write_combine);
Index: linux-2.6.17-rc4-mm3/arch/powerpc/kernel/proc_ppc64.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/powerpc/kernel/proc_ppc64.c	2006-05-23 15:10:16.678224432 -0700
+++ linux-2.6.17-rc4-mm3/arch/powerpc/kernel/proc_ppc64.c	2006-05-23 15:10:34.377323684 -0700
@@ -115,8 +115,6 @@ static int page_map_mmap( struct file *f
 {
 	struct proc_dir_entry *dp = PDE(file->f_dentry->d_inode);
 
-	vma->vm_flags |= VM_SHM | VM_LOCKED;
-
 	if ((vma->vm_end - vma->vm_start) > dp->size)
 		return -EINVAL;
 
Index: linux-2.6.17-rc4-mm3/drivers/sbus/char/vfc_dev.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/drivers/sbus/char/vfc_dev.c	2006-05-23 15:08:28.409538512 -0700
+++ linux-2.6.17-rc4-mm3/drivers/sbus/char/vfc_dev.c	2006-05-23 15:10:34.634143717 -0700
@@ -623,7 +623,7 @@ static int vfc_mmap(struct file *file, s
 		map_size = sizeof(struct vfc_regs);
 
 	vma->vm_flags |=
-		(VM_SHM | VM_LOCKED | VM_IO | VM_MAYREAD | VM_MAYWRITE | VM_MAYSHARE);
+		(VM_MAYREAD | VM_MAYWRITE | VM_MAYSHARE);
 	map_offset = (unsigned int) (long)dev->phys_regs;
 	ret = io_remap_pfn_range(vma, vma->vm_start,
 				  MK_IOSPACE_PFN(dev->which_io,
Index: linux-2.6.17-rc4-mm3/arch/powerpc/kernel/pci_32.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/powerpc/kernel/pci_32.c	2006-05-23 15:10:16.674318424 -0700
+++ linux-2.6.17-rc4-mm3/arch/powerpc/kernel/pci_32.c	2006-05-23 15:10:34.375370679 -0700
@@ -1654,7 +1654,6 @@ int pci_mmap_page_range(struct pci_dev *
 		return -EINVAL;
 
 	vma->vm_pgoff = offset >> PAGE_SHIFT;
-	vma->vm_flags |= VM_SHM | VM_LOCKED | VM_IO;
 	vma->vm_page_prot = __pci_mmap_set_pgprot(dev, rp,
 						  vma->vm_page_prot,
 						  mmap_state, write_combine);
Index: linux-2.6.17-rc4-mm3/arch/cris/arch-v32/drivers/pci/bios.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/cris/arch-v32/drivers/pci/bios.c	2006-05-23 15:10:16.278835103 -0700
+++ linux-2.6.17-rc4-mm3/arch/cris/arch-v32/drivers/pci/bios.c	2006-05-23 15:10:34.372441173 -0700
@@ -27,8 +27,6 @@ int pci_mmap_page_range(struct pci_dev *
 	/* Leave vm_pgoff as-is, the PCI space address is the physical
 	 * address on this platform.
 	 */
-	vma->vm_flags |= (VM_SHM | VM_LOCKED | VM_IO);
-
 	prot = pgprot_val(vma->vm_page_prot);
 	vma->vm_page_prot = __pgprot(prot);
 
Index: linux-2.6.17-rc4-mm3/arch/arm/kernel/bios32.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/arm/kernel/bios32.c	2006-05-23 15:10:16.265164074 -0700
+++ linux-2.6.17-rc4-mm3/arch/arm/kernel/bios32.c	2006-05-23 15:10:34.369511667 -0700
@@ -702,7 +702,6 @@ int pci_mmap_page_range(struct pci_dev *
 	/*
 	 * Mark this as IO
 	 */
-	vma->vm_flags |= VM_SHM | VM_LOCKED | VM_IO;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	if (remap_pfn_range(vma, vma->vm_start, phys,
Index: linux-2.6.17-rc4-mm3/arch/ia64/pci/pci.c
===================================================================
--- linux-2.6.17-rc4-mm3.orig/arch/ia64/pci/pci.c	2006-05-23 15:10:16.497571557 -0700
+++ linux-2.6.17-rc4-mm3/arch/ia64/pci/pci.c	2006-05-23 15:20:27.401143385 -0700
@@ -602,8 +602,6 @@ pci_mmap_page_range (struct pci_dev *dev
 	 * Leave vm_pgoff as-is, the PCI space address is the physical
 	 * address on this platform.
 	 */
-	vma->vm_flags |= (VM_SHM | VM_RESERVED | VM_IO);
-
 	if (write_combine && efi_range_is_wc(vma->vm_start,
 					     vma->vm_end - vma->vm_start))
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
@@ -666,7 +664,6 @@ pci_mmap_legacy_page_range(struct pci_bu
 
 	vma->vm_pgoff += (unsigned long)addr >> PAGE_SHIFT;
 	vma->vm_page_prot = prot;
-	vma->vm_flags |= (VM_SHM | VM_RESERVED | VM_IO);
 
 	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
 			    size, vma->vm_page_prot))
Index: linux-2.6.17-rc4-mm3/include/linux/mm.h
===================================================================
--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h	2006-05-23 15:10:21.641784240 -0700
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h	2006-05-23 15:14:41.099348991 -0700
@@ -145,7 +145,6 @@ extern unsigned int kobjsize(const void 
 
 #define VM_GROWSDOWN	0x00000100	/* general info on the segment */
 #define VM_GROWSUP	0x00000200
-#define VM_SHM		0x00000000	/* Means nothing: delete it later */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: remove VM_LOCKED before remap_pfn_range and drop VM_SHM
  2006-05-23 22:28         ` remove VM_LOCKED before remap_pfn_range and drop VM_SHM Christoph Lameter
@ 2006-05-24 14:57           ` Hugh Dickins
  0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2006-05-24 14:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, David Howells,
	Rohit Seth, linux-mm, linux-kernel

On Tue, 23 May 2006, Christoph Lameter wrote:
> 
> Remove VM_LOCKED before remap_pfn range from device drivers and get rid of 
> VM_SHM.

Okay, that is rather a nice cleanup.  I've held off from doing it,
and have discouraged one or two others from doing it, because there's
a number of other things to be checked thereabouts (witness the way
vfc_dev.c is or'ing flags it has no business to change: but you've
rightly preserved that existing behaviour for now, however bad it may
be); and there's VM_RESERVED (or most of its or'ings) to be removed too.

But what you have looks nice, and no way does it prevent further
cleanup later; though I've not wanted to bother maintainers repeatedly.

Of course, you don't need this patch in order to proceed with migrating
VM_LOCKED areas, because this patch is no more than a cleanup of
irrelevance.  Well, somewhat worse than irrelevance: when a driver
unilaterally sets VM_LOCKED on a vma, then mm->locked_vm goes wild
when the vma is unmapped: doesn't matter at exit, but bad if before.

> remap_pfn_range() already sets VM_IO. There is no need to set VM_SHM since
> it does nothing. VM_LOCKED is of no use since the remap_pfn_range does
> not place pages on the LRU. The pages are therefore never subject to
> swap anyways. Remove all the vm_flags settings before calling
> remap_pfn_range.
> 
> After removing all the vm_flag settings no use of VM_SHM is left.
> Drop it.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>

Acked-by: Hugh Dickins <hugh@veritas.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-23 20:34       ` Hugh Dickins
                           ` (2 preceding siblings ...)
  2006-05-23 22:28         ` remove VM_LOCKED before remap_pfn_range and drop VM_SHM Christoph Lameter
@ 2006-05-24  2:25         ` Arjan van de Ven
  2006-05-24 15:10           ` Hugh Dickins
  3 siblings, 1 reply; 20+ messages in thread
From: Arjan van de Ven @ 2006-05-24  2:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, Rohit Seth, David Howells, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Christoph Lameter

On Tue, 2006-05-23 at 21:34 +0100, Hugh Dickins wrote:
> > 
> > Page migration currently also assumes that VM_LOCKED means do not move the 
> > page. At some point we may want to have a separate flag that guarantees
> > that a page should not be moved. This would enable the moving of VM_LOCKED 
> > pages.
> 
> Oh yes, I'd noticed that subject going by, and meant to speak up
> sometime.  I feel pretty strongly, and have so declared in the past,
> that VM_LOCKED should _not_ guarantee that the same physical page is
> used forever: get_user_pages is what's used to pin a physical page
> for that effect.  I remember Arjan sharing this opinion.

correct. 


> You mentioned in one of the mails that went past that you'd seen
> drivers enforcing VM_LOCKED in vm_flags: aren't those just drivers
> copying other drivers which did so, but achieving nothing thereby,
> to be cleaned up in due course?  (The pages aren't even on LRU.)


I would like to know which, because in general this is a security hole:
Any driver that depends on locked meaning "doesn't move" can be fooled
by the user into becoming unlocked... (by virtue of having another
thread do an munlock on the memory). As such no kernel driver should 
depend on this, and as far as I know, no kernel driver actually does.
(early infiniband drivers used to, but they fixed that well before
things got merged to use the get_user_pages API, exactly for this
reason)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-24  2:25         ` tracking dirty pages patches Arjan van de Ven
@ 2006-05-24 15:10           ` Hugh Dickins
  2006-05-25  2:26             ` Arjan van de Ven
  0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2006-05-24 15:10 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-mm, Rohit Seth, David Howells, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Christoph Lameter

On Wed, 24 May 2006, Arjan van de Ven wrote:
> On Tue, 2006-05-23 at 21:34 +0100, Hugh Dickins wrote:
> 
> > You mentioned in one of the mails that went past that you'd seen
> > drivers enforcing VM_LOCKED in vm_flags: aren't those just drivers
> > copying other drivers which did so, but achieving nothing thereby,
> > to be cleaned up in due course?  (The pages aren't even on LRU.)
> 
> I would like to know which, because in general this is a security hole:
> Any driver that depends on locked meaning "doesn't move" can be fooled
> by the user into becoming unlocked... (by virtue of having another
> thread do an munlock on the memory). As such no kernel driver should 
> depend on this, and as far as I know, no kernel driver actually does.

You'll have seen the list in Christoph's patch.  But they're all
remap_pfn_range users, largely copied one from another, and their
pages won't become freeable even if the user munlocks.

However, that munlocking will lower locked_vm when it shouldn't
touch it.  I suppose the ingenious might mmap and munmap such a
driver in order to lock another mapping beyond RLIMIT_MEMLOCK.
Perhaps that raises the priority of Christoph's patch?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-24 15:10           ` Hugh Dickins
@ 2006-05-25  2:26             ` Arjan van de Ven
  0 siblings, 0 replies; 20+ messages in thread
From: Arjan van de Ven @ 2006-05-25  2:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, Rohit Seth, David Howells, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Christoph Lameter

On Wed, 2006-05-24 at 16:10 +0100, Hugh Dickins wrote:
> On Wed, 24 May 2006, Arjan van de Ven wrote:
> > On Tue, 2006-05-23 at 21:34 +0100, Hugh Dickins wrote:
> > 
> > > You mentioned in one of the mails that went past that you'd seen
> > > drivers enforcing VM_LOCKED in vm_flags: aren't those just drivers
> > > copying other drivers which did so, but achieving nothing thereby,
> > > to be cleaned up in due course?  (The pages aren't even on LRU.)
> > 
> > I would like to know which, because in general this is a security hole:
> > Any driver that depends on locked meaning "doesn't move" can be fooled
> > by the user into becoming unlocked... (by virtue of having another
> > thread do an munlock on the memory). As such no kernel driver should 
> > depend on this, and as far as I know, no kernel driver actually does.
> 
> You'll have seen the list in Christoph's patch.  But they're all
> remap_pfn_range users, largely copied one from another, and their
> pages won't become freeable even if the user munlocks.

that's not "real memory" though so not too relevant for the scenario I
had in mind...
> 
> However, that munlocking will lower locked_vm when it shouldn't
> touch it.  I suppose the ingenious might mmap and munmap such a
> driver in order to lock another mapping beyond RLIMIT_MEMLOCK.
> Perhaps that raises the priority of Christoph's patch?

... but yes that also makes it a security issue, although a bit less
severe I suppose


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-22 19:31 tracking dirty pages patches Hugh Dickins
  2006-05-22 20:29 ` Andrew Morton
  2006-05-23 16:24 ` Christoph Lameter
@ 2006-05-23 16:41 ` David Howells
  2006-05-23 23:07 ` Peter Zijlstra
  3 siblings, 0 replies; 20+ messages in thread
From: David Howells @ 2006-05-23 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, David Howells, linux-mm

Hugh Dickins <hugh@veritas.com> wrote:

> Andrew has resolved that in -mm2 and -mm3 by dropping David's patches
> for the umpteenth time; but whatever happens to the rest of FS-Cache,
> I suspect the page_mkwrite patch will return (Anton wants it for NTFS,
> probably others will want it, but none are relying upon it yet).

At least four other groups have stated an interest in it or a desire to use
the functionality it provides: FUSE, OCFS2, NTFS and JFFS2.  Also, things like
EXT3 really ought to use it to deal with the case of shared-writable mmap
encountering ENOSPC before we permit the page to be dirtied.  I presume Hugh
wants it for Veritas also.

I've attached a copy of the patch for reference.

David

---
The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

 (*) Account and limit number of dirty pages. This is one piece of the puzzle
     needed to make shared writable mapping work safely in FUSE.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    4 ++
 mm/memory.c        |   99 +++++++++++++++++++++++++++++++++++++++-------------
 mm/mmap.c          |   12 +++++-
 mm/mprotect.c      |   11 +++++-
 4 files changed, 98 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1154684..cd3c2cf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -200,6 +200,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 0ec7bc6..6c6891e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1445,25 +1445,59 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the address space that the page is about to
+			 * become writable so that it can prohibit this or wait
+			 * for the page to get into an appropriate state.
+			 *
+			 * We do this without the lock held, so that it can
+			 * sleep if it needs to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1523,6 +1557,10 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2074,18 +2112,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/mmap.c b/mm/mmap.c
index e6ee123..6446c61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1065,7 +1065,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1089,6 +1090,12 @@ munmap_back:
 			goto free_vma;
 	}
 
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1921,7 +1928,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4c14d42..2697abd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +133,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +159,14 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-22 19:31 tracking dirty pages patches Hugh Dickins
                   ` (2 preceding siblings ...)
  2006-05-23 16:41 ` David Howells
@ 2006-05-23 23:07 ` Peter Zijlstra
  2006-05-24 14:20   ` Hugh Dickins
  3 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2006-05-23 23:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, David Howells, linux-mm,
	Christoph Lameter

On Mon, 2006-05-22 at 20:31 +0100, Hugh Dickins wrote:
> Belated observations on your "tracking dirty pages" patches.

Thanks for the thorough examination, I always suspected there was
something I'd overlooked, this being my first foray into these parts of
the code.

> page_wrprotect is a nice use of rmap, but I see a couple of problems.
> One is in the lock ordering (there's info on mm lock ordering at the
> top of filemap.c, but I find the list at the top of rmap.c easier).
> 
> set_page_dirty has always (awkwardly) been liable to be called from
> very low in the hierarchy; whereas you're assuming clear_page_dirty
> is called from much higher up.  And in most cases there's no problem
> (please cross-check to confirm that); but try_to_free_buffers in fs/
> buffer.c calls it while holding mapping->private_lock - page_wrprotect
> called from test_clear_page_dirty then violates the order.
> 
> If we're lucky and that is indeed the only violation, maybe Andrew
> can recommend a change to try_to_free_buffers to avoid it: I have
> no appreciation of the issues at that end myself.

Not really familiar with the code myself either, but from some
inspection it seems safe to do so. ->private_lock seems to serialise
access to the page buffers, not the page state.

Will be in the next version.

> The other worries are in page_wrprotect_one's block
> 	entry = pte_mkclean(pte_wrprotect(*pte));
> 	ptep_establish(vma, address, pte, entry);
> 	update_mmu_cache(vma, address, entry);
> 	lazy_mmu_prot_update(entry);
> ptep_establish, update_mmu_cache and lazy_mmu_prot_update are tricky
> arch-dependent functions which have hitherto only been used on the
> current task mm, whereas you're now using them from (perhaps) another.
> 
> Well, no, I'm wrong: ptrace's get_user_pages has been using them
> from another process; but that's not so common a case as to reassure
> me there won't be issues on some architectures there.

Christoph Lameter has cleared up the waters here, thanks!

> Quite likely ptep_establish and update_mmu_cache are okay for use in
> that way (needs careful checking of arches), at least they take a vma
> argument from which the mm can be found.  Whereas lazy_mmu_prot_update
> looks likely to be wrong, but only does something on ia64: you need
> to consult ia64 mm gurus to check what's needed there.  Maybe it'll
> just be a suboptimal issue (but more important now than in ptrace
> to make it optimal).

Not much here, except to say: thanks for the discussions, they were very
educative.

> Is there a problem with page_wrprotect on VM_LOCKED vmas?  I'm not
> sure: usually VM_LOCKED guarantees no faulting, you abandon that.

Also cleared up by Christoph, he is my hero for today ;-)

> Like others, I don't care for "VM_SharedWritable": you followed the
> VM_ReadHint macros, but this isn't a read hint, and those are weird.
> 
> Personally, I much prefer the explicit
> 	((vma->vm_flags & (VM_SHARED|VM_WRITE)) == (VM_SHARED|VM_WRITE))
> which is the usual style for vm_flags tests throughout mm (except for
> the hugetlb test designed to melt away without HUGETLB).  But I may be
> in a minority on that, Linus did put an is_cow_mapping() in memory.c
> recently, so maybe you'd follow that and say is_shared_writable().

OK, done, is_shared_writable() is it.

> There's a clash and overlap between your "tracking dirty pages" patches
> and David Howell's "add notification of page becoming writable" patch.
> The merge of the two in 2.6.17-rc4-mm1 was wrong: your handle_pte_fault
> change meant it never reached David's page_mkwrite call in do_wp_page.

...

> Please take a look at that patch (David reposted it on linux-kernel
> last Friday, as 08/14 of FS-Cache try #10): I went over it with him
> many months ago, and it fills in at least one gap you're missing...
> 
> mprotect: we all forget mprotect, but mprotect(,,PROT_READ)
> followed by mprotect(,,PROT_WRITE) will give write permission to all
> those ptes you've carefully taken write permission from.  In the
> page_mkwrite patch, we found that was most easily fixed by using
> the !VM_SHARED vm_page_prot in place of the VM_SHARED one.  I
> expect you can simplify your patch a little by doing the same.

Whee, this took me a while to understand, but I think I've got it.
If I do what you propose to do, would there be any remaining users of
the MAP_SHARED part of protection_map left?

I shall try this approach tomorrow if I find some time.

> msync: I rather think that with your changes, if they're to stay,
> then all the page table walking code can be removed from msync -
> since it already skipped vmas which were not VM_SHARED, and there's
> nothing to gain from syncing the !mapping_cap_account_dirty ones.
> I think MS_ASYNC becomes a no-op, and sys_msync so small it won't
> deserve its own msync.c (madvise.c wouldn't be a bad place for it).
> Or am I missing something?

I vaguely remember some discussions on the semantics of these things.
I'll reread and examine the code.

> I'm not convinced that optimize-follow_pages is a worthwhile optimization
> (in some cases you're adding an atomic inc and dec), and it's irrelevant
> to your tracking of dirty pages, but I don't feel strongly about it.
> Except, if it stays then it needs fixing: the flags 0 case is doing
> a put_page without having done a get_page.

Not sure on the benefit either, I just did it to educate myself on the
subject (and blotched it on my way). Christoph kindly fixed the
offending condition.

I guess this patch could really do with some numbers if found that the
set_page_dirty() is needed at all.

> Though currently it seems only some powerpc #ifdef __DEBUG code is using
> follow_pages in that way: since that's not the common case, I think you'd
> best just remove the "if (flags & (FOLL_GET | FOLL_TOUCH))" condition
> from before the get_page.
> 
> (Why does follow_pages set_page_dirty at all?  I _think_ it's in case
> the get_user_pages caller forgets to set_page_dirty when releasing.
> But that's not how we usually write kernel code, to hide mistakes most
> of the time, and your mods may change the balance there.  Andrew will
> remember better whether that set_page_dirty has stronger justification.)
> 
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tracking dirty pages patches
  2006-05-23 23:07 ` Peter Zijlstra
@ 2006-05-24 14:20   ` Hugh Dickins
  0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2006-05-24 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, David Howells, linux-mm,
	Christoph Lameter

On Wed, 24 May 2006, Peter Zijlstra wrote:
> On Mon, 2006-05-22 at 20:31 +0100, Hugh Dickins wrote:
> 
> > I'm not convinced that optimize-follow_pages is a worthwhile optimization
> > (in some cases you're adding an atomic inc and dec), and it's irrelevant
> > to your tracking of dirty pages, but I don't feel strongly about it.
> > Except, if it stays then it needs fixing: the flags 0 case is doing
> > a put_page without having done a get_page.
> 
> Not sure on the benefit either, I just did it to educate myself on the
> subject (and blotched it on my way). Christoph kindly fixed the
> offending condition.
> 
> I guess this patch could really do with some numbers if found that the
> set_page_dirty() is needed at all.

Just drop that patch from the set.  It's a distraction from the rest,
and I believe we'll optimize it much better by removing those tests
and their set_page_dirty (but not immediately).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-05-25  2:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-22 19:31 tracking dirty pages patches Hugh Dickins
2006-05-22 20:29 ` Andrew Morton
2006-05-23  8:17   ` Nick Piggin
2006-05-23 14:55   ` Hugh Dickins
2006-05-23 16:24 ` Christoph Lameter
2006-05-23 19:21   ` Hugh Dickins
2006-05-23 19:31     ` Christoph Lameter
2006-05-23 20:34       ` Hugh Dickins
2006-05-23 21:16         ` Christoph Lameter
2006-05-23 21:17         ` Chen, Kenneth W
2006-05-23 21:40           ` update_mmu_cache vs. lazy_mmu_prot_update Christoph Lameter
2006-05-24 14:12             ` Hugh Dickins
2006-05-23 22:28         ` remove VM_LOCKED before remap_pfn_range and drop VM_SHM Christoph Lameter
2006-05-24 14:57           ` Hugh Dickins
2006-05-24  2:25         ` tracking dirty pages patches Arjan van de Ven
2006-05-24 15:10           ` Hugh Dickins
2006-05-25  2:26             ` Arjan van de Ven
2006-05-23 16:41 ` David Howells
2006-05-23 23:07 ` Peter Zijlstra
2006-05-24 14:20   ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox