get_user_pages() with write=1 and force=1 gets read-only pages.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* get_user_pages() with write=1 and force=1 gets read-only pages.
@ 2005-07-30 20:53 Robin Holt
  2005-07-30 22:13 ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Robin Holt @ 2005-07-30 20:53 UTC (permalink / raw)
  To: linux-mm

I am chasing a bug which I think I understand, but would like some
confirmation.

I believe I have two processes calling get_user_pages at approximately
the same time.  One is calling with write=0.  The other with write=1
and force=1.  The vma has the vm_ops->nopage set to filemap_nopage.

Both faulters get to the point in do_no_page of being ready to insert
the pte.  The first one to get the mm->page_table_lock must be the reader.
The readable pte gets inserted and results in the writer detecting the
pte and returning VM_FAULT_MINOR.

Upon return, the writer the does 'lookup_write = write && !force;'
and then calls follow_page without having the write flag set.

Am I on the right track with this?  Is the correct fix to not just pass
in the write flag untouched?  I believe the change was made by Roland
McGrath, but I don't see an email address for him.

Thanks,
Robin Holt
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-30 20:53 get_user_pages() with write=1 and force=1 gets read-only pages Robin Holt
@ 2005-07-30 22:13 ` Hugh Dickins
  2005-07-31  1:52   ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-07-30 22:13 UTC (permalink / raw)
  To: Robin Holt; +Cc: Roland McGrath, linux-mm

On Sat, 30 Jul 2005, Robin Holt wrote:

> I am chasing a bug which I think I understand, but would like some
> confirmation.
> 
> I believe I have two processes calling get_user_pages at approximately
> the same time.  One is calling with write=0.  The other with write=1
> and force=1.  The vma has the vm_ops->nopage set to filemap_nopage.
> 
> Both faulters get to the point in do_no_page of being ready to insert
> the pte.  The first one to get the mm->page_table_lock must be the reader.
> The readable pte gets inserted and results in the writer detecting the
> pte and returning VM_FAULT_MINOR.
> 
> Upon return, the writer the does 'lookup_write = write && !force;'
> and then calls follow_page without having the write flag set.
> 
> Am I on the right track with this?

I do believe you are.  Twice I've inserted fault code to cope with that
"surely no longer have a shared page we shouldn't write" assumption,
but I think you've just demonstrated that it's inherently unsafe.

Certainly goes against the traditional grain of fault handlers, which can
just try again when in doubt - as in the pte_same checks you've observed.

> Is the correct fix to not just pass in the write flag untouched?

I don't understand you there.  Suspect you're confusing me with that
"not", which perhaps expresses hesitancy, but shouldn't be there?

But the correct fix would not be to pass in the write flag untouched:
it's trying to avoid an endless loop of finding the pte not writable
when ptrace is modifying a page which the user is currently protected
against writing to (setting a breakpoint in readonly text, perhaps?).

get_user_pages is hard!  I don't know the right answer offhand,
but thank you for posing a good question.

> I believe the change was made by Roland
> McGrath, but I don't see an email address for him.

I've CC'ed roland@redhat.com

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-30 22:13 ` Hugh Dickins
@ 2005-07-31  1:52   ` Nick Piggin
  2005-07-31 10:52     ` Robin Holt
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-07-31  1:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Robin Holt, Roland McGrath, linux-mm

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

Hugh Dickins wrote:

> get_user_pages is hard!  I don't know the right answer offhand,
> but thank you for posing a good question.
> 

Detect the racing fault perhaps, and retry until we're sure
that a write fault has gone through?

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-gup-fix.patch --]
[-- Type: text/plain, Size: 3774 bytes --]

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2005-07-28 19:04:34.000000000 +1000
+++ linux-2.6/include/linux/mm.h	2005-07-31 11:40:24.000000000 +1000
@@ -625,6 +625,7 @@ static inline int page_mapped(struct pag
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
  */
+#define VM_FAULT_RACE	(-2)
 #define VM_FAULT_OOM	(-1)
 #define VM_FAULT_SIGBUS	0
 #define VM_FAULT_MINOR	1
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2005-07-28 19:04:37.000000000 +1000
+++ linux-2.6/mm/memory.c	2005-07-31 11:49:35.000000000 +1000
@@ -964,6 +964,14 @@ int get_user_pages(struct task_struct *t
 					return i ? i : -EFAULT;
 				case VM_FAULT_OOM:
 					return i ? i : -ENOMEM;
+				case VM_FAULT_RACE:
+					/*
+					 * Someone else got there first.
+					 * Must retry before we can assume
+					 * that we have actually performed
+					 * the write fault (below).
+					 */
+					continue;
 				default:
 					BUG();
 				}
@@ -1224,6 +1232,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
+	int ret;
 
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
@@ -1280,7 +1289,9 @@ static int do_wp_page(struct mm_struct *
 	 */
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
+	ret = VM_FAULT_RACE;
 	if (likely(pte_same(*page_table, pte))) {
+		ret = VM_FAULT_MINOR;
 		if (PageAnon(old_page))
 			dec_mm_counter(mm, anon_rss);
 		if (PageReserved(old_page))
@@ -1299,7 +1310,7 @@ static int do_wp_page(struct mm_struct *
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
+	return ret;
 
 no_new_page:
 	page_cache_release(old_page);
@@ -1654,7 +1665,7 @@ static int do_swap_page(struct mm_struct
 			if (likely(pte_same(*page_table, orig_pte)))
 				ret = VM_FAULT_OOM;
 			else
-				ret = VM_FAULT_MINOR;
+				ret = VM_FAULT_RACE;
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
 			goto out;
@@ -1676,7 +1687,7 @@ static int do_swap_page(struct mm_struct
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (unlikely(!pte_same(*page_table, orig_pte))) {
-		ret = VM_FAULT_MINOR;
+		ret = VM_FAULT_RACE;
 		goto out_nomap;
 	}
 
@@ -1737,6 +1748,7 @@ do_anonymous_page(struct mm_struct *mm, 
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
+	int ret = VM_FAULT_MINOR;
 
 	/* Read-only mapping of ZERO_PAGE. */
 	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
@@ -1760,6 +1772,7 @@ do_anonymous_page(struct mm_struct *mm, 
 			pte_unmap(page_table);
 			page_cache_release(page);
 			spin_unlock(&mm->page_table_lock);
+			ret = VM_FAULT_RACE;
 			goto out;
 		}
 		inc_mm_counter(mm, rss);
@@ -1779,7 +1792,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
 out:
-	return VM_FAULT_MINOR;
+	return ret;
 no_mem:
 	return VM_FAULT_OOM;
 }
@@ -1897,6 +1910,7 @@ retry:
 		pte_unmap(page_table);
 		page_cache_release(new_page);
 		spin_unlock(&mm->page_table_lock);
+		ret = VM_FAULT_RACE;
 		goto out;
 	}
 
Index: linux-2.6/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/i386/mm/fault.c	2005-07-28 19:03:48.000000000 +1000
+++ linux-2.6/arch/i386/mm/fault.c	2005-07-31 11:47:48.000000000 +1000
@@ -351,6 +351,8 @@ good_area:
 			goto do_sigbus;
 		case VM_FAULT_OOM:
 			goto out_of_memory;
+		case VM_FAULT_RACE:
+			break;
 		default:
 			BUG();
 	}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31  1:52   ` Nick Piggin
@ 2005-07-31 10:52     ` Robin Holt
  2005-07-31 11:07       ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Robin Holt @ 2005-07-31 10:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Hugh Dickins, Robin Holt, Roland McGrath, linux-mm

Should there be a check to ensure we don't return VM_FAULT_RACE when the
pte which was inserted is exactly the same one we would have inserted?
Could we generalize that more to the point of only returning VM_FAULT_RACE
when write access was requested but the racing pte was not writable?

Most of the test cases I have thrown at this have gotten the writer
faulting first which did not result in problems.  I would hate to slow
things down if not necessary.  I am unaware of more issues than the one
I have been tripping.

Thanks,
Robin Holt
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31 10:52     ` Robin Holt
@ 2005-07-31 11:07       ` Nick Piggin
  2005-07-31 11:30         ` Robin Holt
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-07-31 11:07 UTC (permalink / raw)
  To: Robin Holt; +Cc: Hugh Dickins, Roland McGrath, linux-mm

Robin Holt wrote:
> Should there be a check to ensure we don't return VM_FAULT_RACE when the
> pte which was inserted is exactly the same one we would have inserted?

That would slow down the do_xxx_fault fastpaths, though.

Considering VM_FAULT_RACE will only make any difference to get_user_pages
(ie. not the page fault fastpath), and only then in rare cases of a racing
fault on the same pte, I don't think the extra test would be worthwhile.

> Could we generalize that more to the point of only returning VM_FAULT_RACE
> when write access was requested but the racing pte was not writable?
> 

I guess get_user_pages could be changed to retry on VM_FAULT_RACE only if
it is attempting write access... is that worthwhile? I guess so...

> Most of the test cases I have thrown at this have gotten the writer
> faulting first which did not result in problems.  I would hate to slow
> things down if not necessary.  I am unaware of more issues than the one
> I have been tripping.
> 

I think the VM_FAULT_RACE patch as-is should be fairly unintrusive to the
page fault fastpaths. I think weighing down get_user_pages is preferable to
putting logic in the general fault path - though I don't think there should
be too much overhead introduced even there...

Do you think the patch (or at least, the idea) looks like a likely solution
to your problem? Obviously the !i386 architecture specific parts still need
to be filled in...

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31 11:07       ` Nick Piggin
@ 2005-07-31 11:30         ` Robin Holt
  2005-07-31 11:39           ` Robin Holt
  2005-07-31 12:09           ` Robin Holt
  0 siblings, 2 replies; 72+ messages in thread
From: Robin Holt @ 2005-07-31 11:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Robin Holt, Hugh Dickins, Roland McGrath, linux-mm

On Sun, Jul 31, 2005 at 09:07:24PM +1000, Nick Piggin wrote:
> Robin Holt wrote:
> >Should there be a check to ensure we don't return VM_FAULT_RACE when the
> >pte which was inserted is exactly the same one we would have inserted?
> 
> That would slow down the do_xxx_fault fastpaths, though.
> 
> Considering VM_FAULT_RACE will only make any difference to get_user_pages
> (ie. not the page fault fastpath), and only then in rare cases of a racing
> fault on the same pte, I don't think the extra test would be worthwhile.
> 
> >Could we generalize that more to the point of only returning VM_FAULT_RACE
> >when write access was requested but the racing pte was not writable?
> >
> 
> I guess get_user_pages could be changed to retry on VM_FAULT_RACE only if
> it is attempting write access... is that worthwhile? I guess so...
> 
> >Most of the test cases I have thrown at this have gotten the writer
> >faulting first which did not result in problems.  I would hate to slow
> >things down if not necessary.  I am unaware of more issues than the one
> >I have been tripping.
> >
> 
> I think the VM_FAULT_RACE patch as-is should be fairly unintrusive to the
> page fault fastpaths. I think weighing down get_user_pages is preferable to
> putting logic in the general fault path - though I don't think there should
> be too much overhead introduced even there...
> 
> Do you think the patch (or at least, the idea) looks like a likely solution
> to your problem? Obviously the !i386 architecture specific parts still need
> to be filled in...

The patch works for me.

What I was thinking didn't seem that much heavier than what is already being
done.  I guess a patch against your patch might be a better illustration:

This is on top of your patch:

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2005-07-31 05:39:24.161826311 -0500
+++ linux/mm/memory.c	2005-07-31 06:26:33.687274327 -0500
@@ -1768,17 +1768,17 @@ do_anonymous_page(struct mm_struct *mm, 
 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
 
+		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+						vma->vm_page_prot)), vma);
 		if (!pte_none(*page_table)) {
+			if (!pte_same(*page_table, entry))
+				ret = VM_FAULT_RACE;
 			pte_unmap(page_table);
 			page_cache_release(page);
 			spin_unlock(&mm->page_table_lock);
-			ret = VM_FAULT_RACE;
 			goto out;
 		}
 		inc_mm_counter(mm, rss);
-		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
-							 vma->vm_page_prot)),
-				      vma);
 		lru_cache_add_active(page);
 		SetPageReferenced(page);
 		page_add_anon_rmap(page, vma, addr);
@@ -1879,6 +1879,10 @@ retry:
 	}
 	page_table = pte_offset_map(pmd, address);
 
+	entry = mk_pte(new_page, vma->vm_page_prot);
+	if (write_access)
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
 	 * due to the bad i386 page protection. But it's valid
@@ -1895,9 +1899,6 @@ retry:
 			inc_mm_counter(mm, rss);
 
 		flush_icache_page(vma, new_page);
-		entry = mk_pte(new_page, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			lru_cache_add_active(new_page);
@@ -1906,11 +1907,12 @@ retry:
 			page_add_file_rmap(new_page);
 		pte_unmap(page_table);
 	} else {
+		if (!pte_same(*page_table, entry))
+			ret=VM_FAULT_RACE;
 		/* One of our sibling threads was faster, back out. */
 		pte_unmap(page_table);
 		page_cache_release(new_page);
 		spin_unlock(&mm->page_table_lock);
-		ret = VM_FAULT_RACE;
 		goto out;
 	}



In both cases, we have immediately before this read the value from the
pte so all the processor infrastructure is already in place and the
read should be extremely quick.  In truth, the compiler should eliminate
the second load, but I can not guarantee that.

What do you think?

Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31 11:30         ` Robin Holt
@ 2005-07-31 11:39           ` Robin Holt
  2005-07-31 12:09           ` Robin Holt
  1 sibling, 0 replies; 72+ messages in thread
From: Robin Holt @ 2005-07-31 11:39 UTC (permalink / raw)
  To: Robin Holt; +Cc: Nick Piggin, Hugh Dickins, Roland McGrath, linux-mm

After actually thinking, I realize the do_anonymous_page path changes
are pointless.  Please ignore those.

Thanks,
Robin
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2005-07-31 05:39:24.161826311 -0500
> +++ linux/mm/memory.c	2005-07-31 06:26:33.687274327 -0500
> @@ -1768,17 +1768,17 @@ do_anonymous_page(struct mm_struct *mm, 
>  		spin_lock(&mm->page_table_lock);
>  		page_table = pte_offset_map(pmd, addr);
>  
> +		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
> +						vma->vm_page_prot)), vma);
>  		if (!pte_none(*page_table)) {
> +			if (!pte_same(*page_table, entry))
> +				ret = VM_FAULT_RACE;
>  			pte_unmap(page_table);
>  			page_cache_release(page);
>  			spin_unlock(&mm->page_table_lock);
> -			ret = VM_FAULT_RACE;
>  			goto out;
>  		}
>  		inc_mm_counter(mm, rss);
> -		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
> -							 vma->vm_page_prot)),
> -				      vma);
>  		lru_cache_add_active(page);
>  		SetPageReferenced(page);
>  		page_add_anon_rmap(page, vma, addr);
> @@ -1879,6 +1879,10 @@ retry:
>  	}
>  	page_table = pte_offset_map(pmd, address);
>  
> +	entry = mk_pte(new_page, vma->vm_page_prot);
> +	if (write_access)
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +
>  	/*
>  	 * This silly early PAGE_DIRTY setting removes a race
>  	 * due to the bad i386 page protection. But it's valid
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31 11:30         ` Robin Holt
  2005-07-31 11:39           ` Robin Holt
@ 2005-07-31 12:09           ` Robin Holt
  2005-07-31 22:27             ` Nick Piggin
  1 sibling, 1 reply; 72+ messages in thread
From: Robin Holt @ 2005-07-31 12:09 UTC (permalink / raw)
  To: Robin Holt; +Cc: Nick Piggin, Hugh Dickins, Roland McGrath, linux-mm

Just for good measure, I added some counters to the do_no_page
and tweaked my earlier a little.  I made the check look more like:

                atomic64_inc(&do_no_page_collisions);
                if (write_access && !pte_write(*page_table)) {
                        ret=VM_FAULT_RACE;
                        atomic64_inc(&do_no_page_asked_write_got_read);
                }

After running the system for a while, I looked at the counters,
for the 1162 collisions I had, the write_got_read only incremented
4 times.  I am running a customer test program.  Don't really know
what it does, but I believe it is dealing with a large preinitialized
data block which they DIO write data from a file into various areas
of the block at the same time as other threads are reading through
the pages looking for work to do.

During normal run, without their application, I have not seen the
do_no_page_asked_write_got_read increment in the more than 1/2 hour
of run time.

So far, I think the case for setting VM_FAULT_RACE only when there
is a conflict for writable seems strong.

Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31 12:09           ` Robin Holt
@ 2005-07-31 22:27             ` Nick Piggin
  2005-08-01  3:22               ` Roland McGrath
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-07-31 22:27 UTC (permalink / raw)
  To: Robin Holt; +Cc: Hugh Dickins, Roland McGrath, linux-mm

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

Robin Holt wrote:

> So far, I think the case for setting VM_FAULT_RACE only when there
> is a conflict for writable seems strong.
> 

But that's 1162 collisions out of how many faults? And only
on the get_user_pages faults does the return value really
make a difference.

How about the following? This should catch some cases. You
still miss the case where you're racing with anotyher writer
though.

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-gup-write-fix.patch --]
[-- Type: text/plain, Size: 469 bytes --]

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2005-07-31 11:49:35.000000000 +1000
+++ linux-2.6/mm/memory.c	2005-08-01 08:21:21.000000000 +1000
@@ -971,7 +971,9 @@ int get_user_pages(struct task_struct *t
 					 * that we have actually performed
 					 * the write fault (below).
 					 */
-					continue;
+					if (write)
+						continue;
+					break;
 				default:
 					BUG();
 				}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: get_user_pages() with write=1 and force=1 gets read-only pages.
  2005-07-31 22:27             ` Nick Piggin
@ 2005-08-01  3:22               ` Roland McGrath
  2005-08-01  8:21                 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Roland McGrath @ 2005-08-01  3:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Robin Holt, Hugh Dickins, linux-mm

The basic style of this fix seems appropriate to me.  I really don't think
it matters if get_user_pages does extra iterations of the lookup or fault
path in the race situations.  The unnecessary ones will always be short anyway.


Thanks,
Roland
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01  3:22               ` Roland McGrath
@ 2005-08-01  8:21                 ` Nick Piggin
  2005-08-01  9:19                   ` Ingo Molnar
                                     ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-01  8:21 UTC (permalink / raw)
  To: Robin Holt, Andrew Morton, Linus Torvalds
  Cc: Roland McGrath, Hugh Dickins, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

Hi,

Not sure if this should be fixed for 2.6.13. It can result in
pagecache corruption: so I guess that answers my own question.

This was tested by Robin and appears to solve the problem. Roland
had a quick look and thought the basic idea was sound. I'd like to
get a couple more acks before going forward, and in particular
Robin was contemplating possible efficiency improvements (although
efficiency can wait on correctness).

Feedback please, anyone.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-gup-fix.patch --]
[-- Type: text/plain, Size: 8794 bytes --]

When get_user_pages for write access races with another process in the page
fault handler that has established the pte for read access, handle_mm_fault
in get_user_pages will return VM_FAULT_MINOR even if it hasn't made the page
correctly writeable (for example, broken COW).

Thus the assumption that get_user_pages has a writeable page at the mapping
after handle_mm_fault returns is incorrect. Fix this by reporting a raced
(uncompleted) fault and retrying the lookup and fault in get_user_pages before
making the assumption that we have a writeable page.

Great work by Robin Holt <holt@sgi.com> to debug the problem.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/i386/mm/fault.c
+++ linux-2.6/arch/i386/mm/fault.c
@@ -351,6 +351,8 @@ good_area:
 			goto do_sigbus;
 		case VM_FAULT_OOM:
 			goto out_of_memory;
+		case VM_FAULT_RACE:
+			break;
 		default:
 			BUG();
 	}
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -625,6 +625,7 @@ static inline int page_mapped(struct pag
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
  */
+#define VM_FAULT_RACE	(-2)
 #define VM_FAULT_OOM	(-1)
 #define VM_FAULT_SIGBUS	0
 #define VM_FAULT_MINOR	1
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -969,6 +969,16 @@ int get_user_pages(struct task_struct *t
 					return i ? i : -EFAULT;
 				case VM_FAULT_OOM:
 					return i ? i : -ENOMEM;
+				case VM_FAULT_RACE:
+					/*
+					 * Someone else got there first.
+					 * Must retry before we can assume
+					 * that we have actually performed
+					 * the write fault (below).
+					 */
+					if (write)
+						continue;
+					break;
 				default:
 					BUG();
 				}
@@ -1229,6 +1239,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
+	int ret;
 
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
@@ -1285,7 +1296,9 @@ static int do_wp_page(struct mm_struct *
 	 */
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
+	ret = VM_FAULT_RACE;
 	if (likely(pte_same(*page_table, pte))) {
+		ret = VM_FAULT_MINOR;
 		if (PageAnon(old_page))
 			dec_mm_counter(mm, anon_rss);
 		if (PageReserved(old_page))
@@ -1304,7 +1317,7 @@ static int do_wp_page(struct mm_struct *
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
+	return ret;
 
 no_new_page:
 	page_cache_release(old_page);
@@ -1659,7 +1672,7 @@ static int do_swap_page(struct mm_struct
 			if (likely(pte_same(*page_table, orig_pte)))
 				ret = VM_FAULT_OOM;
 			else
-				ret = VM_FAULT_MINOR;
+				ret = VM_FAULT_RACE;
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
 			goto out;
@@ -1681,7 +1694,7 @@ static int do_swap_page(struct mm_struct
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (unlikely(!pte_same(*page_table, orig_pte))) {
-		ret = VM_FAULT_MINOR;
+		ret = VM_FAULT_RACE;
 		goto out_nomap;
 	}
 
@@ -1742,6 +1755,7 @@ do_anonymous_page(struct mm_struct *mm, 
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
+	int ret = VM_FAULT_MINOR;
 
 	/* Read-only mapping of ZERO_PAGE. */
 	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
@@ -1765,6 +1779,7 @@ do_anonymous_page(struct mm_struct *mm, 
 			pte_unmap(page_table);
 			page_cache_release(page);
 			spin_unlock(&mm->page_table_lock);
+			ret = VM_FAULT_RACE;
 			goto out;
 		}
 		inc_mm_counter(mm, rss);
@@ -1784,7 +1799,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
 out:
-	return VM_FAULT_MINOR;
+	return ret;
 no_mem:
 	return VM_FAULT_OOM;
 }
@@ -1902,6 +1917,7 @@ retry:
 		pte_unmap(page_table);
 		page_cache_release(new_page);
 		spin_unlock(&mm->page_table_lock);
+		ret = VM_FAULT_RACE;
 		goto out;
 	}
 
Index: linux-2.6/arch/alpha/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/alpha/mm/fault.c
+++ linux-2.6/arch/alpha/mm/fault.c
@@ -162,6 +162,8 @@ do_page_fault(unsigned long address, uns
 		goto do_sigbus;
 	      case VM_FAULT_OOM:
 		goto out_of_memory;
+	      case VM_FAULT_RACE:
+		break;
 	      default:
 		BUG();
 	}
Index: linux-2.6/arch/arm/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/fault.c
+++ linux-2.6/arch/arm/mm/fault.c
@@ -195,6 +195,7 @@ survive:
 	case VM_FAULT_MINOR:
 		tsk->min_flt++;
 	case VM_FAULT_SIGBUS:
+	case VM_FAULT_RACE:
 		return fault;
 	}
 
Index: linux-2.6/arch/ia64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/fault.c
+++ linux-2.6/arch/ia64/mm/fault.c
@@ -164,6 +164,8 @@ ia64_do_page_fault (unsigned long addres
 		goto bad_area;
 	      case VM_FAULT_OOM:
 		goto out_of_memory;
+	      case VM_FAULT_RACE:
+		break;
 	      default:
 		BUG();
 	}
Index: linux-2.6/arch/m32r/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/m32r/mm/fault.c
+++ linux-2.6/arch/m32r/mm/fault.c
@@ -234,6 +234,8 @@ survive:
 			goto do_sigbus;
 		case VM_FAULT_OOM:
 			goto out_of_memory;
+		case VM_FAULT_RACE:
+			break;
 		default:
 			BUG();
 	}
Index: linux-2.6/arch/mips/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/mips/mm/fault.c
+++ linux-2.6/arch/mips/mm/fault.c
@@ -109,6 +109,8 @@ survive:
 		goto do_sigbus;
 	case VM_FAULT_OOM:
 		goto out_of_memory;
+	case VM_FAULT_RACE:
+		break;
 	default:
 		BUG();
 	}
Index: linux-2.6/arch/ppc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/ppc/mm/fault.c
+++ linux-2.6/arch/ppc/mm/fault.c
@@ -259,6 +259,8 @@ good_area:
                 goto do_sigbus;
         case VM_FAULT_OOM:
                 goto out_of_memory;
+	case VM_FAULT_RACE:
+		break;
 	default:
 		BUG();
 	}
Index: linux-2.6/arch/ppc64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/ppc64/mm/fault.c
+++ linux-2.6/arch/ppc64/mm/fault.c
@@ -234,6 +234,8 @@ good_area:
 		goto do_sigbus;
 	case VM_FAULT_OOM:
 		goto out_of_memory;
+	case VM_FAULT_RACE:
+		break;
 	default:
 		BUG();
 	}
Index: linux-2.6/arch/s390/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/fault.c
+++ linux-2.6/arch/s390/mm/fault.c
@@ -260,6 +260,8 @@ survive:
 		goto do_sigbus;
 	case VM_FAULT_OOM:
 		goto out_of_memory;
+	case VM_FAULT_RACE:
+		break;
 	default:
 		BUG();
 	}
Index: linux-2.6/arch/sh/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/fault.c
+++ linux-2.6/arch/sh/mm/fault.c
@@ -101,6 +101,8 @@ survive:
 			goto do_sigbus;
 		case VM_FAULT_OOM:
 			goto out_of_memory;
+		case VM_FAULT_RACE:
+			break;
 		default:
 			BUG();
 	}
Index: linux-2.6/arch/sparc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/fault.c
+++ linux-2.6/arch/sparc/mm/fault.c
@@ -302,8 +302,8 @@ good_area:
 		current->maj_flt++;
 		break;
 	case VM_FAULT_MINOR:
-	default:
 		current->min_flt++;
+	case VM_FAULT_RACE:
 		break;
 	}
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/sparc64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/fault.c
+++ linux-2.6/arch/sparc64/mm/fault.c
@@ -454,6 +454,8 @@ good_area:
 		goto do_sigbus;
 	case VM_FAULT_OOM:
 		goto out_of_memory;
+	case VM_FAULT_RACE:
+		break;
 	default:
 		BUG();
 	}
Index: linux-2.6/arch/um/kernel/trap_kern.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/trap_kern.c
+++ linux-2.6/arch/um/kernel/trap_kern.c
@@ -76,6 +76,8 @@ int handle_page_fault(unsigned long addr
 		case VM_FAULT_OOM:
 			err = -ENOMEM;
 			goto out_of_memory;
+		case VM_FAULT_RACE:
+			break;
 		default:
 			BUG();
 		}
Index: linux-2.6/arch/xtensa/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/fault.c
+++ linux-2.6/arch/xtensa/mm/fault.c
@@ -113,6 +113,8 @@ survive:
 		goto do_sigbus;
 	case VM_FAULT_OOM:
 		goto out_of_memory;
+	case VM_FAULT_RACE:
+		break;
 	default:
 		BUG();
 	}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01  8:21                 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin
@ 2005-08-01  9:19                   ` Ingo Molnar
  2005-08-01  9:27                     ` Nick Piggin
  2005-08-01 15:42                   ` Linus Torvalds
  2005-08-01 20:03                   ` Hugh Dickins
  2 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2005-08-01  9:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath,
	Hugh Dickins, linux-mm, linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Hi,
> 
> Not sure if this should be fixed for 2.6.13. It can result in 
> pagecache corruption: so I guess that answers my own question.
> 
> This was tested by Robin and appears to solve the problem. Roland had 
> a quick look and thought the basic idea was sound. I'd like to get a 
> couple more acks before going forward, and in particular Robin was 
> contemplating possible efficiency improvements (although efficiency 
> can wait on correctness).
> 
> Feedback please, anyone.

it looks good to me, but wouldnt it be simpler (in terms of patch and 
architecture impact) to always retry the follow_page() in 
get_user_pages(), in case of a minor fault? The sequence of minor faults 
must always be finite so it's a safe solution, and should solve the race 
too. The extra overhead of an additional follow_page() is small.

Especially with 2.6.13 being close this looks like the safest bet, any 
improvement to this (i.e. your patch) can then be done in 2.6.14.

	Ingo

When get_user_pages for write access races with another process in the page
fault handler that has established the pte for read access, handle_mm_fault
in get_user_pages will return VM_FAULT_MINOR even if it hasn't made the page
correctly writeable (for example, broken COW).

Thus the assumption that get_user_pages has a writeable page at the mapping
after handle_mm_fault returns is incorrect. Fix this by retrying the lookup
and fault in get_user_pages before making the assumption that we have a
writeable page.

Great work by Robin Holt <holt@sgi.com> to debug the problem.

Originally-from: Nick Piggin <npiggin@suse.de>

simplified the patch into a two-liner:

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 mm/memory.c |    7 +++++++
 1 files changed, 7 insertions(+)

Index: linux-sched-curr/mm/memory.c
===================================================================
--- linux-sched-curr.orig/mm/memory.c
+++ linux-sched-curr/mm/memory.c
@@ -961,6 +961,13 @@ int get_user_pages(struct task_struct *t
 				switch (handle_mm_fault(mm,vma,start,write)) {
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
+					/*
+					 * We might have raced with a readonly
+					 * pagefault, retry to make sure we
+					 * got write access:
+					 */
+					if (write)
+						continue;
 					break;
 				case VM_FAULT_MAJOR:
 					tsk->maj_flt++;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01  9:19                   ` Ingo Molnar
@ 2005-08-01  9:27                     ` Nick Piggin
  2005-08-01 10:15                       ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-08-01  9:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath,
	Hugh Dickins, linux-mm, linux-kernel

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>Feedback please, anyone.
> 
> 
> it looks good to me, but wouldnt it be simpler (in terms of patch and 
> architecture impact) to always retry the follow_page() in 
> get_user_pages(), in case of a minor fault? The sequence of minor faults 

I believe this can break some things. Hugh posted an example
in his recent post to linux-mm (ptrace setting a breakpoint
in read-only text). I think?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01  9:27                     ` Nick Piggin
@ 2005-08-01 10:15                       ` Ingo Molnar
  2005-08-01 10:57                         ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2005-08-01 10:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath,
	Hugh Dickins, linux-mm, linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Ingo Molnar wrote:
> >* Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> >>Feedback please, anyone.
> >
> >
> >it looks good to me, but wouldnt it be simpler (in terms of patch and 
> >architecture impact) to always retry the follow_page() in 
> >get_user_pages(), in case of a minor fault? The sequence of minor faults 
> 
> I believe this can break some things. Hugh posted an example in his 
> recent post to linux-mm (ptrace setting a breakpoint in read-only 
> text). I think?

Hugh's posting said:

 "it's trying to avoid an endless loop of finding the pte not writable 
  when ptrace is modifying a page which the user is currently protected 
  against writing to (setting a breakpoint in readonly text, perhaps?)"

i'm wondering, why should that case generate an infinite fault? The 
first write access should copy the shared-library page into a private 
page and map it into the task's MM, writable. If this make-writable 
operation races with a read access then we return a minor fault and the 
page is still readonly, but retrying the write should then break up the 
COW protection and generate a writable page, and a subsequent 
follow_page() success. If the page cannot be made writable, shouldnt the 
vma flags reflect this fact by not having the VM_MAYWRITE flag, and 
hence get_user_pages() should have returned with -EFAULT earlier?

in other words, can a named MAP_PRIVATE vma with VM_MAYWRITE set ever be 
non-COW-break-able and thus have the potential to induce an infinite 
loop?

	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 10:15                       ` Ingo Molnar
@ 2005-08-01 10:57                         ` Nick Piggin
  2005-08-01 19:43                           ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-08-01 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath,
	Hugh Dickins, linux-mm, linux-kernel

Ingo Molnar wrote:

> 
> Hugh's posting said:
> 
>  "it's trying to avoid an endless loop of finding the pte not writable 
>   when ptrace is modifying a page which the user is currently protected 
>   against writing to (setting a breakpoint in readonly text, perhaps?)"
> 
> i'm wondering, why should that case generate an infinite fault? The 
> first write access should copy the shared-library page into a private 
> page and map it into the task's MM, writable. If this make-writable 

It will be mapped readonly.

> operation races with a read access then we return a minor fault and the 
> page is still readonly, but retrying the write should then break up the 
> COW protection and generate a writable page, and a subsequent 
> follow_page() success. If the page cannot be made writable, shouldnt the 
> vma flags reflect this fact by not having the VM_MAYWRITE flag, and 
> hence get_user_pages() should have returned with -EFAULT earlier?
> 

If it cannot be written to, then yes. If it can be written to
but is mapped readonly then you have the problem.

Aside, that brings up an interesting question - why should readonly
mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
write access while readonly mappings of readonly files not? Or am I
horribly confused?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 10:57                         ` Nick Piggin
@ 2005-08-01 19:43                           ` Hugh Dickins
  2005-08-01 20:08                             ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-01 19:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Robin Holt, Andrew Morton, Linus Torvalds,
	Roland McGrath, linux-mm, linux-kernel

On Mon, 1 Aug 2005, Nick Piggin wrote:
> Ingo Molnar wrote:
> > Hugh's posting said:
> > 
> > "it's trying to avoid an endless loop of finding the pte not writable 
> > when ptrace is modifying a page which the user is currently protected
> > against writing to (setting a breakpoint in readonly text, perhaps?)"
> > 
> > i'm wondering, why should that case generate an infinite fault? The first
> > write access should copy the shared-library page into a private page and
> > map it into the task's MM, writable. If this make-writable 
> 
> It will be mapped readonly.

Yes.  Sorry to leave you so long pondering over my words!

> > operation races with a read access then we return a minor fault and the
> > page is still readonly, but retrying the write should then break up the
> > COW protection and generate a writable page, and a subsequent
> > follow_page() success. If the page cannot be made writable, shouldnt the
> > vma flags reflect this fact by not having the VM_MAYWRITE flag, and hence
> > get_user_pages() should have returned with -EFAULT earlier?
> 
> If it cannot be written to, then yes. If it can be written to
> but is mapped readonly then you have the problem.

Yes.  The problem case is the one where "maybe_mkwrite" finds !VM_WRITE
and so doesn't set writable (but as Linus has observed, dirty is set).

I'm no expert on that case, was just guessing its use, I think Robin
determined that Roland is the expert on it; but what's very clear is
that it's intentional, allowing strace to write where the user cannot
currently write.

> Aside, that brings up an interesting question - why should readonly
> mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
> write access while readonly mappings of readonly files not? Or am I
> horribly confused?

Either you or I.  You'll have to spell that out to me in more detail,
I don't see it that way.

What I do see and am slightly disturbed by, is that do_wp_page on a
shared maywrite but not currently writable area, will go the break
cow route in mainline, and has done so forever; whereas my page_mkwrite
fixes in -mm inadvertently change that to go the the reuse route.

I think my inadvertent change is correct, and the current behaviour
(putting anonymous pages into a shared vma) is surprising, and may
have bad consequences (would at least get the overcommit accounting
wrong).

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 19:43                           ` Hugh Dickins
@ 2005-08-01 20:08                             ` Linus Torvalds
  2005-08-01 21:06                               ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 20:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton,
	Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky

On Mon, 1 Aug 2005, Hugh Dickins wrote:
> 
> > Aside, that brings up an interesting question - why should readonly
> > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
> > write access while readonly mappings of readonly files not? Or am I
> > horribly confused?
> 
> Either you or I.  You'll have to spell that out to me in more detail,
> I don't see it that way.

We have always just done a COW if it's read-only - even if it's shared.

The point being that if a process mapped did a read-only mapping, and a 
tracer wants to modify memory, the tracer is always allowed to do so, but 
it's _not_ going to write anything back to the filesystem.  Writing 
something back to an executable just because the user happened to mmap it 
with MAP_SHARED (but read-only) _and_ the user had the right to write to 
that fd is _not_ ok.

So VM_MAYWRITE is totally immaterial. We _will_not_write_ (and must not do
so) to the backing store through ptrace unless it was literally a writable
mapping (in which case VM_WRITE will be set, and the page table should be
marked writable in the first case).

So we have two choices:

 - not allow the write at all in ptrace (which I think we did at some 
   point)

   This ends up being really inconvenient, and people seem to really 
   expect to be able to write to readonly areas in debuggers. And doign 
   "MAP_SHARED, PROT_READ" seems to be a common thing (Linux has supported 
   that pretty much since day #1 when mmap was supported - long before
   writable shared mappings were supported, Linux accepted MAP_SHARED +
   PROT_READ not just because we could, but because Unix apps do use it).

or

 - turn a shared read-only page into a private page on ptrace write

   This is what we've been doing. It's strange, and it _does_ change 
   semantics (it's not shared any more, so the debugger writing to it 
   means that now you don't see changes to that file by others), so it's 
   clearly not "correct" either, but it's certainly a million times better
   than writing out breakpoints to shared files..

At some point (for the longest time), when a debugger was used to modify a
read-only page, we also made it writable to the user, which was much
easier from a VM standpoint. Now we have this "maybe_mkwrite()" thing,
which is part of the reason for this particular problem.

Using the dirty flag for a "page is _really_ writable" is admittedly kind
of hacky, but it does have the advantage of working even when the -real-
write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people 
to add some more hacks for their strange VM, so be it..

[ Btw, on a totally unrelated note: anybody who is a git user and looks 
  for when this maybe_mkwrite() thing happened, just doing

	git-whatchanged -p -Smaybe_mkwrite mm/memory.c

  in the bkcvs conversion pinpoints it immediately. Very useful git trick 
  in case you ever have that kind of question. ]

I added Martin Schwidefsky to the Cc: explicitly, so that he can ping 
whoever in the s390 team needs to figure out what the right thing is for 
s390 and the dirty bit semantic change. Thanks for pointing it out.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 20:08                             ` Linus Torvalds
@ 2005-08-01 21:06                               ` Hugh Dickins
  2005-08-01 21:51                                 ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-01 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton,
	Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky

On Mon, 1 Aug 2005, Linus Torvalds wrote:
> On Mon, 1 Aug 2005, Hugh Dickins wrote:
> > 
> > > Aside, that brings up an interesting question - why should readonly
> > > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
> > > write access while readonly mappings of readonly files not? Or am I
> > > horribly confused?
> > 
> > Either you or I.  You'll have to spell that out to me in more detail,
> > I don't see it that way.
> 
> We have always just done a COW if it's read-only - even if it's shared.
> 
> The point being that if a process mapped did a read-only mapping, and a 
> tracer wants to modify memory, the tracer is always allowed to do so, but 
> it's _not_ going to write anything back to the filesystem.  Writing 
> something back to an executable just because the user happened to mmap it 
> with MAP_SHARED (but read-only) _and_ the user had the right to write to 
> that fd is _not_ ok.

I'll need to think that through, but not right now.  It's a surprise
to me, and it's likely to surprise the current kernel too.  I'd prefer
to say that if the executable was mapped shared from a writable fd,
then the tracer will write back to it; but you're clearly against that.
We may need to split the vma to handle your semantics correctly.

> Using the dirty flag for a "page is _really_ writable" is admittedly kind
> of hacky, but it does have the advantage of working even when the -real-
> write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people 
> to add some more hacks for their strange VM, so be it..

I'll concentrate on this for now.  s390 used to have the same semantics
as everyone else, they made a conscious choice to diverge, and keep dirty
bit in separate array indexed by struct page, and page_test_and_clear_dirty
macro they use instead of clearing pte_dirty.

It works all right, and I don't think it will prevent us communicating
between maybe_mkwrite and get_user_pages by a pte flag that would be
the usual pte_dirty on every other architecture.  Just need to be
careful to handle the right one right.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 21:06                               ` Hugh Dickins
@ 2005-08-01 21:51                                 ` Linus Torvalds
  2005-08-01 22:01                                   ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 21:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton,
	Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky

On Mon, 1 Aug 2005, Hugh Dickins wrote:
> > 
> > We have always just done a COW if it's read-only - even if it's shared.
> > 
> > The point being that if a process mapped did a read-only mapping, and a 
> > tracer wants to modify memory, the tracer is always allowed to do so, but 
> > it's _not_ going to write anything back to the filesystem.  Writing 
> > something back to an executable just because the user happened to mmap it 
> > with MAP_SHARED (but read-only) _and_ the user had the right to write to 
> > that fd is _not_ ok.
> 
> I'll need to think that through, but not right now.  It's a surprise
> to me, and it's likely to surprise the current kernel too.

Well, even if you did the write-back if VM_MAYWRITE is set, you'd still
have the case of having MAP_SHARED, PROT_READ _without_ VM_MAYWRITE being
set, and I'd expect that to actually be the common one (since you'd
normally use O_RDONLY to open a fd that you only want to map for reading).

And as mentioned, MAP_SHARED+PROT_READ does actually happen in real life.  
Just do a google search on "MAP_SHARED PROT_READ -PROT_WRITE" and you'll
get tons of hits. For good reason too - because MAP_PRIVATE isn't actually
coherent on several old UNIXes.

So you'd still have to convert such a case to a COW mapping, so it's not 
like you can avoid it.

Of course, if VM_MAYWRITE is not set, you could just convert it silently
to a MAP_PRIVATE at the VM level (that's literally what we used to do, 
back when we didn't support writable shared mappings at all, all those 
years ago), so at least now the COW behaviour would match the vma_flags.

> I'd prefer to say that if the executable was mapped shared from a writable fd,
> then the tracer will write back to it; but you're clearly against that.

Absolutely. I can just see somebody mapping an executable MAP_SHARED and
PROT_READ, and something as simple as doing a breakpoint while debugging
causing system-wide trouble.

I really don't think that's acceptable.

And I'm not making it up - add PROT_EXEC to the google search around, and 
watch it being done exactly that way. Several of the hits mention shared 
libraries too. 

I strongly suspect that almost all cases will be opened with O_RDONLY, but 
still..

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 21:51                                 ` Linus Torvalds
@ 2005-08-01 22:01                                   ` Linus Torvalds
  2005-08-02 12:01                                     ` Martin Schwidefsky
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 22:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton,
	Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky


On Mon, 1 Aug 2005, Linus Torvalds wrote:
> 
> Of course, if VM_MAYWRITE is not set, you could just convert it silently
> to a MAP_PRIVATE at the VM level (that's literally what we used to do, 
> back when we didn't support writable shared mappings at all, all those 
> years ago), so at least now the COW behaviour would match the vma_flags.

Heh. I just checked. We still do exactly that:

                        if (!(file->f_mode & FMODE_WRITE))
                                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

some code never dies ;)

However, we still set the VM_MAYSHARE bit, and thats' the one that
mm/rmap.c checks for some reason. I don't see quite why - VM_MAYSHARE
doesn't actually ever do anything else than make sure that we try to
allocate a mremap() mapping in a cache-coherent space, I think (ie it's a
total no-op on any sane architecture, and as far as rmap is concerned on
all of them).

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 22:01                                   ` Linus Torvalds
@ 2005-08-02 12:01                                     ` Martin Schwidefsky
  2005-08-02 12:26                                       ` Hugh Dickins
  2005-08-02 15:30                                       ` Linus Torvalds
  0 siblings, 2 replies; 72+ messages in thread
From: Martin Schwidefsky @ 2005-08-02 12:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm,
	Ingo Molnar, Nick Piggin, Roland McGrath

> > Any chance you can change the __follow_page test to account for
> > writeable clean ptes? Something like
> >
> >       if (write && !pte_dirty(pte) && !pte_write(pte))
> >               goto out;
> >
> > And then you would re-add the set_page_dirty logic further on.
>
> Hmm.. That should be possible. I wanted to do the simplest possible code
> sequence, but yeah, I guess there's nothing wrong with allowing the code
> to dirty the page.
>
> Somebody want to send me a proper patch? Also, I haven't actually heard
> from whoever actually noticed the problem in the first place (Robin?)
> whether the fix does fix it. It "obviously does", but testing is always
> good ;)

Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable
clean pte is just fine then why do we check the dirty bit at all? Doesn't
pte_dirty() imply pte_write()?

With the additional !pte_write(pte) check (and if I haven't overlooked
something which is not unlikely) s390 should work fine even without the
software-dirty bit hack.

blue skies,
   Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 12:01                                     ` Martin Schwidefsky
@ 2005-08-02 12:26                                       ` Hugh Dickins
  2005-08-02 12:28                                         ` Nick Piggin
  2005-08-02 15:19                                         ` Martin Schwidefsky
  2005-08-02 15:30                                       ` Linus Torvalds
  1 sibling, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-02 12:26 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Linus Torvalds, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Martin Schwidefsky wrote:
> 
> Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable
> clean pte is just fine then why do we check the dirty bit at all? Doesn't
> pte_dirty() imply pte_write()?

Not quite.  This is all about the peculiar ptrace case, which sets "force"
to get_user_pages, and ends up handled by the little maybe_mkwrite function:
we sometimes allow ptrace to modify the page while the user does not have
have write access to it via the pte.

Robin discovered a race which proves it's unsafe for get_user_pages to
reset its lookup_write flag (another stage in this peculiar path) after
a single try, Nick proposed a patch which adds another VM_ return code
which each arch would need to handle, Linus looked for something simpler
and hit upon checking pte_dirty rather than pte_write (and removing
the then unnecessary lookup_write flag).

Linus' changes are in the 2.6.13-rc5 mm/memory.c,
but that leaves s390 broken at present.

> With the additional !pte_write(pte) check (and if I haven't overlooked
> something which is not unlikely) s390 should work fine even without the
> software-dirty bit hack.

I agree the pte_write check ought to go back in next to the pte_dirty
check, and that will leave s390 handling most uses of get_user_pages
correctly, but still failing to handle the peculiar case of strace
modifying a page to which the user does not currently have write access
(e.g. setting a breakpoint in readonly text).

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 12:26                                       ` Hugh Dickins
@ 2005-08-02 12:28                                         ` Nick Piggin
  2005-08-02 15:19                                         ` Martin Schwidefsky
  1 sibling, 0 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-02 12:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

Hugh Dickins wrote:
> On Tue, 2 Aug 2005, Martin Schwidefsky wrote:

>>With the additional !pte_write(pte) check (and if I haven't overlooked
>>something which is not unlikely) s390 should work fine even without the
>>software-dirty bit hack.
> 
> 
> I agree the pte_write check ought to go back in next to the pte_dirty
> check, and that will leave s390 handling most uses of get_user_pages
> correctly, but still failing to handle the peculiar case of strace
> modifying a page to which the user does not currently have write access
> (e.g. setting a breakpoint in readonly text).
> 

Oh, here is the patch I sent Linus and forgot to CC
everyone else.

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-opt-gup.patch --]
[-- Type: text/plain, Size: 1025 bytes --]

Allow __follow_page to succeed when encountering a clean, writeable
pte. Requires reintroduction of the direct page dirtying. Means
get_user_pages doesn't have to drop the page_table_lock and enter
the page fault handler for every clean, writeable pte it encounters
(when being called for write).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -811,15 +811,18 @@ static struct page *__follow_page(struct
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_dirty(pte))
+		if (write && !pte_write(pte) && !pte_dirty(pte))
 			goto out;
 		if (read && !pte_read(pte))
 			goto out;
 		pfn = pte_pfn(pte);
 		if (pfn_valid(pfn)) {
 			page = pfn_to_page(pfn);
-			if (accessed)
+			if (accessed) {
+				if (write && !pte_dirty(pte)&& !PageDirty(page))
+					set_page_dirty(page);
 				mark_page_accessed(page);
+			}
 			return page;
 		}
 	}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 12:26                                       ` Hugh Dickins
  2005-08-02 12:28                                         ` Nick Piggin
@ 2005-08-02 15:19                                         ` Martin Schwidefsky
  1 sibling, 0 replies; 72+ messages in thread
From: Martin Schwidefsky @ 2005-08-02 15:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar,
	Nick Piggin, Roland McGrath, Linus Torvalds

Hugh Dickins <hugh@veritas.com> wrote on 08/02/2005 02:26:09 PM:

> On Tue, 2 Aug 2005, Martin Schwidefsky wrote:
> >
> > Why do we require the !pte_dirty(pte) check? I don't get it. If a
writeable
> > clean pte is just fine then why do we check the dirty bit at all?
Doesn't
> > pte_dirty() imply pte_write()?
>
> Not quite.  This is all about the peculiar ptrace case, which sets
"force"
> to get_user_pages, and ends up handled by the little maybe_mkwrite
function:
> we sometimes allow ptrace to modify the page while the user does not have
> have write access to it via the pte.

I slowly begin to understand. You want to do a copy on write of the page you
want to change with ptrace but without making the pte a writable pte. To
indicate the fact the page has been cowed the pte dirty bit is misused.

> Robin discovered a race which proves it's unsafe for get_user_pages to
> reset its lookup_write flag (another stage in this peculiar path) after
> a single try, Nick proposed a patch which adds another VM_ return code
> which each arch would need to handle, Linus looked for something simpler
> and hit upon checking pte_dirty rather than pte_write (and removing
> the then unnecessary lookup_write flag).

Yes, I've seen it on lkml.

> Linus' changes are in the 2.6.13-rc5 mm/memory.c,
> but that leaves s390 broken at present.

That is currently my problem which I'm trying to solve. With Nicks latest
patch, we are more or less back to the previous situation in regard to the
checks in __follow_page as far as s390 is concerned. pte_write() is 0 for
write access via access_process_vm on a read-only vma. pte_dirty() is
always 0 but since get_user_pages doesn't fall back on lookup_write = 0
anymore we have an endless loop again. Four letter words ..

> > With the additional !pte_write(pte) check (and if I haven't overlooked
> > something which is not unlikely) s390 should work fine even without the
> > software-dirty bit hack.
>
> I agree the pte_write check ought to go back in next to the pte_dirty
> check, and that will leave s390 handling most uses of get_user_pages
> correctly, but still failing to handle the peculiar case of strace
> modifying a page to which the user does not currently have write access
> (e.g. setting a breakpoint in readonly text).

The straight forward fix is to introduce a software dirty bit that can
get misused for this special case. Not that I like it but its seems the
simplest solution.

blue skies,
   Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 12:01                                     ` Martin Schwidefsky
  2005-08-02 12:26                                       ` Hugh Dickins
@ 2005-08-02 15:30                                       ` Linus Torvalds
  2005-08-02 16:03                                         ` Hugh Dickins
  2005-08-02 16:44                                         ` Martin Schwidefsky
  1 sibling, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02 15:30 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm,
	Ingo Molnar, Nick Piggin, Roland McGrath


On Tue, 2 Aug 2005, Martin Schwidefsky wrote:
> 
> Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable
> clean pte is just fine then why do we check the dirty bit at all? Doesn't
> pte_dirty() imply pte_write()?

A _non_writable and clean pty is _also_ fine sometimes. But only if we 
have broken COW and marked it dirty.

> With the additional !pte_write(pte) check (and if I haven't overlooked
> something which is not unlikely) s390 should work fine even without the
> software-dirty bit hack.

No it won't. It will just loop forever in a tight loop if somebody tries 
to put a breakpoint on a read-only location.

On the other hand, this being s390, maybe nobody cares?

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 15:30                                       ` Linus Torvalds
@ 2005-08-02 16:03                                         ` Hugh Dickins
  2005-08-02 16:25                                           ` Linus Torvalds
  2005-08-02 16:44                                         ` Martin Schwidefsky
  1 sibling, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-02 16:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Linus Torvalds wrote:
> 
> On the other hand, this being s390, maybe nobody cares?

You have a cruel streak.

But have I just realized a non-s390 problem with your pte_dirty
technique?  The ptep_set_wrprotect in fork's copy_one_pte.

That's specifically write-protecting the pte to force COW, but
leaving the dirty bit: so now get_user_pages will skip COW-ing it
(in all write cases, not just the peculiar ptrace force one).

We really do need to COW those in get_user_pages, don't we?
And although we could handle it by clearing dirty and doing a
set_page_dirty, it's a path we don't want to clutter further.

(Yes, there's another issue of a fork occurring while the pages
are already under get_user_pages, which is a significant issue for
InfiniBand; but I see that as a separate kind of race, which we can
reasonably disregard in this discussion - though it will need proper
attention, perhaps through Michael Tsirkin's PROT_DONTCOPY patch.)

The simple answer to Robin's pagetable update race is to say that
anyone using ptrace should be prepared for a pt race ;)

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 16:03                                         ` Hugh Dickins
@ 2005-08-02 16:25                                           ` Linus Torvalds
  2005-08-02 17:02                                             ` Linus Torvalds
  2005-08-02 17:21                                             ` Hugh Dickins
  0 siblings, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02 16:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Hugh Dickins wrote:
> 
> But have I just realized a non-s390 problem with your pte_dirty
> technique?  The ptep_set_wrprotect in fork's copy_one_pte.
> 
> That's specifically write-protecting the pte to force COW, but leaving
> the dirty bit: so now get_user_pages will skip COW-ing it (in all write
> cases, not just the peculiar ptrace force one).

Damn, you're right. We could obviously move the dirty bit from the page
tables to the "struct page" in fork() (that may have other advantages:  
we're scanning the dang thing anyway, after all) to avoid that special
case, but yes, that's nasty.

One of the reasons I _liked_ the pte_dirty() test is that there's the
reverse case: a mapping that used to be writable, and got dirtied (and
COW'ed as necessary), and then was mprotected back, and the new test would
happily write to it _without_ having to do any extra work. Which in that
case is correct.

But yeah, fork() does something special.

In fact, that brings up another race altogether: a thread that does a
fork() at the same time as get_user_pages() will have the exact same
issues. Even with the old code. Simply because we test the permissions on
the page long before we actually do the real access (ie it may be dirty
and writable when we get it, but by the time the write happens, it might
have become COW-shared).

Now, that's probably not worth worrying about, but it's kind of 
interesting.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 16:25                                           ` Linus Torvalds
@ 2005-08-02 17:02                                             ` Linus Torvalds
  2005-08-02 17:27                                               ` Hugh Dickins
  2005-08-02 17:21                                             ` Hugh Dickins
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02 17:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Linus Torvalds wrote:
> 
> In fact, that brings up another race altogether: a thread that does a
> fork() at the same time [...]

You don't even need that, actually. There's another race by which the 
write could have gotten lost both with the new code _and_ the old code.

Since we will have dropped the page table lock when calling
handle_mm_fault() (which will just re-get the lock and then drop it 
again) _and_ since we don't actually mark the page dirty if it was 
writable, it's entirely possible that the VM scanner comes in and just 
drops the page from the page tables.

Now, that doesn't sound so bad, but what we have then is a page that is
marked dirty in the "struct page", but hasn't been actually dirtied yet.  
It could get written out and marked clean (can anybody say "preemptible
kernel"?) before we ever actually do the write to the page.

The thing is, we should always set the dirty bit either atomically with
the access (normal "CPU sets the dirty bit on write") _or_ we should set
it after the write (having kept a reference to the page).

Or does anybody see anything that protects us here?

Now, I don't think we can fix that race (which is probably pretty much 
impossible to hit in practice) in the 2.6.13 timeframe.

Maybe I'll have to just accept the horrid "VM_FAULT_RACE" patch. I don't
much like it, but.. 

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 17:02                                             ` Linus Torvalds
@ 2005-08-02 17:27                                               ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-02 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Linus Torvalds wrote:
> 
> Since we will have dropped the page table lock when calling
> handle_mm_fault() (which will just re-get the lock and then drop it 
> again) _and_ since we don't actually mark the page dirty if it was 
> writable, it's entirely possible that the VM scanner comes in and just 
> drops the page from the page tables.
> 
> Now, that doesn't sound so bad, but what we have then is a page that is
> marked dirty in the "struct page", but hasn't been actually dirtied yet.  
> It could get written out and marked clean (can anybody say "preemptible
> kernel"?) before we ever actually do the write to the page.
> 
> The thing is, we should always set the dirty bit either atomically with
> the access (normal "CPU sets the dirty bit on write") _or_ we should set
> it after the write (having kept a reference to the page).
> 
> Or does anybody see anything that protects us here?
> 
> Now, I don't think we can fix that race (which is probably pretty much 
> impossible to hit in practice) in the 2.6.13 timeframe.

I believe this particular race has been recognized since day one of
get_user_pages, and we've always demanded that the caller must do a
SetPageDirty (I should probably say set_page_dirty) before freeing
the pages held for writing.

Which is why I was a bit puzzled to see that prior set_page_dirty
in __follow_page, which Andrew identified as for s390.

> Maybe I'll have to just accept the horrid "VM_FAULT_RACE" patch. I don't
> much like it, but.. 

I've not yet reached a conclusion on that,
need to think more about doing mkclean in copy_one_pte.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 16:25                                           ` Linus Torvalds
  2005-08-02 17:02                                             ` Linus Torvalds
@ 2005-08-02 17:21                                             ` Hugh Dickins
  2005-08-02 18:47                                               ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-02 17:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Linus Torvalds wrote:
> On Tue, 2 Aug 2005, Hugh Dickins wrote:
> > 
> > But have I just realized a non-s390 problem with your pte_dirty
> > technique?  The ptep_set_wrprotect in fork's copy_one_pte.
> > 
> > That's specifically write-protecting the pte to force COW, but leaving
> > the dirty bit: so now get_user_pages will skip COW-ing it (in all write
> > cases, not just the peculiar ptrace force one).
> 
> Damn, you're right. We could obviously move the dirty bit from the page
> tables to the "struct page" in fork() (that may have other advantages:  
> we're scanning the dang thing anyway, after all) to avoid that special
> case, but yes, that's nasty.

It might not be so bad.  It's going to access the struct page anyway.
And clearing dirty from parent and child at fork time could save two
set_page_dirtys at exit time.  But I'm not sure that we could batch the
the dirty bit clearing into one TLB flush like we do the write protection.

> In fact, that brings up another race altogether: a thread that does a
> fork() at the same time as get_user_pages() will have the exact same
> issues. Even with the old code. Simply because we test the permissions on
> the page long before we actually do the real access (ie it may be dirty
> and writable when we get it, but by the time the write happens, it might
> have become COW-shared).
> 
> Now, that's probably not worth worrying about, but it's kind of 
> interesting.

Not worth worrying about in this context: it's one aspect of the
InfiniBand (RDMA) issue I was referring to, to be addressed another time.

Or is it even possible?  We do require the caller of get_user_pages to
down_read(&mm->mmap_sem), and fork parent has down_write(&mm->mmap_sem).

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 17:21                                             ` Hugh Dickins
@ 2005-08-02 18:47                                               ` Linus Torvalds
  2005-08-02 19:20                                                 ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02 18:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath


On Tue, 2 Aug 2005, Hugh Dickins wrote:
> 
> It might not be so bad.  It's going to access the struct page anyway.
> And clearing dirty from parent and child at fork time could save two
> set_page_dirtys at exit time.  But I'm not sure that we could batch the
> the dirty bit clearing into one TLB flush like we do the write protection.

Yes, good point. If the thing is still marked dirty in the TLB, some other 
thread might be writing to the page after we've cleared dirty but before 
we've flushed the TLB - causing the new dirty bit to be lost. I think.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 18:47                                               ` Linus Torvalds
@ 2005-08-02 19:20                                                 ` Hugh Dickins
  2005-08-02 19:54                                                   ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-02 19:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Linus Torvalds wrote:
> On Tue, 2 Aug 2005, Hugh Dickins wrote:
> > 
> > It might not be so bad.  It's going to access the struct page anyway.
> > And clearing dirty from parent and child at fork time could save two
> > set_page_dirtys at exit time.  But I'm not sure that we could batch the
> > the dirty bit clearing into one TLB flush like we do the write protection.
> 
> Yes, good point. If the thing is still marked dirty in the TLB, some other 
> thread might be writing to the page after we've cleared dirty but before 
> we've flushed the TLB - causing the new dirty bit to be lost. I think.

Would that matter?  Yes, if vmscan sneaked in at some point while
page_table_lock is dropped, and wrote away the page with the earlier data.

But I was worrying about the reverse case, that we clear dirty, then
another thread sets it again before we emerge from copy_page_range,
so it gets left behind granting get_user_pages write permission.

Hmm, that implies that the other thread doesn't yet see wrprotect
(because we've not yet flushed TLB), which probably implies it would
still see dirty set: and so not set it again, so not a possible case.
But that's a precarious, processor-dependent argument: I don't think
it's safe to rely on, and your reverse case is already a problem.

I don't believe there's a safe efficient way we could batch clearing
dirty there.  We could make a second pass of the whole mm after the
flush TLB has asserted the wrprotects; but that won't win friends.

I'm thinking of reverting to the old __follow_page, setting write_access
-1 in the get_user_pages special case (to avoid change to all the arches,
in some of which write_access is a boolean, in some a bitflag, but in
none -1), and in that write_access -1 case passing back the special
code to say do_wp_page has done its full job.  Combining your and
Nick's and Andrew's ideas, and letting Martin off the hook.
Or would that disgust you too much?  (We could give -1 a pretty name ;)

Working on it now, but my brain in an even lower power state than ever.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 19:20                                                 ` Hugh Dickins
@ 2005-08-02 19:54                                                   ` Linus Torvalds
  2005-08-02 20:55                                                     ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02 19:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath


On Tue, 2 Aug 2005, Hugh Dickins wrote:
> > 
> > Yes, good point. If the thing is still marked dirty in the TLB, some other 
> > thread might be writing to the page after we've cleared dirty but before 
> > we've flushed the TLB - causing the new dirty bit to be lost. I think.
> 
> Would that matter?  Yes, if vmscan sneaked in at some point while
> page_table_lock is dropped, and wrote away the page with the earlier data.

Right.

> But I was worrying about the reverse case, that we clear dirty, then
> another thread sets it again before we emerge from copy_page_range,
> so it gets left behind granting get_user_pages write permission.

Hmm.. At least x86 won't do that - the dirty bits are updated with an 
atomic read-modify-write sequence that only sets the dirty bit. We won't 
get a writable page somehow.

But the lost dirty bit is nasty.

> I don't believe there's a safe efficient way we could batch clearing
> dirty there. 

Well, there is one really cheap one: look at how many users the VM has.

The thing is, fork() from a threaded environment is simply not done, so we 
could easily have a "slow and careful mode" for the thread case, and 
nobody would probably ever care. 

Whether its worth it, I dunno. It might actually speed up the fork/exit
cases, so it might be worth looking at for that reason.

> I'm thinking of reverting to the old __follow_page, setting write_access
> -1 in the get_user_pages special case (to avoid change to all the arches,
> in some of which write_access is a boolean, in some a bitflag, but in
> none -1), and in that write_access -1 case passing back the special
> code to say do_wp_page has done its full job.  Combining your and
> Nick's and Andrew's ideas, and letting Martin off the hook.
> Or would that disgust you too much?  (We could give -1 a pretty name ;)

Go for it, I think whatever we do won't be wonderfully pretty.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 19:54                                                   ` Linus Torvalds
@ 2005-08-02 20:55                                                     ` Hugh Dickins
  2005-08-03 10:24                                                       ` Nick Piggin
  2005-08-03 10:24                                                       ` Martin Schwidefsky
  0 siblings, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-02 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath

On Tue, 2 Aug 2005, Linus Torvalds wrote:
> 
> Go for it, I think whatever we do won't be wonderfully pretty.

Here we are: get_user_pages quite untested, let alone the racy case,
but I think it should work.  Please all hack it around as you see fit,
I'll check mail when I get home, but won't be very responsive...


Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.

So revert __follow_page to check pte_write as before, and let do_wp_page
pass back a special code VM_FAULT_WRITE to say it has done its full job
(whereas VM_FAULT_MINOR when it backs out on race): once get_user_pages
receives this value, it no longer requires pte_write in __follow_page.

But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case.  To avoid changing
them all in a hurry, only pass back VM_FAULT_WRITE when write_access arg
says VM_FAULT_WRITE_EXPECTED - chosen as -1 since some arches pass
write_access as a boolean, some as a bitflag, but none as -1.

Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.

Signed-off-by: Hugh Dickins <hugh@veritas.com>

--- 2.6.13-rc5/include/linux/mm.h	2005-08-02 12:07:14.000000000 +0100
+++ linux/include/linux/mm.h	2005-08-02 21:14:58.000000000 +0100
@@ -629,6 +629,9 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0
 #define VM_FAULT_MINOR	1
 #define VM_FAULT_MAJOR	2
+#define VM_FAULT_WRITE	3	/* special case for get_user_pages */
+
+#define VM_FAULT_WRITE_EXPECTED	(-1)	/* only for get_user_pages */
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
 
--- 2.6.13-rc5/mm/memory.c	2005-08-02 12:07:23.000000000 +0100
+++ linux/mm/memory.c	2005-08-02 21:14:26.000000000 +0100
@@ -811,15 +811,18 @@ static struct page *__follow_page(struct
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_dirty(pte))
+		if (write && !pte_write(pte))
 			goto out;
 		if (read && !pte_read(pte))
 			goto out;
 		pfn = pte_pfn(pte);
 		if (pfn_valid(pfn)) {
 			page = pfn_to_page(pfn);
-			if (accessed)
+			if (accessed) {
+				if (write && !pte_dirty(pte) &&!PageDirty(page))
+					set_page_dirty(page);
 				mark_page_accessed(page);
+			}
 			return page;
 		}
 	}
@@ -941,10 +944,11 @@ int get_user_pages(struct task_struct *t
 		}
 		spin_lock(&mm->page_table_lock);
 		do {
+			int write_access = write? VM_FAULT_WRITE_EXPECTED: 0;
 			struct page *page;
 
 			cond_resched_lock(&mm->page_table_lock);
-			while (!(page = follow_page(mm, start, write))) {
+			while (!(page = follow_page(mm, start, write_access))) {
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -957,7 +961,16 @@ int get_user_pages(struct task_struct *t
 					break;
 				}
 				spin_unlock(&mm->page_table_lock);
-				switch (handle_mm_fault(mm,vma,start,write)) {
+				switch (handle_mm_fault(mm, vma, start,
+							write_access)) {
+				case VM_FAULT_WRITE:
+					/*
+					 * do_wp_page has broken COW when
+					 * necessary, even if maybe_mkwrite
+					 * decided not to set pte_write
+					 */
+					write_access = 0;
+					/* FALLTHRU */
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
 					break;
@@ -1220,6 +1233,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
+	int ret;
 
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
@@ -1247,7 +1261,7 @@ static int do_wp_page(struct mm_struct *
 			lazy_mmu_prot_update(entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
-			return VM_FAULT_MINOR;
+			return VM_FAULT_WRITE;
 		}
 	}
 	pte_unmap(page_table);
@@ -1274,6 +1288,7 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	ret = VM_FAULT_MINOR;
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
@@ -1290,12 +1305,13 @@ static int do_wp_page(struct mm_struct *
 
 		/* Free the old page.. */
 		new_page = old_page;
+		ret = VM_FAULT_WRITE;
 	}
 	pte_unmap(page_table);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
+	return ret;
 
 no_new_page:
 	page_cache_release(old_page);
@@ -1985,9 +2001,13 @@ static inline int handle_pte_fault(struc
 	}
 
 	if (write_access) {
-		if (!pte_write(entry))
-			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
+		if (!pte_write(entry)) {
+			int ret = do_wp_page(mm, vma, address, pte, pmd, entry);
+			if (ret == VM_FAULT_WRITE &&
+			    write_access != VM_FAULT_WRITE_EXPECTED)
+				ret = VM_FAULT_MINOR;
+			return ret;
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 20:55                                                     ` Hugh Dickins
@ 2005-08-03 10:24                                                       ` Nick Piggin
  2005-08-03 11:47                                                         ` Hugh Dickins
  2005-08-03 16:12                                                         ` Linus Torvalds
  2005-08-03 10:24                                                       ` Martin Schwidefsky
  1 sibling, 2 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-03 10:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

Hugh Dickins wrote:
> On Tue, 2 Aug 2005, Linus Torvalds wrote:
> 
>>Go for it, I think whatever we do won't be wonderfully pretty.
> 
> 
> Here we are: get_user_pages quite untested, let alone the racy case,
> but I think it should work.  Please all hack it around as you see fit,
> I'll check mail when I get home, but won't be very responsive...
> 

Seems OK to me. I don't know why you think handle_mm_fault can't
be inline, but if it can be, then I have a modification attached
that removes the condition - any good?

Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there
is a good reason for it, but might that break out of tree drivers?

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-gup-hugh.patch --]
[-- Type: text/plain, Size: 6183 bytes --]

Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.

So revert __follow_page to check pte_write as before, and let do_wp_page
pass back a special code VM_FAULT_WRITE to say it has done its full job
(whereas VM_FAULT_MINOR when it backs out on race): once get_user_pages
receives this value, it no longer requires pte_write in __follow_page.

But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case.  To avoid changing
them all in a hurry, only pass back VM_FAULT_WRITE when write_access arg
says VM_FAULT_WRITE_EXPECTED - chosen as -1 since some arches pass
write_access as a boolean, some as a bitflag, but none as -1.

Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.

Signed-off-by: Hugh Dickins <hugh@veritas.com>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -625,10 +625,16 @@ static inline int page_mapped(struct pag
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
  */
-#define VM_FAULT_OOM	(-1)
-#define VM_FAULT_SIGBUS	0
-#define VM_FAULT_MINOR	1
-#define VM_FAULT_MAJOR	2
+#define VM_FAULT_OOM	0x00
+#define VM_FAULT_SIGBUS	0x01
+#define VM_FAULT_MINOR	0x02
+#define VM_FAULT_MAJOR	0x03
+
+/* 
+ * Special case for get_user_pages.
+ * Must be in a distinct bit from the above VM_FAULT_ flags.
+ */
+#define VM_FAULT_WRITE	0x10
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
 
@@ -704,7 +710,13 @@ extern pte_t *FASTCALL(pte_alloc_kernel(
 extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
 extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
 extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
-extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
+extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
+
+static inline int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access)
+{
+	return __handle_mm_fault(mm, vma, address, write_access) & (~VM_FAULT_WRITE);
+}
+
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 void install_arg_page(struct vm_area_struct *, struct page *, unsigned long);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -811,15 +811,18 @@ static struct page *__follow_page(struct
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_dirty(pte))
+		if (write && !pte_write(pte))
 			goto out;
 		if (read && !pte_read(pte))
 			goto out;
 		pfn = pte_pfn(pte);
 		if (pfn_valid(pfn)) {
 			page = pfn_to_page(pfn);
-			if (accessed)
+			if (accessed) {
+				if (write && !pte_dirty(pte) &&!PageDirty(page))
+					set_page_dirty(page);
 				mark_page_accessed(page);
+			}
 			return page;
 		}
 	}
@@ -941,10 +944,11 @@ int get_user_pages(struct task_struct *t
 		}
 		spin_lock(&mm->page_table_lock);
 		do {
+			int write_access = write;
 			struct page *page;
 
 			cond_resched_lock(&mm->page_table_lock);
-			while (!(page = follow_page(mm, start, write))) {
+			while (!(page = follow_page(mm, start, write_access))) {
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -957,7 +961,16 @@ int get_user_pages(struct task_struct *t
 					break;
 				}
 				spin_unlock(&mm->page_table_lock);
-				switch (handle_mm_fault(mm,vma,start,write)) {
+				switch (__handle_mm_fault(mm, vma, start,
+							write_access)) {
+				case VM_FAULT_WRITE:
+					/*
+					 * do_wp_page has broken COW when
+					 * necessary, even if maybe_mkwrite
+					 * decided not to set pte_write
+					 */
+					write_access = 0;
+					/* FALLTHRU */
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
 					break;
@@ -1220,6 +1233,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
+	int ret;
 
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
@@ -1247,7 +1261,7 @@ static int do_wp_page(struct mm_struct *
 			lazy_mmu_prot_update(entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
-			return VM_FAULT_MINOR;
+			return VM_FAULT_MINOR|VM_FAULT_WRITE;
 		}
 	}
 	pte_unmap(page_table);
@@ -1274,6 +1288,7 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	ret = VM_FAULT_MINOR;
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
@@ -1290,12 +1305,13 @@ static int do_wp_page(struct mm_struct *
 
 		/* Free the old page.. */
 		new_page = old_page;
+		ret |= VM_FAULT_WRITE;
 	}
 	pte_unmap(page_table);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
+	return ret;
 
 no_new_page:
 	page_cache_release(old_page);
@@ -1987,7 +2003,6 @@ static inline int handle_pte_fault(struc
 	if (write_access) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -2002,7 +2017,7 @@ static inline int handle_pte_fault(struc
 /*
  * By the time we get here, we already hold the mm semaphore
  */
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
 		unsigned long address, int write_access)
 {
 	pgd_t *pgd;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 10:24                                                       ` Nick Piggin
@ 2005-08-03 11:47                                                         ` Hugh Dickins
  2005-08-03 12:13                                                           ` Nick Piggin
  2005-08-03 16:12                                                         ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-03 11:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath

On Wed, 3 Aug 2005, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > Here we are: get_user_pages quite untested, let alone the racy case,
> > but I think it should work.  Please all hack it around as you see fit,
> > I'll check mail when I get home, but won't be very responsive...
> 
> Seems OK to me. I don't know why you think handle_mm_fault can't
> be inline, but if it can be, then I have a modification attached
> that removes the condition - any good?

Stupidity was the reason I thought handle_mm_fault couldn't be inline:
I was picturing it static inline within mm/memory.c, failed to make the
great intellectual leap you've achieved by moving it to include/linux/mm.h.

> Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there
> is a good reason for it, but might that break out of tree drivers?

No, I don't think it would break anything: it's just an historic oddity,
used to be -1 for failure, and only got given a name recently, I think
when wli added the proper major/minor counting.

Your version of the patch looks less hacky to me (not requiring
VM_FAULT_WRITE_EXPECTED arg), though we could perfectly well remove
that at leisure by adding VM_FAULT_WRITE case into all the arches in
2.6.14 (which might be preferable to leaving the __inline obscurity?).

I don't mind either way, but since you've not yet found an actual
error in mine, I'd prefer you to make yours a tidyup patch on top,
Signed-off-by your own good self, and let Linus decide whether he
wants to apply yours on top or not.  Or perhaps the decision rests
for the moment with Robin, whether he gets his customer to test
yours or mine - whichever is tested is the one which should go in.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 11:47                                                         ` Hugh Dickins
@ 2005-08-03 12:13                                                           ` Nick Piggin
  0 siblings, 0 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-03 12:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath

Hugh Dickins wrote:

> Stupidity was the reason I thought handle_mm_fault couldn't be inline:
> I was picturing it static inline within mm/memory.c, failed to make the
> great intellectual leap you've achieved by moving it to include/linux/mm.h.
> 

Well it was one of my finer moments, so don't be too hard on
yourself.

> 
> No, I don't think it would break anything: it's just an historic oddity,
> used to be -1 for failure, and only got given a name recently, I think
> when wli added the proper major/minor counting.
> 
> Your version of the patch looks less hacky to me (not requiring
> VM_FAULT_WRITE_EXPECTED arg), though we could perfectly well remove
> that at leisure by adding VM_FAULT_WRITE case into all the arches in
> 2.6.14 (which might be preferable to leaving the __inline obscurity?).
> 

Well depends on what they want I suppose. Does it even make sense
to expose VM_FAULT_WRITE to arch code if it will just fall through
to VM_FAULT_MINOR?

With my earlier VM_FAULT_RACE thing, you can squint and say that's
a different case to VM_FAULT_MINOR - and accordingly not increment
the minor fault count. Afterall, minor faults *seem* to track count
of modifications made to the pte entry minus major faults, rather
than the number of times the hardware traps. I say this because we
increment the number in get_user_pages too.

But...

> I don't mind either way, but since you've not yet found an actual
> error in mine, I'd prefer you to make yours a tidyup patch on top,
> Signed-off-by your own good self, and let Linus decide whether he
> wants to apply yours on top or not.  Or perhaps the decision rests
> for the moment with Robin, whether he gets his customer to test
> yours or mine - whichever is tested is the one which should go in.
> 

I agree that for the moment we probably just want something that
works. I'd be just as happy to go with your patch as is for now.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 10:24                                                       ` Nick Piggin
  2005-08-03 11:47                                                         ` Hugh Dickins
@ 2005-08-03 16:12                                                         ` Linus Torvalds
  2005-08-03 16:39                                                           ` Linus Torvalds
                                                                             ` (3 more replies)
  1 sibling, 4 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-03 16:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath


On Wed, 3 Aug 2005, Nick Piggin wrote:
> 
> Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there
> is a good reason for it, but might that break out of tree drivers?

Ok, I applied this because it was reasonably pretty and I liked the 
approach. It seems buggy, though, since it was using "switch ()" to test 
the bits (wrongly, afaik), and I'm going to apply the appended on top of 
it. Holler quickly if you disagreee..

		Linus
----
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -949,6 +949,8 @@ int get_user_pages(struct task_struct *t
 
 			cond_resched_lock(&mm->page_table_lock);
 			while (!(page = follow_page(mm, start, write_access))) {
+				int ret;
+
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -961,16 +963,18 @@ int get_user_pages(struct task_struct *t
 					break;
 				}
 				spin_unlock(&mm->page_table_lock);
-				switch (__handle_mm_fault(mm, vma, start,
-							write_access)) {
-				case VM_FAULT_WRITE:
-					/*
-					 * do_wp_page has broken COW when
-					 * necessary, even if maybe_mkwrite
-					 * decided not to set pte_write
-					 */
+				ret = __handle_mm_fault(mm, vma, start, write_access);
+
+				/*
+				 * The VM_FAULT_WRITE bit tells us that do_wp_page has
+				 * broken COW when necessary, even if maybe_mkwrite
+				 * decided not to set pte_write. We can thus safely do
+				 * subsequent page lookups as if they were reads.
+				 */
+				if (ret & VM_FAULT_WRITE)
 					write_access = 0;
-					/* FALLTHRU */
+				
+				switch (ret & ~VM_FAULT_WRITE) {
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
 					break;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 16:12                                                         ` Linus Torvalds
@ 2005-08-03 16:39                                                           ` Linus Torvalds
  2005-08-03 16:42                                                             ` Linus Torvalds
  2005-08-03 17:12                                                           ` Hugh Dickins
                                                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-03 16:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath


On Wed, 3 Aug 2005, Linus Torvalds wrote:
> 
> Ok, I applied this because it was reasonably pretty and I liked the 
> approach. It seems buggy, though, since it was using "switch ()" to test 
> the bits (wrongly, afaik), and I'm going to apply the appended on top of 
> it. Holler quickly if you disagreee..

Another problem I just thought of.

What about PROT_NONE pages? Afaik, they have _exactly_ the same issue on 
the read side as a PROT_READ page has on the write side: pte_read() may 
fail on them.

What makes us not get into an infinite loop there? 

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 16:39                                                           ` Linus Torvalds
@ 2005-08-03 16:42                                                             ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-03 16:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath


On Wed, 3 Aug 2005, Linus Torvalds wrote:
> 
> What makes us not get into an infinite loop there? 

Ahh, never mind, I didn't notice that we never set the "read" thing at 
all, so ptrace will never care if it's readable or not. No problem.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 16:12                                                         ` Linus Torvalds
  2005-08-03 16:39                                                           ` Linus Torvalds
@ 2005-08-03 17:12                                                           ` Hugh Dickins
  2005-08-03 23:03                                                           ` Nick Piggin
  2005-08-04 14:14                                                           ` Alexander Nyberg
  3 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-03 17:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath

On Wed, 3 Aug 2005, Linus Torvalds wrote:
> 
> Ok, I applied this because it was reasonably pretty and I liked the 
> approach. It seems buggy, though, since it was using "switch ()" to test 
> the bits (wrongly, afaik), and I'm going to apply the appended on top of 
> it. Holler quickly if you disagreee..

Looks right to me.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 16:12                                                         ` Linus Torvalds
  2005-08-03 16:39                                                           ` Linus Torvalds
  2005-08-03 17:12                                                           ` Hugh Dickins
@ 2005-08-03 23:03                                                           ` Nick Piggin
  2005-08-04 14:14                                                           ` Alexander Nyberg
  3 siblings, 0 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-03 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt,
	linux-kernel, linux-mm, Ingo Molnar, Roland McGrath

Linus Torvalds wrote:
> 
> On Wed, 3 Aug 2005, Nick Piggin wrote:
> 
>>Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there
>>is a good reason for it, but might that break out of tree drivers?
> 
> 
> Ok, I applied this because it was reasonably pretty and I liked the 
> approach. It seems buggy, though, since it was using "switch ()" to test 
> the bits (wrongly, afaik),

Oops, thanks.

> and I'm going to apply the appended on top of 
> it. Holler quickly if you disagreee..
> 

No that looks fine. Should really be credited to Hugh... well
I guess everyone had some input into it though (Andrew, Hugh,
you, me). It probably doesn't matter too much.

Thanks everyone.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 16:12                                                         ` Linus Torvalds
                                                                             ` (2 preceding siblings ...)
  2005-08-03 23:03                                                           ` Nick Piggin
@ 2005-08-04 14:14                                                           ` Alexander Nyberg
  2005-08-04 14:30                                                             ` Nick Piggin
  3 siblings, 1 reply; 72+ messages in thread
From: Alexander Nyberg @ 2005-08-04 14:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Martin Schwidefsky, Andrew Morton,
	Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath,
	Andi Kleen

On Wed, Aug 03, 2005 at 09:12:37AM -0700 Linus Torvalds wrote:

> 
> 
> On Wed, 3 Aug 2005, Nick Piggin wrote:
> > 
> > Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there
> > is a good reason for it, but might that break out of tree drivers?
> 
> Ok, I applied this because it was reasonably pretty and I liked the 
> approach. It seems buggy, though, since it was using "switch ()" to test 
> the bits (wrongly, afaik), and I'm going to apply the appended on top of 
> it. Holler quickly if you disagreee..
> 

x86_64 had hardcoded the VM_ numbers so it broke down when the numbers
were changed.

Signed-off-by: Alexander Nyberg <alexn@telia.com>

Index: linux-2.6/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86_64/mm/fault.c	2005-07-31 18:10:20.000000000 +0200
+++ linux-2.6/arch/x86_64/mm/fault.c	2005-08-04 16:04:59.000000000 +0200
@@ -439,13 +439,13 @@
 	 * the fault.
 	 */
 	switch (handle_mm_fault(mm, vma, address, write)) {
-	case 1:
+	case VM_FAULT_MINOR:
 		tsk->min_flt++;
 		break;
-	case 2:
+	case VM_FAULT_MAJOR:
 		tsk->maj_flt++;
 		break;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
 	default:
 		goto out_of_memory;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 14:14                                                           ` Alexander Nyberg
@ 2005-08-04 14:30                                                             ` Nick Piggin
  2005-08-04 15:00                                                               ` Alexander Nyberg
  2005-08-04 16:29                                                               ` Russell King
  0 siblings, 2 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-04 14:30 UTC (permalink / raw)
  To: Alexander Nyberg
  Cc: Linus Torvalds, Hugh Dickins, Martin Schwidefsky, Andrew Morton,
	Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath,
	Andi Kleen

Alexander Nyberg wrote:
> On Wed, Aug 03, 2005 at 09:12:37AM -0700 Linus Torvalds wrote:
> 
> 
>>
>>Ok, I applied this because it was reasonably pretty and I liked the 
>>approach. It seems buggy, though, since it was using "switch ()" to test 
>>the bits (wrongly, afaik), and I'm going to apply the appended on top of 
>>it. Holler quickly if you disagreee..
>>
> 
> 
> x86_64 had hardcoded the VM_ numbers so it broke down when the numbers
> were changed.
> 

Ugh, sorry I should have audited this but I really wasn't expecting
it (famous last words). Hasn't been a good week for me.

parisc, cris, m68k, frv, sh64, arm26 are also broken.
Would you mind resending a patch that fixes them all?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 14:30                                                             ` Nick Piggin
@ 2005-08-04 15:00                                                               ` Alexander Nyberg
  2005-08-04 15:35                                                                 ` Hugh Dickins
  2005-08-04 15:36                                                                 ` Linus Torvalds
  2005-08-04 16:29                                                               ` Russell King
  1 sibling, 2 replies; 72+ messages in thread
From: Alexander Nyberg @ 2005-08-04 15:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, Martin Schwidefsky, Andrew Morton,
	Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath,
	Andi Kleen

> >
> >x86_64 had hardcoded the VM_ numbers so it broke down when the numbers
> >were changed.
> >
> 
> Ugh, sorry I should have audited this but I really wasn't expecting
> it (famous last words). Hasn't been a good week for me.

Hardcoding is evil so it's good it gets cleaned up anyway.

> parisc, cris, m68k, frv, sh64, arm26 are also broken.
> Would you mind resending a patch that fixes them all?
> 

Remove the hardcoding in return value checking of handle_mm_fault()

Signed-off-by: Alexander Nyberg <alexn@telia.com>

 arm26/mm/fault.c  |    6 +++---
 cris/mm/fault.c   |    6 +++---
 frv/mm/fault.c    |    6 +++---
 m68k/mm/fault.c   |    6 +++---
 parisc/mm/fault.c |    6 +++---
 sh64/mm/fault.c   |    6 +++---
 x86_64/mm/fault.c |    6 +++---
 7 files changed, 21 insertions(+), 21 deletions(-)

Index: linux-2.6/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86_64/mm/fault.c	2005-07-31 18:10:20.000000000 +0200
+++ linux-2.6/arch/x86_64/mm/fault.c	2005-08-04 16:04:59.000000000 +0200
@@ -439,13 +439,13 @@
 	 * the fault.
 	 */
 	switch (handle_mm_fault(mm, vma, address, write)) {
-	case 1:
+	case VM_FAULT_MINOR:
 		tsk->min_flt++;
 		break;
-	case 2:
+	case VM_FAULT_MAJOR:
 		tsk->maj_flt++;
 		break;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
 	default:
 		goto out_of_memory;
Index: linux-2.6/arch/cris/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/cris/mm/fault.c	2005-07-31 18:10:02.000000000 +0200
+++ linux-2.6/arch/cris/mm/fault.c	2005-08-04 16:40:56.000000000 +0200
@@ -284,13 +284,13 @@
 	 */
 
 	switch (handle_mm_fault(mm, vma, address, writeaccess & 1)) {
-	case 1:
+	case VM_FAULT_MINOR:
 		tsk->min_flt++;
 		break;
-	case 2:
+	case VM_FAULT_MAJOR:
 		tsk->maj_flt++;
 		break;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
 	default:
 		goto out_of_memory;
Index: linux-2.6/arch/m68k/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/m68k/mm/fault.c	2005-07-31 18:10:05.000000000 +0200
+++ linux-2.6/arch/m68k/mm/fault.c	2005-08-04 16:42:05.000000000 +0200
@@ -160,13 +160,13 @@
 	printk("handle_mm_fault returns %d\n",fault);
 #endif
 	switch (fault) {
-	case 1:
+	case VM_FAULT_MINOR:
 		current->min_flt++;
 		break;
-	case 2:
+	case VM_FAULT_MAJOR:
 		current->maj_flt++;
 		break;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		goto bus_err;
 	default:
 		goto out_of_memory;
Index: linux-2.6/arch/parisc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/parisc/mm/fault.c	2005-07-31 18:10:11.000000000 +0200
+++ linux-2.6/arch/parisc/mm/fault.c	2005-08-04 16:41:18.000000000 +0200
@@ -178,13 +178,13 @@
 	 */
 
 	switch (handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) != 0)) {
-	      case 1:
+	      case VM_FAULT_MINOR:
 		++current->min_flt;
 		break;
-	      case 2:
+	      case VM_FAULT_MAJOR:
 		++current->maj_flt;
 		break;
-	      case 0:
+	      case VM_FAULT_SIGBUS:
 		/*
 		 * We ran out of memory, or some other thing happened
 		 * to us that made us unable to handle the page fault
Index: linux-2.6/arch/arm26/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/arm26/mm/fault.c	2005-07-31 18:10:00.000000000 +0200
+++ linux-2.6/arch/arm26/mm/fault.c	2005-08-04 16:46:18.000000000 +0200
@@ -176,12 +176,12 @@
 	 * Handle the "normal" cases first - successful and sigbus
 	 */
 	switch (fault) {
-	case 2:
+	case VM_FAULT_MAJOR:
 		tsk->maj_flt++;
 		return fault;
-	case 1:
+	case VM_FAULT_MINOR:
 		tsk->min_flt++;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		return fault;
 	}
 
Index: linux-2.6/arch/frv/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/frv/mm/fault.c	2005-07-31 18:10:03.000000000 +0200
+++ linux-2.6/arch/frv/mm/fault.c	2005-08-04 16:44:02.000000000 +0200
@@ -163,13 +163,13 @@
 	 * the fault.
 	 */
 	switch (handle_mm_fault(mm, vma, ear0, write)) {
-	case 1:
+	case VM_FAULT_MINOR:
 		current->min_flt++;
 		break;
-	case 2:
+	case VM_FAULT_MAJOR:
 		current->maj_flt++;
 		break;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
 	default:
 		goto out_of_memory;
Index: linux-2.6/arch/sh64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/sh64/mm/fault.c	2005-07-31 18:10:16.000000000 +0200
+++ linux-2.6/arch/sh64/mm/fault.c	2005-08-04 16:44:58.000000000 +0200
@@ -223,13 +223,13 @@
 	 */
 survive:
 	switch (handle_mm_fault(mm, vma, address, writeaccess)) {
-	case 1:
+	case VM_FAULT_MINOR:
 		tsk->min_flt++;
 		break;
-	case 2:
+	case VM_FAULT_MAJOR:
 		tsk->maj_flt++;
 		break;
-	case 0:
+	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
 	default:
 		goto out_of_memory;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 15:00                                                               ` Alexander Nyberg
@ 2005-08-04 15:35                                                                 ` Hugh Dickins
  2005-08-04 16:32                                                                   ` Russell King
  2005-08-04 15:36                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-04 15:35 UTC (permalink / raw)
  To: Alexander Nyberg, Linus Torvalds
  Cc: Nick Piggin, Russell King, Martin Schwidefsky, Andrew Morton,
	Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath,
	Andi Kleen

On Thu, 4 Aug 2005, Alexander Nyberg wrote:
> 
> Hardcoding is evil so it's good it gets cleaned up anyway.
> 
> > parisc, cris, m68k, frv, sh64, arm26 are also broken.
> > Would you mind resending a patch that fixes them all?
> 
> Remove the hardcoding in return value checking of handle_mm_fault()

Your patch looks right to me, and bless you for catching this.
But it does get into changing lots of arches, which we were
trying to avoid at this moment.  Well, that's up to Linus.

And it does miss arm, the only arch which actually needs changing
right now, if we simply restore the original values which Nick shifted
- although arm references the VM_FAULT_ codes in some places, it also
uses "> 0".  arm26 looks at first as if it needs changing too, but
a closer look shows it's remapping the faults and is okay - agreed?

I suggest for now the patch below, which does need to be applied
for the arm case, and makes applying your good cleanup less urgent.


Restore VM_FAULT_SIGBUS, VM_FAULT_MINOR and VM_FAULT_MAJOR to their
original values, so that arches which have them hardcoded will still
work before they're cleaned up.  And correct arm to use the VM_FAULT_
codes throughout, not assuming MINOR and MAJOR are the only ones > 0.

Signed-off-by: Hugh Dickins <hugh@veritas.com>

--- 2.6.13-rc5-git2/arch/arm/mm/fault.c	2005-08-02 12:06:28.000000000 +0100
+++ linux/arch/arm/mm/fault.c	2005-08-04 16:06:57.000000000 +0100
@@ -240,8 +240,11 @@ do_page_fault(unsigned long addr, unsign
 	/*
 	 * Handle the "normal" case first
 	 */
-	if (fault > 0)
+	switch (fault) {
+	case VM_FAULT_MINOR:
+	case VM_FAULT_MAJOR:
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
@@ -261,7 +264,7 @@ do_page_fault(unsigned long addr, unsign
 		do_exit(SIGKILL);
 		return 0;
 
-	case 0:
+	case VM_FAULT_SIGBUS:
 		/*
 		 * We had some memory, but were unable to
 		 * successfully fix up this page fault.
--- 2.6.13-rc5-git2/include/linux/mm.h	2005-08-04 15:20:20.000000000 +0100
+++ linux/include/linux/mm.h	2005-08-04 15:52:34.000000000 +0100
@@ -625,10 +625,10 @@ static inline int page_mapped(struct pag
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
  */
-#define VM_FAULT_OOM	0x00
-#define VM_FAULT_SIGBUS	0x01
-#define VM_FAULT_MINOR	0x02
-#define VM_FAULT_MAJOR	0x03
+#define VM_FAULT_SIGBUS	0x00
+#define VM_FAULT_MINOR	0x01
+#define VM_FAULT_MAJOR	0x02
+#define VM_FAULT_OOM	0x03
 
 /* 
  * Special case for get_user_pages.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 15:35                                                                 ` Hugh Dickins
@ 2005-08-04 16:32                                                                   ` Russell King
  0 siblings, 0 replies; 72+ messages in thread
From: Russell King @ 2005-08-04 16:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alexander Nyberg, Linus Torvalds, Nick Piggin,
	Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen

On Thu, Aug 04, 2005 at 04:35:06PM +0100, Hugh Dickins wrote:
> And it does miss arm, the only arch which actually needs changing
> right now, if we simply restore the original values which Nick shifted
> - although arm references the VM_FAULT_ codes in some places, it also
> uses "> 0".  arm26 looks at first as if it needs changing too, but
> a closer look shows it's remapping the faults and is okay - agreed?

Your patch doesn't look right.  Firstly, I'd rather stay away from
switch() if at all possible - past experience has shown that it
generates inherently poor code on ARM.  Whether that's still true
or not I've no idea, but I don't particularly want to find out at
the moment.

> Restore VM_FAULT_SIGBUS, VM_FAULT_MINOR and VM_FAULT_MAJOR to their
> original values, so that arches which have them hardcoded will still
> work before they're cleaned up.  And correct arm to use the VM_FAULT_
> codes throughout, not assuming MINOR and MAJOR are the only ones > 0.

And the above rules this out.

As I say, I fixed ARM this morning, so changing these constants will
break it again.  Let's just wait for things to stabilise instead of
trying to race with architecture maintainers...

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 15:00                                                               ` Alexander Nyberg
  2005-08-04 15:35                                                                 ` Hugh Dickins
@ 2005-08-04 15:36                                                                 ` Linus Torvalds
  1 sibling, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-04 15:36 UTC (permalink / raw)
  To: Alexander Nyberg
  Cc: Nick Piggin, Hugh Dickins, Martin Schwidefsky, Andrew Morton,
	Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath,
	Andi Kleen


On Thu, 4 Aug 2005, Alexander Nyberg wrote:
> 
> Hardcoding is evil so it's good it gets cleaned up anyway.

Yes.

> > parisc, cris, m68k, frv, sh64, arm26 are also broken. Would you mind
> > resending a patch that fixes them all?
> 
> Remove the hardcoding in return value checking of handle_mm_fault()

I only saw this one when I had already done it myself.

Your arm26 conversion was only partial, btw. Notice how it returns the 
fault number (with a few extensions of its own) and processes it further? 
I think I got that right.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 14:30                                                             ` Nick Piggin
  2005-08-04 15:00                                                               ` Alexander Nyberg
@ 2005-08-04 16:29                                                               ` Russell King
  1 sibling, 0 replies; 72+ messages in thread
From: Russell King @ 2005-08-04 16:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Alexander Nyberg, Linus Torvalds, Hugh Dickins,
	Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel,
	linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen

On Fri, Aug 05, 2005 at 12:30:07AM +1000, Nick Piggin wrote:
> Alexander Nyberg wrote:
> > On Wed, Aug 03, 2005 at 09:12:37AM -0700 Linus Torvalds wrote:
> > 
> > 
> >>
> >>Ok, I applied this because it was reasonably pretty and I liked the 
> >>approach. It seems buggy, though, since it was using "switch ()" to test 
> >>the bits (wrongly, afaik), and I'm going to apply the appended on top of 
> >>it. Holler quickly if you disagreee..
> >>
> > 
> > 
> > x86_64 had hardcoded the VM_ numbers so it broke down when the numbers
> > were changed.
> > 
> 
> Ugh, sorry I should have audited this but I really wasn't expecting
> it (famous last words). Hasn't been a good week for me.
> 
> parisc, cris, m68k, frv, sh64, arm26 are also broken.
> Would you mind resending a patch that fixes them all?

ARM as well - fix is pending Linus pulling my tree...

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 20:55                                                     ` Hugh Dickins
  2005-08-03 10:24                                                       ` Nick Piggin
@ 2005-08-03 10:24                                                       ` Martin Schwidefsky
  2005-08-03 11:57                                                         ` Hugh Dickins
  1 sibling, 1 reply; 72+ messages in thread
From: Martin Schwidefsky @ 2005-08-03 10:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar,
	Nick Piggin, Roland McGrath, Linus Torvalds

Hugh Dickins <hugh@veritas.com> wrote on 08/02/2005 10:55:31 PM:

> > Go for it, I think whatever we do won't be wonderfully pretty.
>
> Here we are: get_user_pages quite untested, let alone the racy case,
> but I think it should work.  Please all hack it around as you see fit,
> I'll check mail when I get home, but won't be very responsive...

Ahh, just tested it and everythings seems to work (even for s390)..
I'm happy :-)

blue skies,
   Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 10:24                                                       ` Martin Schwidefsky
@ 2005-08-03 11:57                                                         ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-03 11:57 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar,
	Nick Piggin, Roland McGrath, Linus Torvalds

On Wed, 3 Aug 2005, Martin Schwidefsky wrote:
> Hugh Dickins <hugh@veritas.com> wrote on 08/02/2005 10:55:31 PM:
> >
> > Here we are: get_user_pages quite untested, let alone the racy case,
> 
> Ahh, just tested it and everythings seems to work (even for s390)..
> I'm happy :-)

Thanks for testing, Martin.  Your happiness is my bliss.  Whether
we go with Nick's mods on mine or not, I think you can now safely
assume we've given up demanding a sudden change at the s390 end.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02 15:30                                       ` Linus Torvalds
  2005-08-02 16:03                                         ` Hugh Dickins
@ 2005-08-02 16:44                                         ` Martin Schwidefsky
  1 sibling, 0 replies; 72+ messages in thread
From: Martin Schwidefsky @ 2005-08-02 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm,
	Ingo Molnar, Nick Piggin, Roland McGrath

Linus Torvalds <torvalds@osdl.org> wrote on 08/02/2005 05:30:37 PM:

> > With the additional !pte_write(pte) check (and if I haven't overlooked
> > something which is not unlikely) s390 should work fine even without the
> > software-dirty bit hack.
>
> No it won't. It will just loop forever in a tight loop if somebody tries
> to put a breakpoint on a read-only location.

Yes, I have realized that as well nowe after staring at the code a little
bit longer. That maybe_mkwrite is really tricky.

> On the other hand, this being s390, maybe nobody cares?

Some will care. At least I do. I've tested the latest git with gdb and
it will indeed loop forever if I try to write to a read-only vma.

blue skies,
   Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01  8:21                 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin
  2005-08-01  9:19                   ` Ingo Molnar
@ 2005-08-01 15:42                   ` Linus Torvalds
  2005-08-01 18:18                     ` Linus Torvalds
                                       ` (3 more replies)
  2005-08-01 20:03                   ` Hugh Dickins
  2 siblings, 4 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 15:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins,
	linux-mm, linux-kernel


On Mon, 1 Aug 2005, Nick Piggin wrote:
> 
> Not sure if this should be fixed for 2.6.13. It can result in
> pagecache corruption: so I guess that answers my own question.

Hell no.

This patch is clearly untested and must _not_ be applied:

+                               case VM_FAULT_RACE:
+                                       /*
+                                        * Someone else got there first.
+                                        * Must retry before we can assume
+                                        * that we have actually performed
+                                        * the write fault (below).
+                                        */
+                                       if (write)
+                                               continue;
+                                       break;

that "continue" will continue without the spinlock held, and now do 
follow_page() will run without page_table_lock, _and_ it will release the 
spinlock once more afterwards, so if somebody else is racing on this, we 
might remove the spinlock for them too.

Don't do it.

Instead, I'd suggest changing the logic for "lookup_write". Make it 
require that the page table entry is _dirty_ (not writable), and then 
remove the line that says:

	lookup_write = write && !force;

and you're now done. A successful mm fault for write _should_ always have 
marked the PTE dirty (and yes, part of testing this would be to verify 
that this is true - but since architectures that don't have HW dirty 
bits depend on this anyway, I'm pretty sure it _is_ true).

Ie something like the below (which is totally untested, obviously, but I 
think conceptually is a lot more correct, and obviously a lot simpler).

		Linus

----
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -811,18 +811,15 @@ static struct page *__follow_page(struct
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_write(pte))
+		if (write && !pte_dirty(pte))
 			goto out;
 		if (read && !pte_read(pte))
 			goto out;
 		pfn = pte_pfn(pte);
 		if (pfn_valid(pfn)) {
 			page = pfn_to_page(pfn);
-			if (accessed) {
-				if (write && !pte_dirty(pte) &&!PageDirty(page))
-					set_page_dirty(page);
+			if (accessed)
 				mark_page_accessed(page);
-			}
 			return page;
 		}
 	}
@@ -972,14 +969,6 @@ int get_user_pages(struct task_struct *t
 				default:
 					BUG();
 				}
-				/*
-				 * Now that we have performed a write fault
-				 * and surely no longer have a shared page we
-				 * shouldn't write, we shouldn't ignore an
-				 * unwritable page in the page table if
-				 * we are forcing write access.
-				 */
-				lookup_write = write && !force;
 				spin_lock(&mm->page_table_lock);
 			}
 			if (pages) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 15:42                   ` Linus Torvalds
@ 2005-08-01 18:18                     ` Linus Torvalds
  2005-08-03  8:24                       ` Robin Holt
  2005-08-01 19:29                     ` Hugh Dickins
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 18:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins,
	linux-mm, linux-kernel

On Mon, 1 Aug 2005, Linus Torvalds wrote:
> 
> Ie something like the below (which is totally untested, obviously, but I 
> think conceptually is a lot more correct, and obviously a lot simpler).

I've tested it, and thought more about it, and I can't see any fault with
the approach. In fact, I like it more. So it's checked in now (in a
further simplified way, since the thing made "lookup_write" always be the
same as just "write").

Can somebody who saw the problem in the first place please verify?

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 18:18                     ` Linus Torvalds
@ 2005-08-03  8:24                       ` Robin Holt
  2005-08-03 11:31                         ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Robin Holt @ 2005-08-03  8:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Robin Holt, Andrew Morton, Roland McGrath,
	Hugh Dickins, linux-mm, linux-kernel

On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote:
> On Mon, 1 Aug 2005, Linus Torvalds wrote:
> > 
> > Ie something like the below (which is totally untested, obviously, but I 
> > think conceptually is a lot more correct, and obviously a lot simpler).
> 
> I've tested it, and thought more about it, and I can't see any fault with
> the approach. In fact, I like it more. So it's checked in now (in a
> further simplified way, since the thing made "lookup_write" always be the
> same as just "write").
> 
> Can somebody who saw the problem in the first place please verify?

Unfortunately, I can not get the user test to run against anything but the
SLES9 SP2 kernel.  I took the commit 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6
and applied that diff to the SUSE kernel.  It does fix the problem the
customer reported.

Thanks,
Robin Holt
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03  8:24                       ` Robin Holt
@ 2005-08-03 11:31                         ` Hugh Dickins
  2005-08-04 11:48                           ` Robin Holt
  0 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-03 11:31 UTC (permalink / raw)
  To: Robin Holt
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, Roland McGrath,
	linux-mm, linux-kernel

On Wed, 3 Aug 2005, Robin Holt wrote:
> On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote:
> > 
> > Can somebody who saw the problem in the first place please verify?
> 
> Unfortunately, I can not get the user test to run against anything but the
> SLES9 SP2 kernel.  I took the commit 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6
> and applied that diff to the SUSE kernel.  It does fix the problem the
> customer reported.

Thanks for getting that tried, Robin, but I'm afraid you hadn't got far
enough down your mailbox.  There's a couple of flaws with Linus' patch
above, one with respect to fork and one with respect to s390, so we're
now intending to apply my patch below on top of Linus' (I see Nick has
now come up with some refinement to mine, test his if you prefer it).
Sorry to mess you back and forth, but 4ceb...f0d6 itself won't do, so
we do rather need your customer to retest with the patch below too.

Thanks,
Hugh

Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.

So revert __follow_page to check pte_write as before, and let do_wp_page
pass back a special code VM_FAULT_WRITE to say it has done its full job
(whereas VM_FAULT_MINOR when it backs out on race): once get_user_pages
receives this value, it no longer requires pte_write in __follow_page.

But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case.  To avoid changing
them all in a hurry, only pass back VM_FAULT_WRITE when write_access arg
says VM_FAULT_WRITE_EXPECTED - chosen as -1 since some arches pass
write_access as a boolean, some as a bitflag, but none as -1.

Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.

Signed-off-by: Hugh Dickins <hugh@veritas.com>

--- 2.6.13-rc5/include/linux/mm.h	2005-08-02 12:07:14.000000000 +0100
+++ linux/include/linux/mm.h	2005-08-02 21:14:58.000000000 +0100
@@ -629,6 +629,9 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0
 #define VM_FAULT_MINOR	1
 #define VM_FAULT_MAJOR	2
+#define VM_FAULT_WRITE	3	/* special case for get_user_pages */
+
+#define VM_FAULT_WRITE_EXPECTED	(-1)	/* only for get_user_pages */
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
 
--- 2.6.13-rc5/mm/memory.c	2005-08-02 12:07:23.000000000 +0100
+++ linux/mm/memory.c	2005-08-02 21:14:26.000000000 +0100
@@ -811,15 +811,18 @@ static struct page *__follow_page(struct
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_dirty(pte))
+		if (write && !pte_write(pte))
 			goto out;
 		if (read && !pte_read(pte))
 			goto out;
 		pfn = pte_pfn(pte);
 		if (pfn_valid(pfn)) {
 			page = pfn_to_page(pfn);
-			if (accessed)
+			if (accessed) {
+				if (write && !pte_dirty(pte) &&!PageDirty(page))
+					set_page_dirty(page);
 				mark_page_accessed(page);
+			}
 			return page;
 		}
 	}
@@ -941,10 +944,11 @@ int get_user_pages(struct task_struct *t
 		}
 		spin_lock(&mm->page_table_lock);
 		do {
+			int write_access = write? VM_FAULT_WRITE_EXPECTED: 0;
 			struct page *page;
 
 			cond_resched_lock(&mm->page_table_lock);
-			while (!(page = follow_page(mm, start, write))) {
+			while (!(page = follow_page(mm, start, write_access))) {
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -957,7 +961,16 @@ int get_user_pages(struct task_struct *t
 					break;
 				}
 				spin_unlock(&mm->page_table_lock);
-				switch (handle_mm_fault(mm,vma,start,write)) {
+				switch (handle_mm_fault(mm, vma, start,
+							write_access)) {
+				case VM_FAULT_WRITE:
+					/*
+					 * do_wp_page has broken COW when
+					 * necessary, even if maybe_mkwrite
+					 * decided not to set pte_write
+					 */
+					write_access = 0;
+					/* FALLTHRU */
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
 					break;
@@ -1220,6 +1233,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
+	int ret;
 
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
@@ -1247,7 +1261,7 @@ static int do_wp_page(struct mm_struct *
 			lazy_mmu_prot_update(entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
-			return VM_FAULT_MINOR;
+			return VM_FAULT_WRITE;
 		}
 	}
 	pte_unmap(page_table);
@@ -1274,6 +1288,7 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	ret = VM_FAULT_MINOR;
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
@@ -1290,12 +1305,13 @@ static int do_wp_page(struct mm_struct *
 
 		/* Free the old page.. */
 		new_page = old_page;
+		ret = VM_FAULT_WRITE;
 	}
 	pte_unmap(page_table);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
+	return ret;
 
 no_new_page:
 	page_cache_release(old_page);
@@ -1985,9 +2001,13 @@ static inline int handle_pte_fault(struc
 	}
 
 	if (write_access) {
-		if (!pte_write(entry))
-			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
+		if (!pte_write(entry)) {
+			int ret = do_wp_page(mm, vma, address, pte, pmd, entry);
+			if (ret == VM_FAULT_WRITE &&
+			    write_access != VM_FAULT_WRITE_EXPECTED)
+				ret = VM_FAULT_MINOR;
+			return ret;
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-03 11:31                         ` Hugh Dickins
@ 2005-08-04 11:48                           ` Robin Holt
  2005-08-04 13:04                             ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Robin Holt @ 2005-08-04 11:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Linus Torvalds, Nick Piggin, Andrew Morton,
	Roland McGrath, linux-mm, linux-kernel

On Wed, Aug 03, 2005 at 12:31:34PM +0100, Hugh Dickins wrote:
> On Wed, 3 Aug 2005, Robin Holt wrote:
> > On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote:
> > > 
> > > Can somebody who saw the problem in the first place please verify?

OK.  I took the three commits:
4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6
f33ea7f404e592e4563b12101b7a4d17da6558d7
a68d2ebc1581a3aec57bd032651e013fa609f530

I back ported them to the SuSE SLES9SP2 kernel.  I will add them at
the end so you can tell me if I messed things up.  I was then able
to run the customer test application multiple times without issue.
Before the fix, we had never acheived three consecutive runs that did
not trip the fault.  After the change, it has been in excess of thirty.
I would say this has fixed the problem.  Did I miss anything which
needs to be tested?

Thanks,
Robin Holt


----- Patches against SLES9SP2 7.191 kernel -----


Adapted for SLES9 SP2 kernel from
X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2005-08-03 16:18:31.553626601 -0500
+++ linux/mm/memory.c	2005-08-03 16:24:08.905247901 -0500
@@ -674,7 +674,7 @@ follow_page(struct mm_struct *mm, unsign
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_write(pte))
+		if (write && !pte_dirty(pte))
 			goto out;
 		if (write && !pte_dirty(pte)) {
 			struct page *page = pte_page(pte);
@@ -814,8 +814,7 @@ int get_user_pages(struct task_struct *t
 		spin_lock(&mm->page_table_lock);
 		do {
 			struct page *map;
-			int lookup_write = write;
-			while (!(map = follow_page(mm, start, lookup_write))) {
+			while (!(map = follow_page(mm, start, write))) {
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -823,8 +822,7 @@ int get_user_pages(struct task_struct *t
 				 * nobody touched so far. This is important
 				 * for doing a core dump for these mappings.
 				 */
-				if (!lookup_write &&
-				    untouched_anonymous_page(mm,vma,start)) {
+				if (!write && untouched_anonymous_page(mm,vma,start)) {
 					map = ZERO_PAGE(start);
 					break;
 				}
@@ -843,14 +841,6 @@ int get_user_pages(struct task_struct *t
 				default:
 					BUG();
 				}
-				/*
-				 * Now that we have performed a write fault
-				 * and surely no longer have a shared page we
-				 * shouldn't write, we shouldn't ignore an
-				 * unwritable page in the page table if
-				 * we are forcing write access.
-				 */
-				lookup_write = write && !force;
 				spin_lock(&mm->page_table_lock);
 			}
 			if (pages) {






Adapted for SLES9 SP2 kernel from
X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f33ea7f404e592e4563b12101b7a4d17da6558d7
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2005-08-03 16:22:55.905794491 -0500
+++ linux/include/linux/mm.h	2005-08-03 16:24:48.123938110 -0500
@@ -651,10 +651,16 @@ static inline int page_mapped(struct pag
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
  */
-#define VM_FAULT_OOM	(-1)
-#define VM_FAULT_SIGBUS	0
-#define VM_FAULT_MINOR	1
-#define VM_FAULT_MAJOR	2
+#define VM_FAULT_OOM	0x00
+#define VM_FAULT_SIGBUS	0x01
+#define VM_FAULT_MINOR	0x02
+#define VM_FAULT_MAJOR	0x03
+
+/* 
+ * Special case for get_user_pages.
+ * Must be in a distinct bit from the above VM_FAULT_ flags.
+ */
+#define VM_FAULT_WRITE	0x10
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
 
@@ -704,7 +710,13 @@ extern pte_t *FASTCALL(pte_alloc_kernel(
 extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
 extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
 extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
-extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
+extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
+
+static inline int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access)
+{
+	return __handle_mm_fault(mm, vma, address, write_access) & (~VM_FAULT_WRITE);
+}
+
 extern void make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 void put_dirty_page(struct task_struct *tsk, struct page *page,
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2005-08-03 16:24:08.905247901 -0500
+++ linux/mm/memory.c	2005-08-03 16:29:01.901967868 -0500
@@ -674,7 +674,7 @@ follow_page(struct mm_struct *mm, unsign
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (write && !pte_dirty(pte))
+		if (write && !pte_write(pte))
 			goto out;
 		if (write && !pte_dirty(pte)) {
 			struct page *page = pte_page(pte);
@@ -684,8 +684,9 @@ follow_page(struct mm_struct *mm, unsign
 		pfn = pte_pfn(pte);
 		if (pfn_valid(pfn)) {
 			struct page *page = pfn_to_page(pfn);
-			
-			mark_page_accessed(page);
+			if (write && !pte_dirty(pte) &&!PageDirty(page))
+				set_page_dirty(page);
+  			mark_page_accessed(page);
 			return page;
 		}
 	}
@@ -813,8 +814,9 @@ int get_user_pages(struct task_struct *t
 		}
 		spin_lock(&mm->page_table_lock);
 		do {
+			int write_access = write;
 			struct page *map;
-			while (!(map = follow_page(mm, start, write))) {
+			while (!(map = follow_page(mm, start, write_access))) {
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -827,7 +829,16 @@ int get_user_pages(struct task_struct *t
 					break;
 				}
 				spin_unlock(&mm->page_table_lock);
-				switch (handle_mm_fault(mm,vma,start,write)) {
+				switch (__handle_mm_fault(mm, vma, start,
+							write_access)) {
+				case VM_FAULT_WRITE:
+					/*
+					 * do_wp_page has broken COW when
+					 * necessary, even if maybe_mkwrite
+					 * decided not to set pte_write
+					 */
+					write_access = 0;
+					/* FALLTHRU */
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
 					break;
@@ -1086,6 +1097,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;
+	int ret;
 
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
@@ -1113,7 +1125,7 @@ static int do_wp_page(struct mm_struct *
 			lazy_mmu_prot_update(entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
-			return VM_FAULT_MINOR;
+			return VM_FAULT_MINOR|VM_FAULT_WRITE;
 		}
 	}
 	pte_unmap(page_table);
@@ -1135,6 +1147,7 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	ret = VM_FAULT_MINOR;
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
@@ -1153,12 +1166,13 @@ static int do_wp_page(struct mm_struct *
 
 		/* Free the old page.. */
 		new_page = old_page;
+		ret |= VM_FAULT_WRITE;
 	}
 	pte_unmap(page_table);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
+	return ret;
 
 no_new_page:
 	page_cache_release(old_page);
@@ -1753,7 +1767,6 @@ static inline int handle_pte_fault(struc
 	if (write_access) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -1777,7 +1790,7 @@ int __attribute__((weak)) arch_hugetlb_f
 /*
  * By the time we get here, we already hold the mm semaphore
  */
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
 	unsigned long address, int write_access)
 {
 	pgd_t *pgd;



Adapted for SLES9 SP2 kernel from
X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a68d2ebc1581a3aec57bd032651e013fa609f530
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2005-08-03 16:29:13.789434124 -0500
+++ linux/mm/memory.c	2005-08-03 16:29:13.838257298 -0500
@@ -817,6 +817,8 @@ int get_user_pages(struct task_struct *t
 			int write_access = write;
 			struct page *map;
 			while (!(map = follow_page(mm, start, write_access))) {
+				int ret;
+
 				/*
 				 * Shortcut for anonymous pages. We don't want
 				 * to force the creation of pages tables for
@@ -829,16 +831,18 @@ int get_user_pages(struct task_struct *t
 					break;
 				}
 				spin_unlock(&mm->page_table_lock);
-				switch (__handle_mm_fault(mm, vma, start,
-							write_access)) {
-				case VM_FAULT_WRITE:
-					/*
-					 * do_wp_page has broken COW when
-					 * necessary, even if maybe_mkwrite
-					 * decided not to set pte_write
-					 */
+				ret = __handle_mm_fault(mm, vma, start, write_access);
+
+				/*
+				 * The VM_FAULT_WRITE bit tells us that do_wp_page has
+				 * broken COW when necessary, even if maybe_mkwrite
+				 * decided not to set pte_write. We can thus safely do
+				 * subsequent page lookups as if they were reads.
+				 */
+				if (ret & VM_FAULT_WRITE)
 					write_access = 0;
-					/* FALLTHRU */
+				
+				switch (ret & ~VM_FAULT_WRITE) {
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
 					break;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-04 11:48                           ` Robin Holt
@ 2005-08-04 13:04                             ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-04 13:04 UTC (permalink / raw)
  To: Robin Holt
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, Roland McGrath,
	linux-mm, linux-kernel

On Thu, 4 Aug 2005, Robin Holt wrote:
> On Wed, Aug 03, 2005 at 12:31:34PM +0100, Hugh Dickins wrote:
> > On Wed, 3 Aug 2005, Robin Holt wrote:
> > > On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote:
> > > > 
> > > > Can somebody who saw the problem in the first place please verify?
> 
> OK.  I took the three commits:
> 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6
> f33ea7f404e592e4563b12101b7a4d17da6558d7
> a68d2ebc1581a3aec57bd032651e013fa609f530
> 
> I back ported them to the SuSE SLES9SP2 kernel.  I will add them at
> the end so you can tell me if I messed things up.  I was then able
> to run the customer test application multiple times without issue.
> Before the fix, we had never acheived three consecutive runs that did
> not trip the fault.  After the change, it has been in excess of thirty.
> I would say this has fixed the problem.  Did I miss anything which
> needs to be tested?

Great, thanks for the testing, the set of patches you tested is correct.

(I think there is one minor anomaly in your patch backport, which has no
effect on the validity of your testing: you've probably ended up with two
separate calls to set_page_dirty in your follow_page, because that moved
between 2.6.5 and 2.6.13.  It doesn't matter, but you might want to tidy
that up one way or the other if you're passing your patch on further.)

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 15:42                   ` Linus Torvalds
  2005-08-01 18:18                     ` Linus Torvalds
@ 2005-08-01 19:29                     ` Hugh Dickins
  2005-08-01 19:48                       ` Linus Torvalds
  2005-08-01 19:57                       ` Andrew Morton
  2005-08-02  0:14                     ` Nick Piggin
  2005-08-02  1:27                     ` Nick Piggin
  3 siblings, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-01 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Robin Holt, Andrew Morton, Roland McGrath,
	Martin Schwidefsky, linux-mm, linux-kernel

On Mon, 1 Aug 2005, Linus Torvalds wrote:
> 
> that "continue" will continue without the spinlock held, and now do 

Yes, I was at last about to reply on that point and others.
I'll make those comments in a separate mail to Nick and all.

> Instead, I'd suggest changing the logic for "lookup_write". Make it 
> require that the page table entry is _dirty_ (not writable), and then 

Attractive, I very much wanted to do that rather than change all the
arches, but I think s390 rules it out: its pte_mkdirty does nothing,
its pte_dirty just says no.

Whether your patch suits all other uses of (__)follow_page I've not
investigated (and I don't see how you can go without the set_page_dirty
if it was necessary before); but at present see no alternative to
something like Nick's patch, though I'd much prefer smaller.

Or should we change s390 to set a flag in the pte just for this purpose?

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 19:29                     ` Hugh Dickins
@ 2005-08-01 19:48                       ` Linus Torvalds
  2005-08-02  8:07                         ` Martin Schwidefsky
  2005-08-01 19:57                       ` Andrew Morton
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 19:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Robin Holt, Andrew Morton, Roland McGrath,
	Martin Schwidefsky, linux-mm, linux-kernel


On Mon, 1 Aug 2005, Hugh Dickins wrote:
> 
> Attractive, I very much wanted to do that rather than change all the
> arches, but I think s390 rules it out: its pte_mkdirty does nothing,
> its pte_dirty just says no.

How does s390 work at all?

> Or should we change s390 to set a flag in the pte just for this purpose?

If the choice is between a broken and ugly implementation for everybody 
else, then hell yes. Even if it's a purely sw bit that nothing else 
actually cares about.. I hope they have an extra bit around somewhere.

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 19:48                       ` Linus Torvalds
@ 2005-08-02  8:07                         ` Martin Schwidefsky
  0 siblings, 0 replies; 72+ messages in thread
From: Martin Schwidefsky @ 2005-08-02  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm,
	Nick Piggin, Roland McGrath

Linus Torvalds <torvalds@osdl.org> wrote on 08/01/2005 09:48:40 PM:

> > Attractive, I very much wanted to do that rather than change all the
> > arches, but I think s390 rules it out: its pte_mkdirty does nothing,
> > its pte_dirty just says no.
>
> How does s390 work at all?

The big difference between s390 and your standard architecture is that
s390 keeps the dirty and reference bits in the storage key. That is
per physical page and not per mapping. The primitive pte_dirty() just
doesn't make any sense for s390. A pte never contains any information
about dirty/reference state of a page. The "page" itself contains it,
you access the information with some instructions (sske, iske & rrbe)
which get the page frame address as parameter.

> > Or should we change s390 to set a flag in the pte just for this purpose?
>
> If the choice is between a broken and ugly implementation for everybody
> else, then hell yes. Even if it's a purely sw bit that nothing else
> actually cares about.. I hope they have an extra bit around somewhere.

Urg, depending on the pte type there are no bits available. For valid ptes
there are some bits we could use but it wouldn't be nice.

blue skies,
   Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 19:29                     ` Hugh Dickins
  2005-08-01 19:48                       ` Linus Torvalds
@ 2005-08-01 19:57                       ` Andrew Morton
  2005-08-01 20:16                         ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Andrew Morton @ 2005-08-01 19:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: torvalds, nickpiggin, holt, roland, schwidefsky, linux-mm, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Mon, 1 Aug 2005, Linus Torvalds wrote:
> > 
> > that "continue" will continue without the spinlock held, and now do 
> 
> Yes, I was at last about to reply on that point and others.
> I'll make those comments in a separate mail to Nick and all.
> 
> > Instead, I'd suggest changing the logic for "lookup_write". Make it 
> > require that the page table entry is _dirty_ (not writable), and then 
> 
> Attractive, I very much wanted to do that rather than change all the
> arches, but I think s390 rules it out: its pte_mkdirty does nothing,
> its pte_dirty just says no.
> 
> Whether your patch suits all other uses of (__)follow_page I've not
> investigated (and I don't see how you can go without the set_page_dirty
> if it was necessary before);

That was introduced 19 months ago by the s390 guys (see patch below).  I
don't really see why Martin decided to mark the page software-dirty at that
stage.

It's a nice thing to do from the VM dirty-memory accounting POV, but I
don't see that it's essential.

> but at present see no alternative to
> something like Nick's patch, though I'd much prefer smaller.
> 
> Or should we change s390 to set a flag in the pte just for this purpose?

That would be a good approach IMO, if possible.


From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Fix endless loop in get_user_pages() on s390.  It happens only on s/390
because pte_dirty always returns 0.  For all other architectures this is an
optimization.

In the case of "write && !pte_dirty(pte)" follow_page() returns NULL.  On all
architectures except s390 handle_pte_fault() will then create a pte with
pte_dirty(pte)==1 because write_access==1.  In the following, second call to
follow_page() all is fine.  With the physical dirty bit patch pte_dirty() is
always 0 for s/390 because the dirty bit doesn't live in the pte.




---

 mm/memory.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff -puN mm/memory.c~s390-16-follow_page-lockup-fix mm/memory.c
--- 25/mm/memory.c~s390-16-follow_page-lockup-fix	2004-01-18 22:36:00.000000000 -0800
+++ 25-akpm/mm/memory.c	2004-01-18 22:36:00.000000000 -0800
@@ -651,14 +651,19 @@ follow_page(struct mm_struct *mm, unsign
 	pte = *ptep;
 	pte_unmap(ptep);
 	if (pte_present(pte)) {
-		if (!write || (pte_write(pte) && pte_dirty(pte))) {
-			pfn = pte_pfn(pte);
-			if (pfn_valid(pfn)) {
-				struct page *page = pfn_to_page(pfn);
-
-				mark_page_accessed(page);
-				return page;
-			}
+		if (write && !pte_write(pte))
+			goto out;
+		if (write && !pte_dirty(pte)) {
+			struct page *page = pte_page(pte);
+			if (!PageDirty(page))
+				set_page_dirty(page);
+		}
+		pfn = pte_pfn(pte);
+		if (pfn_valid(pfn)) {
+			struct page *page = pfn_to_page(pfn);
+			
+			mark_page_accessed(page);
+			return page;
 		}
 	}
 

_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 19:57                       ` Andrew Morton
@ 2005-08-01 20:16                         ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 20:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, nickpiggin, holt, roland, schwidefsky, linux-mm,
	linux-kernel

On Mon, 1 Aug 2005, Andrew Morton wrote:
> 
> That was introduced 19 months ago by the s390 guys (see patch below). 

This really is a very broken patch, btw. 

> +		if (write && !pte_write(pte))
> +			goto out;
> +		if (write && !pte_dirty(pte)) {
> +			struct page *page = pte_page(pte);
> +			if (!PageDirty(page))
> +				set_page_dirty(page);
> +		}
> +		pfn = pte_pfn(pte);
> +		if (pfn_valid(pfn)) {
> +			struct page *page = pfn_to_page(pfn);
> +			
> +			mark_page_accessed(page);
> +			return page;

Note how it doesn't do any "pfn_valid()" stuff for the dirty bit setting, 
so it will set random bits in memory if the pte points to some IO page. 

Maybe that doesn't happen on s390, but..

Anyway, if the s390 people just have a sw-writable bit in their page table
layout, I bet they can fix their problem by just having a "sw dirty"  
bit, and then make "pte_mkdirty()" set that bit. Nobody else will care, 
but ptrace will then just work correctly for them too.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 15:42                   ` Linus Torvalds
  2005-08-01 18:18                     ` Linus Torvalds
  2005-08-01 19:29                     ` Hugh Dickins
@ 2005-08-02  0:14                     ` Nick Piggin
  2005-08-02  1:27                     ` Nick Piggin
  3 siblings, 0 replies; 72+ messages in thread
From: Nick Piggin @ 2005-08-02  0:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins,
	linux-mm, linux-kernel

Linus Torvalds wrote:

>
>On Mon, 1 Aug 2005, Nick Piggin wrote:
>
>>Not sure if this should be fixed for 2.6.13. It can result in
>>pagecache corruption: so I guess that answers my own question.
>>
>
>Hell no.
>
>This patch is clearly untested and must _not_ be applied:
>
>

Yes, I meant that the problem should be fixed, not that the
patch should be applied straight away.

I wanted to get discussion going ASAP. Looks like it worked :)
I'll catch up on it now.


Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 15:42                   ` Linus Torvalds
                                       ` (2 preceding siblings ...)
  2005-08-02  0:14                     ` Nick Piggin
@ 2005-08-02  1:27                     ` Nick Piggin
  2005-08-02  3:45                       ` Linus Torvalds
  3 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-08-02  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins,
	linux-mm, linux-kernel

Linus Torvalds wrote:

>
>Instead, I'd suggest changing the logic for "lookup_write". Make it 
>require that the page table entry is _dirty_ (not writable), and then 
>remove the line that says:
>
>	lookup_write = write && !force;
>
>and you're now done. A successful mm fault for write _should_ always have 
>marked the PTE dirty (and yes, part of testing this would be to verify 
>that this is true - but since architectures that don't have HW dirty 
>bits depend on this anyway, I'm pretty sure it _is_ true).
>
>Ie something like the below (which is totally untested, obviously, but I 
>think conceptually is a lot more correct, and obviously a lot simpler).
>
>

Surely this introduces integrity problems when `force` is not set?
Security holes? Perhaps not, but I wouldn't guarantee it.

However: I like your idea. And getting rid of the lookup_write logic is
a good thing.

I don't much like that it changes the semantics of follow_page for
write on a readonly pte, and that is where your problem is introduced.
I think to go down this route you'd at least need a follow_page check
that is distinct from 'write'. 'writeable', maybe.

Then, having a 'writeable' flag lets you neatly "comment" your idea of
what might constitute a writeable pte, as opposed to the current code
which basically looks like black magic to a reader, and gives no indication
of how it satisfies the get_user_pages requirements.

A minor issue: I don't much like the proliferation of __follow_page flags
either. Why not make __follow_page take a bitmask, and be used directly by
get_user_pages, which would also allow removal of the 'write' argument from
follow_page.

I would cook you some patches, but I'm not in front of the source tree.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02  1:27                     ` Nick Piggin
@ 2005-08-02  3:45                       ` Linus Torvalds
  2005-08-02  4:25                         ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02  3:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins,
	linux-mm, linux-kernel

On Tue, 2 Aug 2005, Nick Piggin wrote:
> 
> Surely this introduces integrity problems when `force` is not set?

"force" changes how we test the vma->vm_flags, that was always the 
meaning from a security standpoint (and that hasn't changed).

The old code had this "lookup_write = write && !force;" thing because
there it used "force" to _clear_ the write bit test, and that was what
caused the race in the first place - next time around we would accept a
non-writable page, even if it hadn't actually gotten COW'ed.

So no, the patch doesn't introduce integrity problems by ignoring "force".  
Quite the reverse - it _removes_ the integrity problems by ignoring it
there. That's kind of the whole point.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02  3:45                       ` Linus Torvalds
@ 2005-08-02  4:25                         ` Nick Piggin
  2005-08-02  4:35                           ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2005-08-02  4:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, lkml

On Mon, 2005-08-01 at 20:45 -0700, Linus Torvalds wrote:
> 
> On Tue, 2 Aug 2005, Nick Piggin wrote:
> > 
> > Surely this introduces integrity problems when `force` is not set?
> 
> "force" changes how we test the vma->vm_flags, that was always the 
> meaning from a security standpoint (and that hasn't changed).
> 

Of course, this test catches the problem I had in mind.

> The old code had this "lookup_write = write && !force;" thing because
> there it used "force" to _clear_ the write bit test, and that was what
> caused the race in the first place - next time around we would accept a
> non-writable page, even if it hadn't actually gotten COW'ed.
> 
> So no, the patch doesn't introduce integrity problems by ignoring "force".  
> Quite the reverse - it _removes_ the integrity problems by ignoring it
> there. That's kind of the whole point.
> 

OK, I'm convinced. One last thing - your fix might have a non
trivial overhead in terms of spin locks and simply entering the
high level page fault handler when dealing with clean, writeable
ptes for write.

Any chance you can change the __follow_page test to account for
writeable clean ptes? Something like

	if (write && !pte_dirty(pte) && !pte_write(pte))
		goto out;

And then you would re-add the set_page_dirty logic further on.

Not that I know what Robin's customer is doing exactly, but it
seems like something you can optimise easily enough.

Nick

-- 
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-02  4:25                         ` Nick Piggin
@ 2005-08-02  4:35                           ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-02  4:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, lkml


On Tue, 2 Aug 2005, Nick Piggin wrote:
> 
> Any chance you can change the __follow_page test to account for
> writeable clean ptes? Something like
> 
> 	if (write && !pte_dirty(pte) && !pte_write(pte))
> 		goto out;
> 
> And then you would re-add the set_page_dirty logic further on.

Hmm.. That should be possible. I wanted to do the simplest possible code 
sequence, but yeah, I guess there's nothing wrong with allowing the code 
to dirty the page.

Somebody want to send me a proper patch? Also, I haven't actually heard 
from whoever actually noticed the problem in the first place (Robin?) 
whether the fix does fix it. It "obviously does", but testing is always 
good ;)

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01  8:21                 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin
  2005-08-01  9:19                   ` Ingo Molnar
  2005-08-01 15:42                   ` Linus Torvalds
@ 2005-08-01 20:03                   ` Hugh Dickins
  2005-08-01 20:12                     ` Andrew Morton
  2 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2005-08-01 20:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Andrew Morton, Linus Torvalds, Ingo Molnar,
	Roland McGrath, linux-mm, linux-kernel

On Mon, 1 Aug 2005, Nick Piggin wrote:
> 
> This was tested by Robin and appears to solve the problem. Roland
> had a quick look and thought the basic idea was sound. I'd like to
> get a couple more acks before going forward, and in particular
> Robin was contemplating possible efficiency improvements (although
> efficiency can wait on correctness).

I'd much prefer a solution that doesn't invade all the arches, but
I don't think Linus' pte_dirty method will work on s390 (unless we
change s390 in a less obvious way than your patch), so it looks
like we do need something like yours.

Comments:

There are currently 21 architectures,
but so far your patch only updates 14 of them?

Personally (rather like Robin) I'd have preferred a return code more
directed to the issue at hand, than your VM_FAULT_RACE.  Not for
efficiency reasons, I think you're right that's not a real issue,
but more to document the peculiar case we're addressing.  Perhaps a
code that says do_wp_page has gone all the way through, even though
it hasn't set the writable bit.

That would require less change in mm/memory.c, but I've no strong
reason to argue that you change your approach in that way.  Perhaps
others prefer the race case singled out, to gather statistics on that.
Assuming we stick with your VM_FAULT_RACE...

Could we define VM_FAULT_RACE as 3 rather than -2?  I think there's 
some historical reason why VM_FAULT_OOM is -1, but see no cause to
extend the range in that strange direction.

VM_FAULT_RACE is a particular subcase of VM_FAULT_MINOR: throughout
the arches I think they should just be adjacent cases of the switch
statement, both doing the min_flt++.

Your continue in get_user_pages skips page_table_lock as Linus noted.

The do_wp_page call from do_swap_page needs to be adjusted, to return
VM_FAULT_RACE if that's what it returned.

If VM_FAULT_RACE is really recording races, then the bottom return
value from handle_pte_fault ought to be VM_FAULT_RACE rather than
VM_FAULT_MINOR.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 20:03                   ` Hugh Dickins
@ 2005-08-01 20:12                     ` Andrew Morton
  2005-08-01 20:26                       ` Linus Torvalds
  2005-08-01 20:51                       ` Hugh Dickins
  0 siblings, 2 replies; 72+ messages in thread
From: Andrew Morton @ 2005-08-01 20:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: nickpiggin, holt, torvalds, mingo, roland, linux-mm, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> There are currently 21 architectures,
> but so far your patch only updates 14 of them?

We could just do:

static inline int handle_mm_fault(...)
{
	int ret = __handle_mm_fault(...);

	if (unlikely(ret == VM_FAULT_RACE))
		ret = VM_FAULT_MINOR;
	return ret;
}

because VM_FAULT_RACE is some internal private thing.

It does add another test-n-branch to the pagefault path though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 20:12                     ` Andrew Morton
@ 2005-08-01 20:26                       ` Linus Torvalds
  2005-08-01 20:51                       ` Hugh Dickins
  1 sibling, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2005-08-01 20:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, nickpiggin, holt, mingo, roland, linux-mm, linux-kernel

On Mon, 1 Aug 2005, Andrew Morton wrote:
>
> We could just do:
> 
> static inline int handle_mm_fault(...)
> {
> 	int ret = __handle_mm_fault(...);
> 
> 	if (unlikely(ret == VM_FAULT_RACE))
> 		ret = VM_FAULT_MINOR;

The reason I really dislike this whole VM_FAULT_RACE thing is that there's 
literally just one user that cares, and that user is such a special case 
anyway that we're _much_  better off fixing it in that user instead.

The dirty bit thing is truly trivial, and is a generic VM feature. The
fact that s390 does strange things is immaterial: I bet that s390 can be
fixed much more easily than the suggested VM_FAULT_RACE patch, and quite
frankly, bringing it semantically closer to the rest of the architectures
is a _good_ thing regardless.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch 2.6.13-rc4] fix get_user_pages bug
  2005-08-01 20:12                     ` Andrew Morton
  2005-08-01 20:26                       ` Linus Torvalds
@ 2005-08-01 20:51                       ` Hugh Dickins
  1 sibling, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2005-08-01 20:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, holt, torvalds, mingo, roland, linux-mm, linux-kernel

On Mon, 1 Aug 2005, Andrew Morton wrote:
> static inline int handle_mm_fault(...)
> {
> 	int ret = __handle_mm_fault(...);
> 
> 	if (unlikely(ret == VM_FAULT_RACE))
> 		ret = VM_FAULT_MINOR;
> 	return ret;
> }
> because VM_FAULT_RACE is some internal private thing.
> It does add another test-n-branch to the pagefault path though.

Good idea, at least to avoid changing all arches at this moment;
though I don't think handle_mm_fault itself can be static inline.

But let's set this VM_FAULT_RACE approach aside for now: I think
we're agreed that the pte_dirty-with-mods-to-s390 route is more
attractive, so I'll now try to find fault with that approach.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2005-08-04 16:32 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-07-30 20:53 get_user_pages() with write=1 and force=1 gets read-only pages Robin Holt
2005-07-30 22:13 ` Hugh Dickins
2005-07-31  1:52   ` Nick Piggin
2005-07-31 10:52     ` Robin Holt
2005-07-31 11:07       ` Nick Piggin
2005-07-31 11:30         ` Robin Holt
2005-07-31 11:39           ` Robin Holt
2005-07-31 12:09           ` Robin Holt
2005-07-31 22:27             ` Nick Piggin
2005-08-01  3:22               ` Roland McGrath
2005-08-01  8:21                 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin
2005-08-01  9:19                   ` Ingo Molnar
2005-08-01  9:27                     ` Nick Piggin
2005-08-01 10:15                       ` Ingo Molnar
2005-08-01 10:57                         ` Nick Piggin
2005-08-01 19:43                           ` Hugh Dickins
2005-08-01 20:08                             ` Linus Torvalds
2005-08-01 21:06                               ` Hugh Dickins
2005-08-01 21:51                                 ` Linus Torvalds
2005-08-01 22:01                                   ` Linus Torvalds
2005-08-02 12:01                                     ` Martin Schwidefsky
2005-08-02 12:26                                       ` Hugh Dickins
2005-08-02 12:28                                         ` Nick Piggin
2005-08-02 15:19                                         ` Martin Schwidefsky
2005-08-02 15:30                                       ` Linus Torvalds
2005-08-02 16:03                                         ` Hugh Dickins
2005-08-02 16:25                                           ` Linus Torvalds
2005-08-02 17:02                                             ` Linus Torvalds
2005-08-02 17:27                                               ` Hugh Dickins
2005-08-02 17:21                                             ` Hugh Dickins
2005-08-02 18:47                                               ` Linus Torvalds
2005-08-02 19:20                                                 ` Hugh Dickins
2005-08-02 19:54                                                   ` Linus Torvalds
2005-08-02 20:55                                                     ` Hugh Dickins
2005-08-03 10:24                                                       ` Nick Piggin
2005-08-03 11:47                                                         ` Hugh Dickins
2005-08-03 12:13                                                           ` Nick Piggin
2005-08-03 16:12                                                         ` Linus Torvalds
2005-08-03 16:39                                                           ` Linus Torvalds
2005-08-03 16:42                                                             ` Linus Torvalds
2005-08-03 17:12                                                           ` Hugh Dickins
2005-08-03 23:03                                                           ` Nick Piggin
2005-08-04 14:14                                                           ` Alexander Nyberg
2005-08-04 14:30                                                             ` Nick Piggin
2005-08-04 15:00                                                               ` Alexander Nyberg
2005-08-04 15:35                                                                 ` Hugh Dickins
2005-08-04 16:32                                                                   ` Russell King
2005-08-04 15:36                                                                 ` Linus Torvalds
2005-08-04 16:29                                                               ` Russell King
2005-08-03 10:24                                                       ` Martin Schwidefsky
2005-08-03 11:57                                                         ` Hugh Dickins
2005-08-02 16:44                                         ` Martin Schwidefsky
2005-08-01 15:42                   ` Linus Torvalds
2005-08-01 18:18                     ` Linus Torvalds
2005-08-03  8:24                       ` Robin Holt
2005-08-03 11:31                         ` Hugh Dickins
2005-08-04 11:48                           ` Robin Holt
2005-08-04 13:04                             ` Hugh Dickins
2005-08-01 19:29                     ` Hugh Dickins
2005-08-01 19:48                       ` Linus Torvalds
2005-08-02  8:07                         ` Martin Schwidefsky
2005-08-01 19:57                       ` Andrew Morton
2005-08-01 20:16                         ` Linus Torvalds
2005-08-02  0:14                     ` Nick Piggin
2005-08-02  1:27                     ` Nick Piggin
2005-08-02  3:45                       ` Linus Torvalds
2005-08-02  4:25                         ` Nick Piggin
2005-08-02  4:35                           ` Linus Torvalds
2005-08-01 20:03                   ` Hugh Dickins
2005-08-01 20:12                     ` Andrew Morton
2005-08-01 20:26                       ` Linus Torvalds
2005-08-01 20:51                       ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox