[RFC PATCH for -mm 0/5] mlock return value rework

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH for -mm 0/5] mlock return value rework
@ 2008-08-11  7:01 KOSAKI Motohiro
  2008-08-11  7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:01 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

patch against: 2.6.27-rc1-mm1


Halesh Sadashiv reported 2.6.23.17 has a regression about mlock return value.

http://marc.info/?l=linux-kernel&m=121749977015397&w=2

it is already fixed.
but the test doesn't works on current -mm tree because split-lru patch
introduce another regression.

So, I try to rework to mlock return value behavior.

Lee-san, could you please review this patches?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH for -mm 1/5]  mlock() fix return values for mainline
  2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
@ 2008-08-11  7:04 ` KOSAKI Motohiro
  2008-08-12 20:39   ` Lee Schermerhorn
  2008-08-11  7:05 ` [RFC PATCH for -mm 2/5] related function comment fixes (optional) KOSAKI Motohiro
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:04 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

following patch is the same to http://marc.info/?l=linux-kernel&m=121750892930775&w=2
and it already stay in linus-tree.

but it is not merged in 2.6.27-rc1-mm1.

So, please apply it first.



-----------------------------------------------

---
 mm/memory.c |   16 +++++++++++++---
 mm/mlock.c  |    2 --
 2 files changed, 13 insertions(+), 5 deletions(-)

Index: b/mm/memory.c
===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2814,16 +2814,26 @@ int make_pages_present(unsigned long add
 
 	vma = find_vma(current->mm, addr);
 	if (!vma)
-		return -1;
+		return -ENOMEM;
 	write = (vma->vm_flags & VM_WRITE) != 0;
 	BUG_ON(addr >= end);
 	BUG_ON(end > vma->vm_end);
 	len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
 	ret = get_user_pages(current, current->mm, addr,
 			len, write, 0, NULL, NULL);
-	if (ret < 0)
+	if (ret < 0) {
+		/*
+		   SUS require strange return value to mlock
+		    - invalid addr generate to ENOMEM.
+		    - out of memory should generate EAGAIN.
+		*/
+		if (ret == -EFAULT)
+			ret = -ENOMEM;
+		else if (ret == -ENOMEM)
+			ret = -EAGAIN;
 		return ret;
-	return ret == len ? 0 : -1;
+	}
+	return ret == len ? 0 : -ENOMEM;
 }
 
 #if !defined(__HAVE_ARCH_GATE_AREA)
Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -425,8 +425,6 @@ success:
 
 out:
 	*prev = vma;
-	if (ret == -ENOMEM)
-		ret = -EAGAIN;
 	return ret;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 1/5]  mlock() fix return values for mainline
  2008-08-11  7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
@ 2008-08-12 20:39   ` Lee Schermerhorn
  2008-08-13  8:03     ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-12 20:39 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Mon, 2008-08-11 at 16:04 +0900, KOSAKI Motohiro wrote:
> following patch is the same to http://marc.info/?l=linux-kernel&m=121750892930775&w=2
> and it already stay in linus-tree.
> 
> but it is not merged in 2.6.27-rc1-mm1.
> 
> So, please apply it first.

Kosaki-san:

make_pages_present() is called from other places than mlock[_fixup()].
However, I guess it's OK to put mlock() specific behavior in
make_pages_present() as all other callers [currently] ignore the return
value.  Is that your thinking?

Lee
> 
> 
> 
> -----------------------------------------------
> 
> ---
>  mm/memory.c |   16 +++++++++++++---
>  mm/mlock.c  |    2 --
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> Index: b/mm/memory.c
> ===================================================================
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2814,16 +2814,26 @@ int make_pages_present(unsigned long add
>  
>  	vma = find_vma(current->mm, addr);
>  	if (!vma)
> -		return -1;
> +		return -ENOMEM;
>  	write = (vma->vm_flags & VM_WRITE) != 0;
>  	BUG_ON(addr >= end);
>  	BUG_ON(end > vma->vm_end);
>  	len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
>  	ret = get_user_pages(current, current->mm, addr,
>  			len, write, 0, NULL, NULL);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		/*
> +		   SUS require strange return value to mlock
> +		    - invalid addr generate to ENOMEM.
> +		    - out of memory should generate EAGAIN.
> +		*/
> +		if (ret == -EFAULT)
> +			ret = -ENOMEM;
> +		else if (ret == -ENOMEM)
> +			ret = -EAGAIN;
>  		return ret;
> -	return ret == len ? 0 : -1;
> +	}
> +	return ret == len ? 0 : -ENOMEM;
>  }
>  
>  #if !defined(__HAVE_ARCH_GATE_AREA)
> Index: b/mm/mlock.c
> ===================================================================
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -425,8 +425,6 @@ success:
>  
>  out:
>  	*prev = vma;
> -	if (ret == -ENOMEM)
> -		ret = -EAGAIN;
>  	return ret;
>  }
>  
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 1/5]  mlock() fix return values for mainline
  2008-08-12 20:39   ` Lee Schermerhorn
@ 2008-08-13  8:03     ` KOSAKI Motohiro
  0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13  8:03 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel

> On Mon, 2008-08-11 at 16:04 +0900, KOSAKI Motohiro wrote:
> > following patch is the same to http://marc.info/?l=linux-kernel&m=121750892930775&w=2
> > and it already stay in linus-tree.
> > 
> > but it is not merged in 2.6.27-rc1-mm1.
> > 
> > So, please apply it first.
> 
> Kosaki-san:
> 
> make_pages_present() is called from other places than mlock[_fixup()].
> However, I guess it's OK to put mlock() specific behavior in
> make_pages_present() as all other callers [currently] ignore the return
> value.  Is that your thinking?

yup, others ignore it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH for -mm 2/5] related function comment fixes (optional)
  2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
  2008-08-11  7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
@ 2008-08-11  7:05 ` KOSAKI Motohiro
  2008-08-12 19:02   ` Lee Schermerhorn
  2008-08-11  7:06 ` [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment KOSAKI Motohiro
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:05 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

Now, __mlock_vma_pages_range has sevaral wrong comment.
 - don't write about mlock parameter
 - write about require write lock, but it is not true.

following patch fixes it.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mlock.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -144,11 +144,18 @@ static void munlock_vma_page(struct page
 }
 
 /*
- * mlock a range of pages in the vma.
+ * mlock/munlock a range of pages in the vma.
  *
- * This takes care of making the pages present too.
+ * If @mlock==1, this takes care of making the pages present too.
  *
- * vma->vm_mm->mmap_sem must be held for write.
+ * @vma:   target vma
+ * @start: start address
+ * @end:   end address
+ * @mlock: 0 indicate munlock, otherwise mlock.
+ *
+ * return 0 if successed, otherwse return negative value.
+ *
+ * vma->vm_mm->mmap_sem must be held for read.
  */
 static int __mlock_vma_pages_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 2/5] related function comment fixes (optional)
  2008-08-11  7:05 ` [RFC PATCH for -mm 2/5] related function comment fixes (optional) KOSAKI Motohiro
@ 2008-08-12 19:02   ` Lee Schermerhorn
  2008-08-13  8:37     ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-12 19:02 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Mon, 2008-08-11 at 16:05 +0900, KOSAKI Motohiro wrote:
> Now, __mlock_vma_pages_range has sevaral wrong comment.
>  - don't write about mlock parameter
>  - write about require write lock, but it is not true.
> 
> following patch fixes it.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  mm/mlock.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> Index: b/mm/mlock.c
> ===================================================================
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -144,11 +144,18 @@ static void munlock_vma_page(struct page
>  }
>  
>  /*
> - * mlock a range of pages in the vma.
> + * mlock/munlock a range of pages in the vma.
>   *
> - * This takes care of making the pages present too.
> + * If @mlock==1, this takes care of making the pages present too.
>   *
> - * vma->vm_mm->mmap_sem must be held for write.
> + * @vma:   target vma
> + * @start: start address
> + * @end:   end address
> + * @mlock: 0 indicate munlock, otherwise mlock.
> + *
> + * return 0 if successed, otherwse return negative value.

How about:

	return 0 on success, [negative] error number on error.

Or something like that.
> + *
> + * vma->vm_mm->mmap_sem must be held for read.
>   */
>  static int __mlock_vma_pages_range(struct vm_area_struct *vma,
>  				   unsigned long start, unsigned long end,
> 
> 

Otherwise,

Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 2/5] related function comment fixes (optional)
  2008-08-12 19:02   ` Lee Schermerhorn
@ 2008-08-13  8:37     ` KOSAKI Motohiro
  0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13  8:37 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel

> > - * vma->vm_mm->mmap_sem must be held for write.
> > + * @vma:   target vma
> > + * @start: start address
> > + * @end:   end address
> > + * @mlock: 0 indicate munlock, otherwise mlock.
> > + *
> > + * return 0 if successed, otherwse return negative value.
> 
> How about:
> 
> 	return 0 on success, [negative] error number on error.

OK. I'll fix at next post.

Thanks carefully review!



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
  2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
  2008-08-11  7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
  2008-08-11  7:05 ` [RFC PATCH for -mm 2/5] related function comment fixes (optional) KOSAKI Motohiro
@ 2008-08-11  7:06 ` KOSAKI Motohiro
  2008-08-12 19:55   ` Lee Schermerhorn
  2008-08-11  7:07 ` [RFC PATCH for -mm 4/5] fix mlock return value at munmap race KOSAKI Motohiro
  2008-08-11  7:08 ` [RFC PATCH for -mm 5/5] fix mlock return value for mm KOSAKI Motohiro
  4 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

Now, __mlock_vma_pages_range never return positive value.
So, locked_vm adjustment code is unnecessary.

also, related comment fixed.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mlock.c |   18 +++++-------------
 mm/mmap.c  |   10 +++++-----
 2 files changed, 10 insertions(+), 18 deletions(-)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -276,7 +276,7 @@ int mlock_vma_pages_range(struct vm_area
 			unsigned long start, unsigned long end)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int nr_pages = (end - start) / PAGE_SIZE;
+	int error = 0;
 	BUG_ON(!(vma->vm_flags & VM_LOCKED));
 
 	/*
@@ -289,8 +289,7 @@ int mlock_vma_pages_range(struct vm_area
 			is_vm_hugetlb_page(vma) ||
 			vma == get_gate_vma(current))) {
 		downgrade_write(&mm->mmap_sem);
-		nr_pages = __mlock_vma_pages_range(vma, start, end, 1);
-
+		error = __mlock_vma_pages_range(vma, start, end, 1);
 		up_read(&mm->mmap_sem);
 		/* vma can change or disappear */
 		down_write(&mm->mmap_sem);
@@ -298,22 +297,19 @@ int mlock_vma_pages_range(struct vm_area
 		/* non-NULL vma must contain @start, but need to check @end */
 		if (!vma ||  end > vma->vm_end)
 			return -EAGAIN;
-		return nr_pages;
+		return error;
 	}
 
 	/*
 	 * User mapped kernel pages or huge pages:
 	 * make these pages present to populate the ptes, but
-	 * fall thru' to reset VM_LOCKED--no need to unlock, and
-	 * return nr_pages so these don't get counted against task's
-	 * locked limit.  huge pages are already counted against
-	 * locked vm limit.
+	 * fall thru' to reset VM_LOCKED--no need to unlock.
 	 */
 	make_pages_present(start, end);
 
 no_mlock:
 	vma->vm_flags &= ~VM_LOCKED;	/* and don't come back! */
-	return nr_pages;		/* pages NOT mlocked */
+	return error;			/* pages NOT mlocked */
 }
 
 
@@ -402,10 +398,6 @@ success:
 		downgrade_write(&mm->mmap_sem);
 
 		ret = __mlock_vma_pages_range(vma, start, end, 1);
-		if (ret > 0) {
-			mm->locked_vm -= ret;
-			ret = 0;
-		}
 		/*
 		 * Need to reacquire mmap sem in write mode, as our callers
 		 * expect this.  We have no support for atomically upgrading
Index: b/mm/mmap.c
===================================================================
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1229,10 +1229,10 @@ out:
 		/*
 		 * makes pages present; downgrades, drops, reacquires mmap_sem
 		 */
-		int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
-		if (nr_pages < 0)
-			return nr_pages;	/* vma gone! */
-		mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
+		int error = mlock_vma_pages_range(vma, addr, addr + len);
+		if (error < 0)
+			return error;	/* vma gone! */
+		mm->locked_vm += (len >> PAGE_SHIFT);
 	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
 	return addr;
@@ -2087,7 +2087,7 @@ out:
 	if (flags & VM_LOCKED) {
 		int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
 		if (nr_pages >= 0)
-			mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 	return addr;
 undo_charge:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
  2008-08-11  7:06 ` [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment KOSAKI Motohiro
@ 2008-08-12 19:55   ` Lee Schermerhorn
  2008-08-13  9:37     ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-12 19:55 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Mon, 2008-08-11 at 16:06 +0900, KOSAKI Motohiro wrote:
> Now, __mlock_vma_pages_range never return positive value.
> So, locked_vm adjustment code is unnecessary.

True, __mlock_vma_pages_range() does not return a positive value.  [It
didn't before this patch series, right?]  However, you are now counting
mlocked hugetlb pages and user mapped kernel pages against locked_vm--at
least in the mmap(MAP_LOCKED) path--even tho' we don't actually mlock().
Note that mlock[all]() will still avoid counting these pages in
mlock_fixup(), as I think it should.

Huge shm pages are already counted against user->locked_shm.  This patch
counts them against mm->locked_vm, as well, if one mlock()s them.  But,
since locked_vm and locked_shm are compared to the memlock rlimit
independently, so we won't be double counting the huge pages against
either limit.  However,  mlock()ed [not SHMLOCKed] hugetlb pages will
now be counted against locked_vm limit and will reduce the amount of
non-shm memory that the task can lock [maybe not such a bad thing?].
Also, mlock()ed hugetlb pages will be included in the /proc/<pid>/status
"VmLck" element, even tho' they're not really mlocked and they don't
show up in the /proc/meminfo "Mlocked" count.

Similarly, mlock()ing a vm range backed by kernel pages--e.g.,
VM_RESERVED|VM_DONTEXPAND vmas--will show up in the VmLck status
element, but won't actually be mlocked nor counted in Mlocked meminfo
field.  They will be counted against the task's locked vm limit.

So, I don't know whether to Ack or Nack this.  I guess it's no further
from reality than the current code.  But, I don't think you need this
one.  The code already differentiates between negative values as error
codes and non-negative values as an adjustment to locked_vm, so you
should be able to meet the standards mandated error returns without this
patch.  

Still thinking about this...
Lee

> 
> also, related comment fixed.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  mm/mlock.c |   18 +++++-------------
>  mm/mmap.c  |   10 +++++-----
>  2 files changed, 10 insertions(+), 18 deletions(-)
> 
> Index: b/mm/mlock.c
> ===================================================================
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -276,7 +276,7 @@ int mlock_vma_pages_range(struct vm_area
>  			unsigned long start, unsigned long end)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	int nr_pages = (end - start) / PAGE_SIZE;
> +	int error = 0;
>  	BUG_ON(!(vma->vm_flags & VM_LOCKED));
>  
>  	/*
> @@ -289,8 +289,7 @@ int mlock_vma_pages_range(struct vm_area
>  			is_vm_hugetlb_page(vma) ||
>  			vma == get_gate_vma(current))) {
>  		downgrade_write(&mm->mmap_sem);
> -		nr_pages = __mlock_vma_pages_range(vma, start, end, 1);
> -
> +		error = __mlock_vma_pages_range(vma, start, end, 1);
>  		up_read(&mm->mmap_sem);
>  		/* vma can change or disappear */
>  		down_write(&mm->mmap_sem);
> @@ -298,22 +297,19 @@ int mlock_vma_pages_range(struct vm_area
>  		/* non-NULL vma must contain @start, but need to check @end */
>  		if (!vma ||  end > vma->vm_end)
>  			return -EAGAIN;
> -		return nr_pages;
> +		return error;
>  	}
>  
>  	/*
>  	 * User mapped kernel pages or huge pages:
>  	 * make these pages present to populate the ptes, but
> -	 * fall thru' to reset VM_LOCKED--no need to unlock, and
> -	 * return nr_pages so these don't get counted against task's
> -	 * locked limit.  huge pages are already counted against
> -	 * locked vm limit.
> +	 * fall thru' to reset VM_LOCKED--no need to unlock.
>  	 */
>  	make_pages_present(start, end);
>  
>  no_mlock:
>  	vma->vm_flags &= ~VM_LOCKED;	/* and don't come back! */
> -	return nr_pages;		/* pages NOT mlocked */
> +	return error;			/* pages NOT mlocked */
>  }
>  
> 
> @@ -402,10 +398,6 @@ success:
>  		downgrade_write(&mm->mmap_sem);
>  
>  		ret = __mlock_vma_pages_range(vma, start, end, 1);
> -		if (ret > 0) {
> -			mm->locked_vm -= ret;
> -			ret = 0;
> -		}
>  		/*
>  		 * Need to reacquire mmap sem in write mode, as our callers
>  		 * expect this.  We have no support for atomically upgrading
> Index: b/mm/mmap.c
> ===================================================================
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1229,10 +1229,10 @@ out:
>  		/*
>  		 * makes pages present; downgrades, drops, reacquires mmap_sem
>  		 */
> -		int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
> -		if (nr_pages < 0)
> -			return nr_pages;	/* vma gone! */
> -		mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
> +		int error = mlock_vma_pages_range(vma, addr, addr + len);
> +		if (error < 0)
> +			return error;	/* vma gone! */
> +		mm->locked_vm += (len >> PAGE_SHIFT);
>  	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
>  		make_pages_present(addr, addr + len);
>  	return addr;
> @@ -2087,7 +2087,7 @@ out:
>  	if (flags & VM_LOCKED) {
>  		int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
>  		if (nr_pages >= 0)
> -			mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
> +			mm->locked_vm += (len >> PAGE_SHIFT);
>  	}
>  	return addr;
>  undo_charge:
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
  2008-08-12 19:55   ` Lee Schermerhorn
@ 2008-08-13  9:37     ` KOSAKI Motohiro
  2008-08-15 13:54       ` Lee Schermerhorn
  0 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13  9:37 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel

Hi

> > Now, __mlock_vma_pages_range never return positive value.
> > So, locked_vm adjustment code is unnecessary.
> 
> True, __mlock_vma_pages_range() does not return a positive value.  [It
> didn't before this patch series, right?]  

True.


> However, you are now counting
> mlocked hugetlb pages and user mapped kernel pages against locked_vm--at
> least in the mmap(MAP_LOCKED) path--even tho' we don't actually mlock().
> Note that mlock[all]() will still avoid counting these pages in
> mlock_fixup(), as I think it should.
> 
> Huge shm pages are already counted against user->locked_shm.  This patch
> counts them against mm->locked_vm, as well, if one mlock()s them.  But,
> since locked_vm and locked_shm are compared to the memlock rlimit
> independently, so we won't be double counting the huge pages against
> either limit.  However,  mlock()ed [not SHMLOCKed] hugetlb pages will
> now be counted against locked_vm limit and will reduce the amount of
> non-shm memory that the task can lock [maybe not such a bad thing?].
> Also, mlock()ed hugetlb pages will be included in the /proc/<pid>/status
> "VmLck" element, even tho' they're not really mlocked and they don't
> show up in the /proc/meminfo "Mlocked" count.
> 
> Similarly, mlock()ing a vm range backed by kernel pages--e.g.,
> VM_RESERVED|VM_DONTEXPAND vmas--will show up in the VmLck status
> element, but won't actually be mlocked nor counted in Mlocked meminfo
> field.  They will be counted against the task's locked vm limit.
> 
> So, I don't know whether to Ack or Nack this.  I guess it's no further
> from reality than the current code.  But, I don't think you need this
> one.  The code already differentiates between negative values as error
> codes and non-negative values as an adjustment to locked_vm, so you
> should be able to meet the standards mandated error returns without this
> patch.  
> 
> Still thinking about this...

I think...

In general, nobody want regression.
and locked_vm exist from long time ago.

So, We shouldn't change locked_vm behavior although this have
very strange behavior.

in linus-tree locked_vm indicate count of amount of VM_LOCKED vma range,
not populated pages nor number of physical pages of locked vma.

Yes, current linus-tree locked_vm code have some strange behavior.
but if we want to change it, we should split out from split-lru patch, IMHO.

Then, I hope to remove locked_vm adjustment code.
Am I missing point?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
  2008-08-13  9:37     ` KOSAKI Motohiro
@ 2008-08-15 13:54       ` Lee Schermerhorn
  2008-08-18  9:23         ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-15 13:54 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Wed, 2008-08-13 at 18:37 +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > > Now, __mlock_vma_pages_range never return positive value.
> > > So, locked_vm adjustment code is unnecessary.
> > 
> > True, __mlock_vma_pages_range() does not return a positive value.  [It
> > didn't before this patch series, right?]  
> 
> True.
> 
> 
> > However, you are now counting
> > mlocked hugetlb pages and user mapped kernel pages against locked_vm--at
> > least in the mmap(MAP_LOCKED) path--even tho' we don't actually mlock().
> > Note that mlock[all]() will still avoid counting these pages in
> > mlock_fixup(), as I think it should.
> > 
> > Huge shm pages are already counted against user->locked_shm.  This patch
> > counts them against mm->locked_vm, as well, if one mlock()s them.  But,
> > since locked_vm and locked_shm are compared to the memlock rlimit
> > independently, so we won't be double counting the huge pages against
> > either limit.  However,  mlock()ed [not SHMLOCKed] hugetlb pages will
> > now be counted against locked_vm limit and will reduce the amount of
> > non-shm memory that the task can lock [maybe not such a bad thing?].
> > Also, mlock()ed hugetlb pages will be included in the /proc/<pid>/status
> > "VmLck" element, even tho' they're not really mlocked and they don't
> > show up in the /proc/meminfo "Mlocked" count.
> > 
> > Similarly, mlock()ing a vm range backed by kernel pages--e.g.,
> > VM_RESERVED|VM_DONTEXPAND vmas--will show up in the VmLck status
> > element, but won't actually be mlocked nor counted in Mlocked meminfo
> > field.  They will be counted against the task's locked vm limit.
> > 
> > So, I don't know whether to Ack or Nack this.  I guess it's no further
> > from reality than the current code.  But, I don't think you need this
> > one.  The code already differentiates between negative values as error
> > codes and non-negative values as an adjustment to locked_vm, so you
> > should be able to meet the standards mandated error returns without this
> > patch.  
> > 
> > Still thinking about this...
> 
> I think...
> 
> In general, nobody want regression.
> and locked_vm exist from long time ago.
> 
> So, We shouldn't change locked_vm behavior although this have
> very strange behavior.
> 
> in linus-tree locked_vm indicate count of amount of VM_LOCKED vma range,
> not populated pages nor number of physical pages of locked vma.
> 
> Yes, current linus-tree locked_vm code have some strange behavior.
> but if we want to change it, we should split out from split-lru patch, IMHO.
> 
> Then, I hope to remove locked_vm adjustment code.
> Am I missing point?

OK.  

I'd like to drop this patch and keep the locked_vm adjustment.  It
should still handle the error conditions, as a negative 'adjustment'
from mlock_vma_pages_range() is handled correctly as an error code.

Now, why keep the adjustment?

First, I agree that locked_vm tracks amount of locked Virtual Memory, as
the name indicates.  IMO, this is not a useful metric.  I think what we
want to track IS the amount of physical memory locked down by the tasks.
However, this is difficult.  Consider that today, if one has called,
say, mlockall(MCL_FUTURE) and then you mmap() some large segment into
your address space more than once, you'll get charged for the locked_vm
each time you mmap() the same segment.  One can easily exceed the
rlimit, even when the actual amount of locked pages is less than the
limit.  But, as I say, this is difficult to fix efficiently and I don't
want to try to do that in the context of these patches.

Now, taking the view that locked_vm counts VM_LOCKED vmas, note that we
do not set VM_LOCKED for those vmas where it does not make sense to
mlock the pages with SetPageMlocked().  That is, the "special" vmas,
that have the following flags:  VM_IO, VM_PFNMAP, VM_RESERVED,
VM_DONTEXPAND, and VM_HUGETLB -- all of the vmas that we filter out in
mlock_fixup().  

Now, for vmas with VM_IO or VM_PFNMAP, we don't even attempt to
"make_pages_present()" because get_user_pages() will error out
immediately.  For the other types, we do make_pages_present() to avoid
future minor faults, but we don't set PageMlocked().  Because we don't
want to have to deal with these vmas during munmap() [including exit
processing] nor duplicate the vma filtering in that path, we don't set
the VM_LOCKED flags for these vmas.  Since the vma is not VM_LOCKED, we
don't want to count the memory as locked_vm.  

For mlock[all](), we do the locked_vm accounting in mlock_fixup().  We
just don't count these vmas.  See the comments "don't set VM_LOCKED,
don't count" in the code.  However, mmap() does the locked_vm accounting
before calling mlock_vma_pages_range() [which replaces calls to
make_pages_present()].  To make mmap(MAP_LOCKED) behavior consistent
with mlock[all]() behavior, we return a "locked_vm" adjustment from
mlock_vma_pages_range().  For the filtered, special vmas, this is always
the size of the vma which mmap() has already added to locked_vm().  

So, I'd like to keep the adjustment returned by mlock_vma_pages_range().
However, as you note, the internal function __mlock_vma_pages_range()
always returns non-positive values [in earlier versions of this series
it could return a positive, non-zero value but I decided that was not
appropriate].  It can, however, return an error code with your changes.
If we keep the locked_vm adjustment, as I propose here, then we need to
pass any return value from __mlock_vma_pages_range() via the "nr_pages"
variable which can contain an error code [< 0] as well as a locked_vm
adjustment [>= 0].

IMO, we still have some issues in locked_vm accounting that needs to be
addressed.  E.g., when you attached a hugetlb shm segment, it is
automatically counted as locked_shm [tracked independently of locked_vm
in the user struct instead of mm -- go figure!].  However, when we
mmap() a hugetlbfs file, we don't automatically add this to locked_vm.
Maybe we should, but not as part of this patch series, I think.

Does this make sense?

Maybe I need to explain the rationale better in the unevictable_lru
documentation.  I do discuss the behavior [locked_vm adjustment], but
maybe not enough rationale.

Lee

> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
  2008-08-15 13:54       ` Lee Schermerhorn
@ 2008-08-18  9:23         ` KOSAKI Motohiro
  2008-08-18 20:56           ` Lee Schermerhorn
  0 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-18  9:23 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel

Hi Lee-san,

Sorry for late responce.

I understand your intention.
I agreed almost part of your explain, but I have few disagree points.


> I'd like to drop this patch and keep the locked_vm adjustment.  It
> should still handle the error conditions, as a negative 'adjustment'
> from mlock_vma_pages_range() is handled correctly as an error code.
> 
> Now, why keep the adjustment?
> 
> First, I agree that locked_vm tracks amount of locked Virtual Memory, as
> the name indicates.  IMO, this is not a useful metric.  I think what we
> want to track IS the amount of physical memory locked down by the tasks.

hmmm..
Sorry, disagreed.
Various application assume current behavior.

So, I like adding to "locked physical memory" metrics, not replace.


> However, this is difficult.  Consider that today, if one has called,
> say, mlockall(MCL_FUTURE) and then you mmap() some large segment into
> your address space more than once, you'll get charged for the locked_vm
> each time you mmap() the same segment.  One can easily exceed the
> rlimit, even when the actual amount of locked pages is less than the
> limit.  But, as I say, this is difficult to fix efficiently and I don't
> want to try to do that in the context of these patches.

Agreed.
This is mmap()'s bug.


> Now, taking the view that locked_vm counts VM_LOCKED vmas, note that we
> do not set VM_LOCKED for those vmas where it does not make sense to
> mlock the pages with SetPageMlocked().  That is, the "special" vmas,
> that have the following flags:  VM_IO, VM_PFNMAP, VM_RESERVED,
> VM_DONTEXPAND, and VM_HUGETLB -- all of the vmas that we filter out in
> mlock_fixup().  
>
> Now, for vmas with VM_IO or VM_PFNMAP, we don't even attempt to
> "make_pages_present()" because get_user_pages() will error out
> immediately.  For the other types, we do make_pages_present() to avoid
> future minor faults, but we don't set PageMlocked().  Because we don't
> want to have to deal with these vmas during munmap() [including exit
> processing] nor duplicate the vma filtering in that path, we don't set
> the VM_LOCKED flags for these vmas.  Since the vma is not VM_LOCKED, we
> don't want to count the memory as locked_vm.  
>
> For mlock[all](), we do the locked_vm accounting in mlock_fixup().  We
> just don't count these vmas.  See the comments "don't set VM_LOCKED,
> don't count" in the code.  However, mmap() does the locked_vm accounting
> before calling mlock_vma_pages_range() [which replaces calls to
> make_pages_present()].  To make mmap(MAP_LOCKED) behavior consistent
> with mlock[all]() behavior, we return a "locked_vm" adjustment from
> mlock_vma_pages_range().  For the filtered, special vmas, this is always
> the size of the vma which mmap() has already added to locked_vm().  

Agreed.
That is definitly bug. (and this isn't introduced by us)

but current implementation is slightly bad.

mlock_vma_pages_range()'s prototype is ..

	int mlock_vma_pages_range(struct vm_area_struct *vma,
	                        unsigned long start, unsigned long end)

    ==> return type is int.

but sys_mlock()'s prototype is ..

	asmlinkage long sys_mlock(unsigned long start, size_t len)

    ==> len argument's type is size_t.

So, this adjustment code can overflow easily.

maybe, mlock_vma_pages_range()'s type should be changed as bellow

	int mlock_vma_pages_range(struct vm_area_struct *vma,
	                        unsigned long start, unsigned long end,
	                        unsigned long *locked_pages)


but, I think this should do as another patch 
because its bug isn't introduced by us, then it is another change.


So, I like following order changing.

1. remove current adjustment code from split-lru patch series
2. re-introduce it as another patch. (of cource, overflow problem should be fixed)


> So, I'd like to keep the adjustment returned by mlock_vma_pages_range().
> However, as you note, the internal function __mlock_vma_pages_range()
> always returns non-positive values [in earlier versions of this series
> it could return a positive, non-zero value but I decided that was not
> appropriate].  It can, however, return an error code with your changes.
> If we keep the locked_vm adjustment, as I propose here, then we need to
> pass any return value from __mlock_vma_pages_range() via the "nr_pages"
> variable which can contain an error code [< 0] as well as a locked_vm
> adjustment [>= 0].
>
> IMO, we still have some issues in locked_vm accounting that needs to be
> addressed.  E.g., when you attached a hugetlb shm segment, it is
> automatically counted as locked_shm [tracked independently of locked_vm
> in the user struct instead of mm -- go figure!].  However, when we
> mmap() a hugetlbfs file, we don't automatically add this to locked_vm.
> Maybe we should, but not as part of this patch series, I think.

Agreed.

> Does this make sense?
> 
> Maybe I need to explain the rationale better in the unevictable_lru
> documentation.  I do discuss the behavior [locked_vm adjustment], but
> maybe not enough rationale.


Could you tell me your feeling about my concern?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
  2008-08-18  9:23         ` KOSAKI Motohiro
@ 2008-08-18 20:56           ` Lee Schermerhorn
  0 siblings, 0 replies; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-18 20:56 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Mon, 2008-08-18 at 18:23 +0900, KOSAKI Motohiro wrote:
> Hi Lee-san,
> 
> Sorry for late responce.
> 
> I understand your intention.
> I agreed almost part of your explain, but I have few disagree points.
> 
> 
> > I'd like to drop this patch and keep the locked_vm adjustment.  It
> > should still handle the error conditions, as a negative 'adjustment'
> > from mlock_vma_pages_range() is handled correctly as an error code.
> > 
> > Now, why keep the adjustment?
> > 
> > First, I agree that locked_vm tracks amount of locked Virtual Memory, as
> > the name indicates.  IMO, this is not a useful metric.  I think what we
> > want to track IS the amount of physical memory locked down by the tasks.
> 
> hmmm..
> Sorry, disagreed.
> Various application assume current behavior.
> 
> So, I like adding to "locked physical memory" metrics, not replace.
> 
> 
> > However, this is difficult.  Consider that today, if one has called,
> > say, mlockall(MCL_FUTURE) and then you mmap() some large segment into
> > your address space more than once, you'll get charged for the locked_vm
> > each time you mmap() the same segment.  One can easily exceed the
> > rlimit, even when the actual amount of locked pages is less than the
> > limit.  But, as I say, this is difficult to fix efficiently and I don't
> > want to try to do that in the context of these patches.
> 
> Agreed.
> This is mmap()'s bug.
> 
> 
> > Now, taking the view that locked_vm counts VM_LOCKED vmas, note that we
> > do not set VM_LOCKED for those vmas where it does not make sense to
> > mlock the pages with SetPageMlocked().  That is, the "special" vmas,
> > that have the following flags:  VM_IO, VM_PFNMAP, VM_RESERVED,
> > VM_DONTEXPAND, and VM_HUGETLB -- all of the vmas that we filter out in
> > mlock_fixup().  
> >
> > Now, for vmas with VM_IO or VM_PFNMAP, we don't even attempt to
> > "make_pages_present()" because get_user_pages() will error out
> > immediately.  For the other types, we do make_pages_present() to avoid
> > future minor faults, but we don't set PageMlocked().  Because we don't
> > want to have to deal with these vmas during munmap() [including exit
> > processing] nor duplicate the vma filtering in that path, we don't set
> > the VM_LOCKED flags for these vmas.  Since the vma is not VM_LOCKED, we
> > don't want to count the memory as locked_vm.  
> >
> > For mlock[all](), we do the locked_vm accounting in mlock_fixup().  We
> > just don't count these vmas.  See the comments "don't set VM_LOCKED,
> > don't count" in the code.  However, mmap() does the locked_vm accounting
> > before calling mlock_vma_pages_range() [which replaces calls to
> > make_pages_present()].  To make mmap(MAP_LOCKED) behavior consistent
> > with mlock[all]() behavior, we return a "locked_vm" adjustment from
> > mlock_vma_pages_range().  For the filtered, special vmas, this is always
> > the size of the vma which mmap() has already added to locked_vm().  
> 
> Agreed.
> That is definitly bug. (and this isn't introduced by us)
> 
> but current implementation is slightly bad.
> 
> mlock_vma_pages_range()'s prototype is ..
> 
> 	int mlock_vma_pages_range(struct vm_area_struct *vma,
> 	                        unsigned long start, unsigned long end)
> 
>     ==> return type is int.
> 
> but sys_mlock()'s prototype is ..
> 
> 	asmlinkage long sys_mlock(unsigned long start, size_t len)
> 
>     ==> len argument's type is size_t.
> 
> So, this adjustment code can overflow easily.
> 
> maybe, mlock_vma_pages_range()'s type should be changed as bellow
> 
> 	int mlock_vma_pages_range(struct vm_area_struct *vma,
> 	                        unsigned long start, unsigned long end,
> 	                        unsigned long *locked_pages)
> 
> 
> but, I think this should do as another patch 
> because its bug isn't introduced by us, then it is another change.
> 
> 
> So, I like following order changing.
> 
> 1. remove current adjustment code from split-lru patch series
> 2. re-introduce it as another patch. (of cource, overflow problem should be fixed)
> 
> 
> > So, I'd like to keep the adjustment returned by mlock_vma_pages_range().
> > However, as you note, the internal function __mlock_vma_pages_range()
> > always returns non-positive values [in earlier versions of this series
> > it could return a positive, non-zero value but I decided that was not
> > appropriate].  It can, however, return an error code with your changes.
> > If we keep the locked_vm adjustment, as I propose here, then we need to
> > pass any return value from __mlock_vma_pages_range() via the "nr_pages"
> > variable which can contain an error code [< 0] as well as a locked_vm
> > adjustment [>= 0].
> >
> > IMO, we still have some issues in locked_vm accounting that needs to be
> > addressed.  E.g., when you attached a hugetlb shm segment, it is
> > automatically counted as locked_shm [tracked independently of locked_vm
> > in the user struct instead of mm -- go figure!].  However, when we
> > mmap() a hugetlbfs file, we don't automatically add this to locked_vm.
> > Maybe we should, but not as part of this patch series, I think.
> 
> Agreed.
> 
> > Does this make sense?
> > 
> > Maybe I need to explain the rationale better in the unevictable_lru
> > documentation.  I do discuss the behavior [locked_vm adjustment], but
> > maybe not enough rationale.
> 
> 
> Could you tell me your feeling about my concern?


OK, here's what I propose:

I'll regenerate the patches against the current mmotm, such that they
apply after the related patches.  After the patch
"/mmap-handle-mlocked-pages-during-map-remap-unmap.patch", we'll have
the following three patches:

1) back out the locked_vm changes from this patch, so that we can apply
as a separate patch.  This #1 patch can [should] be folded into the
"mmap-handle-mlocked-pages..." patch.  This is essentially your patch
3/5, reworked to fit into this location.
2) a new patch to handle the locked_vm accounting during mmap() to match
mlock() behavior and the state of VM_LOCKED flags.  Essentially, this
will restore the current behavior, with a separate patch description.
I'll also change the nr_pages type to long to alleviate your concern
about overflow.  [But note nr_pages != nr_bytes.]
3) your patch, 4/5, to fix the error return of mlock_fixup() when next
vma changes during mlock() with downgraded semaphore.

Next, I think that the Posix mlock() error return issue is separate from
the unevictable mlocked page handling, altho' it touches the same code.
So, I propose two more patches to apply atop 27-rc*-mmotm, as follows:

1) revert the change to make_pages_present() in memory.c and mlock.c
from the mainline to address the mlock() posix error return.
make_pages_present() is not just for mlock().  Altho' all other callers
ignore the return value, we have a "better" place to put the error code
translation in mlock.c now.  
2) patch mlock.c in __mlock_vma_pages_range() [both versions] to
translate the error return from get_user_pages/make_pages_present to the
proper Posix value.  This patch will only affect mlock() error returns.

I have these patches cooking now, but will not have a change to post
them today.  I'll send them out tomorrow.

Regards,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH for -mm 4/5] fix mlock return value at munmap race
  2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
                   ` (2 preceding siblings ...)
  2008-08-11  7:06 ` [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment KOSAKI Motohiro
@ 2008-08-11  7:07 ` KOSAKI Motohiro
  2008-08-12 20:19   ` Lee Schermerhorn
  2008-08-11  7:08 ` [RFC PATCH for -mm 5/5] fix mlock return value for mm KOSAKI Motohiro
  4 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

Now, We call downgrade_write(&mm->mmap_sem) at begin of mlock.
It increase mlock scalability.

But if mlock and munmap conflict happend, We can find vma gone.
At that time, kernel should return ENOMEM because mlock after munmap return ENOMEM.
(in addition, EAGAIN indicate "please try again", but mlock() called again cause error again)

This problem is theoretical issue.
I can't reproduce that vma gone on my box, but fixes is better.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mlock.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -296,7 +296,7 @@ int mlock_vma_pages_range(struct vm_area
 		vma = find_vma(mm, start);
 		/* non-NULL vma must contain @start, but need to check @end */
 		if (!vma ||  end > vma->vm_end)
-			return -EAGAIN;
+			return -ENOMEM;
 		return error;
 	}
 
@@ -410,7 +410,7 @@ success:
 		*prev = find_vma(mm, start);
 		/* non-NULL *prev must contain @start, but need to check @end */
 		if (!(*prev) || end > (*prev)->vm_end)
-			ret = -EAGAIN;
+			ret = -ENOMEM;
 	} else {
 		/*
 		 * TODO:  for unlocking, pages will already be resident, so


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 4/5] fix mlock return value at munmap race
  2008-08-11  7:07 ` [RFC PATCH for -mm 4/5] fix mlock return value at munmap race KOSAKI Motohiro
@ 2008-08-12 20:19   ` Lee Schermerhorn
  0 siblings, 0 replies; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-12 20:19 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Mon, 2008-08-11 at 16:07 +0900, KOSAKI Motohiro wrote:
> Now, We call downgrade_write(&mm->mmap_sem) at begin of mlock.
> It increase mlock scalability.
> 
> But if mlock and munmap conflict happend, We can find vma gone.
> At that time, kernel should return ENOMEM because mlock after munmap return ENOMEM.
> (in addition, EAGAIN indicate "please try again", but mlock() called again cause error again)
> 
> This problem is theoretical issue.
> I can't reproduce that vma gone on my box, but fixes is better.

OK.

Reviewed-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  mm/mlock.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: b/mm/mlock.c
> ===================================================================
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -296,7 +296,7 @@ int mlock_vma_pages_range(struct vm_area
>  		vma = find_vma(mm, start);
>  		/* non-NULL vma must contain @start, but need to check @end */
>  		if (!vma ||  end > vma->vm_end)
> -			return -EAGAIN;
> +			return -ENOMEM;
>  		return error;
>  	}
>  
> @@ -410,7 +410,7 @@ success:
>  		*prev = find_vma(mm, start);
>  		/* non-NULL *prev must contain @start, but need to check @end */
>  		if (!(*prev) || end > (*prev)->vm_end)
> -			ret = -EAGAIN;
> +			ret = -ENOMEM;
>  	} else {
>  		/*
>  		 * TODO:  for unlocking, pages will already be resident, so
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH for -mm 5/5] fix mlock return value for mm
  2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
                   ` (3 preceding siblings ...)
  2008-08-11  7:07 ` [RFC PATCH for -mm 4/5] fix mlock return value at munmap race KOSAKI Motohiro
@ 2008-08-11  7:08 ` KOSAKI Motohiro
  2008-08-11  7:43   ` KOSAKI Motohiro
  4 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:08 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
We shouldn't do that.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mlock.c |   25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -165,8 +165,9 @@ static int __mlock_vma_pages_range(struc
 	unsigned long addr = start;
 	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
-	int ret;
+	int ret = 0;
 	int gup_flags = 0;
+	int ret2 = 0;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -249,9 +250,23 @@ static int __mlock_vma_pages_range(struc
 		}
 	}
 
+	/*
+	  SUSv3 require following return value to mlock
+	  - invalid addr generate to ENOMEM.
+	  - out of memory generate EAGAIN.
+	*/
+	if (ret < 0) {
+		if (ret == -EFAULT)
+			ret2 = -ENOMEM;
+		else if (ret == -ENOMEM)
+			ret2 = -EAGAIN;
+		else
+			ret2 = ret;
+	}
+
 	lru_add_drain_all();	/* to update stats */
 
-	return 0;	/* count entire vma as locked_vm */
+	return ret2;	/* count entire vma as locked_vm */
 }
 
 #else /* CONFIG_UNEVICTABLE_LRU */
@@ -263,9 +278,11 @@ static int __mlock_vma_pages_range(struc
 				   unsigned long start, unsigned long end,
 				   int mlock)
 {
+	int ret = 0;
+
 	if (mlock && (vma->vm_flags & VM_LOCKED))
-		make_pages_present(start, end);
-	return 0;
+		ret = make_pages_present(start, end);
+	return ret;
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 5/5] fix mlock return value for mm
  2008-08-11  7:08 ` [RFC PATCH for -mm 5/5] fix mlock return value for mm KOSAKI Motohiro
@ 2008-08-11  7:43   ` KOSAKI Motohiro
  2008-08-12 20:30     ` Lee Schermerhorn
  0 siblings, 1 reply; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-11  7:43 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Lee Schermerhorn, Rik van Riel; +Cc: kosaki.motohiro

> Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
> We shouldn't do that.

Oops, sorry.
I sent older version, I resend it.

Definitly, I should learn an correct operation of quilt ;)


--------------------------------------------------------------
Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
We shouldn't do that.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 mm/mlock.c |   32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -165,8 +165,9 @@ static int __mlock_vma_pages_range(struc
 	unsigned long addr = start;
 	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
-	int ret;
+	int ret = 0;
 	int gup_flags = 0;
+	int ret2 = 0;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -249,9 +250,23 @@ static int __mlock_vma_pages_range(struc
 		}
 	}
 
+	/*
+	  SUSv3 require following return value to mlock
+	  - invalid addr generate to ENOMEM.
+	  - out of memory generate EAGAIN.
+	*/
+	if (ret < 0) {
+		if (ret == -EFAULT)
+			ret2 = -ENOMEM;
+		else if (ret == -ENOMEM)
+			ret2 = -EAGAIN;
+		else
+			ret2 = ret;
+	}
+
 	lru_add_drain_all();	/* to update stats */
 
-	return 0;	/* count entire vma as locked_vm */
+	return ret2;	/* count entire vma as locked_vm */
 }
 
 #else /* CONFIG_UNEVICTABLE_LRU */
@@ -263,9 +278,11 @@ static int __mlock_vma_pages_range(struc
 				   unsigned long start, unsigned long end,
 				   int mlock)
 {
+	int ret = 0;
+
 	if (mlock && (vma->vm_flags & VM_LOCKED))
-		make_pages_present(start, end);
-	return 0;
+		ret = make_pages_present(start, end);
+	return ret;
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
 
@@ -276,7 +293,6 @@ int mlock_vma_pages_range(struct vm_area
 			unsigned long start, unsigned long end)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int error = 0;
 	BUG_ON(!(vma->vm_flags & VM_LOCKED));
 
 	/*
@@ -289,7 +305,7 @@ int mlock_vma_pages_range(struct vm_area
 			is_vm_hugetlb_page(vma) ||
 			vma == get_gate_vma(current))) {
 		downgrade_write(&mm->mmap_sem);
-		error = __mlock_vma_pages_range(vma, start, end, 1);
+		__mlock_vma_pages_range(vma, start, end, 1);
 		up_read(&mm->mmap_sem);
 		/* vma can change or disappear */
 		down_write(&mm->mmap_sem);
@@ -297,7 +313,7 @@ int mlock_vma_pages_range(struct vm_area
 		/* non-NULL vma must contain @start, but need to check @end */
 		if (!vma ||  end > vma->vm_end)
 			return -ENOMEM;
-		return error;
+		return 0;
 	}
 
 	/*
@@ -309,7 +325,7 @@ int mlock_vma_pages_range(struct vm_area
 
 no_mlock:
 	vma->vm_flags &= ~VM_LOCKED;	/* and don't come back! */
-	return error;			/* pages NOT mlocked */
+	return 0;			/* pages NOT mlocked */
 }
 
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 5/5] fix mlock return value for mm
  2008-08-11  7:43   ` KOSAKI Motohiro
@ 2008-08-12 20:30     ` Lee Schermerhorn
  2008-08-13  8:36       ` KOSAKI Motohiro
  0 siblings, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2008-08-12 20:30 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, Andrew Morton, Rik van Riel

On Mon, 2008-08-11 at 16:43 +0900, KOSAKI Motohiro wrote:
> > Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
> > We shouldn't do that.
> 
> Oops, sorry.
> I sent older version, I resend it.
> 
> Definitly, I should learn an correct operation of quilt ;)
> 
> 
> --------------------------------------------------------------
> Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
> We shouldn't do that.

Could you explain, in comments or patch description, why, after patching
__mlock_vma_pages_range() top return mlock() appropriate values for
__get_user_pages() failures, you then ignore the return value of
__mlock_vma_pages_range() in mlock_vma_pages_range() [last 4 hunks]?  Is
it because mlock_vma_pages_range() is called from mmap(), mremap(), etc,
and not from mlock()?

Lee
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  mm/mlock.c |   32 ++++++++++++++++++++++++--------
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> Index: b/mm/mlock.c
> ===================================================================
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -165,8 +165,9 @@ static int __mlock_vma_pages_range(struc
>  	unsigned long addr = start;
>  	struct page *pages[16]; /* 16 gives a reasonable batch */
>  	int nr_pages = (end - start) / PAGE_SIZE;
> -	int ret;
> +	int ret = 0;
>  	int gup_flags = 0;
> +	int ret2 = 0;
>  
>  	VM_BUG_ON(start & ~PAGE_MASK);
>  	VM_BUG_ON(end   & ~PAGE_MASK);
> @@ -249,9 +250,23 @@ static int __mlock_vma_pages_range(struc
>  		}
>  	}
>  
> +	/*
> +	  SUSv3 require following return value to mlock
> +	  - invalid addr generate to ENOMEM.
> +	  - out of memory generate EAGAIN.
> +	*/
> +	if (ret < 0) {
> +		if (ret == -EFAULT)
> +			ret2 = -ENOMEM;
> +		else if (ret == -ENOMEM)
> +			ret2 = -EAGAIN;
> +		else
> +			ret2 = ret;
> +	}
> +
>  	lru_add_drain_all();	/* to update stats */
>  
> -	return 0;	/* count entire vma as locked_vm */
> +	return ret2;	/* count entire vma as locked_vm */
>  }
>  
>  #else /* CONFIG_UNEVICTABLE_LRU */
> @@ -263,9 +278,11 @@ static int __mlock_vma_pages_range(struc
>  				   unsigned long start, unsigned long end,
>  				   int mlock)
>  {
> +	int ret = 0;
> +
>  	if (mlock && (vma->vm_flags & VM_LOCKED))
> -		make_pages_present(start, end);
> -	return 0;
> +		ret = make_pages_present(start, end);
> +	return ret;
>  }
>  #endif /* CONFIG_UNEVICTABLE_LRU */
>  
> @@ -276,7 +293,6 @@ int mlock_vma_pages_range(struct vm_area
>  			unsigned long start, unsigned long end)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	int error = 0;
>  	BUG_ON(!(vma->vm_flags & VM_LOCKED));
>  
>  	/*
> @@ -289,7 +305,7 @@ int mlock_vma_pages_range(struct vm_area
>  			is_vm_hugetlb_page(vma) ||
>  			vma == get_gate_vma(current))) {
>  		downgrade_write(&mm->mmap_sem);
> -		error = __mlock_vma_pages_range(vma, start, end, 1);
> +		__mlock_vma_pages_range(vma, start, end, 1);
>  		up_read(&mm->mmap_sem);
>  		/* vma can change or disappear */
>  		down_write(&mm->mmap_sem);
> @@ -297,7 +313,7 @@ int mlock_vma_pages_range(struct vm_area
>  		/* non-NULL vma must contain @start, but need to check @end */
>  		if (!vma ||  end > vma->vm_end)
>  			return -ENOMEM;
> -		return error;
> +		return 0;
>  	}
>  
>  	/*
> @@ -309,7 +325,7 @@ int mlock_vma_pages_range(struct vm_area
>  
>  no_mlock:
>  	vma->vm_flags &= ~VM_LOCKED;	/* and don't come back! */
> -	return error;			/* pages NOT mlocked */
> +	return 0;			/* pages NOT mlocked */
>  }
>  
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH for -mm 5/5] fix mlock return value for mm
  2008-08-12 20:30     ` Lee Schermerhorn
@ 2008-08-13  8:36       ` KOSAKI Motohiro
  0 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-08-13  8:36 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel

Hi

> > > Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
> > > We shouldn't do that.
> > 
> > Oops, sorry.
> > I sent older version, I resend it.
> > 
> > Definitly, I should learn an correct operation of quilt ;)
> > 
> > 
> > --------------------------------------------------------------
> > Now, __mlock_vma_pages_range() ignore return value of __get_user_pages().
> > We shouldn't do that.
> 
> Could you explain, in comments or patch description, why, after patching
> __mlock_vma_pages_range() top return mlock() appropriate values for
> __get_user_pages() failures, you then ignore the return value of
> __mlock_vma_pages_range() in mlock_vma_pages_range() [last 4 hunks]?  Is
> it because mlock_vma_pages_range() is called from mmap(), mremap(), etc,
> and not from mlock()?

Ah, OK.
I agreed with my patch description is too short.

in linus-tree code, make_pages_present called from seven points
 - sys_remap_file_pages
 - mlock_fixup
 - mmap_region
 - find_extend_vma
 - do_brk
 - move_vma
 - do_mremap

and, only mlock_fixup treat return value of it.
IOW, linus-tree policy is

mmap, brk, mremap	ignore error of page population
mlock			treat error of page population


In the other hand, __mlock_vma_pages_range() called from seven points.
 - sys_remap_file_pages (via mlock_vma_pages_range)
 - mmap_region (via mlock_vma_pages_range)
 - find_extend_vma (via mlock_vma_pages_range)
 - do_brk (via mlock_vma_pages_range)
 - move_vma (via mlock_vma_pages_range)
 - do_mremap (via mlock_vma_pages_range)
 - mlock_fixup

this is not a coincidence.
__mlock_vma_pages_range() aimed at unevictable lru aware make_pages_present().


So, mlock_fixup shouldn't ignore get_user_pages() in __mlock_vma_pages_range().
but others should ignore it.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-08-18 20:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
2008-08-11  7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
2008-08-12 20:39   ` Lee Schermerhorn
2008-08-13  8:03     ` KOSAKI Motohiro
2008-08-11  7:05 ` [RFC PATCH for -mm 2/5] related function comment fixes (optional) KOSAKI Motohiro
2008-08-12 19:02   ` Lee Schermerhorn
2008-08-13  8:37     ` KOSAKI Motohiro
2008-08-11  7:06 ` [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment KOSAKI Motohiro
2008-08-12 19:55   ` Lee Schermerhorn
2008-08-13  9:37     ` KOSAKI Motohiro
2008-08-15 13:54       ` Lee Schermerhorn
2008-08-18  9:23         ` KOSAKI Motohiro
2008-08-18 20:56           ` Lee Schermerhorn
2008-08-11  7:07 ` [RFC PATCH for -mm 4/5] fix mlock return value at munmap race KOSAKI Motohiro
2008-08-12 20:19   ` Lee Schermerhorn
2008-08-11  7:08 ` [RFC PATCH for -mm 5/5] fix mlock return value for mm KOSAKI Motohiro
2008-08-11  7:43   ` KOSAKI Motohiro
2008-08-12 20:30     ` Lee Schermerhorn
2008-08-13  8:36       ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox