linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] include private data mappings in RLIMIT_DATA limit
@ 2007-07-10  0:43 Herbert van den Bergh
  2007-07-10  0:54 ` Dave McCracken
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert van den Bergh @ 2007-07-10  0:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: akpm, Dave McCracken, Chris Mason

This patch changes how the kernel limits memory allocations that are
controlled by setrlimit() using the RLIMIT_DATA resource.  Currently the
kernel can limit the size of a process data, bss, and heap, but does not
limit the amount of process private memory that can be allocated with
the mmap() call.  Since malloc() calls mmap() to allocate process memory
once brk() calls can no longer grow the heap, process private memory
allocations cannot be limited with RLIMIT_DATA.

With this patch, not only memory in the data segment of a process, but
also private data mappings, both file-based and anonymous, are counted
toward the RLIMIT_DATA resource limit.  Executable mappings, such as
text segments of shared objects, are not counted toward the private data
limit.  The result is that malloc() will fail once the combined size of
the data segment and private data mappings reaches this limit.

This brings the Linux behavior in line with what is documented in the
POSIX man page for setrlimit(3p).

Signed-off-by: Herbert van den Bergh (herbert.van.den.bergh@oracle.com)
Signed-off-by: Dave McCracken (dave.mccracken@oracle.com)

--- linux-2.6.22/fs/proc/task_mmu.c.orig	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22/fs/proc/task_mmu.c	2007-07-09 14:41:50.000000000 -0700
@@ -31,7 +31,7 @@ char *task_mem(struct mm_struct *mm, cha
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
 
-	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
+	data = mm->total_vm - mm->shared_vm - mm->stack_vm - mm->exec_vm;
 	text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
 	lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
 	buffer += sprintf(buffer,
--- linux-2.6.22/mm/mprotect.c.orig	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22/mm/mprotect.c	2007-07-09 14:41:50.000000000 -0700
@@ -162,6 +162,22 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
+	/* 
+	 * Check against private data size limit. When execute permission
+	 * is removed, the mapping is counted toward the private data size.
+	 */
+	if (!(newflags & VM_EXEC) && (vma->vm_flags & VM_EXEC)) {
+		unsigned long rlim;
+		rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
+		if (rlim < RLIM_INFINITY) {
+			unsigned long priv_vm;
+			priv_vm = mm->total_vm - mm->shared_vm - mm->exec_vm;
+			priv_vm <<= PAGE_SHIFT;
+			if (priv_vm + (end - start) > rlim)
+				return -ENOMEM;
+		}
+	}
+
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
--- linux-2.6.22/mm/mremap.c.orig	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22/mm/mremap.c	2007-07-09 14:41:50.000000000 -0700
@@ -343,6 +343,23 @@ unsigned long do_mremap(unsigned long ad
 		goto out;
 	}
 
+	/*
+	 * Check against private data size limit. Count memory allocations 
+	 * which are not shared and not executable code as private data.
+	 */
+	if (!(vma->vm_flags & (VM_EXEC|VM_SHARED))) {
+		unsigned long rlim;
+		rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
+		if (rlim < RLIM_INFINITY) {
+			unsigned long priv_vm;
+			struct mm_struct *mm = current->mm;
+			priv_vm = mm->total_vm - mm->shared_vm - mm->exec_vm;
+			priv_vm <<= PAGE_SHIFT;
+			if (priv_vm + (new_len - old_len) > rlim)
+				goto out;
+		}
+	}
+
 	if (vma->vm_flags & VM_ACCOUNT) {
 		charged = (new_len - old_len) >> PAGE_SHIFT;
 		if (security_vm_enough_memory(charged))
--- linux-2.6.22/mm/mmap.c.orig	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22/mm/mmap.c	2007-07-09 14:41:50.000000000 -0700
@@ -245,16 +245,6 @@ asmlinkage unsigned long sys_brk(unsigne
 	if (brk < mm->end_code)
 		goto out;
 
-	/*
-	 * Check against rlimit here. If this check is done later after the test
-	 * of oldbrk with newbrk then it can escape the test and let the data
-	 * segment grow beyond its set limit the in case where the limit is
-	 * not page aligned -Ram Gupta
-	 */
-	rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
-	if (rlim < RLIM_INFINITY && brk - mm->start_data > rlim)
-		goto out;
-
 	newbrk = PAGE_ALIGN(brk);
 	oldbrk = PAGE_ALIGN(mm->brk);
 	if (oldbrk == newbrk)
@@ -267,6 +257,19 @@ asmlinkage unsigned long sys_brk(unsigne
 		goto out;
 	}
 
+	/*
+	 * Check against private data size limit. Count memory allocations 
+	 * which are not shared and not executable code as private data.
+	 */
+	rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
+	if (rlim < RLIM_INFINITY) {
+		unsigned long priv_vm;
+		priv_vm = mm->total_vm - mm->shared_vm - mm->exec_vm;
+		priv_vm <<= PAGE_SHIFT;
+		if (priv_vm + (newbrk-oldbrk) > rlim)
+			goto out;
+	}
+
 	/* Check against existing mmap mappings. */
 	if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
 		goto out;
@@ -875,7 +878,8 @@ void vm_stat_account(struct mm_struct *m
 		= VM_STACK_FLAGS & (VM_GROWSUP|VM_GROWSDOWN);
 
 	if (file) {
-		mm->shared_vm += pages;
+		if (flags & VM_SHARED)
+			mm->shared_vm += pages;
 		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
 			mm->exec_vm += pages;
 	} else if (flags & stack_flags)
@@ -1041,6 +1045,21 @@ munmap_back:
 	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
+	/*
+	 * Check against private data size limit. Count memory allocations 
+	 * which are not shared and not executable code as private data.
+	 */
+	if (!(vm_flags & (VM_SHARED|VM_EXEC))) {
+		unsigned long rlim, priv_vm;
+		rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
+		if (rlim < RLIM_INFINITY) {
+			priv_vm = mm->total_vm - mm->shared_vm - mm->exec_vm;
+			priv_vm <<= PAGE_SHIFT;
+			if (priv_vm + len > rlim)
+				return -ENOMEM;
+		}
+	}
+
 	if (accountable && (!(flags & MAP_NORESERVE) ||
 			    sysctl_overcommit_memory == OVERCOMMIT_NEVER)) {
 		if (vm_flags & VM_SHARED) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10  0:43 [PATCH] include private data mappings in RLIMIT_DATA limit Herbert van den Bergh
@ 2007-07-10  0:54 ` Dave McCracken
  2007-07-10 16:58   ` Hugh Dickins
  0 siblings, 1 reply; 9+ messages in thread
From: Dave McCracken @ 2007-07-10  0:54 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

On Monday 09 July 2007, Herbert van den Bergh wrote:
> With this patch, not only memory in the data segment of a process, but
> also private data mappings, both file-based and anonymous, are counted
> toward the RLIMIT_DATA resource limit.  Executable mappings, such as
> text segments of shared objects, are not counted toward the private data
> limit.  The result is that malloc() will fail once the combined size of
> the data segment and private data mappings reaches this limit.
>
> This brings the Linux behavior in line with what is documented in the
> POSIX man page for setrlimit(3p).

I believe this patch is a simple and obvious fix to a hole introduced when 
libc malloc() began using mmap() instead of brk().  We took away the ability 
to control how much data space processes could soak up.  This patch returns 
that control to the user.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10  0:54 ` Dave McCracken
@ 2007-07-10 16:58   ` Hugh Dickins
  2007-07-10 17:19     ` Dave McCracken
  0 siblings, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2007-07-10 16:58 UTC (permalink / raw)
  To: Dave McCracken; +Cc: herbert.van.den.bergh, akpm, linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2309 bytes --]

On Mon, 9 Jul 2007, Dave McCracken wrote:
> On Monday 09 July 2007, Herbert van den Bergh wrote:
> > With this patch, not only memory in the data segment of a process, but
> > also private data mappings, both file-based and anonymous, are counted
> > toward the RLIMIT_DATA resource limit.  Executable mappings, such as
> > text segments of shared objects, are not counted toward the private data
> > limit.  The result is that malloc() will fail once the combined size of
> > the data segment and private data mappings reaches this limit.
> >
> > This brings the Linux behavior in line with what is documented in the
> > POSIX man page for setrlimit(3p).

Which says malloc() can fail from it, but conspicuously not that mmap()
can fail from it: unlike the RLIMIT_AS case.  Would we be better off?

> 
> I believe this patch is a simple and obvious fix to a hole introduced when 
> libc malloc() began using mmap() instead of brk().

But didn't libc start doing that many years ago?  Wouldn't that have
been the time for such a patch rather than now: when it can only break
apps that are currently working?

> We took away the ability 
> to control how much data space processes could soak up.  This patch returns 
> that control to the user.

I remember thinking that the idea of data ulimit had become obsolete
(just preserved for compatibility) back when mmap() got invented and
used for dynamic libraries.  I think that's when they brought in
RLIMIT_AS, something which could make sense in the mmap() world.

This patch does give it more meaning.  But if we are prepared to
take the risk of breaking things in this way (I think not but don't
mind being corrected), it would be more accurate to take writability
into account, and use that quantity (sadly not stored in mm_struct,
would have to be added) which we do security_vm_enough_memory() upon,
which totals up into /proc/meminfo's Committed_AS.

That change to /proc/PID/status VmData: 
-	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
+	data = mm->total_vm - mm->shared_vm - mm->stack_vm - mm->exec_vm;
looks plausible, but isn't exec_vm already counted as shared_vm,
so now being doubly subtracted?  Besides which, we wouldn't want
to change those numbers again without consulting Albert.

Hugh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10 16:58   ` Hugh Dickins
@ 2007-07-10 17:19     ` Dave McCracken
  2007-07-10 18:06       ` Hugh Dickins
  0 siblings, 1 reply; 9+ messages in thread
From: Dave McCracken @ 2007-07-10 17:19 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: herbert.van.den.bergh, akpm, linux-kernel, linux-mm

On Tuesday 10 July 2007, Hugh Dickins wrote:
> On Mon, 9 Jul 2007, Dave McCracken wrote:
> > On Monday 09 July 2007, Herbert van den Bergh wrote:
> > > With this patch, not only memory in the data segment of a process, but
> > > also private data mappings, both file-based and anonymous, are counted
> > > toward the RLIMIT_DATA resource limit.  Executable mappings, such as
> > > text segments of shared objects, are not counted toward the private
> > > data limit.  The result is that malloc() will fail once the combined
> > > size of the data segment and private data mappings reaches this limit.
> > >
> > > This brings the Linux behavior in line with what is documented in the
> > > POSIX man page for setrlimit(3p).
>
> Which says malloc() can fail from it, but conspicuously not that mmap()
> can fail from it: unlike the RLIMIT_AS case.  Would we be better off?

True.  But keep in mind that when POSIX was written mmap() was new and shiny 
and pretty much only used for shared mappings, and definitely not used by 
malloc().

> > I believe this patch is a simple and obvious fix to a hole introduced
> > when libc malloc() began using mmap() instead of brk().
>
> But didn't libc start doing that many years ago?  Wouldn't that have
> been the time for such a patch rather than now: when it can only break
> apps that are currently working?

Yes, probably.  But it got missed.

> > We took away the ability
> > to control how much data space processes could soak up.  This patch
> > returns that control to the user.
>
> I remember thinking that the idea of data ulimit had become obsolete
> (just preserved for compatibility) back when mmap() got invented and
> used for dynamic libraries.  I think that's when they brought in
> RLIMIT_AS, something which could make sense in the mmap() world.
>
> This patch does give it more meaning.  But if we are prepared to
> take the risk of breaking things in this way (I think not but don't
> mind being corrected), it would be more accurate to take writability
> into account, and use that quantity (sadly not stored in mm_struct,
> would have to be added) which we do security_vm_enough_memory() upon,
> which totals up into /proc/meminfo's Committed_AS.

Given that RLIMIT_DATA is pretty much meaningless in current kernels, I would 
put forward the argument that this change is extremely unlikely to break 
anything because no one is currently setting it to anything other than 
unlimited.  Adding this feature would give administrators another tool, a way 
to control the private data size of a process without restricting its ability 
to attach to large shared mappings.

> That change to /proc/PID/status VmData:
> -	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
> +	data = mm->total_vm - mm->shared_vm - mm->stack_vm - mm->exec_vm;
> looks plausible, but isn't exec_vm already counted as shared_vm,
> so now being doubly subtracted?  Besides which, we wouldn't want
> to change those numbers again without consulting Albert.

As I recall, this was added after Herbert discovered that exec_vm is not 
counted as shared_vm.  It's actually mapped as private/readonly.

> Hugh


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10 17:19     ` Dave McCracken
@ 2007-07-10 18:06       ` Hugh Dickins
  2007-07-10 19:12         ` Dave McCracken
  0 siblings, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2007-07-10 18:06 UTC (permalink / raw)
  To: Dave McCracken; +Cc: herbert.van.den.bergh, akpm, linux-kernel, linux-mm

On Tue, 10 Jul 2007, Dave McCracken wrote:
> On Tuesday 10 July 2007, Hugh Dickins wrote:
> > > >
> > > > This brings the Linux behavior in line with what is documented in the
> > > > POSIX man page for setrlimit(3p).
> >
> > Which says malloc() can fail from it, but conspicuously not that mmap()
> > can fail from it: unlike the RLIMIT_AS case.  Would we be better off?
> 
> True.  But keep in mind that when POSIX was written mmap() was new and shiny 
> and pretty much only used for shared mappings, and definitely not used by 
> malloc().

Well, my bookmark is to SUSv3, which I think is equivalent these days?
And that specifically says malloc() or mmap() in the RLIMIT_AS case,
but only malloc() in the RLIMIT_DATA case.  We're wrong either way.

> Given that RLIMIT_DATA is pretty much meaningless in current kernels, I would 
> put forward the argument that this change is extremely unlikely to break 
> anything because no one is currently setting it to anything other than 
> unlimited.  Adding this feature would give administrators another tool, a way 
> to control the private data size of a process without restricting its ability 
> to attach to large shared mappings.

That may be a good argument (though "extremely unlikely to break"s
have a nasty habit of biting).  I'd still say that the contribution
to Committed_AS is more appropriate and more useful here.

> > That change to /proc/PID/status VmData:
> > -	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
> > +	data = mm->total_vm - mm->shared_vm - mm->stack_vm - mm->exec_vm;
> > looks plausible, but isn't exec_vm already counted as shared_vm,
> > so now being doubly subtracted?  Besides which, we wouldn't want
> > to change those numbers again without consulting Albert.
> 
> As I recall, this was added after Herbert discovered that exec_vm is not 
> counted as shared_vm.  It's actually mapped as private/readonly.

Mapped private readonly yes, but vm_stat_account() says
	if (file) {
		mm->shared_vm += pages;
		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
			mm->exec_vm += pages;

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10 18:06       ` Hugh Dickins
@ 2007-07-10 19:12         ` Dave McCracken
  2007-07-10 19:42           ` Hugh Dickins
  0 siblings, 1 reply; 9+ messages in thread
From: Dave McCracken @ 2007-07-10 19:12 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: herbert.van.den.bergh, akpm, linux-kernel, linux-mm

On Tuesday 10 July 2007, Hugh Dickins wrote:
> On Tue, 10 Jul 2007, Dave McCracken wrote:
> > Given that RLIMIT_DATA is pretty much meaningless in current kernels, I
> > would put forward the argument that this change is extremely unlikely to
> > break anything because no one is currently setting it to anything other
> > than unlimited.  Adding this feature would give administrators another
> > tool, a way to control the private data size of a process without
> > restricting its ability to attach to large shared mappings.
>
> That may be a good argument (though "extremely unlikely to break"s
> have a nasty habit of biting).  I'd still say that the contribution
> to Committed_AS is more appropriate and more useful here.

You may be right... I suppose everything will bite someone somewhere with a 
sufficiently large user base.

As for whether Committed_AS is more appropriate, I'll have to defer to Herbert 
on this one.  He stated that RLIMIT_DATA no longer does what it was intended 
to do, and offered a fix for it, and I agreed with him.  I do believe his 
patch does a reasonable approximation of the original intent of RLIMIT_DATA, 
but I didn't delve into the actual intended use of it once it's fixed.

> > > That change to /proc/PID/status VmData:
> > > -	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
> > > +	data = mm->total_vm - mm->shared_vm - mm->stack_vm - mm->exec_vm;
> > > looks plausible, but isn't exec_vm already counted as shared_vm,
> > > so now being doubly subtracted?  Besides which, we wouldn't want
> > > to change those numbers again without consulting Albert.
> >
> > As I recall, this was added after Herbert discovered that exec_vm is not
> > counted as shared_vm.  It's actually mapped as private/readonly.
>
> Mapped private readonly yes, but vm_stat_account() says
> 	if (file) {
> 		mm->shared_vm += pages;
> 		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
> 			mm->exec_vm += pages;

In that code shared_vm includes everything that's mmap()ed, including private 
mappings.  But if you look at Herbert's patch he has the following change:

        if (file) {
-               mm->shared_vm += pages;
+               if (flags & VM_SHARED)
+                       mm->shared_vm += pages;
                if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
                        mm->exec_vm += pages;

This means that shared_vm now is truly only memory that's mapped VM_SHARED and 
does not include VM_EXEC memory.  That necessitates the separate subtraction 
of exec_vm in the data calculations.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10 19:12         ` Dave McCracken
@ 2007-07-10 19:42           ` Hugh Dickins
  2007-07-10 20:09             ` Herbert van den Bergh
  0 siblings, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2007-07-10 19:42 UTC (permalink / raw)
  To: Dave McCracken; +Cc: herbert.van.den.bergh, akpm, linux-kernel, linux-mm

On Tue, 10 Jul 2007, Dave McCracken wrote:
> On Tuesday 10 July 2007, Hugh Dickins wrote:
> >
> > Mapped private readonly yes, but vm_stat_account() says
> > 	if (file) {
> > 		mm->shared_vm += pages;
> > 		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
> > 			mm->exec_vm += pages;
> 
> In that code shared_vm includes everything that's mmap()ed, including private 
> mappings.  But if you look at Herbert's patch he has the following change:
> 
>         if (file) {
> -               mm->shared_vm += pages;
> +               if (flags & VM_SHARED)
> +                       mm->shared_vm += pages;
>                 if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
>                         mm->exec_vm += pages;
> 
> This means that shared_vm now is truly only memory that's mapped VM_SHARED and 
> does not include VM_EXEC memory.  That necessitates the separate subtraction 
> of exec_vm in the data calculations.

Ah, I just noticed at the beginning of the patch, and didn't look for
a balancing change - thanks.  I'd strongly recommend that he not mess
around with these numbers, unless there's _very_ good reason: they're
not ideal, nothing ever will be, changing them around just causes pain.
shared_vm may not be a full description of what it counts, but it'll
do until you've a better name (readonly mappings share with the file
even when they're private).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10 19:42           ` Hugh Dickins
@ 2007-07-10 20:09             ` Herbert van den Bergh
  2007-07-13 19:18               ` Hugh Dickins
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert van den Bergh @ 2007-07-10 20:09 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave McCracken, akpm, linux-kernel, linux-mm


Hugh Dickins wrote:
> On Tue, 10 Jul 2007, Dave McCracken wrote:
>> On Tuesday 10 July 2007, Hugh Dickins wrote:
>>> Mapped private readonly yes, but vm_stat_account() says
>>> 	if (file) {
>>> 		mm->shared_vm += pages;
>>> 		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
>>> 			mm->exec_vm += pages;
>> In that code shared_vm includes everything that's mmap()ed, including private 
>> mappings.  But if you look at Herbert's patch he has the following change:
>>
>>         if (file) {
>> -               mm->shared_vm += pages;
>> +               if (flags & VM_SHARED)
>> +                       mm->shared_vm += pages;
>>                 if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
>>                         mm->exec_vm += pages;
>>
>> This means that shared_vm now is truly only memory that's mapped VM_SHARED and 
>> does not include VM_EXEC memory.  That necessitates the separate subtraction 
>> of exec_vm in the data calculations.
> 
> Ah, I just noticed at the beginning of the patch, and didn't look for
> a balancing change - thanks.  I'd strongly recommend that he not mess
> around with these numbers, unless there's _very_ good reason: they're
> not ideal, nothing ever will be, changing them around just causes pain.
> shared_vm may not be a full description of what it counts, but it'll
> do until you've a better name (readonly mappings share with the file
> even when they're private).

The result of counting only VM_SHARED file mappings in shared_vm is that
it no longer includes MAP_PRIVATE file mappings.  These file mappings are
shared until they are written to, at which point they become private.
The vm doesn't currently track this change from shared to private,
so either way the accounting is off by a bit.  But the intention of the
private mapping is to not share the page with other processes, so I think
the right thing to do is to charge the process with this memory usage as
private memory.  I am continuing to make the assumption that the code
already made that all VM_EXEC mappings are also shared.  But one thing
that has changed is that when a mapping is no longer VM_EXEC, it is also
no longer counted as shared, but as private, and charged to the data size.

This has generally little effect on /proc/pid/stats VmData.

One of the motivations for this change was to make it easier for a
system administrator to manage processes that could allocate large
amounts of private memory, either through malloc() or through mmap(),
possibly due to programming errors.  As Dave also pointed out, if these
processes need to be able to attach to large shared memory segments, such
as a database buffer cache, then attempting to control the memory usage of
these processes with RLIMIT_AS becomes very tricky.  The sysadmin may set
the RLIMIT_AS to the current buffer cache shared memory segment size of
say 8GB plus 200M for process private allocations, but the next day the
DBA decides to increase the shared memory segment to 10GB, and wonders
why his processes won't start.  Now, with RLIMIT_DATA being effective again,
all that is needed is to set RLIMIT_DATA to a reasonable value for the 
application.

I don't anticipate that a lot of people need to be concerned about the 
effect of this resource limit change.  By default RLIMIT_DATA is set to
unlimited.  For those who do want to set it, they will need to understand 
what it does, just like any other resource limit.

Perhaps an update of the man page is in order as well.

Thanks,
Herbert.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] include private data mappings in RLIMIT_DATA limit
  2007-07-10 20:09             ` Herbert van den Bergh
@ 2007-07-13 19:18               ` Hugh Dickins
  0 siblings, 0 replies; 9+ messages in thread
From: Hugh Dickins @ 2007-07-13 19:18 UTC (permalink / raw)
  To: Herbert van den Bergh
  Cc: Dave McCracken, Albert Cahalan, akpm, linux-kernel, linux-mm

On Tue, 10 Jul 2007, Herbert van den Bergh wrote:
> Hugh Dickins wrote:
> > On Tue, 10 Jul 2007, Dave McCracken wrote:
> >> On Tuesday 10 July 2007, Hugh Dickins wrote:
> >>> Mapped private readonly yes, but vm_stat_account() says
> >>> 	if (file) {
> >>> 		mm->shared_vm += pages;
> >>> 		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
> >>> 			mm->exec_vm += pages;
> >> In that code shared_vm includes everything that's mmap()ed, including private 
> >> mappings.  But if you look at Herbert's patch he has the following change:
> >>
> >>         if (file) {
> >> -               mm->shared_vm += pages;
> >> +               if (flags & VM_SHARED)
> >> +                       mm->shared_vm += pages;
> >>                 if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
> >>                         mm->exec_vm += pages;
> >>
> >> This means that shared_vm now is truly only memory that's mapped VM_SHARED and 
> >> does not include VM_EXEC memory.  That necessitates the separate subtraction 
> >> of exec_vm in the data calculations.

Actually, that wasn't quite right, was it?  Though it's normally the
case that a VM_EXEC mapping is not VM_SHARED, that cannot be relied upon.

> The result of counting only VM_SHARED file mappings in shared_vm is that
> it no longer includes MAP_PRIVATE file mappings.  These file mappings are
> shared until they are written to, at which point they become private.
> The vm doesn't currently track this change from shared to private,

Well, it doesn't track the change at that instant, but VM_ACCOUNT does
track those private mappings which may be written to (or have been
written to earlier).  So it's the size of those mappings I believe
you'd best be counting for your DATA checks: the size that we keep
checking in calls to security_vm_enough_memory() also.

When I "cat /proc/self/maps" I see a lot of little readonly mappings:
you'll be counting those towards RLIMIT_DATA, whereas VM_ACCOUNT won't.

> so either way the accounting is off by a bit.  But the intention of the
> private mapping is to not share the page with other processes, so I think
> the right thing to do is to charge the process with this memory usage as
> private memory.  I am continuing to make the assumption that the code
> already made that all VM_EXEC mappings are also shared.  But one thing

I don't recall such an assumption; and I haven't worked out how your
version of vm_stat_account() can still increment exec_vm depending on
VM_EXEC only within its "if (file)" block, yet your VM_EXEC tests when
checking RLIM_DATA not consider file at all.

> that has changed is that when a mapping is no longer VM_EXEC, it is also
> no longer counted as shared, but as private, and charged to the data size.
> 
> This has generally little effect on /proc/pid/stats VmData.

Your instincts are good: you want to put existing fields of mm_struct
to better use without adding more clutter; and you want RLIMIT_DATA
to be a limit on the number shown as "VmData" in /proc/pid/status.
I applaud both choices.  But two pieces of history make me uneasy.

shared_vm and exec_vm, which you're depending upon in your tests,
were hacked up (no disrespect to wli) for 2.6.9, to give plausible
numbers in those /proc/pid/status fields, without the horribly
intensive and cache-destroying page scan which was used to calculate
them before.  They've served their purpose well, but the kernel has
never used them for decisions before - our main concern has been
that the numbers shown don't ever get to wrap negative.

Whereas userspace may have grown used to the values shown there,
and be alarmed by changes: Albert Cahalan (procps maintainer) has
protested in the past at the way we change these things around:
I don't think we can unilaterally change them again without his
consent.

> 
> One of the motivations for this change was to make it easier for a
> system administrator to manage processes that could allocate large
> amounts of private memory, either through malloc() or through mmap(),
> possibly due to programming errors.  As Dave also pointed out, if these
> processes need to be able to attach to large shared memory segments, such
> as a database buffer cache, then attempting to control the memory usage of
> these processes with RLIMIT_AS becomes very tricky.  The sysadmin may set
> the RLIMIT_AS to the current buffer cache shared memory segment size of
> say 8GB plus 200M for process private allocations, but the next day the
> DBA decides to increase the shared memory segment to 10GB, and wonders
> why his processes won't start.  Now, with RLIMIT_DATA being effective again,
> all that is needed is to set RLIMIT_DATA to a reasonable value for the 
> application.

Fair enough.

> 
> I don't anticipate that a lot of people need to be concerned about the 
> effect of this resource limit change.  By default RLIMIT_DATA is set to
> unlimited.  For those who do want to set it, they will need to understand 
> what it does, just like any other resource limit.

Yes, I think Dave persuaded me that the existing RLIMIT_DATA
is too old-fashioned to be useful to people, and that therefore
few will be inconvenienced by this change - though it will need
documenting in "release notes", and we need to make this possibility
of breakage very clear to Andrew and to Linus, who may veto it.

> 
> Perhaps an update of the man page is in order as well.

Surely (though it's going to be hard to explain it exactly
in a few words, whichever metric is used).

I don't want to keep on going back and forth in discussion.  I think
I need to prepare a patch using that vm_enough_memory charge as the
RLIMIT_DATA target, but showing old and your and my VmData in
/proc/pid/status, to see what those numbers look like across a
range of processes.  We could always display an additional VmAcct
or some such, but it would be a shame not to have it named VmData,
and it would be tiresome explaining the difference to people ever after.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-07-13 19:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-10  0:43 [PATCH] include private data mappings in RLIMIT_DATA limit Herbert van den Bergh
2007-07-10  0:54 ` Dave McCracken
2007-07-10 16:58   ` Hugh Dickins
2007-07-10 17:19     ` Dave McCracken
2007-07-10 18:06       ` Hugh Dickins
2007-07-10 19:12         ` Dave McCracken
2007-07-10 19:42           ` Hugh Dickins
2007-07-10 20:09             ` Herbert van den Bergh
2007-07-13 19:18               ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox