Re: [PATCH] munmap() don't check sysctl_max_mapcount

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: kosaki.motohiro@jp.fujitsu.com,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH] munmap() don't check sysctl_max_mapcount
Date: Mon, 12 Oct 2009 19:13:41 +0900 (JST)	[thread overview]
Message-ID: <20091012184654.E4D0.A69D9226@jp.fujitsu.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0910091007010.17240@sister.anvils>

[-- Attachment #1: Type: text/plain, Size: 3326 bytes --]

Hi Hugh

> > Hi everyone,
> > 
> > Is this good idea?
> 
> Sorry, I overlooked this earlier.
> 
> Correct me if I'm wrong, but from the look of your patch,
> I believe anyone could increase their mapcount arbitrarily far beyond
> sysctl_max_map_count, just by taking little bites out of a large mmap.
> 
> In which case there's not much point in having sysctl_max_map_count
> at all.  Perhaps there isn't much point to it anyway, and the answer
> is just to raise it to where it catches runaways but interferes with
> nobody else?
> 
> If you change your patch so that do_munmap() cannot increase the final
> number vmas beyond sysctl_max_map_count, that would seem reasonable.
> But would that satisfy your testcase?  And does the testcase really
> matter in practice?  It doesn't seem to have upset anyone before.

Very thank you for payed attention to my patch. Yes, this is real issue.
my customer suffer from it.

May I explain why you haven't seen this issue? this issue is architecture
independent problem. however register stack architecture (e.g. ia64, sparc)
dramatically increase the possibility of the heppen this issue.

Why? the stack of the typical architecture have one PROT_READ|PROT_WRITE area and one
PROT_NONE area (it is called as guard page).
if the process multiple thread, stack layout is

fig1)
  +--------+--------+-------+-------+--------------
  | thr-0  | thr-0  | thr-1 | thr-1 | 
  | guard  | stack  | guard | stack | ......
  +--------+--------+-------+-------+-------------

Thus, stack freeing didn't make vma splitting. in the other hand,
the stack of the register stack architecture have two PROT_READ|PROT_WRITE
area and one PROT_NONE area.
then, stack layout is

fig2)
  +-----------+--------+------------------+-------+------------------+-----------
  | thr-0     | thr-0  | thr-0 stack      | thr-1 | thr-1 stack      | thr-2 
  | reg-stack | guard  | + thr-1 regstack | guard | + thr-2 regstack | guard
  +-----------+--------+------------------+-------+------------------+-----------

Then, stack unmapping may cause vma splitting.

However, non-register stack architecture don't free from this issue.
if the program use malloc(), it can makes fig2 layout. please run attached
test program (test_max_mapcount_for_86.c), it should addressed the problem.


And, I doubt I haven't catch your mention. May I ask some question?
Honestly I don't think max_map_count is important knob. it is strange
mutant of limit of virtual address space in the process.
At very long time ago (probably the stone age), linux doesn't have
vma rb_tree handling, then many vma directly cause find_vma slow down.
However current linux have good scalability. it can handle many vma issue.
So, Why do you think max_mapcount sould be strictly keeped?

Honestly, I doubt nobody suffer from removing sysctl_max_mapcount.


And yes, stack unmapping have exceptional charactatics. the guard zone
gurantee it never raise map_count. 
So, I think the attached patch (0001-Don-t...) is the same as you talked about, right?
I can accept it. I haven't test it on ia64. however, at least it works
well on x86.


BUT, I still think kernel souldn't refuse any resource deallocation.
otherwise, people discourage proper resource deallocation and encourage
brutal intentional memory leak programming style. What do you think?


[-- Attachment #2: test_max_mapcount_for_86.c --]
[-- Type: application/octet-stream, Size: 1333 bytes --]

  #include<stdio.h>
  #include<stdlib.h>
  #include<string.h>
  #include<pthread.h>
  #include<errno.h>
  #include<unistd.h>
 
  #define THREAD_NUM 30000
  #define MAL_SIZE (8*1024*1024)
 
 void *wait_thread(void *args)
 {
 	void *addr;
 
 	addr = malloc(MAL_SIZE);

#if 0
 	if(addr)
 		memset(addr, 1, MAL_SIZE);
#endif
 	sleep(10);
 
 	return NULL;
 }
 
 void *wait_thread2(void *args)
 {
 	sleep(60);
 
 	return NULL;
 }
 
 int main(int argc, char *argv[])
 {
 	int i;
 	pthread_t thread[THREAD_NUM], th;
 	int ret, count = 0;
 	pthread_attr_t attr;
 
 	ret = pthread_attr_init(&attr);
 	if(ret) {
 		perror("pthread_attr_init");
 	}
 
 	ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
 	if(ret) {
 		perror("pthread_attr_setdetachstate");
 	}
 
 	for (i = 0; i < THREAD_NUM; i++) {
 		ret = pthread_create(&th, &attr, wait_thread, NULL);
 		if(ret) {
 			fprintf(stderr, "[%d] ", count);
 			perror("pthread_create");
 		} else {
 			printf("[%d] create OK.\n", count);
 		}
 		count++;
 
 		ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
 		if(ret) {
 			fprintf(stderr, "[%d] ", count);
 			perror("pthread_create");
 		} else {
 			printf("[%d] create OK.\n", count);
 		}
 		count++;
 	}
 
 	sleep(3600);
 	return 0;
 }

[-- Attachment #3: 0001-Don-t-make-ENOMEM-temporary-mapcount-exceeding-in-mu.patch --]
[-- Type: application/octet-stream, Size: 5413 bytes --]

From c77fd11f579dc497b692501fca8a12ad8a943a20 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Mon, 12 Oct 2009 18:34:20 +0900
Subject: [PATCH] Don't make ENOMEM temporary mapcount exceeding in munmap()

On ia64, the following test program exit abnormally, because
glibc thread library called abort().

 ========================================================
 (gdb) bt
 #0  0xa000000000010620 in __kernel_syscall_via_break ()
 #1  0x20000000003208e0 in raise () from /lib/libc.so.6.1
 #2  0x2000000000324090 in abort () from /lib/libc.so.6.1
 #3  0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
 #4  0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
 #5  0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
 ========================================================

The fact is, glibc call munmap() when thread exitng time for freeing stack, and
it assume munlock() never fail. However, munmap() often make vma splitting
and it with many mapcount make -ENOMEM.

Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.

This patch does it.

 test_max_mapcount.c
 ==================================================================
  #include<stdio.h>
  #include<stdlib.h>
  #include<string.h>
  #include<pthread.h>
  #include<errno.h>
  #include<unistd.h>

  #define THREAD_NUM 30000
  #define MAL_SIZE (8*1024*1024)

 void *wait_thread(void *args)
 {
 	void *addr;

 	addr = malloc(MAL_SIZE);
 	sleep(10);

 	return NULL;
 }

 void *wait_thread2(void *args)
 {
 	sleep(60);

 	return NULL;
 }

 int main(int argc, char *argv[])
 {
 	int i;
 	pthread_t thread[THREAD_NUM], th;
 	int ret, count = 0;
 	pthread_attr_t attr;

 	ret = pthread_attr_init(&attr);
 	if(ret) {
 		perror("pthread_attr_init");
 	}

 	ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
 	if(ret) {
 		perror("pthread_attr_setdetachstate");
 	}

 	for (i = 0; i < THREAD_NUM; i++) {
 		ret = pthread_create(&th, &attr, wait_thread, NULL);
 		if(ret) {
 			fprintf(stderr, "[%d] ", count);
 			perror("pthread_create");
 		} else {
 			printf("[%d] create OK.\n", count);
 		}
 		count++;

 		ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
 		if(ret) {
 			fprintf(stderr, "[%d] ", count);
 			perror("pthread_create");
 		} else {
 			printf("[%d] create OK.\n", count);
 		}
 		count++;
 	}

 	sleep(3600);
 	return 0;
 }
 ==================================================================

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/mmap.c |   38 ++++++++++++++++++++++++++++++--------
 1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 865830d..41ab576 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1830,10 +1830,10 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /*
- * Split a vma into two pieces at address 'addr', a new vma is allocated
- * either for the first part or the tail.
+ * __split_vma() bypasses sysctl_max_map_count checking.  We use this on the
+ * munmap path where it doesn't make sense to fail.
  */
-int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
+static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
 	struct mempolicy *pol;
@@ -1843,9 +1843,6 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
-	if (mm->map_count >= sysctl_max_map_count)
-		return -ENOMEM;
-
 	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 	if (!new)
 		return -ENOMEM;
@@ -1885,6 +1882,19 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	return 0;
 }
 
+/*
+ * Split a vma into two pieces at address 'addr', a new vma is allocated
+ * either for the first part or the tail.
+ */
+int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
+	      unsigned long addr, int new_below)
+{
+	if (mm->map_count >= sysctl_max_map_count)
+		return -ENOMEM;
+
+	return __split_vma(mm, vma, addr, new_below);
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -1920,7 +1930,19 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	 * places tmp vma above, and higher split_vma places tmp vma below.
 	 */
 	if (start > vma->vm_start) {
-		int error = split_vma(mm, vma, start, 0);
+		int error;
+
+		/*
+		 * If vma need to split both head and tail of the vma,
+		 * mapcount can exceed max_mapcount. Otherwise, the
+		 * exceeding is tmporary. later detach_vmas_to_be_unmapped()
+		 * decrement mapcount less than max_mapcount. we don't need
+		 * to care it.
+		 */
+		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+			return -ENOMEM;
+
+		error = __split_vma(mm, vma, start, 0);
 		if (error)
 			return error;
 		prev = vma;
@@ -1929,7 +1951,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	/* Does it split the last one? */
 	last = find_vma(mm, end);
 	if (last && end > last->vm_start) {
-		int error = split_vma(mm, last, end, 1);
+		int error = __split_vma(mm, last, end, 1);
 		if (error)
 			return error;
 	}
-- 
1.6.2.5

next prev parent reply	other threads:[~2009-10-12 10:13 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-02  9:07 KOSAKI Motohiro
2009-10-09  9:28 ` Hugh Dickins
2009-10-12 10:13   ` KOSAKI Motohiro [this message]
2009-10-12 15:04     ` Hugh Dickins
2009-10-13  0:53       ` KAMEZAWA Hiroyuki
2009-10-13  2:50       ` KOSAKI Motohiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091012184654.E4D0.A69D9226@jp.fujitsu.com \
    --to=kosaki.motohiro@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox