* MAP_NOZERO revisited @ 2012-01-05 0:37 Arun Sharma 2012-01-05 7:23 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 5+ messages in thread From: Arun Sharma @ 2012-01-05 0:37 UTC (permalink / raw) To: linux-mm; +Cc: Davide Libenzi, Johannes Weiner, Balbir Singh A few years ago, Davide posted patches to address clear_page() showing up high in the kernel profiles. http://thread.gmane.org/gmane.linux.kernel/548928 With malloc implementations that try to conserve the RSS by madvising away unused pages that are dirty (i.e. faulted in), we pay a high cost in clear_page() if that page is needed later by the same process. Now that we have memcgs with their own LRU lists, I was thinking of a MAP_NOZERO implementation that tries to avoid zero'ing the page if it's coming from the same memcg. This will probably need an extra PCG_* flag maintaining state about whether the page was moved between memcgs since last use. Security implications: this is not as good as the UID based checks in Davide's implementation, so should probably be an opt-in instead of being enabled by default. Comments? -Arun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MAP_NOZERO revisited 2012-01-05 0:37 MAP_NOZERO revisited Arun Sharma @ 2012-01-05 7:23 ` KAMEZAWA Hiroyuki 2012-01-11 18:50 ` MAP_UNINITIALIZED (Was Re: MAP_NOZERO revisited) Arun Sharma 0 siblings, 1 reply; 5+ messages in thread From: KAMEZAWA Hiroyuki @ 2012-01-05 7:23 UTC (permalink / raw) To: Arun Sharma; +Cc: linux-mm, Davide Libenzi, Johannes Weiner, Balbir Singh On Wed, 4 Jan 2012 16:37:13 -0800 Arun Sharma <asharma@fb.com> wrote: > > A few years ago, Davide posted patches to address clear_page() showing > up high in the kernel profiles. > > http://thread.gmane.org/gmane.linux.kernel/548928 > > With malloc implementations that try to conserve the RSS by madvising > away unused pages that are dirty (i.e. faulted in), we pay a high cost > in clear_page() if that page is needed later by the same process. > > Now that we have memcgs with their own LRU lists, I was thinking of a > MAP_NOZERO implementation that tries to avoid zero'ing the page if it's > coming from the same memcg. > > This will probably need an extra PCG_* flag maintaining state about > whether the page was moved between memcgs since last use. > When pages are freed, it goes back to global page allocator. memcg has no page allocator hooks for alloc/free. We, memcg guys, tries to reduce size of page_cgroup remove page_cgroup->flags. And finally want to integrate it to struct 'page'. So, I don't like your idea very much. please find another way. > Security implications: this is not as good as the UID based checks in > Davide's implementation, so should probably be an opt-in instead of > being enabled by default. > I think you need an another page allocator as hugetlb.c does and need to maintain 'page pool'. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* MAP_UNINITIALIZED (Was Re: MAP_NOZERO revisited) 2012-01-05 7:23 ` KAMEZAWA Hiroyuki @ 2012-01-11 18:50 ` Arun Sharma 2012-01-12 5:10 ` Balbir Singh 0 siblings, 1 reply; 5+ messages in thread From: Arun Sharma @ 2012-01-11 18:50 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Arun Sharma, linux-mm, Davide Libenzi, Johannes Weiner, Balbir Singh On Thu, Jan 05, 2012 at 04:23:11PM +0900, KAMEZAWA Hiroyuki wrote: > When pages are freed, it goes back to global page allocator. > memcg has no page allocator hooks for alloc/free. I missed this part. Thanks for reminding me. > We, memcg guys, tries to reduce size of page_cgroup remove page_cgroup->flags. > And finally want to integrate it to struct 'page'. > So, I don't like your idea very much. > please find another way. Thinking a bit more, it may be possible to implement this without page_cgroup->flags using mm_match_cgroup(current->mm, page->mem_cgroup). > > > Security implications: this is not as good as the UID based checks in > > Davide's implementation, so should probably be an opt-in instead of > > being enabled by default. > > > > I think you need an another page allocator as hugetlb.c does and need to > maintain 'page pool'. That sounds like a bigger change. All I need is a way of computing "was this page previously mapped into the current cgroup?" without affecting allocator performance. I'm thinking this more relaxed check is sufficient for many real world use cases. I also realized that I could use MAP_UNINITIALIZED for this purpose. Attached is a completely insecure patch, which may be interesting for embedded use cases on CPUs with MMU. Yeah, the VM_SAO hack is ugly. Any better suggestions? -Arun commit 37b83f3fb77a177a2f81ebb8aeaec28c2a46e503 Author: Arun Sharma <asharma@fb.com> Date: Tue Jan 10 17:02:46 2012 -0800 mm: Enable MAP_UNINITIALIZED for archs with mmu This enables malloc optimizations where we might madvise(..,MADV_DONTNEED) a page only to fault it back at a different virtual address. Signed-off-by: Arun Sharma <asharma@fb.com> diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h index 787abbb..71e079f 100644 --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -19,11 +19,7 @@ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x20 /* don't use a file */ -#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED -# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */ -#else -# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ -#endif +#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */ #define MS_ASYNC 1 /* sync memory asynchronously */ #define MS_INVALIDATE 2 /* invalidate the caches */ diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 3a93f73..04d838e 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -156,6 +156,11 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags, vma, vaddr); +#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED + if (!vma->vm_file && vma->vm_flags & VM_UNINITIALIZED) + return page; +#endif + if (page) clear_user_highpage(page, vaddr); diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..6345c57 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -118,6 +118,8 @@ extern unsigned int kobjsize(const void *objp); #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ +#define VM_UNINITIALIZED VM_SAO /* Steal a powerpc bit for now, since we're out + bits for 32 bit archs */ /* Bits set in the VMA until the stack is in its final location */ #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) diff --git a/include/linux/mman.h b/include/linux/mman.h index 51647b4..f7d4f60 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -88,6 +88,7 @@ calc_vm_flag_bits(unsigned long flags) return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | + _calc_vm_trans(flags, MAP_UNINITIALIZED, VM_UNINITIALIZED) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); } #endif /* __KERNEL__ */ diff --git a/init/Kconfig b/init/Kconfig index 43298f9..428e047 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1259,7 +1259,7 @@ endchoice config MMAP_ALLOW_UNINITIALIZED bool "Allow mmapped anonymous memory to be uninitialized" - depends on EXPERT && !MMU + depends on EXPERT default n help Normally, and according to the Linux spec, anonymous memory obtained diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c3fdbcb..e6dd642 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1868,6 +1868,12 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, put_mems_allowed(); return page; } + +#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED + if (!vma->vm_file && vma->vm_flags & VM_UNINITIALIZED) + gfp &= ~__GFP_ZERO; +#endif + /* * fast path: default or task policy */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MAP_UNINITIALIZED (Was Re: MAP_NOZERO revisited) 2012-01-11 18:50 ` MAP_UNINITIALIZED (Was Re: MAP_NOZERO revisited) Arun Sharma @ 2012-01-12 5:10 ` Balbir Singh 2012-01-12 18:16 ` Arun Sharma 0 siblings, 1 reply; 5+ messages in thread From: Balbir Singh @ 2012-01-12 5:10 UTC (permalink / raw) To: Arun Sharma; +Cc: KAMEZAWA Hiroyuki, linux-mm, Davide Libenzi, Johannes Weiner > commit 37b83f3fb77a177a2f81ebb8aeaec28c2a46e503 > Author: Arun Sharma <asharma@fb.com> > Date: Tue Jan 10 17:02:46 2012 -0800 > > mm: Enable MAP_UNINITIALIZED for archs with mmu > > This enables malloc optimizations where we might > madvise(..,MADV_DONTNEED) a page only to fault it > back at a different virtual address. > > Signed-off-by: Arun Sharma <asharma@fb.com> > > diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h > index 787abbb..71e079f 100644 > --- a/include/asm-generic/mman-common.h > +++ b/include/asm-generic/mman-common.h > @@ -19,11 +19,7 @@ > #define MAP_TYPE 0x0f /* Mask for type of mapping */ > #define MAP_FIXED 0x10 /* Interpret addr exactly */ > #define MAP_ANONYMOUS 0x20 /* don't use a file */ > -#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED > -# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */ > -#else > -# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ > -#endif > +#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */ > Define MAP_UNINITIALIZED - are you referring to not zeroing out pages before handing them down? Is this safe even between threads. > #define MS_ASYNC 1 /* sync memory asynchronously */ > #define MS_INVALIDATE 2 /* invalidate the caches */ > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index 3a93f73..04d838e 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -156,6 +156,11 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, > struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags, > vma, vaddr); > > +#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED > + if (!vma->vm_file && vma->vm_flags & VM_UNINITIALIZED) > + return page; > +#endif > + > if (page) > clear_user_highpage(page, vaddr); > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 4baadd1..6345c57 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -118,6 +118,8 @@ extern unsigned int kobjsize(const void *objp); > #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ > #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ > #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ > +#define VM_UNINITIALIZED VM_SAO /* Steal a powerpc bit for now, since we're out > + bits for 32 bit archs */ Without proper checks if it can be re-used? > > /* Bits set in the VMA until the stack is in its final location */ > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) > diff --git a/include/linux/mman.h b/include/linux/mman.h > index 51647b4..f7d4f60 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -88,6 +88,7 @@ calc_vm_flag_bits(unsigned long flags) > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | > _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | > + _calc_vm_trans(flags, MAP_UNINITIALIZED, VM_UNINITIALIZED) | > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); > } > #endif /* __KERNEL__ */ > diff --git a/init/Kconfig b/init/Kconfig > index 43298f9..428e047 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1259,7 +1259,7 @@ endchoice > > config MMAP_ALLOW_UNINITIALIZED > bool "Allow mmapped anonymous memory to be uninitialized" > - depends on EXPERT && !MMU > + depends on EXPERT > default n > help > Normally, and according to the Linux spec, anonymous memory obtained > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index c3fdbcb..e6dd642 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1868,6 +1868,12 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > put_mems_allowed(); > return page; > } > + > +#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED > + if (!vma->vm_file && vma->vm_flags & VM_UNINITIALIZED) > + gfp &= ~__GFP_ZERO; > +#endif > + > /* > * fast path: default or task policy > */ Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MAP_UNINITIALIZED (Was Re: MAP_NOZERO revisited) 2012-01-12 5:10 ` Balbir Singh @ 2012-01-12 18:16 ` Arun Sharma 0 siblings, 0 replies; 5+ messages in thread From: Arun Sharma @ 2012-01-12 18:16 UTC (permalink / raw) To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, linux-mm, Davide Libenzi, Johannes Weiner On 1/11/12 9:10 PM, Balbir Singh wrote: > > Define MAP_UNINITIALIZED - are you referring to not zeroing out pages > before handing them down? Is this safe even between threads. > If it doesn't work for an app, it shouldn't be asking for this behavior via an mmap flag? Only calloc() specifies that the returned memory will be zero'ed. There is no such guarantee for malloc(). >> +#define VM_UNINITIALIZED VM_SAO /* Steal a powerpc bit for now, since we're out >> + bits for 32 bit archs */ > > Without proper checks if it can be re-used? Yeah - this is a complete hack. I'm trying to convince people that this is a viable idea, before asking for a vm_flags bit. Microbenchmark data: # time -p ./test2 real 7.60 user 0.78 sys 6.81 # time -p ./test2 xx real 4.40 user 0.78 sys 3.62 # cat test2.c #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <stdint.h> #define MMAP_SIZE (20 * 1024 * 1024) #define PAGE_SIZE 4096 #define MAP_UNINITIALIZED 0x4000000 main(int argc, char *argv[]) { void *addr, *naddr; char *p, *end, val; int flag = 0; int i; if (argc > 1) { flag = MAP_UNINITIALIZED; } addr = mmap(NULL, MMAP_SIZE, PROT_READ|PROT_WRITE, flag | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(-1); } end = (char *) addr + MMAP_SIZE; for (i = 0; i < 1000; i++) { int ret; ret = madvise(addr, MMAP_SIZE, MADV_DONTNEED); if (ret == -1) perror("madvise"); for (p = (char *) addr; p < end; p += PAGE_SIZE) { *p = 0xAB; } } } -Arun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-01-12 18:16 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-05 0:37 MAP_NOZERO revisited Arun Sharma 2012-01-05 7:23 ` KAMEZAWA Hiroyuki 2012-01-11 18:50 ` MAP_UNINITIALIZED (Was Re: MAP_NOZERO revisited) Arun Sharma 2012-01-12 5:10 ` Balbir Singh 2012-01-12 18:16 ` Arun Sharma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox