* [rfc][patch 1/2] mm: dont account ZERO_PAGE
@ 2007-03-29 7:58 Nick Piggin
2007-03-29 7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin
2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins
0 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2007-03-29 7:58 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins, Linus Torvalds,
Linux Memory Management List
Cc: tee, holt
Special-case the ZERO_PAGE to prevent it from being accounted like a normal
mapped page. This is not illogical or unclean, because the ZERO_PAGE is
heavily special cased through the page fault path.
This requires Carsten Otte's filemap_xip patch, as well as restoring the
move_pte function for MIPS which was removed after I noticed it didn't
handle the ZERO_PAGE accounting correctly (which is not an issue after
this patch).
A test-case which took over 2 hours to complete on a 1024 core Altix
takes around 2 seconds afterward.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -479,7 +479,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
pte = pte_mkold(pte);
page = vm_normal_page(vma, addr, pte);
- if (page) {
+ if (likely(page && page != ZERO_PAGE(addr))) {
get_page(page);
page_dup_rmap(page);
rss[!!PageAnon(page)]++;
@@ -665,7 +665,7 @@ static unsigned long zap_pte_range(struc
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
- if (unlikely(!page))
+ if (unlikely(!page || page == ZERO_PAGE(addr)))
continue;
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
@@ -1125,9 +1125,6 @@ static int zeromap_pte_range(struct mm_s
pte++;
break;
}
- page_cache_get(page);
- page_add_file_rmap(page);
- inc_mm_counter(mm, file_rss);
set_pte_at(mm, addr, pte, zero_pte);
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -1629,7 +1626,7 @@ gotten:
*/
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
- if (old_page) {
+ if (likely(old_page && old_page != ZERO_PAGE(address))) {
page_remove_rmap(old_page, vma);
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
@@ -1659,7 +1656,7 @@ gotten:
}
if (new_page)
page_cache_release(new_page);
- if (old_page)
+ if (old_page && old_page != ZERO_PAGE(address))
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
@@ -2152,15 +2149,12 @@ static int do_anonymous_page(struct mm_s
} else {
/* Map the ZERO_PAGE - vm_page_prot is readonly */
page = ZERO_PAGE(address);
- page_cache_get(page);
entry = mk_pte(page, vma->vm_page_prot);
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (!pte_none(*page_table))
- goto release;
- inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
+ goto unlock;
}
set_pte_at(mm, address, page_table, entry);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* [rfc][patch 2/2] mips: reinstate move_pte 2007-03-29 7:58 [rfc][patch 1/2] mm: dont account ZERO_PAGE Nick Piggin @ 2007-03-29 7:58 ` Nick Piggin 2007-03-29 17:49 ` Linus Torvalds 2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins 1 sibling, 1 reply; 49+ messages in thread From: Nick Piggin @ 2007-03-29 7:58 UTC (permalink / raw) To: Andrew Morton, Hugh Dickins, Linus Torvalds, Linux Memory Management List Cc: tee, holt Restore move_pte for MIPS, so that any given virtual address vaddr that maps a ZERO_PAGE will map ZERO_PAGE(vaddr). This has a circular dependancy on the previous patch, which normally means they belong in the same patch, but I thought this case is clearer if split out. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/include/asm-mips/pgtable.h =================================================================== --- linux-2.6.orig/include/asm-mips/pgtable.h +++ linux-2.6/include/asm-mips/pgtable.h @@ -69,6 +69,16 @@ extern unsigned long zero_page_mask; #define ZERO_PAGE(vaddr) \ (virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask)))) +#define __HAVE_ARCH_MOVE_PTE +#define move_pte(pte, prot, old_addr, new_addr) \ +({ \ + pte_t newpte = (pte); \ + if (pte_present(pte) && \ + pte_pfn(pte) == page_to_pfn(ZERO_PAGE(old_addr))) \ + newpte = mk_pte(ZERO_PAGE(new_addr), (prot)); \ + newpte; +}) + extern void paging_init(void); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 2/2] mips: reinstate move_pte 2007-03-29 7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin @ 2007-03-29 17:49 ` Linus Torvalds 0 siblings, 0 replies; 49+ messages in thread From: Linus Torvalds @ 2007-03-29 17:49 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List, tee, holt On Thu, 29 Mar 2007, Nick Piggin wrote: > > Restore move_pte for MIPS, so that any given virtual address vaddr that maps > a ZERO_PAGE will map ZERO_PAGE(vaddr). Why does this matter? Why do we even care about the page counts? I thought we long since agreed that reserved pages don't need to have page counts. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE 2007-03-29 7:58 [rfc][patch 1/2] mm: dont account ZERO_PAGE Nick Piggin 2007-03-29 7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin @ 2007-03-29 13:10 ` Hugh Dickins 2007-03-30 1:46 ` Nick Piggin 2007-03-30 2:40 ` Nick Piggin 1 sibling, 2 replies; 49+ messages in thread From: Hugh Dickins @ 2007-03-29 13:10 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt On Thu, 29 Mar 2007, Nick Piggin wrote: > > Special-case the ZERO_PAGE to prevent it from being accounted like a normal > mapped page. This is not illogical or unclean, because the ZERO_PAGE is > heavily special cased through the page fault path. Thou dost protest too much! By "heavily special cased through the page fault path" you mean do_wp_page() uses a pre-zeroed page when it spots it, instead of copying its data. That's rather a different case. Look, I don't have any vehement objection to exempting the ZERO_PAGE from accounting: if you remember before, I just suggested it was of questionable value to exempt it, and the exemption should be made a separate patch. But this patch is not complete, is it? For example, fremap.c's zap_pte? I haven't checked further. I was going to suggest you should make ZERO_PAGEs fail vm_normal_page, but I guess do_wp_page wouldn't behave very well then ;) Tucking the tests away in some vm_normal_page-like function might make them more acceptable. > A test-case which took over 2 hours to complete on a 1024 core Altix > takes around 2 seconds afterward. Oh, it's easy to devise a test-case of that kind, but does it matter in real life? I admit that what most people run on their 1024-core Altices will be significantly different from what I checked on my laptop back then, but in my case use of the ZERO_PAGE didn't look common enough to make special cases for. You put forward a pagecache replication patch a few weeks ago. That's what I expected to happen to the ZERO_PAGE, once NUMA folks complained of the accounting. Isn't that a better way to go? Or is there some important app on the Altix which uses the ZERO_PAGE so very much, that its interesting data remains shared between nodes forever, and it's only its struct page cacheline bouncing dirtily from one to another that slows things down? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE 2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins @ 2007-03-30 1:46 ` Nick Piggin 2007-03-30 2:59 ` Robin Holt 2007-03-30 2:40 ` Nick Piggin 1 sibling, 1 reply; 49+ messages in thread From: Nick Piggin @ 2007-03-30 1:46 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt On Thu, Mar 29, 2007 at 02:10:55PM +0100, Hugh Dickins wrote: > On Thu, 29 Mar 2007, Nick Piggin wrote: > > > > Special-case the ZERO_PAGE to prevent it from being accounted like a normal > > mapped page. This is not illogical or unclean, because the ZERO_PAGE is > > heavily special cased through the page fault path. > > Thou dost protest too much! By "heavily special cased through the page > fault path" you mean do_wp_page() uses a pre-zeroed page when it spots > it, instead of copying its data. That's rather a different case. That, and the use of the zero page _at all_ in the do_anonymous_page and zeromap, and I guess our anti-wrapping hacks in the page allocator... it is just done for a little optimisation, so I figure it wouldn't hurt to optimise a bit more ;) > Look, I don't have any vehement objection to exempting the ZERO_PAGE > from accounting: if you remember before, I just suggested it was of > questionable value to exempt it, and the exemption should be made a > separate patch. > > But this patch is not complete, is it? For example, fremap.c's > zap_pte? I haven't checked further. I was going to suggest you > should make ZERO_PAGEs fail vm_normal_page, but I guess do_wp_page > wouldn't behave very well then ;) Tucking the tests away in some > vm_normal_page-like function might make them more acceptable. Yeah I was going to do that, but noted the do_wp_page thingy. I don't know... it might be better though... vm_refcounted_page()? > > A test-case which took over 2 hours to complete on a 1024 core Altix > > takes around 2 seconds afterward. > > Oh, it's easy to devise a test-case of that kind, but does it matter > in real life? I admit that what most people run on their 1024-core > Altices will be significantly different from what I checked on my > laptop back then, but in my case use of the ZERO_PAGE didn't look > common enough to make special cases for. Yeah I don't have access to the box, but it was a constructed test of some kind. However this is basically a dead box situation... on smaller systems we could still see performance improvements. And the other thing is I'd like to be able to get rid of the wrapping tests from the page allocator and PageReserved from the kernel entirely at some point. > You put forward a pagecache replication patch a few weeks ago. > That's what I expected to happen to the ZERO_PAGE, once NUMA folks > complained of the accounting. Isn't that a better way to go? Not sure how much remote memory access the ZERO_PAGE itself causes. It is obviously readonly data, and itaniums have pretty big caches, so it is more important to get rid of the bouncing cachelines. Per node ZERO_PAGE could be a good idea, however you can still have all pages come from a single node (eg. a forking server)... > Or is there some important app on the Altix which uses the > ZERO_PAGE so very much, that its interesting data remains shared > between nodes forever, and it's only its struct page cacheline > bouncing dirtily from one to another that slows things down? Can't answer that. I think they are worried about this being hit in the field. Does the ZERO_PAGE help _any_ real workloads? It will cost an extra fault any time you are not content with its interesting data. I don't know why any performance critical app would read huge swaths of zeroes, but there is probably a reason for it... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE 2007-03-30 1:46 ` Nick Piggin @ 2007-03-30 2:59 ` Robin Holt 2007-03-30 3:09 ` Nick Piggin 0 siblings, 1 reply; 49+ messages in thread From: Robin Holt @ 2007-03-30 2:59 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt On Fri, Mar 30, 2007 at 03:46:34AM +0200, Nick Piggin wrote: > > Oh, it's easy to devise a test-case of that kind, but does it matter > > in real life? I admit that what most people run on their 1024-core > > Altices will be significantly different from what I checked on my > > laptop back then, but in my case use of the ZERO_PAGE didn't look > > common enough to make special cases for. > > Yeah I don't have access to the box, but it was a constructed test > of some kind. However this is basically a dead box situation... on > smaller systems we could still see performance improvements. It was not a constructed test. It was an test application which started up and read one word from each page to fill the page tables (not sure why that was done), then forked a process for each cpu. At that point, it was supposed start doing computation using data from an NFS accessible file. Unfortunately, the file was not found so the application exited and the machine locked up for hours. Of course, they assumed something had gone wrong with the system and repeated the test with the same result. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE 2007-03-30 2:59 ` Robin Holt @ 2007-03-30 3:09 ` Nick Piggin 2007-03-30 9:23 ` Robin Holt 0 siblings, 1 reply; 49+ messages in thread From: Nick Piggin @ 2007-03-30 3:09 UTC (permalink / raw) To: Robin Holt Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee On Thu, Mar 29, 2007 at 09:59:37PM -0500, Robin Holt wrote: > On Fri, Mar 30, 2007 at 03:46:34AM +0200, Nick Piggin wrote: > > > Oh, it's easy to devise a test-case of that kind, but does it matter > > > in real life? I admit that what most people run on their 1024-core > > > Altices will be significantly different from what I checked on my > > > laptop back then, but in my case use of the ZERO_PAGE didn't look > > > common enough to make special cases for. > > > > Yeah I don't have access to the box, but it was a constructed test > > of some kind. However this is basically a dead box situation... on > > smaller systems we could still see performance improvements. > > It was not a constructed test. It was an test application which started > up and read one word from each page to fill the page tables (not sure > why that was done), then forked a process for each cpu. At that point, > it was supposed start doing computation using data from an NFS accessible > file. Unfortunately, the file was not found so the application exited > and the machine locked up for hours. Sorry, my mistake. Thanks for the clarification: this sounds like something that will not be helped by per-node ZERO_PAGEs either. So not typical, but something that we'd rather not fall over with. I guess large ranges of zero pages could be quite common in startup of HPC codes operating on large matricies. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE 2007-03-30 3:09 ` Nick Piggin @ 2007-03-30 9:23 ` Robin Holt 0 siblings, 0 replies; 49+ messages in thread From: Robin Holt @ 2007-03-30 9:23 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee On Fri, Mar 30, 2007 at 05:09:12AM +0200, Nick Piggin wrote: > > up and read one word from each page to fill the page tables (not sure > > why that was done), then forked a process for each cpu. At that point, > > So not typical, but something that we'd rather not fall over with. I agree > I guess large ranges of zero pages could be quite common in startup > of HPC codes operating on large matricies. The "not sure why that was done" was referring to this being exactly the opposite of what a typical HPC application does. Those tend to locate themselves on the node which will use an address range and the write touch each of the pages. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE 2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins 2007-03-30 1:46 ` Nick Piggin @ 2007-03-30 2:40 ` Nick Piggin 2007-04-04 3:37 ` [rfc] no ZERO_PAGE? Nick Piggin 1 sibling, 1 reply; 49+ messages in thread From: Nick Piggin @ 2007-03-30 2:40 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt On Thu, Mar 29, 2007 at 02:10:55PM +0100, Hugh Dickins wrote: > > But this patch is not complete, is it? For example, fremap.c's > zap_pte? I haven't checked further. I was going to suggest you Ah yes, nonlinear... thanks I missed that. Well it would make life easier if we got rid of ZERO_PAGE completely, which I definitely wouldn't complain about ;) It is much more likely to cause noticable performance loss in other areas though, so it is not really a candidate for SLES at the moment. But I would like to get something for mainline that everyone likes whether that is vm_refcounted_page (which I just implemented and it doesn't make things much cleaner, but I'll go with it); per-node ZERO_PAGE; or whatever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* [rfc] no ZERO_PAGE? 2007-03-30 2:40 ` Nick Piggin @ 2007-04-04 3:37 ` Nick Piggin 2007-04-04 9:45 ` Hugh Dickins 2007-04-04 15:35 ` Linus Torvalds 0 siblings, 2 replies; 49+ messages in thread From: Nick Piggin @ 2007-04-04 3:37 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Fri, Mar 30, 2007 at 04:40:48AM +0200, Nick Piggin wrote: > > Well it would make life easier if we got rid of ZERO_PAGE completely, > which I definitely wouldn't complain about ;) So, what bad things (apart from my bugs in untested code) happen if we do this? We can actually go further, and probably remove the ZERO_PAGE completely (just need an extra get_user_pages flag or something for the core dumping issue). Shall I do a more complete patchset and ask Andrew to give it a run in -mm? -- ZERO_PAGE for anonymous pages seems to only be designed to help stupid programs, so remove it. This solves issues with ZERO_PAGE refcounting and NUMA un-awareness. (Actually, not quite. We should also remove all the zeromap stuff that also seems to not do much except help stupid programs). Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1613,16 +1613,10 @@ gotten: if (unlikely(anon_vma_prepare(vma))) goto oom; - if (old_page == ZERO_PAGE(address)) { - new_page = alloc_zeroed_user_highpage(vma, address); - if (!new_page) - goto oom; - } else { - new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); - if (!new_page) - goto oom; - cow_user_page(new_page, old_page, address, vma); - } + new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); + if (!new_page) + goto oom; + cow_user_page(new_page, old_page, address, vma); /* * Re-check the pte - we dropped the lock @@ -2130,52 +2124,33 @@ static int do_anonymous_page(struct mm_s spinlock_t *ptl; pte_t entry; - if (write_access) { - /* Allocate our own private page. */ - pte_unmap(page_table); + /* Allocate our own private page. */ + pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - page = alloc_zeroed_user_highpage(vma, address); - if (!page) - goto oom; + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + page = alloc_zeroed_user_highpage(vma, address); + if (!page) + return VM_FAULT_OOM; - entry = mk_pte(page, vma->vm_page_prot); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); + entry = mk_pte(page, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); - page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - if (!pte_none(*page_table)) - goto release; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + if (likely(!pte_none(*page_table))) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); - } else { - /* Map the ZERO_PAGE - vm_page_prot is readonly */ - page = ZERO_PAGE(address); - page_cache_get(page); - entry = mk_pte(page, vma->vm_page_prot); - - ptl = pte_lockptr(mm, pmd); - spin_lock(ptl); - if (!pte_none(*page_table)) - goto release; - inc_mm_counter(mm, file_rss); - page_add_file_rmap(page); - } - - set_pte_at(mm, address, page_table, entry); + set_pte_at(mm, address, page_table, entry); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, address, entry); - lazy_mmu_prot_update(entry); -unlock: + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, address, entry); + lazy_mmu_prot_update(entry); + } else + page_cache_release(page); pte_unmap_unlock(page_table, ptl); + return VM_FAULT_MINOR; -release: - page_cache_release(page); - goto unlock; -oom: - return VM_FAULT_OOM; } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 3:37 ` [rfc] no ZERO_PAGE? Nick Piggin @ 2007-04-04 9:45 ` Hugh Dickins 2007-04-04 10:24 ` Nick Piggin 2007-04-04 15:35 ` Linus Torvalds 1 sibling, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2007-04-04 9:45 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007, Nick Piggin wrote: > On Fri, Mar 30, 2007 at 04:40:48AM +0200, Nick Piggin wrote: > > > > Well it would make life easier if we got rid of ZERO_PAGE completely, > > which I definitely wouldn't complain about ;) Yes, I love this approach too. > > So, what bad things (apart from my bugs in untested code) happen > if we do this? We can actually go further, and probably remove the > ZERO_PAGE completely (just need an extra get_user_pages flag or > something for the core dumping issue). Some things will go faster (no longer needing a separate COW fault on the read-protected ZERO_PAGE), some things will go slower and use more memory. The open question is whether anyone will notice those regressions: I'm hoping they won't, I'm afraid they will. And though we'll see each as a program doing "something stupid", as in the Altix case Robin showed to drive us here, we cannot just ignore it. > > Shall I do a more complete patchset and ask Andrew to give it a > run in -mm? I'd like you to: I didn't study the fragment below, it's really all uses of the ZERO_PAGE that I'd like to see go, then we see who shouts. It's quite likely that the patch would have to be reverted: don't bother to remove the allocations of ZERO_PAGE in each architecture at this stage, too much nuisance going back and forth on those. Leave ZERO_PAGE as configurable, default off for testing, buried somewhere like under EMBEDDED? It's much more attractive just to remove the old code, and reintroduce it if there's a demand; but leaving it under config would make it easy to restore, and if there's trouble with removing ZERO_PAGE, we might later choose to disable it at the high end but enable it at the low. What would you prefer? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 9:45 ` Hugh Dickins @ 2007-04-04 10:24 ` Nick Piggin 2007-04-04 12:27 ` Andrea Arcangeli 2007-04-04 12:45 ` Hugh Dickins 0 siblings, 2 replies; 49+ messages in thread From: Nick Piggin @ 2007-04-04 10:24 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 10:45:39AM +0100, Hugh Dickins wrote: > On Wed, 4 Apr 2007, Nick Piggin wrote: > > On Fri, Mar 30, 2007 at 04:40:48AM +0200, Nick Piggin wrote: > > > > > > Well it would make life easier if we got rid of ZERO_PAGE completely, > > > which I definitely wouldn't complain about ;) > > Yes, I love this approach too. > > > > > So, what bad things (apart from my bugs in untested code) happen > > if we do this? We can actually go further, and probably remove the > > ZERO_PAGE completely (just need an extra get_user_pages flag or > > something for the core dumping issue). > > Some things will go faster (no longer needing a separate COW fault > on the read-protected ZERO_PAGE), some things will go slower and use > more memory. The open question is whether anyone will notice those > regressions: I'm hoping they won't, I'm afraid they will. And though > we'll see each as a program doing "something stupid", as in the Altix > case Robin showed to drive us here, we cannot just ignore it. Sure. Agreed. > > Shall I do a more complete patchset and ask Andrew to give it a > > run in -mm? > > I'd like you to: I didn't study the fragment below, it's really all > uses of the ZERO_PAGE that I'd like to see go, then we see who shouts. Yeah, they are basically pretty trivial to remove. I'll do a more complete patch now that I know you like the approach. > It's quite likely that the patch would have to be reverted: don't > bother to remove the allocations of ZERO_PAGE in each architecture > at this stage, too much nuisance going back and forth on those. OK. > Leave ZERO_PAGE as configurable, default off for testing, buried > somewhere like under EMBEDDED? It's much more attractive just to > remove the old code, and reintroduce it if there's a demand; but > leaving it under config would make it easy to restore, and if > there's trouble with removing ZERO_PAGE, we might later choose > to disable it at the high end but enable it at the low. What > would you prefer? Ooh, the one with more '-' signs in the diff ;) No, you have a point, but if we have to ask people to recompile with CONFIG_ZERO_PAGE, then it isn't much harder to ask them to apply a patch first. But for a potential mainline merge, maybe starting with a CONFIG option is a good idea -- defaulting to off, and we could start by turning it on just in -rc kernels for a few releases, to get a bit more confidence? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 10:24 ` Nick Piggin @ 2007-04-04 12:27 ` Andrea Arcangeli 2007-04-04 13:55 ` Dan Aloni 2007-04-04 12:45 ` Hugh Dickins 1 sibling, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 12:27 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 12:24:07PM +0200, Nick Piggin wrote: > But for a potential mainline merge, maybe starting with a CONFIG > option is a good idea -- defaulting to off, and we could start by > turning it on just in -rc kernels for a few releases, to get a bit > more confidence? The only reason to do that is if there are many stupid apps pretending to get meaningful information from pages that cannot contain any information. The zero page in the anon page fault has been there forever so... Anyway I also like this approach as I immediately suggested it after reading about the zero page scalability patches ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 12:27 ` Andrea Arcangeli @ 2007-04-04 13:55 ` Dan Aloni 2007-04-04 14:14 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: Dan Aloni @ 2007-04-04 13:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 02:27:01PM +0200, Andrea Arcangeli wrote: > On Wed, Apr 04, 2007 at 12:24:07PM +0200, Nick Piggin wrote: > > But for a potential mainline merge, maybe starting with a CONFIG > > option is a good idea -- defaulting to off, and we could start by > > turning it on just in -rc kernels for a few releases, to get a bit > > more confidence? > > The only reason to do that is if there are many stupid apps pretending > to get meaningful information from pages that cannot contain any > information. The zero page in the anon page fault has been there > forever so... There might be a lot of applications like that, and I'm not sure that _all_ of them can be considered 'stupid' as you say. How about applications that perform mmap() and R/W random-access on large *sparse* files? (e.g. a scientific app that uses a large sparse file as a big database look-up table). As I see it, these apps would need to keep track of what's sparse and what's not... -- Dan Aloni XIV LTD, http://www.xivstorage.com da-x (at) monatomic.org, dan (at) xiv.co.il -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 13:55 ` Dan Aloni @ 2007-04-04 14:14 ` Andrea Arcangeli 2007-04-04 14:44 ` Dan Aloni 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 14:14 UTC (permalink / raw) To: Dan Aloni Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 04:55:32PM +0300, Dan Aloni wrote: > How about applications that perform mmap() and R/W random-access on > large *sparse* files? (e.g. a scientific app that uses a large sparse > file as a big database look-up table). As I see it, these apps would > need to keep track of what's sparse and what's not... That's not anonymous memory if those are read page faults on _files_. I'm only talking about anonymous memory and do_anonymous_page, i.e. no file data at all. In more clear words, the only thing we're discussing here is char = malloc(1); *char. Your example _already_ allocates zeroed pagecache instead of the zero page, so your example (random access over sparse files with mmap, be it MAP_PRIVATE or MAP_SHARED no difference for reads) has never had anything to do with the zero page. If something we could optimize your example to _start_ using for the first time ever the ZERO_PAGE, it would make more sense to use it to be mapped where the lowlevel fs finds holes. ZERO_PAGE in do_anonymous_page instead doesn't make much sense to me, but it has always been there as far as I can remember. The thing is that it never hurted until the huge systems with nightmare cacheline bouncing reported heavy stalls on some testcase, which make it look like a DoS because of the ZERO_PAGE, hence now that it hurts I guess it can go. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 14:14 ` Andrea Arcangeli @ 2007-04-04 14:44 ` Dan Aloni 2007-04-04 15:03 ` Hugh Dickins 2007-04-04 15:27 ` Andrea Arcangeli 0 siblings, 2 replies; 49+ messages in thread From: Dan Aloni @ 2007-04-04 14:44 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 04:14:57PM +0200, Andrea Arcangeli wrote: > On Wed, Apr 04, 2007 at 04:55:32PM +0300, Dan Aloni wrote: > > How about applications that perform mmap() and R/W random-access on > > large *sparse* files? (e.g. a scientific app that uses a large sparse > > file as a big database look-up table). As I see it, these apps would > > need to keep track of what's sparse and what's not... > > That's not anonymous memory if those are read page faults on > _files_. I'm only talking about anonymous memory and > do_anonymous_page, i.e. no file data at all. In more clear words, the > only thing we're discussing here is char = malloc(1); *char. > > Your example _already_ allocates zeroed pagecache instead of the zero > page, so your example (random access over sparse files with mmap, be > it MAP_PRIVATE or MAP_SHARED no difference for reads) has never had > anything to do with the zero page. If something we could optimize your > example to _start_ using for the first time ever the ZERO_PAGE, it > would make more sense to use it to be mapped where the lowlevel fs > finds holes. ZERO_PAGE in do_anonymous_page instead doesn't make much > sense to me, but it has always been there as far as I can > remember. The thing is that it never hurted until the huge systems > with nightmare cacheline bouncing reported heavy stalls on some > testcase, which make it look like a DoS because of the ZERO_PAGE, > hence now that it hurts I guess it can go. Oh, right. Thanks for clarifing. I should have figured it out before I sent that mail. To refine that example, you could replace the file with a large anonymous memory pool and a lot of swap space committed to it. In that case - with no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? Perhaps it's an example too far-fetched to worth considering... -- Dan Aloni XIV LTD, http://www.xivstorage.com da-x (at) monatomic.org, dan (at) xiv.co.il -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 14:44 ` Dan Aloni @ 2007-04-04 15:03 ` Hugh Dickins 2007-04-04 15:34 ` Andrea Arcangeli 2007-04-04 15:27 ` Andrea Arcangeli 1 sibling, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2007-04-04 15:03 UTC (permalink / raw) To: Dan Aloni Cc: Andrea Arcangeli, Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, 4 Apr 2007, Dan Aloni wrote: > > To refine that example, you could replace the file with a large anonymous > memory pool and a lot of swap space committed to it. In that case - with > no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? > Perhaps it's an example too far-fetched to worth considering... Nice point, not far-fetched, though I don't know whether it's worth worrying about or not. Yes, as things stand, the kernel will needlessly write them out to swap: because we're in the habit of marking a writable pte as dirty, partly to save the processor (how i386-centric am I being?) from having to do that work just after, partly because of some race too ancient for me to know anything about - do_no_page (though not the function in question here) says: * This silly early PAGE_DIRTY setting removes a race * due to the bad i386 page protection. But it's valid * for other architectures too. Maybe Nick will decide to not to mark the readfaults as dirty. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:03 ` Hugh Dickins @ 2007-04-04 15:34 ` Andrea Arcangeli 2007-04-04 15:41 ` Hugh Dickins 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 15:34 UTC (permalink / raw) To: Hugh Dickins Cc: Dan Aloni, Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 04:03:15PM +0100, Hugh Dickins wrote: > Maybe Nick will decide to not to mark the readfaults as dirty. I don't like to mark the pte readonly and clean, we'd be still optimizing for the current ZERO_PAGE users and even for those it would generate a unnecessary page fault if they later write to it. If any legitimate ZERO_PAGE user really exists, then we should keep mapping the ZERO_PAGE into it and fix the scalability issue associated with it, instead of allocating a new page in readonly mode. Marking anonymous pages readonly and clean so they can be collected w/o swapping still is desiderable for glibc through madvise (madvise would later need to be called again before starting using the collectable anon pages to store information into it), but that's an entirely different topic ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:34 ` Andrea Arcangeli @ 2007-04-04 15:41 ` Hugh Dickins 2007-04-04 16:07 ` Andrea Arcangeli 2007-04-04 16:14 ` Linus Torvalds 0 siblings, 2 replies; 49+ messages in thread From: Hugh Dickins @ 2007-04-04 15:41 UTC (permalink / raw) To: Andrea Arcangeli Cc: Dan Aloni, Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, 4 Apr 2007, Andrea Arcangeli wrote: > On Wed, Apr 04, 2007 at 04:03:15PM +0100, Hugh Dickins wrote: > > Maybe Nick will decide to not to mark the readfaults as dirty. > > I don't like to mark the pte readonly and clean, Nor I: I meant that anonymous readfault should (perhaps) mark the pte writable but clean. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:41 ` Hugh Dickins @ 2007-04-04 16:07 ` Andrea Arcangeli 2007-04-04 16:14 ` Linus Torvalds 1 sibling, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 16:07 UTC (permalink / raw) To: Hugh Dickins Cc: Dan Aloni, Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 04:41:46PM +0100, Hugh Dickins wrote: > Nor I: I meant that anonymous readfault should > (perhaps) mark the pte writable but clean. Sorry I assumed when you said clean you implied readonly... Though we'd need to differentiate the archs where the dirty bit is not set by the hardware. Overall I'm unsure it worth it. Currently the VM definitely wouldn't cope with a writeable and clean anonymous page, so we'd need to change shrink_page_list and try_to_unmap_anon to make it work. Likely it won't be measurable, so it may be a nice feature to have from a theoretical point of view, in practice I doubt it matters. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:41 ` Hugh Dickins 2007-04-04 16:07 ` Andrea Arcangeli @ 2007-04-04 16:14 ` Linus Torvalds 1 sibling, 0 replies; 49+ messages in thread From: Linus Torvalds @ 2007-04-04 16:14 UTC (permalink / raw) To: Hugh Dickins Cc: Andrea Arcangeli, Dan Aloni, Nick Piggin, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, 4 Apr 2007, Hugh Dickins wrote: > On Wed, 4 Apr 2007, Andrea Arcangeli wrote: > > On Wed, Apr 04, 2007 at 04:03:15PM +0100, Hugh Dickins wrote: > > > Maybe Nick will decide to not to mark the readfaults as dirty. > > > > I don't like to mark the pte readonly and clean, > > Nor I: I meant that anonymous readfault should > (perhaps) mark the pte writable but clean. Maybe. On the other hand, marking it dirty is going to be almost as expensive as taking the whole page fault again. The dirty bit is in software on a lot of architectures, and even on x86 where it's in hw, all microarchitectures basically consider it a micro-trap, and some of them (*cough*P4*cough*) are really bad at it. So I'd actually rather just mark it dirty too, because that way there is a real potential performance upside to go with the real potential performance downside, and we can hope that it all comes out even in the end ;) Linus PS. Yes, I wrote the benchmark. On at least some versions of the P4, just setting the dirty bit took 1500 cycles.. No sw-visible traps, just a *lot* of cycles to clean out the pipeline entirely, do a micro-trap, and continue. Of course, the P4 sucks at these things, but the point is that it can be as expensive to do it "in hardware" as doing it in software if the hardware is mis-designed.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 14:44 ` Dan Aloni 2007-04-04 15:03 ` Hugh Dickins @ 2007-04-04 15:27 ` Andrea Arcangeli 2007-04-04 16:15 ` Dan Aloni 1 sibling, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 15:27 UTC (permalink / raw) To: Dan Aloni Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 05:44:21PM +0300, Dan Aloni wrote: > To refine that example, you could replace the file with a large anonymous > memory pool and a lot of swap space committed to it. In that case - with > no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? Swapout or ram is the same in this context. The point is that it will take 4k either in ram or swap, let's talk about virtual memory without differentiating between ram or swap. > Perhaps it's an example too far-fetched to worth considering... Even if you would read the sparsed file to a malloced space (more commonly that would be tmpfs) using the read syscall, those anon (or tmpfs) pages would be _written_ first, which isn't the case we're discussing here. You don't know what is on disk, so reading from disk (regardless of what you read, holes, zeros or anything) provides useful information, but you know what is in ram after an anon mmap: just zeros, reading them can't provide useful information to any software. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:27 ` Andrea Arcangeli @ 2007-04-04 16:15 ` Dan Aloni 2007-04-04 16:48 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: Dan Aloni @ 2007-04-04 16:15 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 05:27:17PM +0200, Andrea Arcangeli wrote: > On Wed, Apr 04, 2007 at 05:44:21PM +0300, Dan Aloni wrote: > > To refine that example, you could replace the file with a large anonymous > > memory pool and a lot of swap space committed to it. In that case - with > > no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? > > Swapout or ram is the same in this context. The point is that it will > take 4k either in ram or swap, let's talk about virtual memory without > differentiating between ram or swap. The main difference is that disk-backed swap can create I/O pressure which would slow down the swap-outs that are not of zeroed pages (and other I/Os on that disk for that matter). For purely-RAM virtual memory the latency incured from managing newly allocated and zeroed pages is neglegible compared to the latencies you get from reading/flushing those pages to disk if you add swap to the picture. > > Perhaps it's an example too far-fetched to worth considering... > > Even if you would read the sparsed file to a malloced space (more > commonly that would be tmpfs) using the read syscall, those anon (or > tmpfs) pages would be _written_ first, which isn't the case we're > discussing here. > > You don't know what is on disk, so reading from disk (regardless of > what you read, holes, zeros or anything) provides useful information, > but you know what is in ram after an anon mmap: just zeros, reading > them can't provide useful information to any software. I agree. The swap I/O case still holds, though: swapping-in the zeroed pages that got swapped-out might incur unwanted overhead. -- Dan Aloni XIV LTD, http://www.xivstorage.com da-x (at) monatomic.org, dan (at) xiv.co.il -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 16:15 ` Dan Aloni @ 2007-04-04 16:48 ` Andrea Arcangeli 0 siblings, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 16:48 UTC (permalink / raw) To: Dan Aloni; +Cc: Linux Memory Management List, Linux Kernel Mailing List Hi Dan, On Wed, Apr 04, 2007 at 07:15:15PM +0300, Dan Aloni wrote: > The main difference is that disk-backed swap can create I/O pressure which > would slow down the swap-outs that are not of zeroed pages (and other I/Os > on that disk for that matter). For purely-RAM virtual memory the latency > incured from managing newly allocated and zeroed pages is neglegible > compared to the latencies you get from reading/flushing those pages to > disk if you add swap to the picture. Sorry but you're telling me the obvious... clearly you're right, swap is slower, ram is faster. As a corollary on a 64bit system you could always throw money at ram and _guarantee_ that those anon read page faults never hit swap. That's not the point. If 4G more of virtual memory are allocated in the address space of a task because of this kernel change, it's the same problem if those 4G are later allocated in swap or in ram depending on the runtime environment of the kernel. The problem is that 4G more will be allocated, it doesn't matter _where_. The user with a 8G system will not be slowed down much, the user with a 128M system will trash beyond repair, but it's the same problem for both. If the new ram will go into ram or swap is irrelevant because it's an unknown variable that depends on the amount of ram and swap and on what else is running (infact there will be a third guy with even less luck that will go out of memory and crash after hitting an oom killer bug ;), it's the same problem in all three cases. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 10:24 ` Nick Piggin 2007-04-04 12:27 ` Andrea Arcangeli @ 2007-04-04 12:45 ` Hugh Dickins 2007-04-04 13:05 ` Andrea Arcangeli 1 sibling, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2007-04-04 12:45 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007, Nick Piggin wrote: > > No, you have a point, but if we have to ask people to recompile > with CONFIG_ZERO_PAGE, then it isn't much harder to ask them to > apply a patch first. > > But for a potential mainline merge, maybe starting with a CONFIG > option is a good idea -- defaulting to off, and we could start by > turning it on just in -rc kernels for a few releases, to get a bit > more confidence? I'm confused. CONFIG_ZERO_PAGE off is where we'd like to end up: how would turning CONFIG_ZERO_PAGE on in -rc kernels help us to get there? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 12:45 ` Hugh Dickins @ 2007-04-04 13:05 ` Andrea Arcangeli 2007-04-04 13:32 ` Hugh Dickins 0 siblings, 1 reply; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 13:05 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 01:45:06PM +0100, Hugh Dickins wrote: > I'm confused. CONFIG_ZERO_PAGE off is where we'd like to end up: how > would turning CONFIG_ZERO_PAGE on in -rc kernels help us to get there? He most certainly meant on by default. I think if we do this, we also need a zeropage counter in the vm stats so that we'll get a measure of the waste and it'll be possible to identify apps to optimize/fix. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 13:05 ` Andrea Arcangeli @ 2007-04-04 13:32 ` Hugh Dickins 2007-04-04 13:40 ` Andrea Arcangeli 0 siblings, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2007-04-04 13:32 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, 4 Apr 2007, Andrea Arcangeli wrote: > On Wed, Apr 04, 2007 at 01:45:06PM +0100, Hugh Dickins wrote: > > I'm confused. CONFIG_ZERO_PAGE off is where we'd like to end up: how > > would turning CONFIG_ZERO_PAGE on in -rc kernels help us to get there? > > He most certainly meant on by default. Okay, I thought it more diplomatic to label myself as the confused one ;) > > I think if we do this, we also need a zeropage counter in the vm stats > so that we'll get a measure of the waste and it'll be possible to > identify apps to optimize/fix. That's a little unfortunate, since we'd then have to lose the win from this change, that we issue a writable zeroed page (when VM_WRITE) in do_anonymous_page, even when it's a read fault, saving subsequent fault. Wouldn't we? Or am I confused ;? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 13:32 ` Hugh Dickins @ 2007-04-04 13:40 ` Andrea Arcangeli 0 siblings, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 13:40 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 02:32:03PM +0100, Hugh Dickins wrote: > That's a little unfortunate, since we'd then have to lose the win from > this change, that we issue a writable zeroed page (when VM_WRITE) in > do_anonymous_page, even when it's a read fault, saving subsequent fault. Hmm no, that win would remain (and that win would only apply to the class of apps that we intend to hurt by removing the zero-page anyway). I think it's enough to increase a per-cpu counter in do_anonymous_page if it's a read fault, and nothing else. We don't need to keep track of the exact number of ZERO_PAGEs in the VM. Ideally nothing should increase my counter, hence your "exact" counter would always be zero too when everything is ok. The only real win we'll lose with the counter is the removal of the slow-path branch in do_anonymous_page, but I guess I'm more comfortable to be able to detect if something very inefficient ever run on my system. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 3:37 ` [rfc] no ZERO_PAGE? Nick Piggin 2007-04-04 9:45 ` Hugh Dickins @ 2007-04-04 15:35 ` Linus Torvalds 2007-04-04 15:48 ` Andrea Arcangeli ` (5 more replies) 1 sibling, 6 replies; 49+ messages in thread From: Linus Torvalds @ 2007-04-04 15:35 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007, Nick Piggin wrote: > > Shall I do a more complete patchset and ask Andrew to give it a > run in -mm? Do this trivial one first. See how it fares. Although I don't know how much -mm will do for it. There is certainly not going to be any correctness problems, afaik, just *performance* problems. Does anybody do any performance testing on -mm? That said, talking about correctness/performance problems: > + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > + if (likely(!pte_none(*page_table))) { > inc_mm_counter(mm, anon_rss); > lru_cache_add_active(page); > page_add_new_anon_rmap(page, vma, address); Isn't that test the wrong way around? Shouldn't it be if (likely(pte_none(*page_table))) { without any logical negation? Was this patch tested? Anyway, I'm not against this, but I can see somebody actually *wanting* the ZERO page in some cases. I've used the fact for TLB testing, for example, by just doing a big malloc(), and knowing that the kernel will re-use the ZERO_PAGE so that I don't get any cache effects (well, at least not any *physical* cache effects. Virtually indexed cached will still show effects of it, of course, but I haven't cared). That's an example of an app that actually cares about the page allocation (or, in this case, the lack there-of). Not an important one, but maybe there are important ones that care? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:35 ` Linus Torvalds @ 2007-04-04 15:48 ` Andrea Arcangeli 2007-04-04 16:09 ` Linus Torvalds ` (2 more replies) 2007-04-04 16:32 ` Eric Dumazet ` (4 subsequent siblings) 5 siblings, 3 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 15:48 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote: > Anyway, I'm not against this, but I can see somebody actually *wanting* > the ZERO page in some cases. I've used the fact for TLB testing, for > example, by just doing a big malloc(), and knowing that the kernel will > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least > not any *physical* cache effects. Virtually indexed cached will still show > effects of it, of course, but I haven't cared). Ok, those cases wanting the same zero page, could be fairly easily converted to an mmap over /dev/zero (without having to run 4k large mmap syscalls or nonlinear). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:48 ` Andrea Arcangeli @ 2007-04-04 16:09 ` Linus Torvalds 2007-04-04 16:23 ` Andrea Arcangeli 2007-04-04 16:10 ` Hugh Dickins 2007-04-04 22:07 ` Valdis.Kletnieks 2 siblings, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2007-04-04 16:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, 4 Apr 2007, Andrea Arcangeli wrote: > > Ok, those cases wanting the same zero page, could be fairly easily > converted to an mmap over /dev/zero (without having to run 4k large > mmap syscalls or nonlinear). You're missing the point. What if it's something like oracle that has been tuned for Linux using this? Or even an open-source app that is just used by big places and they see performace problems but it's not obvious *why*. We "know" why, because we're discussing this point. But two months from now, when some random company complains to SuSE/RH/whatever that their app runs 5% slower or uses 200% more swap, who is going to realize what caused it? THAT is the problem with patches like this. I'm not against it, but you can't just dismiss it with "we can fix the app". We *cannot* fix the app if we don't even realize what caused the problem.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 16:09 ` Linus Torvalds @ 2007-04-04 16:23 ` Andrea Arcangeli 0 siblings, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 16:23 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 09:09:28AM -0700, Linus Torvalds wrote: > You're missing the point. What if it's something like oracle that has been > tuned for Linux using this? Or even an open-source app that is just used > by big places and they see performace problems but it's not obvious *why*. > > We "know" why, because we're discussing this point. But two months from > now, when some random company complains to SuSE/RH/whatever that their app > runs 5% slower or uses 200% more swap, who is going to realize what caused > it? No, I'm not missing the point, I was the first to say here that such code has been there forever and in turn I'm worried about apps depending on it for all the wrong reasons, I even went as far as asking a counter to avoid the waste to go unniticed, and last but not the least that's why I'm not discussing this as internal suse fix for the scalability issue, but only as a malinline patch for -mm. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:48 ` Andrea Arcangeli 2007-04-04 16:09 ` Linus Torvalds @ 2007-04-04 16:10 ` Hugh Dickins 2007-04-04 16:31 ` Andrea Arcangeli 2007-04-04 22:07 ` Valdis.Kletnieks 2 siblings, 1 reply; 49+ messages in thread From: Hugh Dickins @ 2007-04-04 16:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Nick Piggin, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, 4 Apr 2007, Andrea Arcangeli wrote: > On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote: > > Anyway, I'm not against this, but I can see somebody actually *wanting* > > the ZERO page in some cases. I've used the fact for TLB testing, for > > example, by just doing a big malloc(), and knowing that the kernel will > > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least > > not any *physical* cache effects. Virtually indexed cached will still show > > effects of it, of course, but I haven't cared). > > Ok, those cases wanting the same zero page, could be fairly easily > converted to an mmap over /dev/zero No, MAP_SHARED mmap of /dev/zero uses shmem, which allocates distinct pages for this (because in general tmpfs doesn't know if a readonly file will be written to later on), and MAP_PRIVATE mmap of /dev/zero uses the zeromap stuff which we were hoping to eliminate too (though not in Nick's initial patch). Looks like a job for /dev/same_page_over_and_over_again. > (without having to run 4k large mmap syscalls or nonlinear). You scared me, I made no sense of that at first: ah yes, repeatedly mmap'ing the same page can be done those ways. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 16:10 ` Hugh Dickins @ 2007-04-04 16:31 ` Andrea Arcangeli 0 siblings, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-04 16:31 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Nick Piggin, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 05:10:37PM +0100, Hugh Dickins wrote: > file will be written to later on), and MAP_PRIVATE mmap of /dev/zero Obviously I meant MAP_PRIVATE of /dev/zero, since it's the only one backed by the zero page. > uses the zeromap stuff which we were hoping to eliminate too > (though not in Nick's initial patch). I didn't realized you wanted to eliminate it too. > Looks like a job for /dev/same_page_over_and_over_again. > > > (without having to run 4k large mmap syscalls or nonlinear). > > You scared me, I made no sense of that at first: ah yes, > repeatedly mmap'ing the same page can be done those ways. Yep, which is probably why we don't need the /dev/same_page_over_and_over_again for that. Overall the worry about the TLB benchmarking apps being broken in its measurements sounds very minor compared to the risk of wasting tons of ram and going out of memory. If there was no risk of bad breakage we wouldn't need to discuss this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:48 ` Andrea Arcangeli 2007-04-04 16:09 ` Linus Torvalds 2007-04-04 16:10 ` Hugh Dickins @ 2007-04-04 22:07 ` Valdis.Kletnieks 2 siblings, 0 replies; 49+ messages in thread From: Valdis.Kletnieks @ 2007-04-04 22:07 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 283 bytes --] On Wed, 04 Apr 2007 17:48:39 +0200, Andrea Arcangeli said: > Ok, those cases wanting the same zero page, could be fairly easily > converted to an mmap over /dev/zero (without having to run 4k large > mmap syscalls or nonlinear). "D'oh!" -- H. Simpson. Ignore my previous note. :) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:35 ` Linus Torvalds 2007-04-04 15:48 ` Andrea Arcangeli @ 2007-04-04 16:32 ` Eric Dumazet 2007-04-04 17:02 ` Linus Torvalds 2007-04-04 19:15 ` Andrew Morton ` (3 subsequent siblings) 5 siblings, 1 reply; 49+ messages in thread From: Eric Dumazet @ 2007-04-04 16:32 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007 08:35:30 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > Anyway, I'm not against this, but I can see somebody actually *wanting* > the ZERO page in some cases. I've used the fact for TLB testing, for > example, by just doing a big malloc(), and knowing that the kernel will > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least > not any *physical* cache effects. Virtually indexed cached will still show > effects of it, of course, but I haven't cared). > > That's an example of an app that actually cares about the page allocation > (or, in this case, the lack there-of). Not an important one, but maybe > there are important ones that care? I dont know if this small prog is of any interest : But results on an Intel Pentium-M are interesting, in particular 2) & 3) If a page is first allocated as page_zero then cow to a full rw page, this is more expensive. (2660 cycles instead of 2300) Is there an app somewhere that depends on 2) being ultra-fast but then future write accesses *slow* ??? $ ./page_bench >RES; cat RES 1) pagefault tp bring a rw page: Poke (addr=0x804c000): 2360 cycles 1) pagefault to bring a rw page: Poke (addr=0x804d000): 2368 cycles 1) pagefault to bring a rw page: Poke (addr=0x804e000): 2120 cycles 2) pagefault to bring a zero page, readonly Peek(addr=0x804f000): ->0 891 cycles 3) pagefault to make this page rw Poke (addr=0x804f000): 2660 cycles 1) pagefault to bring a rw page: Poke (addr=0x8050000): 2099 cycles 1) pagefault to bring a rw page: Poke (addr=0x8051000): 2062 cycles 4) memset 4096 bytes to 0x55: Poke_full (addr=0x804f000, len=4096): 2719 cycles 5) fill the whole table Poke_full (addr=0x804c000, len=4194304): 6563661 cycles 6) fill again whole table (no more faults, but cpu cache too small) Poke_full (addr=0x804c000, len=4194304): 5188925 cycles 7.1) faulting a mmap zone, read access Peek(addr=0xb7f8a000): ->0 40453 cycles 8.1) faulting a mmap zone, write access Poke (addr=0xb7f89000): 10599 cycles 7.2) faulting a mmap zone, read access Peek(addr=0xb7f88000): ->0 8167 cycles 8.3) faulting a mmap zone, write access Poke (addr=0xb7f87000): 5701 cycles $ cat page_bench.c # include <errno.h> # include <stdlib.h> # include <unistd.h> # include <fcntl.h> # include <stdio.h> # include <sys/time.h> # include <time.h> # include <sys/mman.h> # include <string.h> #ifdef __x86_64 #define rdtscll(val) do { \ unsigned int __a,__d; \ asm volatile("rdtsc" : "=a" (__a), "=d" (__d)); \ (val) = ((unsigned long)__a) | (((unsigned long)__d)<<32); \ } while(0) #elif __i386 #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #endif int var; int *addr1, *addr2, *addr3, *addr4; void map_many_vmas(unsigned int nb) { size_t sz = getpagesize(); int ui; for (ui = 0 ; ui < nb ; ui++) { void *p = mmap(NULL, sz, (ui == 0) ? PROT_READ : PROT_READ|PROT_WRITE, (ui & 1) ? MAP_PRIVATE|MAP_ANONYMOUS : MAP_ANONYMOUS|MAP_SHARED, -1, 0); if (p == (void *)-1) { fprintf(stderr, "Only %u mappings could be set\n", ui); break; } if (!addr1) addr1 = (int *)p; else if (!addr2) addr2 = (int *)p; else if (!addr3) addr3 = (int *)p; else if (!addr4) addr4 = (int *)p; } } void show_maps() { char buffer[4096]; int fd, lu; fd = open("/proc/self/maps", 0); if (fd != -1) { while ((lu = read(fd, buffer, sizeof(buffer))) > 0) write(2, buffer, lu); close(fd); } } void poke_int(void *addr, int val) { unsigned long long start, end; long delta; rdtscll(start); *(int *)addr = val; rdtscll(end); delta = (end - start); printf("Poke (addr=%p): %ld cycles\n", addr, delta); } void poke_full(void *addr, int val, int len) { unsigned long long start, end; long delta; rdtscll(start); memset(addr, val, len); rdtscll(end); delta = (end - start); printf("Poke_full (addr=%p, len=%d): %ld cycles\n", addr, len, delta); } int peek_int(void *addr) { unsigned long long start, end; long delta; int val; rdtscll(start); val = *(int *)addr; rdtscll(end); delta = (end - start); printf("Peek(addr=%p): ->%d %ld cycles\n", addr, val, delta); return val; } int big_table[1024*1024] __attribute__((aligned(4096))); void usage(int code) { fprintf(stderr, "Usage : page_bench [-m mappings]\n"); exit(code); } int main(int argc, char *argv[]) { unsigned int nb_mappings = 200; int c; while ((c = getopt(argc, argv, "Vm:")) != EOF) { if (c == 'm') nb_mappings = atoi(optarg); else if (c == 'V') usage(0); } if (nb_mappings < 4) nb_mappings = 4; map_many_vmas(nb_mappings); // show_maps(); printf("1) pagefault tp bring a rw page:\n") ; poke_int(&big_table[0], 10); printf("1) pagefault to bring a rw page:\n") ; poke_int(&big_table[1024], 10); printf("1) pagefault to bring a rw page:\n") ; poke_int(&big_table[2048], 10); printf("2) pagefault to bring a zero page, readonly\n"); peek_int(&big_table[3*1024]); printf("3) pagefault to make this page rw\n"); poke_int(&big_table[3*1024], 10); printf("1) pagefault to bring a rw page:\n") ; poke_int(&big_table[4*1024], 10); printf("1) pagefault to bring a rw page:\n") ; poke_int(&big_table[5*1024], 10); printf("4) memset 4096 bytes to 0x55:\n"); poke_full(&big_table[3*1024], 0x55, 4096); printf("5) fill the whole table\n"); poke_full(big_table, 1, sizeof(big_table)); printf("6) fill again whole table (no more faults, but cpu cache too small)\n"); poke_full(big_table, 1, sizeof(big_table)); printf("7.1) faulting a mmap zone, read access\n"); peek_int(addr1); printf("8.1) faulting a mmap zone, write access\n"); poke_int(addr2, 10); printf("7.2) faulting a mmap zone, read access\n"); peek_int(addr3); printf("8.3) faulting a mmap zone, write access\n"); poke_int(addr4, 10); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 16:32 ` Eric Dumazet @ 2007-04-04 17:02 ` Linus Torvalds 0 siblings, 0 replies; 49+ messages in thread From: Linus Torvalds @ 2007-04-04 17:02 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007, Eric Dumazet wrote: > > But results on an Intel Pentium-M are interesting, in particular 2) & 3) > > If a page is first allocated as page_zero then cow to a full rw page, this is more expensive. > (2660 cycles instead of 2300) Yes, you have an extra TLB flush there at a minimum (if the page didn't exist at all before, you don't have to flush). That said, the big cost tends to be the clearing of the page. Which is why the "bring in zero page" is so much faster than anything else - it's the only case that doesn't need to clear the page. So you should basically think of your numbers like this: - roughly 900 cycles is the cost of the page fault and all the "basic software" side in the kernel - roughly 1400 cycles to actually do the "memset" to clear the page (and no, that's *not* the cost of memory accesses per se - it's very likely already in the L2 cache or similar, we just need to clear it and if it wasn't marked exclusive need to do a bus cycle to invalidate it on any other CPU's). with small variation depending on what the state was before of the cache in particular (for example, the TLB flush cost, but also: when you do > 4) memset 4096 bytes to 0x55: > Poke_full (addr=0x804f000, len=4096): 2719 cycles This only adds ~600 cycles to memset the same 4kB that cost ~1400 cycles before, but that's *probably* largely because it was now already dirty in the L2 and possibly the L1, so it's quite possible that this is really just a cache effect, because now it's entirely exclusive in the caches so you don't need to do any probing on the bus at all). Also note: in the end, page faults are usually fairly unusual. You do them once, and then use the page a lot after that. That's not *always* true, of course. Some malloc()/free() patterns of big areas that are not used for long will easily cause constant mmap/munmap, and a lot of page faults. The worst effect of page faults tends to be for short-lived stuff. Notably things like "system()" that executes a shell just to execute something else. Almost *everything* in that path is basically "use once, then throw away", and page fault latency is interesting. So this is one case where it might be interesting to look at what lmbench reports for the "fork/exit", "fork/exec" and "shell exec" numbers before and after. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:35 ` Linus Torvalds 2007-04-04 15:48 ` Andrea Arcangeli 2007-04-04 16:32 ` Eric Dumazet @ 2007-04-04 19:15 ` Andrew Morton 2007-04-04 20:11 ` David Miller, Linus Torvalds ` (2 subsequent siblings) 5 siblings, 0 replies; 49+ messages in thread From: Andrew Morton @ 2007-04-04 19:15 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007 08:35:30 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > Does anybody do any performance testing on -mm? http://test.kernel.org/perf/index.html has pretty graphs of lots of kernel versions for a few benchmarks. I'm not aware of any other organised effort along those lines. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:35 ` Linus Torvalds ` (2 preceding siblings ...) 2007-04-04 19:15 ` Andrew Morton @ 2007-04-04 20:11 ` David Miller, Linus Torvalds 2007-04-04 20:50 ` Andrew Morton ` (2 more replies) 2007-04-04 22:05 ` Valdis.Kletnieks 2007-04-05 4:47 ` Nick Piggin 5 siblings, 3 replies; 49+ messages in thread From: David Miller, Linus Torvalds @ 2007-04-04 20:11 UTC (permalink / raw) To: torvalds; +Cc: npiggin, hugh, akpm, linux-mm, tee, holt, andrea, linux-kernel > Anyway, I'm not against this, but I can see somebody actually *wanting* > the ZERO page in some cases. I've used the fact for TLB testing, for > example, by just doing a big malloc(), and knowing that the kernel will > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least > not any *physical* cache effects. Virtually indexed cached will still show > effects of it, of course, but I haven't cared). > > That's an example of an app that actually cares about the page allocation > (or, in this case, the lack there-of). Not an important one, but maybe > there are important ones that care? If we're going to consider this seriously, there is a case I know of. Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there is an instructive comment: /* Do not bother with the expensive D-cache flush if it * is merely the zero page. The 'bigcore' testcase in GDB * causes this case to run millions of times. */ if (page == ZERO_PAGE(0)) return; basically what the GDB test case does it mmap() an enormous anonymous area, not touch it, then dump core. As I understand the patch being considered to remove ZERO_PAGE(), this kind of core dump will cause a lot of pages to be allocated, probably eating up a lot of system time as well as memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 20:11 ` David Miller, Linus Torvalds @ 2007-04-04 20:50 ` Andrew Morton 2007-04-05 2:03 ` Nick Piggin 2007-04-05 5:23 ` Andrea Arcangeli 2 siblings, 0 replies; 49+ messages in thread From: Andrew Morton @ 2007-04-04 20:50 UTC (permalink / raw) To: David Miller Cc: torvalds, npiggin, hugh, linux-mm, tee, holt, andrea, linux-kernel On Wed, 04 Apr 2007 13:11:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > As I understand the patch being considered to remove ZERO_PAGE(), this > kind of core dump will cause a lot of pages to be allocated, probably > eating up a lot of system time as well as memory. Point. Also, what effect will the proposed changes have upon rss reporting, and upon the numbers in /proc/pid/[s]maps? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 20:11 ` David Miller, Linus Torvalds 2007-04-04 20:50 ` Andrew Morton @ 2007-04-05 2:03 ` Nick Piggin 2007-04-05 5:23 ` Andrea Arcangeli 2 siblings, 0 replies; 49+ messages in thread From: Nick Piggin @ 2007-04-05 2:03 UTC (permalink / raw) To: David Miller Cc: torvalds, hugh, akpm, linux-mm, tee, holt, andrea, linux-kernel On Wed, Apr 04, 2007 at 01:11:11PM -0700, David Miller wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Wed, 4 Apr 2007 08:35:30 -0700 (PDT) > > > Anyway, I'm not against this, but I can see somebody actually *wanting* > > the ZERO page in some cases. I've used the fact for TLB testing, for > > example, by just doing a big malloc(), and knowing that the kernel will > > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least > > not any *physical* cache effects. Virtually indexed cached will still show > > effects of it, of course, but I haven't cared). > > > > That's an example of an app that actually cares about the page allocation > > (or, in this case, the lack there-of). Not an important one, but maybe > > there are important ones that care? > > If we're going to consider this seriously, there is a case I know of. > Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there > is an instructive comment: > > /* Do not bother with the expensive D-cache flush if it > * is merely the zero page. The 'bigcore' testcase in GDB > * causes this case to run millions of times. > */ > if (page == ZERO_PAGE(0)) > return; > > basically what the GDB test case does it mmap() an enormous anonymous > area, not touch it, then dump core. > > As I understand the patch being considered to remove ZERO_PAGE(), this > kind of core dump will cause a lot of pages to be allocated, probably > eating up a lot of system time as well as memory. Yeah. Well it is trivial to leave ZERO_PAGE in get_user_pages, however in the longer run it would be nice to get rid of ZERO_PAGE completely so we need an alternative. I've been working on a patch for core dumping that can detect unfaulted anonymous memory and skip it without doing the ZERO_PAGE comparision. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 20:11 ` David Miller, Linus Torvalds 2007-04-04 20:50 ` Andrew Morton 2007-04-05 2:03 ` Nick Piggin @ 2007-04-05 5:23 ` Andrea Arcangeli 2 siblings, 0 replies; 49+ messages in thread From: Andrea Arcangeli @ 2007-04-05 5:23 UTC (permalink / raw) To: David Miller Cc: torvalds, npiggin, hugh, akpm, linux-mm, tee, holt, linux-kernel On Wed, Apr 04, 2007 at 01:11:11PM -0700, David S. Miller wrote: > If we're going to consider this seriously, there is a case I know of. > Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there > is an instructive comment: > > /* Do not bother with the expensive D-cache flush if it > * is merely the zero page. The 'bigcore' testcase in GDB > * causes this case to run millions of times. > */ > if (page == ZERO_PAGE(0)) > return; > > basically what the GDB test case does it mmap() an enormous anonymous > area, not touch it, then dump core. > > As I understand the patch being considered to remove ZERO_PAGE(), this > kind of core dump will cause a lot of pages to be allocated, probably > eating up a lot of system time as well as memory. Well, if we leave the zero page in because there may be too many apps to optimize, we still have to fix the zero page handling. Current code is far from ideal. Currently the zero page scales worse than no-zero-page, at the very least all the page count/mapcount increase/decrease at every map-in/zap must be dropped from memory.c, otherwise two totally unrelated gdb running at the same time (or gdb at the same time of fortran, or two unrelated fortran apps) will badly trash over the zero page reference counting. Besides the backwards compatibility argument with gdb or similar apps I doubt the zero page is a really worthwhile optimization and I guess we'd be better off if it never existed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:35 ` Linus Torvalds ` (3 preceding siblings ...) 2007-04-04 20:11 ` David Miller, Linus Torvalds @ 2007-04-04 22:05 ` Valdis.Kletnieks 2007-04-05 0:27 ` Linus Torvalds 2007-04-05 4:47 ` Nick Piggin 5 siblings, 1 reply; 49+ messages in thread From: Valdis.Kletnieks @ 2007-04-04 22:05 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 1137 bytes --] On Wed, 04 Apr 2007 08:35:30 PDT, Linus Torvalds said: > Although I don't know how much -mm will do for it. There is certainly not > going to be any correctness problems, afaik, just *performance* problems. > Does anybody do any performance testing on -mm? I have to admit I don't do anything more definite than "wow, this goes oink"... > That's an example of an app that actually cares about the page allocation > (or, in this case, the lack there-of). Not an important one, but maybe > there are important ones that care? I'd not be surprised if there's sparse-matrix code out there that wants to malloc a *huge* array (like a 1025x1025 array of numbers) that then only actually *writes* to several hundred locations, and relies on the fact that all the untouched pages read back all-zeros. Of course, said code is probably buggy because it doesn't zero the whole thing because you don't usually know if some other function already scribbled on that heap page. This would probably be more interesting if we had a userspace API for "Give me a metric buttload of zero page frames" that malloc() and friends could leverage..... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 22:05 ` Valdis.Kletnieks @ 2007-04-05 0:27 ` Linus Torvalds 2007-04-05 1:25 ` Valdis.Kletnieks 2007-04-05 2:30 ` Nick Piggin 0 siblings, 2 replies; 49+ messages in thread From: Linus Torvalds @ 2007-04-05 0:27 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, 4 Apr 2007, Valdis.Kletnieks@vt.edu wrote: > > I'd not be surprised if there's sparse-matrix code out there that wants to > malloc a *huge* array (like a 1025x1025 array of numbers) that then only > actually *writes* to several hundred locations, and relies on the fact that > all the untouched pages read back all-zeros. Good point. In fact, it doesn't need to be a malloc() - I remember people doing this with Fortran programs and just having an absolutely incredibly big BSS (with traditional Fortran, dymic memory allocations are just not done). > Of course, said code is probably buggy because it doesn't zero the whole > thing because you don't usually know if some other function already > scribbled on that heap page. Sure you do. If glibc used mmap() or brk(), it *knows* the new data is zero. So if you use calloc(), for example, it's entirely possible that a good libc wouldn't waste time zeroing it. The same is true of BSS. You never clear the BSS with a memset, you just know it starts out zeroed. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-05 0:27 ` Linus Torvalds @ 2007-04-05 1:25 ` Valdis.Kletnieks 2007-04-05 2:30 ` Nick Piggin 1 sibling, 0 replies; 49+ messages in thread From: Valdis.Kletnieks @ 2007-04-05 1:25 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 903 bytes --] On Wed, 04 Apr 2007 17:27:31 PDT, Linus Torvalds said: > Sure you do. If glibc used mmap() or brk(), it *knows* the new data is > zero. So if you use calloc(), for example, it's entirely possible that > a good libc wouldn't waste time zeroing it. Right. However, the *user* code usually has no idea about the previous history - so if it uses malloc(), it should be doing something like: ptr = malloc(my_size*sizeof(whatever)); memset(ptr, my_size*sizeof(), 0); So malloc does something clever to guarantee that it's zero, and then userspace undoes the cleverness because it has no easy way to *know* that cleverness happened. Admittedly, calloc() *can* get away with being clever. I know we have some glibc experts lurking here - any of them want to comment on how smart calloc() actually is, or how smart it can become without needing major changes to the rest of the malloc() and friends? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-05 0:27 ` Linus Torvalds 2007-04-05 1:25 ` Valdis.Kletnieks @ 2007-04-05 2:30 ` Nick Piggin 2007-04-05 5:37 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Nick Piggin @ 2007-04-05 2:30 UTC (permalink / raw) To: Linus Torvalds Cc: Valdis.Kletnieks, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote: > > > On Wed, 4 Apr 2007, Valdis.Kletnieks@vt.edu wrote: > > > > I'd not be surprised if there's sparse-matrix code out there that wants to > > malloc a *huge* array (like a 1025x1025 array of numbers) that then only > > actually *writes* to several hundred locations, and relies on the fact that > > all the untouched pages read back all-zeros. > > Good point. In fact, it doesn't need to be a malloc() - I remember people > doing this with Fortran programs and just having an absolutely incredibly > big BSS (with traditional Fortran, dymic memory allocations are just not > done). Sparse matrices are one thing I worry about. I don't know enough about HPC code to know whether they will be a problem. I know there exist data structures to optimise sparse matrix storage... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-05 2:30 ` Nick Piggin @ 2007-04-05 5:37 ` William Lee Irwin III 2007-04-05 17:23 ` Valdis.Kletnieks 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2007-04-05 5:37 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Valdis.Kletnieks, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote: >> Good point. In fact, it doesn't need to be a malloc() - I remember people >> doing this with Fortran programs and just having an absolutely incredibly >> big BSS (with traditional Fortran, dymic memory allocations are just not >> done). On Thu, Apr 05, 2007 at 04:30:26AM +0200, Nick Piggin wrote: > Sparse matrices are one thing I worry about. I don't know enough about > HPC code to know whether they will be a problem. I know there exist > data structures to optimise sparse matrix storage... \begin{admission-against-interest} Sparse matrix code goes to extreme lengths to avoid ever looking at substantial numbers of zero floating point matrix and vector entries. In extreme cases, hashing and various sorts of heavyweight data structures are used to represent highly irregular structures. At various times the matrix is not even explicitly formed. Most typical are cases like band diagonal matrices where storage is allocated only for the nonzero diagonals. The entire purpose of sparse algorithms is to avoid examining or even allocating zeros. The actual phenomenon of concern here is dense matrix code with sparse matrix inputs. The matrices will typically not be vast but may span 1MB or so of RAM (1024x1024 is 1M*sizeof(double), and various dense matrix algorithms target ca. 300x300). Most of the time this will arise from the use of dense matrix code as black box solvers called as a library by programs not terribly concerned about efficiency until something gets explosively inefficient (and maybe not even then), or otherwise numerically naive programs. This, however, is arguably the majority of the usage cases by end-user invocations, so beware, though not too much. I'd be more concerned about large hashtables sparsely used for the purposes of adjacency detection and other cases where large time vs. space tradeoffs are made for probabilistic reasons involving collisions. \end{admission-against-interest} -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-05 5:37 ` William Lee Irwin III @ 2007-04-05 17:23 ` Valdis.Kletnieks 0 siblings, 0 replies; 49+ messages in thread From: Valdis.Kletnieks @ 2007-04-05 17:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Linus Torvalds, Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 1580 bytes --] On Wed, 04 Apr 2007 22:37:29 PDT, William Lee Irwin III said: > The actual phenomenon of concern here is dense matrix code with sparse > matrix inputs. The matrices will typically not be vast but may span 1MB > or so of RAM (1024x1024 is 1M*sizeof(double), and various dense matrix > algorithms target ca. 300x300). Most of the time this will arise from > the use of dense matrix code as black box solvers called as a library > by programs not terribly concerned about efficiency until something > gets explosively inefficient (and maybe not even then), or otherwise > numerically naive programs. This, however, is arguably the majority of > the usage cases by end-user invocations, so beware, though not too much. Amen, brother! :) At least in my environment, the vast majority of matrix code is actually run by graduate students under the direction of whatever professor is the Principal Investigator on the grant. As a rule, you can expect the grad student to know about rounding errors and convergence issues and similar program *correctness* factors. But it's the rare one that has much interest in program *efficiency*. If it takes 2 days to run, that's 2 days they can go get another few pages of thesis written while they wait. :) The code that gets on our SystemX (a top-50 supercomputer still) is usually well-tweaked for efficiency. However, that's just one system - there's on the order of several hundred smaller compute clusters and boxen and SGI-en on campus where "protect the system from cargo-cult programming by grad students" is a valid kernel goal. ;) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [rfc] no ZERO_PAGE? 2007-04-04 15:35 ` Linus Torvalds ` (4 preceding siblings ...) 2007-04-04 22:05 ` Valdis.Kletnieks @ 2007-04-05 4:47 ` Nick Piggin 5 siblings, 0 replies; 49+ messages in thread From: Nick Piggin @ 2007-04-05 4:47 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, tee, holt, Andrea Arcangeli, Linux Kernel Mailing List On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote: > > > On Wed, 4 Apr 2007, Nick Piggin wrote: > > > > Shall I do a more complete patchset and ask Andrew to give it a > > run in -mm? > > Do this trivial one first. See how it fares. OK. > Although I don't know how much -mm will do for it. There is certainly not > going to be any correctness problems, afaik, just *performance* problems. > Does anybody do any performance testing on -mm? > > That said, talking about correctness/performance problems: > > > + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); > > + if (likely(!pte_none(*page_table))) { > > inc_mm_counter(mm, anon_rss); > > lru_cache_add_active(page); > > page_add_new_anon_rmap(page, vma, address); > > Isn't that test the wrong way around? > > Shouldn't it be > > if (likely(pte_none(*page_table))) { > > without any logical negation? Was this patch tested? Yeah, untested of course. I'm having problems booting my normal test box, so the main point of the patch was to generate some discussion (which worked! ;)). Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2007-04-05 17:23 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-03-29 7:58 [rfc][patch 1/2] mm: dont account ZERO_PAGE Nick Piggin 2007-03-29 7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin 2007-03-29 17:49 ` Linus Torvalds 2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins 2007-03-30 1:46 ` Nick Piggin 2007-03-30 2:59 ` Robin Holt 2007-03-30 3:09 ` Nick Piggin 2007-03-30 9:23 ` Robin Holt 2007-03-30 2:40 ` Nick Piggin 2007-04-04 3:37 ` [rfc] no ZERO_PAGE? Nick Piggin 2007-04-04 9:45 ` Hugh Dickins 2007-04-04 10:24 ` Nick Piggin 2007-04-04 12:27 ` Andrea Arcangeli 2007-04-04 13:55 ` Dan Aloni 2007-04-04 14:14 ` Andrea Arcangeli 2007-04-04 14:44 ` Dan Aloni 2007-04-04 15:03 ` Hugh Dickins 2007-04-04 15:34 ` Andrea Arcangeli 2007-04-04 15:41 ` Hugh Dickins 2007-04-04 16:07 ` Andrea Arcangeli 2007-04-04 16:14 ` Linus Torvalds 2007-04-04 15:27 ` Andrea Arcangeli 2007-04-04 16:15 ` Dan Aloni 2007-04-04 16:48 ` Andrea Arcangeli 2007-04-04 12:45 ` Hugh Dickins 2007-04-04 13:05 ` Andrea Arcangeli 2007-04-04 13:32 ` Hugh Dickins 2007-04-04 13:40 ` Andrea Arcangeli 2007-04-04 15:35 ` Linus Torvalds 2007-04-04 15:48 ` Andrea Arcangeli 2007-04-04 16:09 ` Linus Torvalds 2007-04-04 16:23 ` Andrea Arcangeli 2007-04-04 16:10 ` Hugh Dickins 2007-04-04 16:31 ` Andrea Arcangeli 2007-04-04 22:07 ` Valdis.Kletnieks 2007-04-04 16:32 ` Eric Dumazet 2007-04-04 17:02 ` Linus Torvalds 2007-04-04 19:15 ` Andrew Morton 2007-04-04 20:11 ` David Miller, Linus Torvalds 2007-04-04 20:50 ` Andrew Morton 2007-04-05 2:03 ` Nick Piggin 2007-04-05 5:23 ` Andrea Arcangeli 2007-04-04 22:05 ` Valdis.Kletnieks 2007-04-05 0:27 ` Linus Torvalds 2007-04-05 1:25 ` Valdis.Kletnieks 2007-04-05 2:30 ` Nick Piggin 2007-04-05 5:37 ` William Lee Irwin III 2007-04-05 17:23 ` Valdis.Kletnieks 2007-04-05 4:47 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox