[rfc][patch 1/2] mm: dont account ZERO

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [rfc][patch 1/2] mm: dont account ZERO_PAGE
@ 2007-03-29  7:58 Nick Piggin
  2007-03-29  7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin
  2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins
  0 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2007-03-29  7:58 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Linus Torvalds,
	Linux Memory Management List
  Cc: tee, holt

Special-case the ZERO_PAGE to prevent it from being accounted like a normal
mapped page. This is not illogical or unclean, because the ZERO_PAGE is
heavily special cased through the page fault path.

This requires Carsten Otte's filemap_xip patch, as well as restoring the
move_pte function for MIPS which was removed after I noticed it didn't
handle the ZERO_PAGE accounting correctly (which is not an issue after
this patch).

A test-case which took over 2 hours to complete on a 1024 core Altix
takes around 2 seconds afterward.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -479,7 +479,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	pte = pte_mkold(pte);
 
 	page = vm_normal_page(vma, addr, pte);
-	if (page) {
+	if (likely(page && page != ZERO_PAGE(addr))) {
 		get_page(page);
 		page_dup_rmap(page);
 		rss[!!PageAnon(page)]++;
@@ -665,7 +665,7 @@ static unsigned long zap_pte_range(struc
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
-			if (unlikely(!page))
+			if (unlikely(!page || page == ZERO_PAGE(addr)))
 				continue;
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
@@ -1125,9 +1125,6 @@ static int zeromap_pte_range(struct mm_s
 			pte++;
 			break;
 		}
-		page_cache_get(page);
-		page_add_file_rmap(page);
-		inc_mm_counter(mm, file_rss);
 		set_pte_at(mm, addr, pte, zero_pte);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
@@ -1629,7 +1626,7 @@ gotten:
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
-		if (old_page) {
+		if (likely(old_page && old_page != ZERO_PAGE(address))) {
 			page_remove_rmap(old_page, vma);
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
@@ -1659,7 +1656,7 @@ gotten:
 	}
 	if (new_page)
 		page_cache_release(new_page);
-	if (old_page)
+	if (old_page && old_page != ZERO_PAGE(address))
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
@@ -2152,15 +2149,12 @@ static int do_anonymous_page(struct mm_s
 	} else {
 		/* Map the ZERO_PAGE - vm_page_prot is readonly */
 		page = ZERO_PAGE(address);
-		page_cache_get(page);
 		entry = mk_pte(page, vma->vm_page_prot);
 
 		ptl = pte_lockptr(mm, pmd);
 		spin_lock(ptl);
 		if (!pte_none(*page_table))
-			goto release;
-		inc_mm_counter(mm, file_rss);
-		page_add_file_rmap(page);
+			goto unlock;
 	}
 
 	set_pte_at(mm, address, page_table, entry);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [rfc][patch 2/2] mips: reinstate move_pte
  2007-03-29  7:58 [rfc][patch 1/2] mm: dont account ZERO_PAGE Nick Piggin
@ 2007-03-29  7:58 ` Nick Piggin
  2007-03-29 17:49   ` Linus Torvalds
  2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2007-03-29  7:58 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Linus Torvalds,
	Linux Memory Management List
  Cc: tee, holt

Restore move_pte for MIPS, so that any given virtual address vaddr that maps
a ZERO_PAGE will map ZERO_PAGE(vaddr).

This has a circular dependancy on the previous patch, which normally means
they belong in the same patch, but I thought this case is clearer if split
out.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/asm-mips/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-mips/pgtable.h
+++ linux-2.6/include/asm-mips/pgtable.h
@@ -69,6 +69,16 @@ extern unsigned long zero_page_mask;
 #define ZERO_PAGE(vaddr) \
 	(virt_to_page((void *)(empty_zero_page + (((unsigned long)(vaddr)) & zero_page_mask))))
 
+#define __HAVE_ARCH_MOVE_PTE
+#define move_pte(pte, prot, old_addr, new_addr)				\
+({									\
+	pte_t newpte = (pte);						\
+	if (pte_present(pte) && 					\
+		pte_pfn(pte) == page_to_pfn(ZERO_PAGE(old_addr)))	\
+		newpte = mk_pte(ZERO_PAGE(new_addr), (prot));		\
+	newpte;
+})
+
 extern void paging_init(void);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE
  2007-03-29  7:58 [rfc][patch 1/2] mm: dont account ZERO_PAGE Nick Piggin
  2007-03-29  7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin
@ 2007-03-29 13:10 ` Hugh Dickins
  2007-03-30  1:46   ` Nick Piggin
  2007-03-30  2:40   ` Nick Piggin
  1 sibling, 2 replies; 49+ messages in thread
From: Hugh Dickins @ 2007-03-29 13:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt

On Thu, 29 Mar 2007, Nick Piggin wrote:
> 
> Special-case the ZERO_PAGE to prevent it from being accounted like a normal
> mapped page. This is not illogical or unclean, because the ZERO_PAGE is
> heavily special cased through the page fault path.

Thou dost protest too much!  By "heavily special cased through the page
fault path" you mean do_wp_page() uses a pre-zeroed page when it spots
it, instead of copying its data.  That's rather a different case.

Look, I don't have any vehement objection to exempting the ZERO_PAGE
from accounting: if you remember before, I just suggested it was of
questionable value to exempt it, and the exemption should be made a
separate patch.

But this patch is not complete, is it?  For example, fremap.c's
zap_pte?  I haven't checked further.  I was going to suggest you
should make ZERO_PAGEs fail vm_normal_page, but I guess do_wp_page
wouldn't behave very well then ;)  Tucking the tests away in some
vm_normal_page-like function might make them more acceptable.

> A test-case which took over 2 hours to complete on a 1024 core Altix
> takes around 2 seconds afterward.

Oh, it's easy to devise a test-case of that kind, but does it matter
in real life?  I admit that what most people run on their 1024-core
Altices will be significantly different from what I checked on my
laptop back then, but in my case use of the ZERO_PAGE didn't look
common enough to make special cases for.

You put forward a pagecache replication patch a few weeks ago.
That's what I expected to happen to the ZERO_PAGE, once NUMA folks
complained of the accounting.  Isn't that a better way to go?

Or is there some important app on the Altix which uses the
ZERO_PAGE so very much, that its interesting data remains shared
between nodes forever, and it's only its struct page cacheline
bouncing dirtily from one to another that slows things down?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 2/2] mips: reinstate move_pte
  2007-03-29  7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin
@ 2007-03-29 17:49   ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2007-03-29 17:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List, tee, holt


On Thu, 29 Mar 2007, Nick Piggin wrote:
> 
> Restore move_pte for MIPS, so that any given virtual address vaddr that maps
> a ZERO_PAGE will map ZERO_PAGE(vaddr).

Why does this matter? Why do we even care about the page counts? I thought 
we long since agreed that reserved pages don't need to have page counts.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE
  2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins
@ 2007-03-30  1:46   ` Nick Piggin
  2007-03-30  2:59     ` Robin Holt
  2007-03-30  2:40   ` Nick Piggin
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2007-03-30  1:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt

On Thu, Mar 29, 2007 at 02:10:55PM +0100, Hugh Dickins wrote:
> On Thu, 29 Mar 2007, Nick Piggin wrote:
> > 
> > Special-case the ZERO_PAGE to prevent it from being accounted like a normal
> > mapped page. This is not illogical or unclean, because the ZERO_PAGE is
> > heavily special cased through the page fault path.
> 
> Thou dost protest too much!  By "heavily special cased through the page
> fault path" you mean do_wp_page() uses a pre-zeroed page when it spots
> it, instead of copying its data.  That's rather a different case.

That, and the use of the zero page _at all_ in the do_anonymous_page
and zeromap, and I guess our anti-wrapping hacks in the page allocator...
it is just done for a little optimisation, so I figure it wouldn't hurt
to optimise a bit more ;)

> Look, I don't have any vehement objection to exempting the ZERO_PAGE
> from accounting: if you remember before, I just suggested it was of
> questionable value to exempt it, and the exemption should be made a
> separate patch.
> 
> But this patch is not complete, is it?  For example, fremap.c's
> zap_pte?  I haven't checked further.  I was going to suggest you
> should make ZERO_PAGEs fail vm_normal_page, but I guess do_wp_page
> wouldn't behave very well then ;)  Tucking the tests away in some
> vm_normal_page-like function might make them more acceptable.

Yeah I was going to do that, but noted the do_wp_page thingy. I don't
know... it might be better though... vm_refcounted_page()?

> > A test-case which took over 2 hours to complete on a 1024 core Altix
> > takes around 2 seconds afterward.
> 
> Oh, it's easy to devise a test-case of that kind, but does it matter
> in real life?  I admit that what most people run on their 1024-core
> Altices will be significantly different from what I checked on my
> laptop back then, but in my case use of the ZERO_PAGE didn't look
> common enough to make special cases for.

Yeah I don't have access to the box, but it was a constructed test
of some kind. However this is basically a dead box situation... on
smaller systems we could still see performance improvements.

And the other thing is I'd like to be able to get rid of the wrapping
tests from the page allocator and PageReserved from the kernel entirely
at some point.

> You put forward a pagecache replication patch a few weeks ago.
> That's what I expected to happen to the ZERO_PAGE, once NUMA folks
> complained of the accounting.  Isn't that a better way to go?

Not sure how much remote memory access the ZERO_PAGE itself causes.
It is obviously readonly data, and itaniums have pretty big caches,
so it is more important to get rid of the bouncing cachelines.

Per node ZERO_PAGE could be a good idea, however you can still have
all pages come from a single node (eg. a forking server)...

> Or is there some important app on the Altix which uses the
> ZERO_PAGE so very much, that its interesting data remains shared
> between nodes forever, and it's only its struct page cacheline
> bouncing dirtily from one to another that slows things down?

Can't answer that. I think they are worried about this being hit in
the field.

Does the ZERO_PAGE help _any_ real workloads? It will cost an extra
fault any time you are not content with its interesting data. I
don't know why any performance critical app would read huge swaths
of zeroes, but there is probably a reason for it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE
  2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins
  2007-03-30  1:46   ` Nick Piggin
@ 2007-03-30  2:40   ` Nick Piggin
  2007-04-04  3:37     ` [rfc] no ZERO_PAGE? Nick Piggin
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2007-03-30  2:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee, holt

On Thu, Mar 29, 2007 at 02:10:55PM +0100, Hugh Dickins wrote:
> 
> But this patch is not complete, is it?  For example, fremap.c's
> zap_pte?  I haven't checked further.  I was going to suggest you

Ah yes, nonlinear... thanks I missed that.

Well it would make life easier if we got rid of ZERO_PAGE completely,
which I definitely wouldn't complain about ;) It is much more likely
to cause noticable performance loss in other areas though, so it is
not really a candidate for SLES at the moment.

But I would like to get something for mainline that everyone likes
whether that is vm_refcounted_page (which I just implemented and it
doesn't make things much cleaner, but I'll go with it); per-node
ZERO_PAGE; or whatever.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE
  2007-03-30  1:46   ` Nick Piggin
@ 2007-03-30  2:59     ` Robin Holt
  2007-03-30  3:09       ` Nick Piggin
  0 siblings, 1 reply; 49+ messages in thread
From: Robin Holt @ 2007-03-30  2:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt

On Fri, Mar 30, 2007 at 03:46:34AM +0200, Nick Piggin wrote:
> > Oh, it's easy to devise a test-case of that kind, but does it matter
> > in real life?  I admit that what most people run on their 1024-core
> > Altices will be significantly different from what I checked on my
> > laptop back then, but in my case use of the ZERO_PAGE didn't look
> > common enough to make special cases for.
> 
> Yeah I don't have access to the box, but it was a constructed test
> of some kind. However this is basically a dead box situation... on
> smaller systems we could still see performance improvements.

It was not a constructed test.  It was an test application which started
up and read one word from each page to fill the page tables (not sure
why that was done), then forked a process for each cpu.  At that point,
it was supposed start doing computation using data from an NFS accessible
file.  Unfortunately, the file was not found so the application exited
and the machine locked up for hours.

Of course, they assumed something had gone wrong with the system and
repeated the test with the same result.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE
  2007-03-30  2:59     ` Robin Holt
@ 2007-03-30  3:09       ` Nick Piggin
  2007-03-30  9:23         ` Robin Holt
  0 siblings, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2007-03-30  3:09 UTC (permalink / raw)
  To: Robin Holt
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee

On Thu, Mar 29, 2007 at 09:59:37PM -0500, Robin Holt wrote:
> On Fri, Mar 30, 2007 at 03:46:34AM +0200, Nick Piggin wrote:
> > > Oh, it's easy to devise a test-case of that kind, but does it matter
> > > in real life?  I admit that what most people run on their 1024-core
> > > Altices will be significantly different from what I checked on my
> > > laptop back then, but in my case use of the ZERO_PAGE didn't look
> > > common enough to make special cases for.
> > 
> > Yeah I don't have access to the box, but it was a constructed test
> > of some kind. However this is basically a dead box situation... on
> > smaller systems we could still see performance improvements.
> 
> It was not a constructed test.  It was an test application which started
> up and read one word from each page to fill the page tables (not sure
> why that was done), then forked a process for each cpu.  At that point,
> it was supposed start doing computation using data from an NFS accessible
> file.  Unfortunately, the file was not found so the application exited
> and the machine locked up for hours.

Sorry, my mistake. Thanks for the clarification: this sounds like
something that will not be helped by per-node ZERO_PAGEs either.

So not typical, but something that we'd rather not fall over with.
I guess large ranges of zero pages could be quite common in startup
of HPC codes operating on large matricies.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc][patch 1/2] mm: dont account ZERO_PAGE
  2007-03-30  3:09       ` Nick Piggin
@ 2007-03-30  9:23         ` Robin Holt
  0 siblings, 0 replies; 49+ messages in thread
From: Robin Holt @ 2007-03-30  9:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee

On Fri, Mar 30, 2007 at 05:09:12AM +0200, Nick Piggin wrote:
> > up and read one word from each page to fill the page tables (not sure
> > why that was done), then forked a process for each cpu.  At that point,
>
> So not typical, but something that we'd rather not fall over with.

I agree

> I guess large ranges of zero pages could be quite common in startup
> of HPC codes operating on large matricies.

The "not sure why that was done" was referring to this being exactly the
opposite of what a typical HPC application does.  Those tend to locate
themselves on the node which will use an address range and the write
touch each of the pages.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [rfc] no ZERO_PAGE?
  2007-03-30  2:40   ` Nick Piggin
@ 2007-04-04  3:37     ` Nick Piggin
  2007-04-04  9:45       ` Hugh Dickins
  2007-04-04 15:35       ` Linus Torvalds
  0 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2007-04-04  3:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Fri, Mar 30, 2007 at 04:40:48AM +0200, Nick Piggin wrote:
> 
> Well it would make life easier if we got rid of ZERO_PAGE completely,
> which I definitely wouldn't complain about ;)

So, what bad things (apart from my bugs in untested code) happen
if we do this? We can actually go further, and probably remove the
ZERO_PAGE completely (just need an extra get_user_pages flag or
something for the core dumping issue).

Shall I do a more complete patchset and ask Andrew to give it a
run in -mm?

--

ZERO_PAGE for anonymous pages seems to only be designed to help stupid
programs, so remove it. This solves issues with ZERO_PAGE refcounting
and NUMA un-awareness.

(Actually, not quite. We should also remove all the zeromap stuff that
also seems to not do much except help stupid programs).

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1613,16 +1613,10 @@ gotten:
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	if (old_page == ZERO_PAGE(address)) {
-		new_page = alloc_zeroed_user_highpage(vma, address);
-		if (!new_page)
-			goto oom;
-	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!new_page)
-			goto oom;
-		cow_user_page(new_page, old_page, address, vma);
-	}
+	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+	if (!new_page)
+		goto oom;
+	cow_user_page(new_page, old_page, address, vma);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2130,52 +2124,33 @@ static int do_anonymous_page(struct mm_s
 	spinlock_t *ptl;
 	pte_t entry;
 
-	if (write_access) {
-		/* Allocate our own private page. */
-		pte_unmap(page_table);
+	/* Allocate our own private page. */
+	pte_unmap(page_table);
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_zeroed_user_highpage(vma, address);
-		if (!page)
-			goto oom;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	page = alloc_zeroed_user_highpage(vma, address);
+	if (!page)
+		return VM_FAULT_OOM;
 
-		entry = mk_pte(page, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	entry = mk_pte(page, vma->vm_page_prot);
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
-		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-		if (!pte_none(*page_table))
-			goto release;
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (likely(!pte_none(*page_table))) {
 		inc_mm_counter(mm, anon_rss);
 		lru_cache_add_active(page);
 		page_add_new_anon_rmap(page, vma, address);
-	} else {
-		/* Map the ZERO_PAGE - vm_page_prot is readonly */
-		page = ZERO_PAGE(address);
-		page_cache_get(page);
-		entry = mk_pte(page, vma->vm_page_prot);
-
-		ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		if (!pte_none(*page_table))
-			goto release;
-		inc_mm_counter(mm, file_rss);
-		page_add_file_rmap(page);
-	}
-
-	set_pte_at(mm, address, page_table, entry);
+		set_pte_at(mm, address, page_table, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, address, entry);
-	lazy_mmu_prot_update(entry);
-unlock:
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+	} else
+		page_cache_release(page);
 	pte_unmap_unlock(page_table, ptl);
+
 	return VM_FAULT_MINOR;
-release:
-	page_cache_release(page);
-	goto unlock;
-oom:
-	return VM_FAULT_OOM;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04  3:37     ` [rfc] no ZERO_PAGE? Nick Piggin
@ 2007-04-04  9:45       ` Hugh Dickins
  2007-04-04 10:24         ` Nick Piggin
  2007-04-04 15:35       ` Linus Torvalds
  1 sibling, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2007-04-04  9:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Wed, 4 Apr 2007, Nick Piggin wrote:
> On Fri, Mar 30, 2007 at 04:40:48AM +0200, Nick Piggin wrote:
> > 
> > Well it would make life easier if we got rid of ZERO_PAGE completely,
> > which I definitely wouldn't complain about ;)

Yes, I love this approach too.

> 
> So, what bad things (apart from my bugs in untested code) happen
> if we do this? We can actually go further, and probably remove the
> ZERO_PAGE completely (just need an extra get_user_pages flag or
> something for the core dumping issue).

Some things will go faster (no longer needing a separate COW fault
on the read-protected ZERO_PAGE), some things will go slower and use
more memory.  The open question is whether anyone will notice those
regressions: I'm hoping they won't, I'm afraid they will.  And though
we'll see each as a program doing "something stupid", as in the Altix
case Robin showed to drive us here, we cannot just ignore it.

> 
> Shall I do a more complete patchset and ask Andrew to give it a
> run in -mm?

I'd like you to: I didn't study the fragment below, it's really all
uses of the ZERO_PAGE that I'd like to see go, then we see who shouts.

It's quite likely that the patch would have to be reverted: don't
bother to remove the allocations of ZERO_PAGE in each architecture
at this stage, too much nuisance going back and forth on those.

Leave ZERO_PAGE as configurable, default off for testing, buried
somewhere like under EMBEDDED?  It's much more attractive just to
remove the old code, and reintroduce it if there's a demand; but
leaving it under config would make it easy to restore, and if
there's trouble with removing ZERO_PAGE, we might later choose
to disable it at the high end but enable it at the low.  What
would you prefer?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04  9:45       ` Hugh Dickins
@ 2007-04-04 10:24         ` Nick Piggin
  2007-04-04 12:27           ` Andrea Arcangeli
  2007-04-04 12:45           ` Hugh Dickins
  0 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2007-04-04 10:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 10:45:39AM +0100, Hugh Dickins wrote:
> On Wed, 4 Apr 2007, Nick Piggin wrote:
> > On Fri, Mar 30, 2007 at 04:40:48AM +0200, Nick Piggin wrote:
> > > 
> > > Well it would make life easier if we got rid of ZERO_PAGE completely,
> > > which I definitely wouldn't complain about ;)
> 
> Yes, I love this approach too.
> 
> > 
> > So, what bad things (apart from my bugs in untested code) happen
> > if we do this? We can actually go further, and probably remove the
> > ZERO_PAGE completely (just need an extra get_user_pages flag or
> > something for the core dumping issue).
> 
> Some things will go faster (no longer needing a separate COW fault
> on the read-protected ZERO_PAGE), some things will go slower and use
> more memory.  The open question is whether anyone will notice those
> regressions: I'm hoping they won't, I'm afraid they will.  And though
> we'll see each as a program doing "something stupid", as in the Altix
> case Robin showed to drive us here, we cannot just ignore it.

Sure. Agreed.

> > Shall I do a more complete patchset and ask Andrew to give it a
> > run in -mm?
> 
> I'd like you to: I didn't study the fragment below, it's really all
> uses of the ZERO_PAGE that I'd like to see go, then we see who shouts.

Yeah, they are basically pretty trivial to remove. I'll do a more
complete patch now that I know you like the approach.

> It's quite likely that the patch would have to be reverted: don't
> bother to remove the allocations of ZERO_PAGE in each architecture
> at this stage, too much nuisance going back and forth on those.

OK.

> Leave ZERO_PAGE as configurable, default off for testing, buried
> somewhere like under EMBEDDED?  It's much more attractive just to
> remove the old code, and reintroduce it if there's a demand; but
> leaving it under config would make it easy to restore, and if
> there's trouble with removing ZERO_PAGE, we might later choose
> to disable it at the high end but enable it at the low.  What
> would you prefer?

Ooh, the one with more '-' signs in the diff ;)

No, you have a point, but if we have to ask people to recompile 
with CONFIG_ZERO_PAGE, then it isn't much harder to ask them to
apply a patch first.

But for a potential mainline merge, maybe starting with a CONFIG
option is a good idea -- defaulting to off, and we could start by
turning it on just in -rc kernels for a few releases, to get a bit
more confidence?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 10:24         ` Nick Piggin
@ 2007-04-04 12:27           ` Andrea Arcangeli
  2007-04-04 13:55             ` Dan Aloni
  2007-04-04 12:45           ` Hugh Dickins
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 12:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 12:24:07PM +0200, Nick Piggin wrote:
> But for a potential mainline merge, maybe starting with a CONFIG
> option is a good idea -- defaulting to off, and we could start by
> turning it on just in -rc kernels for a few releases, to get a bit
> more confidence?

The only reason to do that is if there are many stupid apps pretending
to get meaningful information from pages that cannot contain any
information. The zero page in the anon page fault has been there
forever so...

Anyway I also like this approach as I immediately suggested it after
reading about the zero page scalability patches ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 10:24         ` Nick Piggin
  2007-04-04 12:27           ` Andrea Arcangeli
@ 2007-04-04 12:45           ` Hugh Dickins
  2007-04-04 13:05             ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2007-04-04 12:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Wed, 4 Apr 2007, Nick Piggin wrote:
> 
> No, you have a point, but if we have to ask people to recompile 
> with CONFIG_ZERO_PAGE, then it isn't much harder to ask them to
> apply a patch first.
> 
> But for a potential mainline merge, maybe starting with a CONFIG
> option is a good idea -- defaulting to off, and we could start by
> turning it on just in -rc kernels for a few releases, to get a bit
> more confidence?

I'm confused.  CONFIG_ZERO_PAGE off is where we'd like to end up: how
would turning CONFIG_ZERO_PAGE on in -rc kernels help us to get there?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 12:45           ` Hugh Dickins
@ 2007-04-04 13:05             ` Andrea Arcangeli
  2007-04-04 13:32               ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 13:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 01:45:06PM +0100, Hugh Dickins wrote:
> I'm confused.  CONFIG_ZERO_PAGE off is where we'd like to end up: how
> would turning CONFIG_ZERO_PAGE on in -rc kernels help us to get there?

He most certainly meant on by default.

I think if we do this, we also need a zeropage counter in the vm stats
so that we'll get a measure of the waste and it'll be possible to
identify apps to optimize/fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 13:05             ` Andrea Arcangeli
@ 2007-04-04 13:32               ` Hugh Dickins
  2007-04-04 13:40                 ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2007-04-04 13:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 01:45:06PM +0100, Hugh Dickins wrote:
> > I'm confused.  CONFIG_ZERO_PAGE off is where we'd like to end up: how
> > would turning CONFIG_ZERO_PAGE on in -rc kernels help us to get there?
> 
> He most certainly meant on by default.

Okay, I thought it more diplomatic to label myself as the confused one ;)

> 
> I think if we do this, we also need a zeropage counter in the vm stats
> so that we'll get a measure of the waste and it'll be possible to
> identify apps to optimize/fix.

That's a little unfortunate, since we'd then have to lose the win from
this change, that we issue a writable zeroed page (when VM_WRITE) in
do_anonymous_page, even when it's a read fault, saving subsequent fault.

Wouldn't we?  Or am I confused ;?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 13:32               ` Hugh Dickins
@ 2007-04-04 13:40                 ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 13:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 02:32:03PM +0100, Hugh Dickins wrote:
> That's a little unfortunate, since we'd then have to lose the win from
> this change, that we issue a writable zeroed page (when VM_WRITE) in
> do_anonymous_page, even when it's a read fault, saving subsequent fault.

Hmm no, that win would remain (and that win would only apply to the
class of apps that we intend to hurt by removing the zero-page
anyway). I think it's enough to increase a per-cpu counter in
do_anonymous_page if it's a read fault, and nothing else. We don't
need to keep track of the exact number of ZERO_PAGEs in the
VM. Ideally nothing should increase my counter, hence your "exact"
counter would always be zero too when everything is ok.

The only real win we'll lose with the counter is the removal of the
slow-path branch in do_anonymous_page, but I guess I'm more
comfortable to be able to detect if something very inefficient ever
run on my system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 12:27           ` Andrea Arcangeli
@ 2007-04-04 13:55             ` Dan Aloni
  2007-04-04 14:14               ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Dan Aloni @ 2007-04-04 13:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 02:27:01PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 12:24:07PM +0200, Nick Piggin wrote:
> > But for a potential mainline merge, maybe starting with a CONFIG
> > option is a good idea -- defaulting to off, and we could start by
> > turning it on just in -rc kernels for a few releases, to get a bit
> > more confidence?
> 
> The only reason to do that is if there are many stupid apps pretending
> to get meaningful information from pages that cannot contain any
> information. The zero page in the anon page fault has been there
> forever so...

There might be a lot of applications like that, and I'm not sure that 
_all_ of them can be considered 'stupid' as you say.

How about applications that perform mmap() and R/W random-access on 
large *sparse* files? (e.g. a scientific app that uses a large sparse 
file as a big database look-up table). As I see it, these apps would
need to keep track of what's sparse and what's not...

-- 
Dan Aloni
XIV LTD, http://www.xivstorage.com
da-x (at) monatomic.org, dan (at) xiv.co.il

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 13:55             ` Dan Aloni
@ 2007-04-04 14:14               ` Andrea Arcangeli
  2007-04-04 14:44                 ` Dan Aloni
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 14:14 UTC (permalink / raw)
  To: Dan Aloni
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 04:55:32PM +0300, Dan Aloni wrote:
> How about applications that perform mmap() and R/W random-access on 
> large *sparse* files? (e.g. a scientific app that uses a large sparse 
> file as a big database look-up table). As I see it, these apps would
> need to keep track of what's sparse and what's not...

That's not anonymous memory if those are read page faults on
_files_. I'm only talking about anonymous memory and
do_anonymous_page, i.e. no file data at all. In more clear words, the
only thing we're discussing here is char = malloc(1); *char.

Your example _already_ allocates zeroed pagecache instead of the zero
page, so your example (random access over sparse files with mmap, be
it MAP_PRIVATE or MAP_SHARED no difference for reads) has never had
anything to do with the zero page. If something we could optimize your
example to _start_ using for the first time ever the ZERO_PAGE, it
would make more sense to use it to be mapped where the lowlevel fs
finds holes. ZERO_PAGE in do_anonymous_page instead doesn't make much
sense to me, but it has always been there as far as I can
remember. The thing is that it never hurted until the huge systems
with nightmare cacheline bouncing reported heavy stalls on some
testcase, which make it look like a DoS because of the ZERO_PAGE,
hence now that it hurts I guess it can go.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 14:14               ` Andrea Arcangeli
@ 2007-04-04 14:44                 ` Dan Aloni
  2007-04-04 15:03                   ` Hugh Dickins
  2007-04-04 15:27                   ` Andrea Arcangeli
  0 siblings, 2 replies; 49+ messages in thread
From: Dan Aloni @ 2007-04-04 14:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 04:14:57PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 04:55:32PM +0300, Dan Aloni wrote:
> > How about applications that perform mmap() and R/W random-access on 
> > large *sparse* files? (e.g. a scientific app that uses a large sparse 
> > file as a big database look-up table). As I see it, these apps would
> > need to keep track of what's sparse and what's not...
> 
> That's not anonymous memory if those are read page faults on
> _files_. I'm only talking about anonymous memory and
> do_anonymous_page, i.e. no file data at all. In more clear words, the
> only thing we're discussing here is char = malloc(1); *char.
>
> Your example _already_ allocates zeroed pagecache instead of the zero
> page, so your example (random access over sparse files with mmap, be
> it MAP_PRIVATE or MAP_SHARED no difference for reads) has never had
> anything to do with the zero page. If something we could optimize your
> example to _start_ using for the first time ever the ZERO_PAGE, it
> would make more sense to use it to be mapped where the lowlevel fs
> finds holes. ZERO_PAGE in do_anonymous_page instead doesn't make much
> sense to me, but it has always been there as far as I can
> remember. The thing is that it never hurted until the huge systems
> with nightmare cacheline bouncing reported heavy stalls on some
> testcase, which make it look like a DoS because of the ZERO_PAGE,
> hence now that it hurts I guess it can go.

Oh, right. Thanks for clarifing. I should have figured it out before 
I sent that mail.

To refine that example, you could replace the file with a large anonymous 
memory pool and a lot of swap space committed to it. In that case - with 
no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? 
Perhaps it's an example too far-fetched to worth considering...

-- 
Dan Aloni
XIV LTD, http://www.xivstorage.com
da-x (at) monatomic.org, dan (at) xiv.co.il

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 14:44                 ` Dan Aloni
@ 2007-04-04 15:03                   ` Hugh Dickins
  2007-04-04 15:34                     ` Andrea Arcangeli
  2007-04-04 15:27                   ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2007-04-04 15:03 UTC (permalink / raw)
  To: Dan Aloni
  Cc: Andrea Arcangeli, Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Dan Aloni wrote:
> 
> To refine that example, you could replace the file with a large anonymous 
> memory pool and a lot of swap space committed to it. In that case - with 
> no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? 
> Perhaps it's an example too far-fetched to worth considering...

Nice point, not far-fetched, though I don't know whether it's worth
worrying about or not.  Yes, as things stand, the kernel will
needlessly write them out to swap: because we're in the habit of
marking a writable pte as dirty, partly to save the processor (how
i386-centric am I being?) from having to do that work just after,
partly because of some race too ancient for me to know anything
about - do_no_page (though not the function in question here) says:

	 * This silly early PAGE_DIRTY setting removes a race
	 * due to the bad i386 page protection. But it's valid
	 * for other architectures too.

Maybe Nick will decide to not to mark the readfaults as dirty.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 14:44                 ` Dan Aloni
  2007-04-04 15:03                   ` Hugh Dickins
@ 2007-04-04 15:27                   ` Andrea Arcangeli
  2007-04-04 16:15                     ` Dan Aloni
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 15:27 UTC (permalink / raw)
  To: Dan Aloni
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 05:44:21PM +0300, Dan Aloni wrote:
> To refine that example, you could replace the file with a large anonymous 
> memory pool and a lot of swap space committed to it. In that case - with 
> no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? 

Swapout or ram is the same in this context. The point is that it will
take 4k either in ram or swap, let's talk about virtual memory without
differentiating between ram or swap.

> Perhaps it's an example too far-fetched to worth considering...

Even if you would read the sparsed file to a malloced space (more
commonly that would be tmpfs) using the read syscall, those anon (or
tmpfs) pages would be _written_ first, which isn't the case we're
discussing here.

You don't know what is on disk, so reading from disk (regardless of
what you read, holes, zeros or anything) provides useful information,
but you know what is in ram after an anon mmap: just zeros, reading
them can't provide useful information to any software.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:03                   ` Hugh Dickins
@ 2007-04-04 15:34                     ` Andrea Arcangeli
  2007-04-04 15:41                       ` Hugh Dickins
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 15:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dan Aloni, Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 04:03:15PM +0100, Hugh Dickins wrote:
> Maybe Nick will decide to not to mark the readfaults as dirty.

I don't like to mark the pte readonly and clean, we'd be still
optimizing for the current ZERO_PAGE users and even for those it would
generate a unnecessary page fault if they later write to it. If any
legitimate ZERO_PAGE user really exists, then we should keep mapping
the ZERO_PAGE into it and fix the scalability issue associated with
it, instead of allocating a new page in readonly mode.

Marking anonymous pages readonly and clean so they can be collected
w/o swapping still is desiderable for glibc through madvise (madvise
would later need to be called again before starting using the
collectable anon pages to store information into it), but that's
an entirely different topic ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04  3:37     ` [rfc] no ZERO_PAGE? Nick Piggin
  2007-04-04  9:45       ` Hugh Dickins
@ 2007-04-04 15:35       ` Linus Torvalds
  2007-04-04 15:48         ` Andrea Arcangeli
                           ` (5 more replies)
  1 sibling, 6 replies; 49+ messages in thread
From: Linus Torvalds @ 2007-04-04 15:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Wed, 4 Apr 2007, Nick Piggin wrote:
> 
> Shall I do a more complete patchset and ask Andrew to give it a
> run in -mm?

Do this trivial one first. See how it fares.

Although I don't know how much -mm will do for it. There is certainly not 
going to be any correctness problems, afaik, just *performance* problems. 
Does anybody do any performance testing on -mm?

That said, talking about correctness/performance problems:

> +	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (likely(!pte_none(*page_table))) {
>  		inc_mm_counter(mm, anon_rss);
>  		lru_cache_add_active(page);
>  		page_add_new_anon_rmap(page, vma, address);

Isn't that test the wrong way around?

Shouldn't it be

	if (likely(pte_none(*page_table))) {

without any logical negation? Was this patch tested?

Anyway, I'm not against this, but I can see somebody actually *wanting* 
the ZERO page in some cases. I've used the fact for TLB testing, for 
example, by just doing a big malloc(), and knowing that the kernel will 
re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
not any *physical* cache effects. Virtually indexed cached will still show 
effects of it, of course, but I haven't cared).

That's an example of an app that actually cares about the page allocation 
(or, in this case, the lack there-of). Not an important one, but maybe 
there are important ones that care?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:34                     ` Andrea Arcangeli
@ 2007-04-04 15:41                       ` Hugh Dickins
  2007-04-04 16:07                         ` Andrea Arcangeli
  2007-04-04 16:14                         ` Linus Torvalds
  0 siblings, 2 replies; 49+ messages in thread
From: Hugh Dickins @ 2007-04-04 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dan Aloni, Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 04:03:15PM +0100, Hugh Dickins wrote:
> > Maybe Nick will decide to not to mark the readfaults as dirty.
> 
> I don't like to mark the pte readonly and clean,

Nor I: I meant that anonymous readfault should
(perhaps) mark the pte writable but clean.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:35       ` Linus Torvalds
@ 2007-04-04 15:48         ` Andrea Arcangeli
  2007-04-04 16:09           ` Linus Torvalds
                             ` (2 more replies)
  2007-04-04 16:32         ` Eric Dumazet
                           ` (4 subsequent siblings)
  5 siblings, 3 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 15:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
> Anyway, I'm not against this, but I can see somebody actually *wanting* 
> the ZERO page in some cases. I've used the fact for TLB testing, for 
> example, by just doing a big malloc(), and knowing that the kernel will 
> re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
> not any *physical* cache effects. Virtually indexed cached will still show 
> effects of it, of course, but I haven't cared).

Ok, those cases wanting the same zero page, could be fairly easily
converted to an mmap over /dev/zero (without having to run 4k large
mmap syscalls or nonlinear).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:41                       ` Hugh Dickins
@ 2007-04-04 16:07                         ` Andrea Arcangeli
  2007-04-04 16:14                         ` Linus Torvalds
  1 sibling, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 16:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dan Aloni, Nick Piggin, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 04:41:46PM +0100, Hugh Dickins wrote:
> Nor I: I meant that anonymous readfault should
> (perhaps) mark the pte writable but clean.

Sorry I assumed when you said clean you implied readonly... Though
we'd need to differentiate the archs where the dirty bit is not set by
the hardware. Overall I'm unsure it worth it. Currently the VM
definitely wouldn't cope with a writeable and clean anonymous page, so
we'd need to change shrink_page_list and try_to_unmap_anon to make it
work. Likely it won't be measurable, so it may be a nice feature to
have from a theoretical point of view, in practice I doubt it matters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:48         ` Andrea Arcangeli
@ 2007-04-04 16:09           ` Linus Torvalds
  2007-04-04 16:23             ` Andrea Arcangeli
  2007-04-04 16:10           ` Hugh Dickins
  2007-04-04 22:07           ` Valdis.Kletnieks
  2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2007-04-04 16:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
> 
> Ok, those cases wanting the same zero page, could be fairly easily
> converted to an mmap over /dev/zero (without having to run 4k large
> mmap syscalls or nonlinear).

You're missing the point. What if it's something like oracle that has been 
tuned for Linux using this? Or even an open-source app that is just used 
by big places and they see performace problems but it's not obvious *why*.

We "know" why, because we're discussing this point. But two months from 
now, when some random company complains to SuSE/RH/whatever that their app 
runs 5% slower or uses 200% more swap, who is going to realize what caused 
it?

THAT is the problem with patches like this. I'm not against it, but you 
can't just dismiss it with "we can fix the app". We *cannot* fix the app 
if we don't even realize what caused the problem..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:48         ` Andrea Arcangeli
  2007-04-04 16:09           ` Linus Torvalds
@ 2007-04-04 16:10           ` Hugh Dickins
  2007-04-04 16:31             ` Andrea Arcangeli
  2007-04-04 22:07           ` Valdis.Kletnieks
  2 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2007-04-04 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
> > Anyway, I'm not against this, but I can see somebody actually *wanting* 
> > the ZERO page in some cases. I've used the fact for TLB testing, for 
> > example, by just doing a big malloc(), and knowing that the kernel will 
> > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
> > not any *physical* cache effects. Virtually indexed cached will still show 
> > effects of it, of course, but I haven't cared).
> 
> Ok, those cases wanting the same zero page, could be fairly easily
> converted to an mmap over /dev/zero

No, MAP_SHARED mmap of /dev/zero uses shmem, which allocates distinct
pages for this (because in general tmpfs doesn't know if a readonly
file will be written to later on), and MAP_PRIVATE mmap of /dev/zero
uses the zeromap stuff which we were hoping to eliminate too
(though not in Nick's initial patch).

Looks like a job for /dev/same_page_over_and_over_again.

> (without having to run 4k large mmap syscalls or nonlinear).

You scared me, I made no sense of that at first: ah yes,
repeatedly mmap'ing the same page can be done those ways.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:41                       ` Hugh Dickins
  2007-04-04 16:07                         ` Andrea Arcangeli
@ 2007-04-04 16:14                         ` Linus Torvalds
  1 sibling, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2007-04-04 16:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Dan Aloni, Nick Piggin, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Hugh Dickins wrote:

> On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
> > On Wed, Apr 04, 2007 at 04:03:15PM +0100, Hugh Dickins wrote:
> > > Maybe Nick will decide to not to mark the readfaults as dirty.
> > 
> > I don't like to mark the pte readonly and clean,
> 
> Nor I: I meant that anonymous readfault should
> (perhaps) mark the pte writable but clean.

Maybe. On the other hand, marking it dirty is going to be almost as 
expensive as taking the whole page fault again. The dirty bit is in 
software on a lot of architectures, and even on x86 where it's in hw, all 
microarchitectures basically consider it a micro-trap, and some of them 
(*cough*P4*cough*) are really bad at it.

So I'd actually rather just mark it dirty too, because that way there is a 
real potential performance upside to go with the real potential 
performance downside, and we can hope that it all comes out even in the 
end ;)

			Linus

PS. Yes, I wrote the benchmark. On at least some versions of the P4, just 
setting the dirty bit took 1500 cycles.. No sw-visible traps, just a *lot* 
of cycles to clean out the pipeline entirely, do a micro-trap, and 
continue. Of course, the P4 sucks at these things, but the point is that 
it can be as expensive to do it "in hardware" as doing it in software if 
the hardware is mis-designed..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:27                   ` Andrea Arcangeli
@ 2007-04-04 16:15                     ` Dan Aloni
  2007-04-04 16:48                       ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Dan Aloni @ 2007-04-04 16:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 05:27:17PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 05:44:21PM +0300, Dan Aloni wrote:
> > To refine that example, you could replace the file with a large anonymous 
> > memory pool and a lot of swap space committed to it. In that case - with 
> > no ZERO_PAGE, would the kernel needlessly swap-out the zeroed pages? 
> 
> Swapout or ram is the same in this context. The point is that it will
> take 4k either in ram or swap, let's talk about virtual memory without
> differentiating between ram or swap.

The main difference is that disk-backed swap can create I/O pressure which
would slow down the swap-outs that are not of zeroed pages (and other I/Os
on that disk for that matter). For purely-RAM virtual memory the latency 
incured from managing newly allocated and zeroed pages is neglegible 
compared to the latencies you get from reading/flushing those pages to 
disk if you add swap to the picture.
 
> > Perhaps it's an example too far-fetched to worth considering...
> 
> Even if you would read the sparsed file to a malloced space (more
> commonly that would be tmpfs) using the read syscall, those anon (or
> tmpfs) pages would be _written_ first, which isn't the case we're
> discussing here.
> 
> You don't know what is on disk, so reading from disk (regardless of
> what you read, holes, zeros or anything) provides useful information,
> but you know what is in ram after an anon mmap: just zeros, reading
> them can't provide useful information to any software.

I agree. The swap I/O case still holds, though: swapping-in the zeroed
pages that got swapped-out might incur unwanted overhead.

-- 
Dan Aloni
XIV LTD, http://www.xivstorage.com
da-x (at) monatomic.org, dan (at) xiv.co.il

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 16:09           ` Linus Torvalds
@ 2007-04-04 16:23             ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 09:09:28AM -0700, Linus Torvalds wrote:
> You're missing the point. What if it's something like oracle that has been 
> tuned for Linux using this? Or even an open-source app that is just used 
> by big places and they see performace problems but it's not obvious *why*.
> 
> We "know" why, because we're discussing this point. But two months from 
> now, when some random company complains to SuSE/RH/whatever that their app 
> runs 5% slower or uses 200% more swap, who is going to realize what caused 
> it?

No, I'm not missing the point, I was the first to say here that such
code has been there forever and in turn I'm worried about apps
depending on it for all the wrong reasons, I even went as far as
asking a counter to avoid the waste to go unniticed, and last but not
the least that's why I'm not discussing this as internal suse fix for
the scalability issue, but only as a malinline patch for -mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 16:10           ` Hugh Dickins
@ 2007-04-04 16:31             ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 16:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 05:10:37PM +0100, Hugh Dickins wrote:
> file will be written to later on), and MAP_PRIVATE mmap of /dev/zero

Obviously I meant MAP_PRIVATE of /dev/zero, since it's the only one
backed by the zero page.

> uses the zeromap stuff which we were hoping to eliminate too
> (though not in Nick's initial patch).

I didn't realized you wanted to eliminate it too.

> Looks like a job for /dev/same_page_over_and_over_again.
> 
> > (without having to run 4k large mmap syscalls or nonlinear).
> 
> You scared me, I made no sense of that at first: ah yes,
> repeatedly mmap'ing the same page can be done those ways.

Yep, which is probably why we don't need the
/dev/same_page_over_and_over_again for that.

Overall the worry about the TLB benchmarking apps being broken in its
measurements sounds very minor compared to the risk of wasting tons of
ram and going out of memory. If there was no risk of bad breakage we
wouldn't need to discuss this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:35       ` Linus Torvalds
  2007-04-04 15:48         ` Andrea Arcangeli
@ 2007-04-04 16:32         ` Eric Dumazet
  2007-04-04 17:02           ` Linus Torvalds
  2007-04-04 19:15         ` Andrew Morton
                           ` (3 subsequent siblings)
  5 siblings, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2007-04-04 16:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007 08:35:30 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Anyway, I'm not against this, but I can see somebody actually *wanting* 
> the ZERO page in some cases. I've used the fact for TLB testing, for 
> example, by just doing a big malloc(), and knowing that the kernel will 
> re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
> not any *physical* cache effects. Virtually indexed cached will still show 
> effects of it, of course, but I haven't cared).
> 
> That's an example of an app that actually cares about the page allocation 
> (or, in this case, the lack there-of). Not an important one, but maybe 
> there are important ones that care?

I dont know if this small prog is of any interest :

But results on an Intel Pentium-M are interesting, in particular 2) & 3)

If a page is first allocated as page_zero then cow to a full rw page, this is more expensive.
(2660 cycles instead of 2300)

Is there an app somewhere that depends on 2) being ultra-fast but then future write accesses *slow* ???

$ ./page_bench >RES; cat RES
1) pagefault tp bring a rw page:
Poke (addr=0x804c000): 2360 cycles
1) pagefault to bring a rw page:
Poke (addr=0x804d000): 2368 cycles
1) pagefault to bring a rw page:
Poke (addr=0x804e000): 2120 cycles
2) pagefault to bring a zero page, readonly
Peek(addr=0x804f000): ->0 891 cycles
3) pagefault to make this page rw
Poke (addr=0x804f000): 2660 cycles
1) pagefault to bring a rw page:
Poke (addr=0x8050000): 2099 cycles
1) pagefault to bring a rw page:
Poke (addr=0x8051000): 2062 cycles
4) memset 4096 bytes to 0x55:
Poke_full (addr=0x804f000, len=4096): 2719 cycles
5) fill the whole table
Poke_full (addr=0x804c000, len=4194304): 6563661 cycles
6) fill again whole table (no more faults, but cpu cache too small)
Poke_full (addr=0x804c000, len=4194304): 5188925 cycles
7.1) faulting a mmap zone, read access
Peek(addr=0xb7f8a000): ->0 40453 cycles
8.1) faulting a mmap zone, write access
Poke (addr=0xb7f89000): 10599 cycles
7.2) faulting a mmap zone, read access
Peek(addr=0xb7f88000): ->0 8167 cycles
8.3) faulting a mmap zone, write access
Poke (addr=0xb7f87000): 5701 cycles


$ cat page_bench.c

# include <errno.h>
# include <stdlib.h>
# include <unistd.h>
# include <fcntl.h>
# include <stdio.h>
# include <sys/time.h>
# include <time.h>
# include <sys/mman.h>
# include <string.h>

#ifdef __x86_64

#define rdtscll(val) do { \
     unsigned int __a,__d; \
     asm volatile("rdtsc" : "=a" (__a), "=d" (__d)); \
     (val) = ((unsigned long)__a) | (((unsigned long)__d)<<32); \
} while(0)

#elif  __i386

#define rdtscll(val) \
     __asm__ __volatile__("rdtsc" : "=A" (val))

#endif

int var;



int *addr1, *addr2, *addr3, *addr4;

void map_many_vmas(unsigned int nb)
{
size_t sz = getpagesize();
int ui;
for (ui = 0 ; ui < nb ; ui++) {
	void *p = mmap(NULL, sz,
			(ui == 0) ? PROT_READ : PROT_READ|PROT_WRITE,
			(ui & 1) ? MAP_PRIVATE|MAP_ANONYMOUS : MAP_ANONYMOUS|MAP_SHARED, -1, 0);
	if (p == (void *)-1) {
		fprintf(stderr, "Only %u mappings could be set\n", ui);
		break;
		}
	if (!addr1) addr1 = (int *)p;
	else if (!addr2) addr2 = (int *)p;
	else if (!addr3) addr3 = (int *)p;
	else if (!addr4) addr4 = (int *)p;
	}
}

void show_maps()
{
char buffer[4096];
int fd, lu;

fd = open("/proc/self/maps", 0);
if (fd != -1) {
	while ((lu = read(fd, buffer, sizeof(buffer))) > 0)
		write(2, buffer, lu);
	close(fd);
	}
}

void poke_int(void *addr, int val)
{
unsigned long long start, end;
long delta;
	rdtscll(start);
	*(int *)addr = val;
	rdtscll(end);
	delta = (end - start);
	printf("Poke (addr=%p): %ld cycles\n", addr, delta);
}

void poke_full(void *addr, int val, int len)
{
unsigned long long start, end;
long delta;
	rdtscll(start);
	memset(addr, val, len);
	rdtscll(end);
	delta = (end - start);
	printf("Poke_full (addr=%p, len=%d): %ld cycles\n", addr, len, delta);
}

int  peek_int(void *addr)
{
unsigned long long start, end;
long delta;
int val;
	rdtscll(start);
	val = *(int *)addr;
	rdtscll(end);
	delta = (end - start);
	printf("Peek(addr=%p): ->%d %ld cycles\n", addr, val, delta);
	return val;
}

int big_table[1024*1024] __attribute__((aligned(4096)));

void usage(int code)
{
fprintf(stderr, "Usage : page_bench [-m mappings]\n");
exit(code);
}

int main(int argc, char *argv[])
{
	unsigned int nb_mappings = 200;
	int c;

	while ((c = getopt(argc, argv, "Vm:")) != EOF) {
		if (c == 'm')
			nb_mappings = atoi(optarg);
		else if (c == 'V')
			usage(0);
	}
	if (nb_mappings < 4)
		nb_mappings = 4;
	map_many_vmas(nb_mappings);
//	show_maps();
	printf("1) pagefault tp bring a rw page:\n") ;
		poke_int(&big_table[0], 10);
	printf("1) pagefault to bring a rw page:\n") ;
		poke_int(&big_table[1024], 10);
	printf("1) pagefault to bring a rw page:\n") ;
		poke_int(&big_table[2048], 10);
	printf("2) pagefault to bring a zero page, readonly\n");
		peek_int(&big_table[3*1024]);
	printf("3) pagefault to make this page rw\n");
		poke_int(&big_table[3*1024], 10);

	printf("1) pagefault to bring a rw page:\n") ;
	poke_int(&big_table[4*1024], 10);
	printf("1) pagefault to bring a rw page:\n") ;
	poke_int(&big_table[5*1024], 10);

	printf("4) memset 4096 bytes to 0x55:\n");
	poke_full(&big_table[3*1024], 0x55, 4096);

	printf("5) fill the whole table\n");
	poke_full(big_table, 1, sizeof(big_table));
	printf("6) fill again whole table (no more faults, but cpu cache too small)\n");
	poke_full(big_table, 1, sizeof(big_table));

	printf("7.1) faulting a mmap zone, read access\n");
	peek_int(addr1);

	printf("8.1) faulting a mmap zone, write access\n");
	poke_int(addr2, 10);
	printf("7.2) faulting a mmap zone, read access\n");
	peek_int(addr3);
	printf("8.3) faulting a mmap zone, write access\n");
	poke_int(addr4, 10);

	return 0;
}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 16:15                     ` Dan Aloni
@ 2007-04-04 16:48                       ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 16:48 UTC (permalink / raw)
  To: Dan Aloni; +Cc: Linux Memory Management List, Linux Kernel Mailing List

Hi Dan,

On Wed, Apr 04, 2007 at 07:15:15PM +0300, Dan Aloni wrote:
> The main difference is that disk-backed swap can create I/O pressure which
> would slow down the swap-outs that are not of zeroed pages (and other I/Os
> on that disk for that matter). For purely-RAM virtual memory the latency 
> incured from managing newly allocated and zeroed pages is neglegible 
> compared to the latencies you get from reading/flushing those pages to 
> disk if you add swap to the picture.

Sorry but you're telling me the obvious... clearly you're right, swap
is slower, ram is faster. As a corollary on a 64bit system you could
always throw money at ram and _guarantee_ that those anon read page
faults never hit swap. That's not the point.

If 4G more of virtual memory are allocated in the address space of a
task because of this kernel change, it's the same problem if those 4G
are later allocated in swap or in ram depending on the runtime
environment of the kernel. The problem is that 4G more will be
allocated, it doesn't matter _where_. The user with a 8G system will
not be slowed down much, the user with a 128M system will trash beyond
repair, but it's the same problem for both. If the new ram will go
into ram or swap is irrelevant because it's an unknown variable that
depends on the amount of ram and swap and on what else is running
(infact there will be a third guy with even less luck that will go out
of memory and crash after hitting an oom killer bug ;), it's the same
problem in all three cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 16:32         ` Eric Dumazet
@ 2007-04-04 17:02           ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2007-04-04 17:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

On Wed, 4 Apr 2007, Eric Dumazet wrote:
> 
> But results on an Intel Pentium-M are interesting, in particular 2) & 3)
> 
> If a page is first allocated as page_zero then cow to a full rw page, this is more expensive.
> (2660 cycles instead of 2300)

Yes, you have an extra TLB flush there at a minimum (if the page didn't 
exist at all before, you don't have to flush).

That said, the big cost tends to be the clearing of the page. Which is why 
the "bring in zero page" is so much faster than anything else - it's the 
only case that doesn't need to clear the page.

So you should basically think of your numbers like this:
 - roughly 900 cycles is the cost of the page fault and all the 
   "basic software" side in the kernel
 - roughly 1400 cycles to actually do the "memset" to clear the page (and 
   no, that's *not* the cost of memory accesses per se - it's very likely 
   already in the L2 cache or similar, we just need to clear it and if 
   it wasn't marked exclusive need to do a bus cycle to invalidate it on 
   any other CPU's).

with small variation depending on what the state was before of the cache 
in particular (for example, the TLB flush cost, but also: when you do

> 4) memset 4096 bytes to 0x55:
> Poke_full (addr=0x804f000, len=4096): 2719 cycles

This only adds ~600 cycles to memset the same 4kB that cost ~1400 cycles 
before, but that's *probably* largely because it was now already dirty in 
the L2 and possibly the L1, so it's quite possible that this is really 
just a cache effect, because now it's entirely exclusive in the caches so 
you don't need to do any probing on the bus at all).

Also note: in the end, page faults are usually fairly unusual. You do them 
once, and then use the page a lot after that. That's not *always* true, of 
course. Some malloc()/free() patterns of big areas that are not used for 
long will easily cause constant mmap/munmap, and a lot of page faults.

The worst effect of page faults tends to be for short-lived stuff. Notably 
things like "system()" that executes a shell just to execute something 
else. Almost *everything* in that path is basically "use once, then throw 
away", and page fault latency is interesting.

So this is one case where it might be interesting to look at what lmbench 
reports for the "fork/exit", "fork/exec" and "shell exec" numbers before 
and after. 

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:35       ` Linus Torvalds
  2007-04-04 15:48         ` Andrea Arcangeli
  2007-04-04 16:32         ` Eric Dumazet
@ 2007-04-04 19:15         ` Andrew Morton
  2007-04-04 20:11         ` David Miller, Linus Torvalds
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2007-04-04 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Wed, 4 Apr 2007 08:35:30 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Does anybody do any performance testing on -mm?

http://test.kernel.org/perf/index.html has pretty graphs of lots of kernel versions
for a few benchmarks.  I'm not aware of any other organised effort along those
lines.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:35       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2007-04-04 19:15         ` Andrew Morton
@ 2007-04-04 20:11         ` David Miller, Linus Torvalds
  2007-04-04 20:50           ` Andrew Morton
                             ` (2 more replies)
  2007-04-04 22:05         ` Valdis.Kletnieks
  2007-04-05  4:47         ` Nick Piggin
  5 siblings, 3 replies; 49+ messages in thread
From: David Miller, Linus Torvalds @ 2007-04-04 20:11 UTC (permalink / raw)
  To: torvalds; +Cc: npiggin, hugh, akpm, linux-mm, tee, holt, andrea, linux-kernel

> Anyway, I'm not against this, but I can see somebody actually *wanting* 
> the ZERO page in some cases. I've used the fact for TLB testing, for 
> example, by just doing a big malloc(), and knowing that the kernel will 
> re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
> not any *physical* cache effects. Virtually indexed cached will still show 
> effects of it, of course, but I haven't cared).
> 
> That's an example of an app that actually cares about the page allocation 
> (or, in this case, the lack there-of). Not an important one, but maybe 
> there are important ones that care?

If we're going to consider this seriously, there is a case I know of.
Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
is an instructive comment:

	/* Do not bother with the expensive D-cache flush if it
	 * is merely the zero page.  The 'bigcore' testcase in GDB
	 * causes this case to run millions of times.
	 */
	if (page == ZERO_PAGE(0))
		return;

basically what the GDB test case does it mmap() an enormous anonymous
area, not touch it, then dump core.

As I understand the patch being considered to remove ZERO_PAGE(), this
kind of core dump will cause a lot of pages to be allocated, probably
eating up a lot of system time as well as memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 20:11         ` David Miller, Linus Torvalds
@ 2007-04-04 20:50           ` Andrew Morton
  2007-04-05  2:03           ` Nick Piggin
  2007-04-05  5:23           ` Andrea Arcangeli
  2 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2007-04-04 20:50 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, npiggin, hugh, linux-mm, tee, holt, andrea, linux-kernel

On Wed, 04 Apr 2007 13:11:11 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.

Point.

Also, what effect will the proposed changes have upon rss reporting,
and upon the numbers in /proc/pid/[s]maps?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:35       ` Linus Torvalds
                           ` (3 preceding siblings ...)
  2007-04-04 20:11         ` David Miller, Linus Torvalds
@ 2007-04-04 22:05         ` Valdis.Kletnieks
  2007-04-05  0:27           ` Linus Torvalds
  2007-04-05  4:47         ` Nick Piggin
  5 siblings, 1 reply; 49+ messages in thread
From: Valdis.Kletnieks @ 2007-04-04 22:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Wed, 04 Apr 2007 08:35:30 PDT, Linus Torvalds said:

> Although I don't know how much -mm will do for it. There is certainly not 
> going to be any correctness problems, afaik, just *performance* problems. 
> Does anybody do any performance testing on -mm?

I have to admit I don't do anything more definite than "wow, this goes oink"...

> That's an example of an app that actually cares about the page allocation 
> (or, in this case, the lack there-of). Not an important one, but maybe 
> there are important ones that care?

I'd not be surprised if there's sparse-matrix code out there that wants to
malloc a *huge* array (like a 1025x1025 array of numbers) that then only
actually *writes* to several hundred locations, and relies on the fact that
all the untouched pages read back all-zeros.  Of course, said code is probably
buggy because it doesn't zero the whole thing because you don't usually know
if some other function already scribbled on that heap page.

This would probably be more interesting if we had a userspace API for
"Give me a metric buttload of zero page frames" that malloc() and friends
could leverage.....

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:48         ` Andrea Arcangeli
  2007-04-04 16:09           ` Linus Torvalds
  2007-04-04 16:10           ` Hugh Dickins
@ 2007-04-04 22:07           ` Valdis.Kletnieks
  2 siblings, 0 replies; 49+ messages in thread
From: Valdis.Kletnieks @ 2007-04-04 22:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 283 bytes --]

On Wed, 04 Apr 2007 17:48:39 +0200, Andrea Arcangeli said:

> Ok, those cases wanting the same zero page, could be fairly easily
> converted to an mmap over /dev/zero (without having to run 4k large
> mmap syscalls or nonlinear).

"D'oh!" -- H. Simpson.

Ignore my previous note. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 22:05         ` Valdis.Kletnieks
@ 2007-04-05  0:27           ` Linus Torvalds
  2007-04-05  1:25             ` Valdis.Kletnieks
  2007-04-05  2:30             ` Nick Piggin
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2007-04-05  0:27 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List


On Wed, 4 Apr 2007, Valdis.Kletnieks@vt.edu wrote:
> 
> I'd not be surprised if there's sparse-matrix code out there that wants to
> malloc a *huge* array (like a 1025x1025 array of numbers) that then only
> actually *writes* to several hundred locations, and relies on the fact that
> all the untouched pages read back all-zeros.

Good point. In fact, it doesn't need to be a malloc() - I remember people 
doing this with Fortran programs and just having an absolutely incredibly 
big BSS (with traditional Fortran, dymic memory allocations are just not 
done).

> Of course, said code is probably buggy because it doesn't zero the whole 
> thing because you don't usually know if some other function already 
> scribbled on that heap page.

Sure you do. If glibc used mmap() or brk(), it *knows* the new data is 
zero. So if you use calloc(), for example, it's entirely possible that 
a good libc wouldn't waste time zeroing it.

The same is true of BSS. You never clear the BSS with a memset, you just 
know it starts out zeroed.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-05  0:27           ` Linus Torvalds
@ 2007-04-05  1:25             ` Valdis.Kletnieks
  2007-04-05  2:30             ` Nick Piggin
  1 sibling, 0 replies; 49+ messages in thread
From: Valdis.Kletnieks @ 2007-04-05  1:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 903 bytes --]

On Wed, 04 Apr 2007 17:27:31 PDT, Linus Torvalds said:

> Sure you do. If glibc used mmap() or brk(), it *knows* the new data is 
> zero. So if you use calloc(), for example, it's entirely possible that 
> a good libc wouldn't waste time zeroing it.

Right.  However, the *user* code usually has no idea about the previous
history - so if it uses malloc(), it should be doing something like:

	ptr = malloc(my_size*sizeof(whatever));
	memset(ptr, my_size*sizeof(), 0);

So malloc does something clever to guarantee that it's zero, and then userspace
undoes the cleverness because it has no easy way to *know* that cleverness
happened.

Admittedly, calloc() *can* get away with being clever.  I know we have some
glibc experts lurking here - any of them want to comment on how smart calloc()
actually is, or how smart it can become without needing major changes to the
rest of the malloc() and friends?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 20:11         ` David Miller, Linus Torvalds
  2007-04-04 20:50           ` Andrew Morton
@ 2007-04-05  2:03           ` Nick Piggin
  2007-04-05  5:23           ` Andrea Arcangeli
  2 siblings, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2007-04-05  2:03 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, hugh, akpm, linux-mm, tee, holt, andrea, linux-kernel

On Wed, Apr 04, 2007 at 01:11:11PM -0700, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed, 4 Apr 2007 08:35:30 -0700 (PDT)
> 
> > Anyway, I'm not against this, but I can see somebody actually *wanting* 
> > the ZERO page in some cases. I've used the fact for TLB testing, for 
> > example, by just doing a big malloc(), and knowing that the kernel will 
> > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least 
> > not any *physical* cache effects. Virtually indexed cached will still show 
> > effects of it, of course, but I haven't cared).
> > 
> > That's an example of an app that actually cares about the page allocation 
> > (or, in this case, the lack there-of). Not an important one, but maybe 
> > there are important ones that care?
> 
> If we're going to consider this seriously, there is a case I know of.
> Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
> is an instructive comment:
> 
> 	/* Do not bother with the expensive D-cache flush if it
> 	 * is merely the zero page.  The 'bigcore' testcase in GDB
> 	 * causes this case to run millions of times.
> 	 */
> 	if (page == ZERO_PAGE(0))
> 		return;
> 
> basically what the GDB test case does it mmap() an enormous anonymous
> area, not touch it, then dump core.
> 
> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.

Yeah. Well it is trivial to leave ZERO_PAGE in get_user_pages, however
in the longer run it would be nice to get rid of ZERO_PAGE completely
so we need an alternative.

I've been working on a patch for core dumping that can detect unfaulted
anonymous memory and skip it without doing the ZERO_PAGE comparision.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-05  0:27           ` Linus Torvalds
  2007-04-05  1:25             ` Valdis.Kletnieks
@ 2007-04-05  2:30             ` Nick Piggin
  2007-04-05  5:37               ` William Lee Irwin III
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2007-04-05  2:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Valdis.Kletnieks, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 4 Apr 2007, Valdis.Kletnieks@vt.edu wrote:
> > 
> > I'd not be surprised if there's sparse-matrix code out there that wants to
> > malloc a *huge* array (like a 1025x1025 array of numbers) that then only
> > actually *writes* to several hundred locations, and relies on the fact that
> > all the untouched pages read back all-zeros.
> 
> Good point. In fact, it doesn't need to be a malloc() - I remember people 
> doing this with Fortran programs and just having an absolutely incredibly 
> big BSS (with traditional Fortran, dymic memory allocations are just not 
> done).

Sparse matrices are one thing I worry about. I don't know enough about
HPC code to know whether they will be a problem. I know there exist
data structures to optimise sparse matrix storage...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 15:35       ` Linus Torvalds
                           ` (4 preceding siblings ...)
  2007-04-04 22:05         ` Valdis.Kletnieks
@ 2007-04-05  4:47         ` Nick Piggin
  5 siblings, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2007-04-05  4:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, tee,
	holt, Andrea Arcangeli, Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 4 Apr 2007, Nick Piggin wrote:
> > 
> > Shall I do a more complete patchset and ask Andrew to give it a
> > run in -mm?
> 
> Do this trivial one first. See how it fares.

OK.

> Although I don't know how much -mm will do for it. There is certainly not 
> going to be any correctness problems, afaik, just *performance* problems. 
> Does anybody do any performance testing on -mm?
> 
> That said, talking about correctness/performance problems:
> 
> > +	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > +	if (likely(!pte_none(*page_table))) {
> >  		inc_mm_counter(mm, anon_rss);
> >  		lru_cache_add_active(page);
> >  		page_add_new_anon_rmap(page, vma, address);
> 
> Isn't that test the wrong way around?
> 
> Shouldn't it be
> 
> 	if (likely(pte_none(*page_table))) {
> 
> without any logical negation? Was this patch tested?

Yeah, untested of course. I'm having problems booting my normal test box,
so the main point of the patch was to generate some discussion (which
worked! ;)).

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-04 20:11         ` David Miller, Linus Torvalds
  2007-04-04 20:50           ` Andrew Morton
  2007-04-05  2:03           ` Nick Piggin
@ 2007-04-05  5:23           ` Andrea Arcangeli
  2 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2007-04-05  5:23 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, npiggin, hugh, akpm, linux-mm, tee, holt, linux-kernel

On Wed, Apr 04, 2007 at 01:11:11PM -0700, David S. Miller wrote:
> If we're going to consider this seriously, there is a case I know of.
> Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
> is an instructive comment:
> 
> 	/* Do not bother with the expensive D-cache flush if it
> 	 * is merely the zero page.  The 'bigcore' testcase in GDB
> 	 * causes this case to run millions of times.
> 	 */
> 	if (page == ZERO_PAGE(0))
> 		return;
> 
> basically what the GDB test case does it mmap() an enormous anonymous
> area, not touch it, then dump core.
> 
> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.

Well, if we leave the zero page in because there may be too many apps
to optimize, we still have to fix the zero page handling. Current code
is far from ideal. Currently the zero page scales worse than
no-zero-page, at the very least all the page count/mapcount
increase/decrease at every map-in/zap must be dropped from memory.c,
otherwise two totally unrelated gdb running at the same time (or gdb
at the same time of fortran, or two unrelated fortran apps) will badly
trash over the zero page reference counting.

Besides the backwards compatibility argument with gdb or similar apps
I doubt the zero page is a really worthwhile optimization and I guess
we'd be better off if it never existed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-05  2:30             ` Nick Piggin
@ 2007-04-05  5:37               ` William Lee Irwin III
  2007-04-05 17:23                 ` Valdis.Kletnieks
  0 siblings, 1 reply; 49+ messages in thread
From: William Lee Irwin III @ 2007-04-05  5:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Valdis.Kletnieks, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote:
>> Good point. In fact, it doesn't need to be a malloc() - I remember people 
>> doing this with Fortran programs and just having an absolutely incredibly 
>> big BSS (with traditional Fortran, dymic memory allocations are just not 
>> done).

On Thu, Apr 05, 2007 at 04:30:26AM +0200, Nick Piggin wrote:
> Sparse matrices are one thing I worry about. I don't know enough about
> HPC code to know whether they will be a problem. I know there exist
> data structures to optimise sparse matrix storage...

\begin{admission-against-interest}

Sparse matrix code goes to extreme lengths to avoid ever looking at
substantial numbers of zero floating point matrix and vector entries.
In extreme cases, hashing and various sorts of heavyweight data
structures are used to represent highly irregular structures. At various
times the matrix is not even explicitly formed. Most typical are cases
like band diagonal matrices where storage is allocated only for the
nonzero diagonals. The entire purpose of sparse algorithms is to avoid
examining or even allocating zeros.

The actual phenomenon of concern here is dense matrix code with sparse
matrix inputs. The matrices will typically not be vast but may span 1MB
or so of RAM (1024x1024 is 1M*sizeof(double), and various dense matrix
algorithms target ca. 300x300). Most of the time this will arise from
the use of dense matrix code as black box solvers called as a library
by programs not terribly concerned about efficiency until something
gets explosively inefficient (and maybe not even then), or otherwise
numerically naive programs. This, however, is arguably the majority of
the usage cases by end-user invocations, so beware, though not too much.

I'd be more concerned about large hashtables sparsely used for the
purposes of adjacency detection and other cases where large time vs.
space tradeoffs are made for probabilistic reasons involving
collisions.

\end{admission-against-interest}

-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [rfc] no ZERO_PAGE?
  2007-04-05  5:37               ` William Lee Irwin III
@ 2007-04-05 17:23                 ` Valdis.Kletnieks
  0 siblings, 0 replies; 49+ messages in thread
From: Valdis.Kletnieks @ 2007-04-05 17:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Linus Torvalds, Hugh Dickins, Andrew Morton,
	Linux Memory Management List, tee, holt, Andrea Arcangeli,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

On Wed, 04 Apr 2007 22:37:29 PDT, William Lee Irwin III said:

> The actual phenomenon of concern here is dense matrix code with sparse
> matrix inputs. The matrices will typically not be vast but may span 1MB
> or so of RAM (1024x1024 is 1M*sizeof(double), and various dense matrix
> algorithms target ca. 300x300). Most of the time this will arise from
> the use of dense matrix code as black box solvers called as a library
> by programs not terribly concerned about efficiency until something
> gets explosively inefficient (and maybe not even then), or otherwise
> numerically naive programs. This, however, is arguably the majority of
> the usage cases by end-user invocations, so beware, though not too much.

Amen, brother! :)

At least in my environment, the vast majority of matrix code is actually run by
graduate students under the direction of whatever professor is the Principal
Investigator on the grant. As a rule, you can expect the grad student to know
about rounding errors and convergence issues and similar program *correctness*
factors.  But it's the rare one that has much interest in program *efficiency*.
If it takes 2 days to run, that's 2 days they can go get another few pages of
thesis written while they wait. :)

The code that gets on our SystemX (a top-50 supercomputer still) is usually
well-tweaked for efficiency.  However, that's just one system - there's on the
order of several hundred smaller compute clusters and boxen and SGI-en on
campus where "protect the system from cargo-cult programming by grad students"
is a valid kernel goal. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2007-04-05 17:23 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-29  7:58 [rfc][patch 1/2] mm: dont account ZERO_PAGE Nick Piggin
2007-03-29  7:58 ` [rfc][patch 2/2] mips: reinstate move_pte Nick Piggin
2007-03-29 17:49   ` Linus Torvalds
2007-03-29 13:10 ` [rfc][patch 1/2] mm: dont account ZERO_PAGE Hugh Dickins
2007-03-30  1:46   ` Nick Piggin
2007-03-30  2:59     ` Robin Holt
2007-03-30  3:09       ` Nick Piggin
2007-03-30  9:23         ` Robin Holt
2007-03-30  2:40   ` Nick Piggin
2007-04-04  3:37     ` [rfc] no ZERO_PAGE? Nick Piggin
2007-04-04  9:45       ` Hugh Dickins
2007-04-04 10:24         ` Nick Piggin
2007-04-04 12:27           ` Andrea Arcangeli
2007-04-04 13:55             ` Dan Aloni
2007-04-04 14:14               ` Andrea Arcangeli
2007-04-04 14:44                 ` Dan Aloni
2007-04-04 15:03                   ` Hugh Dickins
2007-04-04 15:34                     ` Andrea Arcangeli
2007-04-04 15:41                       ` Hugh Dickins
2007-04-04 16:07                         ` Andrea Arcangeli
2007-04-04 16:14                         ` Linus Torvalds
2007-04-04 15:27                   ` Andrea Arcangeli
2007-04-04 16:15                     ` Dan Aloni
2007-04-04 16:48                       ` Andrea Arcangeli
2007-04-04 12:45           ` Hugh Dickins
2007-04-04 13:05             ` Andrea Arcangeli
2007-04-04 13:32               ` Hugh Dickins
2007-04-04 13:40                 ` Andrea Arcangeli
2007-04-04 15:35       ` Linus Torvalds
2007-04-04 15:48         ` Andrea Arcangeli
2007-04-04 16:09           ` Linus Torvalds
2007-04-04 16:23             ` Andrea Arcangeli
2007-04-04 16:10           ` Hugh Dickins
2007-04-04 16:31             ` Andrea Arcangeli
2007-04-04 22:07           ` Valdis.Kletnieks
2007-04-04 16:32         ` Eric Dumazet
2007-04-04 17:02           ` Linus Torvalds
2007-04-04 19:15         ` Andrew Morton
2007-04-04 20:11         ` David Miller, Linus Torvalds
2007-04-04 20:50           ` Andrew Morton
2007-04-05  2:03           ` Nick Piggin
2007-04-05  5:23           ` Andrea Arcangeli
2007-04-04 22:05         ` Valdis.Kletnieks
2007-04-05  0:27           ` Linus Torvalds
2007-04-05  1:25             ` Valdis.Kletnieks
2007-04-05  2:30             ` Nick Piggin
2007-04-05  5:37               ` William Lee Irwin III
2007-04-05 17:23                 ` Valdis.Kletnieks
2007-04-05  4:47         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox