[PATCH] ZERO_PAGE again v5.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ZERO_PAGE again v5.
@ 2009-08-05 10:16 KAMEZAWA Hiroyuki
  2009-08-05 10:17 ` [PATCH 1/2] ZERO_PAGE config KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05 10:16 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, npiggin, hugh.dickins, akpm, torvalds


Updated from v4 as
  - avoid to add new arguments to vm_normal_page().
    vm_normal_page() always returns NULL if ZERO_PAGE is found.
  - follow_page() directly handles pte_special and ANON_ZERO_PAGE.

Then, amount of changes are reduced. Thanks for advices.

Concerns pointed out:
  - Does use_zero_page() cover all cases ?
    I think yes..
  - All get_user_pages() callers, which may find ZERO_PAGE is safe ?
    need tests.
  - All follow_pages() callers, which may find ZERO_PAGE is safe ?
    I think yes.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] ZERO_PAGE config
  2009-08-05 10:16 [PATCH] ZERO_PAGE again v5 KAMEZAWA Hiroyuki
@ 2009-08-05 10:17 ` KAMEZAWA Hiroyuki
  2009-08-05 10:19 ` [PATCH 2/2] ZERO_PAGE based on pte_special KAMEZAWA Hiroyuki
  2009-08-09 17:28 ` [PATCH] ZERO_PAGE again v5 Hugh Dickins
  2 siblings, 0 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05 10:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, akpm, torvalds

no changes from v4.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Kconfig for using ZERO_PAGE or not. Using ZERO_PAGE or not is depends on
 - arch has pte_special() or not.
 - arch allows to use ZERO_PAGE or not.

In this patch, generic-config for /mm and arch-specific config for x86
is added. Other archs ?

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 arch/x86/Kconfig |    3 +++
 mm/Kconfig       |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

Index: mmotm-2.6.31-Aug4/mm/Kconfig
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/Kconfig
+++ mmotm-2.6.31-Aug4/mm/Kconfig
@@ -225,6 +225,24 @@ config KSM
 	  saving memory until one or another app needs to modify the content.
 	  Recommended for use with KVM, or with other duplicative applications.
 
+config SUPPORT_ANON_ZERO_PAGE
+	bool "Use anon zero page"
+	default y if ARCH_SUPPORT_ANON_ZERO_PAGE
+	help
+	  In anonymous private mapping (MAP_ANONYMOUS and /dev/zero), a read
+	  page fault will allocate a new zero-cleared page. If the first page
+	  fault is write, allocating a new page is necessary. But if it is
+	  read, we can use ZERO_PAGE until a write comes. If you set this to y,
+	  the kernel use ZERO_PAGE and delays allocating new memory in private
+	  anon mapping until the first write. If applications use large mmap
+	  and most of accesses are read, this reduces memory usage and cache
+	  usage to some extent. To support this, your architecture should have
+	  _PAGE_SPECIAL bit in pte. And this will be no help to cpu cache if
+	  the arch's cache is virtually tagged.
+	  To developper:
+	  This ZERO_PAGE changes behavior of follow_page(). please check
+	  usage of follow_page() in your arch before supporting this.
+
 config DEFAULT_MMAP_MIN_ADDR
         int "Low address space to protect from user allocation"
         default 4096
Index: mmotm-2.6.31-Aug4/arch/x86/Kconfig
===================================================================
--- mmotm-2.6.31-Aug4.orig/arch/x86/Kconfig
+++ mmotm-2.6.31-Aug4/arch/x86/Kconfig
@@ -158,6 +158,9 @@ config ARCH_HIBERNATION_POSSIBLE
 config ARCH_SUSPEND_POSSIBLE
 	def_bool y
 
+config ARCH_SUPPORT_ANON_ZERO_PAGE
+	def_bool y
+
 config ZONE_DMA32
 	bool
 	default X86_64

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] ZERO_PAGE based on pte_special
  2009-08-05 10:16 [PATCH] ZERO_PAGE again v5 KAMEZAWA Hiroyuki
  2009-08-05 10:17 ` [PATCH 1/2] ZERO_PAGE config KAMEZAWA Hiroyuki
@ 2009-08-05 10:19 ` KAMEZAWA Hiroyuki
  2009-08-09 17:28 ` [PATCH] ZERO_PAGE again v5 Hugh Dickins
  2 siblings, 0 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05 10:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, akpm, torvalds

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

updated against: mmotm-Jul-30-2009.

ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)

In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.

  - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
    checked as VM_SHARED.

  - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.

  - follow_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
    NULL is returned even if ZERO_PAGE is found.

  - vm_normal_page() returns NULL if ZERO_PAGE is found.

  - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
    __get_user_page() returns NULL even if ZERO_PAGE is found.


Changelog v4.1 -> v5
 - removed new arguments to vm_normal_page()
 - follow_page() handles ZERO_PAGE directly.

Changelog v4 -> v4.1
 - removed nexted "if" in get_user_pages() for readability

Changelog v3->v4
 - FOLL_NOZERO is directly passed to vm_normal_page()

Changelog v2->v3
 - totally renewed.
 - use pte_special()
 - added new argument to vm_normal_page().
 - MAYSHARE is checked.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mm.h |    1 
 mm/internal.h      |    1 
 mm/memory.c        |  123 +++++++++++++++++++++++++++++++++++++++++++----------
 mm/migrate.c       |    6 +-
 mm/mlock.c         |    2 
 5 files changed, 108 insertions(+), 25 deletions(-)

Index: mmotm-2.6.31-Aug4/mm/memory.c
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/memory.c
+++ mmotm-2.6.31-Aug4/mm/memory.c
@@ -444,6 +444,27 @@ static inline int is_cow_mapping(unsigne
 }
 
 /*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+	/*
+	 * We don't want to optimize FOLL_ANON for make_pages_present()
+	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+	 * we want to get the page from the page tables to make sure
+	 * that we serialize and update with any other user of that
+	 * mapping. At doing page fault, VM_MAYSHARE should be also check for
+	 * avoiding possible changes to VM_SHARED.
+	 */
+	if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+		return 0;
+	/*
+	 * And if we have a fault routine, it's not an anonymous region.
+	 */
+	return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
@@ -490,6 +511,12 @@ static inline int is_cow_mapping(unsigne
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 				pte_t pte)
 {
@@ -498,8 +525,16 @@ struct page *vm_normal_page(struct vm_ar
 	if (HAVE_PTE_SPECIAL) {
 		if (likely(!pte_special(pte)))
 			goto check_pfn;
-		if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
-			print_bad_pte(vma, addr, pte, NULL);
+
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+			return NULL;
+		/*
+		 * ZERO PAGE ? If vma is shared or has page fault handler,
+		 * Using ZERO PAGE is bug.
+		 */
+		if (HAVE_ANON_ZERO && use_zero_page(vma))
+			return NULL;
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -1143,7 +1178,16 @@ struct page *follow_page(struct vm_area_
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
-	page = vm_normal_page(vma, address, pte);
+
+	if (HAVE_ANON_ZERO && pte_special(pte) && use_zero_page(vma)) {
+		/* This page is ZERO_PAGE */
+		if (flags & FOLL_NOZERO)
+			page = NULL;
+		else
+			page = ZERO_PAGE(0);
+	} else
+		page = vm_normal_page(vma, address, pte);
+
 	if (unlikely(!page))
 		goto bad_page;
 
@@ -1188,23 +1232,6 @@ no_page_table:
 	return page;
 }
 
-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
-	/*
-	 * We don't want to optimize FOLL_ANON for make_pages_present()
-	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
-	 * we want to get the page from the page tables to make sure
-	 * that we serialize and update with any other user of that
-	 * mapping.
-	 */
-	if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
-		return 0;
-	/*
-	 * And if we have a fault routine, it's not an anonymous region.
-	 */
-	return !vma->vm_ops || !vma->vm_ops->fault;
-}
 
 
 
@@ -1218,6 +1245,7 @@ int __get_user_pages(struct task_struct 
 	int force = !!(flags & GUP_FLAGS_FORCE);
 	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
 	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+	int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
 
 	if (nr_pages <= 0)
 		return 0;
@@ -1261,7 +1289,11 @@ int __get_user_pages(struct task_struct 
 				return i ? : -EFAULT;
 			}
 			if (pages) {
-				struct page *page = vm_normal_page(gate_vma, start, *pte);
+				struct page *page;
+				/*
+				 * this is not anon vma...don't care zero page.
+				 */
+				page = vm_normal_page(gate_vma, start, *pte);
 				pages[i] = page;
 				if (page)
 					get_page(page);
@@ -1291,6 +1323,8 @@ int __get_user_pages(struct task_struct 
 			foll_flags |= FOLL_GET;
 		if (!write && use_zero_page(vma))
 			foll_flags |= FOLL_ANON;
+		else
+			ignore_zero = 0;
 
 		do {
 			struct page *page;
@@ -1309,9 +1343,17 @@ int __get_user_pages(struct task_struct 
 			if (write)
 				foll_flags |= FOLL_WRITE;
 
+			if (ignore_zero)
+				foll_flags |= FOLL_NOZERO;
+
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				/*
+				 * When we ignore zero pages, no more ops to do.
+				 */
+				if (ignore_zero)
+					break;
 
 				ret = handle_mm_fault(mm, vma, start,
 					(foll_flags & FOLL_WRITE) ?
@@ -2617,6 +2659,41 @@ out_page:
 	return ret;
 }
 
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	pte_t *page_table;
+	bool ret = false;
+
+	if (!use_zero_page(vma))
+		return ret;
+	/*
+	 * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+	 */
+	entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_none(*page_table))
+		goto out_unlock;
+	set_pte_at(mm, address, page_table, entry);
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	ret = true;
+out_unlock:
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd)
+{
+	/* We don't use ZERO PAGE */
+	return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2633,6 +2710,10 @@ static int do_anonymous_page(struct mm_s
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
+	if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+		if (do_anon_zeromap(mm, vma, address, pmd))
+			return 0;
+
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
Index: mmotm-2.6.31-Aug4/include/linux/mm.h
===================================================================
--- mmotm-2.6.31-Aug4.orig/include/linux/mm.h
+++ mmotm-2.6.31-Aug4/include/linux/mm.h
@@ -1246,6 +1246,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO	0x10	/* returns NULL if ZERO_PAGE is found */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
Index: mmotm-2.6.31-Aug4/mm/internal.h
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/internal.h
+++ mmotm-2.6.31-Aug4/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
 #define GUP_FLAGS_FORCE                  0x2
 #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
 #define GUP_FLAGS_IGNORE_SIGKILL         0x8
+#define GUP_FLAGS_IGNORE_ZERO         	 0x10
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
Index: mmotm-2.6.31-Aug4/mm/migrate.c
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/migrate.c
+++ mmotm-2.6.31-Aug4/mm/migrate.c
@@ -850,7 +850,7 @@ static int do_move_page_to_node_array(st
 		if (!vma || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET);
+		page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1007,14 +1007,14 @@ static void do_pages_stat_array(struct m
 		if (!vma)
 			goto set_status;
 
-		page = follow_page(vma, addr, 0);
+		page = follow_page(vma, addr, FOLL_NOZERO);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
 			goto set_status;
 
 		err = -ENOENT;
-		/* Use PageReserved to check for zero page */
+		/* if zero page, page is NULL. */
 		if (!page || PageReserved(page))
 			goto set_status;
 
Index: mmotm-2.6.31-Aug4/mm/mlock.c
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/mlock.c
+++ mmotm-2.6.31-Aug4/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
 	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
 	int ret = 0;
-	int gup_flags = 0;
+	int gup_flags = GUP_FLAGS_IGNORE_ZERO;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ZERO_PAGE again v5.
  2009-08-05 10:16 [PATCH] ZERO_PAGE again v5 KAMEZAWA Hiroyuki
  2009-08-05 10:17 ` [PATCH 1/2] ZERO_PAGE config KAMEZAWA Hiroyuki
  2009-08-05 10:19 ` [PATCH 2/2] ZERO_PAGE based on pte_special KAMEZAWA Hiroyuki
@ 2009-08-09 17:28 ` Hugh Dickins
  2009-08-10  0:14   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2009-08-09 17:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Nick Piggin, Andrew Morton, Linus Torvalds

On Wed, 5 Aug 2009, KAMEZAWA Hiroyuki wrote:
> Updated from v4 as
>   - avoid to add new arguments to vm_normal_page().
>     vm_normal_page() always returns NULL if ZERO_PAGE is found.
>   - follow_page() directly handles pte_special and ANON_ZERO_PAGE.
> 
> Then, amount of changes are reduced. Thanks for advices.
> 
> Concerns pointed out:
>   - Does use_zero_page() cover all cases ?
>     I think yes..
>   - All get_user_pages() callers, which may find ZERO_PAGE is safe ?
>     need tests.
>   - All follow_pages() callers, which may find ZERO_PAGE is safe ?
>     I think yes.

Sorry, KAMEZAWA-san, I'm afraid this is still some way off being right.

Certainly the extent of the v5 patch is much more to my taste than v4
was, thank you.

Something that's missing, which we can get away with but probably
need to reinstate, is the shortcut when COWing: not to copy the
ZERO_PAGE, but just do a memset.

But just try mlock'ing a private readonly anon area into which you've
faulted a zero page, and the "BUG: Bad page map" message tells us
it's quite wrong to be trying use_zero_page() there.

Actually, I don't understand ignore_zero at all: it's used solely by
the mlock case, yet its effect seems to be precisely not to fault in
pages if they're missing - I wonder if you've got in a muddle between
the two very different awkward cases, mlocking and coredumps of
sparsely populated areas.

And I don't at all like the way you flush_dcache_page(page) on a
page which may now be NULL: okay, you're only encouraging x86 to
say Yes to the Kconfig option, but that's a landmine for the first
arch with a real flush_dcache_page(page) which says Yes to it.

Actually, the Kconfig stuff seems silly to me (who's going to know
how to choose on or off?): the only architecture which wanted more
than one ZERO_PAGE was MIPS, and it doesn't __HAVE_ARCH_PTE_SPECIAL
yet, so I think I'm going to drop all the Kconfig end of it.

Because I hate reviewing things and trying to direct other people
by remote control: what usually happens is I send them off in some
direction which, once I try to do it myself, turns out to have been
the wrong direction.  I do need to try to do this myself, instead of
standing on the sidelines criticizing.

In fairness, I think Linus himself was a little confused when he
separated off use_zero_page(): I think we've all got confused around
there (as we noticed a month or so ago when discussing its hugetlb
equivalent), and I need to think it through again at last.

I'll get on to it now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ZERO_PAGE again v5.
  2009-08-09 17:28 ` [PATCH] ZERO_PAGE again v5 Hugh Dickins
@ 2009-08-10  0:14   ` KAMEZAWA Hiroyuki
  2009-08-10 11:24     ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10  0:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Nick Piggin, Andrew Morton, Linus Torvalds

On Sun, 9 Aug 2009 18:28:48 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> On Wed, 5 Aug 2009, KAMEZAWA Hiroyuki wrote:
> > Updated from v4 as
> >   - avoid to add new arguments to vm_normal_page().
> >     vm_normal_page() always returns NULL if ZERO_PAGE is found.
> >   - follow_page() directly handles pte_special and ANON_ZERO_PAGE.
> > 
> > Then, amount of changes are reduced. Thanks for advices.
> > 
> > Concerns pointed out:
> >   - Does use_zero_page() cover all cases ?
> >     I think yes..
> >   - All get_user_pages() callers, which may find ZERO_PAGE is safe ?
> >     need tests.
> >   - All follow_pages() callers, which may find ZERO_PAGE is safe ?
> >     I think yes.
> 
> Sorry, KAMEZAWA-san, I'm afraid this is still some way off being right.
> 
> Certainly the extent of the v5 patch is much more to my taste than v4
> was, thank you.
> 
At first, thank you for review.

> Something that's missing, which we can get away with but probably
> need to reinstate, is the shortcut when COWing: not to copy the
> ZERO_PAGE, but just do a memset.
> 
> But just try mlock'ing a private readonly anon area into which you've
> faulted a zero page, and the "BUG: Bad page map" message tells us
> it's quite wrong to be trying use_zero_page() there.
> 
> Actually, I don't understand ignore_zero at all: it's used solely by
> the mlock case, yet its effect seems to be precisely not to fault in
> pages if they're missing - I wonder if you've got in a muddle between
> the two very different awkward cases, mlocking and coredumps of
> sparsely populated areas.
> 
Ah, then, you say mlock() should allocate 'real' page if zero page
is mapped. Right ?

"How to handle mlock" is a concern for me, too. But I selected this
to allow the same behavior to old kernels.

> And I don't at all like the way you flush_dcache_page(page) on a
> page which may now be NULL: okay, you're only encouraging x86 to
> say Yes to the Kconfig option, but that's a landmine for the first
> arch with a real flush_dcache_page(page) which says Yes to it.
> 
do_wp_page()
	-> cow_user_page()
		-> (src is NULL)
Ah....ok, it's bug. I added ....Sorry, I didn't see this in older version
and missed this.

> Actually, the Kconfig stuff seems silly to me (who's going to know
> how to choose on or off?): the only architecture which wanted more
> than one ZERO_PAGE was MIPS, and it doesn't __HAVE_ARCH_PTE_SPECIAL
> yet, so I think I'm going to drop all the Kconfig end of it.
> 
ok, I have no strong demands on it.

> Because I hate reviewing things and trying to direct other people
> by remote control: what usually happens is I send them off in some
> direction which, once I try to do it myself, turns out to have been
> the wrong direction.  I do need to try to do this myself, instead of
> standing on the sidelines criticizing.
> 
> In fairness, I think Linus himself was a little confused when he
> separated off use_zero_page(): I think we've all got confused around
> there (as we noticed a month or so ago when discussing its hugetlb
> equivalent), and I need to think it through again at last.
> 
> I'll get on to it now.
> 

Thank you for comments. I'll go to a trip until Aug/17, programming-camp,
I'll be able to consider this patch and the whole things aroung paging in calm
enviroment. I'll try to restart from scratch.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ZERO_PAGE again v5.
  2009-08-10  0:14   ` KAMEZAWA Hiroyuki
@ 2009-08-10 11:24     ` Hugh Dickins
  0 siblings, 0 replies; 10+ messages in thread
From: Hugh Dickins @ 2009-08-10 11:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Nick Piggin, Andrew Morton, Linus Torvalds

On Mon, 10 Aug 2009, KAMEZAWA Hiroyuki wrote:
> On Sun, 9 Aug 2009 18:28:48 +0100 (BST)
> Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> > 
> > Actually, I don't understand ignore_zero at all: it's used solely by
> > the mlock case, yet its effect seems to be precisely not to fault in
> > pages if they're missing - I wonder if you've got in a muddle between
> > the two very different awkward cases, mlocking and coredumps of
> > sparsely populated areas.
> > 
> Ah, then, you say mlock() should allocate 'real' page if zero page
> is mapped. Right ?

No.

(That would be a possibility if it gets us out of some difficulty
with the newer mlocking code, but it's not something we ever did or
wanted to do in the past.)

What I was saying in that paragraph was that (it appears to me that)
in your patch only __mlock_vma_pages_range sets GUP_FLAGS_IGNORE_ZERO,
that __get_user_pages sets ignore_zero according to that flag, some
conditions may clear it, but then it goes on to say
			while (!(page = follow_page(vma, start, foll_flags))) {
				/*
				 * When we ignore zero pages, no more ops to do.
				 */
				if (ignore_zero)
					break;
which means that when ignore_zero is set and follow_page returns NULL,
we emerge from the loop with NULL page, don't we?  Whereas when mlocking,
we want to fault in any pages which were not already there.

Or am I just reading this all wrong?

> > And I don't at all like the way you flush_dcache_page(page) on a
> > page which may now be NULL: okay, you're only encouraging x86 to
> > say Yes to the Kconfig option, but that's a landmine for the first
> > arch with a real flush_dcache_page(page) which says Yes to it.
> > 
> do_wp_page()
> 	-> cow_user_page()
> 		-> (src is NULL)
> Ah....ok, it's bug. I added ....Sorry, I didn't see this in older version
> and missed this.

That's an entirely different issue, and I don't see that it's a bug,
just the inefficiency I mentioned elsewhere that we'd be better off
doing a memset than trying to memcpy the ZERO_PAGE.

What I was saying in that paragraph is that when you break from the
loop in __get_user_pages with ignore_zero and NULL page, you reach
			if (IS_ERR(page))
				return i ? i : PTR_ERR(page);
			if (pages) {
				pages[i] = page;

				flush_anon_page(vma, page, start);
				flush_dcache_page(page);
			}
which inserts a NULL into pages[i] (which may be okay if the other
end is prepared for it, as I think __mlock_vma_pages_range is),
then passes a NULL page to flush_anon_page and flush_dcache_page.

I looked up one of the non-empty implementations of flush_dcache_page
and saw it testing a bit in page->flags, assuming (very reasonably!)
that the page pointer is not NULL.  Oops.

> > Because I hate reviewing things and trying to direct other people
> > by remote control: what usually happens is I send them off in some
> > direction which, once I try to do it myself, turns out to have been
> > the wrong direction.  I do need to try to do this myself, instead of
> > standing on the sidelines criticizing.
> > 
> > In fairness, I think Linus himself was a little confused when he
> > separated off use_zero_page(): I think we've all got confused around
> > there (as we noticed a month or so ago when discussing its hugetlb
> > equivalent), and I need to think it through again at last.
> > 
> > I'll get on to it now.
> > 
> 
> Thank you for comments. I'll go to a trip until Aug/17, programming-camp,

Sorry for not getting to all this sooner, yes I'd seen your warning in
another mail that you'd be away, but I just didn't get here fast enough.

> I'll be able to consider this patch and the whole things aroung paging in calm
> enviroment. I'll try to restart from scratch.

What I'm saying above is that I'd much prefer to try doing the patch
myself and have you review that, than us to keep on going back and
forth with different versions like this.

I am not confident that you've grasped all the issues, and I am sure
that there's a least one issue which I have not grasped: maybe it'll
end up irrelevant, but I do need to understand Linus's Fix ZERO_PAGE
breakage with vmware, 672ca28e300c17bf8d792a2a7a8631193e580c74

But I'll discuss that separately, probably offlist; or, if I'm lucky,
just composing the mail will bring me to answer my own question!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 0/2] ZERO PAGE again v3.
@ 2009-07-09  3:24 KAMEZAWA Hiroyuki
  2009-07-16  9:01 ` [PATCH 0/2] ZERO PAGE again v4 KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-09  3:24 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, npiggin, hugh.dickins, avi, akpm, torvalds, aarcange

After v2 discussion, I felt that "Go" sign can be given if implemetaion is neat
and tiny and overhead seems very small. Here is v3.

In this version,

 - use pte_special() in vm_normal_page()
   All ZERO_PAGE check will go down to vm_normal_page() and check is done here.
   Some new flags in follow_page() and get_user_pages().

 - per arch use-zero-page config is added.
   IIUC, archs which have _PAGE_SPECIAL is only x86, powerpc, s390.
   Because this patch make use of pte_special() check, config to use zero page
   is added and you can turn it off if necessary.
   I this patch, only x86 is turned on which I can test.

Any comments are welcome. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 0/2] ZERO PAGE again v4.
  2009-07-09  3:24 [PATCH 0/2] ZERO PAGE again v3 KAMEZAWA Hiroyuki
@ 2009-07-16  9:01 ` KAMEZAWA Hiroyuki
  2009-07-16  9:04   ` [PATCH 2/2] ZERO PAGE based on pte_special KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-16  9:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, avi, akpm,
	torvalds, aarcange


Rebased onto  mm-of-the-moment snapshot 2009-07-15-20-57.
And modifeied to make vm_normal_page() eat FOLL_NOZERO, directly.

Any comments ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] ZERO PAGE based on pte_special
  2009-07-16  9:01 ` [PATCH 0/2] ZERO PAGE again v4 KAMEZAWA Hiroyuki
@ 2009-07-16  9:04   ` KAMEZAWA Hiroyuki
  2009-07-16 12:00     ` Minchan Kim
  0 siblings, 1 reply; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-16  9:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, avi, akpm,
	torvalds, aarcange

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)

In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.

  - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
    checked as VM_SHARED.

  - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.

  - vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
    NULL is returned even if ZERO_PAGE is found.

  - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
    __get_user_page() returns NULL even if ZERO_PAGE is found.

Note:
  - no changes to get_user_pages(). ZERO_PAGE can be returned when
    vma is ANONYMOUS && PRIVATE and the access is READ.

Changelog v3->v4
 - FOLL_NOZERO is directly passed to vm_normal_page()

Changelog v2->v3
 - totally renewed.
 - use pte_special()
 - added new argument to vm_normal_page().
 - MAYSHARE is checked.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |    8 +--
 include/linux/mm.h |    3 -
 mm/fremap.c        |    2 
 mm/internal.h      |    1 
 mm/memory.c        |  136 +++++++++++++++++++++++++++++++++++++++++------------
 mm/mempolicy.c     |    8 +--
 mm/migrate.c       |    6 +-
 mm/mlock.c         |    2 
 mm/rmap.c          |    6 +-
 9 files changed, 128 insertions(+), 44 deletions(-)

Index: mmotm-2.6.31-Jul15/mm/memory.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/memory.c
+++ mmotm-2.6.31-Jul15/mm/memory.c
@@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
 }
 
 /*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+	/*
+	 * We don't want to optimize FOLL_ANON for make_pages_present()
+	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+	 * we want to get the page from the page tables to make sure
+	 * that we serialize and update with any other user of that
+	 * mapping. At doing page fault, VM_MAYSHARE should be also check for
+	 * avoiding possible changes to VM_SHARED.
+	 */
+	if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+		return 0;
+	/*
+	 * And if we have a fault routine, it's not an anonymous region.
+	 */
+	return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
@@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-				pte_t pte)
+			    pte_t pte, unsigned int flags)
 {
 	unsigned long pfn = pte_pfn(pte);
 
 	if (HAVE_PTE_SPECIAL) {
 		if (likely(!pte_special(pte)))
 			goto check_pfn;
-		if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
-			print_bad_pte(vma, addr, pte, NULL);
+
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+			return NULL;
+		/*
+		 * ZERO PAGE ? If vma is shared or has page fault handler,
+		 * Using ZERO PAGE is bug.
+		 */
+		if (HAVE_ANON_ZERO && use_zero_page(vma)) {
+			if (flags & FOLL_NOZERO)
+				return NULL;
+			return ZERO_PAGE(0);
+		}
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	if (vm_flags & VM_SHARED)
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
-
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
@@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
 		if (pte_present(ptent)) {
 			struct page *page;
 
-			page = vm_normal_page(vma, addr, ptent);
+			page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
@@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
-	page = vm_normal_page(vma, address, pte);
+	page = vm_normal_page(vma, address, pte, flags);
 	if (unlikely(!page))
 		goto bad_page;
 
@@ -1186,23 +1223,6 @@ no_page_table:
 	return page;
 }
 
-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
-	/*
-	 * We don't want to optimize FOLL_ANON for make_pages_present()
-	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
-	 * we want to get the page from the page tables to make sure
-	 * that we serialize and update with any other user of that
-	 * mapping.
-	 */
-	if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
-		return 0;
-	/*
-	 * And if we have a fault routine, it's not an anonymous region.
-	 */
-	return !vma->vm_ops || !vma->vm_ops->fault;
-}
 
 
 
@@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct 
 	int force = !!(flags & GUP_FLAGS_FORCE);
 	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
 	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+	int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
 
 	if (nr_pages <= 0)
 		return 0;
@@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct 
 				return i ? : -EFAULT;
 			}
 			if (pages) {
-				struct page *page = vm_normal_page(gate_vma, start, *pte);
+				struct page *page;
+				/*
+				 * this is not anon vma...don't haddle zero page
+				 * related flags.
+				 */
+				page = vm_normal_page(gate_vma, start, *pte, 0);
 				pages[i] = page;
 				if (page)
 					get_page(page);
@@ -1287,8 +1313,13 @@ int __get_user_pages(struct task_struct 
 		foll_flags = FOLL_TOUCH;
 		if (pages)
 			foll_flags |= FOLL_GET;
-		if (!write && use_zero_page(vma))
-			foll_flags |= FOLL_ANON;
+		if (!write) {
+			if (use_zero_page(vma))
+				foll_flags |= FOLL_ANON;
+			else
+				ignore_zero = 0;
+		} else
+			ignore_zero = 0;
 
 		do {
 			struct page *page;
@@ -1307,9 +1338,17 @@ int __get_user_pages(struct task_struct 
 			if (write)
 				foll_flags |= FOLL_WRITE;
 
+			if (ignore_zero)
+				foll_flags |= FOLL_NOZERO;
+
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				/*
+				 * When we ignore zero pages, no more ops to do.
+				 */
+				if (ignore_zero)
+					break;
 
 				ret = handle_mm_fault(mm, vma, start,
 					(foll_flags & FOLL_WRITE) ?
@@ -1953,10 +1992,10 @@ static int do_wp_page(struct mm_struct *
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
 
-	old_page = vm_normal_page(vma, address, orig_pte);
+	old_page = vm_normal_page(vma, address, orig_pte, FOLL_NOZERO);
 	if (!old_page) {
 		/*
-		 * VM_MIXEDMAP !pfn_valid() case
+		 * VM_MIXEDMAP !pfn_valid() case or ZERO_PAGE cases.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
 		 * Just mark the pages writable as we can't do any dirty
@@ -2610,6 +2649,41 @@ out_page:
 	return ret;
 }
 
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	pte_t *page_table;
+	bool ret = false;
+
+	if (!use_zero_page(vma))
+		return ret;
+	/*
+	 * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+	 */
+	entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_none(*page_table))
+		goto out_unlock;
+	set_pte_at(mm, address, page_table, entry);
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	ret = true;
+out_unlock:
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd)
+{
+	/* We don't use ZERO PAGE */
+	return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2700,10 @@ static int do_anonymous_page(struct mm_s
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
+	if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+		if (do_anon_zeromap(mm, vma, address, pmd))
+			return 0;
+
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
Index: mmotm-2.6.31-Jul15/mm/fremap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/fremap.c
+++ mmotm-2.6.31-Jul15/mm/fremap.c
@@ -33,7 +33,7 @@ static void zap_pte(struct mm_struct *mm
 
 		flush_cache_page(vma, addr, pte_pfn(pte));
 		pte = ptep_clear_flush(vma, addr, ptep);
-		page = vm_normal_page(vma, addr, pte);
+		page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
 		if (page) {
 			if (pte_dirty(pte))
 				set_page_dirty(page);
Index: mmotm-2.6.31-Jul15/mm/mempolicy.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mempolicy.c
+++ mmotm-2.6.31-Jul15/mm/mempolicy.c
@@ -404,13 +404,13 @@ static int check_pte_range(struct vm_are
 
 		if (!pte_present(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
+		/* we avoid zero page here */
+		page = vm_normal_page(vma, addr, *pte, FOLL_NOZERO);
 		if (!page)
 			continue;
 		/*
-		 * The check for PageReserved here is important to avoid
-		 * handling zero pages and other pages that may have been
-		 * marked special by the system.
+		 * The check for PageReserved here is imortant to avoid pages
+		 * that may have been marked special by the system.
 		 *
 		 * If the PageReserved would not be checked here then f.e.
 		 * the location of the zero page could have an influence
Index: mmotm-2.6.31-Jul15/mm/rmap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/rmap.c
+++ mmotm-2.6.31-Jul15/mm/rmap.c
@@ -946,7 +946,11 @@ static int try_to_unmap_cluster(unsigned
 	for (; address < end; pte++, address += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
-		page = vm_normal_page(vma, address, *pte);
+		/*
+		 * Because we comes from try_to_unmap_file(), we'll never see
+		 * ZERO_PAGE or ANON.
+		 */
+		page = vm_normal_page(vma, address, *pte, FOLL_NOZERO);
 		BUG_ON(!page || PageAnon(page));
 
 		if (locked_vma) {
Index: mmotm-2.6.31-Jul15/include/linux/mm.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/include/linux/mm.h
+++ mmotm-2.6.31-Jul15/include/linux/mm.h
@@ -753,7 +753,7 @@ struct zap_details {
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-		pte_t pte);
+		pte_t pte, unsigned int flags);
 
 int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size);
@@ -1245,6 +1245,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO	0x10	/* returns NULL if ZERO_PAGE is found */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
Index: mmotm-2.6.31-Jul15/mm/internal.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/internal.h
+++ mmotm-2.6.31-Jul15/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
 #define GUP_FLAGS_FORCE                  0x2
 #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
 #define GUP_FLAGS_IGNORE_SIGKILL         0x8
+#define GUP_FLAGS_IGNORE_ZERO         	 0x10
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
Index: mmotm-2.6.31-Jul15/mm/migrate.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/migrate.c
+++ mmotm-2.6.31-Jul15/mm/migrate.c
@@ -850,7 +850,7 @@ static int do_move_page_to_node_array(st
 		if (!vma || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET);
+		page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1007,14 +1007,14 @@ static void do_pages_stat_array(struct m
 		if (!vma)
 			goto set_status;
 
-		page = follow_page(vma, addr, 0);
+		page = follow_page(vma, addr, FOLL_NOZERO);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
 			goto set_status;
 
 		err = -ENOENT;
-		/* Use PageReserved to check for zero page */
+		/* if zero page, page is NULL. */
 		if (!page || PageReserved(page))
 			goto set_status;
 
Index: mmotm-2.6.31-Jul15/mm/mlock.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mlock.c
+++ mmotm-2.6.31-Jul15/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
 	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
 	int ret = 0;
-	int gup_flags = 0;
+	int gup_flags = GUP_FLAGS_IGNORE_ZERO;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
Index: mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
@@ -361,8 +361,8 @@ static int smaps_pte_range(pmd_t *pmd, u
 			continue;
 
 		mss->resident += PAGE_SIZE;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* we ignore zero page */
+		page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
 		if (!page)
 			continue;
 
@@ -469,8 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
 		ptent = *pte;
 		if (!pte_present(ptent))
 			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* we ignore zero page */
+		page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
 		if (!page)
 			continue;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ZERO PAGE based on pte_special
  2009-07-16  9:04   ` [PATCH 2/2] ZERO PAGE based on pte_special KAMEZAWA Hiroyuki
@ 2009-07-16 12:00     ` Minchan Kim
  2009-07-16 13:02       ` KAMEZAWA Hiroyuki
  2009-07-17  0:38       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 10+ messages in thread
From: Minchan Kim @ 2009-07-16 12:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, avi, akpm,
	torvalds, aarcange


Hi, Kame. 

On Thu, 16 Jul 2009 18:04:24 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ZERO_PAGE for anonymous private mapping is useful when an application
> requires large continuous memory but write sparsely or some other usages.
> It was removed in 2.6.24 but this patch tries to re-add it.
> (Because there are some use cases..)
> 
> In past, ZERO_PAGE was removed because heavy cache line contention in
> ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
> Then, implementation is changed as following.
> 
>   - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
>     checked as VM_SHARED.
> 
>   - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.
> 
>   - vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
>     NULL is returned even if ZERO_PAGE is found.
> 
>   - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
>     __get_user_page() returns NULL even if ZERO_PAGE is found.
> 
> Note:
>   - no changes to get_user_pages(). ZERO_PAGE can be returned when
>     vma is ANONYMOUS && PRIVATE and the access is READ.
> 
> Changelog v3->v4
>  - FOLL_NOZERO is directly passed to vm_normal_page()
> 
> Changelog v2->v3
>  - totally renewed.
>  - use pte_special()
>  - added new argument to vm_normal_page().
>  - MAYSHARE is checked.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  fs/proc/task_mmu.c |    8 +--
>  include/linux/mm.h |    3 -
>  mm/fremap.c        |    2 
>  mm/internal.h      |    1 
>  mm/memory.c        |  136 +++++++++++++++++++++++++++++++++++++++++------------
>  mm/mempolicy.c     |    8 +--
>  mm/migrate.c       |    6 +-
>  mm/mlock.c         |    2 
>  mm/rmap.c          |    6 +-
>  9 files changed, 128 insertions(+), 44 deletions(-)
> 
> Index: mmotm-2.6.31-Jul15/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Jul15.orig/mm/memory.c
> +++ mmotm-2.6.31-Jul15/mm/memory.c
> @@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
>  }
>  
>  /*
> + * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
> + */
> +static inline int use_zero_page(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * We don't want to optimize FOLL_ANON for make_pages_present()
> +	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
> +	 * we want to get the page from the page tables to make sure
> +	 * that we serialize and update with any other user of that
> +	 * mapping. At doing page fault, VM_MAYSHARE should be also check for
> +	 * avoiding possible changes to VM_SHARED.
> +	 */
> +	if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
> +		return 0;
> +	/*
> +	 * And if we have a fault routine, it's not an anonymous region.
> +	 */
> +	return !vma->vm_ops || !vma->vm_ops->fault;
> +}
> +
> +/*
>   * vm_normal_page -- This function gets the "struct page" associated with a pte.
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" (either
> @@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
>  #else
>  # define HAVE_PTE_SPECIAL 0
>  #endif
> +
> +#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
> +# define HAVE_ANON_ZERO 1
> +#else
> +# define HAVE_ANON_ZERO 0
> +#endif
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -				pte_t pte)
> +			    pte_t pte, unsigned int flags)
>  {
>  	unsigned long pfn = pte_pfn(pte);
>  
>  	if (HAVE_PTE_SPECIAL) {
>  		if (likely(!pte_special(pte)))
>  			goto check_pfn;
> -		if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
> -			print_bad_pte(vma, addr, pte, NULL);
> +
> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> +			return NULL;
> +		/*
> +		 * ZERO PAGE ? If vma is shared or has page fault handler,
> +		 * Using ZERO PAGE is bug.
> +		 */
> +		if (HAVE_ANON_ZERO && use_zero_page(vma)) {
> +			if (flags & FOLL_NOZERO)
> +				return NULL;
> +			return ZERO_PAGE(0);
> +		}
> +		print_bad_pte(vma, addr, pte, NULL);
>  		return NULL;
>  	}
>  
> @@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  	if (vm_flags & VM_SHARED)
>  		pte = pte_mkclean(pte);
>  	pte = pte_mkold(pte);
> -
> -	page = vm_normal_page(vma, addr, pte);
> +	page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page, vma, addr);
> @@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
>  		if (pte_present(ptent)) {
>  			struct page *page;
>  
> -			page = vm_normal_page(vma, addr, ptent);
> +			page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
>  			if (unlikely(details) && page) {
>  				/*
>  				 * unmap_shared_mapping_pages() wants to
> @@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
>  		goto no_page;
>  	if ((flags & FOLL_WRITE) && !pte_write(pte))
>  		goto unlock;
> -	page = vm_normal_page(vma, address, pte);
> +	page = vm_normal_page(vma, address, pte, flags);
>  	if (unlikely(!page))
>  		goto bad_page;
>  
> @@ -1186,23 +1223,6 @@ no_page_table:
>  	return page;
>  }
>  
> -/* Can we do the FOLL_ANON optimization? */
> -static inline int use_zero_page(struct vm_area_struct *vma)
> -{
> -	/*
> -	 * We don't want to optimize FOLL_ANON for make_pages_present()
> -	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
> -	 * we want to get the page from the page tables to make sure
> -	 * that we serialize and update with any other user of that
> -	 * mapping.
> -	 */
> -	if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
> -		return 0;
> -	/*
> -	 * And if we have a fault routine, it's not an anonymous region.
> -	 */
> -	return !vma->vm_ops || !vma->vm_ops->fault;
> -}
>  
>  
>  
> @@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct 
>  	int force = !!(flags & GUP_FLAGS_FORCE);
>  	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
>  	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
> +	int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
>  
>  	if (nr_pages <= 0)
>  		return 0;
> @@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct 
>  				return i ? : -EFAULT;
>  			}
>  			if (pages) {
> -				struct page *page = vm_normal_page(gate_vma, start, *pte);
> +				struct page *page;
> +				/*
> +				 * this is not anon vma...don't haddle zero page
> +				 * related flags.
> +				 */
> +				page = vm_normal_page(gate_vma, start, *pte, 0);
>  				pages[i] = page;
>  				if (page)
>  					get_page(page);
> @@ -1287,8 +1313,13 @@ int __get_user_pages(struct task_struct 
>  		foll_flags = FOLL_TOUCH;
>  		if (pages)
>  			foll_flags |= FOLL_GET;
> -		if (!write && use_zero_page(vma))
> -			foll_flags |= FOLL_ANON;
> +		if (!write) {
> +			if (use_zero_page(vma))
> +				foll_flags |= FOLL_ANON;
> +			else
> +				ignore_zero = 0;
> +		} else
> +			ignore_zero = 0;

Hmm. nested condition is not good for redabililty. 

How about this ?
if (!write && use_zero_page(vma))
	foll_flags |= FOLL_ANON;
else
	ignore_zero = 0;



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ZERO PAGE based on pte_special
  2009-07-16 12:00     ` Minchan Kim
@ 2009-07-16 13:02       ` KAMEZAWA Hiroyuki
  2009-07-17  0:38       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-16 13:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, npiggin, hugh.dickins,
	avi, akpm, torvalds, aarcange

Minchan Kim wrote:
>
> Hi, Kame.
>
> On Thu, 16 Jul 2009 18:04:24 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> ZERO_PAGE for anonymous private mapping is useful when an application
>> requires large continuous memory but write sparsely or some other
>> usages.
>> It was removed in 2.6.24 but this patch tries to re-add it.
>> (Because there are some use cases..)
>>
>> In past, ZERO_PAGE was removed because heavy cache line contention in
>> ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
>> Then, implementation is changed as following.
>>
>>   - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
>>     checked as VM_SHARED.
>>
>>   - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating
>> ZERO_PAGE.
>>
>>   - vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
>>     NULL is returned even if ZERO_PAGE is found.
>>
>>   - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If
>> set,
>>     __get_user_page() returns NULL even if ZERO_PAGE is found.
>>
>> Note:
>>   - no changes to get_user_pages(). ZERO_PAGE can be returned when
>>     vma is ANONYMOUS && PRIVATE and the access is READ.
>>
>> Changelog v3->v4
>>  - FOLL_NOZERO is directly passed to vm_normal_page()
>>
>> Changelog v2->v3
>>  - totally renewed.
>>  - use pte_special()
>>  - added new argument to vm_normal_page().
>>  - MAYSHARE is checked.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  fs/proc/task_mmu.c |    8 +--
>>  include/linux/mm.h |    3 -
>>  mm/fremap.c        |    2
>>  mm/internal.h      |    1
>>  mm/memory.c        |  136
>> +++++++++++++++++++++++++++++++++++++++++------------
>>  mm/mempolicy.c     |    8 +--
>>  mm/migrate.c       |    6 +-
>>  mm/mlock.c         |    2
>>  mm/rmap.c          |    6 +-
>>  9 files changed, 128 insertions(+), 44 deletions(-)
>>
>> Index: mmotm-2.6.31-Jul15/mm/memory.c
>> ===================================================================
>> --- mmotm-2.6.31-Jul15.orig/mm/memory.c
>> +++ mmotm-2.6.31-Jul15/mm/memory.c
>> @@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
>>  }
>>
>>  /*
>> + * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON
>> optimization ?
>> + */
>> +static inline int use_zero_page(struct vm_area_struct *vma)
>> +{
>> +	/*
>> +	 * We don't want to optimize FOLL_ANON for make_pages_present()
>> +	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
>> +	 * we want to get the page from the page tables to make sure
>> +	 * that we serialize and update with any other user of that
>> +	 * mapping. At doing page fault, VM_MAYSHARE should be also check for
>> +	 * avoiding possible changes to VM_SHARED.
>> +	 */
>> +	if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
>> +		return 0;
>> +	/*
>> +	 * And if we have a fault routine, it's not an anonymous region.
>> +	 */
>> +	return !vma->vm_ops || !vma->vm_ops->fault;
>> +}
>> +
>> +/*
>>   * vm_normal_page -- This function gets the "struct page" associated
>> with a pte.
>>   *
>>   * "Special" mappings do not wish to be associated with a "struct page"
>> (either
>> @@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
>>  #else
>>  # define HAVE_PTE_SPECIAL 0
>>  #endif
>> +
>> +#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
>> +# define HAVE_ANON_ZERO 1
>> +#else
>> +# define HAVE_ANON_ZERO 0
>> +#endif
>>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long
>> addr,
>> -				pte_t pte)
>> +			    pte_t pte, unsigned int flags)
>>  {
>>  	unsigned long pfn = pte_pfn(pte);
>>
>>  	if (HAVE_PTE_SPECIAL) {
>>  		if (likely(!pte_special(pte)))
>>  			goto check_pfn;
>> -		if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
>> -			print_bad_pte(vma, addr, pte, NULL);
>> +
>> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>> +			return NULL;
>> +		/*
>> +		 * ZERO PAGE ? If vma is shared or has page fault handler,
>> +		 * Using ZERO PAGE is bug.
>> +		 */
>> +		if (HAVE_ANON_ZERO && use_zero_page(vma)) {
>> +			if (flags & FOLL_NOZERO)
>> +				return NULL;
>> +			return ZERO_PAGE(0);
>> +		}
>> +		print_bad_pte(vma, addr, pte, NULL);
>>  		return NULL;
>>  	}
>>
>> @@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
>>  	if (vm_flags & VM_SHARED)
>>  		pte = pte_mkclean(pte);
>>  	pte = pte_mkold(pte);
>> -
>> -	page = vm_normal_page(vma, addr, pte);
>> +	page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
>>  	if (page) {
>>  		get_page(page);
>>  		page_dup_rmap(page, vma, addr);
>> @@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
>>  		if (pte_present(ptent)) {
>>  			struct page *page;
>>
>> -			page = vm_normal_page(vma, addr, ptent);
>> +			page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
>>  			if (unlikely(details) && page) {
>>  				/*
>>  				 * unmap_shared_mapping_pages() wants to
>> @@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
>>  		goto no_page;
>>  	if ((flags & FOLL_WRITE) && !pte_write(pte))
>>  		goto unlock;
>> -	page = vm_normal_page(vma, address, pte);
>> +	page = vm_normal_page(vma, address, pte, flags);
>>  	if (unlikely(!page))
>>  		goto bad_page;
>>
>> @@ -1186,23 +1223,6 @@ no_page_table:
>>  	return page;
>>  }
>>
>> -/* Can we do the FOLL_ANON optimization? */
>> -static inline int use_zero_page(struct vm_area_struct *vma)
>> -{
>> -	/*
>> -	 * We don't want to optimize FOLL_ANON for make_pages_present()
>> -	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
>> -	 * we want to get the page from the page tables to make sure
>> -	 * that we serialize and update with any other user of that
>> -	 * mapping.
>> -	 */
>> -	if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
>> -		return 0;
>> -	/*
>> -	 * And if we have a fault routine, it's not an anonymous region.
>> -	 */
>> -	return !vma->vm_ops || !vma->vm_ops->fault;
>> -}
>>
>>
>>
>> @@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct
>>  	int force = !!(flags & GUP_FLAGS_FORCE);
>>  	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
>>  	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
>> +	int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
>>
>>  	if (nr_pages <= 0)
>>  		return 0;
>> @@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct
>>  				return i ? : -EFAULT;
>>  			}
>>  			if (pages) {
>> -				struct page *page = vm_normal_page(gate_vma, start, *pte);
>> +				struct page *page;
>> +				/*
>> +				 * this is not anon vma...don't haddle zero page
>> +				 * related flags.
>> +				 */
>> +				page = vm_normal_page(gate_vma, start, *pte, 0);
>>  				pages[i] = page;
>>  				if (page)
>>  					get_page(page);
>> @@ -1287,8 +1313,13 @@ int __get_user_pages(struct task_struct
>>  		foll_flags = FOLL_TOUCH;
>>  		if (pages)
>>  			foll_flags |= FOLL_GET;
>> -		if (!write && use_zero_page(vma))
>> -			foll_flags |= FOLL_ANON;
>> +		if (!write) {
>> +			if (use_zero_page(vma))
>> +				foll_flags |= FOLL_ANON;
>> +			else
>> +				ignore_zero = 0;
>> +		} else
>> +			ignore_zero = 0;
>
> Hmm. nested condition is not good for redabililty.
>
> How about this ?
> if (!write && use_zero_page(vma))
> 	foll_flags |= FOLL_ANON;
> else
> 	ignore_zero = 0;
>
Ah, yes. yours seems better.
I'll post updated one, later.

Thank you
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ZERO PAGE based on pte_special
  2009-07-16 12:00     ` Minchan Kim
  2009-07-16 13:02       ` KAMEZAWA Hiroyuki
@ 2009-07-17  0:38       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 10+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-17  0:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, avi, akpm,
	torvalds, aarcange

On Thu, 16 Jul 2009 21:00:28 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> Hmm. nested condition is not good for redabililty. 
> 
> How about this ?
> if (!write && use_zero_page(vma))
> 	foll_flags |= FOLL_ANON;
> else
> 	ignore_zero = 0;
> 
Ok, here is v4.1
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)

In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.

  - Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
    checked as VM_SHARED.

  - pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.

  - vm_normal_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
    NULL is returned even if ZERO_PAGE is found.

  - __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
    __get_user_page() returns NULL even if ZERO_PAGE is found.

Note:
  - no changes to get_user_pages(). ZERO_PAGE can be returned when
    vma is ANONYMOUS && PRIVATE and the access is READ.

Changelog v4 -> v4.1
 - removed nexted "if" in get_user_pages() for readability

Changelog v3->v4
 - FOLL_NOZERO is directly passed to vm_normal_page()

Changelog v2->v3
 - totally renewed.
 - use pte_special()
 - added new argument to vm_normal_page().
 - MAYSHARE is checked.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |    8 +--
 include/linux/mm.h |    3 -
 mm/fremap.c        |    2 
 mm/internal.h      |    1 
 mm/memory.c        |  129 +++++++++++++++++++++++++++++++++++++++++------------
 mm/mempolicy.c     |    8 +--
 mm/migrate.c       |    6 +-
 mm/mlock.c         |    2 
 mm/rmap.c          |    6 ++
 9 files changed, 123 insertions(+), 42 deletions(-)

Index: mmotm-2.6.31-Jul15/mm/memory.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/memory.c
+++ mmotm-2.6.31-Jul15/mm/memory.c
@@ -442,6 +442,27 @@ static inline int is_cow_mapping(unsigne
 }
 
 /*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+	/*
+	 * We don't want to optimize FOLL_ANON for make_pages_present()
+	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+	 * we want to get the page from the page tables to make sure
+	 * that we serialize and update with any other user of that
+	 * mapping. At doing page fault, VM_MAYSHARE should be also check for
+	 * avoiding possible changes to VM_SHARED.
+	 */
+	if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+		return 0;
+	/*
+	 * And if we have a fault routine, it's not an anonymous region.
+	 */
+	return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
@@ -488,16 +509,33 @@ static inline int is_cow_mapping(unsigne
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-				pte_t pte)
+			    pte_t pte, unsigned int flags)
 {
 	unsigned long pfn = pte_pfn(pte);
 
 	if (HAVE_PTE_SPECIAL) {
 		if (likely(!pte_special(pte)))
 			goto check_pfn;
-		if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
-			print_bad_pte(vma, addr, pte, NULL);
+
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+			return NULL;
+		/*
+		 * ZERO PAGE ? If vma is shared or has page fault handler,
+		 * Using ZERO PAGE is bug.
+		 */
+		if (HAVE_ANON_ZERO && use_zero_page(vma)) {
+			if (flags & FOLL_NOZERO)
+				return NULL;
+			return ZERO_PAGE(0);
+		}
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -591,8 +629,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	if (vm_flags & VM_SHARED)
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
-
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
@@ -783,7 +820,7 @@ static unsigned long zap_pte_range(struc
 		if (pte_present(ptent)) {
 			struct page *page;
 
-			page = vm_normal_page(vma, addr, ptent);
+			page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
@@ -1141,7 +1178,7 @@ struct page *follow_page(struct vm_area_
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
-	page = vm_normal_page(vma, address, pte);
+	page = vm_normal_page(vma, address, pte, flags);
 	if (unlikely(!page))
 		goto bad_page;
 
@@ -1186,23 +1223,6 @@ no_page_table:
 	return page;
 }
 
-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
-	/*
-	 * We don't want to optimize FOLL_ANON for make_pages_present()
-	 * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
-	 * we want to get the page from the page tables to make sure
-	 * that we serialize and update with any other user of that
-	 * mapping.
-	 */
-	if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
-		return 0;
-	/*
-	 * And if we have a fault routine, it's not an anonymous region.
-	 */
-	return !vma->vm_ops || !vma->vm_ops->fault;
-}
 
 
 
@@ -1216,6 +1236,7 @@ int __get_user_pages(struct task_struct 
 	int force = !!(flags & GUP_FLAGS_FORCE);
 	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
 	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+	int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
 
 	if (nr_pages <= 0)
 		return 0;
@@ -1259,7 +1280,12 @@ int __get_user_pages(struct task_struct 
 				return i ? : -EFAULT;
 			}
 			if (pages) {
-				struct page *page = vm_normal_page(gate_vma, start, *pte);
+				struct page *page;
+				/*
+				 * this is not anon vma...don't haddle zero page
+				 * related flags.
+				 */
+				page = vm_normal_page(gate_vma, start, *pte, 0);
 				pages[i] = page;
 				if (page)
 					get_page(page);
@@ -1289,6 +1315,8 @@ int __get_user_pages(struct task_struct 
 			foll_flags |= FOLL_GET;
 		if (!write && use_zero_page(vma))
 			foll_flags |= FOLL_ANON;
+		else
+			ignore_zero = 0;
 
 		do {
 			struct page *page;
@@ -1307,9 +1335,17 @@ int __get_user_pages(struct task_struct 
 			if (write)
 				foll_flags |= FOLL_WRITE;
 
+			if (ignore_zero)
+				foll_flags |= FOLL_NOZERO;
+
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
+				/*
+				 * When we ignore zero pages, no more ops to do.
+				 */
+				if (ignore_zero)
+					break;
 
 				ret = handle_mm_fault(mm, vma, start,
 					(foll_flags & FOLL_WRITE) ?
@@ -1953,10 +1989,10 @@ static int do_wp_page(struct mm_struct *
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
 
-	old_page = vm_normal_page(vma, address, orig_pte);
+	old_page = vm_normal_page(vma, address, orig_pte, FOLL_NOZERO);
 	if (!old_page) {
 		/*
-		 * VM_MIXEDMAP !pfn_valid() case
+		 * VM_MIXEDMAP !pfn_valid() case or ZERO_PAGE cases.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
 		 * Just mark the pages writable as we can't do any dirty
@@ -2610,6 +2646,41 @@ out_page:
 	return ret;
 }
 
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	pte_t *page_table;
+	bool ret = false;
+
+	if (!use_zero_page(vma))
+		return ret;
+	/*
+	 * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+	 */
+	entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_none(*page_table))
+		goto out_unlock;
+	set_pte_at(mm, address, page_table, entry);
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	ret = true;
+out_unlock:
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd)
+{
+	/* We don't use ZERO PAGE */
+	return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2697,10 @@ static int do_anonymous_page(struct mm_s
 	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
+	if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+		if (do_anon_zeromap(mm, vma, address, pmd))
+			return 0;
+
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
Index: mmotm-2.6.31-Jul15/mm/fremap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/fremap.c
+++ mmotm-2.6.31-Jul15/mm/fremap.c
@@ -33,7 +33,7 @@ static void zap_pte(struct mm_struct *mm
 
 		flush_cache_page(vma, addr, pte_pfn(pte));
 		pte = ptep_clear_flush(vma, addr, ptep);
-		page = vm_normal_page(vma, addr, pte);
+		page = vm_normal_page(vma, addr, pte, FOLL_NOZERO);
 		if (page) {
 			if (pte_dirty(pte))
 				set_page_dirty(page);
Index: mmotm-2.6.31-Jul15/mm/mempolicy.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mempolicy.c
+++ mmotm-2.6.31-Jul15/mm/mempolicy.c
@@ -404,13 +404,13 @@ static int check_pte_range(struct vm_are
 
 		if (!pte_present(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
+		/* we avoid zero page here */
+		page = vm_normal_page(vma, addr, *pte, FOLL_NOZERO);
 		if (!page)
 			continue;
 		/*
-		 * The check for PageReserved here is important to avoid
-		 * handling zero pages and other pages that may have been
-		 * marked special by the system.
+		 * The check for PageReserved here is imortant to avoid pages
+		 * that may have been marked special by the system.
 		 *
 		 * If the PageReserved would not be checked here then f.e.
 		 * the location of the zero page could have an influence
Index: mmotm-2.6.31-Jul15/mm/rmap.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/rmap.c
+++ mmotm-2.6.31-Jul15/mm/rmap.c
@@ -946,7 +946,11 @@ static int try_to_unmap_cluster(unsigned
 	for (; address < end; pte++, address += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
-		page = vm_normal_page(vma, address, *pte);
+		/*
+		 * Because we comes from try_to_unmap_file(), we'll never see
+		 * ZERO_PAGE or ANON.
+		 */
+		page = vm_normal_page(vma, address, *pte, FOLL_NOZERO);
 		BUG_ON(!page || PageAnon(page));
 
 		if (locked_vma) {
Index: mmotm-2.6.31-Jul15/include/linux/mm.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/include/linux/mm.h
+++ mmotm-2.6.31-Jul15/include/linux/mm.h
@@ -753,7 +753,7 @@ struct zap_details {
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-		pte_t pte);
+		pte_t pte, unsigned int flags);
 
 int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size);
@@ -1245,6 +1245,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO	0x10	/* returns NULL if ZERO_PAGE is found */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
Index: mmotm-2.6.31-Jul15/mm/internal.h
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/internal.h
+++ mmotm-2.6.31-Jul15/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
 #define GUP_FLAGS_FORCE                  0x2
 #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
 #define GUP_FLAGS_IGNORE_SIGKILL         0x8
+#define GUP_FLAGS_IGNORE_ZERO         	 0x10
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
Index: mmotm-2.6.31-Jul15/mm/migrate.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/migrate.c
+++ mmotm-2.6.31-Jul15/mm/migrate.c
@@ -850,7 +850,7 @@ static int do_move_page_to_node_array(st
 		if (!vma || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET);
+		page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1007,14 +1007,14 @@ static void do_pages_stat_array(struct m
 		if (!vma)
 			goto set_status;
 
-		page = follow_page(vma, addr, 0);
+		page = follow_page(vma, addr, FOLL_NOZERO);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
 			goto set_status;
 
 		err = -ENOENT;
-		/* Use PageReserved to check for zero page */
+		/* if zero page, page is NULL. */
 		if (!page || PageReserved(page))
 			goto set_status;
 
Index: mmotm-2.6.31-Jul15/mm/mlock.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/mm/mlock.c
+++ mmotm-2.6.31-Jul15/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
 	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
 	int ret = 0;
-	int gup_flags = 0;
+	int gup_flags = GUP_FLAGS_IGNORE_ZERO;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
Index: mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.31-Jul15.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.31-Jul15/fs/proc/task_mmu.c
@@ -361,8 +361,8 @@ static int smaps_pte_range(pmd_t *pmd, u
 			continue;
 
 		mss->resident += PAGE_SIZE;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* we ignore zero page */
+		page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
 		if (!page)
 			continue;
 
@@ -469,8 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
 		ptent = *pte;
 		if (!pte_present(ptent))
 			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* we ignore zero page */
+		page = vm_normal_page(vma, addr, ptent, FOLL_NOZERO);
 		if (!page)
 			continue;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-08-10 11:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-05 10:16 [PATCH] ZERO_PAGE again v5 KAMEZAWA Hiroyuki
2009-08-05 10:17 ` [PATCH 1/2] ZERO_PAGE config KAMEZAWA Hiroyuki
2009-08-05 10:19 ` [PATCH 2/2] ZERO_PAGE based on pte_special KAMEZAWA Hiroyuki
2009-08-09 17:28 ` [PATCH] ZERO_PAGE again v5 Hugh Dickins
2009-08-10  0:14   ` KAMEZAWA Hiroyuki
2009-08-10 11:24     ` Hugh Dickins
  -- strict thread matches above, loose matches on Subject: below --
2009-07-09  3:24 [PATCH 0/2] ZERO PAGE again v3 KAMEZAWA Hiroyuki
2009-07-16  9:01 ` [PATCH 0/2] ZERO PAGE again v4 KAMEZAWA Hiroyuki
2009-07-16  9:04   ` [PATCH 2/2] ZERO PAGE based on pte_special KAMEZAWA Hiroyuki
2009-07-16 12:00     ` Minchan Kim
2009-07-16 13:02       ` KAMEZAWA Hiroyuki
2009-07-17  0:38       ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox