* [PATCH 2/2] ZERO_PAGE based on pte_special
2009-08-05 10:16 [PATCH] ZERO_PAGE again v5 KAMEZAWA Hiroyuki
2009-08-05 10:17 ` [PATCH 1/2] ZERO_PAGE config KAMEZAWA Hiroyuki
@ 2009-08-05 10:19 ` KAMEZAWA Hiroyuki
2009-08-09 17:28 ` [PATCH] ZERO_PAGE again v5 Hugh Dickins
2 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05 10:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, npiggin, hugh.dickins, akpm, torvalds
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
updated against: mmotm-Jul-30-2009.
ZERO_PAGE for anonymous private mapping is useful when an application
requires large continuous memory but write sparsely or some other usages.
It was removed in 2.6.24 but this patch tries to re-add it.
(Because there are some use cases..)
In past, ZERO_PAGE was removed because heavy cache line contention in
ZERO_PAGE's refcounting, this version of ZERO_PAGE avoid to refcnt it.
Then, implementation is changed as following.
- Use of ZERO_PAGE is limited to PRIVATE mapping. Then, VM_MAYSHARE is
checked as VM_SHARED.
- pte_special(), _PAGE_SPECIAL bit in pte is used for indicating ZERO_PAGE.
- follow_page() eats FOLL_XXX flag. If FOLL_NOZERO is set,
NULL is returned even if ZERO_PAGE is found.
- vm_normal_page() returns NULL if ZERO_PAGE is found.
- __get_user_pages() eats one more flag as GUP_FLAGS_IGNORE_ZERO. If set,
__get_user_page() returns NULL even if ZERO_PAGE is found.
Changelog v4.1 -> v5
- removed new arguments to vm_normal_page()
- follow_page() handles ZERO_PAGE directly.
Changelog v4 -> v4.1
- removed nexted "if" in get_user_pages() for readability
Changelog v3->v4
- FOLL_NOZERO is directly passed to vm_normal_page()
Changelog v2->v3
- totally renewed.
- use pte_special()
- added new argument to vm_normal_page().
- MAYSHARE is checked.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/mm.h | 1
mm/internal.h | 1
mm/memory.c | 123 +++++++++++++++++++++++++++++++++++++++++++----------
mm/migrate.c | 6 +-
mm/mlock.c | 2
5 files changed, 108 insertions(+), 25 deletions(-)
Index: mmotm-2.6.31-Aug4/mm/memory.c
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/memory.c
+++ mmotm-2.6.31-Aug4/mm/memory.c
@@ -444,6 +444,27 @@ static inline int is_cow_mapping(unsigne
}
/*
+ * Can we use ZERO_PAGE at fault ? or Can we do the FOLL_ANON optimization ?
+ */
+static inline int use_zero_page(struct vm_area_struct *vma)
+{
+ /*
+ * We don't want to optimize FOLL_ANON for make_pages_present()
+ * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
+ * we want to get the page from the page tables to make sure
+ * that we serialize and update with any other user of that
+ * mapping. At doing page fault, VM_MAYSHARE should be also check for
+ * avoiding possible changes to VM_SHARED.
+ */
+ if (vma->vm_flags & (VM_LOCKED | VM_SHARED | VM_MAYSHARE))
+ return 0;
+ /*
+ * And if we have a fault routine, it's not an anonymous region.
+ */
+ return !vma->vm_ops || !vma->vm_ops->fault;
+}
+
+/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
@@ -490,6 +511,12 @@ static inline int is_cow_mapping(unsigne
#else
# define HAVE_PTE_SPECIAL 0
#endif
+
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+# define HAVE_ANON_ZERO 1
+#else
+# define HAVE_ANON_ZERO 0
+#endif
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
{
@@ -498,8 +525,16 @@ struct page *vm_normal_page(struct vm_ar
if (HAVE_PTE_SPECIAL) {
if (likely(!pte_special(pte)))
goto check_pfn;
- if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
- print_bad_pte(vma, addr, pte, NULL);
+
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+ /*
+ * ZERO PAGE ? If vma is shared or has page fault handler,
+ * Using ZERO PAGE is bug.
+ */
+ if (HAVE_ANON_ZERO && use_zero_page(vma))
+ return NULL;
+ print_bad_pte(vma, addr, pte, NULL);
return NULL;
}
@@ -1143,7 +1178,16 @@ struct page *follow_page(struct vm_area_
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
- page = vm_normal_page(vma, address, pte);
+
+ if (HAVE_ANON_ZERO && pte_special(pte) && use_zero_page(vma)) {
+ /* This page is ZERO_PAGE */
+ if (flags & FOLL_NOZERO)
+ page = NULL;
+ else
+ page = ZERO_PAGE(0);
+ } else
+ page = vm_normal_page(vma, address, pte);
+
if (unlikely(!page))
goto bad_page;
@@ -1188,23 +1232,6 @@ no_page_table:
return page;
}
-/* Can we do the FOLL_ANON optimization? */
-static inline int use_zero_page(struct vm_area_struct *vma)
-{
- /*
- * We don't want to optimize FOLL_ANON for make_pages_present()
- * when it tries to page in a VM_LOCKED region. As to VM_SHARED,
- * we want to get the page from the page tables to make sure
- * that we serialize and update with any other user of that
- * mapping.
- */
- if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
- return 0;
- /*
- * And if we have a fault routine, it's not an anonymous region.
- */
- return !vma->vm_ops || !vma->vm_ops->fault;
-}
@@ -1218,6 +1245,7 @@ int __get_user_pages(struct task_struct
int force = !!(flags & GUP_FLAGS_FORCE);
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+ int ignore_zero = !!(flags & GUP_FLAGS_IGNORE_ZERO);
if (nr_pages <= 0)
return 0;
@@ -1261,7 +1289,11 @@ int __get_user_pages(struct task_struct
return i ? : -EFAULT;
}
if (pages) {
- struct page *page = vm_normal_page(gate_vma, start, *pte);
+ struct page *page;
+ /*
+ * this is not anon vma...don't care zero page.
+ */
+ page = vm_normal_page(gate_vma, start, *pte);
pages[i] = page;
if (page)
get_page(page);
@@ -1291,6 +1323,8 @@ int __get_user_pages(struct task_struct
foll_flags |= FOLL_GET;
if (!write && use_zero_page(vma))
foll_flags |= FOLL_ANON;
+ else
+ ignore_zero = 0;
do {
struct page *page;
@@ -1309,9 +1343,17 @@ int __get_user_pages(struct task_struct
if (write)
foll_flags |= FOLL_WRITE;
+ if (ignore_zero)
+ foll_flags |= FOLL_NOZERO;
+
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
+ /*
+ * When we ignore zero pages, no more ops to do.
+ */
+ if (ignore_zero)
+ break;
ret = handle_mm_fault(mm, vma, start,
(foll_flags & FOLL_WRITE) ?
@@ -2617,6 +2659,41 @@ out_page:
return ret;
}
+#ifdef CONFIG_SUPPORT_ANON_ZERO_PAGE
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ pte_t entry;
+ pte_t *page_table;
+ bool ret = false;
+
+ if (!use_zero_page(vma))
+ return ret;
+ /*
+ * We use _PAGE_SPECIAL bit in pte to indicate this page is ZERO PAGE.
+ */
+ entry = pte_mkspecial(mk_pte(ZERO_PAGE(0), vma->vm_page_prot));
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_none(*page_table))
+ goto out_unlock;
+ set_pte_at(mm, address, page_table, entry);
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, entry);
+ ret = true;
+out_unlock:
+ pte_unmap_unlock(page_table, ptl);
+ return ret;
+}
+#else
+static bool do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ /* We don't use ZERO PAGE */
+ return false;
+}
+#endif /* CONFIG_SUPPORT_ANON_ZERO_PAGE */
+
/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -2633,6 +2710,10 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);
+ if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+ if (do_anon_zeromap(mm, vma, address, pmd))
+ return 0;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage_movable(vma, address);
Index: mmotm-2.6.31-Aug4/include/linux/mm.h
===================================================================
--- mmotm-2.6.31-Aug4.orig/include/linux/mm.h
+++ mmotm-2.6.31-Aug4/include/linux/mm.h
@@ -1246,6 +1246,7 @@ struct page *follow_page(struct vm_area_
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO 0x10 /* returns NULL if ZERO_PAGE is found */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
Index: mmotm-2.6.31-Aug4/mm/internal.h
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/internal.h
+++ mmotm-2.6.31-Aug4/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
#define GUP_FLAGS_FORCE 0x2
#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
#define GUP_FLAGS_IGNORE_SIGKILL 0x8
+#define GUP_FLAGS_IGNORE_ZERO 0x10
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
Index: mmotm-2.6.31-Aug4/mm/migrate.c
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/migrate.c
+++ mmotm-2.6.31-Aug4/mm/migrate.c
@@ -850,7 +850,7 @@ static int do_move_page_to_node_array(st
if (!vma || !vma_migratable(vma))
goto set_status;
- page = follow_page(vma, pp->addr, FOLL_GET);
+ page = follow_page(vma, pp->addr, FOLL_GET | FOLL_NOZERO);
err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1007,14 +1007,14 @@ static void do_pages_stat_array(struct m
if (!vma)
goto set_status;
- page = follow_page(vma, addr, 0);
+ page = follow_page(vma, addr, FOLL_NOZERO);
err = PTR_ERR(page);
if (IS_ERR(page))
goto set_status;
err = -ENOENT;
- /* Use PageReserved to check for zero page */
+ /* if zero page, page is NULL. */
if (!page || PageReserved(page))
goto set_status;
Index: mmotm-2.6.31-Aug4/mm/mlock.c
===================================================================
--- mmotm-2.6.31-Aug4.orig/mm/mlock.c
+++ mmotm-2.6.31-Aug4/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(stru
struct page *pages[16]; /* 16 gives a reasonable batch */
int nr_pages = (end - start) / PAGE_SIZE;
int ret = 0;
- int gup_flags = 0;
+ int gup_flags = GUP_FLAGS_IGNORE_ZERO;
VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] ZERO_PAGE again v5.
2009-08-05 10:16 [PATCH] ZERO_PAGE again v5 KAMEZAWA Hiroyuki
2009-08-05 10:17 ` [PATCH 1/2] ZERO_PAGE config KAMEZAWA Hiroyuki
2009-08-05 10:19 ` [PATCH 2/2] ZERO_PAGE based on pte_special KAMEZAWA Hiroyuki
@ 2009-08-09 17:28 ` Hugh Dickins
2009-08-10 0:14 ` KAMEZAWA Hiroyuki
2 siblings, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2009-08-09 17:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, Nick Piggin, Andrew Morton, Linus Torvalds
On Wed, 5 Aug 2009, KAMEZAWA Hiroyuki wrote:
> Updated from v4 as
> - avoid to add new arguments to vm_normal_page().
> vm_normal_page() always returns NULL if ZERO_PAGE is found.
> - follow_page() directly handles pte_special and ANON_ZERO_PAGE.
>
> Then, amount of changes are reduced. Thanks for advices.
>
> Concerns pointed out:
> - Does use_zero_page() cover all cases ?
> I think yes..
> - All get_user_pages() callers, which may find ZERO_PAGE is safe ?
> need tests.
> - All follow_pages() callers, which may find ZERO_PAGE is safe ?
> I think yes.
Sorry, KAMEZAWA-san, I'm afraid this is still some way off being right.
Certainly the extent of the v5 patch is much more to my taste than v4
was, thank you.
Something that's missing, which we can get away with but probably
need to reinstate, is the shortcut when COWing: not to copy the
ZERO_PAGE, but just do a memset.
But just try mlock'ing a private readonly anon area into which you've
faulted a zero page, and the "BUG: Bad page map" message tells us
it's quite wrong to be trying use_zero_page() there.
Actually, I don't understand ignore_zero at all: it's used solely by
the mlock case, yet its effect seems to be precisely not to fault in
pages if they're missing - I wonder if you've got in a muddle between
the two very different awkward cases, mlocking and coredumps of
sparsely populated areas.
And I don't at all like the way you flush_dcache_page(page) on a
page which may now be NULL: okay, you're only encouraging x86 to
say Yes to the Kconfig option, but that's a landmine for the first
arch with a real flush_dcache_page(page) which says Yes to it.
Actually, the Kconfig stuff seems silly to me (who's going to know
how to choose on or off?): the only architecture which wanted more
than one ZERO_PAGE was MIPS, and it doesn't __HAVE_ARCH_PTE_SPECIAL
yet, so I think I'm going to drop all the Kconfig end of it.
Because I hate reviewing things and trying to direct other people
by remote control: what usually happens is I send them off in some
direction which, once I try to do it myself, turns out to have been
the wrong direction. I do need to try to do this myself, instead of
standing on the sidelines criticizing.
In fairness, I think Linus himself was a little confused when he
separated off use_zero_page(): I think we've all got confused around
there (as we noticed a month or so ago when discussing its hugetlb
equivalent), and I need to think it through again at last.
I'll get on to it now.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread