From: Oren Laadan <orenl@cs.columbia.edu>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linux-api@vger.kernel.org, Serge Hallyn <serue@us.ibm.com>,
Ingo Molnar <mingo@elte.hu>,
containers@lists.linux-foundation.org,
Oren Laadan <orenl@cs.columbia.edu>
Subject: [C/R v20][PATCH 41/96] Introduce FOLL_DIRTY to follow_page() for "dirty" pages
Date: Wed, 17 Mar 2010 12:08:29 -0400 [thread overview]
Message-ID: <1268842164-5590-42-git-send-email-orenl@cs.columbia.edu> (raw)
In-Reply-To: <1268842164-5590-41-git-send-email-orenl@cs.columbia.edu>
This is a preparatory patch necessary for checkpoint/restart (next
two patches) of memory to work correctly.
The patch introduces a new FOLL_DIRTY flag which tells follow_page()
to return -EFAULT also for not-present file-backed pages.
In 2.6.32 follow_page() changes its behavior due to this commit:
mm: FOLL_DUMP replace FOLL_ANON
8e4b9a60718970bbc02dfd3abd0b956ab65af231
Also introduce __get_dirty_page() that returns a page only if it's
"dirty", that is that has been modified before, and otherwise returns
NULL. It uses FOLL_DUMP | FOLL_DIRTY and converts the error value
EFAULT to NULL - telling the caller that the page in question is
clean.
(This also optimizes for checkpoint in the next patch: before, if a
file-backed page was not-present we would first fault it in (read from
disk) and then detect that it was virgin. Instead, now we detect that
the page is clean earlier without needing to fault it in).
To see why it's needed, consider these scenarios:
1. Task maps a file beyond it's limit, never touches those
extra page (if it did, it would get EFAULT/Bus error)
2. Task maps a file and writes the last page, then the file gets
truncated (by at least a page). A subsequent access to the page will
cause bus error (VM_FAULT_SIGBUS).
3. If the file size is extended back (using truncate) and the task
accesses that page, then the task will get a fresh page (losing data
it had written to that address before).
[Before kernel 2.6.32, that page would become anonymous once it was
dirtied, such that accesses in case #2 are valid, and in case #3 the
task would see the old page regardless of the file contents.]
--CHECKPOINT: before we used FOLL_ANON flags to tell follow_page() to
return the zero-page for case#1. For case#2, the actual page was
returned. Without this patch, In kernel 2.3.32, FOLL_DUMP would make
follow_page() return NULL and then fault handler would have returned
VM_FAULT_SIGBUS in case#1 (and depending on arch, case#2 too), and
checkpoint would fails.
--RESTART: case #1 works, because mmap() works as before, and those
pages that were never touched will not be restored either, they will
remain untouched. The same holds for case#2 (as of kernel 2.6.32),
because at checkpoint it would decide that the page is clean and not
save the contents, and therefore it will not try to restore the
contents at restart. This is consistent with the expected behavior
after restart: if the file remains as is, subsequent accesses will
trigger a bus error, and if the file is extended, then the user will
observe a fresh page.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/mm.h | 2 +
mm/memory.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 96 insertions(+), 1 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 48d67ee..a93f4dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -843,6 +843,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
struct page *get_dump_page(unsigned long addr);
+struct page *__get_dirty_page(struct vm_area_struct *vma, unsigned long addr);
extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
extern void do_invalidatepage(struct page *page, unsigned long offset);
@@ -1262,6 +1263,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
+#define FOLL_DIRTY 0x20 /* give error on non-present file mapped */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index 09e4b1b..005dd55 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1226,8 +1226,17 @@ bad_page:
no_page:
pte_unmap_unlock(ptep, ptl);
- if (!pte_none(pte))
+ if (!pte_none(pte)) {
+ /*
+ * When checkpointing we only care about dirty pages.
+ * If a file-backed page is missing, then return an
+ * error to tell __get_dirty_page() that it's clean,
+ * so it won't try to demand page it into memory.
+ */
+ if ((flags & FOLL_DIRTY) && pte_file(pte))
+ page = ERR_PTR(-EFAULT);
return page;
+ }
no_page_table:
/*
@@ -1241,6 +1250,16 @@ no_page_table:
if ((flags & FOLL_DUMP) &&
(!vma->vm_ops || !vma->vm_ops->fault))
return ERR_PTR(-EFAULT);
+
+ /*
+ * When checkpointing we only care about dirty pages. If there
+ * is no page table for a non-anonymous page, we return an
+ * error to tell __get_dirty_page() that the page is clean, so
+ * it won't allocate page tables and the page unnecessarily.
+ */
+ if ((flags & FOLL_DIRTY) && vma->vm_ops)
+ return ERR_PTR(-EFAULT);
+
return page;
}
@@ -1498,6 +1517,80 @@ pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
return NULL;
}
+/**
+ * __get_dirty_page - return page pointer for dirty user page
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * return the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL or error.
+ *
+ * Should only be called for private vma.
+ * Must be called with mmap_sem held for read or write.
+ */
+struct page *__get_dirty_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct page *page;
+
+ BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+ /*
+ * FOLL_DUMP tells follow_page() to return -EFAULT for either
+ * non-present anonymous pages, or memory "holes".
+ * FOLL_DIRTY tells follow_page() to return -EFAULT also for
+ * non-present file-mapped pages.
+ * Otherwise, follow_page() returns the page, or NULL if the
+ * page is swapped out.
+ */
+
+ cond_resched();
+ while (!(page = follow_page(vma, addr,
+ FOLL_GET | FOLL_DUMP | FOLL_DIRTY))) {
+ int ret;
+
+ /* the page is swapped out - bring it in (optimize ?) */
+ ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+ if (ret & VM_FAULT_ERROR) {
+ if (ret & VM_FAULT_OOM)
+ return ERR_PTR(-ENOMEM);
+ else if (ret & VM_FAULT_SIGBUS)
+ return ERR_PTR(-EFAULT);
+ else
+ BUG();
+ break;
+ }
+ cond_resched();
+ }
+
+ /* -EFAULT means that the page is clean (see above) */
+ if (PTR_ERR(page) == -EFAULT)
+ return NULL;
+ else if (IS_ERR(page))
+ return page;
+
+ /*
+ * Only care about dirty pages: either anonymous non-zero pages,
+ * or file-backed COW (copy-on-write) pages that were modified.
+ * A clean COW page is not interesting because its contents are
+ * identical to the backing file; ignore such pages.
+ * A file-backed broken COW is identified by its page_mapping()
+ * being unset (NULL) because the page will no longer be mapped
+ * to the original file after having been modified.
+ */
+ if (is_zero_pfn(page_to_pfn(page))) {
+ /* this is the zero page: ignore */
+ page_cache_release(page);
+ page = NULL;
+ } else if (vma->vm_file && (page_mapping(page) != NULL)) {
+ /* file backed clean cow: ignore */
+ page_cache_release(page);
+ page = NULL;
+ }
+
+ return page;
+}
+
/*
* This is the old fallback for page remapping.
*
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-03-17 16:15 UTC|newest]
Thread overview: 103+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-17 16:07 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 01/96] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 02/96] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 03/96] eclone (3/11): Define set_pidmap() function Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 04/96] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 05/96] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 06/96] eclone (6/11): Check invalid clone flags Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 07/96] eclone (7/11): Define do_fork_with_pids() Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 08/96] eclone (8/11): Implement sys_eclone for x86 (32,64) Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 09/96] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 10/96] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 11/96] eclone (11/11): Document sys_eclone Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 12/96] c/r: extend arch_setup_additional_pages() Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 13/96] c/r: break out new_user_ns() Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 14/96] c/r: split core function out of some set*{u,g}id functions Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 16/96] cgroup freezer: Update stale locking comments Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 17/96] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 18/96] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 19/96] Namespaces submenu Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 21/96] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 22/96] c/r: documentation Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 23/96] c/r: basic infrastructure for checkpoint/restart Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 24/96] c/r: x86_32 support " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 25/96] c/r: x86-64: checkpoint/restart implementation Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 26/96] c/r: external checkpoint of a task other than ourself Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 27/96] c/r: export functionality used in next patch for restart-blocks Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 28/96] c/r: restart-blocks Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 29/96] c/r: checkpoint multiple processes Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 30/96] c/r: restart " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 31/96] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 32/96] c/r: support for zombie processes Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 33/96] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 34/96] c/r: infrastructure for shared objects Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 35/96] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 36/96] deferqueue: generic queue to defer work Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2010-03-17 16:08 ` Oren Laadan [this message]
2010-03-17 16:08 ` [C/R v20][PATCH 42/96] c/r: dump memory address space (private memory) Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 43/96] c/r: restore " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 47/96] c/r: export shmem_getpage() to support shared memory Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 48/96] c/r: dump anonymous- and file-mapped- " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 49/96] c/r: restore " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 54/96] c/r: make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 55/96] c/r: support for UTS namespace Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 56/96] c/r (ipc): allow allocation of a desired ipc identifier Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 57/96] c/r: save and restore sysvipc namespace basics Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 58/96] c/r: support share-memory sysv-ipc Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 59/96] " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 60/96] c/r: support semaphore sysv-ipc Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 61/96] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 62/96] c/r: add CKPT_COPY() macro Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 63/96] c/r: define s390-specific checkpoint-restart code Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 64/96] c/r: capabilities: define checkpoint and restore fns Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 65/96] c/r: checkpoint and restore task credentials Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 67/96] c/r: checkpoint and restore (shared) task's sighand_struct Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 68/96] c/r: [signal 1/4] blocked and template for shared signals Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 69/96] c/r: [signal 2/4] checkpoint/restart of rlimit Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 70/96] c/r: [signal 3/4] pending signals (private, shared) Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 71/96] c/r: [signal 4/4] support for real/virt/prof itimers Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 72/96] Expose may_setuid() in user.h and add may_setgid() (v2) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 73/96] c/r: correctly restore pgid Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 74/96] Add common socket helpers to unify the security hooks Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 75/96] c/r: introduce checkpoint/restore methods to struct proto_ops Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 76/96] c/r: Add AF_UNIX support (v12) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 77/96] c/r: add support for listening INET sockets (v2) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 78/96] c/r: add support for connected INET sockets (v5) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 79/96] c/r: [pty 1/2] allow allocation of desired pty slave Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 80/96] c/r: [pty 2/2] support for pseudo terminals Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 81/96] c/r: support for controlling terminal and job control Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 86/96] powerpc: reserve checkpoint arch identifiers Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 87/96] powerpc: provide APIs for validating and updating DABR Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 88/96] use correct ccr bit for syscall error status Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 89/96] powerpc: checkpoint/restart implementation Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 90/96] powerpc: wire up checkpoint and restart syscalls Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 91/96] powerpc: enable checkpoint support in Kconfig Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 92/96] c/r: add lsm name and lsm_info (policy header) to container info Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 93/96] c/r: add generic LSM c/r support (v7) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 94/96] c/r: add smack support to lsm c/r (v4) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 95/96] c/r: add selinux support (v6) Oren Laadan
2010-03-17 16:09 ` [C/R v20][PATCH 96/96] c/r: add an entry for checkpoint/restart in MAINTAINERS Oren Laadan
2010-03-17 21:09 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Andreas Dilger
2010-03-17 23:25 ` Matt Helsley
2010-03-17 23:37 ` Matt Helsley
2010-03-22 23:28 ` [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Rafael J. Wysocki
2010-03-23 16:03 ` Oren Laadan
2010-03-26 22:53 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1268842164-5590-42-git-send-email-orenl@cs.columbia.edu \
--to=orenl@cs.columbia.edu \
--cc=akpm@linux-foundation.org \
--cc=containers@lists.linux-foundation.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=serue@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox