linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: Nick Piggin <nickpiggin@yahoo.com.au>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>
Cc: kosaki.motohiro@jp.fujitsu.com,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Ingo Molnar <mingo@elte.hu>, Nick Piggin <npiggin@novell.com>,
	Hugh Dickins <hugh@veritas.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	linux-mm@kvack.org
Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
Date: Sun, 22 Mar 2009 21:23:56 +0900 (JST)	[thread overview]
Message-ID: <20090322205249.6801.A69D9226@jp.fujitsu.com> (raw)
In-Reply-To: <20090318105735.BD17.A69D9226@jp.fujitsu.com>

Hi

following patch is my v2 approach.
it survive Andrea's three dio test-case.

Linus suggested to change add_to_swap() and shrink_page_list() stuff
for avoid false cow in do_wp_page() when page become to swapcache.

I think it's good idea. but it's a bit radical. so I think it's for development
tree tackle.

Then, I decide to use Nick's early decow in 
get_user_pages() and RO mapped page don't use gup_fast.

yeah, my approach is extream brutal way and big hammer. but I think 
it don't have performance issue in real world.

why?

Practically, we can assume following two thing.

(1) the buffer of passed write(2) syscall argument is RW mapped
    page or COWed RO page.

if anybody write following code, my path cause performance degression.

   buf = mmap()
   memset(buf, 0x11, len);
   mprotect(buf, len, PROT_READ)
   fd = open(O_DIRECT)
   write(fd, buf, len)

but it's very artifactical code. nobody want this.
ok, we can ignore this.

(2) DirectIO user process isn't short lived process.

early decow only decrease short lived process performaqnce. 
because long lived process do decowing anyway before exec(2).

and, All DB application is definitely long lived process.
then early decow don't cause degression.


TODO
  - implement down_write_killable().
    (but it isn't important thing because this is rare case issue.)
  - implement non x86 portion.


Am I missing any thing?


Note: this is still RFC. not intent submission.

--
 arch/x86/mm/gup.c         |   22 ++++++++++++++--------
 fs/direct-io.c            |   11 +++++++++++
 include/linux/init_task.h |    1 +
 include/linux/mm.h        |    9 +++++++++
 include/linux/mm_types.h  |    6 ++++++
 kernel/fork.c             |    3 +++
 mm/internal.h             |   10 ----------
 mm/memory.c               |   17 ++++++++++++++++-
 mm/util.c                 |    8 ++++++--
 9 files changed, 66 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index be54176..02e479b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -74,8 +74,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 	pte_t *ptep;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+
+	/* Maybe the read only pte is cow mapped page. (or not maybe)
+	   So, falling back to get_user_pages() is better */
+	mask |= _PAGE_RW;
 
 	ptep = pte_offset_map(&pmd, addr);
 	do {
@@ -114,8 +116,7 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	int refs;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+	mask |= _PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
 	/* hugepages are never "special" */
@@ -171,8 +172,7 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	int refs;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+	mask |= _PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
 	/* hugepages are never "special" */
@@ -272,6 +272,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	{
 		int ret;
+		int gup_flags;
 
 slow:
 		local_irq_enable();
@@ -280,9 +281,14 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
+		gup_flags = GUP_FLAGS_PINNING_PAGE;
+		if (write)
+			gup_flags |= GUP_FLAGS_WRITE;
+
 		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		ret = __get_user_pages(current, mm, start,
+				       (end - start) >> PAGE_SHIFT, gup_flags,
+				       pages, NULL);
 		up_read(&mm->mmap_sem);
 
 		/* Have to be a bit careful with return values */
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b6d4390..4f46720 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -131,6 +131,9 @@ struct dio {
 	int is_async;			/* is IO async ? */
 	int io_error;			/* IO error in completion path */
 	ssize_t result;                 /* IO result */
+
+	/* fork exclusive stuff */
+	struct mm_struct *mm;
 };
 
 /*
@@ -243,6 +246,9 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret)
 	if (dio->lock_type == DIO_LOCKING)
 		/* lockdep: non-owner release */
 		up_read_non_owner(&dio->inode->i_alloc_sem);
+	up_read_non_owner(&dio->mm->mm_pinned_sem);
+	mmdrop(dio->mm);
+	dio->mm = NULL;
 
 	if (ret == 0)
 		ret = dio->page_errors;
@@ -942,6 +948,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	ssize_t ret = 0;
 	ssize_t ret2;
 	size_t bytes;
+	struct mm_struct *mm;
 
 	dio->inode = inode;
 	dio->rw = rw;
@@ -960,6 +967,10 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
+	mm = dio->mm = current->mm;
+	atomic_inc(&mm->mm_count);
+	down_read_non_owner(&mm->mm_pinned_sem);
+
 	/*
 	 * In case of non-aligned buffers, we may need 2 more
 	 * pages since we need to zero out first and last block.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e752d97..3bc134a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
 	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
 	.cpu_vm_mask	= CPU_MASK_ALL,				\
+	.mm_pinned_sem	= __RWSEM_INITIALIZER(name.mm_pinned_sem), \
 }
 
 #define INIT_SIGNALS(sig) {						\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..dcc6ccc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -823,6 +823,15 @@ static inline int handle_mm_fault(struct mm_struct *mm,
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 
+#define GUP_FLAGS_WRITE				0x01
+#define GUP_FLAGS_FORCE				0x02
+#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS	0x04
+#define GUP_FLAGS_IGNORE_SIGKILL		0x08
+#define GUP_FLAGS_PINNING_PAGE			0x10
+
+int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, int len, int flags,
+		     struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d84feb7..27089d9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -274,6 +274,12 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+
+	/*
+	 * if there are on-flight directio or similar pinning action,
+	 * COW cause memory corruption. the sem protect it by preventing fork.
+	 */
+	struct rw_semaphore mm_pinned_sem;
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 4854c2c..ded7caf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	unsigned long charge;
 	struct mempolicy *pol;
 
+	down_write(&oldmm->mm_pinned_sem);
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
 	/*
@@ -368,6 +369,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	up_write(&oldmm->mm_pinned_sem);
 	return retval;
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
@@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_owner(mm, p);
+	init_rwsem(&mm->mm_pinned_sem);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
diff --git a/mm/internal.h b/mm/internal.h
index 478223b..04f25d2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -272,14 +272,4 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 {
 }
 #endif /* CONFIG_SPARSEMEM */
-
-#define GUP_FLAGS_WRITE                  0x1
-#define GUP_FLAGS_FORCE                  0x2
-#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
-#define GUP_FLAGS_IGNORE_SIGKILL         0x8
-
-int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		     unsigned long start, int len, int flags,
-		     struct page **pages, struct vm_area_struct **vmas);
-
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index baa999e..b00e3e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1211,6 +1211,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	int force = !!(flags & GUP_FLAGS_FORCE);
 	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
 	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+	int decow = 0;
 
 	if (len <= 0)
 		return 0;
@@ -1279,6 +1280,20 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			continue;
 		}
 
+		/*
+		 * Except in special cases where the caller will not read to or
+		 * write from these pages, we must break COW for any pages
+		 * returned from get_user_pages, so that our caller does not
+		 * subsequently end up with the pages of a parent or child
+		 * process after a COW takes place.
+		 */
+		if (flags & GUP_FLAGS_PINNING_PAGE) {
+			if (!pages)
+				return -EINVAL;
+			if (is_cow_mapping(vma->vm_flags))
+				decow = 1;
+		}
+
 		foll_flags = FOLL_TOUCH;
 		if (pages)
 			foll_flags |= FOLL_GET;
@@ -1299,7 +1314,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					fatal_signal_pending(current)))
 				return i ? i : -ERESTARTSYS;
 
-			if (write)
+			if (write || decow)
 				foll_flags |= FOLL_WRITE;
 
 			cond_resched();
diff --git a/mm/util.c b/mm/util.c
index 37eaccd..a80d5d3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -197,10 +197,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
 {
 	struct mm_struct *mm = current->mm;
 	int ret;
+	int gup_flags = GUP_FLAGS_PINNING_PAGE;
+
+	if (write)
+		gup_flags |= GUP_FLAGS_WRITE;
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start, nr_pages,
-					write, 0, pages, NULL);
+	ret = __get_user_pages(current, mm, start, nr_pages,
+			       gup_flags, pages, NULL);
 	up_read(&mm->mmap_sem);
 
 	return ret;



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-03-22 11:42 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20090311170611.GA2079@elte.hu>
2009-03-11 17:33 ` Linus Torvalds
2009-03-11 17:41   ` Ingo Molnar
2009-03-11 17:58     ` Linus Torvalds
2009-03-11 18:37       ` Andrea Arcangeli
2009-03-11 18:46         ` Linus Torvalds
2009-03-11 19:01           ` Linus Torvalds
2009-03-11 19:59             ` Andrea Arcangeli
2009-03-11 20:19               ` Linus Torvalds
2009-03-11 20:33                 ` Linus Torvalds
2009-03-11 20:55                   ` Andrea Arcangeli
2009-03-11 21:28                     ` Linus Torvalds
2009-03-11 21:57                       ` Andrea Arcangeli
2009-03-11 22:06                         ` Linus Torvalds
2009-03-11 22:07                           ` Linus Torvalds
2009-03-11 22:22                           ` Davide Libenzi
2009-03-11 22:32                             ` Linus Torvalds
2009-03-14  5:07                   ` Benjamin Herrenschmidt
2009-03-11 20:48                 ` Andrea Arcangeli
2009-03-14  5:06                 ` Benjamin Herrenschmidt
2009-03-14  5:20                   ` Nick Piggin
2009-03-16 16:01                     ` KOSAKI Motohiro
2009-03-16 16:23                       ` Nick Piggin
2009-03-16 16:32                         ` Linus Torvalds
2009-03-16 16:50                           ` Nick Piggin
2009-03-16 17:02                             ` Linus Torvalds
2009-03-16 17:19                               ` Nick Piggin
2009-03-16 17:42                                 ` Linus Torvalds
2009-03-16 18:02                                   ` Nick Piggin
2009-03-16 18:05                                     ` Nick Piggin
2009-03-16 18:17                                       ` Linus Torvalds
2009-03-16 18:33                                         ` Nick Piggin
2009-03-16 19:22                                           ` Linus Torvalds
2009-03-17  5:44                                             ` Nick Piggin
2009-03-16 18:14                                     ` Linus Torvalds
2009-03-16 18:29                                       ` Nick Piggin
2009-03-16 19:17                                         ` Linus Torvalds
2009-03-17  5:42                                           ` Nick Piggin
2009-03-17  5:58                                             ` Nick Piggin
2009-03-16 18:37                                       ` Andrea Arcangeli
2009-03-16 18:28                                   ` Andrea Arcangeli
2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
2009-03-18  2:04                         ` KOSAKI Motohiro
2009-03-22 12:23                           ` KOSAKI Motohiro [this message]
2009-03-23  0:13                             ` KOSAKI Motohiro
2009-03-23 16:29                               ` Ingo Molnar
2009-03-23 16:46                                 ` Linus Torvalds
2009-03-24  5:08                                   ` KOSAKI Motohiro
2009-03-24 13:43                             ` Nick Piggin
2009-03-24 17:56                               ` Linus Torvalds
2009-03-30 10:52                               ` KOSAKI Motohiro
     [not found]                                 ` <200904022307.12043.nickpiggin@yahoo.com.au>
2009-04-03  3:49                                   ` Nick Piggin
2009-03-17  0:44                       ` Linus Torvalds
2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
2009-03-17 12:19                         ` Andrea Arcangeli
2009-03-17 16:43                           ` Linus Torvalds
2009-03-17 17:01                             ` Linus Torvalds
2009-03-17 17:10                               ` Andrea Arcangeli
2009-03-17 17:43                                 ` Linus Torvalds
2009-03-17 18:09                                   ` Linus Torvalds
2009-03-17 18:19                                     ` Linus Torvalds
2009-03-17 18:46                                       ` Andrea Arcangeli
2009-03-17 19:03                                         ` Linus Torvalds
2009-03-17 19:35                                           ` Andrea Arcangeli
2009-03-17 19:55                                             ` Linus Torvalds
2009-03-11 19:06           ` Andrea Arcangeli
2009-03-12  5:36           ` Nick Piggin
2009-03-12 16:23             ` Nick Piggin
2009-03-12 17:00               ` Andrea Arcangeli
2009-03-12 17:20                 ` Nick Piggin
2009-03-12 17:23                   ` Nick Piggin
2009-03-12 18:06                   ` Andrea Arcangeli
2009-03-12 18:58                     ` Andrea Arcangeli
2009-03-13 16:09                     ` Nick Piggin
2009-03-13 19:34                       ` Andrea Arcangeli
2009-03-14  4:59                         ` Nick Piggin
2009-03-16 13:56                           ` Andrea Arcangeli
2009-03-16 16:01                             ` Nick Piggin
2009-03-14  4:46                       ` Nick Piggin
2009-03-14  5:06                         ` Nick Piggin
2009-03-11 18:53     ` Andrea Arcangeli
2009-03-11 18:22   ` Andrea Arcangeli
2009-03-11 19:06     ` Ingo Molnar
2009-03-11 19:15       ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090322205249.6801.A69D9226@jp.fujitsu.com \
    --to=kosaki.motohiro@jp.fujitsu.com \
    --cc=aarcange@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=hugh@veritas.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=npiggin@novell.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox