linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: kosaki.motohiro@jp.fujitsu.com,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Ingo Molnar <mingo@elte.hu>, Nick Piggin <npiggin@novell.com>,
	Hugh Dickins <hugh@veritas.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	linux-mm@kvack.org
Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
Date: Tue, 17 Mar 2009 01:01:42 +0900 (JST)	[thread overview]
Message-ID: <20090316223612.4B2A.A69D9226@jp.fujitsu.com> (raw)
In-Reply-To: <200903141620.45052.nickpiggin@yahoo.com.au>

Hi

> > IB folks so far have been avoiding the fork() trap thanks to
> > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
> > whole application knows what it's doing and just plain avoids fork.
> >
> > -But- things get nasty if for some reason, the user of gup is somewhere
> > deep in some kind of library that an application uses without knowing,
> > while forking here or there to run shell scripts or other helpers.
> >
> > I've seen it :-)
> >
> > So if a solution can be found that doesn't uglify the whole thing beyond
> > recognition, it's probably worth it.
> 
> AFAIKS, the approach I've posted is probably the simplest (and maybe only
> way) to really fix it. It's not too ugly.

May I join this discussion?

if we only need concern to O_DIRECT, below patch is enough.

Yes, my patch isn't realy solusion.
Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork issue.
*and* my patch is crazy slow :)

So, my point is, I merely oppose easily decision to give up fixing.

Currently, I agree we don't have easily fixinig way.
but I believe we can solve this problem completely in the nealy future 
because LKML folks are very cool guys.

Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page.
Also I don't hope that I says "Oh, Solaris can solve your requirement,
AIX can, FreeBSD can, but Linux can't".
it beat my proud of linux developer a bit ;)

andorea's patch seems a bit complex than your. but I think it can
improve later.
but the man page change can't undo.


In addition, May I talk about my gup-fast concern?
AFAIK, the worth of gup-fast is not removing one atomic operation.
not grabbing mmap_sem is essetial.

it because:
  - block layer and i/o driver also have several lock.
    then, DirectIO take many atomic operations anyway.
    one atomic operation cost is not so expensive.
  - but mmap_sem is one of most easy contented lock in linux.
    because
    - almost modern DB software have multi threading.
    - glibc malloc/free can cause mmap, munmap, mprotect syscall.
      its syscall grab down_write(&mmap_sem).
    - page fault also grab down_read(&mmap_sem).
    - anyway, userland application can't avoid malloc() and pagefault.

However, I haven't seen anyone try to munmap() to direct-io region.
So, it imply mmap_sem can split out fine grainy.
(or, Can we remove it completely? iirc PerterZ tryed it about two month ago)

after that, we can grab mmap_sem without performace degression and 
many mmap_sem avoiding effort can be removed.

perhaps, I talk funny thing. gup-fast was introduced for solving DB2 problem.
but I don't have any DB2 development experience.

Am I over-optimistic?



> You can't easily fix it at write-time by COWing in the right direction like
> Linus suggested because at that point you may have multiple get_user_pages
> (for read) from the parent and child on the page, so there is no way to COW
> it in the right direction.
> 
> You could do something crazy like allowing only one get_user_pages read on a
> wp page, and recording which direction to send it if it does get COWed. But
> at that point you've got something that's far uglier in the core code and
> more complex than what I posted.





---
 fs/direct-io.c            |    2 ++
 include/linux/init_task.h |    1 +
 include/linux/mm_types.h  |    3 +++
 kernel/fork.c             |    3 +++
 4 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b6d4390..8f9a810 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb
*iocb, struct inode *inode,
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));

+	down_read(&current->mm->directio_sem);
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
+	up_read(&current->mm->directio_sem);

 	/*
 	 * In case of error extending write may have instantiated a few
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e752d97..68e02b9 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
 	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
 	.cpu_vm_mask	= CPU_MASK_ALL,				\
+	.directio_sem	= __RWSEM_INITIALIZER(name.directio_sem), \
 }

 #define INIT_SIGNALS(sig) {						\
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d84feb7..39ba4e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -274,6 +274,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+
+	/* if there are on-flight directio, we can't fork. */
+	struct rw_semaphore directio_sem;
 };

 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 4854c2c..bbe9fa7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct
mm_struct *oldmm)
 	unsigned long charge;
 	struct mempolicy *pol;

+	down_write(&oldmm->directio_sem);
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
 	/*
@@ -368,6 +369,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	up_write(&oldmm->directio_sem);
 	return retval;
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
@@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct
* mm, struct task_struct *p)
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_owner(mm, p);
+	init_rwsem(&mm->directio_sem);

 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
-- 
1.6.0.6



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-03-16 16:03 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20090311170611.GA2079@elte.hu>
2009-03-11 17:33 ` Linus Torvalds
2009-03-11 17:41   ` Ingo Molnar
2009-03-11 17:58     ` Linus Torvalds
2009-03-11 18:37       ` Andrea Arcangeli
2009-03-11 18:46         ` Linus Torvalds
2009-03-11 19:01           ` Linus Torvalds
2009-03-11 19:59             ` Andrea Arcangeli
2009-03-11 20:19               ` Linus Torvalds
2009-03-11 20:33                 ` Linus Torvalds
2009-03-11 20:55                   ` Andrea Arcangeli
2009-03-11 21:28                     ` Linus Torvalds
2009-03-11 21:57                       ` Andrea Arcangeli
2009-03-11 22:06                         ` Linus Torvalds
2009-03-11 22:07                           ` Linus Torvalds
2009-03-11 22:22                           ` Davide Libenzi
2009-03-11 22:32                             ` Linus Torvalds
2009-03-14  5:07                   ` Benjamin Herrenschmidt
2009-03-11 20:48                 ` Andrea Arcangeli
2009-03-14  5:06                 ` Benjamin Herrenschmidt
2009-03-14  5:20                   ` Nick Piggin
2009-03-16 16:01                     ` KOSAKI Motohiro [this message]
2009-03-16 16:23                       ` Nick Piggin
2009-03-16 16:32                         ` Linus Torvalds
2009-03-16 16:50                           ` Nick Piggin
2009-03-16 17:02                             ` Linus Torvalds
2009-03-16 17:19                               ` Nick Piggin
2009-03-16 17:42                                 ` Linus Torvalds
2009-03-16 18:02                                   ` Nick Piggin
2009-03-16 18:05                                     ` Nick Piggin
2009-03-16 18:17                                       ` Linus Torvalds
2009-03-16 18:33                                         ` Nick Piggin
2009-03-16 19:22                                           ` Linus Torvalds
2009-03-17  5:44                                             ` Nick Piggin
2009-03-16 18:14                                     ` Linus Torvalds
2009-03-16 18:29                                       ` Nick Piggin
2009-03-16 19:17                                         ` Linus Torvalds
2009-03-17  5:42                                           ` Nick Piggin
2009-03-17  5:58                                             ` Nick Piggin
2009-03-16 18:37                                       ` Andrea Arcangeli
2009-03-16 18:28                                   ` Andrea Arcangeli
2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
2009-03-18  2:04                         ` KOSAKI Motohiro
2009-03-22 12:23                           ` KOSAKI Motohiro
2009-03-23  0:13                             ` KOSAKI Motohiro
2009-03-23 16:29                               ` Ingo Molnar
2009-03-23 16:46                                 ` Linus Torvalds
2009-03-24  5:08                                   ` KOSAKI Motohiro
2009-03-24 13:43                             ` Nick Piggin
2009-03-24 17:56                               ` Linus Torvalds
2009-03-30 10:52                               ` KOSAKI Motohiro
     [not found]                                 ` <200904022307.12043.nickpiggin@yahoo.com.au>
2009-04-03  3:49                                   ` Nick Piggin
2009-03-17  0:44                       ` Linus Torvalds
2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
2009-03-17 12:19                         ` Andrea Arcangeli
2009-03-17 16:43                           ` Linus Torvalds
2009-03-17 17:01                             ` Linus Torvalds
2009-03-17 17:10                               ` Andrea Arcangeli
2009-03-17 17:43                                 ` Linus Torvalds
2009-03-17 18:09                                   ` Linus Torvalds
2009-03-17 18:19                                     ` Linus Torvalds
2009-03-17 18:46                                       ` Andrea Arcangeli
2009-03-17 19:03                                         ` Linus Torvalds
2009-03-17 19:35                                           ` Andrea Arcangeli
2009-03-17 19:55                                             ` Linus Torvalds
2009-03-11 19:06           ` Andrea Arcangeli
2009-03-12  5:36           ` Nick Piggin
2009-03-12 16:23             ` Nick Piggin
2009-03-12 17:00               ` Andrea Arcangeli
2009-03-12 17:20                 ` Nick Piggin
2009-03-12 17:23                   ` Nick Piggin
2009-03-12 18:06                   ` Andrea Arcangeli
2009-03-12 18:58                     ` Andrea Arcangeli
2009-03-13 16:09                     ` Nick Piggin
2009-03-13 19:34                       ` Andrea Arcangeli
2009-03-14  4:59                         ` Nick Piggin
2009-03-16 13:56                           ` Andrea Arcangeli
2009-03-16 16:01                             ` Nick Piggin
2009-03-14  4:46                       ` Nick Piggin
2009-03-14  5:06                         ` Nick Piggin
2009-03-11 18:53     ` Andrea Arcangeli
2009-03-11 18:22   ` Andrea Arcangeli
2009-03-11 19:06     ` Ingo Molnar
2009-03-11 19:15       ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090316223612.4B2A.A69D9226@jp.fujitsu.com \
    --to=kosaki.motohiro@jp.fujitsu.com \
    --cc=aarcange@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=hugh@veritas.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=npiggin@novell.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox