From: Nick Piggin <nickpiggin@yahoo.com.au>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Ingo Molnar <mingo@elte.hu>, Nick Piggin <npiggin@novell.com>,
Hugh Dickins <hugh@veritas.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
linux-mm@kvack.org
Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
Date: Tue, 17 Mar 2009 03:23:45 +1100 [thread overview]
Message-ID: <200903170323.45917.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <20090316223612.4B2A.A69D9226@jp.fujitsu.com>
On Tuesday 17 March 2009 03:01:42 KOSAKI Motohiro wrote:
> Hi
>
> > AFAIKS, the approach I've posted is probably the simplest (and maybe only
> > way) to really fix it. It's not too ugly.
>
> May I join this discussion?
Of course :)
> if we only need concern to O_DIRECT, below patch is enough.
>
> Yes, my patch isn't realy solusion.
> Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork
> issue. *and* my patch is crazy slow :)
Well, it's an interesting question. I'd say it probably is more than
just O_DIRECT. vmsplice too, for example (which I think is much harder
to fix this way because the pages are retired by the other end of
the pipe, so I don't think you can hold a lock across it).
For other device drivers, one could argue that they are "special" and
require special knowledge and apps to use MADV_DONTFORK... Ben didn't
like that so much, and also some other users of get_user_pages might
come up.
But your patch is interesting. I don't think it is crazy slow... well
it might be a bit slow in the case that a threaded app doing a lot of
direct IO or an app doing async IO forks. But how common is that?
I would be slightly more worried about the common cacheline touched
to take the read lock for multithreaded direct IO, but I'm not sure
how much that will hurt DB2.
> So, my point is, I merely oppose easily decision to give up fixing.
>
> Currently, I agree we don't have easily fixinig way.
> but I believe we can solve this problem completely in the nealy future
> because LKML folks are very cool guys.
>
> Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page.
> Also I don't hope that I says "Oh, Solaris can solve your requirement,
> AIX can, FreeBSD can, but Linux can't".
> it beat my proud of linux developer a bit ;)
>
> andorea's patch seems a bit complex than your. but I think it can
> improve later.
> but the man page change can't undo.
>
>
> In addition, May I talk about my gup-fast concern?
> AFAIK, the worth of gup-fast is not removing one atomic operation.
> not grabbing mmap_sem is essetial.
Yes, mmap_sem is the big thing. But straight line speed is important
too.
[...]
> ---
> fs/direct-io.c | 2 ++
> include/linux/init_task.h | 1 +
> include/linux/mm_types.h | 3 +++
> kernel/fork.c | 3 +++
> 4 files changed, 9 insertions(+), 0 deletions(-)
It is an interesting patch. Thanks for throwing it into the discussion.
I do prefer to close the race up for all cases if we decide to do
anything at all about it, ie. all or nothing. But maybe others disagree.
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b6d4390..8f9a810 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb
> *iocb, struct inode *inode,
> dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
> (end > i_size_read(inode)));
>
> + down_read(¤t->mm->directio_sem);
> retval = direct_io_worker(rw, iocb, inode, iov, offset,
> nr_segs, blkbits, get_block, end_io, dio);
> + up_read(¤t->mm->directio_sem);
>
> /*
> * In case of error extending write may have instantiated a few
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index e752d97..68e02b9 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
> .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \
> .mmlist = LIST_HEAD_INIT(name.mmlist), \
> .cpu_vm_mask = CPU_MASK_ALL, \
> + .directio_sem = __RWSEM_INITIALIZER(name.directio_sem), \
> }
>
> #define INIT_SIGNALS(sig) { \
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d84feb7..39ba4e6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -274,6 +274,9 @@ struct mm_struct {
> #ifdef CONFIG_MMU_NOTIFIER
> struct mmu_notifier_mm *mmu_notifier_mm;
> #endif
> +
> + /* if there are on-flight directio, we can't fork. */
> + struct rw_semaphore directio_sem;
> };
>
> /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4854c2c..bbe9fa7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct
> mm_struct *oldmm)
> unsigned long charge;
> struct mempolicy *pol;
>
> + down_write(&oldmm->directio_sem);
> down_write(&oldmm->mmap_sem);
> flush_cache_dup_mm(oldmm);
> /*
> @@ -368,6 +369,7 @@ out:
> up_write(&mm->mmap_sem);
> flush_tlb_mm(oldmm);
> up_write(&oldmm->mmap_sem);
> + up_write(&oldmm->directio_sem);
> return retval;
> fail_nomem_policy:
> kmem_cache_free(vm_area_cachep, tmp);
> @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct
> * mm, struct task_struct *p)
> mm->free_area_cache = TASK_UNMAPPED_BASE;
> mm->cached_hole_size = ~0UL;
> mm_init_owner(mm, p);
> + init_rwsem(&mm->directio_sem);
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-03-16 16:23 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20090311170611.GA2079@elte.hu>
2009-03-11 17:33 ` Linus Torvalds
2009-03-11 17:41 ` Ingo Molnar
2009-03-11 17:58 ` Linus Torvalds
2009-03-11 18:37 ` Andrea Arcangeli
2009-03-11 18:46 ` Linus Torvalds
2009-03-11 19:01 ` Linus Torvalds
2009-03-11 19:59 ` Andrea Arcangeli
2009-03-11 20:19 ` Linus Torvalds
2009-03-11 20:33 ` Linus Torvalds
2009-03-11 20:55 ` Andrea Arcangeli
2009-03-11 21:28 ` Linus Torvalds
2009-03-11 21:57 ` Andrea Arcangeli
2009-03-11 22:06 ` Linus Torvalds
2009-03-11 22:07 ` Linus Torvalds
2009-03-11 22:22 ` Davide Libenzi
2009-03-11 22:32 ` Linus Torvalds
2009-03-14 5:07 ` Benjamin Herrenschmidt
2009-03-11 20:48 ` Andrea Arcangeli
2009-03-14 5:06 ` Benjamin Herrenschmidt
2009-03-14 5:20 ` Nick Piggin
2009-03-16 16:01 ` KOSAKI Motohiro
2009-03-16 16:23 ` Nick Piggin [this message]
2009-03-16 16:32 ` Linus Torvalds
2009-03-16 16:50 ` Nick Piggin
2009-03-16 17:02 ` Linus Torvalds
2009-03-16 17:19 ` Nick Piggin
2009-03-16 17:42 ` Linus Torvalds
2009-03-16 18:02 ` Nick Piggin
2009-03-16 18:05 ` Nick Piggin
2009-03-16 18:17 ` Linus Torvalds
2009-03-16 18:33 ` Nick Piggin
2009-03-16 19:22 ` Linus Torvalds
2009-03-17 5:44 ` Nick Piggin
2009-03-16 18:14 ` Linus Torvalds
2009-03-16 18:29 ` Nick Piggin
2009-03-16 19:17 ` Linus Torvalds
2009-03-17 5:42 ` Nick Piggin
2009-03-17 5:58 ` Nick Piggin
2009-03-16 18:37 ` Andrea Arcangeli
2009-03-16 18:28 ` Andrea Arcangeli
2009-03-16 23:59 ` KAMEZAWA Hiroyuki
2009-03-18 2:04 ` KOSAKI Motohiro
2009-03-22 12:23 ` KOSAKI Motohiro
2009-03-23 0:13 ` KOSAKI Motohiro
2009-03-23 16:29 ` Ingo Molnar
2009-03-23 16:46 ` Linus Torvalds
2009-03-24 5:08 ` KOSAKI Motohiro
2009-03-24 13:43 ` Nick Piggin
2009-03-24 17:56 ` Linus Torvalds
2009-03-30 10:52 ` KOSAKI Motohiro
[not found] ` <200904022307.12043.nickpiggin@yahoo.com.au>
2009-04-03 3:49 ` Nick Piggin
2009-03-17 0:44 ` Linus Torvalds
2009-03-17 0:56 ` KAMEZAWA Hiroyuki
2009-03-17 12:19 ` Andrea Arcangeli
2009-03-17 16:43 ` Linus Torvalds
2009-03-17 17:01 ` Linus Torvalds
2009-03-17 17:10 ` Andrea Arcangeli
2009-03-17 17:43 ` Linus Torvalds
2009-03-17 18:09 ` Linus Torvalds
2009-03-17 18:19 ` Linus Torvalds
2009-03-17 18:46 ` Andrea Arcangeli
2009-03-17 19:03 ` Linus Torvalds
2009-03-17 19:35 ` Andrea Arcangeli
2009-03-17 19:55 ` Linus Torvalds
2009-03-11 19:06 ` Andrea Arcangeli
2009-03-12 5:36 ` Nick Piggin
2009-03-12 16:23 ` Nick Piggin
2009-03-12 17:00 ` Andrea Arcangeli
2009-03-12 17:20 ` Nick Piggin
2009-03-12 17:23 ` Nick Piggin
2009-03-12 18:06 ` Andrea Arcangeli
2009-03-12 18:58 ` Andrea Arcangeli
2009-03-13 16:09 ` Nick Piggin
2009-03-13 19:34 ` Andrea Arcangeli
2009-03-14 4:59 ` Nick Piggin
2009-03-16 13:56 ` Andrea Arcangeli
2009-03-16 16:01 ` Nick Piggin
2009-03-14 4:46 ` Nick Piggin
2009-03-14 5:06 ` Nick Piggin
2009-03-11 18:53 ` Andrea Arcangeli
2009-03-11 18:22 ` Andrea Arcangeli
2009-03-11 19:06 ` Ingo Molnar
2009-03-11 19:15 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200903170323.45917.nickpiggin@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=aarcange@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=hugh@veritas.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=npiggin@novell.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox