linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Ingo Molnar <mingo@elte.hu>, Nick Piggin <npiggin@novell.com>,
	Hugh Dickins <hugh@veritas.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	linux-mm@kvack.org
Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
Date: Tue, 17 Mar 2009 03:23:45 +1100	[thread overview]
Message-ID: <200903170323.45917.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <20090316223612.4B2A.A69D9226@jp.fujitsu.com>

On Tuesday 17 March 2009 03:01:42 KOSAKI Motohiro wrote:
> Hi
>

> > AFAIKS, the approach I've posted is probably the simplest (and maybe only
> > way) to really fix it. It's not too ugly.
>
> May I join this discussion?

Of course :)


> if we only need concern to O_DIRECT, below patch is enough.
>
> Yes, my patch isn't realy solusion.
> Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork
> issue. *and* my patch is crazy slow :)

Well, it's an interesting question. I'd say it probably is more than
just O_DIRECT. vmsplice too, for example (which I think is much harder
to fix this way because the pages are retired by the other end of
the pipe, so I don't think you can hold a lock across it).

For other device drivers, one could argue that they are "special" and
require special knowledge and apps to use MADV_DONTFORK... Ben didn't
like that so much, and also some other users of get_user_pages might
come up.

But your patch is interesting. I don't think it is crazy slow... well
it might be a bit slow in the case that a threaded app doing a lot of
direct IO or an app doing async IO forks. But how common is that?

I would be slightly more worried about the common cacheline touched
to take the read lock for multithreaded direct IO, but I'm not sure
how much that will hurt DB2.


> So, my point is, I merely oppose easily decision to give up fixing.
>
> Currently, I agree we don't have easily fixinig way.
> but I believe we can solve this problem completely in the nealy future
> because LKML folks are very cool guys.
>
> Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page.
> Also I don't hope that I says "Oh, Solaris can solve your requirement,
> AIX can, FreeBSD can, but Linux can't".
> it beat my proud of linux developer a bit ;)
>
> andorea's patch seems a bit complex than your. but I think it can
> improve later.
> but the man page change can't undo.
>
>
> In addition, May I talk about my gup-fast concern?
> AFAIK, the worth of gup-fast is not removing one atomic operation.
> not grabbing mmap_sem is essetial.

Yes, mmap_sem is the big thing. But straight line speed is important
too.

[...]

> ---
>  fs/direct-io.c            |    2 ++
>  include/linux/init_task.h |    1 +
>  include/linux/mm_types.h  |    3 +++
>  kernel/fork.c             |    3 +++
>  4 files changed, 9 insertions(+), 0 deletions(-)

It is an interesting patch. Thanks for throwing it into the discussion.
I do prefer to close the race up for all cases if we decide to do
anything at all about it, ie. all or nothing. But maybe others disagree.


> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b6d4390..8f9a810 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb
> *iocb, struct inode *inode,
>  	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
>  		(end > i_size_read(inode)));
>
> +	down_read(&current->mm->directio_sem);
>  	retval = direct_io_worker(rw, iocb, inode, iov, offset,
>  				nr_segs, blkbits, get_block, end_io, dio);
> +	up_read(&current->mm->directio_sem);
>
>  	/*
>  	 * In case of error extending write may have instantiated a few
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index e752d97..68e02b9 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
>  	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
>  	.cpu_vm_mask	= CPU_MASK_ALL,				\
> +	.directio_sem	= __RWSEM_INITIALIZER(name.directio_sem), \
>  }
>
>  #define INIT_SIGNALS(sig) {						\
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d84feb7..39ba4e6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -274,6 +274,9 @@ struct mm_struct {
>  #ifdef CONFIG_MMU_NOTIFIER
>  	struct mmu_notifier_mm *mmu_notifier_mm;
>  #endif
> +
> +	/* if there are on-flight directio, we can't fork. */
> +	struct rw_semaphore directio_sem;
>  };
>
>  /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4854c2c..bbe9fa7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct
> mm_struct *oldmm)
>  	unsigned long charge;
>  	struct mempolicy *pol;
>
> +	down_write(&oldmm->directio_sem);
>  	down_write(&oldmm->mmap_sem);
>  	flush_cache_dup_mm(oldmm);
>  	/*
> @@ -368,6 +369,7 @@ out:
>  	up_write(&mm->mmap_sem);
>  	flush_tlb_mm(oldmm);
>  	up_write(&oldmm->mmap_sem);
> +	up_write(&oldmm->directio_sem);
>  	return retval;
>  fail_nomem_policy:
>  	kmem_cache_free(vm_area_cachep, tmp);
> @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct
> * mm, struct task_struct *p)
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
>  	mm_init_owner(mm, p);
> +	init_rwsem(&mm->directio_sem);
>
>  	if (likely(!mm_alloc_pgd(mm))) {
>  		mm->def_flags = 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-03-16 16:23 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20090311170611.GA2079@elte.hu>
2009-03-11 17:33 ` Linus Torvalds
2009-03-11 17:41   ` Ingo Molnar
2009-03-11 17:58     ` Linus Torvalds
2009-03-11 18:37       ` Andrea Arcangeli
2009-03-11 18:46         ` Linus Torvalds
2009-03-11 19:01           ` Linus Torvalds
2009-03-11 19:59             ` Andrea Arcangeli
2009-03-11 20:19               ` Linus Torvalds
2009-03-11 20:33                 ` Linus Torvalds
2009-03-11 20:55                   ` Andrea Arcangeli
2009-03-11 21:28                     ` Linus Torvalds
2009-03-11 21:57                       ` Andrea Arcangeli
2009-03-11 22:06                         ` Linus Torvalds
2009-03-11 22:07                           ` Linus Torvalds
2009-03-11 22:22                           ` Davide Libenzi
2009-03-11 22:32                             ` Linus Torvalds
2009-03-14  5:07                   ` Benjamin Herrenschmidt
2009-03-11 20:48                 ` Andrea Arcangeli
2009-03-14  5:06                 ` Benjamin Herrenschmidt
2009-03-14  5:20                   ` Nick Piggin
2009-03-16 16:01                     ` KOSAKI Motohiro
2009-03-16 16:23                       ` Nick Piggin [this message]
2009-03-16 16:32                         ` Linus Torvalds
2009-03-16 16:50                           ` Nick Piggin
2009-03-16 17:02                             ` Linus Torvalds
2009-03-16 17:19                               ` Nick Piggin
2009-03-16 17:42                                 ` Linus Torvalds
2009-03-16 18:02                                   ` Nick Piggin
2009-03-16 18:05                                     ` Nick Piggin
2009-03-16 18:17                                       ` Linus Torvalds
2009-03-16 18:33                                         ` Nick Piggin
2009-03-16 19:22                                           ` Linus Torvalds
2009-03-17  5:44                                             ` Nick Piggin
2009-03-16 18:14                                     ` Linus Torvalds
2009-03-16 18:29                                       ` Nick Piggin
2009-03-16 19:17                                         ` Linus Torvalds
2009-03-17  5:42                                           ` Nick Piggin
2009-03-17  5:58                                             ` Nick Piggin
2009-03-16 18:37                                       ` Andrea Arcangeli
2009-03-16 18:28                                   ` Andrea Arcangeli
2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
2009-03-18  2:04                         ` KOSAKI Motohiro
2009-03-22 12:23                           ` KOSAKI Motohiro
2009-03-23  0:13                             ` KOSAKI Motohiro
2009-03-23 16:29                               ` Ingo Molnar
2009-03-23 16:46                                 ` Linus Torvalds
2009-03-24  5:08                                   ` KOSAKI Motohiro
2009-03-24 13:43                             ` Nick Piggin
2009-03-24 17:56                               ` Linus Torvalds
2009-03-30 10:52                               ` KOSAKI Motohiro
     [not found]                                 ` <200904022307.12043.nickpiggin@yahoo.com.au>
2009-04-03  3:49                                   ` Nick Piggin
2009-03-17  0:44                       ` Linus Torvalds
2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
2009-03-17 12:19                         ` Andrea Arcangeli
2009-03-17 16:43                           ` Linus Torvalds
2009-03-17 17:01                             ` Linus Torvalds
2009-03-17 17:10                               ` Andrea Arcangeli
2009-03-17 17:43                                 ` Linus Torvalds
2009-03-17 18:09                                   ` Linus Torvalds
2009-03-17 18:19                                     ` Linus Torvalds
2009-03-17 18:46                                       ` Andrea Arcangeli
2009-03-17 19:03                                         ` Linus Torvalds
2009-03-17 19:35                                           ` Andrea Arcangeli
2009-03-17 19:55                                             ` Linus Torvalds
2009-03-11 19:06           ` Andrea Arcangeli
2009-03-12  5:36           ` Nick Piggin
2009-03-12 16:23             ` Nick Piggin
2009-03-12 17:00               ` Andrea Arcangeli
2009-03-12 17:20                 ` Nick Piggin
2009-03-12 17:23                   ` Nick Piggin
2009-03-12 18:06                   ` Andrea Arcangeli
2009-03-12 18:58                     ` Andrea Arcangeli
2009-03-13 16:09                     ` Nick Piggin
2009-03-13 19:34                       ` Andrea Arcangeli
2009-03-14  4:59                         ` Nick Piggin
2009-03-16 13:56                           ` Andrea Arcangeli
2009-03-16 16:01                             ` Nick Piggin
2009-03-14  4:46                       ` Nick Piggin
2009-03-14  5:06                         ` Nick Piggin
2009-03-11 18:53     ` Andrea Arcangeli
2009-03-11 18:22   ` Andrea Arcangeli
2009-03-11 19:06     ` Ingo Molnar
2009-03-11 19:15       ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200903170323.45917.nickpiggin@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=aarcange@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=hugh@veritas.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@novell.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox