From: Hugh Dickins <hughd@google.com>
To: Jan Kara <jack@suse.cz>
Cc: David Herrmann <dh.herrmann@gmail.com>,
Hugh Dickins <hughd@google.com>,
Tony Battersby <tonyb@cybernetics.com>,
Al Viro <viro@zeniv.linux.org.uk>,
Michael Kerrisk <mtk.manpages@gmail.com>,
Ryan Lortie <desrt@desrt.ca>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm <linux-mm@kvack.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
Greg Kroah-Hartman <greg@kroah.com>,
John Stultz <john.stultz@linaro.org>,
Kristian Hogsberg <krh@bitplanet.net>,
Lennart Poettering <lennart@poettering.net>,
Daniel Mack <zonque@gmail.com>, Kay Sievers <kay@vrfy.org>
Subject: Re: [PATCH v2 0/3] File Sealing & memfd_create()
Date: Mon, 19 May 2014 15:11:04 -0700 (PDT) [thread overview]
Message-ID: <alpine.LSU.2.11.1405191403280.969@eggly.anvils> (raw)
In-Reply-To: <20140519160942.GD3427@quack.suse.cz>
On Mon, 19 May 2014, Jan Kara wrote:
> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm. If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> >
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced
To be honest, I'm quite glad to hear that: it is still a solution worth
considering, but I'd rather continue the search for a better solution.
> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> >
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).
We need to fix it together with your sealing patchset, because your
patchset is all about introducing a new kind of guarantee: a guarantee
which this async i/o issue makes impossible to give, as things stand.
Exasperating for you, I understand; but that's how it is.
A new feature may make new demands on the infrastructure.
I can imagine existing problems, but (I may be out of touch) I have
not heard of them as problems in practice. Certainly they would not
be recent regressions: mm-page versus fs-file has worked in this way
for as long as I've known them (pages released independently of
unmapping the file, with the understanding that i/o might still
be in progress, so care taken not to free the pages too soon).
> >
> > Shouldn't we try to fix the _cause_ of this?
Nobody is against fixing the cause: we are all looking for the
simplest way of doing so,
> >
> > Isn't there a simple way to lock/mark/.. affected vmas in
> > get_user_pages(_fast)() and release them once done? We could increase
> > i_mmap_writable on all affected address_space and decrease it on
> > release. This would at least prevent sealing and could be check on
> > other operations, too (like setting S_IMMUTABLE).
> > This should be as easy as checking page_mapping(page) != NULL and then
> > adjusting ->i_mmap_writable in
> > get_writable_user_pages/put_writable_user_pages, right?
> Doing this would be quite a bit of work. Currently references returned by
> get_user_pages() are page references like any other and thus are released
> by put_page() or similar. Now you would make them special and they need
> special releasing and there are lots of places in kernel where
> get_user_pages() is used that would need changing.
Lots of places that would need changing, yes; but we have often
wondered in the past whether there should be a put_user_pages().
Though I'm not sure that it would actually solve anything...
>
> Another aspect is that it could have performance implications - if there
> are several processes using get_user_pages[_fast]() on a file, they would
> start contending on modifying i_mmap_writeable.
Doing extra vma work in get_user_pages() wouldn't be so bad. But doing
any vma work in get_user_pages_fast() would upset almost all its users:
get_user_pages_fast() is a fast-path which expressly avoids the vmas,
and hates additional cachelines being added to its machinations.
If sealing had appeared before get_user_pages_fast(), maybe we wouldn't
have let get_user_pages_fast() in; but now it's the other way around.
I would be more interested in attacking from the get_user_pages() and
get_user_pages_fast() end, if I could convince myself that they do
actually delimit the problem; maybe they do, but I'm not yet convinced.
>
> One somewhat crazy idea I have is that maybe we could delay unmapping of a
> page if this was last VMA referencing it until all extra page references of
> pages in there are dropped. That would make i_mmap_writeable reliable for
> you and it would also close those races with remount. Hugh, do you think
> this might be viable?
It is definitely worth pursuing further, but I'm not very hopeful on it.
In a world of free page flags and free struct page fields, maybe. (And
I don't see sealing as a feature sensibly restricted to 64-bit only.)
I think we would have to set a page flag, maybe bump a count, for every
leftover page that raises i_mmap_writable; and lower it (potentially from
interrupt context) at put_page() time. Easy to make i_mmap_writable an
atomic rather than guarded by i_mmap_mutex, but we still need to
synchronize on it falling to 0.
And how would we recognize the relevant, decrementing, put_page()?
page_count divided into "read_"count and write_count? Ugh!
I also have a strong instinct against adding delays into munmap+exit;
though that mainly comes from the urge to free memory, and here we are
only delaying until a page becomes freeable, so maybe I should abandon
that bias in this case.
I did start thinking in this direction last week, but stuck somewhere
and retreated, I forget on what issue. At this moment I'm not really
in that zone, but anxious to complete my promised responses to David's
patches, which I almost but not quite completed last night.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-05-19 22:12 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-15 18:38 David Herrmann
2014-04-15 18:38 ` [PATCH v2 1/3] shm: add sealing API David Herrmann
2014-05-20 2:16 ` Hugh Dickins
2014-05-23 16:37 ` David Herrmann
2014-06-02 10:30 ` Hugh Dickins
2014-04-15 18:38 ` [PATCH v2 2/3] shm: add memfd_create() syscall David Herrmann
2014-05-20 2:20 ` Hugh Dickins
2014-05-23 16:57 ` David Herrmann
2014-06-02 10:59 ` Hugh Dickins
2014-06-02 17:50 ` Andy Lutomirski
2014-06-13 10:42 ` David Herrmann
2014-05-21 10:50 ` Konstantin Khlebnikov
2014-04-15 18:38 ` [PATCH v2 3/3] selftests: add memfd_create() + sealing tests David Herrmann
2014-05-20 2:22 ` Hugh Dickins
2014-05-23 17:06 ` David Herrmann
2014-05-14 5:09 ` [PATCH v2 0/3] File Sealing & memfd_create() Hugh Dickins
2014-05-14 16:15 ` Tony Battersby
2014-05-14 22:35 ` Hugh Dickins
2014-05-19 11:44 ` David Herrmann
2014-05-19 16:09 ` Jan Kara
2014-05-19 22:11 ` Hugh Dickins [this message]
2014-05-26 11:44 ` David Herrmann
2014-05-31 4:44 ` Hugh Dickins
2014-06-02 4:42 ` Minchan Kim
2014-06-02 9:14 ` Jan Kara
2014-06-02 16:04 ` David Herrmann
2014-06-03 8:31 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LSU.2.11.1405191403280.969@eggly.anvils \
--to=hughd@google.com \
--cc=akpm@linux-foundation.org \
--cc=desrt@desrt.ca \
--cc=dh.herrmann@gmail.com \
--cc=greg@kroah.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=john.stultz@linaro.org \
--cc=kay@vrfy.org \
--cc=krh@bitplanet.net \
--cc=lennart@poettering.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mtk.manpages@gmail.com \
--cc=tj@kernel.org \
--cc=tonyb@cybernetics.com \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=zonque@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox