Re: RFC: reviving mlock isolation dead code

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hugh Dickins <hughd@google.com>
To: Michel Lespinasse <walken@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Nick Piggin <npiggin@kernel.dk>,
	Arjan van de Ven <arjan@infradead.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: RFC: reviving mlock isolation dead code
Date: Tue, 16 Nov 2010 15:28:45 -0800 (PST)	[thread overview]
Message-ID: <alpine.LSU.2.00.1011161444200.16422@tigran.mtv.corp.google.com> (raw)
In-Reply-To: <AANLkTin+16yDxGrRfbqw9OPnDDV8OgXr_nbZnXJEHK9w@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4671 bytes --]

On Mon, 15 Nov 2010, Michel Lespinasse wrote:
> On Mon, Nov 15, 2010 at 5:44 PM, Hugh Dickins <hughd@google.com> wrote:
> > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> >> Michel Lespinasse <walken@google.com> wrote:
> >> > ...
> >> > The other mlock related issue I have is that it marks pages as dirty
> >> > (if they are in a writable VMA), and causes writeback to work on them,
> >> > even though the pages have not actually been modified. This looks like
> >> > it would be solvable with a new get_user_pages flag for mlock use
> >> > (breaking cow etc, but not writing to the pages just yet).
> >>
> >> To be honest, I haven't understand why current code does so. I dislike it too. but
> >> I'm not sure such change is safe or not. I hope another developer comment you ;-)
> >
> > It's been that way for years, and the primary purpose is to do the COWs
> > in advance, so we won't need to allocate new pages later to the locked
> > area: the pages that may be needed are already locked down.
> 
> Thanks Hugh for posting your comments. I was aware of Suleiman's
> proposal to always do a READ mode get_user_pages years ago, and I
> could see that we'd need a new flag instead so we can break COW
> without dirtying pages, but I hadn't thought about other issues.
> 
> > That justifies it for the private mapping case, but what of shared maps?
> > There the justification is that the underlying file might be sparse, and
> > we want to allocate blocks upfront for the locked area.
> >
> > Do we?  I dislike it also, as you both do.  It seems crazy to mark a
> > vast number of pages as dirty when they're not.
> >
> > It makes sense to mark pte_dirty when we have a real write fault to a
> > page, to save the mmu from making that pagetable transaction immediately
> > after; but it does not make sense when the write (if any) may come
> > minutes later - we'll just do a pointless write and clear dirty meanwhile.
> 
> If we just mlocked the page but did not made it writable (or mark it
> dirty) yet, would we be allowed to skip the page_mkwrite method call ?

Yes, indeed you should skip it in that case.

> 
> I believe this would be legal:

Yes, I agree that it would be legal.

> 
> - If/when an actual write comes later on, we'll run through
> do_wp_page() again, and reuse the old page, making it writable and
> dirty from then on. Since this is a shared mapping, we won't have to
> allocate a new page at a that time, so this preserves the mlock
> semantic of having all necessary pages preallocated.
> 
> - If we skip page_mkwrite(), we can't guarantee that the filesystem
> will have a free block to allocate, but is this actually part of the
> mlock() semantics ? I think not, given that only a few filesystems
> implement page_mkwrite() in the first place. ext4 does, but ext2/3
> does not, for example. So while skipping page_mkwrite() would prevent
> data blocks from being pre-allocated, I don't really see it as
> breaking mlock() ?

Yes, allocating the blocks is not actually part of mlock() semantics.

And a few years ago, there was no ->page_mkwrite(), and the ->nopage()
interface didn't tell the filesystem whether it was read or write fault
(and mlocking a writable vma certainly didn't do synchronous writes back
to disk before the mlock returned success or failure).

It's all a matter of QoS: is it acceptable to make the change, that
a write fault to an mlocked area of a sparse file might now generate
SIGBUS, on a few filesystems which have recently been guaranteeing not?

Personally, I believe that's more acceptable than doing a huge rush of
(almost always) pointless writes at the time of mlock().  But I can
see that others may disagree.

> 
> > If it does work out, I think you'd need to be passing the flag down to
> > follow_page too: I have a patch or patches to merge the FOLL_flags with
> > the FAULT_FLAGs - Linus wanted that a year ago, and I recently met a
> > need for it with shmem - I'd better accelerate sending those in.
> 
> The follow_page change is simpler, it might even be sufficient to not
> pass in the FOLL_TOUCH flag I think.

Yes, in fact, is anything required beyond Peter's original simple patch?

There are some tweaks that could be added.  A FAULT_FLAG to let filesystem
know that we're mlocking a writable area, so it could be careful about it?
only useful if some filesystem uses it!  A check on vma_wants_writenotify()
or something like it, so mlock does set pte_write if it's okay e.g. tmpfs?
Second order things, probably don't matter.

Added Ccs of those most likely to agree or disagree with us.

Hugh

next prev parent reply	other threads:[~2010-11-16 23:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-30 10:16 Michel Lespinasse
2010-10-30 12:48 ` Michel Lespinasse
2010-11-01  7:05 ` KOSAKI Motohiro
2010-11-09  4:34   ` KOSAKI Motohiro
2010-11-10 12:21     ` Michel Lespinasse
2010-11-14  5:07       ` KOSAKI Motohiro
2010-11-16  1:44         ` Hugh Dickins
2010-11-16  6:50           ` Michel Lespinasse
2010-11-16 23:28             ` Hugh Dickins [this message]
2010-11-18 11:16               ` Michel Lespinasse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.00.1011161444200.16422@tigran.mtv.corp.google.com \
    --to=hughd@google.com \
    --cc=arjan@infradead.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@kernel.dk \
    --cc=peterz@infradead.org \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox