Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Gruenbacher <agruenba@redhat.com>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
	"Andreas Grünbacher" <andreas.gruenbacher@gmail.com>,
	"Konstantin Khlebnikov" <khlebnikov@yandex-team.ru>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Linux-MM <linux-mm@kvack.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"cluster-devel@redhat.com" <cluster-devel@redhat.com>,
	"Ronnie Sahlberg" <lsahlber@redhat.com>,
	"Steve French" <sfrench@samba.org>,
	"Bob Peterson" <rpeterso@redhat.com>
Subject: Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
Date: Wed, 27 Nov 2019 17:29:36 +0100	[thread overview]
Message-ID: <CAHc6FU4Mx_=qMYOBc0VYdn-paFXKVffq=k72LCJgTWONv9chng@mail.gmail.com> (raw)
In-Reply-To: <cdd48a4d-42a4-dd15-2701-e08e26fef17f@redhat.com>

On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
> On 25/11/2019 17:05, Linus Torvalds wrote:
> > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> >> Linus, is that roughly what you were thinking of?
> > So the concept looks ok, but I don't really like the new flags as they
> > seem to be gfs2-specific rather than generic.
> >
> > That said, I don't _gate_ them either, since they aren't in any
> > critical code sequence, and it's not like they do anything really odd.
> >
> > I still think the whole gfs2 approach is broken. You're magically ok
> > with using stale data from the cache just because it's cached, even if
> > another client might have truncated the file or something.
>
> If another node tries to truncate the file, that will require an
> exclusive glock, and in turn that means the all the other nodes will
> have to drop their glock(s) shared or exclusive. That process
> invalidates the page cache on those nodes, such that any further
> requests on those nodes will find the cache empty and have to call into
> the filesystem.
>
> If a page is truncated on another node, then when the local node gives
> up its glock, after any copying (e.g. for read) has completed then the
> truncate will take place. The local node will then have to reread any
> data relating to new pages or return an error in case the next page to
> be read has vanished due to the truncate. It is a pretty small window,
> and the advantage is that in cases where the page is in cache, we can
> directly use the cached page without having to call into the filesystem
> at all. So it is page atomic in that sense.
>
> The overall aim here is to avoid taking (potentially slow) cluster locks
> when at all possible, yet at the same time deliver close to local fs
> semantics whenever we can. You can think of GFS2's glock concept (at
> least as far as the inodes we are discussing here) as providing a
> combination of (page) cache control and cluster (dlm) locking.
>
> >
> > So you're ok with saying "the file used to be X bytes in size, so
> > we'll just give you this data because we trust that the X is correct".
> >
> > But you're not ok to say "oh, the file used to be X bytes in size, but
> > we don't want to give you a short read because it might not be correct
> > any more".
> >
> > See the disconnect? You trust the size in one situation, but not in another one.
>
> Well we are not trusting the size at all... the original algorithm
> worked entirely off "is this page in cache and uptodate?" and for
> exactly the reason that we know the size in the inode might be out of
> date, if we are not currently holding a glock in either shared or
> exclusive mode. We also know that if there is a page in cache and
> uptodate then we must be holding the glock too.
>
>
> >
> > I also don't really see that you *need* the new flag at all. Since
> > you're doing to do a speculative read and then a real read anyway, and
> > since the only thing that you seem to care about is the file size
> > (because the *data* you will trust if it is cached), then why don't
> > you just use the *existing* generic read, and *IFF* you get a
> > truncated return value, then you go and double-check that the file
> > hasn't changed in size?
> >
> > See what I'm saying? I think gfs2 is being very inconsistent in when
> > it trusts the file size, and I don't see that you even need the new
> > behavior that patch gives, because you might as well just use the
> > existing code (just move the i_size check earlier, and then teach gfs2
> > to double-check the "I didn't get as much as I expected" case).

We can identify short reads, but we won't get information about
readahead back from generic_file_read_iter or filemap_fault. We could
try to work around this with filesystem specific flags for ->readpage
and ->readpages, but that would break down with multiple concurrent
readers in addition to being a real mess. I'm currently out of better
ideas that avoid duplicating the generic code.

> >                   Linus
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward. We are perhaps also a
> bit reluctant to go off and (nearly) duplicate code that is already in
> the core vfs library functions, since that often leads to things getting
> out of sync (our implementation of ->writepages is one case where that
> happened in the past) and missing important bug fixes/features in some
> cases. Hopefully though we can iterate on this a bit and come up with
> something which will resolve all the issues,
>
> Steve.

Thanks,
Andreas

next prev parent reply	other threads:[~2019-11-27 16:29 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-28  9:59 Konstantin Khlebnikov
2019-10-28 12:39 ` Linus Torvalds
2019-10-28 12:42 ` Kirill A. Shutemov
2019-10-28 12:47   ` Linus Torvalds
2019-10-28 12:57     ` Kirill A. Shutemov
2019-10-29 14:25       ` Konstantin Khlebnikov
2019-10-29 16:52         ` Linus Torvalds
2019-10-30  6:50           ` Kirill A. Shutemov
2019-10-30  7:02             ` Linus Torvalds
2019-10-30 10:34           ` Steven Whitehouse
2019-10-30 10:54             ` Linus Torvalds
2019-10-31 11:40               ` Steven Whitehouse
2019-11-22 23:59                 ` Andreas Grünbacher
2019-11-25 10:52                   ` Steven Whitehouse
2019-11-25 17:05                     ` Linus Torvalds
2019-11-27 15:41                       ` Steven Whitehouse
2019-11-27 16:29                         ` Andreas Gruenbacher [this message]
2019-11-27 17:29                         ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHc6FU4Mx_=qMYOBc0VYdn-paFXKVffq=k72LCJgTWONv9chng@mail.gmail.com' \
    --to=agruenba@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreas.gruenbacher@gmail.com \
    --cc=cluster-devel@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=khlebnikov@yandex-team.ru \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsahlber@redhat.com \
    --cc=rpeterso@redhat.com \
    --cc=sfrench@samba.org \
    --cc=swhiteho@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox