From: Andreas Gruenbacher <agruenba@redhat.com>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
"Andreas Grünbacher" <andreas.gruenbacher@gmail.com>,
"Konstantin Khlebnikov" <khlebnikov@yandex-team.ru>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
Linux-MM <linux-mm@kvack.org>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
"Alexander Viro" <viro@zeniv.linux.org.uk>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"cluster-devel@redhat.com" <cluster-devel@redhat.com>,
"Ronnie Sahlberg" <lsahlber@redhat.com>,
"Steve French" <sfrench@samba.org>,
"Bob Peterson" <rpeterso@redhat.com>
Subject: Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
Date: Wed, 27 Nov 2019 17:29:36 +0100 [thread overview]
Message-ID: <CAHc6FU4Mx_=qMYOBc0VYdn-paFXKVffq=k72LCJgTWONv9chng@mail.gmail.com> (raw)
In-Reply-To: <cdd48a4d-42a4-dd15-2701-e08e26fef17f@redhat.com>
On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
> On 25/11/2019 17:05, Linus Torvalds wrote:
> > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> >> Linus, is that roughly what you were thinking of?
> > So the concept looks ok, but I don't really like the new flags as they
> > seem to be gfs2-specific rather than generic.
> >
> > That said, I don't _gate_ them either, since they aren't in any
> > critical code sequence, and it's not like they do anything really odd.
> >
> > I still think the whole gfs2 approach is broken. You're magically ok
> > with using stale data from the cache just because it's cached, even if
> > another client might have truncated the file or something.
>
> If another node tries to truncate the file, that will require an
> exclusive glock, and in turn that means the all the other nodes will
> have to drop their glock(s) shared or exclusive. That process
> invalidates the page cache on those nodes, such that any further
> requests on those nodes will find the cache empty and have to call into
> the filesystem.
>
> If a page is truncated on another node, then when the local node gives
> up its glock, after any copying (e.g. for read) has completed then the
> truncate will take place. The local node will then have to reread any
> data relating to new pages or return an error in case the next page to
> be read has vanished due to the truncate. It is a pretty small window,
> and the advantage is that in cases where the page is in cache, we can
> directly use the cached page without having to call into the filesystem
> at all. So it is page atomic in that sense.
>
> The overall aim here is to avoid taking (potentially slow) cluster locks
> when at all possible, yet at the same time deliver close to local fs
> semantics whenever we can. You can think of GFS2's glock concept (at
> least as far as the inodes we are discussing here) as providing a
> combination of (page) cache control and cluster (dlm) locking.
>
> >
> > So you're ok with saying "the file used to be X bytes in size, so
> > we'll just give you this data because we trust that the X is correct".
> >
> > But you're not ok to say "oh, the file used to be X bytes in size, but
> > we don't want to give you a short read because it might not be correct
> > any more".
> >
> > See the disconnect? You trust the size in one situation, but not in another one.
>
> Well we are not trusting the size at all... the original algorithm
> worked entirely off "is this page in cache and uptodate?" and for
> exactly the reason that we know the size in the inode might be out of
> date, if we are not currently holding a glock in either shared or
> exclusive mode. We also know that if there is a page in cache and
> uptodate then we must be holding the glock too.
>
>
> >
> > I also don't really see that you *need* the new flag at all. Since
> > you're doing to do a speculative read and then a real read anyway, and
> > since the only thing that you seem to care about is the file size
> > (because the *data* you will trust if it is cached), then why don't
> > you just use the *existing* generic read, and *IFF* you get a
> > truncated return value, then you go and double-check that the file
> > hasn't changed in size?
> >
> > See what I'm saying? I think gfs2 is being very inconsistent in when
> > it trusts the file size, and I don't see that you even need the new
> > behavior that patch gives, because you might as well just use the
> > existing code (just move the i_size check earlier, and then teach gfs2
> > to double-check the "I didn't get as much as I expected" case).
We can identify short reads, but we won't get information about
readahead back from generic_file_read_iter or filemap_fault. We could
try to work around this with filesystem specific flags for ->readpage
and ->readpages, but that would break down with multiple concurrent
readers in addition to being a real mess. I'm currently out of better
ideas that avoid duplicating the generic code.
> > Linus
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward. We are perhaps also a
> bit reluctant to go off and (nearly) duplicate code that is already in
> the core vfs library functions, since that often leads to things getting
> out of sync (our implementation of ->writepages is one case where that
> happened in the past) and missing important bug fixes/features in some
> cases. Hopefully though we can iterate on this a bit and come up with
> something which will resolve all the issues,
>
> Steve.
Thanks,
Andreas
next prev parent reply other threads:[~2019-11-27 16:29 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-28 9:59 Konstantin Khlebnikov
2019-10-28 12:39 ` Linus Torvalds
2019-10-28 12:42 ` Kirill A. Shutemov
2019-10-28 12:47 ` Linus Torvalds
2019-10-28 12:57 ` Kirill A. Shutemov
2019-10-29 14:25 ` Konstantin Khlebnikov
2019-10-29 16:52 ` Linus Torvalds
2019-10-30 6:50 ` Kirill A. Shutemov
2019-10-30 7:02 ` Linus Torvalds
2019-10-30 10:34 ` Steven Whitehouse
2019-10-30 10:54 ` Linus Torvalds
2019-10-31 11:40 ` Steven Whitehouse
2019-11-22 23:59 ` Andreas Grünbacher
2019-11-25 10:52 ` Steven Whitehouse
2019-11-25 17:05 ` Linus Torvalds
2019-11-27 15:41 ` Steven Whitehouse
2019-11-27 16:29 ` Andreas Gruenbacher [this message]
2019-11-27 17:29 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHc6FU4Mx_=qMYOBc0VYdn-paFXKVffq=k72LCJgTWONv9chng@mail.gmail.com' \
--to=agruenba@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andreas.gruenbacher@gmail.com \
--cc=cluster-devel@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=khlebnikov@yandex-team.ru \
--cc=kirill@shutemov.name \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsahlber@redhat.com \
--cc=rpeterso@redhat.com \
--cc=sfrench@samba.org \
--cc=swhiteho@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox