From: Chuck Lever <chuck.lever@oracle.com>
To: NeilBrown <neilb@suse.de>
Cc: Chuck Lever <cel@kernel.org>, David Howells <dhowells@redhat.com>,
Jeff Layton <jlayton@kernel.org>, Hugh Dickins <hughd@google.com>,
Jens Axboe <axboe@kernel.dk>,
Matthew Wilcox <willy@infradead.org>,
linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH] nfsd: Fix reading via splice
Date: Sun, 30 Jul 2023 11:29:24 -0400 [thread overview]
Message-ID: <ZMaB1BwBfNko1ZoE@tissot.1015granger.net> (raw)
In-Reply-To: <169058849828.32308.14965537137761913794@noble.neil.brown.name>
On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote:
> On Fri, 28 Jul 2023, Chuck Lever wrote:
> > From: David Howells <dhowells@redhat.com>
> >
> > nfsd_splice_actor() has a clause in its loop that chops up a compound page
> > into individual pages such that if the same page is seen twice in a row, it
> > is discarded the second time. This is a problem with the advent of
> > shmem_splice_read() as that inserts zero_pages into the pipe in lieu of
> > pages that aren't present in the pagecache.
> >
> > Fix this by assuming that the last page is being extended only if the
> > currently stored length + starting offset is not currently on a page
> > boundary.
> >
> > This can be tested by NFS-exporting a tmpfs filesystem on the test machine
> > and truncating it to more than a page in size (eg. truncate -s 8192) and
> > then reading it by NFS. The first page will be all zeros, but thereafter
> > garbage will be read.
> >
> > Note: I wonder if we can ever get a situation now where we get a splice
> > that gives us contiguous parts of a page in separate actor calls. As NFSD
> > can only be splicing from a file (I think), there are only three sources of
> > the page: copy_splice_read(), shmem_splice_read() and file_splice_read().
> > The first allocates pages for the data it reads, so the problem cannot
> > occur; the second should never see a partial page; and the third waits for
> > each page to become available before we're allowed to read from it.
> >
> > Fixes: bd194b187115 ("shmem: Implement splice-read")
> > Reported-by: Chuck Lever <chuck.lever@oracle.com>
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > cc: Hugh Dickins <hughd@google.com>
> > cc: Jens Axboe <axboe@kernel.dk>
> > cc: Matthew Wilcox <willy@infradead.org>
> > cc: linux-nfs@vger.kernel.org
> > cc: linux-fsdevel@vger.kernel.org
> > cc: linux-mm@kvack.org
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> > fs/nfsd/vfs.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 59b7d60ae33e..ee3bbaa79478 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -956,10 +956,13 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> > last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
> > for (page += offset / PAGE_SIZE; page <= last_page; page++) {
> > /*
> > - * Skip page replacement when extending the contents
> > - * of the current page.
> > + * Skip page replacement when extending the contents of the
> > + * current page. But note that we may get two zero_pages in a
> > + * row from shmem.
> > */
> > - if (page == *(rqstp->rq_next_page - 1))
> > + if (page == *(rqstp->rq_next_page - 1) &&
> > + offset_in_page(rqstp->rq_res.page_base +
> > + rqstp->rq_res.page_len))
>
> This seems fragile in that it makes assumptions about the pages being
> sent and their alignment.
> Given that it was broken by the splice-read change, that confirms it is
> fragile. Maybe we could make the code a bit more explicit about what is
> expected.
Indeed, this code is brittle. This is not even the only time the
actor has been broken in the past four or five kernel releases.
IMO the problem is that there is no API contract or documentation
for splice actors. And as far as I am aware, only a few other
examples are in use to learn from.
> Also, I don't think this test can ever be relevant after the first time
> through the loop. So I think it would be clearest to have the
> interesting case outside the loop.
>
> page += offset / PAGE_SIZE;
> if (rqstp->rq_res.pages_len > 0) {
> /* appending to page list - check alignment */
> if (offset % PAGE_SIZE != (rqstp->rq_res.page_base +
> rqstp-.rq_res.page_len) % PAGE_SIZE)
> return -EIO;
> if (offset % PAGE_SIZE != 0) {
> /* continuing previous page */
> if (page != rqstp->rq_next_page[-1])
> return -EIO;
> page += 1;
> }
> } else
> /* Starting new page list */
> rqstp->rq_res.page_base = offset % PAGE_SIZE;
>
> for ( ; page <= last_page ; page++)
> if (unlikely(!svc_rqst_replace_page(rqstp, page)))
> return -EIO;
>
> rqstp->rq_res.page_len += sd->len;
> return sd->len;
>
>
> Also, the name "svc_rqst_replace_page" doesn't give any hint that the
> next_page pointer is advanced. Maybe svc_rqst_add_page() ??? Not great
> I admit.
All reasonable suggestions.
However, I'm getting ready to replace the splice read code with...
je ne ce pas.
- There are reports that splice read doesn't perform well
- It's a brittle piece of engineering, as observed
- The "zero copy" read path will need to support folios, hopefully
sooner rather than later
- We want the server's read path to use iomap when that is more
broadly available in local filesystems
- This fix is destined for 6.5-rc, which limits the amount of
clean up and optimization we should be doing
I'd like to apply David's fix as-is, unless it's truly broken or
someone has a better quick solution.
> > continue;
> > if (unlikely(!svc_rqst_replace_page(rqstp, page)))
> > return -EIO;
> >
> >
> >
>
--
Chuck Lever
next prev parent reply other threads:[~2023-07-30 15:30 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-28 12:32 Chuck Lever
2023-07-28 23:54 ` NeilBrown
2023-07-30 15:29 ` Chuck Lever [this message]
2023-07-30 16:50 ` Hugh Dickins
2023-07-30 16:55 ` Chuck Lever
2023-07-30 22:02 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZMaB1BwBfNko1ZoE@tissot.1015granger.net \
--to=chuck.lever@oracle.com \
--cc=axboe@kernel.dk \
--cc=cel@kernel.org \
--cc=dhowells@redhat.com \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox