From: Linus Torvalds <torvalds@linux-foundation.org>
To: Kiryl Shutsemau <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>,
Luis Chamberlain <mcgrof@kernel.org>,
Linux-MM <linux-mm@kvack.org>,
linux-fsdevel@vger.kernel.org
Subject: Re: Optimizing small reads
Date: Mon, 13 Oct 2025 09:19:47 -0700 [thread overview]
Message-ID: <CAHk-=wgrZL7pLPW9GjUagoGOoOeDAVnyGJCn+6J5x-9+Dtbx-A@mail.gmail.com> (raw)
In-Reply-To: <dz7pcqi5ytmb35r6kojuetdipjp7xdjlnyzcu5qb6d4cdo6vq5@3b62gfzcxszo>
On Mon, 13 Oct 2025 at 08:39, Kiryl Shutsemau <kirill@shutemov.name> wrote:
>
> And, for archiving purposes, here is the last version of the patch that
> supports large blocks.
>
> Do you think it makes sense to submit unsafe_copy_to_user() optimization
> as a standalone thingy?
Without a user, I'm not convinced. I think right now there are zero
cases of unsafe_copy_to_user() that are large enough for this to
matter. It looks like we have exactly two cases; the readdir() case I
knew about (because I did that one), and one other user in put_cmsg(),
which was introduced to networking with the commit message being
"Calling two copy_to_user() for very small regions has very high overhead"
so that one is very small too.
HOWEVER.
What I like in this patch is how you split up filemap_read() itself
further, and I think that part might be worth it, except I think you
did the "noinline" parts wrong:
> +static bool noinline filemap_read_fast(struct kiocb *iocb, struct iov_iter *iter,
...
> +static ssize_t filemap_read_slow(struct kiocb *iocb, struct iov_iter *iter,
...
> +ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> + ssize_t already_read)
> +{
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> + union {
> + struct folio_batch fbatch;
> + __DECLARE_FLEX_ARRAY(char, buffer);
> + //char __buffer[4096];
> + } area __uninitialized;
> +
> + if (unlikely(iocb->ki_pos < 0))
> + return -EINVAL;
> + if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes))
> + return 0;
> + if (unlikely(!iov_iter_count(iter)))
> + return 0;
> +
> + iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos);
> +
> + if (filemap_read_fast(iocb, iter, area.buffer, sizeof(area), &already_read))
> + return already_read;
> +
> + return filemap_read_slow(iocb, iter, &area.fbatch, already_read);
> +}
> EXPORT_SYMBOL_GPL(filemap_read);
Look, the reason you did this was presumably for stack usage when you
have a big 4k allocation, but the part you *really* needed to protect
was filemap_read_slow() that then has much deeper call chains.
So filemap_read_slow() should *not* ever see the big 4k stack slot
that it never needs or uses, and that eats into our fairly small
kernel stack.
BUT with this organization, what you could have done is:
- don't share the buffers between filemap_read_slow/filemap_read_fast
at all, so the 'union' trick with a flex array etc would go away
- both filemap_read_slow and filemap_read_fast would be 'noinline' so
that they don't share a stack frame
- filemap_read_fast could have the on-stack byte buffer - I think 4k
is still excessive for on-stack, but it could certainly be larger than
256 bytes
- filemap_read_slow would have just the 'struct folio_batch' thing.
Anyway, I think *that* part of this patch might actually be worth it
independently of anything else.
Of course, that re-organization does cause that extra level of
function calls in order to not have excessive stack usage, and the CPU
errata can make 'ret' instructions expensive, but whatever.
So I don't know. I like how it splits up those two cases more clearly
and makes the baseline 'filemap_read()' much simpler and very
straightforward, and I like how it also splits up the stack usage and
would make that union trick unnecessary.
But it's not a big deal either way.
Linus
next prev parent reply other threads:[~2025-10-13 16:20 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 2:18 Linus Torvalds
2025-10-03 3:32 ` Luis Chamberlain
2025-10-15 21:31 ` Swarna Prabhu
2025-10-03 9:55 ` Kiryl Shutsemau
2025-10-03 16:18 ` Linus Torvalds
2025-10-03 16:40 ` Linus Torvalds
2025-10-03 17:23 ` Kiryl Shutsemau
2025-10-03 17:49 ` Linus Torvalds
2025-10-06 11:44 ` Kiryl Shutsemau
2025-10-06 15:50 ` Linus Torvalds
2025-10-06 18:04 ` Kiryl Shutsemau
2025-10-06 18:14 ` Linus Torvalds
2025-10-07 21:47 ` Linus Torvalds
2025-10-07 22:35 ` Linus Torvalds
2025-10-07 22:54 ` Linus Torvalds
2025-10-07 23:30 ` Linus Torvalds
2025-10-08 14:54 ` Kiryl Shutsemau
2025-10-08 16:27 ` Linus Torvalds
2025-10-08 17:03 ` Linus Torvalds
2025-10-09 16:22 ` Kiryl Shutsemau
2025-10-09 17:29 ` Linus Torvalds
2025-10-10 10:10 ` Kiryl Shutsemau
2025-10-10 17:51 ` Linus Torvalds
2025-10-13 15:35 ` Kiryl Shutsemau
2025-10-13 15:39 ` Kiryl Shutsemau
2025-10-13 16:19 ` Linus Torvalds [this message]
2025-10-14 12:58 ` Kiryl Shutsemau
2025-10-14 16:41 ` Linus Torvalds
2025-10-13 16:06 ` Linus Torvalds
2025-10-13 17:26 ` Theodore Ts'o
2025-10-14 3:20 ` Theodore Ts'o
2025-10-08 10:28 ` Kiryl Shutsemau
2025-10-08 16:24 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHk-=wgrZL7pLPW9GjUagoGOoOeDAVnyGJCn+6J5x-9+Dtbx-A@mail.gmail.com' \
--to=torvalds@linux-foundation.org \
--cc=kirill@shutemov.name \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox