From: Luis Chamberlain <mcgrof@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>,
Linux-MM <linux-mm@kvack.org>,
Swarna Prabhu <s.prabhu@samsung.com>,
Pankaj Raghav <p.raghav@samsung.com>,
Devasena Inupakutika <devasena.i@samsung.com>
Subject: Re: Optimizing small reads
Date: Thu, 2 Oct 2025 20:32:42 -0700 [thread overview]
Message-ID: <aN9D2i4qOroaktSc@bombadil.infradead.org> (raw)
In-Reply-To: <CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com>
On Thu, Oct 02, 2025 at 07:18:35PM -0700, Linus Torvalds wrote:
> Willy,
> I've literally been running this patch in my tree now for 18 months,
> and have never seen any issues. However back when I posted it
> originally, Luis reported that it caused failures on xfstest
>
> * generic/095
> * generic/741
>
> although I still have no idea how that could happen. I've looked at
> the patch occasionally over the months I've been carrying it, trying
> to figure out how it could possibly matter, and have never figured it
> out.
>
> I'm not planning on moving it to my mainline tree now either, but I
> decided I might as well at least repost it to see if somebody else has
> any interest or comments on it.
Indeed, we do, I had asked our of new hires to verify the above two test
failures, as it seems like a good candidate for someone new to jump in
and analyze.
> The impetus for this patch is
> obviously from your posting back in early 2024 about some real user
> that did a lot of really small reads. It still sounds like a very odd
> load to me, but apparently there was a good real-life reason for it.
Its actually not odd, I've been trying to dig into "what workload could
this be", and there a few answers. The top one on my radar is the read
amplification incurred through through vector DBs on billion scale low
latency requirements, due to re-ranking due to PQ compression. Since PQ is
lossy, even if you want the top k candidates you need to re-rank, and
k_candidates is much larger than k. The issue it each candidate may lie in
different disk LBAs. PQ compressed vectors are kept in GPU memory and
their raw vectors on SSDs.
I found a clever algorithm callfed FusionANNs which shows how you can
optimize the vectors which are likely related to each other so they are
placed near each other [0] proving the software sucks and could be
improved. However I'm not aware of an open source version of this
algorithm, so we're looking to write one and test its impact.
> I still think this patch actually looks quite nice - which surprised
> me when I wrote it originally. I started out writing it as a "let's
> see what this hacky thing results in", but it didn't turn out very
> hacky at all.
Yes, its a quite bueatiful hack.
> And it looks ridiculously good on some strange small-read benchmarks,
> although I say that purely from memory, since I've long since lost the
> code that tested this. Now it's been "tested" purely by virtue of
> basically being something I've been running on my own machine for a
> long time.
>
> Anyway, feel free to ignore it. I can keep carrying this patch in my
> local tree forever or until it actually causes more conflicts than I
> feel comfortable keeping around. But so far in the last 18+ months it
> has never caused any real pain (I have my own tree that contains a few
> random patches for other reasons anyway, although lately this has
> actually been the biggest of that little lot).
Trust me that this is being looked at :)
We had some other ideas too... but this is far out from anything ready.
However since there was no *rush* I figured its a nice candidate effort
to get someone new to hack on the kernel. So give us a bit of time and
will surely justify its merit with a clear workload, and with a thorough
dig why its failing on the above.
Luis
>
> Linus
> From bf11d657e1f9a010ae0253feb27c2471b897d25d Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 26 Feb 2024 15:18:44 -0800
> Subject: [PATCH] mm/filemap: do small reads in RCU-mode read without refcounts
>
> Hackety hack hack concept from report by Willy.
>
> Mommy, I'm scared.
>
> Link: https://lore.kernel.org/all/Zduto30LUEqIHg4h@casper.infradead.org/
> Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> mm/filemap.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 127 insertions(+), 10 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a52dd38d2b4a..f15bc1108585 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2677,6 +2677,96 @@ static void filemap_end_dropbehind_read(struct folio *folio)
> }
> }
>
> +/*
> + * I can't be bothered to care about HIGHMEM for the fast read case
> + */
> +#ifdef CONFIG_HIGHMEM
> +#define filemap_fast_read(mapping, pos, buffer, size) 0
> +#else
> +
> +/*
> + * Called under RCU with size limited to the file size and one page
> + */
> +static inline unsigned long filemap_folio_copy_rcu(struct address_space *mapping, loff_t pos, char *buffer, size_t size)
> +{
> + struct inode *inode;
> + loff_t file_size;
> + XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT);
> + struct folio *folio;
> + size_t offset;
> +
> + /* Limit it to the file size */
> + inode = mapping->host;
> + file_size = i_size_read(inode);
> + if (unlikely(pos >= file_size))
> + return 0;
> + if (size > file_size - pos)
> + size = file_size - pos;
> +
> + xas_reset(&xas);
> + folio = xas_load(&xas);
> + if (xas_retry(&xas, folio))
> + return 0;
> +
> + if (!folio || xa_is_value(folio))
> + return 0;
> +
> + if (!folio_test_uptodate(folio))
> + return 0;
> +
> + /* No fast-case if we are supposed to start readahead */
> + if (folio_test_readahead(folio))
> + return 0;
> + /* .. or mark it accessed */
> + if (!folio_test_referenced(folio))
> + return 0;
> +
> + /* Do the data copy */
> + offset = pos & (folio_size(folio) - 1);
> + memcpy(buffer, folio_address(folio) + offset, size);
> +
> + /*
> + * After we've copied the data from the folio,
> + * do some final sanity checks.
> + */
> + smp_rmb();
> +
> + if (unlikely(folio != xas_reload(&xas)))
> + return 0;
> +
> + /*
> + * This is just a heuristic: somebody could still truncate and then
> + * write to extend it to the same size..
> + */
> + if (file_size != inode->i_size)
> + return 0;
> +
> + return size;
> +}
> +
> +/*
> + * Iff we can complete the read completely in one atomic go under RCU,
> + * do so here. Otherwise return 0 (no partial reads, please - this is
> + * purely for the trivial fast case).
> + */
> +static inline unsigned long filemap_fast_read(struct address_space *mapping, loff_t pos, char *buffer, size_t size)
> +{
> + unsigned long pgoff;
> +
> + /* Don't even try for page-crossers */
> + pgoff = pos & ~PAGE_MASK;
> + if (pgoff + size > PAGE_SIZE)
> + return 0;
> +
> + /* Let's see if we can just do the read under RCU */
> + rcu_read_lock();
> + size = filemap_folio_copy_rcu(mapping, pos, buffer, size);
> + rcu_read_unlock();
> +
> + return size;
> +}
> +#endif /* !HIGHMEM */
> +
> /**
> * filemap_read - Read data from the page cache.
> * @iocb: The iocb to read.
> @@ -2697,7 +2787,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> struct file_ra_state *ra = &filp->f_ra;
> struct address_space *mapping = filp->f_mapping;
> struct inode *inode = mapping->host;
> - struct folio_batch fbatch;
> + union {
> + struct folio_batch fbatch;
> + __DECLARE_FLEX_ARRAY(char, buffer);
> + } area __uninitialized;
> int i, error = 0;
> bool writably_mapped;
> loff_t isize, end_offset;
> @@ -2711,7 +2804,31 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> return 0;
>
> iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos);
> - folio_batch_init(&fbatch);
> +
> + /*
> + * Try a quick lockless read into the 'area' union. Note that
> + * this union is intentionally marked "__uninitialized", because
> + * any compiler initialization would be pointless since this
> + * can fill it will garbage.
> + */
> + if (iov_iter_count(iter) <= sizeof(area)) {
> + unsigned long count = iov_iter_count(iter);
> +
> + count = filemap_fast_read(mapping, iocb->ki_pos, area.buffer, count);
> + if (count) {
> + size_t copied = copy_to_iter(area.buffer, count, iter);
> + if (unlikely(!copied))
> + return already_read ? already_read : -EFAULT;
> + ra->prev_pos = iocb->ki_pos += copied;
> + file_accessed(filp);
> + return copied + already_read;
> + }
> + }
> +
> + /*
> + * This actually properly initializes the fbatch for the slow case
> + */
> + folio_batch_init(&area.fbatch);
>
> do {
> cond_resched();
> @@ -2727,7 +2844,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> if (unlikely(iocb->ki_pos >= i_size_read(inode)))
> break;
>
> - error = filemap_get_pages(iocb, iter->count, &fbatch, false);
> + error = filemap_get_pages(iocb, iter->count, &area.fbatch, false);
> if (error < 0)
> break;
>
> @@ -2755,11 +2872,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> * mark it as accessed the first time.
> */
> if (!pos_same_folio(iocb->ki_pos, last_pos - 1,
> - fbatch.folios[0]))
> - folio_mark_accessed(fbatch.folios[0]);
> + area.fbatch.folios[0]))
> + folio_mark_accessed(area.fbatch.folios[0]);
>
> - for (i = 0; i < folio_batch_count(&fbatch); i++) {
> - struct folio *folio = fbatch.folios[i];
> + for (i = 0; i < folio_batch_count(&area.fbatch); i++) {
> + struct folio *folio = area.fbatch.folios[i];
> size_t fsize = folio_size(folio);
> size_t offset = iocb->ki_pos & (fsize - 1);
> size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
> @@ -2790,13 +2907,13 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> }
> }
> put_folios:
> - for (i = 0; i < folio_batch_count(&fbatch); i++) {
> - struct folio *folio = fbatch.folios[i];
> + for (i = 0; i < folio_batch_count(&area.fbatch); i++) {
> + struct folio *folio = area.fbatch.folios[i];
>
> filemap_end_dropbehind_read(folio);
> folio_put(folio);
> }
> - folio_batch_init(&fbatch);
> + folio_batch_init(&area.fbatch);
> } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
>
> file_accessed(filp);
> --
> 2.51.0.419.gf70362ddf4
>
next prev parent reply other threads:[~2025-10-03 3:32 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 2:18 Linus Torvalds
2025-10-03 3:32 ` Luis Chamberlain [this message]
2025-10-15 21:31 ` Swarna Prabhu
2025-10-03 9:55 ` Kiryl Shutsemau
2025-10-03 16:18 ` Linus Torvalds
2025-10-03 16:40 ` Linus Torvalds
2025-10-03 17:23 ` Kiryl Shutsemau
2025-10-03 17:49 ` Linus Torvalds
2025-10-06 11:44 ` Kiryl Shutsemau
2025-10-06 15:50 ` Linus Torvalds
2025-10-06 18:04 ` Kiryl Shutsemau
2025-10-06 18:14 ` Linus Torvalds
2025-10-07 21:47 ` Linus Torvalds
2025-10-07 22:35 ` Linus Torvalds
2025-10-07 22:54 ` Linus Torvalds
2025-10-07 23:30 ` Linus Torvalds
2025-10-08 14:54 ` Kiryl Shutsemau
2025-10-08 16:27 ` Linus Torvalds
2025-10-08 17:03 ` Linus Torvalds
2025-10-09 16:22 ` Kiryl Shutsemau
2025-10-09 17:29 ` Linus Torvalds
2025-10-10 10:10 ` Kiryl Shutsemau
2025-10-10 17:51 ` Linus Torvalds
2025-10-13 15:35 ` Kiryl Shutsemau
2025-10-13 15:39 ` Kiryl Shutsemau
2025-10-13 16:19 ` Linus Torvalds
2025-10-14 12:58 ` Kiryl Shutsemau
2025-10-14 16:41 ` Linus Torvalds
2025-10-13 16:06 ` Linus Torvalds
2025-10-13 17:26 ` Theodore Ts'o
2025-10-14 3:20 ` Theodore Ts'o
2025-10-08 10:28 ` Kiryl Shutsemau
2025-10-08 16:24 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aN9D2i4qOroaktSc@bombadil.infradead.org \
--to=mcgrof@kernel.org \
--cc=devasena.i@samsung.com \
--cc=linux-mm@kvack.org \
--cc=p.raghav@samsung.com \
--cc=s.prabhu@samsung.com \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox