From: Jan Kara <jack@suse.cz>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Jan Kara <jack@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
David Hildenbrand <david@redhat.com>,
Dave Chinner <david@fromorbit.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Kalesh Singh <kaleshsingh@google.com>, Zi Yan <ziy@nvidia.com>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
Date: Tue, 6 May 2025 13:29:13 +0200 [thread overview]
Message-ID: <mq7vno6v7mrrquya4kogseej4fasfyq574ersgdxdhateho7md@bvmy6y4ccgyz> (raw)
In-Reply-To: <cbfad787-fcb8-43ce-8fd9-e9495116534d@arm.com>
On Tue 06-05-25 10:28:11, Ryan Roberts wrote:
> On 05/05/2025 10:37, Jan Kara wrote:
> > On Mon 05-05-25 11:13:26, Jan Kara wrote:
> >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> >>> Previously asynchonous readahead would read ra_pages (usually 128K)
> >>> directly after the end of the synchonous readahead and given the
> >>> synchronous readahead portion had no alignment guarantees (beyond page
> >>> boundaries) it is possible (and likely) that the end of the initial 128K
> >>> region would not fall on a natural boundary for the folio size being
> >>> used. Therefore smaller folios were used to align down to the required
> >>> boundary, both at the end of the previous readahead block and at the
> >>> start of the new one.
> >>>
> >>> In the worst cases, this can result in never properly ramping up the
> >>> folio size, and instead getting stuck oscillating between order-0, -1
> >>> and -2 folios. The next readahead will try to use folios whose order is
> >>> +2 bigger than the folio that had the readahead marker. But because of
> >>> the alignment requirements, that folio (the first one in the readahead
> >>> block) can end up being order-0 in some cases.
> >>>
> >>> There will be 2 modifications to solve this issue:
> >>>
> >>> 1) Calculate the readahead size so the end is aligned to a folio
> >>> boundary. This prevents needing to allocate small folios to align
> >>> down at the end of the window and fixes the oscillation problem.
> >>>
> >>> 2) Remember the "preferred folio order" in the ra state instead of
> >>> inferring it from the folio with the readahead marker. This solves
> >>> the slow ramp up problem (discussed in a subsequent patch).
> >>>
> >>> This patch addresses (1) only. A subsequent patch will address (2).
> >>>
> >>> Worked example:
> >>>
> >>> The following shows the previous pathalogical behaviour when the initial
> >>> synchronous readahead is unaligned. We start reading at page 17 in the
> >>> file and read sequentially from there. I'm showing a dump of the pages
> >>> in the page cache just after we read the first page of the folio with
> >>> the readahead marker.
> >
> > <snip>
> >
> >> Looks good. When I was reading this code some time ago, I also felt we
> >> should rather do some rounding instead of creating small folios so thanks
> >> for working on this. Feel free to add:
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >
> > But now I've also remembered why what you do here isn't an obvious win.
> > There are storage devices (mostly RAID arrays) where optimum read size
> > isn't a power of 2. Think for example a RAID-0 device composed from three
> > disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> > on x86 and max_order is 9. Then previously (if we were lucky with
> > alignment) we were alternating between order 7 and order 8 pages in the
> > page cache and do optimally sized IOs od 1536k.
>
> Sorry I'm struggling to follow some of this, perhaps my superficial
> understanding of all the readahead subtleties is starting to show...
>
> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
> bdi->ra_pages would remain the usual 32 (128K)?
Sorry, I have been probably too brief in my previous message :)
bdi->ra_pages is actually set based on optimal IO size reported by the
hardware (see blk_apply_bdi_limits() and how its callers are filling in
lim->io_opt). The 128K you speak about is just a last-resort value if
hardware doesn't provide one. And some storage devices do report optimal IO
size that is not power of two.
Also note that bdi->ra_pages can be tuned in sysfs and a lot of users
actually do this (usually from their udev rules). We don't have to perform
well when some odd value gets set but you definitely cannot assume
bdi->ra_pages is 128K :).
> In which case, for mmap, won't
> we continue to be limited by ra_pages and will never get beyond order-5? (for
> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
> spotting it in the code...
>
> I guess you are also implicitly teaching me something about how the block layer
> works here too... if there are 2 read requests for an order-7 and order-8, then
> the block layer will merge those to a single read (upto the 384 optimal size?)
Correct. In fact readahead code will already perform this merging when
submitting the IO.
> but if there are 2 reads of order-8 then it won't merge because it would be
> bigger than the optimal size and it won't split the second one at the optimal
> size either? Have I inferred that correctly?
With the code as you modify it, you would round down ra->size from 384 to
256 and submit only one 1MB sized IO (with one order-8 page). And this will
cause regression in read throughput for such devices because they now don't
get buffer large enough to run at full speed.
> > Now you will allocate all
> > folios of order 8 (nice) but reads will be just 1024k and you'll see
> > noticeable drop in read throughput (not nice). Note that this is not just a
> > theoretical example but a real case we have hit when doing performance
> > testing of servers and for which I was tweaking readahead code in the past.
> >
> > So I think we need to tweak this logic a bit. Perhaps we should round_down
> > end to the minimum alignment dictated by 'order' and maxpages? Like:
> >
> > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
>
> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT -
> 1" part?
My bad. It should have been:
1 << min(order, ffs(max_pages) - 1)
> I think what you are suggesting is that the patch becomes something like
> this:
>
> ---8<---
> + end = ra->start + ra->size;
> + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
Not quite. ilog2() returns the most significant bit set but we really want
to align to the least significant bit set. So when max_pages is 384, we
want to align to at most order-7 (aligning the end more does not make sense
when you want to do IO 384 pages large). That's why I'm using ffs() and not
ilog2().
> + if (aligned_end > ra->start)
> + ra->size -= end - aligned_end;
> + ra->async_size = ra->size;
> ---8<---
>
> So if max_pages=384, then aligned_end will be aligned down to a maximum
> of the previous 1MB boundary?
No, it needs to be aligned only to previous 512K boundary because we want
to do IOs 3*512K large.
Hope things are a bit clearer now :)
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2025-05-06 11:29 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
2025-05-05 8:49 ` Jan Kara
2025-05-05 9:51 ` David Hildenbrand
2025-05-05 10:09 ` Jan Kara
2025-05-05 10:25 ` David Hildenbrand
2025-05-05 12:51 ` Ryan Roberts
2025-05-05 16:14 ` Jan Kara
2025-05-05 10:09 ` Anshuman Khandual
2025-05-05 13:00 ` Ryan Roberts
2025-05-08 12:55 ` Pankaj Raghav (Samsung)
2025-05-09 13:30 ` Ryan Roberts
2025-05-09 20:50 ` Pankaj Raghav (Samsung)
2025-05-13 12:33 ` Ryan Roberts
2025-05-13 6:19 ` Chaitanya S Prakash
2025-04-30 14:59 ` [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
2025-05-05 9:13 ` Jan Kara
2025-05-05 9:37 ` Jan Kara
2025-05-06 9:28 ` Ryan Roberts
2025-05-06 11:29 ` Jan Kara [this message]
2025-05-06 15:31 ` Ryan Roberts
2025-04-30 14:59 ` [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
2025-05-05 9:39 ` Jan Kara
2025-05-05 9:57 ` David Hildenbrand
2025-05-09 10:00 ` Pankaj Raghav (Samsung)
2025-04-30 14:59 ` [RFC PATCH v4 4/5] mm/readahead: Store folio order " Ryan Roberts
2025-05-05 9:52 ` Jan Kara
2025-05-06 9:53 ` Ryan Roberts
2025-05-06 10:45 ` Jan Kara
2025-05-05 10:08 ` David Hildenbrand
2025-05-06 10:03 ` Ryan Roberts
2025-05-06 14:24 ` David Hildenbrand
2025-05-06 15:06 ` Ryan Roberts
2025-04-30 14:59 ` [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
2025-05-05 10:06 ` Jan Kara
2025-05-09 13:52 ` Will Deacon
2025-05-13 12:46 ` Ryan Roberts
2025-05-14 15:14 ` Will Deacon
2025-05-14 15:31 ` Ryan Roberts
2025-05-06 10:05 ` [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mq7vno6v7mrrquya4kogseej4fasfyq574ersgdxdhateho7md@bvmy6y4ccgyz \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=david@fromorbit.com \
--cc=david@redhat.com \
--cc=kaleshsingh@google.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=viro@zeniv.linux.org.uk \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox