Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Kalesh Singh <kaleshsingh@google.com>,
	lsf-pc@lists.linux-foundation.org,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	David Hildenbrand <david@redhat.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Juan Yescas <jyescas@google.com>,
	android-mm <android-mm@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior
Date: Mon, 24 Feb 2025 16:52:24 +0000	[thread overview]
Message-ID: <82fbe53b-98c4-4e55-9eeb-5a013596c4c6@lucifer.local> (raw)
In-Reply-To: <ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz>

On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > Hello!
> > >
> > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > Problem Statement
> > > > ===============
> > > >
> > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > regions that are never accessed. Current mechanisms to disable
> > > > readahead lack granularity and rather operate at the file or VMA
> > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > potential solutions for optimizing page cache/readahead behavior.
> > > >
> > > >
> > > > Background
> > > > =========
> > > >
> > > > The read-ahead heuristics on file-backed memory mappings can
> > > > inadvertently populate the page cache with pages corresponding to
> > > > regions that user-space processes are known never to access e.g ELF
> > > > LOAD segment padding regions. While these pages are ultimately
> > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > particularly when a substantial quantity of such regions exists.
> > > >
> > > > Although the underlying file can be made sparse in these regions to
> > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > populating the page cache within these ranges. These pages, while
> > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > reclaim overhead is further exacerbated in filesystems that support
> > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > PTEs if found present in the page cache.
> > > >
> > > > While the memory impact may be negligible for large files containing a
> > > > limited number of sparse regions, it becomes appreciable for many
> > > > small mappings characterized by numerous holes. This scenario can
> > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > >
> > > OK, I agree the behavior you describe exists. But do you have some
> > > real-world numbers showing its extent? I'm not looking for some artificial
> > > numbers - sure bad cases can be constructed - but how big practical problem
> > > is this? If you can show that average Android phone has 10% of these
> > > useless pages in memory than that's one thing and we should be looking for
> > > some general solution. If it is more like 0.1%, then why bother?
> > >
> > > > Limitations of Existing Mechanisms
> > > > ===========================
> > > >
> > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > entire file, rather than specific sub-regions. The offset and length
> > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > POSIX_FADV_DONTNEED [2] cases.
> > > >
> > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > VMA, rather than specific sub-regions. [3]
> > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > population persists. [4]
> > >
> > > Somewhere else in the thread you complain about readahead extending past
> > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > relatively uncontroversial change. Note that if someone accesses the file
> > > through standard read(2) or write(2) syscall or through different memory
> > > mapping, the limits won't apply but such combinations of access are not
> > > that common anyway.
> >
> > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > different portions of a file and suddenly you lose all the readahead for the
> > rest even though you're reading sequentially?
>
> Well, you wouldn't loose all readahead for the rest. Just readahead won't
> preread data underlying the next VMA so yes, you get a cache miss and have
> to wait for a page to get loaded into cache when transitioning to the next
> VMA but once you get there, you'll have readahead running at full speed
> again.

I'm aware of how readahead works (I _believe_ there's currently a
pre-release of a book with a very extensive section on readahead written by
somebody :P).

Also been looking at it for file-backed guard regions recently, which is
why I've been commenting here specifically as it's been on my mind lately,
and also Kalesh's interest in this stems from a guard region 'scenario'
(hence my cc).

Anyway perhaps I didn't phrase this well - my concern is whether this might
impact performance in real world scenarios, such as one where a VMA is
mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
the same file, in sequential order.

From Kalesh's LPC talk, unless I misinterpreted what he said, this is
precisely what he's doing? I mean we'd not be talking here about mmap()
behaviour with readahead otherwise.

Granted, perhaps you'd only _ever_ be reading sequentially within a
specific VMA's boundaries, rather than going from one to another (excluding
PROT_NONE guards obviously) and that's very possible, if that's what you
mean.

But otherwise, surely this is a thing? And might we therefore be imposing
unnecessary cache misses?

Which is why I suggest...

>
> So yes, sequential read of a memory mapping of a file fragmented into many
> VMAs will be somewhat slower. My impression is such use is rare (sequential
> readers tend to use read(2) rather than mmap) but I could be wrong.
>
> > What about shared libraries with r/o parts and exec parts?
> >
> > I think we'd really need to do some pretty careful checking to ensure this
> > wouldn't break some real world use cases esp. if we really do mostly
> > readahead data from page cache.
>
> So I'm not sure if you are not conflating two things here because the above
> sentence doesn't make sense to me :). Readahead is the mechanism that
> brings data from underlying filesystem into the page cache. Fault-around is
> the mechanism that maps into page tables pages present in the page cache
> although they were not possibly requested by the page fault. By "do mostly
> readahead data from page cache" are you speaking about fault-around? That
> currently does not cross VMA boundaries anyway as far as I'm reading
> do_fault_around()...

...that we test this and see how it behaves :) Which is literally all I
am saying in the above. Ideally with representative workloads.

I mean, I think this shouldn't be a controversial point right? Perhaps
again I didn't communicate this well. But this is all I mean here.

BTW, I understand the difference between readahead and fault-around, you can
run git blame on do_fault_around() if you have doubts about that ;)

And yes fault around is constrained to the VMA (and actually avoids
crossing PTE boundaries).

>
> > > Regarding controlling readahead for various portions of the file - I'm
> > > skeptical. In my opinion it would require too much bookeeping on the kernel
> > > side for such a niche usecache (but maybe your numbers will show it isn't
> > > such a niche as I think :)). I can imagine you could just completely
> > > turn off kernel readahead for the file and do your special readahead from
> > > userspace - I think you could use either userfaultfd for triggering it or
> > > new fanotify FAN_PREACCESS events.
> >
> > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh
> > says, he is too!) I don't really see how we could avoid having to do that
> > for this kind of case, but I may be missing something...
>
> I don't see why we would need to be increasing number of VMAs here at all.
> With FAN_PREACCESS you get notification with file & offset when it's
> accessed, you can issue readahead(2) calls based on that however you like.
> Similarly you can ask for userfaults for the whole mapped range and handle
> those. Now thinking more about this, this approach has the downside that
> you cannot implement async readahead with it (once PTE is mapped to some
> page it won't trigger notifications either with FAN_PREACCESS or with
> UFFD). But with UFFD you could at least trigger readahead on minor faults.

Yeah we're talking past each other on this, sorry I missed your point about
fanotify there!

uffd is probably not reasonably workable given overhead I would have
thought.

I am really unaware of how fanotify works so I mean cool if you can find a
solution this way, awesome :)

I'm just saying, if we need to somehow retain state about regions which
should have adjusted readahead behaviour at a VMA level, I can't see how
this could be done without VMA fragmentation and I'd rather we didn't.

If we can avoid that great!

>
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

next prev parent reply	other threads:[~2025-02-24 16:52 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-21 21:13 Kalesh Singh
2025-02-22 18:03 ` Kent Overstreet
2025-02-23  5:36   ` Kalesh Singh
2025-02-23  5:42     ` Kalesh Singh
2025-02-23  9:30     ` Lorenzo Stoakes
2025-02-23 12:24       ` Matthew Wilcox
2025-02-23  5:34 ` Ritesh Harjani
2025-02-23  6:50   ` Kalesh Singh
2025-02-24 12:56   ` David Sterba
2025-02-24 14:14 ` [Lsf-pc] " Jan Kara
2025-02-24 14:21   ` Lorenzo Stoakes
2025-02-24 16:31     ` Jan Kara
2025-02-24 16:52       ` Lorenzo Stoakes [this message]
2025-02-24 21:36         ` Kalesh Singh
2025-02-24 21:55           ` Kalesh Singh
2025-02-24 23:56           ` Dave Chinner
2025-02-25  6:45             ` Kalesh Singh
2025-02-27 22:12             ` Matthew Wilcox
2025-02-28  1:12               ` Dave Chinner
2025-02-28  9:07               ` David Hildenbrand
2025-04-02  0:13                 ` Kalesh Singh
2025-02-25  5:44           ` Lorenzo Stoakes
2025-02-25  6:59             ` Kalesh Singh
2025-02-25 16:36           ` Jan Kara
2025-02-26  0:49             ` Kalesh Singh
2025-02-25 16:21         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82fbe53b-98c4-4e55-9eeb-5a013596c4c6@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=android-mm@google.com \
    --cc=david@redhat.com \
    --cc=jack@suse.cz \
    --cc=jyescas@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox