linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Predictive readahead of dentries
@ 2025-01-14  3:38 Shyam Prasad N
  2025-01-14 12:39 ` [Lsf-pc] " Jan Kara
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Shyam Prasad N @ 2025-01-14  3:38 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy
  Cc: Shyam Prasad N

The Linux kernel does buffered reads and writes using the page cache
layer, where the filesystem reads and writes are offloaded to the
VM/MM layer. The VM layer does a predictive readahead of data by
optionally asking the filesystem to read more data asynchronously than
what was requested.

The VFS layer maintains a dentry cache which gets populated during
access of dentries (either during readdir/getdents or during lookup).
This dentries within a directory actually forms the address space for
the directory, which is read sequentially during getdents. For network
filesystems, the dentries are also looked up during revalidate.

During sequential getdents, it makes sense to perform a readahead
similar to file reads. Even for revalidations and dentry lookups,
there can be some heuristics that can be maintained to know if the
lookups within the directory are sequential in nature. With this, the
dentry cache can be pre-populated for a directory, even before the
dentries are accessed, thereby boosting the performance. This could
give even more benefits for network filesystems by avoiding costly
round trips to the server.

NFS client already does a simplistic form of this readahead by
maintaining an address space for the directory inode and storing the
dentry records returned by the server in this space. However, this
dentry access mechanism is so generic that I feel that this can be a
part of the VFS/VM layer, similar to buffered reads of a file. Also,
VFS layer is better equipped to store heuristics about dentry access
patterns.

-- 
Regards,
Shyam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14  3:38 [LSF/MM/BPF TOPIC] Predictive readahead of dentries Shyam Prasad N
@ 2025-01-14 12:39 ` Jan Kara
  2025-01-15  9:52   ` Shyam Prasad N
  2025-01-14 13:24 ` Amir Goldstein
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-01-14 12:39 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

Hello!

On Tue 14-01-25 09:08:38, Shyam Prasad N wrote:
> The Linux kernel does buffered reads and writes using the page cache
> layer, where the filesystem reads and writes are offloaded to the
> VM/MM layer. The VM layer does a predictive readahead of data by
> optionally asking the filesystem to read more data asynchronously than
> what was requested.
> 
> The VFS layer maintains a dentry cache which gets populated during
> access of dentries (either during readdir/getdents or during lookup).
> This dentries within a directory actually forms the address space for
> the directory, which is read sequentially during getdents. For network
> filesystems, the dentries are also looked up during revalidate.
> 
> During sequential getdents, it makes sense to perform a readahead
> similar to file reads. Even for revalidations and dentry lookups,
> there can be some heuristics that can be maintained to know if the
> lookups within the directory are sequential in nature. With this, the
> dentry cache can be pre-populated for a directory, even before the
> dentries are accessed, thereby boosting the performance. This could
> give even more benefits for network filesystems by avoiding costly
> round trips to the server.
> 
> NFS client already does a simplistic form of this readahead by
> maintaining an address space for the directory inode and storing the
> dentry records returned by the server in this space. However, this
> dentry access mechanism is so generic that I feel that this can be a
> part of the VFS/VM layer, similar to buffered reads of a file. Also,
> VFS layer is better equipped to store heuristics about dentry access
> patterns.

Interesting idea. Note that individual filesystems actually do directory
readahead on their own. They just don't readahead 'struct dentry' but
rather issue readahead for metadata blocks to get into cache which is what
takes most time. Readahead makes the most sense for readdir() (or
getdents() as you call it) calls where the filesystem driver has all the
information it needs (unlike VFS) for performing efficient readahead. So
here I'm not sure there's much need for a change.

I'm not against some form of readahead for ->lookup calls but we'd have to
very carefully design the heuristics for detecting some kind of pattern of
->lookup calls so that we know which entry is going to be the next one
looked up and evaluate whether it is actually an overall win or not. So
for this the discussion would need a more concrete proposal to be useful I
think.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14  3:38 [LSF/MM/BPF TOPIC] Predictive readahead of dentries Shyam Prasad N
  2025-01-14 12:39 ` [Lsf-pc] " Jan Kara
@ 2025-01-14 13:24 ` Amir Goldstein
  2025-01-14 14:12   ` Benjamin Coddington
                     ` (2 more replies)
  2025-01-14 15:59 ` James Bottomley
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-01-14 13:24 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
>
> The Linux kernel does buffered reads and writes using the page cache
> layer, where the filesystem reads and writes are offloaded to the
> VM/MM layer. The VM layer does a predictive readahead of data by
> optionally asking the filesystem to read more data asynchronously than
> what was requested.
>
> The VFS layer maintains a dentry cache which gets populated during
> access of dentries (either during readdir/getdents or during lookup).
> This dentries within a directory actually forms the address space for
> the directory, which is read sequentially during getdents. For network
> filesystems, the dentries are also looked up during revalidate.
>
> During sequential getdents, it makes sense to perform a readahead
> similar to file reads. Even for revalidations and dentry lookups,
> there can be some heuristics that can be maintained to know if the
> lookups within the directory are sequential in nature. With this, the
> dentry cache can be pre-populated for a directory, even before the
> dentries are accessed, thereby boosting the performance. This could
> give even more benefits for network filesystems by avoiding costly
> round trips to the server.
>

I believe you are referring to READDIRPLUS, which is quite common
for network protocols and also supported by FUSE.

Unlike network protocols, FUSE decides by server configuration and
heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
mode, FUSE starts with readdirplus, but if nothing calls lookup on the
directory inode by the time the next getdents call, it stops with readdirplus.

I personally ran into the problem that I would like to control from the
application, which knows if it is doing "ls" or "ls -l" whether a specific
getdents() will use FUSE readdirplus or not, because in some situations
where "ls -l" is not needed that can avoid a lot of unneeded IO.

I do not know if implementing readdirplus (i.e. populate inode and dentry)
makes sense for disk filesystems, but if we do it in VFS level, there has to
be at an API to control or at least opt-out of readdirplus, like with readahead.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14 13:24 ` Amir Goldstein
@ 2025-01-14 14:12   ` Benjamin Coddington
  2025-01-14 15:01     ` Paulo Alcantara
  2025-01-15 11:27   ` Shyam Prasad N
  2025-01-20 21:26   ` Benjamin Coddington
  2 siblings, 1 reply; 14+ messages in thread
From: Benjamin Coddington @ 2025-01-14 14:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Shyam Prasad N, lsf-pc, linux-fsdevel, linux-mm, brauner,
	Matthew Wilcox, David Howells, Jeff Layton, Steve French,
	trondmy, Shyam Prasad N

On 14 Jan 2025, at 8:24, Amir Goldstein wrote:

> On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
>>
>> The Linux kernel does buffered reads and writes using the page cache
>> layer, where the filesystem reads and writes are offloaded to the
>> VM/MM layer. The VM layer does a predictive readahead of data by
>> optionally asking the filesystem to read more data asynchronously than
>> what was requested.
>>
>> The VFS layer maintains a dentry cache which gets populated during
>> access of dentries (either during readdir/getdents or during lookup).
>> This dentries within a directory actually forms the address space for
>> the directory, which is read sequentially during getdents. For network
>> filesystems, the dentries are also looked up during revalidate.
>>
>> During sequential getdents, it makes sense to perform a readahead
>> similar to file reads. Even for revalidations and dentry lookups,
>> there can be some heuristics that can be maintained to know if the
>> lookups within the directory are sequential in nature. With this, the
>> dentry cache can be pre-populated for a directory, even before the
>> dentries are accessed, thereby boosting the performance. This could
>> give even more benefits for network filesystems by avoiding costly
>> round trips to the server.
>>
>
> I believe you are referring to READDIRPLUS, which is quite common
> for network protocols and also supported by FUSE.
>
> Unlike network protocols, FUSE decides by server configuration and
> heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
> mode, FUSE starts with readdirplus, but if nothing calls lookup on the
> directory inode by the time the next getdents call, it stops with readdirplus.
>
> I personally ran into the problem that I would like to control from the
> application, which knows if it is doing "ls" or "ls -l" whether a specific
> getdents() will use FUSE readdirplus or not, because in some situations
> where "ls -l" is not needed that can avoid a lot of unneeded IO.

Indeed, we often have folks wanting dramatically different behavior from
getdents() in NFS, and every time we've tried to improve our heuristics
someone else shouts "regression"!

We can tune the NFS heuristic per-mount, but it often makes the wrong
choice..  As you say letting the application make the call would be ideal.
POSIX_FADV_ ?

Ben



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14 14:12   ` Benjamin Coddington
@ 2025-01-14 15:01     ` Paulo Alcantara
  2025-01-15 14:30       ` Shyam Prasad N
  0 siblings, 1 reply; 14+ messages in thread
From: Paulo Alcantara @ 2025-01-14 15:01 UTC (permalink / raw)
  To: Benjamin Coddington, Amir Goldstein
  Cc: Shyam Prasad N, lsf-pc, linux-fsdevel, linux-mm, brauner,
	Matthew Wilcox, David Howells, Jeff Layton, Steve French,
	trondmy, Shyam Prasad N

Benjamin Coddington <bcodding@redhat.com> writes:

> On 14 Jan 2025, at 8:24, Amir Goldstein wrote:
>
>> On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
>>>
>>> The Linux kernel does buffered reads and writes using the page cache
>>> layer, where the filesystem reads and writes are offloaded to the
>>> VM/MM layer. The VM layer does a predictive readahead of data by
>>> optionally asking the filesystem to read more data asynchronously than
>>> what was requested.
>>>
>>> The VFS layer maintains a dentry cache which gets populated during
>>> access of dentries (either during readdir/getdents or during lookup).
>>> This dentries within a directory actually forms the address space for
>>> the directory, which is read sequentially during getdents. For network
>>> filesystems, the dentries are also looked up during revalidate.
>>>
>>> During sequential getdents, it makes sense to perform a readahead
>>> similar to file reads. Even for revalidations and dentry lookups,
>>> there can be some heuristics that can be maintained to know if the
>>> lookups within the directory are sequential in nature. With this, the
>>> dentry cache can be pre-populated for a directory, even before the
>>> dentries are accessed, thereby boosting the performance. This could
>>> give even more benefits for network filesystems by avoiding costly
>>> round trips to the server.
>>>
>>
>> I believe you are referring to READDIRPLUS, which is quite common
>> for network protocols and also supported by FUSE.
>>
>> Unlike network protocols, FUSE decides by server configuration and
>> heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
>> mode, FUSE starts with readdirplus, but if nothing calls lookup on the
>> directory inode by the time the next getdents call, it stops with readdirplus.
>>
>> I personally ran into the problem that I would like to control from the
>> application, which knows if it is doing "ls" or "ls -l" whether a specific
>> getdents() will use FUSE readdirplus or not, because in some situations
>> where "ls -l" is not needed that can avoid a lot of unneeded IO.
>
> Indeed, we often have folks wanting dramatically different behavior from
> getdents() in NFS, and every time we've tried to improve our heuristics
> someone else shouts "regression"!

In CIFS, we already preload the dcache with the result of
SMB2_QUERY_DIRECTORY, which I believe NFS does the same thing.

Shyam, what's the problem with current approach?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14  3:38 [LSF/MM/BPF TOPIC] Predictive readahead of dentries Shyam Prasad N
  2025-01-14 12:39 ` [Lsf-pc] " Jan Kara
  2025-01-14 13:24 ` Amir Goldstein
@ 2025-01-14 15:59 ` James Bottomley
  2025-01-16  4:50 ` Al Viro
  2025-01-16  5:31 ` Christoph Hellwig
  4 siblings, 0 replies; 14+ messages in thread
From: James Bottomley @ 2025-01-14 15:59 UTC (permalink / raw)
  To: Shyam Prasad N, lsf-pc, linux-fsdevel, linux-mm, brauner,
	Matthew Wilcox, David Howells, Jeff Layton, Steve French,
	trondmy
  Cc: Shyam Prasad N

On Tue, 2025-01-14 at 09:08 +0530, Shyam Prasad N wrote:
> The Linux kernel does buffered reads and writes using the page cache
> layer, where the filesystem reads and writes are offloaded to the
> VM/MM layer. The VM layer does a predictive readahead of data by
> optionally asking the filesystem to read more data asynchronously
> than what was requested.
> 
> The VFS layer maintains a dentry cache which gets populated during
> access of dentries (either during readdir/getdents or during lookup).
> This dentries within a directory actually forms the address space for
> the directory, which is read sequentially during getdents. For
> network filesystems, the dentries are also looked up during
> revalidate.
> 
> During sequential getdents, it makes sense to perform a readahead
> similar to file reads. Even for revalidations and dentry lookups,
> there can be some heuristics that can be maintained to know if the
> lookups within the directory are sequential in nature. With this, the
> dentry cache can be pre-populated for a directory, even before the
> dentries are accessed, thereby boosting the performance. This could
> give even more benefits for network filesystems by avoiding costly
> round trips to the server.

If your theory were correct, especially the bit about using the dentry
cache to retain the readahead information, wouldn't a precursor
actually be populating the dentry cache on iterate_dir() which is the
engine for both the readdir() and getdents() syscalls?  It strikes me
the reason we don't do dentry population here is partly because the
lookup() on each name would slow everything down (iterate_dir is very
locking light weight because it needs to be fast) and partly because
whatever is doing the directory read may only be interested in a single
name.  The only userspace operation you can guarantee is going to do a
lookup() for every name is ls -l, but that doesn't seem to be a good
one to optimize for.

Regards,

James



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14 12:39 ` [Lsf-pc] " Jan Kara
@ 2025-01-15  9:52   ` Shyam Prasad N
  0 siblings, 0 replies; 14+ messages in thread
From: Shyam Prasad N @ 2025-01-15  9:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

Hi Jan,
Thanks for the review.

On Tue, Jan 14, 2025 at 6:09 PM Jan Kara <jack@suse.cz> wrote:
>
> Hello!
>
> On Tue 14-01-25 09:08:38, Shyam Prasad N wrote:
> > The Linux kernel does buffered reads and writes using the page cache
> > layer, where the filesystem reads and writes are offloaded to the
> > VM/MM layer. The VM layer does a predictive readahead of data by
> > optionally asking the filesystem to read more data asynchronously than
> > what was requested.
> >
> > The VFS layer maintains a dentry cache which gets populated during
> > access of dentries (either during readdir/getdents or during lookup).
> > This dentries within a directory actually forms the address space for
> > the directory, which is read sequentially during getdents. For network
> > filesystems, the dentries are also looked up during revalidate.
> >
> > During sequential getdents, it makes sense to perform a readahead
> > similar to file reads. Even for revalidations and dentry lookups,
> > there can be some heuristics that can be maintained to know if the
> > lookups within the directory are sequential in nature. With this, the
> > dentry cache can be pre-populated for a directory, even before the
> > dentries are accessed, thereby boosting the performance. This could
> > give even more benefits for network filesystems by avoiding costly
> > round trips to the server.
> >
> > NFS client already does a simplistic form of this readahead by
> > maintaining an address space for the directory inode and storing the
> > dentry records returned by the server in this space. However, this
> > dentry access mechanism is so generic that I feel that this can be a
> > part of the VFS/VM layer, similar to buffered reads of a file. Also,
> > VFS layer is better equipped to store heuristics about dentry access
> > patterns.
>
> Interesting idea. Note that individual filesystems actually do directory
> readahead on their own. They just don't readahead 'struct dentry' but
> rather issue readahead for metadata blocks to get into cache which is what
> takes most time. Readahead makes the most sense for readdir() (or
> getdents() as you call it) calls where the filesystem driver has all the
> information it needs (unlike VFS) for performing efficient readahead. So
> here I'm not sure there's much need for a change.

I agree that the filesystem driver can do this.
But the logic for "advising" how many dentries to readahead may be
something that depends on the workload rather than the filesystem
itself.
Most of the practical use cases would readdir the entire directory.
But there could be use cases where a partial directory could be read
too.

>
> I'm not against some form of readahead for ->lookup calls but we'd have to
> very carefully design the heuristics for detecting some kind of pattern of
> ->lookup calls so that we know which entry is going to be the next one
> looked up and evaluate whether it is actually an overall win or not. So
> for this the discussion would need a more concrete proposal to be useful I
> think.

Acked.
Simplistically, the whole directory could be read when the number of
dentry revalidations or lookups that missed the cache, but was
successfully loaded from the backend exceeds a certain number (I can
see how this number could be filesystem specific). There could be
other more sophisticated implementations.

Let me think through this further (and read the other comments) and
see if I can refine this further.

>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

-- 
Regards,
Shyam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14 13:24 ` Amir Goldstein
  2025-01-14 14:12   ` Benjamin Coddington
@ 2025-01-15 11:27   ` Shyam Prasad N
  2025-01-15 14:21     ` Amir Goldstein
  2025-01-20 21:26   ` Benjamin Coddington
  2 siblings, 1 reply; 14+ messages in thread
From: Shyam Prasad N @ 2025-01-15 11:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

On Tue, Jan 14, 2025 at 6:55 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
> >
> > The Linux kernel does buffered reads and writes using the page cache
> > layer, where the filesystem reads and writes are offloaded to the
> > VM/MM layer. The VM layer does a predictive readahead of data by
> > optionally asking the filesystem to read more data asynchronously than
> > what was requested.
> >
> > The VFS layer maintains a dentry cache which gets populated during
> > access of dentries (either during readdir/getdents or during lookup).
> > This dentries within a directory actually forms the address space for
> > the directory, which is read sequentially during getdents. For network
> > filesystems, the dentries are also looked up during revalidate.
> >
> > During sequential getdents, it makes sense to perform a readahead
> > similar to file reads. Even for revalidations and dentry lookups,
> > there can be some heuristics that can be maintained to know if the
> > lookups within the directory are sequential in nature. With this, the
> > dentry cache can be pre-populated for a directory, even before the
> > dentries are accessed, thereby boosting the performance. This could
> > give even more benefits for network filesystems by avoiding costly
> > round trips to the server.
> >
>
> I believe you are referring to READDIRPLUS, which is quite common
> for network protocols and also supported by FUSE.
This discussion is not completely about readdirplus, but definitely is
a part of it.
I'm suggesting doing the next set of readdir() calls in advance, so
that the data needed to serve those are already in the cache.
I'm also suggesting artificially doing a readdir to avoid sequential
revalidation of each dentry; or a readdirplus to avoid stat of each
inode corresponding to these dentries
>
> Unlike network protocols, FUSE decides by server configuration and
> heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
> mode, FUSE starts with readdirplus, but if nothing calls lookup on the
> directory inode by the time the next getdents call, it stops with readdirplus.
>
> I personally ran into the problem that I would like to control from the
> application, which knows if it is doing "ls" or "ls -l" whether a specific
> getdents() will use FUSE readdirplus or not, because in some situations
> where "ls -l" is not needed that can avoid a lot of unneeded IO.
>
> I do not know if implementing readdirplus (i.e. populate inode and dentry)
> makes sense for disk filesystems, but if we do it in VFS level, there has to
> be at an API to control or at least opt-out of readdirplus, like with readahead.
That would be a great knob to have for network filesystems. We have to
rely on heuristics today to predict which of these patterns the
workload is using.

>
> Thanks,
> Amir.


-- 
Regards,
Shyam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-15 11:27   ` Shyam Prasad N
@ 2025-01-15 14:21     ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-01-15 14:21 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

On Wed, Jan 15, 2025 at 12:27 PM Shyam Prasad N <nspmangalore@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 6:55 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
> > >
> > > The Linux kernel does buffered reads and writes using the page cache
> > > layer, where the filesystem reads and writes are offloaded to the
> > > VM/MM layer. The VM layer does a predictive readahead of data by
> > > optionally asking the filesystem to read more data asynchronously than
> > > what was requested.
> > >
> > > The VFS layer maintains a dentry cache which gets populated during
> > > access of dentries (either during readdir/getdents or during lookup).
> > > This dentries within a directory actually forms the address space for
> > > the directory, which is read sequentially during getdents. For network
> > > filesystems, the dentries are also looked up during revalidate.
> > >
> > > During sequential getdents, it makes sense to perform a readahead
> > > similar to file reads. Even for revalidations and dentry lookups,
> > > there can be some heuristics that can be maintained to know if the
> > > lookups within the directory are sequential in nature. With this, the
> > > dentry cache can be pre-populated for a directory, even before the
> > > dentries are accessed, thereby boosting the performance. This could
> > > give even more benefits for network filesystems by avoiding costly
> > > round trips to the server.
> > >
> >
> > I believe you are referring to READDIRPLUS, which is quite common
> > for network protocols and also supported by FUSE.
> This discussion is not completely about readdirplus, but definitely is
> a part of it.
> I'm suggesting doing the next set of readdir() calls in advance, so
> that the data needed to serve those are already in the cache.
> I'm also suggesting artificially doing a readdir to avoid sequential
> revalidation of each dentry; or a readdirplus to avoid stat of each
> inode corresponding to these dentries

Well, if readdirplus is implemented, then "readaheadplus" could be
implemented by async io_uring readdirplus commands. Right?
io_uring command would have to know to chain the following
readdirplus commands with the offset returned from the previous
readdirplus response, but that should be doable I think?

> >
> > Unlike network protocols, FUSE decides by server configuration and
> > heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
> > mode, FUSE starts with readdirplus, but if nothing calls lookup on the
> > directory inode by the time the next getdents call, it stops with readdirplus.
> >
> > I personally ran into the problem that I would like to control from the
> > application, which knows if it is doing "ls" or "ls -l" whether a specific
> > getdents() will use FUSE readdirplus or not, because in some situations
> > where "ls -l" is not needed that can avoid a lot of unneeded IO.
> >
> > I do not know if implementing readdirplus (i.e. populate inode and dentry)
> > makes sense for disk filesystems, but if we do it in VFS level, there has to
> > be at an API to control or at least opt-out of readdirplus, like with readahead.
> That would be a great knob to have for network filesystems. We have to
> rely on heuristics today to predict which of these patterns the
> workload is using.
>

It seems like the demand existed for a long time.
Man page for posix_fadvise(2) says:
"Programs can use posix_fadvise() to announce an intention to access file data
 in a specific pattern in the future, thus allowing the kernel to
perform appropriate
 optimizations."

I do not read this as limiting to non-directory files, and indeed fadvise() can
be called on directories, but others could argue that this is an API abuse.

Mind sending a patch for POSIX_FADV_{NO,}READDIRPLUS?
make sure it fails with -ENOTDIR on non-dir and be ready to face the
inevitable bikeshedding ;)

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14 15:01     ` Paulo Alcantara
@ 2025-01-15 14:30       ` Shyam Prasad N
  2025-01-15 14:47         ` Paulo Alcantara
  0 siblings, 1 reply; 14+ messages in thread
From: Shyam Prasad N @ 2025-01-15 14:30 UTC (permalink / raw)
  To: Paulo Alcantara
  Cc: Benjamin Coddington, Amir Goldstein, lsf-pc, linux-fsdevel,
	linux-mm, brauner, Matthew Wilcox, David Howells, Jeff Layton,
	Steve French, trondmy, Shyam Prasad N

Hi Paulo,

On Tue, Jan 14, 2025 at 8:31 PM Paulo Alcantara <pc@manguebit.com> wrote:
>
> Benjamin Coddington <bcodding@redhat.com> writes:
>
> > On 14 Jan 2025, at 8:24, Amir Goldstein wrote:
> >
> >> On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
> >>>
> >>> The Linux kernel does buffered reads and writes using the page cache
> >>> layer, where the filesystem reads and writes are offloaded to the
> >>> VM/MM layer. The VM layer does a predictive readahead of data by
> >>> optionally asking the filesystem to read more data asynchronously than
> >>> what was requested.
> >>>
> >>> The VFS layer maintains a dentry cache which gets populated during
> >>> access of dentries (either during readdir/getdents or during lookup).
> >>> This dentries within a directory actually forms the address space for
> >>> the directory, which is read sequentially during getdents. For network
> >>> filesystems, the dentries are also looked up during revalidate.
> >>>
> >>> During sequential getdents, it makes sense to perform a readahead
> >>> similar to file reads. Even for revalidations and dentry lookups,
> >>> there can be some heuristics that can be maintained to know if the
> >>> lookups within the directory are sequential in nature. With this, the
> >>> dentry cache can be pre-populated for a directory, even before the
> >>> dentries are accessed, thereby boosting the performance. This could
> >>> give even more benefits for network filesystems by avoiding costly
> >>> round trips to the server.
> >>>
> >>
> >> I believe you are referring to READDIRPLUS, which is quite common
> >> for network protocols and also supported by FUSE.
> >>
> >> Unlike network protocols, FUSE decides by server configuration and
> >> heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
> >> mode, FUSE starts with readdirplus, but if nothing calls lookup on the
> >> directory inode by the time the next getdents call, it stops with readdirplus.
> >>
> >> I personally ran into the problem that I would like to control from the
> >> application, which knows if it is doing "ls" or "ls -l" whether a specific
> >> getdents() will use FUSE readdirplus or not, because in some situations
> >> where "ls -l" is not needed that can avoid a lot of unneeded IO.
> >
> > Indeed, we often have folks wanting dramatically different behavior from
> > getdents() in NFS, and every time we've tried to improve our heuristics
> > someone else shouts "regression"!
>
> In CIFS, we already preload the dcache with the result of
> SMB2_QUERY_DIRECTORY, which I believe NFS does the same thing.
>
> Shyam, what's the problem with current approach?

We load the dentry cache with results of QueryDirectory. But what I'm
proposing here is a read ahead, even before the next readdir is done
by the application. i.e. the idea is that the data necessary to emit
dentries is already in the cache before it is even called. That should
speed up the overall directory reads.

-- 
Regards,
Shyam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-15 14:30       ` Shyam Prasad N
@ 2025-01-15 14:47         ` Paulo Alcantara
  0 siblings, 0 replies; 14+ messages in thread
From: Paulo Alcantara @ 2025-01-15 14:47 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: Benjamin Coddington, Amir Goldstein, lsf-pc, linux-fsdevel,
	linux-mm, brauner, Matthew Wilcox, David Howells, Jeff Layton,
	Steve French, trondmy, Shyam Prasad N

Shyam Prasad N <nspmangalore@gmail.com> writes:

> We load the dentry cache with results of QueryDirectory. But what I'm
> proposing here is a read ahead, even before the next readdir is done
> by the application. i.e. the idea is that the data necessary to emit
> dentries is already in the cache before it is even called. That should
> speed up the overall directory reads.

Thanks for the explanation.

We'd need to be careful as in CIFS we could end up with several
automounts (DFS links) by doing these readdirs in advance, especially on
slow connections and when failover happens when mounting them.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14  3:38 [LSF/MM/BPF TOPIC] Predictive readahead of dentries Shyam Prasad N
                   ` (2 preceding siblings ...)
  2025-01-14 15:59 ` James Bottomley
@ 2025-01-16  4:50 ` Al Viro
  2025-01-16  5:31 ` Christoph Hellwig
  4 siblings, 0 replies; 14+ messages in thread
From: Al Viro @ 2025-01-16  4:50 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

On Tue, Jan 14, 2025 at 09:08:38AM +0530, Shyam Prasad N wrote:

> The VFS layer maintains a dentry cache which gets populated during
> access of dentries (either during readdir/getdents or during lookup).
> This dentries within a directory actually forms the address space for
> the directory, which is read sequentially during getdents. For network
> filesystems, the dentries are also looked up during revalidate.
> 
> During sequential getdents, it makes sense to perform a readahead
> similar to file reads. Even for revalidations and dentry lookups,
> there can be some heuristics that can be maintained to know if the
> lookups within the directory are sequential in nature. With this, the
> dentry cache can be pre-populated for a directory, even before the
> dentries are accessed, thereby boosting the performance. This could
> give even more benefits for network filesystems by avoiding costly
> round trips to the server.
> 
> NFS client already does a simplistic form of this readahead by
> maintaining an address space for the directory inode and storing the
> dentry records returned by the server in this space. However, this
> dentry access mechanism is so generic that I feel that this can be a
> part of the VFS/VM layer, similar to buffered reads of a file. Also,
> VFS layer is better equipped to store heuristics about dentry access
> patterns.

You do realize that for local filesystems it'll actually hurt anything
that does *not* stat() or open() everything it runs across, right?

Directories do not contain inode metadata; on lookup you do want
that - for given object.  So you need to get the on-disk inode read,
so that in-core inode could be set up.  Adding that on readdir for
every directory entry you run across can be thoroughly unpleasant.

It should be up to filesystem.  It's not just the access pattern.
Imagine the joy of doing that on e.g. NFSv2; would you agree that
"I'd have to send a bleeding GETATTR for every entry in READDIR
response" is an important detail when deciding whether we want
to do dcache prepopulation?

Ideas regarding better infrastructure filesystems could use would
be interesting, but decision whether to use that or not in any
given case belongs in filesystem itself, *not* in upper layers.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14  3:38 [LSF/MM/BPF TOPIC] Predictive readahead of dentries Shyam Prasad N
                   ` (3 preceding siblings ...)
  2025-01-16  4:50 ` Al Viro
@ 2025-01-16  5:31 ` Christoph Hellwig
  4 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-01-16  5:31 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: lsf-pc, linux-fsdevel, linux-mm, brauner, Matthew Wilcox,
	David Howells, Jeff Layton, Steve French, trondmy,
	Shyam Prasad N

Why don't you implement a prototype and post the results?

Weirdly enough every year people come out of the woods more or (usually)
less interesting totally handwavy ideas just before LSF/MM and spam
the lists with their philosophy.

Put your effort where your mouth is and give it a try and if it's useful
send patches.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Predictive readahead of dentries
  2025-01-14 13:24 ` Amir Goldstein
  2025-01-14 14:12   ` Benjamin Coddington
  2025-01-15 11:27   ` Shyam Prasad N
@ 2025-01-20 21:26   ` Benjamin Coddington
  2 siblings, 0 replies; 14+ messages in thread
From: Benjamin Coddington @ 2025-01-20 21:26 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Shyam Prasad N, lsf-pc, linux-fsdevel, linux-mm, brauner,
	Matthew Wilcox, David Howells, Jeff Layton, Steve French,
	trondmy, Shyam Prasad N

On 14 Jan 2025, at 8:24, Amir Goldstein wrote:

> On Tue, Jan 14, 2025 at 4:38 AM Shyam Prasad N <nspmangalore@gmail.com> wrote:
>>
>> The Linux kernel does buffered reads and writes using the page cache
>> layer, where the filesystem reads and writes are offloaded to the
>> VM/MM layer. The VM layer does a predictive readahead of data by
>> optionally asking the filesystem to read more data asynchronously than
>> what was requested.
>>
>> The VFS layer maintains a dentry cache which gets populated during
>> access of dentries (either during readdir/getdents or during lookup).
>> This dentries within a directory actually forms the address space for
>> the directory, which is read sequentially during getdents. For network
>> filesystems, the dentries are also looked up during revalidate.
>>
>> During sequential getdents, it makes sense to perform a readahead
>> similar to file reads. Even for revalidations and dentry lookups,
>> there can be some heuristics that can be maintained to know if the
>> lookups within the directory are sequential in nature. With this, the
>> dentry cache can be pre-populated for a directory, even before the
>> dentries are accessed, thereby boosting the performance. This could
>> give even more benefits for network filesystems by avoiding costly
>> round trips to the server.
>>
>
> I believe you are referring to READDIRPLUS, which is quite common
> for network protocols and also supported by FUSE.
>
> Unlike network protocols, FUSE decides by server configuration and
> heuristics whether to "fuse_use_readdirplus" - specifically in readdirplus_auto
> mode, FUSE starts with readdirplus, but if nothing calls lookup on the
> directory inode by the time the next getdents call, it stops with readdirplus.
>
> I personally ran into the problem that I would like to control from the
> application, which knows if it is doing "ls" or "ls -l" whether a specific
> getdents() will use FUSE readdirplus or not, because in some situations
> where "ls -l" is not needed that can avoid a lot of unneeded IO.

Indeed, we often have folks wanting dramatically different behavior from
getdents() in NFS, and every time we've tried to improve our heuristics
someone else shouts "regression"!

We can tune the NFS heuristic per-mount, but it often makes the wrong
choice..  As you say letting the application make the call would be ideal.
POSIX_FADV_ ?

Ben



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-01-20 21:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-14  3:38 [LSF/MM/BPF TOPIC] Predictive readahead of dentries Shyam Prasad N
2025-01-14 12:39 ` [Lsf-pc] " Jan Kara
2025-01-15  9:52   ` Shyam Prasad N
2025-01-14 13:24 ` Amir Goldstein
2025-01-14 14:12   ` Benjamin Coddington
2025-01-14 15:01     ` Paulo Alcantara
2025-01-15 14:30       ` Shyam Prasad N
2025-01-15 14:47         ` Paulo Alcantara
2025-01-15 11:27   ` Shyam Prasad N
2025-01-15 14:21     ` Amir Goldstein
2025-01-20 21:26   ` Benjamin Coddington
2025-01-14 15:59 ` James Bottomley
2025-01-16  4:50 ` Al Viro
2025-01-16  5:31 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox