[PATCH] mm/readahead: Skip fully overlapped range

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/readahead: Skip fully overlapped range
@ 2025-09-23  3:59 Aubrey Li
  2025-09-23  3:49 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Aubrey Li @ 2025-09-23  3:59 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu
  Cc: linux-fsdevel, linux-mm, linux-kernel, Aubrey Li

RocksDB sequential read benchmark under high concurrency shows severe
lock contention. Multiple threads may issue readahead on the same file
simultaneously, which leads to heavy contention on the xas spinlock in
filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent
there.

To mitigate this issue, a readahead request will be skipped if its
range is fully covered by an ongoing readahead. This avoids redundant
work and significantly reduces lock contention. In one-second sampling,
contention on xas spinlock dropped from 138,314 times to 2,144 times,
resulting in a large performance improvement in the benchmark.

				w/o patch       w/ patch
RocksDB-readseq (ops/sec)
(32-threads)			1.2M		2.4M

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vinicius Gomes <vinicius.gomes@intel.com>
Cc: Tianyou Li <tianyou.li@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Suggested-by: Nanhai Zou <nanhai.zou@intel.com>
Tested-by: Gang Deng <gang.deng@intel.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 mm/readahead.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 20d36d6b055e..57ae1a137730 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -337,7 +337,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	struct file_ra_state *ra = ractl->ra;
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-	unsigned long max_pages;
+	unsigned long max_pages, index;
 
 	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
 		return;
@@ -348,6 +348,19 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	 */
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
 	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
+
+	index = readahead_index(ractl);
+	/*
+	 * Skip this readahead if the requested range is fully covered
+	 * by the ongoing readahead range. This typically occurs in
+	 * concurrent scenarios.
+	 */
+	if (index >= ra->start && index + nr_to_read  <= ra->start + ra->size)
+		return;
+
+	ra->start = index;
+	ra->size = nr_to_read;
+
 	while (nr_to_read) {
 		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;
 
@@ -357,6 +370,10 @@ void force_page_cache_ra(struct readahead_control *ractl,
 
 		nr_to_read -= this_chunk;
 	}
+
+	/* Reset readahead state to allow the next readahead */
+	ra->start = 0;
+	ra->size = 0;
 }
 
 /*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-09-23  3:59 [PATCH] mm/readahead: Skip fully overlapped range Aubrey Li
@ 2025-09-23  3:49 ` Andrew Morton
  2025-09-23  5:11   ` Aubrey Li
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2025-09-23  3:49 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Jan Kara, Roman Gushchin

On Tue, 23 Sep 2025 11:59:46 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:

> RocksDB sequential read benchmark under high concurrency shows severe
> lock contention. Multiple threads may issue readahead on the same file
> simultaneously, which leads to heavy contention on the xas spinlock in
> filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent
> there.
> 
> To mitigate this issue, a readahead request will be skipped if its
> range is fully covered by an ongoing readahead. This avoids redundant
> work and significantly reduces lock contention. In one-second sampling,
> contention on xas spinlock dropped from 138,314 times to 2,144 times,
> resulting in a large performance improvement in the benchmark.
> 
> 				w/o patch       w/ patch
> RocksDB-readseq (ops/sec)
> (32-threads)			1.2M		2.4M

On which kernel version?  In recent times we've made a few readahead
changes to address issues with high concurrency and a quick retest on
mm.git's current mm-stable branch would be interesting please.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-09-23  3:49 ` Andrew Morton
@ 2025-09-23  5:11   ` Aubrey Li
  2025-09-23  9:57     ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Aubrey Li @ 2025-09-23  5:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Jan Kara, Roman Gushchin

On 9/23/25 11:49, Andrew Morton wrote:
> On Tue, 23 Sep 2025 11:59:46 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
> 
>> RocksDB sequential read benchmark under high concurrency shows severe
>> lock contention. Multiple threads may issue readahead on the same file
>> simultaneously, which leads to heavy contention on the xas spinlock in
>> filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent
>> there.
>>
>> To mitigate this issue, a readahead request will be skipped if its
>> range is fully covered by an ongoing readahead. This avoids redundant
>> work and significantly reduces lock contention. In one-second sampling,
>> contention on xas spinlock dropped from 138,314 times to 2,144 times,
>> resulting in a large performance improvement in the benchmark.
>>
>> 				w/o patch       w/ patch
>> RocksDB-readseq (ops/sec)
>> (32-threads)			1.2M		2.4M
> 
> On which kernel version?  In recent times we've made a few readahead
> changes to address issues with high concurrency and a quick retest on
> mm.git's current mm-stable branch would be interesting please.
> 

I'm on v6.16.7. Thanks Andrew for the information, let me check with mm.git.

Thanks,
-Aubrey



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-09-23  5:11   ` Aubrey Li
@ 2025-09-23  9:57     ` Jan Kara
  2025-09-24  0:27       ` Aubrey Li
  2025-09-30  5:35       ` Aubrey Li
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2025-09-23  9:57 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Andrew Morton, Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Jan Kara, Roman Gushchin

On Tue 23-09-25 13:11:37, Aubrey Li wrote:
> On 9/23/25 11:49, Andrew Morton wrote:
> > On Tue, 23 Sep 2025 11:59:46 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
> > 
> >> RocksDB sequential read benchmark under high concurrency shows severe
> >> lock contention. Multiple threads may issue readahead on the same file
> >> simultaneously, which leads to heavy contention on the xas spinlock in
> >> filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent
> >> there.
> >>
> >> To mitigate this issue, a readahead request will be skipped if its
> >> range is fully covered by an ongoing readahead. This avoids redundant
> >> work and significantly reduces lock contention. In one-second sampling,
> >> contention on xas spinlock dropped from 138,314 times to 2,144 times,
> >> resulting in a large performance improvement in the benchmark.
> >>
> >> 				w/o patch       w/ patch
> >> RocksDB-readseq (ops/sec)
> >> (32-threads)			1.2M		2.4M
> > 
> > On which kernel version?  In recent times we've made a few readahead
> > changes to address issues with high concurrency and a quick retest on
> > mm.git's current mm-stable branch would be interesting please.
> 
> I'm on v6.16.7. Thanks Andrew for the information, let me check with mm.git.

I don't expect much of a change for this load but getting test result with
mm.git as a confirmation would be nice. Also, based on the fact that the
patch you propose helps, this looks like there are many threads sharing one
struct file which race to read the same content. That is actually rather
problematic for current readahead code because there's *no synchronization*
on updating file's readhead state. So threads can race and corrupt the
state in interesting ways under one another's hands. On rare occasions I've
observed this with heavy NFS workload where the NFS server is
multithreaded. Since the practical outcome is "just" reduced read
throughput / reading too much, it was never high enough on my priority list
to fix properly (I do have some preliminary patch for that laying around
but there are some open questions that require deeper thinking - like how
to handle a situation where one threads does readahead, filesystem requests
some alignment of the request size after the fact, so we'd like to update
readahead state but another thread has modified the shared readahead state
in the mean time).  But if we're going to work on improving behavior of
readahead for multiple threads sharing readahead state, fixing the code so
that readahead state is at least consistent is IMO the first necessary
step. And then we can pile more complex logic on top of that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-09-23  9:57     ` Jan Kara
@ 2025-09-24  0:27       ` Aubrey Li
  2025-09-30  5:35       ` Aubrey Li
  1 sibling, 0 replies; 9+ messages in thread
From: Aubrey Li @ 2025-09-24  0:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Roman Gushchin

On 9/23/25 17:57, Jan Kara wrote:
> On Tue 23-09-25 13:11:37, Aubrey Li wrote:
>> On 9/23/25 11:49, Andrew Morton wrote:
>>> On Tue, 23 Sep 2025 11:59:46 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>>
>>>> RocksDB sequential read benchmark under high concurrency shows severe
>>>> lock contention. Multiple threads may issue readahead on the same file
>>>> simultaneously, which leads to heavy contention on the xas spinlock in
>>>> filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent
>>>> there.
>>>>
>>>> To mitigate this issue, a readahead request will be skipped if its
>>>> range is fully covered by an ongoing readahead. This avoids redundant
>>>> work and significantly reduces lock contention. In one-second sampling,
>>>> contention on xas spinlock dropped from 138,314 times to 2,144 times,
>>>> resulting in a large performance improvement in the benchmark.
>>>>
>>>> 				w/o patch       w/ patch
>>>> RocksDB-readseq (ops/sec)
>>>> (32-threads)			1.2M		2.4M
>>>
>>> On which kernel version?  In recent times we've made a few readahead
>>> changes to address issues with high concurrency and a quick retest on
>>> mm.git's current mm-stable branch would be interesting please.
>>
>> I'm on v6.16.7. Thanks Andrew for the information, let me check with mm.git.
> 
> I don't expect much of a change for this load but getting test result with
> mm.git as a confirmation would be nice. 

Yes, the hotspot remains on mm.git:mm-stable branch.

   - 88.68% clone3
      - 88.68% start_thread
         - 88.68% reader_thread
            - 88.27% syscall
                 entry_SYSCALL_64_after_hwframe
                 do_syscall_64
                 ksys_readahead
                 generic_fadvise
                 force_page_cache_ra
                 page_cache_ra_unbounded
                 filemap_add_folio
                 __filemap_add_folio
                 _raw_spin_lock_irq
               - do_raw_spin_lock
                    native_queued_spin_lock_slowpath

> Also, based on the fact that the
> patch you propose helps, this looks like there are many threads sharing one
> struct file which race to read the same content. That is actually rather
> problematic for current readahead code because there's *no synchronization*
> on updating file's readhead state. So threads can race and corrupt the
> state in interesting ways under one another's hands. On rare occasions I've
> observed this with heavy NFS workload where the NFS server is
> multithreaded. Since the practical outcome is "just" reduced read
> throughput / reading too much, it was never high enough on my priority list
> to fix properly (I do have some preliminary patch for that laying around
> but there are some open questions that require deeper thinking - like how
> to handle a situation where one threads does readahead, filesystem requests
> some alignment of the request size after the fact, so we'd like to update
> readahead state but another thread has modified the shared readahead state
> in the mean time).  But if we're going to work on improving behavior of
> readahead for multiple threads sharing readahead state, fixing the code so
> that readahead state is at least consistent is IMO the first necessary
> step. And then we can pile more complex logic on top of that.

This makes sense. I actually had a version using atomic operations to update
ra in my patch, but I found that ra is also updated in other paths without
synchronization, so I dropped the atomic operations before sending the patch.
Let me check what I can do for this.

Have you put your preliminary patch somewhere?

Thanks,
-Aubrey
> 
> 								Honza



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-09-23  9:57     ` Jan Kara
  2025-09-24  0:27       ` Aubrey Li
@ 2025-09-30  5:35       ` Aubrey Li
  2025-10-11 22:20         ` Andrew Morton
  1 sibling, 1 reply; 9+ messages in thread
From: Aubrey Li @ 2025-09-30  5:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Roman Gushchin

On 9/23/25 17:57, Jan Kara wrote:
> On Tue 23-09-25 13:11:37, Aubrey Li wrote:
>> On 9/23/25 11:49, Andrew Morton wrote:
>>> On Tue, 23 Sep 2025 11:59:46 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>>
>>>> RocksDB sequential read benchmark under high concurrency shows severe
>>>> lock contention. Multiple threads may issue readahead on the same file
>>>> simultaneously, which leads to heavy contention on the xas spinlock in
>>>> filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent
>>>> there.
>>>>
>>>> To mitigate this issue, a readahead request will be skipped if its
>>>> range is fully covered by an ongoing readahead. This avoids redundant
>>>> work and significantly reduces lock contention. In one-second sampling,
>>>> contention on xas spinlock dropped from 138,314 times to 2,144 times,
>>>> resulting in a large performance improvement in the benchmark.
>>>>
>>>> 				w/o patch       w/ patch
>>>> RocksDB-readseq (ops/sec)
>>>> (32-threads)			1.2M		2.4M
>>>
>>> On which kernel version?  In recent times we've made a few readahead
>>> changes to address issues with high concurrency and a quick retest on
>>> mm.git's current mm-stable branch would be interesting please.
>>
>> I'm on v6.16.7. Thanks Andrew for the information, let me check with mm.git.
> 
> I don't expect much of a change for this load but getting test result with
> mm.git as a confirmation would be nice. Also, based on the fact that the
> patch you propose helps, this looks like there are many threads sharing one
> struct file which race to read the same content. That is actually rather
> problematic for current readahead code because there's *no synchronization*
> on updating file's readhead state. So threads can race and corrupt the
> state in interesting ways under one another's hands. On rare occasions I've
> observed this with heavy NFS workload where the NFS server is
> multithreaded. Since the practical outcome is "just" reduced read
> throughput / reading too much, it was never high enough on my priority list
> to fix properly (I do have some preliminary patch for that laying around
> but there are some open questions that require deeper thinking - like how
> to handle a situation where one threads does readahead, filesystem requests
> some alignment of the request size after the fact, so we'd like to update
> readahead state but another thread has modified the shared readahead state
> in the mean time).  But if we're going to work on improving behavior of
> readahead for multiple threads sharing readahead state, fixing the code so
> that readahead state is at least consistent is IMO the first necessary
> step. And then we can pile more complex logic on top of that.
> 

If I understand this article correctly, especially the following passage:
- https://lwn.net/Articles/888715/

"""
A core idea in readahead is to take a risk and read more than was requested.
If that risk brings rewards and the extra data is accessed, then that
justifies a further risk of reading even more data that hasn't been requested.
When performing a single sequential read through a file, the details of past
behavior can easily be stored in the struct file_ra_state. However if an
application reads from two, three, or more, sections of the file and
interleaves these sequential reads, then file_ra_state cannot keep track
of all that state. Instead we rely on the content already in the page cache.
Specifically we have a flag, PG_readahead, which can be set on a page.
That name should be read in the past tense: the page was read ahead.A risk
was taken when reading that page so, if it pays off and the page is accessed,
then that is justification for taking another risk and reading some more.
"""

file_ra_state is considered a performance hint, not a critical correctness
field. The race conditions on file's readahead state don't affect the
correctness of file I/O because later the page cache mechanisms ensure data
consistency, it won't cause wrong data to be read. I think that's why we do
not lock file_ra_state today, to avoid performance penalties on this hot path.

That said, this patch didn't make things worse, and it does take a risk but
brings the rewards of RocksDB's readseq benchmark.

Thanks,
-Aubrey


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-09-30  5:35       ` Aubrey Li
@ 2025-10-11 22:20         ` Andrew Morton
  2025-10-16 16:21           ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2025-10-11 22:20 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Jan Kara, Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Roman Gushchin

On Tue, 30 Sep 2025 13:35:43 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:

> file_ra_state is considered a performance hint, not a critical correctness
> field. The race conditions on file's readahead state don't affect the
> correctness of file I/O because later the page cache mechanisms ensure data
> consistency, it won't cause wrong data to be read. I think that's why we do
> not lock file_ra_state today, to avoid performance penalties on this hot path.
> 
> That said, this patch didn't make things worse, and it does take a risk but
> brings the rewards of RocksDB's readseq benchmark.

So if I may summarize:

- you've identifed and addressed an issue with concurrent readahead
  against an fd

- Jan points out that we don't properly handle concurrent access to a
  file's ra_state.  This is somewhat offtopic, but we should address
  this sometime anyway.  Then we can address the RocksDB issue later.

Alternatively, we could fix this issue right now and let the
concurrency fixes come later.  Not as pretty, but it's practical.

Another practicality: improving a benchmark is nice, but do we have any
reasons to believe that this change will improve any real-world
workload?  If so, which and by how much?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-10-11 22:20         ` Andrew Morton
@ 2025-10-16 16:21           ` Jan Kara
  2025-11-07 10:28             ` Aubrey Li
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2025-10-16 16:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aubrey Li, Jan Kara, Matthew Wilcox, Nanhai Zou, Gang Deng,
	Tianyou Li, Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel,
	linux-mm, linux-kernel, Roman Gushchin

Sorry for not replying earlier. I wanted make up my mind about this and
other stuff was keeping preempting me...

On Sat 11-10-25 15:20:42, Andrew Morton wrote:
> On Tue, 30 Sep 2025 13:35:43 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
> 
> > file_ra_state is considered a performance hint, not a critical correctness
> > field. The race conditions on file's readahead state don't affect the
> > correctness of file I/O because later the page cache mechanisms ensure data
> > consistency, it won't cause wrong data to be read. I think that's why we do
> > not lock file_ra_state today, to avoid performance penalties on this hot path.
> > 
> > That said, this patch didn't make things worse, and it does take a risk but
> > brings the rewards of RocksDB's readseq benchmark.
> 
> So if I may summarize:
> 
> - you've identifed and addressed an issue with concurrent readahead
>   against an fd

Right but let me also note that the patch modifies only
force_page_cache_ra() which is a pretty peculiar function. It's used at two
places:
1) When page_cache_sync_ra() decides it isn't worth to do a proper
readahead and just wants to read that one one.

2) From POSIX_FADV_WILLNEED - I suppose this is Aubrey's case.

As such it seems to be fixing mostly a "don't do it when it hurts" kind of
load from the benchmark than a widely used practical case since I'm not
sure many programs call POSIX_FADV_WILLNEED from many threads in parallel
for the same range.

> - Jan points out that we don't properly handle concurrent access to a
>   file's ra_state.  This is somewhat offtopic, but we should address
>   this sometime anyway.  Then we can address the RocksDB issue later.

The problem I had with the patch is that it adds more racy updates & checks
for the shared ra state so it's kind of difficult to say whether some
workload will not now more often clobber the ra state resulting in poor
readahead behavior. Also as I looked into the patch now another objection I
have is that force_page_cache_ra() previously didn't touch the ra state at
all, it just read the requested pages. After the patch
force_page_cache_ra() will destroy the readahead state completely. This is
definitely something we don't want to do.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/readahead: Skip fully overlapped range
  2025-10-16 16:21           ` Jan Kara
@ 2025-11-07 10:28             ` Aubrey Li
  0 siblings, 0 replies; 9+ messages in thread
From: Aubrey Li @ 2025-11-07 10:28 UTC (permalink / raw)
  To: Jan Kara, Andrew Morton
  Cc: Matthew Wilcox, Nanhai Zou, Gang Deng, Tianyou Li,
	Vinicius Gomes, Tim Chen, Chen Yu, linux-fsdevel, linux-mm,
	linux-kernel, Roman Gushchin

Really sorry for the late, too. Thunderbird collapsed this thread, but didn't
highlight it as unread, I thought no one response, :(

On 10/17/25 12:21 AM, Jan Kara wrote:
> Sorry for not replying earlier. I wanted make up my mind about this and
> other stuff was keeping preempting me...
> 
> On Sat 11-10-25 15:20:42, Andrew Morton wrote:
>> On Tue, 30 Sep 2025 13:35:43 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>
>>> file_ra_state is considered a performance hint, not a critical correctness
>>> field. The race conditions on file's readahead state don't affect the
>>> correctness of file I/O because later the page cache mechanisms ensure data
>>> consistency, it won't cause wrong data to be read. I think that's why we do
>>> not lock file_ra_state today, to avoid performance penalties on this hot path.
>>>
>>> That said, this patch didn't make things worse, and it does take a risk but
>>> brings the rewards of RocksDB's readseq benchmark.
>>
>> So if I may summarize:
>>
>> - you've identifed and addressed an issue with concurrent readahead
>>   against an fd
> 
> Right but let me also note that the patch modifies only
> force_page_cache_ra() which is a pretty peculiar function. It's used at two
> places:
> 1) When page_cache_sync_ra() decides it isn't worth to do a proper
> readahead and just wants to read that one one.
> 
> 2) From POSIX_FADV_WILLNEED - I suppose this is Aubrey's case.
> 
> As such it seems to be fixing mostly a "don't do it when it hurts" kind of
> load from the benchmark than a widely used practical case since I'm not
> sure many programs call POSIX_FADV_WILLNEED from many threads in parallel
> for the same range.
> 
>> - Jan points out that we don't properly handle concurrent access to a
>>   file's ra_state.  This is somewhat offtopic, but we should address
>>   this sometime anyway.  Then we can address the RocksDB issue later.
>>
>> Another practicality: improving a benchmark is nice, but do we have any
>> reasons to believe that this change will improve any real-world
>> workload?  If so, which and by how much?

I only have RocksDB on my side, but this isn't a lab case but a real case.
It's an issue reported by a customer. They use this case to stress test the
system under high-concurrency data workloads, it could have business impact.

> 
> The problem I had with the patch is that it adds more racy updates & checks
> for the shared ra state so it's kind of difficult to say whether some
> workload will not now more often clobber the ra state resulting in poor
> readahead behavior. Also as I looked into the patch now another objection I
> have is that force_page_cache_ra() previously didn't touch the ra state at
> all, it just read the requested pages. After the patch
> force_page_cache_ra() will destroy the readahead state completely. This is
> definitely something we don't want to do.

This is also something I worried about, so I added two trace points at the
entry and exit of force_page_cache_ra(), and I got all ZEROs.

test-9858    [018] .....   554.352691: force_page_cache_ra: force_page_cache_ra entry: ra->start = 0, ra->size = 0
test-9858    [018] .....   554.352695: force_page_cache_ra: force_page_cache_ra exit: ra->start = 0, ra->size = 0
test-9855    [009] .....   554.352701: force_page_cache_ra: force_page_cache_ra entry: ra->start = 0, ra->size = 0
test-9855    [009] .....   554.352705: force_page_cache_ra: force_page_cache_ra exit: ra->start = 0, ra->size = 0

I think for this code path, my patch doesn't break anything. Do we have any
other code paths I can check?

Anyway, thanks Andrew and Jan for the detailed feedback and discussion. if
we later plan to make file_ra_state concurrency-safe first, I'd be happy to
help test or rebase this optimization on top of that work.

Thanks,
-Aubrey


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-07 10:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-23  3:59 [PATCH] mm/readahead: Skip fully overlapped range Aubrey Li
2025-09-23  3:49 ` Andrew Morton
2025-09-23  5:11   ` Aubrey Li
2025-09-23  9:57     ` Jan Kara
2025-09-24  0:27       ` Aubrey Li
2025-09-30  5:35       ` Aubrey Li
2025-10-11 22:20         ` Andrew Morton
2025-10-16 16:21           ` Jan Kara
2025-11-07 10:28             ` Aubrey Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox