Re: [RFC PATCH v0] mm/vmscan: Add readahead LRU to improve readahead file page reclamation efficiency

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lei Liu <liulei.rjpt@vivo.com>
To: Yuanchu Xie <yuanchu@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Wei Xu <weixugc@google.com>, David Hildenbrand <david@redhat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Zi Yan <ziy@nvidia.com>, Matthew Brost <matthew.brost@intel.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Rakie Kim <rakie.kim@sk.com>, Byungchul Park <byungchul@sk.com>,
	Gregory Price <gourry@gourry.net>,
	Ying Huang <ying.huang@linux.alibaba.com>,
	Alistair Popple <apopple@nvidia.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,
	Johannes Thumshirn <johannes.thumshirn@wdc.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	Nico Pache <npache@redhat.com>, Harry Yoo <harry.yoo@oracle.com>,
	Yu Zhao <yuzhao@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Usama Arif <usamaarif642@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Nhat Pham <nphamcs@gmail.com>, Hao Jia <jiahao1@lixiang.com>,
	"Kirill A. Shutemov" <kas@kernel.org>,
	Barry Song <baohua@kernel.org>, Ingo Molnar <mingo@kernel.org>,
	Jens Axboe <axboe@kernel.dk>, Petr Mladek <pmladek@suse.com>,
	Jaewon Kim <jaewon31.kim@samsung.com>,
	"open list:PROC FILESYSTEM" <linux-kernel@vger.kernel.org>,
	"open list:PROC FILESYSTEM" <linux-fsdevel@vger.kernel.org>,
	"open list:MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU)"
	<linux-mm@kvack.org>,
	"open list:TRACING" <linux-trace-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH v0] mm/vmscan: Add readahead LRU to improve readahead file page reclamation efficiency
Date: Wed, 17 Sep 2025 17:45:45 +0800	[thread overview]
Message-ID: <d31a3880-dc7c-4224-b248-085941431abc@vivo.com> (raw)
In-Reply-To: <CAJj2-QHy3rTSPpE5uyu4gW9dWe1E5Q28P_N-VX2Uo+xBFauxdw@mail.gmail.com>


On 2025/9/17 0:33, Yuanchu Xie wrote:
> On Tue, Sep 16, 2025 at 2:22 AM Lei Liu <liulei.rjpt@vivo.com> wrote:
>> ...
>>
>> 2. Solution Proposal
>> Introduce a Readahead LRU to track pages brought in via readahead. During
>> memory reclamation, prioritize scanning this LRU to reclaim pages that
>> have not been accessed recently. For pages in the Readahead LRU that are
>> accessed, move them back to the inactive_file LRU to await subsequent
>> reclamation.
> I'm unsure this is the right solution though, given all users would
> have this readahead LRU on and we don't have performance numbers
> besides application startup here.
> My impression is that readahead behavior is highly dependent on the
> hardware, the workload, and the desired behavior, so making the
> readahead{-adjacent} behavior more amenable to tuning seems like the
> right direction.
>
> Maybe relevant discussions: https://lwn.net/Articles/897786/
>
> I only skimmed the code but noticed a few things:
>
>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>> index a458f1e112fd..4f3f031134fd 100644
>> --- a/fs/proc/meminfo.c
>> +++ b/fs/proc/meminfo.c
>> @@ -71,6 +71,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>          show_val_kb(m, "Inactive(anon): ", pages[LRU_INACTIVE_ANON]);
>>          show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
>>          show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
>> +       show_val_kb(m, "ReadAhead(file):",
> I notice both readahead and read ahead in this patch. Stick to the
> conventional one (readahead).
>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 8d3fa3a91ce4..57dac828aa4f 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -127,6 +127,7 @@ enum pageflags {
>>   #ifdef CONFIG_ARCH_USES_PG_ARCH_3
>>          PG_arch_3,
>>   #endif
>> +       PG_readahead_lru,
> More pageflags...
>
> b/include/trace/events/mmflags.h
>> index aa441f593e9a..2dbc1701e838 100644
>> --- a/include/trace/events/mmflags.h
>> +++ b/include/trace/events/mmflags.h
>> @@ -159,7 +159,8 @@ TRACE_DEFINE_ENUM(___GFP_LAST_BIT);
>>          DEF_PAGEFLAG_NAME(reclaim),                                     \
>>          DEF_PAGEFLAG_NAME(swapbacked),                                  \
>>          DEF_PAGEFLAG_NAME(unevictable),                                 \
>> -       DEF_PAGEFLAG_NAME(dropbehind)                                   \
>> +       DEF_PAGEFLAG_NAME(dropbehind),                                  \
>> +       DEF_PAGEFLAG_NAME(readahead_lru)                                \
>>   IF_HAVE_PG_MLOCK(mlocked)                                              \
>>   IF_HAVE_PG_HWPOISON(hwpoison)                                          \
>>   IF_HAVE_PG_IDLE(idle)                                                  \
>> @@ -309,6 +310,7 @@ IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,  "droppable"     )               \
>>                  EM (LRU_ACTIVE_ANON, "active_anon") \
>>                  EM (LRU_INACTIVE_FILE, "inactive_file") \
>>                  EM (LRU_ACTIVE_FILE, "active_file") \
>> +               EM(LRU_READ_AHEAD_FILE, "readahead_file") \
> Likewise, inconsistent naming.
>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 9e5ef39ce73a..0feab4d89d47 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -760,6 +760,8 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
>>                  folio_set_workingset(newfolio);
>>          if (folio_test_checked(folio))
>>                  folio_set_checked(newfolio);
>> +       if (folio_test_readahead_lru(folio))
>> +               folio_set_readahead_lru(folio);
> newfolio
Understood—I'll revise accordingly.
>
>>   /*
>> @@ -5800,6 +5837,87 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
>>
>>   #endif /* CONFIG_LRU_GEN */
>>
>> +static unsigned long shrink_read_ahead_list(unsigned long nr_to_scan,
>> +                                           unsigned long nr_to_reclaim,
>> +                                           struct lruvec *lruvec,
>> +                                           struct scan_control *sc)
>> +{
>> +       LIST_HEAD(l_hold);
>> +       LIST_HEAD(l_reclaim);
>> +       LIST_HEAD(l_inactive);
>> +       unsigned long nr_scanned = 0;
>> +       unsigned long nr_taken = 0;
>> +       unsigned long nr_reclaimed = 0;
>> +       unsigned long vm_flags;
>> +       enum vm_event_item item;
>> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +       struct reclaim_stat stat = { 0 };
>> +
>> +       lru_add_drain();
>> +
>> +       spin_lock_irq(&lruvec->lru_lock);
>> +       nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &l_hold, &nr_scanned,
>> +                                     sc, LRU_READ_AHEAD_FILE);
>> +
>> +       __count_vm_events(PGSCAN_READAHEAD_FILE, nr_scanned);
>> +       __mod_node_page_state(pgdat, NR_ISOLATED_FILE, nr_taken);
>> +       item = PGSCAN_KSWAPD + reclaimer_offset(sc);
>> +       if (!cgroup_reclaim(sc))
>> +               __count_vm_events(item, nr_scanned);
>> +       count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>> +       __count_vm_events(PGSCAN_FILE, nr_scanned);
>> +       spin_unlock_irq(&lruvec->lru_lock);
>> +
>> +       if (nr_taken == 0)
>> +               return 0;
>> +
>> +       while (!list_empty(&l_hold)) {
>> +               struct folio *folio;
>> +
>> +               cond_resched();
>> +               folio = lru_to_folio(&l_hold);
>> +               list_del(&folio->lru);
>> +               folio_clear_readahead_lru(folio);
>> +
>> +               if (folio_referenced(folio, 0, sc->target_mem_cgroup, &vm_flags)) {
>> +                       list_add(&folio->lru, &l_inactive);
>> +                       continue;
>> +               }
>> +               folio_clear_active(folio);
>> +               list_add(&folio->lru, &l_reclaim);
>> +       }
>> +
>> +       nr_reclaimed = shrink_folio_list(&l_reclaim, pgdat, sc, &stat, true,
>> +                                        lruvec_memcg(lruvec));
>> +
>> +       list_splice(&l_reclaim, &l_inactive);
>> +
>> +       spin_lock_irq(&lruvec->lru_lock);
>> +       move_folios_to_lru(lruvec, &l_inactive);
>> +       __mod_node_page_state(pgdat, NR_ISOLATED_FILE, -nr_taken);
>> +
>> +       __count_vm_events(PGSTEAL_READAHEAD_FILE, nr_reclaimed);
>> +       item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
>> +       if (!cgroup_reclaim(sc))
>> +               __count_vm_events(item, nr_reclaimed);
>> +       count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>> +       __count_vm_events(PGSTEAL_FILE, nr_reclaimed);
>> +       spin_unlock_irq(&lruvec->lru_lock);
> I see the idea is that readahead pages should be scanned before the
> rest of inactive file. I wonder if this is achievable without adding
> another LRU.
>
>
> Thanks,
> Yuanchu

Hi， Yuanchu

Thank you for your valuable feedback!

1.We initially considered keeping readahead pages in the system's 
existing inactive/active LRUs without adding a dedicated LRU. However, 
this approach may lead to inefficient reclamation of readahead pages.

Reason: When scanning the inactive LRU, processing readahead pages can 
be frequently interrupted by non-readahead pages (e.g., shared/accessed 
pages). The reference checks for these non-readahead pages incur 
significant overhead, slowing down the scanning and reclamation of 
readahead pages.
Thus, isolating readahead pages in a readahead LRU allows more targeted 
reclamation, significantly accelerating scanning and recycling efficiency.

2.That said, this solution does raise valid concerns. As you rightly 
pointed out, enabling this LRU globally may not align with all users' 
needs since not every scenario requires it.

3.For now, this remains a preliminary solution. The primary goal of this 
RFC is to highlight the issue of excessive readahead overhead and gather 
community insights for better alternatives.

We are actively exploring solutions without adding a new LRU for future 
iterations.

Best regards,
Lei Liu

next prev parent reply	other threads:[~2025-09-17  9:46 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-16  7:22 Lei Liu
2025-09-16 16:33 ` Yuanchu Xie
2025-09-17  9:45   ` Lei Liu [this message]
2025-09-23  8:03   ` zhongjinji

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d31a3880-dc7c-4224-b248-085941431abc@vivo.com \
    --to=liulei.rjpt@vivo.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=jackmanb@google.com \
    --cc=jaewon31.kim@samsung.com \
    --cc=jiahao1@lixiang.com \
    --cc=johannes.thumshirn@wdc.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kas@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rakie.kim@sk.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yu.c.chen@intel.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox