From: Michal Hocko <mhocko@suse.com>
To: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, akpm@linux-foundation.org,
axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
hannes@cmpxchg.org, david@kernel.org, zhengqi.arch@bytedance.com,
shakeel.butt@linux.dev, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, ziy@nvidia.com, matthew.brost@intel.com,
rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
ying.huang@linux.alibaba.com, apopple@nvidia.com,
bingjiao@google.com, jonathan.cameron@huawei.com,
pratyush.brahma@oss.qualcomm.com
Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
Date: Mon, 2 Feb 2026 14:11:10 +0100 [thread overview]
Message-ID: <aYCiboGiXO2lQC0W@tiehlicka> (raw)
In-Reply-To: <CAC5umyhqbW_qXaApO8OGg1wo706GfVPuak5JwdBfBgS751Ka5Q@mail.gmail.com>
On Thu 29-01-26 09:40:17, Akinobu Mita wrote:
> 2026年1月28日(水) 7:00 Joshua Hahn <joshua.hahnjy@gmail.com>:
> >
> > > > > Therefore, it appears that the behavior of get_swappiness() is important
> > > > > in this issue.
> > > >
> > > > This is quite mysterious.
> > > >
> > > > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > > > it quite strange that the issue you mention above occurs regardless of whether
> > > > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > > > as before? Were these hangs similarly fixed by modifying the callsite in
> > > > get_swappiness?
> > >
> > > Good point.
> > > When MGLRU is disabled, changing only the behavior of can_demote()
> > > called by get_swappiness() did not solve the problem.
> > >
> > > Instead, the problem was avoided by changing only the behavior of
> > > can_demote() called by can_reclaim_anon_page(), without changing the
> > > behavior of can_demote() called from other places.
> > >
> > > > On a separate note, I feel a bit uncomfortable for making this the default
> > > > setting, regardless of whether there is swap space or not. Just as it is
> > > > easy to create a degenerate scenario where all memory is unreclaimable
> > > > and the system starts going into (wasteful) reclaim on the lower tiers,
> > > > it is equally easy to create a scenario where all memory is very easily
> > > > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > > > free up memory on the lower tiers.
> > > >
> > > > Reality is likely somewhere in between. And from my perspective, as long as
> > > > we have some amount of easily reclaimable memory, I don't think immediately
> > > > OOMing will be helpful for the system (and even if none of the memory is
> > > > easily reclaimable, we should still try doing something before killing).
> > > >
> > > > > > > The reason for this issue is that memory allocations do not directly
> > > > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > > > memory tier, it can always be reclaimed by demotion.
> > > >
> > > > This patch enforces that the opposite of this assumption is true; that even
> > > > if a target node has an underlying memory tier, it can never be reclaimed by
> > > > demotion.
> > > >
> > > > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > > > this new enforcement could be harmful to the system. What do you think?
> > >
> > > Thank you for the detailed explanation.
> > >
> > > I understand the concern regarding the current patch, which only
> > > checks the free memory of the demotion target node.
> > > I will explore a solution.
> >
> > Hello Akinobu, I hope you had a great weekend!
> >
> > I noticed something that I thought was worth flagging. It seems like the
> > primary addition of this patch, which is to check for zone_watermark_ok
> > across the zones, is already a part of should_reclaim_retry():
> >
> > /*
> > * Keep reclaiming pages while there is a chance this will lead
> > * somewhere. If none of the target zones can satisfy our allocation
> > * request even if all reclaimable pages are considered then we are
> > * screwed and have to go OOM.
> > */
> > for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
> > ac->highest_zoneidx, ac->nodemask) {
> >
> > [...snip...]
> >
> > /*
> > * Would the allocation succeed if we reclaimed all
> > * reclaimable pages?
> > */
> > wmark = __zone_watermark_ok(zone, order, min_wmark,
> > ac->highest_zoneidx, alloc_flags, available);
> >
> > if (wmark) {
> > ret = true;
> > break;
> > }
> > }
> >
> > ... which is called in __alloc_pages_slowpath. I wonder why we don't already
> > hit this. It seems to do the same thing your patch is doing?
>
> I checked the number of calls and the time spent for several functions
> called by __alloc_pages_slowpath(), and found that time is spent in
> __alloc_pages_direct_reclaim() before reaching the first should_reclaim_retry().
>
> After a few minutes have passed and the debug code that automatically
> resets numa_demotion_enabled to false is executed, it appears that
> __alloc_pages_direct_reclaim() immediately exits.
First of all is this MGLRU or traditional reclaim? Or both?
Then another thing I've noticed only now. There seems to be a layering
discrepancy (for traditional LRU reclaim) when get_scan_count which
controls the to-be-reclaimed lrus always relies on can_reclaim_anon_pages
while down the reclaim path shrink_folio_list tries to be more clever
and avoid demotion if it turns out to be inefficient.
I wouldn't be surprised if get_scan_count predominantly (or even
exclusively) scanned anon LRUs only while increasing the reclaim
priority (so essentially just checked all anon pages on the LRU list)
before concluding that it makes no sense. This can take quite some time
and in the worst case you could be recycling couple of page cache pages
remaining on the list to make small but sufficient progress to loop
around.
So I think the first step is to make the demotion behavior consistent.
If demotion fails then it would probably makes sense to set sc->no_demotion
so that get_scan_count can learn from the reclaim feedback that
anonymous pages are not a good reclaim target in this situation. But the
whole reclaim path needs a careful review I am afraid.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2026-02-02 13:11 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-13 8:14 [PATCH v4 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
2026-01-13 8:14 ` [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
2026-01-13 9:30 ` Pratyush Brahma
2026-01-13 8:14 ` [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
2026-01-13 9:32 ` Pratyush Brahma
2026-01-13 8:14 ` [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
2026-01-13 13:40 ` Michal Hocko
2026-01-14 12:51 ` Akinobu Mita
2026-01-14 13:40 ` Michal Hocko
2026-01-14 17:49 ` Gregory Price
2026-01-15 0:40 ` Akinobu Mita
2026-01-22 0:32 ` Akinobu Mita
2026-01-22 16:38 ` Gregory Price
2026-01-26 1:57 ` Akinobu Mita
2026-01-27 21:21 ` Gregory Price
2026-01-29 0:51 ` Akinobu Mita
2026-01-29 2:48 ` Gregory Price
2026-01-22 18:34 ` Joshua Hahn
2026-01-26 2:01 ` Akinobu Mita
2026-01-27 22:00 ` Joshua Hahn
2026-01-29 0:40 ` Akinobu Mita
2026-02-02 13:11 ` Michal Hocko [this message]
2026-02-02 13:15 ` Michal Hocko
2026-02-04 2:07 ` Akinobu Mita
2026-02-04 9:25 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYCiboGiXO2lQC0W@tiehlicka \
--to=mhocko@suse.com \
--cc=Liam.Howlett@oracle.com \
--cc=akinobu.mita@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=bingjiao@google.com \
--cc=byungchul@sk.com \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=pratyush.brahma@oss.qualcomm.com \
--cc=rakie.kim@sk.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox