Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Akinobu Mita <akinobu.mita@gmail.com>
To: Gregory Price <gourry@gourry.net>
Cc: Michal Hocko <mhocko@suse.com>,
	linux-cxl@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	 axelrasmussen@google.com, yuanchu@google.com,
	weixugc@google.com,  hannes@cmpxchg.org, david@kernel.org,
	zhengqi.arch@bytedance.com,  shakeel.butt@linux.dev,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	 vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	ziy@nvidia.com,  matthew.brost@intel.com,
	joshua.hahnjy@gmail.com, rakie.kim@sk.com,  byungchul@sk.com,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	 bingjiao@google.com, jonathan.cameron@huawei.com,
	 pratyush.brahma@oss.qualcomm.com
Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
Date: Thu, 29 Jan 2026 09:51:44 +0900	[thread overview]
Message-ID: <CAC5umyi6oW9qZZH75Owitojd+wgTGrz6uHsEFXxzo8aWF7FoKA@mail.gmail.com> (raw)
In-Reply-To: <aXksUiwYGwad5JvC@gourry-fedora-PF4VCD3F>

2026年1月28日(水) 6:21 Gregory Price <gourry@gourry.net>:
>
> On Mon, Jan 26, 2026 at 10:57:11AM +0900, Akinobu Mita wrote:
> > >
> > > Doesn't this suggest what I mentioned earlier?  If you don't demote when
> > > the target node is full, then you're removing a memory pressure signal
> > > from the lower node and reclaim won't ever clean up the lower node to
> > > make room for future demotions.
> >
> > Thank you for your analysis.
> > Now I finally understand the concerns (though I'll need to learn more
> > to find a solution...)
> >
>
> Apologies - sorry for the multiple threads, i accidentally replied on v3
>
> It's taken me a while to detangle this, but what looks  like what might
> be happening is demote_folios is actually stealing all the potential
> candidates for swap for leaving reclaim with no forward progress and no
> OOM signal.
>
> 1) demotion is already not a reclaim signal, so forgive my prior
>    comments, i missed the masking of ~__GFP_RECLAIM
>
> 2) it appears we spend most of the time building the demotion list, but
>    then just abandon the list without having made progress later when
>    the demotion allocation target fails (w/ __THISNODE you don't get
>    OOM on allocation failure, we just continue)
>
> 3) i don't see hugetlb pages causing the GFP_RECLAIM override bug being
>    an issue in reclaim, because the page->lru is used for something else
>    in hugetlb pages (i.e. we shouldn't see hugetlb pages here)
>
> 4) skipping the entire demotion pass will shunt all this pressure to
>    swap instead (do_demote_pass = false -> so we swap instead).
>
>
> The risk here is that the OOM situation is temporary and some amount of
> memory from toptier gets shunting to swap while kswapd on other tiers
> makes progress.  This is effectively LRU inversion.
>
> Why swappiness affects behavior is likely because it changes how
> aggressively your lower-tier gets reclaimed, and therefore reduces the
> upper tier demotion failures until swap is already pressured.
>
> I'm not sure there's a best-option here, we may need additional input to
> determine what the least-worst option is.  Causing LRU inversion when
> all the nodes are pressured but swap is available is not preferable.

Would it be better if can_demote() returned false after checking that
there is no free swap space at all and that there is not enough free space
on the demote target node or its lower nodes?

can_demote()
{
        ...
        /* If demotion node isn't in the cgroup's mems_allowed, fall back */
        if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
                if (get_nr_swap_pages() > 0)
                        return true;
                do {
                        int z;
                        struct zone *zone;
                        struct pglist_data *pgdat = NODE_DATA(demotion_nid);

                        for_each_managed_zone_pgdat(zone, pgdat, z,
MAX_NR_ZONES - 1) {
                                if (zone_watermark_ok(zone, 0,
min_wmark_pages(zone),
                                                      ZONE_MOVABLE, 0))
                                        return true;
                        }
                        demotion_nid = next_demotion_node(demotion_nid);
                } while (demotion_nid != NUMA_NO_NODE);
        }
        return false;
}

next prev parent reply	other threads:[~2026-01-29  0:51 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-13  8:14 [PATCH v4 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
2026-01-13  8:14 ` [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
2026-01-13  9:30   ` Pratyush Brahma
2026-01-13  8:14 ` [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
2026-01-13  9:32   ` Pratyush Brahma
2026-01-13  8:14 ` [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
2026-01-13 13:40   ` Michal Hocko
2026-01-14 12:51     ` Akinobu Mita
2026-01-14 13:40       ` Michal Hocko
2026-01-14 17:49       ` Gregory Price
2026-01-15  0:40         ` Akinobu Mita
2026-01-22  0:32           ` Akinobu Mita
2026-01-22 16:38             ` Gregory Price
2026-01-26  1:57               ` Akinobu Mita
2026-01-27 21:21                 ` Gregory Price
2026-01-29  0:51                   ` Akinobu Mita [this message]
2026-01-29  2:48                     ` Gregory Price
2026-01-22 18:34       ` Joshua Hahn
2026-01-26  2:01         ` Akinobu Mita
2026-01-27 22:00           ` Joshua Hahn
2026-01-29  0:40             ` Akinobu Mita
2026-02-02 13:11               ` Michal Hocko
2026-02-02 13:15                 ` Michal Hocko
2026-02-04  2:07                 ` Akinobu Mita
2026-02-04  9:25                   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAC5umyi6oW9qZZH75Owitojd+wgTGrz6uHsEFXxzo8aWF7FoKA@mail.gmail.com \
    --to=akinobu.mita@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=bingjiao@google.com \
    --cc=byungchul@sk.com \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=pratyush.brahma@oss.qualcomm.com \
    --cc=rakie.kim@sk.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox