linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: lenohou@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	 Jialing Wang <wjl.linux@gmail.com>,
	Yafang Shao <laoar.shao@gmail.com>, Yu Zhao <yuzhao@google.com>,
	 Kairui Song <ryncsn@gmail.com>, Bingfang Guo <bfguo@icloud.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5] mm/mglru: fix cgroup OOM during MGLRU state switching
Date: Fri, 20 Mar 2026 04:49:21 +0800	[thread overview]
Message-ID: <CAGsJ_4yY6q_kP74LtKpD=FEWPBH811Ch3EA=v6dxgYfyNtmZyA@mail.gmail.com> (raw)
In-Reply-To: <20260319-b4-switch-mglru-v2-v5-1-8898491e5f17@gmail.com>

On Thu, Mar 19, 2026 at 11:40 AM Leno Hou via B4 Relay
<devnull+lenohou.gmail.com@kernel.org> wrote:
>
> From: Leno Hou <lenohou@gmail.com>
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim path.
> This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> Problem Description
> ==================
> The issue arises from a "reclaim vacuum" during the transition.
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
>    false before the pages are drained from MGLRU lists back to traditional
>    LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
>    and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
>    yet, or the changes are not yet visible to all CPUs due to a lack
>    of synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
>    concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees the
> new state but the MGLRU lists haven't been populated via fill_evictable()
> yet.
>
> Solution
> ========
> Introduce a 'switching' state (`lru_switch`) to bridge the transition.
> When transitioning, the system enters this intermediate state where
> the reclaimer is forced to attempt both MGLRU and traditional reclaim
> paths sequentially. This ensures that folios remain visible to at least
> one reclaim mechanism until the transition is fully materialized across
> all CPUs.
>
> Changes
> =======
> v5:
>  - Rename lru_gen_draining to lru_gen_switching; lru_drain_core to
>    lru_switch
>  - Add more documentation for folio_referenced_one
>  - Keep folio_check_references unchanged
>
> v4:
>  - Fix Sashiko.dev's AI CodeReview comments
>  - Remove the patch maintain workingset refault context across
>  - Remove folio_lru_gen(folio) != -1 which involved in v2 patch
>
> v3:
>  - Rebase onto mm-new branch for queue testing
>  - Don't look around while draining
>  - Fix Barry Song's comment
>
> v2:
> - Replace with a static branch `lru_drain_core` to track the transition
>   state.
> - Ensures all LRU helpers correctly identify page state by checking
>   folio_lru_gen(folio) != -1 instead of relying solely on global flags.
> - Maintain workingset refault context across MGLRU state transitions
> - Fix build error when CONFIG_LRU_GEN is disabled.
>
> v1:
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
>   of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
>   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
>   lists first, and then fall through to traditional LRU lists instead
>   of returning early. This ensures that folios are visible to at least
>   one reclaim path at any given time.
>
> Race & Mitigation
> ================
> A race window exists between checking the 'draining' state and performing
> the actual list operations. For instance, a reclaimer might observe the
> draining state as false just before it changes, leading to a suboptimal
> reclaim path decision.
>
> However, this impact is effectively mitigated by the kernel's reclaim
> retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
> to find eligible folios due to a state transition race, subsequent retries
> in the loop will observe the updated state and correctly direct the scan
> to the appropriate LRU lists. This ensures the transient inconsistency
> does not escalate into a terminal OOM kill.
>
> This effectively reduce the race window that previously triggered OOMs
> under high memory pressure.
>
> This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
> functions correctly without triggering unexpected OOM kills.
>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Axel Rasmussen <axelrasmussen@google.com>
> To: Yuanchu Xie <yuanchu@google.com>
> To: Wei Xu <weixugc@google.com>
> To: Barry Song <21cnbao@gmail.com>
> To: Jialing Wang <wjl.linux@gmail.com>
> To: Yafang Shao <laoar.shao@gmail.com>
> To: Yu Zhao <yuzhao@google.com>
> To: Kairui Song <ryncsn@gmail.com>
> To: Bingfang Guo <bfguo@icloud.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Leno Hou <lenohou@gmail.com>
> ---
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim path.
> This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> Problem Description
> ==================
> The issue arises from a "reclaim vacuum" during the transition.
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
>    false before the pages are drained from MGLRU lists back to traditional
>    LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
>    and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
>    yet, or the changes are not yet visible to all CPUs due to a lack
>    of synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
>    concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees the
> new state but the MGLRU lists haven't been populated via fill_evictable()
> yet.
>
> Solution
> ========
> Introduce a 'switching' state (`lru_switch`) to bridge the transition.
> When transitioning, the system enters this intermediate state where
> the reclaimer is forced to attempt both MGLRU and traditional reclaim
> paths sequentially. This ensures that folios remain visible to at least
> one reclaim mechanism until the transition is fully materialized across
> all CPUs.
>
> Changes
> =======
> v5:
>  - Rename lru_gen_draining to lru_gen_switching; lru_drain_core to
>    lru_switch
>  - Add more documentation for folio_referenced_one
>  - Keep folio_check_references unchanged
> v4:
>  - Fix Sashiko.dev's AI CodeReview comments
>  - Remove the patch maintain workingset refault context across
>  - Remove folio_lru_gen(folio) != -1 which involved in v2 patch
>
> v3:
>  - Rebase onto mm-new branch for queue testing
>  - Don't look around while draining
>  - Fix Barry Song's comment
>
> v2:
> - Replace with a static branch `lru_drain_core` to track the transition
>   state.
> - Ensures all LRU helpers correctly identify page state by checking
>   folio_lru_gen(folio) != -1 instead of relying solely on global flags.
> - Maintain workingset refault context across MGLRU state transitions
> - Fix build error when CONFIG_LRU_GEN is disabled.
>
> v1:
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
>   of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
>   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
>   lists first, and then fall through to traditional LRU lists instead
>   of returning early. This ensures that folios are visible to at least
>   one reclaim path at any given time.
>
> Race & Mitigation
> ================
> A race window exists between checking the 'draining' state and performing
> the actual list operations. For instance, a reclaimer might observe the
> draining state as false just before it changes, leading to a suboptimal
> reclaim path decision.
>
> However, this impact is effectively mitigated by the kernel's reclaim
> retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
> to find eligible folios due to a state transition race, subsequent retries
> in the loop will observe the updated state and correctly direct the scan
> to the appropriate LRU lists. This ensures the transient inconsistency
> does not escalate into a terminal OOM kill.
>
> This effectively reduce the race window that previously triggered OOMs
> under high memory pressure.
>
> This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
> functions correctly without triggering unexpected OOM kills.
>
> Reproduction
> ===========
>
> The issue was consistently reproduced on v6.1.157 and v6.18.3 using a
> high-pressure memory cgroup (v1) environment.
>
> Reproduction steps:
> 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
>    and 8GB active anonymous memory.
> 2. Toggle MGLRU state while performing new memory allocations to force
>    direct reclaim.
>
> Reproduction script
> ===================
>
> ```bash
>
> MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
>
> switch_mglru() {
>     local orig_val=$(cat "$MGLRU_FILE")
>     if [[ "$orig_val" != "0x0000" ]]; then
>         echo n > "$MGLRU_FILE" &
>     else
>         echo y > "$MGLRU_FILE" &
>     fi
> }
>
> mkdir -p "$CGROUP_PATH"
> echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> echo $$ > "$CGROUP_PATH/cgroup.procs"
>
> dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
>
> stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> sleep 5
>
> switch_mglru
> stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || \
> echo "OOM Triggered"
>
> grep oom_kill "$CGROUP_PATH/memory.oom_control"
> ```
> ---
> Changes in v5:
> - Rename lru_gen_draining to lru_gen_switching; lru_drain_core to
>    lru_switch
> - Add more documentation for folio_referenced_one
> - Keep folio_check_references unchanged
> - Link to v4: https://lore.kernel.org/r/20260318-b4-switch-mglru-v2-v4-1-1b927c93659d@gmail.com
>
> Changes in v4:
> - Fix Sashiko.dev's AI CodeReview comments
>   Link: https://sashiko.dev/#/patchset/20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321%40gmail.com
> - Remove the patch maintain workingset refault context across
> - Remove folio_lru_gen(folio) != -1 which involved in v2 patch
> - Link to v3: https://lore.kernel.org/r/20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@gmail.com
> ---

A bit odd—I’ve seen v5, v4, and so on many times;
at least three times?

I’m starting to suspect my eyes are broken.

I guess we might have a changelog issue here?
Otherwise,
Reviewed-by: Barry Song <baohua@kernel.org>

Thanks
Barry


  parent reply	other threads:[~2026-03-19 20:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-18 16:30 Leno Hou via B4 Relay
2026-03-19  9:08 ` Yafang Shao
2026-03-19 20:49 ` Barry Song [this message]
2026-03-19 21:04   ` Axel Rasmussen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4yY6q_kP74LtKpD=FEWPBH811Ch3EA=v6dxgYfyNtmZyA@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bfguo@icloud.com \
    --cc=laoar.shao@gmail.com \
    --cc=lenohou@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryncsn@gmail.com \
    --cc=weixugc@google.com \
    --cc=wjl.linux@gmail.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox