linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Kairui Song <ryncsn@gmail.com>
Cc: Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Barry Song <baohua@kernel.org>, Hugh Dickins <hughd@google.com>,
	Chris Li <chrisl@kernel.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	Youngjun Park <youngjun.park@lge.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: Re: [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table
Date: Tue, 24 Feb 2026 10:58:37 -0500	[thread overview]
Message-ID: <aZ3KrfD_6vfxjRcs@cmpxchg.org> (raw)
In-Reply-To: <CAMgjq7Aq5ckraKtNtet8+1ANuqnitFsXxefbDJQZpBxNmaW7Cg@mail.gmail.com>

On Tue, Feb 24, 2026 at 04:34:00PM +0800, Kairui Song wrote:
> On Tue, Feb 24, 2026 at 12:46 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Feb 20, 2026 at 07:42:09AM +0800, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > To prepare for merging the swap_cgroup_ctrl into the swap table, store
> > > the memcg info in the swap table on swapout.
> > >
> > > This is done by using the existing shadow format.
> > >
> > > Note this also changes the refault counting at the nearest online memcg
> > > level:
> > >
> > > Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
> > > and each cgroup is likely to have different characteristics.
> >
> > This is not correct.
> >
> > As much as I like the idea of storing the swap_cgroup association
> > inside the shadow entry, the refault evaluation needs to happen at the
> > level that drove eviction.
> >
> > Consider a workload that is split into cgroups purely for accounting,
> > not for setting different limits:
> >
> > workload (limit domain)
> > `- component A
> > `- component B
> >
> > This means the two components must compete freely, and it must behave
> > as if there is only one LRU. When pages get reclaimed in a round-robin
> > fashion, both A and B get aged at the same pace. Likewise, when pages
> > in A refault, they must challenge the *combined* workingset of both A
> > and B, not just the local pages.
> >
> > Otherwise, you risk retaining stale workingset in one subgroup while
> > the other one is thrashing. This breaks userspace expectations.
> >
> 
> Hi Johannes, thanks for pointing this out.
> 
> I'm just not sure how much of a real problem this is. The refault
> challenge change was made in commit b910718a948a which was before anon
> shadow was introduced. And shadows could get reclaimed, especially
> when under pressure (and we could be doing that again by reclaiming
> full_clusters with swap tables). And MGLRU simply ignores the
> target_memcg here yet it performs surprisingly well with multiple
> memcg setups. And I did find a comment in workingset.c saying the
> kernel used to activate all pages, which is also fine. And that commit
> also mentioned the active list shrinking, but anon active list gets
> shrinked just fine without refault feedback in shrink_lruvec under
> can_age_anon_pages.

                    *if inactive anon is empty, as part of the second
                     chance logic

Please try to understand *why* this code is the way it is before
throwing it all out. It was driven by real production problems. The
fact that some workloads don't care is not prove that many don't hurt
if you break this.

Anon refault detection was added for that reason: Once you have swap,
you facilitate anon workingsets that exceed memory capacity. At that
point, cache replacement strategies apply. Scan resistance matters.

With fast modern compression and flash swap, the anon set alone can be
larger than memory capacity. Everything that
6a3ed2123a78de22a9e2b2855068a8d89f8e14f4 says about file cache starts
applying to anonymous pages: you don't want to throw out the hot anon
workingset just because somebody is doing a one-off burst scan through
a larger set of cold, swapped out pages.

Like I said in the LSFMM thread, there is no difference between anon
and file. There didn't use to be historically. The LRU lists were
split mechanically because noswap systems became common (lots of RAM +
rotational drives = sad swap) and there was no point in scanning/aging
anonymous memory if there is no swap space.

But no reasonable argument has been put forth why anon should be aged
completely differently than file when you DO have swap.

There is more explanation of Why for the cgroup behavior in the cover
letter portion of 53138cea7f398d2cdd0fa22adeec7e16093e1ebd.


  reply	other threads:[~2026-02-24 15:58 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 01/15] mm: move thp_limit_gfp_mask to header Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 02/15] mm, swap: simplify swap_cache_alloc_folio Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 03/15] mm, swap: move conflict checking logic of out swap cache adding Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 04/15] mm, swap: add support for large order folios in swap cache directly Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 05/15] mm, swap: unify large folio allocation Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 06/15] memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead Kairui Song via B4 Relay
2026-02-23 16:22   ` Johannes Weiner
2026-02-24  5:44   ` Shakeel Butt
2026-02-24  8:08     ` Kairui Song
2026-02-19 23:42 ` [PATCH RFC 07/15] memcg, swap: defer the recording of memcg info and reparent flexibly Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table Kairui Song via B4 Relay
2026-02-23 16:36   ` Johannes Weiner
2026-02-24  8:34     ` Kairui Song
2026-02-24 15:58       ` Johannes Weiner [this message]
2026-02-19 23:42 ` [PATCH RFC 09/15] mm, swap: support flexible batch freeing of slots in different memcg Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 10/15] mm, swap: always retrieve memcg id from swap table Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 11/15] mm/swap, memcg: remove swap cgroup array Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 12/15] mm, swap: merge zeromap into swap table Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 13/15] mm: ghost swapfile support for zswap Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 14/15] mm, swap: add a special device for ghost swap setup Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 15/15] mm, swap: allocate cluster dynamically for ghost swapfile Kairui Song via B4 Relay
2026-02-21  8:15 ` [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic " Barry Song
2026-02-21  9:07   ` Kairui Song
2026-02-21  9:30     ` Barry Song
2026-02-23 16:52 ` Johannes Weiner
2026-02-24  2:10   ` Kairui Song
2026-02-23 18:22 ` Nhat Pham
2026-02-24  3:34   ` Kairui Song
2026-02-24 21:56     ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ3KrfD_6vfxjRcs@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=devnull+kasong.tencent.com@kernel.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=youngjun.park@lge.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox