From: Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>, Zi Yan <ziy@nvidia.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Barry Song <baohua@kernel.org>, Hugh Dickins <hughd@google.com>,
Chris Li <chrisl@kernel.org>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Youngjun Park <youngjun.park@lge.com>,
Chengming Zhou <chengming.zhou@linux.dev>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Qi Zheng <zhengqi.arch@bytedance.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
Kairui Song <kasong@tencent.com>, Yosry Ahmed <yosry@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>, Dev Jain <dev.jain@arm.com>,
Lance Yang <lance.yang@linux.dev>,
Michal Hocko <mhocko@suse.com>, Michal Hocko <mhocko@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Lorenzo Stoakes <ljs@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Subject: [PATCH v3 08/12] mm, swap: delay and unify memcg lookup and charging for swapin
Date: Tue, 21 Apr 2026 14:16:52 +0800 [thread overview]
Message-ID: <20260421-swap-table-p4-v3-8-2f23759a76bc@tencent.com> (raw)
In-Reply-To: <20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com>
From: Kairui Song <kasong@tencent.com>
Instead of checking the cgroup private ID during page table walk in
swap_pte_batch(), move the memcg lookup into __swap_cache_add_check()
under the cluster lock.
The first pre-alloc check is speculative and skips the memcg check since
the post-alloc stable check ensures all slots covered by the folio
belong to the same memcg. It is very rare for contiguous and aligned
entries across a contiguous region of a page table of the same process
or shmem mapping to belong to different memcgs.
This also prepares for recording the memcg info in the cluster's table.
Also make the order check and fallback more compact.
There should be no user-observable behavior change.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/memcontrol.h | 6 +++---
mm/internal.h | 10 +---------
mm/memcontrol.c | 10 ++++------
mm/swap_state.c | 28 +++++++++++++++++++---------
4 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7d08128de1fd..a013f37f24aa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -646,8 +646,8 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp);
-int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
- gfp_t gfp, swp_entry_t entry);
+int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id,
+ struct mm_struct *mm, gfp_t gfp);
void __mem_cgroup_uncharge(struct folio *folio);
@@ -1137,7 +1137,7 @@ static inline int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp)
}
static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
- struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
+ unsigned short id, struct mm_struct *mm, gfp_t gfp)
{
return 0;
}
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..9d2fec696bd6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -451,24 +451,16 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
{
pte_t expected_pte = pte_next_swp_offset(pte);
const pte_t *end_ptep = start_ptep + max_nr;
- const softleaf_t entry = softleaf_from_pte(pte);
pte_t *ptep = start_ptep + 1;
- unsigned short cgroup_id;
VM_WARN_ON(max_nr < 1);
- VM_WARN_ON(!softleaf_is_swap(entry));
+ VM_WARN_ON(!softleaf_is_swap(softleaf_from_pte(pte)));
- cgroup_id = lookup_swap_cgroup_id(entry);
while (ptep < end_ptep) {
- softleaf_t entry;
-
pte = ptep_get(ptep);
if (!pte_same(pte, expected_pte))
break;
- entry = softleaf_from_pte(pte);
- if (lookup_swap_cgroup_id(entry) != cgroup_id)
- break;
expected_pte = pte_next_swp_offset(expected_pte);
ptep++;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7df30ca5aa7..641706fa47bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5062,27 +5062,25 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
/**
* mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
- * @folio: folio to charge.
+ * @folio: the folio to charge
+ * @id: memory cgroup id
* @mm: mm context of the victim
* @gfp: reclaim mode
- * @entry: swap entry for which the folio is allocated
*
* This function charges a folio allocated for swapin. Please call this before
* adding the folio to the swapcache.
*
* Returns 0 on success. Otherwise, an error code is returned.
*/
-int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
- gfp_t gfp, swp_entry_t entry)
+int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id,
+ struct mm_struct *mm, gfp_t gfp)
{
struct mem_cgroup *memcg;
- unsigned short id;
int ret;
if (mem_cgroup_disabled())
return 0;
- id = lookup_swap_cgroup_id(entry);
rcu_read_lock();
memcg = mem_cgroup_from_private_id(id);
if (!memcg || !css_tryget_online(&memcg->css))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 12b290d43e45..86d517a33a55 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -142,16 +142,20 @@ void *swap_cache_get_shadow(swp_entry_t entry)
* @ci: The locked swap cluster
* @targ_entry: The target swap entry to check, will be rounded down by @nr
* @nr: Number of slots to check, must be a power of 2
- * @shadowp: Returns the shadow value if one exists in the range.
+ * @shadowp: Returns the shadow value if one exists in the range
+ * @memcg_id: Returns the memory cgroup id, NULL to ignore cgroup check
*
* Check if all slots covered by given range have a swap count >= 1.
- * Retrieves the shadow if there is one.
+ * Retrieves the shadow if there is one. If @memcg_id is not NULL, also
+ * checks if all slots belong to the same cgroup and return the cgroup
+ * private id.
*
* Context: Caller must lock the cluster.
*/
static int __swap_cache_add_check(struct swap_cluster_info *ci,
swp_entry_t targ_entry,
- unsigned long nr, void **shadowp)
+ unsigned long nr, void **shadowp,
+ unsigned short *memcg_id)
{
unsigned int ci_off, ci_end;
unsigned long old_tb;
@@ -169,19 +173,24 @@ static int __swap_cache_add_check(struct swap_cluster_info *ci,
return -EEXIST;
if (!__swp_tb_get_count(old_tb))
return -ENOENT;
- if (swp_tb_is_shadow(old_tb) && shadowp)
+ if (shadowp && swp_tb_is_shadow(old_tb))
*shadowp = swp_tb_to_shadow(old_tb);
+ if (memcg_id)
+ *memcg_id = lookup_swap_cgroup_id(targ_entry);
if (nr == 1)
return 0;
+ targ_entry.val = round_down(targ_entry.val, nr);
ci_off = round_down(ci_off, nr);
ci_end = ci_off + nr;
do {
old_tb = __swap_table_get(ci, ci_off);
if (unlikely(swp_tb_is_folio(old_tb) ||
- !__swp_tb_get_count(old_tb)))
+ !__swp_tb_get_count(old_tb) ||
+ (memcg_id && *memcg_id != lookup_swap_cgroup_id(targ_entry))))
return -EBUSY;
+ targ_entry.val++;
} while (++ci_off < ci_end);
return 0;
@@ -397,6 +406,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
swp_entry_t entry;
struct folio *folio;
void *shadow = NULL;
+ unsigned short memcg_id;
unsigned long address, nr_pages = 1 << order;
struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
@@ -404,7 +414,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
/* Check if the slot and range are available, skip allocation if not */
spin_lock(&ci->lock);
- err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL);
+ err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL);
spin_unlock(&ci->lock);
if (unlikely(err))
return ERR_PTR(err);
@@ -427,7 +437,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
/* Double check the range is still not in conflict */
spin_lock(&ci->lock);
- err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow);
+ err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_id);
if (unlikely(err)) {
spin_unlock(&ci->lock);
folio_put(folio);
@@ -439,8 +449,8 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
__swap_cache_do_add_folio(ci, folio, entry);
spin_unlock(&ci->lock);
- if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL,
- gfp, entry)) {
+ if (mem_cgroup_swapin_charge_folio(folio, memcg_id,
+ vmf ? vmf->vma->vm_mm : NULL, gfp)) {
spin_lock(&ci->lock);
__swap_cache_do_del_folio(ci, folio, entry, shadow);
spin_unlock(&ci->lock);
--
2.53.0
next prev parent reply other threads:[~2026-04-21 6:17 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 6:16 [PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 01/12] mm, swap: simplify swap cache allocation helper Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 02/12] mm, swap: move common swap cache operations into standalone helpers Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 03/12] mm/huge_memory: move THP gfp limit helper into header Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 04/12] mm, swap: add support for stable large allocation in swap cache directly Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 05/12] mm, swap: unify large folio allocation Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 06/12] mm/memcg, swap: tidy up cgroup v1 memsw swap helpers Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 07/12] mm, swap: support flexible batch freeing of slots in different memcgs Kairui Song via B4 Relay
2026-04-21 6:16 ` Kairui Song via B4 Relay [this message]
2026-04-21 6:16 ` [PATCH v3 09/12] mm, swap: consolidate cluster allocation helpers Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 10/12] mm/memcg, swap: store cgroup id in cluster table directly Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 11/12] mm/memcg: remove no longer used swap cgroup array Kairui Song via B4 Relay
2026-04-21 6:16 ` [PATCH v3 12/12] mm, swap: merge zeromap into swap table Kairui Song via B4 Relay
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260421-swap-table-p4-v3-8-2f23759a76bc@tencent.com \
--to=devnull+kasong.tencent.com@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=yosry@kernel.org \
--cc=youngjun.park@lge.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox