linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] mm/swap_cgroup: remove global swap cgroup lock
@ 2024-12-18 11:46 Kairui Song
  2024-12-18 11:46 ` [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-18 11:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series removes the global swap cgroup lock. The critical section of
this lock is very short but it's still a bottle neck for mass parallel
swap workloads.

Up to 10% performance gain for tmpfs build kernel test on a 48c96t
system under memory pressure, and no regression for other cases:

V2: https://lore.kernel.org/linux-mm/20241210092805.87281-1-ryncsn@gmail.com/
Updates since V2:
- Micro optimization for bit operations in patch 3 [Chris Li]
- Improve BUILD_BUG_ON to cover potential arch corner cases [Chris Li]
- Introduce patch 4, make the swap_cgroup tracking code more
   robust [Chris Li]

V1: https://lore.kernel.org/linux-mm/20241202184154.19321-1-ryncsn@gmail.com/
Updates since V1:
- Collect Review and Ack.
- Use bit shift instead of a mixed usage of short and atomic for
  emulating 2 byte xchg [Chris Li]
- Merge patch 3 into patch 4 for simplicity [Roman Gushchin].
- Drop call of mem_cgroup_disabled instead in patch 1, also fix bot
  build error [Yosry Ahmed]
- Wrap the access of the atomic_t map with helpers properly, so the
  emulation can be dropped to use native 2 byte xchg once available.

Kairui Song (4):
  mm, memcontrol: avoid duplicated memcg enable check
  mm/swap_cgroup: remove swap_cgroup_cmpxchg
  mm/swap_cgroup: remove global swap cgroup lock
  mm/swap_cgroup: decouple swap cgroup recording and clearing

 include/linux/swap_cgroup.h |  14 ++--
 mm/memcontrol.c             |  15 ++--
 mm/swap_cgroup.c            | 148 +++++++++++++++++++-----------------
 3 files changed, 93 insertions(+), 84 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check
  2024-12-18 11:46 [PATCH v3 0/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
@ 2024-12-18 11:46 ` Kairui Song
  2024-12-22 13:33   ` Huang, Ying
  2024-12-18 11:46 ` [PATCH v3 2/4] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2024-12-18 11:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check,
so the caller doesn't need to check that.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..79900a486ed1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4609,7 +4609,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	 * correspond 1:1 to page and swap slot lifetimes: we charge the
 	 * page to memory here, and uncharge swap when the slot is freed.
 	 */
-	if (!mem_cgroup_disabled() && do_memsw_account()) {
+	if (do_memsw_account()) {
 		/*
 		 * The swap entry might not get freed for a long time,
 		 * let's not wait for it.  The page already received a
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 2/4] mm/swap_cgroup: remove swap_cgroup_cmpxchg
  2024-12-18 11:46 [PATCH v3 0/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
  2024-12-18 11:46 ` [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
@ 2024-12-18 11:46 ` Kairui Song
  2024-12-18 11:46 ` [PATCH v3 3/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
  2024-12-18 11:46 ` [PATCH v3 4/4] mm/swap_cgroup: decouple swap cgroup recording and clearing Kairui Song
  3 siblings, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-18 11:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This function is never used after commit 6b611388b626
("memcg-v1: remove charge move code").

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap_cgroup.h |  2 --
 mm/swap_cgroup.c            | 29 -----------------------------
 2 files changed, 31 deletions(-)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index ae73a87775b3..d521ad1c4164 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -6,8 +6,6 @@
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
 
-extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index f63d1aa072a1..1770b076f6b7 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -45,35 +45,6 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 	return &ctrl->map[offset];
 }
 
-/**
- * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
- * @ent: swap entry to be cmpxchged
- * @old: old id
- * @new: new id
- *
- * Returns old id at success, 0 at failure.
- * (There is no mem_cgroup using 0 as its id)
- */
-unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new)
-{
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned long flags;
-	unsigned short retval;
-
-	sc = lookup_swap_cgroup(ent, &ctrl);
-
-	spin_lock_irqsave(&ctrl->lock, flags);
-	retval = sc->id;
-	if (retval == old)
-		sc->id = new;
-	else
-		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
-	return retval;
-}
-
 /**
  * swap_cgroup_record - record mem_cgroup for a set of swap entries
  * @ent: the first swap entry to be recorded into
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 3/4] mm/swap_cgroup: remove global swap cgroup lock
  2024-12-18 11:46 [PATCH v3 0/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
  2024-12-18 11:46 ` [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
  2024-12-18 11:46 ` [PATCH v3 2/4] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
@ 2024-12-18 11:46 ` Kairui Song
  2024-12-18 11:46 ` [PATCH v3 4/4] mm/swap_cgroup: decouple swap cgroup recording and clearing Kairui Song
  3 siblings, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-18 11:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
replaced the cmpxchg/xchg with a global irq spinlock because some archs
doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.

And as commented in swap_cgroup.c, this lock is not needed for map
synchronization.

Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
it to get rid of this lock. Introduced two helpers for doing so and they
can be easily dropped if a generic 2 byte xchg is support.

Testing using 64G brd and build with build kernel with make -j96 in 1.5G
memory cgroup using 4k folios showed below improvement (6 test run):

Before this series:
Sys time: 10782.29 (stdev 42.353886)
Real time: 171.49 (stdev 0.595541)

After this commit:
Sys time: 9617.23 (stdev 37.764062), -10.81%
Real time: 159.65 (stdev 0.587388), -6.90%

With 64k folios and 2G memcg:
Before this series:
Sys time: 8176.94 (stdev 26.414712)
Real time: 141.98 (stdev 0.797382)

After this commit:
Sys time: 7358.98 (stdev 54.927593), -10.00%
Real time: 134.07 (stdev 0.757463), -5.57%

Sequential swapout of 8G 64k zero folios with madvise (24 test run):
Before this series:
5461409.12 us (stdev 183957.827084)

After this commit:
5420447.26 us (stdev 196419.240317)

Sequential swapin of 8G 4k zero folios (24 test run):
Before this series:
19736958.916667 us (stdev 189027.246676)

After this commit:
19662182.629630 us (stdev 172717.640614)

Performance is better or at least not worse for all tests above.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/swap_cgroup.c | 77 ++++++++++++++++++++++++++++++------------------
 1 file changed, 49 insertions(+), 28 deletions(-)

diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 1770b076f6b7..cf0445cb35ed 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -7,19 +7,20 @@
 
 static DEFINE_MUTEX(swap_cgroup_mutex);
 
+/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
+#define ID_PER_SC (sizeof(struct swap_cgroup) / sizeof(unsigned short))
+#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
+#define ID_MASK (BIT(ID_SHIFT) - 1)
 struct swap_cgroup {
-	unsigned short		id;
+	atomic_t ids;
 };
 
 struct swap_cgroup_ctrl {
 	struct swap_cgroup *map;
-	spinlock_t	lock;
 };
 
 static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
 
-#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
-
 /*
  * SwapCgroup implements "lookup" and "exchange" operations.
  * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
@@ -30,19 +31,35 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
  *    SwapCache(and its swp_entry) is under lock.
  *  - When called via swap_free(), there is no user of this entry and no race.
  * Then, we don't need lock around "exchange".
- *
- * TODO: we can push these buffers out to HIGHMEM.
  */
-static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
-					struct swap_cgroup_ctrl **ctrlp)
+static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
+					      pgoff_t offset)
 {
-	pgoff_t offset = swp_offset(ent);
-	struct swap_cgroup_ctrl *ctrl;
+	unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT;
+	unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
 
-	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	if (ctrlp)
-		*ctrlp = ctrl;
-	return &ctrl->map[offset];
+	BUILD_BUG_ON(!is_power_of_2(ID_PER_SC));
+	BUILD_BUG_ON(sizeof(struct swap_cgroup) != sizeof(atomic_t));
+
+	return (old_ids >> shift) & ID_MASK;
+}
+
+static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
+					    pgoff_t offset,
+					    unsigned short new_id)
+{
+	unsigned short old_id;
+	struct swap_cgroup *sc = &map[offset / ID_PER_SC];
+	unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT;
+	unsigned int new_ids, old_ids = atomic_read(&sc->ids);
+
+	do {
+		old_id = (old_ids >> shift) & ID_MASK;
+		new_ids = (old_ids & ~(ID_MASK << shift));
+		new_ids |= ((unsigned int)new_id) << shift;
+	} while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
+
+	return old_id;
 }
 
 /**
@@ -58,21 +75,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 				  unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned short old;
-	unsigned long flags;
 	pgoff_t offset = swp_offset(ent);
 	pgoff_t end = offset + nr_ents;
+	unsigned short old, iter;
+	struct swap_cgroup *map;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
+	map = ctrl->map;
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	old = sc->id;
-	for (; offset < end; offset++, sc++) {
-		VM_BUG_ON(sc->id != old);
-		sc->id = id;
-	}
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	old = __swap_cgroup_id_lookup(map, offset);
+	do {
+		iter = __swap_cgroup_id_xchg(map, offset, id);
+		VM_BUG_ON(iter != old);
+	} while (++offset != end);
 
 	return old;
 }
@@ -85,9 +100,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
  */
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
+	struct swap_cgroup_ctrl *ctrl;
+
 	if (mem_cgroup_disabled())
 		return 0;
-	return lookup_swap_cgroup(ent, NULL)->id;
+
+	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
+	return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
 }
 
 int swap_cgroup_swapon(int type, unsigned long max_pages)
@@ -98,14 +117,16 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
 	if (mem_cgroup_disabled())
 		return 0;
 
-	map = vcalloc(max_pages, sizeof(struct swap_cgroup));
+	BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC !=
+		     sizeof(struct swap_cgroup));
+	map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
+		      sizeof(struct swap_cgroup));
 	if (!map)
 		goto nomem;
 
 	ctrl = &swap_cgroup_ctrl[type];
 	mutex_lock(&swap_cgroup_mutex);
 	ctrl->map = map;
-	spin_lock_init(&ctrl->lock);
 	mutex_unlock(&swap_cgroup_mutex);
 
 	return 0;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 4/4] mm/swap_cgroup: decouple swap cgroup recording and clearing
  2024-12-18 11:46 [PATCH v3 0/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
                   ` (2 preceding siblings ...)
  2024-12-18 11:46 ` [PATCH v3 3/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
@ 2024-12-18 11:46 ` Kairui Song
  3 siblings, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-18 11:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current implementation of swap cgroup tracking is a bit complex
and fragile:

On charging path, swap_cgroup_record always records an actual memcg id,
and it depends on the caller to make sure all entries passed in must
belong to one single folio. As folios are always charged or uncharged
as a whole, and always charged and uncharged in order, swap_cgroup
doesn't need an extra lock.

On uncharging path, swap_cgroup_record always sets the record to zero.
These entries won't be charged again until uncharging is done. So there
is no extra lock needed either. Worth noting that swap cgroup clearing
may happen without folio involved, eg. exiting processes will zap its
page table without swapin.

The xchg/cmpxchg provides atomic operations and barriers to ensure
no tearing or synchronization issue of these swap cgroup records.

It works but quite error-prone. Things can be much clear and
robust by decoupling recording and clearing into two helpers.
Recording takes the actual folio being charged as argument, and
clearing always set the record to zero, and refine the debug
sanity checks to better reflect their usage

Benchmark even showed a very slight improvement as it saved some
extra arguments and lookups:

make -j96 with defconfig on tmpfs in 1.5G memory cgroup using 4k folios:
Before: sys 9617.23 (stdev 37.764062)
After : sys 9541.54 (stdev 42.973976)

make -j96 with defconfig on tmpfs in 2G memory cgroup using 64k folios:
Before: sys 7358.98 (stdev 54.927593)
After : sys 7337.82 (stdev 39.398956)

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap_cgroup.h | 12 ++++---
 mm/memcontrol.c             | 13 +++-----
 mm/swap_cgroup.c            | 66 +++++++++++++++++++++++--------------
 3 files changed, 55 insertions(+), 36 deletions(-)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index d521ad1c4164..b5ec038069da 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -6,8 +6,8 @@
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
 
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-					 unsigned int nr_ents);
+extern void swap_cgroup_record(struct folio *folio, swp_entry_t ent);
+extern unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
@@ -15,8 +15,12 @@ extern void swap_cgroup_swapoff(int type);
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-				  unsigned int nr_ents)
+void swap_cgroup_record(struct folio *folio, swp_entry_t ent)
+{
+}
+
+static inline
+unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79900a486ed1..ca1ea84b4bce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4973,7 +4973,6 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
 	unsigned int nr_entries;
-	unsigned short oldid;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
@@ -5000,11 +4999,10 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	/* Get references for the tail pages, too */
 	if (nr_entries > 1)
 		mem_cgroup_id_get_many(swap_memcg, nr_entries - 1);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg),
-				   nr_entries);
-	VM_BUG_ON_FOLIO(oldid, folio);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
+	swap_cgroup_record(folio, entry);
+
 	folio_unqueue_deferred_split(folio);
 	folio->memcg_data = 0;
 
@@ -5035,7 +5033,6 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	unsigned int nr_pages = folio_nr_pages(folio);
 	struct page_counter *counter;
 	struct mem_cgroup *memcg;
-	unsigned short oldid;
 
 	if (do_memsw_account())
 		return 0;
@@ -5064,10 +5061,10 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	/* Get references for the tail pages, too */
 	if (nr_pages > 1)
 		mem_cgroup_id_get_many(memcg, nr_pages - 1);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages);
-	VM_BUG_ON_FOLIO(oldid, folio);
 	mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
 
+	swap_cgroup_record(folio, entry);
+
 	return 0;
 }
 
@@ -5081,7 +5078,7 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	struct mem_cgroup *memcg;
 	unsigned short id;
 
-	id = swap_cgroup_record(entry, 0, nr_pages);
+	id = swap_cgroup_clear(entry, nr_pages);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index cf0445cb35ed..be39078f255b 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -21,17 +21,6 @@ struct swap_cgroup_ctrl {
 
 static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
 
-/*
- * SwapCgroup implements "lookup" and "exchange" operations.
- * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
- * against SwapCache. At swap_free(), this is accessed directly from swap.
- *
- * This means,
- *  - we have no race in "exchange" when we're accessed via SwapCache because
- *    SwapCache(and its swp_entry) is under lock.
- *  - When called via swap_free(), there is no user of this entry and no race.
- * Then, we don't need lock around "exchange".
- */
 static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
 					      pgoff_t offset)
 {
@@ -63,29 +52,58 @@ static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
 }
 
 /**
- * swap_cgroup_record - record mem_cgroup for a set of swap entries
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries.
+ * These entries must belong to one single folio, and that folio
+ * must be being charged for swap space (swap out), and these
+ * entries must not have been charged
+ *
+ * @folio: the folio that the swap entry belongs to
+ * @ent: the first swap entry to be recorded
+ */
+void swap_cgroup_record(struct folio *folio, swp_entry_t ent)
+{
+	unsigned int nr_ents = folio_nr_pages(folio);
+	struct swap_cgroup *map;
+	pgoff_t offset, end;
+	unsigned short old;
+
+	offset = swp_offset(ent);
+	end = offset + nr_ents;
+	map = swap_cgroup_ctrl[swp_type(ent)].map;
+
+	do {
+		old = __swap_cgroup_id_xchg(map, offset,
+					    mem_cgroup_id(folio_memcg(folio)));
+		VM_BUG_ON(old);
+	} while (++offset != end);
+}
+
+/**
+ * swap_cgroup_clear - clear mem_cgroup for a set of swap entries.
+ * These entries must be being uncharged from swap. They either
+ * belongs to one single folio in the swap cache (swap in for
+ * cgroup v1), or no longer have any users (slot freeing).
+ *
  * @ent: the first swap entry to be recorded into
- * @id: mem_cgroup to be recorded
  * @nr_ents: number of swap entries to be recorded
  *
- * Returns old value at success, 0 at failure.
- * (Of course, old value can be 0.)
+ * Returns the existing old value.
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-				  unsigned int nr_ents)
+unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 {
-	struct swap_cgroup_ctrl *ctrl;
 	pgoff_t offset = swp_offset(ent);
 	pgoff_t end = offset + nr_ents;
-	unsigned short old, iter;
 	struct swap_cgroup *map;
+	unsigned short old, iter = 0;
 
-	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	map = ctrl->map;
+	offset = swp_offset(ent);
+	end = offset + nr_ents;
+	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
-	old = __swap_cgroup_id_lookup(map, offset);
 	do {
-		iter = __swap_cgroup_id_xchg(map, offset, id);
+		old = __swap_cgroup_id_xchg(map, offset, 0);
+		if (!iter)
+			iter = old;
 		VM_BUG_ON(iter != old);
 	} while (++offset != end);
 
@@ -119,7 +137,7 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
 
 	BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC !=
 		     sizeof(struct swap_cgroup));
-	map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
+	map = vzalloc(DIV_ROUND_UP(max_pages, ID_PER_SC) *
 		      sizeof(struct swap_cgroup));
 	if (!map)
 		goto nomem;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check
  2024-12-18 11:46 ` [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
@ 2024-12-22 13:33   ` Huang, Ying
  2024-12-22 14:51     ` Kairui Song
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2024-12-22 13:33 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Yosry Ahmed, Roman Gushchin, Shakeel Butt, Johannes Weiner,
	Barry Song, Michal Hocko, linux-kernel

Hi, Kairui,

Sorry for jumping in so late.

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check,
> so the caller doesn't need to check that.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Chris Li <chrisl@kernel.org>
> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..79900a486ed1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4609,7 +4609,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
>  	 * correspond 1:1 to page and swap slot lifetimes: we charge the
>  	 * page to memory here, and uncharge swap when the slot is freed.
>  	 */
> -	if (!mem_cgroup_disabled() && do_memsw_account()) {
> +	if (do_memsw_account()) {
>  		/*
>  		 * The swap entry might not get freed for a long time,
>  		 * let's not wait for it.  The page already received a

I take a look at memcontrol.c, it appears that almost all extern
functions check mem_cgroup_disabled() as the first step.  So I guess
that this is a convention of memcontrol.c?  And the benefit of the
change is minimal.  In contrast, if someone makes more changes to
mem_cgroup_swapin_uncharge_swap() in the future, he may forget to add
this back.  So, it may be unnecessary to make the change?

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check
  2024-12-22 13:33   ` Huang, Ying
@ 2024-12-22 14:51     ` Kairui Song
  2024-12-27  2:03       ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2024-12-22 14:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel

On Sun, Dec 22, 2024 at 9:33 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> Hi, Kairui,

Hi Ying,

>
> Sorry for jumping in so late.
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check,
> > so the caller doesn't need to check that.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Acked-by: Chris Li <chrisl@kernel.org>
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7b3503d12aaf..79900a486ed1 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4609,7 +4609,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
> >        * correspond 1:1 to page and swap slot lifetimes: we charge the
> >        * page to memory here, and uncharge swap when the slot is freed.
> >        */
> > -     if (!mem_cgroup_disabled() && do_memsw_account()) {
> > +     if (do_memsw_account()) {
> >               /*
> >                * The swap entry might not get freed for a long time,
> >                * let's not wait for it.  The page already received a
>
> I take a look at memcontrol.c, it appears that almost all extern
> functions check mem_cgroup_disabled() as the first step.

Hmm, just checked memcontrol.c and I saw quite a few extern functions
not doing that, so I think that's not a convention.

> that this is a convention of memcontrol.c?  And the benefit of the
> change is minimal.  In contrast, if someone makes more changes to
> mem_cgroup_swapin_uncharge_swap() in the future, he may forget to add
> this back.  So, it may be unnecessary to make the change?

This change is minimal indeed, it only helps to remove a few unneeded
nop, still a gain though.

I think mem_cgroup_swapin_uncharge_swap should fade away in the future,
it's only for Cgroup V1, and it's a really simple function, just a
wrapper for mem_cgroup_uncharge_swap, so I think this is not a
problem?

If you are concerned about this, this patch can be dropped from this
series, rest of the patches still work the same.



>
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check
  2024-12-22 14:51     ` Kairui Song
@ 2024-12-27  2:03       ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2024-12-27  2:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Sun, Dec 22, 2024 at 9:33 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>>
>> Hi, Kairui,
>
> Hi Ying,
>
>>
>> Sorry for jumping in so late.
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check,
>> > so the caller doesn't need to check that.
>> >
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>> > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
>> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> > Acked-by: Chris Li <chrisl@kernel.org>
>> > ---
>> >  mm/memcontrol.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> > index 7b3503d12aaf..79900a486ed1 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -4609,7 +4609,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
>> >        * correspond 1:1 to page and swap slot lifetimes: we charge the
>> >        * page to memory here, and uncharge swap when the slot is freed.
>> >        */
>> > -     if (!mem_cgroup_disabled() && do_memsw_account()) {
>> > +     if (do_memsw_account()) {
>> >               /*
>> >                * The swap entry might not get freed for a long time,
>> >                * let's not wait for it.  The page already received a
>>
>> I take a look at memcontrol.c, it appears that almost all extern
>> functions check mem_cgroup_disabled() as the first step.
>
> Hmm, just checked memcontrol.c and I saw quite a few extern functions
> not doing that, so I think that's not a convention.

I still think that it's a good idea to check whether memcg is disabled
in the outermost interfaces instead of being buried in some internal
functions.

>> that this is a convention of memcontrol.c?  And the benefit of the
>> change is minimal.  In contrast, if someone makes more changes to
>> mem_cgroup_swapin_uncharge_swap() in the future, he may forget to add
>> this back.  So, it may be unnecessary to make the change?
>
> This change is minimal indeed, it only helps to remove a few unneeded
> nop, still a gain though.

The benefit is minimal too.

> I think mem_cgroup_swapin_uncharge_swap should fade away in the future,

Good.  Then, we don't need to optimize it too.  Just let it fade away.

> it's only for Cgroup V1, and it's a really simple function, just a
> wrapper for mem_cgroup_uncharge_swap, so I think this is not a
> problem?
>
> If you are concerned about this, this patch can be dropped from this
> series, rest of the patches still work the same.

Just my 2 cents.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-12-27  2:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-18 11:46 [PATCH v3 0/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
2024-12-18 11:46 ` [PATCH v3 1/4] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
2024-12-22 13:33   ` Huang, Ying
2024-12-22 14:51     ` Kairui Song
2024-12-27  2:03       ` Huang, Ying
2024-12-18 11:46 ` [PATCH v3 2/4] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
2024-12-18 11:46 ` [PATCH v3 3/4] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
2024-12-18 11:46 ` [PATCH v3 4/4] mm/swap_cgroup: decouple swap cgroup recording and clearing Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox