[PATCH v2 0/3] mm/swap_cgroup: remove global swap cgroup lock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm/swap_cgroup: remove global swap cgroup lock
@ 2024-12-10  9:28 Kairui Song
  2024-12-10  9:28 ` [PATCH v2 1/3] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-10  9:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series removes the global swap cgroup lock. The critical section of
this lock is very short but it's still a bottle neck for mass parallel
swap workloads.

Up to 10% performance gain for tmpfs build kernel test on a
48c96t system, and no regression for other cases:

Testing using 64G brd and build with build kernel with make -j96 in 1.5G
memory cgroup using 4k folios showed below improvement (10 test run):

Before this series:
Sys time: 10809.46 (stdev 80.831491)
Real time: 171.41 (stdev 1.239894)

After this commit:
Sys time: 9621.26 (stdev 34.620000), -10.42%
Real time: 160.00 (stdev 0.497814), -6.57%

With 64k folios and 2G memcg:
Before this series:
Sys time: 8231.99 (stdev 30.030994)
Real time: 143.57 (stdev 0.577394)

After this commit:
Sys time: 7403.47 (stdev 6.270000), -10.06%
Real time: 135.18 (stdev 0.605000), -5.84%

Sequential swapout of 8G 64k zero folios (24 test run):
Before this series:
5461409.12 us (stdev 183957.827084)

After this commit:
5420447.26 us (stdev 196419.240317)

Sequential swapin of 8G 4k zero folios (24 test run):
Before this series:
19736958.916667 us (stdev 189027.246676)

After this commit:
19662182.629630 us (stdev 172717.640614)

V1: https://lore.kernel.org/linux-mm/20241202184154.19321-1-ryncsn@gmail.com/
Updates:
- Collect Review and Ack.
- Use bit shift instead of a mixed usage of short and atomic for
  emulating 2 byte xchg [Chris Li]
- Merge patch 3 into patch 4 for simplicity [Roman Gushchin].
- Drop call of mem_cgroup_disabled instead in patch 1, also fix bot
  build error [Yosry Ahmed]
- Wrap the access of the atomic_t map with helpers properly, so the
  emulation can be dropped to use native 2 byte xchg once available.

Kairui Song (3):
  mm, memcontrol: avoid duplicated memcg enable check
  mm/swap_cgroup: remove swap_cgroup_cmpxchg
  mm, swap_cgroup: remove global swap cgroup lock

 include/linux/swap_cgroup.h |  2 -
 mm/memcontrol.c             |  2 +-
 mm/swap_cgroup.c            | 96 ++++++++++++++++---------------------
 3 files changed, 43 insertions(+), 57 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] mm, memcontrol: avoid duplicated memcg enable check
  2024-12-10  9:28 [PATCH v2 0/3] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
@ 2024-12-10  9:28 ` Kairui Song
  2024-12-10  9:28 ` [PATCH v2 2/3] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
  2024-12-10  9:28 ` [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock Kairui Song
  2 siblings, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-10  9:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check,
so the caller doesn't need to check that.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..79900a486ed1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4609,7 +4609,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	 * correspond 1:1 to page and swap slot lifetimes: we charge the
 	 * page to memory here, and uncharge swap when the slot is freed.
 	 */
-	if (!mem_cgroup_disabled() && do_memsw_account()) {
+	if (do_memsw_account()) {
 		/*
 		 * The swap entry might not get freed for a long time,
 		 * let's not wait for it.  The page already received a
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] mm/swap_cgroup: remove swap_cgroup_cmpxchg
  2024-12-10  9:28 [PATCH v2 0/3] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
  2024-12-10  9:28 ` [PATCH v2 1/3] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
@ 2024-12-10  9:28 ` Kairui Song
  2024-12-10  9:28 ` [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock Kairui Song
  2 siblings, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-10  9:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This function is never used after commit 6b611388b626
("memcg-v1: remove charge move code").

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap_cgroup.h |  2 --
 mm/swap_cgroup.c            | 29 -----------------------------
 2 files changed, 31 deletions(-)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index ae73a87775b3..d521ad1c4164 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -6,8 +6,6 @@
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
 
-extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index f63d1aa072a1..1770b076f6b7 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -45,35 +45,6 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 	return &ctrl->map[offset];
 }
 
-/**
- * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
- * @ent: swap entry to be cmpxchged
- * @old: old id
- * @new: new id
- *
- * Returns old id at success, 0 at failure.
- * (There is no mem_cgroup using 0 as its id)
- */
-unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new)
-{
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned long flags;
-	unsigned short retval;
-
-	sc = lookup_swap_cgroup(ent, &ctrl);
-
-	spin_lock_irqsave(&ctrl->lock, flags);
-	retval = sc->id;
-	if (retval == old)
-		sc->id = new;
-	else
-		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
-	return retval;
-}
-
 /**
  * swap_cgroup_record - record mem_cgroup for a set of swap entries
  * @ent: the first swap entry to be recorded into
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock
  2024-12-10  9:28 [PATCH v2 0/3] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
  2024-12-10  9:28 ` [PATCH v2 1/3] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
  2024-12-10  9:28 ` [PATCH v2 2/3] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
@ 2024-12-10  9:28 ` Kairui Song
  2024-12-11  1:19   ` Roman Gushchin
  2024-12-14 16:07   ` Chris Li
  2 siblings, 2 replies; 8+ messages in thread
From: Kairui Song @ 2024-12-10  9:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
replaced the cmpxchg/xchg with a global irq spinlock because some archs
doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.

And as commented in swap_cgroup.c, this lock is not needed for map
synchronization.

Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
it to get rid of this lock. Introduced two helpers for doing so and they
can be easily dropped if a generic 2 byte xchg is support.

Testing using 64G brd and build with build kernel with make -j96 in 1.5G
memory cgroup using 4k folios showed below improvement (10 test run):

Before this series:
Sys time: 10809.46 (stdev 80.831491)
Real time: 171.41 (stdev 1.239894)

After this commit:
Sys time: 9621.26 (stdev 34.620000), -10.42%
Real time: 160.00 (stdev 0.497814), -6.57%

With 64k folios and 2G memcg:
Before this series:
Sys time: 8231.99 (stdev 30.030994)
Real time: 143.57 (stdev 0.577394)

After this commit:
Sys time: 7403.47 (stdev 6.270000), -10.06%
Real time: 135.18 (stdev 0.605000), -5.84%

Sequential swapout of 8G 64k zero folios with madvise (24 test run):
Before this series:
5461409.12 us (stdev 183957.827084)

After this commit:
5420447.26 us (stdev 196419.240317)

Sequential swapin of 8G 4k zero folios (24 test run):
Before this series:
19736958.916667 us (stdev 189027.246676)

After this commit:
19662182.629630 us (stdev 172717.640614)

Performance is better or at least not worse for all tests above.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 28 deletions(-)

diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 1770b076f6b7..a0a8547dc85d 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -7,19 +7,20 @@
 
 static DEFINE_MUTEX(swap_cgroup_mutex);
 
+/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
+#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))
+#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
+#define ID_MASK (BIT(ID_SHIFT) - 1)
 struct swap_cgroup {
-	unsigned short		id;
+	atomic_t ids;
 };
 
 struct swap_cgroup_ctrl {
 	struct swap_cgroup *map;
-	spinlock_t	lock;
 };
 
 static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
 
-#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
-
 /*
  * SwapCgroup implements "lookup" and "exchange" operations.
  * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
@@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
  *    SwapCache(and its swp_entry) is under lock.
  *  - When called via swap_free(), there is no user of this entry and no race.
  * Then, we don't need lock around "exchange".
- *
- * TODO: we can push these buffers out to HIGHMEM.
  */
-static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
-					struct swap_cgroup_ctrl **ctrlp)
+static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
+					      pgoff_t offset)
 {
-	pgoff_t offset = swp_offset(ent);
-	struct swap_cgroup_ctrl *ctrl;
+	unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
+	unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
 
-	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	if (ctrlp)
-		*ctrlp = ctrl;
-	return &ctrl->map[offset];
+	return (old_ids & (ID_MASK << shift)) >> shift;
+}
+
+static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
+					    pgoff_t offset,
+					    unsigned short new_id)
+{
+	unsigned short old_id;
+	unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
+	struct swap_cgroup *sc = &map[offset / ID_PER_SC];
+	unsigned int new_ids, old_ids = atomic_read(&sc->ids);
+
+	do {
+		old_id = (old_ids & (ID_MASK << shift)) >> shift;
+		new_ids = (old_ids & ~(ID_MASK << shift));
+		new_ids |= ((unsigned int)new_id) << shift;
+	} while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
+
+	return old_id;
 }
 
 /**
@@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 				  unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned short old;
-	unsigned long flags;
 	pgoff_t offset = swp_offset(ent);
 	pgoff_t end = offset + nr_ents;
+	unsigned short old, iter;
+	struct swap_cgroup *map;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
+	map = ctrl->map;
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	old = sc->id;
-	for (; offset < end; offset++, sc++) {
-		VM_BUG_ON(sc->id != old);
-		sc->id = id;
-	}
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	old = __swap_cgroup_id_lookup(map, offset);
+	do {
+		iter = __swap_cgroup_id_xchg(map, offset, id);
+		VM_BUG_ON(iter != old);
+	} while (++offset != end);
 
 	return old;
 }
@@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
  */
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
+	struct swap_cgroup_ctrl *ctrl;
+
 	if (mem_cgroup_disabled())
 		return 0;
-	return lookup_swap_cgroup(ent, NULL)->id;
+
+	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
+	return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
 }
 
 int swap_cgroup_swapon(int type, unsigned long max_pages)
@@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
 	if (mem_cgroup_disabled())
 		return 0;
 
-	map = vcalloc(max_pages, sizeof(struct swap_cgroup));
+	BUILD_BUG_ON(!ID_PER_SC);
+	map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
+		      sizeof(struct swap_cgroup));
 	if (!map)
 		goto nomem;
 
 	ctrl = &swap_cgroup_ctrl[type];
 	mutex_lock(&swap_cgroup_mutex);
 	ctrl->map = map;
-	spin_lock_init(&ctrl->lock);
 	mutex_unlock(&swap_cgroup_mutex);
 
 	return 0;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock
  2024-12-10  9:28 ` [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock Kairui Song
@ 2024-12-11  1:19   ` Roman Gushchin
  2024-12-14 16:07   ` Chris Li
  1 sibling, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2024-12-11  1:19 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Huang, Ying,
	Yosry Ahmed, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel

On Tue, Dec 10, 2024 at 05:28:05PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> replaced the cmpxchg/xchg with a global irq spinlock because some archs
> doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
> 
> And as commented in swap_cgroup.c, this lock is not needed for map
> synchronization.
> 
> Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> it to get rid of this lock. Introduced two helpers for doing so and they
> can be easily dropped if a generic 2 byte xchg is support.
> 
> Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> memory cgroup using 4k folios showed below improvement (10 test run):
> 
> Before this series:
> Sys time: 10809.46 (stdev 80.831491)
> Real time: 171.41 (stdev 1.239894)
> 
> After this commit:
> Sys time: 9621.26 (stdev 34.620000), -10.42%
> Real time: 160.00 (stdev 0.497814), -6.57%
> 
> With 64k folios and 2G memcg:
> Before this series:
> Sys time: 8231.99 (stdev 30.030994)
> Real time: 143.57 (stdev 0.577394)
> 
> After this commit:
> Sys time: 7403.47 (stdev 6.270000), -10.06%
> Real time: 135.18 (stdev 0.605000), -5.84%
> 
> Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> Before this series:
> 5461409.12 us (stdev 183957.827084)
> 
> After this commit:
> 5420447.26 us (stdev 196419.240317)
> 
> Sequential swapin of 8G 4k zero folios (24 test run):
> Before this series:
> 19736958.916667 us (stdev 189027.246676)
> 
> After this commit:
> 19662182.629630 us (stdev 172717.640614)
> 
> Performance is better or at least not worse for all tests above.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock
  2024-12-10  9:28 ` [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock Kairui Song
  2024-12-11  1:19   ` Roman Gushchin
@ 2024-12-14 16:07   ` Chris Li
  2024-12-14 19:48     ` Kairui Song
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Li @ 2024-12-14 16:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel

On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> replaced the cmpxchg/xchg with a global irq spinlock because some archs
> doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
>
> And as commented in swap_cgroup.c, this lock is not needed for map
> synchronization.
>
> Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> it to get rid of this lock. Introduced two helpers for doing so and they
> can be easily dropped if a generic 2 byte xchg is support.
>
> Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> memory cgroup using 4k folios showed below improvement (10 test run):
>
> Before this series:
> Sys time: 10809.46 (stdev 80.831491)
> Real time: 171.41 (stdev 1.239894)
>
> After this commit:
> Sys time: 9621.26 (stdev 34.620000), -10.42%
> Real time: 160.00 (stdev 0.497814), -6.57%
>
> With 64k folios and 2G memcg:
> Before this series:
> Sys time: 8231.99 (stdev 30.030994)
> Real time: 143.57 (stdev 0.577394)
>
> After this commit:
> Sys time: 7403.47 (stdev 6.270000), -10.06%
> Real time: 135.18 (stdev 0.605000), -5.84%
>
> Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> Before this series:
> 5461409.12 us (stdev 183957.827084)
>
> After this commit:
> 5420447.26 us (stdev 196419.240317)
>
> Sequential swapin of 8G 4k zero folios (24 test run):
> Before this series:
> 19736958.916667 us (stdev 189027.246676)
>
> After this commit:
> 19662182.629630 us (stdev 172717.640614)
>
> Performance is better or at least not worse for all tests above.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 28 deletions(-)
>
> diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> index 1770b076f6b7..a0a8547dc85d 100644
> --- a/mm/swap_cgroup.c
> +++ b/mm/swap_cgroup.c
> @@ -7,19 +7,20 @@
>
>  static DEFINE_MUTEX(swap_cgroup_mutex);
>
> +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */

Might not be two short if the atomic_t is more than 4 bytes. The
assumption here is that short is 2 bytes and atomic_t is 4 bytes. It
is hard to conclude that is the case for all architecture.

> +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))

You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)",
or get rid of struct swap_cgroup and directly use atomic_t.

> +#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
> +#define ID_MASK (BIT(ID_SHIFT) - 1)
>  struct swap_cgroup {
> -       unsigned short          id;
> +       atomic_t ids;

You use struct swap_cgroup and atomic_t which assumes no padding added
to the struct. You might want to build an assert on sizeof(atomic_t)
== sizeof(struct swap_cgroup).

>  };
>
>  struct swap_cgroup_ctrl {
>         struct swap_cgroup *map;
> -       spinlock_t      lock;
>  };
>
>  static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
>
> -#define SC_PER_PAGE    (PAGE_SIZE/sizeof(struct swap_cgroup))
> -
>  /*
>   * SwapCgroup implements "lookup" and "exchange" operations.
>   * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
>   *    SwapCache(and its swp_entry) is under lock.
>   *  - When called via swap_free(), there is no user of this entry and no race.
>   * Then, we don't need lock around "exchange".
> - *
> - * TODO: we can push these buffers out to HIGHMEM.
>   */
> -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> -                                       struct swap_cgroup_ctrl **ctrlp)
> +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
> +                                             pgoff_t offset)
>  {
> -       pgoff_t offset = swp_offset(ent);
> -       struct swap_cgroup_ctrl *ctrl;
> +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;

Might not want to assume the ID_PER_SC is two. If some architecture
atomic_t is 64 bits then that code will break.

> +       unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);

Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some
strange architecture might break it. Better use unsigned version of
aotmic_t;


>
> -       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> -       if (ctrlp)
> -               *ctrlp = ctrl;
> -       return &ctrl->map[offset];
> +       return (old_ids & (ID_MASK << shift)) >> shift;

Can be simplified as (old_ids >> shift) & ID_MASK. You might want to
double check that.

> +}
> +
> +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
> +                                           pgoff_t offset,
> +                                           unsigned short new_id)
> +{
> +       unsigned short old_id;
> +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;

Same here, it assumes ID_PER_SC is 2.

> +       struct swap_cgroup *sc = &map[offset / ID_PER_SC];
> +       unsigned int new_ids, old_ids = atomic_read(&sc->ids);

Again it assumes  sizeof(unsigned int) == sizeof(atomic_t).

> +
> +       do {
> +               old_id = (old_ids & (ID_MASK << shift)) >> shift;
Can be simplify:
old_id = (old_ids >> shift) & ID_MASK;

> +               new_ids = (old_ids & ~(ID_MASK << shift));
> +               new_ids |= ((unsigned int)new_id) << shift;

new_ids |= (atomic_t) new_id << shift;

> +       } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
> +
> +       return old_id;
>  }
>
>  /**
> @@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
>                                   unsigned int nr_ents)
>  {
>         struct swap_cgroup_ctrl *ctrl;
> -       struct swap_cgroup *sc;
> -       unsigned short old;
> -       unsigned long flags;
>         pgoff_t offset = swp_offset(ent);
>         pgoff_t end = offset + nr_ents;
> +       unsigned short old, iter;
> +       struct swap_cgroup *map;
>
> -       sc = lookup_swap_cgroup(ent, &ctrl);
> +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> +       map = ctrl->map;
>
> -       spin_lock_irqsave(&ctrl->lock, flags);
> -       old = sc->id;
> -       for (; offset < end; offset++, sc++) {
> -               VM_BUG_ON(sc->id != old);
> -               sc->id = id;
> -       }
> -       spin_unlock_irqrestore(&ctrl->lock, flags);

The above will always assign nr_ents of swap entry atomically.

> +       old = __swap_cgroup_id_lookup(map, offset);
> +       do {
> +               iter = __swap_cgroup_id_xchg(map, offset, id);
> +               VM_BUG_ON(iter != old);
> +       } while (++offset != end);

Here it is possible that some of the nr_ents can be changed while the
offset is still in the loop. Might want to examine if the caller can
trigger that or not. We want to make sure it is safe to do so, when
removing the spin lock, the nr_ents might not update to the same value
if two callers race it.
>
>         return old;
>  }
> @@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
>   */
>  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  {
> +       struct swap_cgroup_ctrl *ctrl;
> +
>         if (mem_cgroup_disabled())
>                 return 0;
> -       return lookup_swap_cgroup(ent, NULL)->id;
> +
> +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> +       return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
>  }
>
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
> @@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
>         if (mem_cgroup_disabled())
>                 return 0;
>
> -       map = vcalloc(max_pages, sizeof(struct swap_cgroup));
> +       BUILD_BUG_ON(!ID_PER_SC);

It is simpler just to assert on: sizeof(atomic_t) >= sizeof(unsigned short).
I think that is what it does here.

You might also want to assert:  !(sizeof(atomic_t) % sizeof(unsigned short))

Chris

> +       map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
> +                     sizeof(struct swap_cgroup));
>         if (!map)
>                 goto nomem;
>
>         ctrl = &swap_cgroup_ctrl[type];
>         mutex_lock(&swap_cgroup_mutex);
>         ctrl->map = map;
> -       spin_lock_init(&ctrl->lock);
>         mutex_unlock(&swap_cgroup_mutex);
>
>         return 0;
> --
> 2.47.1
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock
  2024-12-14 16:07   ` Chris Li
@ 2024-12-14 19:48     ` Kairui Song
  2024-12-15 15:04       ` Chris Li
  0 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2024-12-14 19:48 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel

On Sun, Dec 15, 2024 at 12:07 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> > replaced the cmpxchg/xchg with a global irq spinlock because some archs
> > doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
> >
> > And as commented in swap_cgroup.c, this lock is not needed for map
> > synchronization.
> >
> > Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> > it to get rid of this lock. Introduced two helpers for doing so and they
> > can be easily dropped if a generic 2 byte xchg is support.
> >
> > Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> > memory cgroup using 4k folios showed below improvement (10 test run):
> >
> > Before this series:
> > Sys time: 10809.46 (stdev 80.831491)
> > Real time: 171.41 (stdev 1.239894)
> >
> > After this commit:
> > Sys time: 9621.26 (stdev 34.620000), -10.42%
> > Real time: 160.00 (stdev 0.497814), -6.57%
> >
> > With 64k folios and 2G memcg:
> > Before this series:
> > Sys time: 8231.99 (stdev 30.030994)
> > Real time: 143.57 (stdev 0.577394)
> >
> > After this commit:
> > Sys time: 7403.47 (stdev 6.270000), -10.06%
> > Real time: 135.18 (stdev 0.605000), -5.84%
> >
> > Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> > Before this series:
> > 5461409.12 us (stdev 183957.827084)
> >
> > After this commit:
> > 5420447.26 us (stdev 196419.240317)
> >
> > Sequential swapin of 8G 4k zero folios (24 test run):
> > Before this series:
> > 19736958.916667 us (stdev 189027.246676)
> >
> > After this commit:
> > 19662182.629630 us (stdev 172717.640614)
> >
> > Performance is better or at least not worse for all tests above.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
> >  1 file changed, 45 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> > index 1770b076f6b7..a0a8547dc85d 100644
> > --- a/mm/swap_cgroup.c
> > +++ b/mm/swap_cgroup.c
> > @@ -7,19 +7,20 @@
> >
> >  static DEFINE_MUTEX(swap_cgroup_mutex);
> >
> > +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
>
> Might not be two short if the atomic_t is more than 4 bytes. The
> assumption here is that short is 2 bytes and atomic_t is 4 bytes. It
> is hard to conclude that is the case for all architecture.
>
> > +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))
>
> You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)",
> or get rid of struct swap_cgroup and directly use atomic_t.
>
> > +#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
> > +#define ID_MASK (BIT(ID_SHIFT) - 1)
> >  struct swap_cgroup {
> > -       unsigned short          id;
> > +       atomic_t ids;
>
> You use struct swap_cgroup and atomic_t which assumes no padding added
> to the struct. You might want to build an assert on sizeof(atomic_t)
> == sizeof(struct swap_cgroup).

Good idea, asserting struct swap_croup is an atomic_t ensures no
unexpected behaviour.

>
> >  };
> >
> >  struct swap_cgroup_ctrl {
> >         struct swap_cgroup *map;
> > -       spinlock_t      lock;
> >  };
> >
> >  static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> >
> > -#define SC_PER_PAGE    (PAGE_SIZE/sizeof(struct swap_cgroup))
> > -
> >  /*
> >   * SwapCgroup implements "lookup" and "exchange" operations.
> >   * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> > @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> >   *    SwapCache(and its swp_entry) is under lock.
> >   *  - When called via swap_free(), there is no user of this entry and no race.
> >   * Then, we don't need lock around "exchange".
> > - *
> > - * TODO: we can push these buffers out to HIGHMEM.
> >   */
> > -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> > -                                       struct swap_cgroup_ctrl **ctrlp)
> > +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
> > +                                             pgoff_t offset)
> >  {
> > -       pgoff_t offset = swp_offset(ent);
> > -       struct swap_cgroup_ctrl *ctrl;
> > +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
>
> Might not want to assume the ID_PER_SC is two. If some architecture
> atomic_t is 64 bits then that code will break.

Good idea, atomic_t is by defining an int, not sure if there is any
strange archs will change the size though, more robust code is always
better.

Can change this to (offset % ID_PER_SC) instead, the generated machine
code should be still the same for most archs.

>
> > +       unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
>
> Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some
> strange architecture might break it. Better use unsigned version of
> aotmic_t;
>
>
> >
> > -       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > -       if (ctrlp)
> > -               *ctrlp = ctrl;
> > -       return &ctrl->map[offset];
> > +       return (old_ids & (ID_MASK << shift)) >> shift;
>
> Can be simplified as (old_ids >> shift) & ID_MASK. You might want to
> double check that.
>
> > +}
> > +
> > +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
> > +                                           pgoff_t offset,
> > +                                           unsigned short new_id)
> > +{
> > +       unsigned short old_id;
> > +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
>
> Same here, it assumes ID_PER_SC is 2.
>
> > +       struct swap_cgroup *sc = &map[offset / ID_PER_SC];
> > +       unsigned int new_ids, old_ids = atomic_read(&sc->ids);
>
> Again it assumes  sizeof(unsigned int) == sizeof(atomic_t).

I think this should be OK? The document says "atomic_t, atomic_long_t
and atomic64_t use int, long and s64 respectively".

Could change this with some wrapper but I think it's unnecessary.

>
> > +
> > +       do {
> > +               old_id = (old_ids & (ID_MASK << shift)) >> shift;
> Can be simplify:
> old_id = (old_ids >> shift) & ID_MASK;
>
> > +               new_ids = (old_ids & ~(ID_MASK << shift));
> > +               new_ids |= ((unsigned int)new_id) << shift;
>
> new_ids |= (atomic_t) new_id << shift;

atomic_t doesn't work with bit operations, it must be an arithmetic
type, so here I think it has to stay like this.

>
> > +       } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
> > +
> > +       return old_id;
> >  }
> >
> >  /**
> > @@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> >                                   unsigned int nr_ents)
> >  {
> >         struct swap_cgroup_ctrl *ctrl;
> > -       struct swap_cgroup *sc;
> > -       unsigned short old;
> > -       unsigned long flags;
> >         pgoff_t offset = swp_offset(ent);
> >         pgoff_t end = offset + nr_ents;
> > +       unsigned short old, iter;
> > +       struct swap_cgroup *map;
> >
> > -       sc = lookup_swap_cgroup(ent, &ctrl);
> > +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > +       map = ctrl->map;
> >
> > -       spin_lock_irqsave(&ctrl->lock, flags);
> > -       old = sc->id;
> > -       for (; offset < end; offset++, sc++) {
> > -               VM_BUG_ON(sc->id != old);
> > -               sc->id = id;
> > -       }
> > -       spin_unlock_irqrestore(&ctrl->lock, flags);
>
> The above will always assign nr_ents of swap entry atomically.
>
> > +       old = __swap_cgroup_id_lookup(map, offset);
> > +       do {
> > +               iter = __swap_cgroup_id_xchg(map, offset, id);
> > +               VM_BUG_ON(iter != old);
> > +       } while (++offset != end);
>
> Here it is possible that some of the nr_ents can be changed while the
> offset is still in the loop. Might want to examine if the caller can
> trigger that or not. We want to make sure it is safe to do so, when
> removing the spin lock, the nr_ents might not update to the same value
> if two callers race it.

Right, the problem is explained with the comments in the beginning of
this file, I can update that with more details:

Entries are always charged / uncharged belonging to one folio, or
being uncharged with no folio related (no more users, other folios
can't use these entries unless freeing is done).

As a folio is always charged or uncharged as a whole, and charge /
uncharge never happens concurrently, and the xchg here implies full
barrier, so the set of entries will also always be used as a whole.

So yes this function does implies the caller must always pass in swap
entries belonging to one single folio, or entries that have no users.
It's quite fragile indeed, I can make the caller pass in the folio as
an argument to clarify this, with some more WARN or BUG_ON.

> >
> >         return old;
> >  }
> > @@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> >   */
> >  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> >  {
> > +       struct swap_cgroup_ctrl *ctrl;
> > +
> >         if (mem_cgroup_disabled())
> >                 return 0;
> > -       return lookup_swap_cgroup(ent, NULL)->id;
> > +
> > +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > +       return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
> >  }
> >
> >  int swap_cgroup_swapon(int type, unsigned long max_pages)
> > @@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
> >         if (mem_cgroup_disabled())
> >                 return 0;
> >
> > -       map = vcalloc(max_pages, sizeof(struct swap_cgroup));
> > +       BUILD_BUG_ON(!ID_PER_SC);
>
> It is simpler just to assert on: sizeof(atomic_t) >= sizeof(unsigned short).
> I think that is what it does here.
>
> You might also want to assert:  !(sizeof(atomic_t) % sizeof(unsigned short))

Good idea.


>
> Chris
>
> > +       map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
> > +                     sizeof(struct swap_cgroup));
> >         if (!map)
> >                 goto nomem;
> >
> >         ctrl = &swap_cgroup_ctrl[type];
> >         mutex_lock(&swap_cgroup_mutex);
> >         ctrl->map = map;
> > -       spin_lock_init(&ctrl->lock);
> >         mutex_unlock(&swap_cgroup_mutex);
> >
> >         return 0;
> > --
> > 2.47.1
> >
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock
  2024-12-14 19:48     ` Kairui Song
@ 2024-12-15 15:04       ` Chris Li
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Li @ 2024-12-15 15:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Huang, Ying, Yosry Ahmed,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, Barry Song,
	Michal Hocko, linux-kernel

On Sat, Dec 14, 2024 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sun, Dec 15, 2024 at 12:07 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> > > replaced the cmpxchg/xchg with a global irq spinlock because some archs
> > > doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
> > >
> > > And as commented in swap_cgroup.c, this lock is not needed for map
> > > synchronization.
> > >
> > > Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> > > it to get rid of this lock. Introduced two helpers for doing so and they
> > > can be easily dropped if a generic 2 byte xchg is support.
> > >
> > > Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> > > memory cgroup using 4k folios showed below improvement (10 test run):
> > >
> > > Before this series:
> > > Sys time: 10809.46 (stdev 80.831491)
> > > Real time: 171.41 (stdev 1.239894)
> > >
> > > After this commit:
> > > Sys time: 9621.26 (stdev 34.620000), -10.42%
> > > Real time: 160.00 (stdev 0.497814), -6.57%
> > >
> > > With 64k folios and 2G memcg:
> > > Before this series:
> > > Sys time: 8231.99 (stdev 30.030994)
> > > Real time: 143.57 (stdev 0.577394)
> > >
> > > After this commit:
> > > Sys time: 7403.47 (stdev 6.270000), -10.06%
> > > Real time: 135.18 (stdev 0.605000), -5.84%
> > >
> > > Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> > > Before this series:
> > > 5461409.12 us (stdev 183957.827084)
> > >
> > > After this commit:
> > > 5420447.26 us (stdev 196419.240317)
> > >
> > > Sequential swapin of 8G 4k zero folios (24 test run):
> > > Before this series:
> > > 19736958.916667 us (stdev 189027.246676)
> > >
> > > After this commit:
> > > 19662182.629630 us (stdev 172717.640614)
> > >
> > > Performance is better or at least not worse for all tests above.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
> > >  1 file changed, 45 insertions(+), 28 deletions(-)
> > >
> > > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> > > index 1770b076f6b7..a0a8547dc85d 100644
> > > --- a/mm/swap_cgroup.c
> > > +++ b/mm/swap_cgroup.c
> > > @@ -7,19 +7,20 @@
> > >
> > >  static DEFINE_MUTEX(swap_cgroup_mutex);
> > >
> > > +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
> >
> > Might not be two short if the atomic_t is more than 4 bytes. The
> > assumption here is that short is 2 bytes and atomic_t is 4 bytes. It
> > is hard to conclude that is the case for all architecture.
> >
> > > +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))
> >
> > You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)",
> > or get rid of struct swap_cgroup and directly use atomic_t.
> >
> > > +#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
> > > +#define ID_MASK (BIT(ID_SHIFT) - 1)
> > >  struct swap_cgroup {
> > > -       unsigned short          id;
> > > +       atomic_t ids;
> >
> > You use struct swap_cgroup and atomic_t which assumes no padding added
> > to the struct. You might want to build an assert on sizeof(atomic_t)
> > == sizeof(struct swap_cgroup).
>
> Good idea, asserting struct swap_croup is an atomic_t ensures no
> unexpected behaviour.
>
> >
> > >  };
> > >
> > >  struct swap_cgroup_ctrl {
> > >         struct swap_cgroup *map;
> > > -       spinlock_t      lock;
> > >  };
> > >
> > >  static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> > >
> > > -#define SC_PER_PAGE    (PAGE_SIZE/sizeof(struct swap_cgroup))
> > > -
> > >  /*
> > >   * SwapCgroup implements "lookup" and "exchange" operations.
> > >   * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> > > @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> > >   *    SwapCache(and its swp_entry) is under lock.
> > >   *  - When called via swap_free(), there is no user of this entry and no race.
> > >   * Then, we don't need lock around "exchange".
> > > - *
> > > - * TODO: we can push these buffers out to HIGHMEM.
> > >   */
> > > -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> > > -                                       struct swap_cgroup_ctrl **ctrlp)
> > > +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
> > > +                                             pgoff_t offset)
> > >  {
> > > -       pgoff_t offset = swp_offset(ent);
> > > -       struct swap_cgroup_ctrl *ctrl;
> > > +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
> >
> > Might not want to assume the ID_PER_SC is two. If some architecture
> > atomic_t is 64 bits then that code will break.
>
> Good idea, atomic_t is by defining an int, not sure if there is any
> strange archs will change the size though, more robust code is always
> better.
>
> Can change this to (offset % ID_PER_SC) instead, the generated machine
> code should be still the same for most archs.
>
> >
> > > +       unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
> >
> > Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some
> > strange architecture might break it. Better use unsigned version of
> > aotmic_t;
> >
> >
> > >
> > > -       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > > -       if (ctrlp)
> > > -               *ctrlp = ctrl;
> > > -       return &ctrl->map[offset];
> > > +       return (old_ids & (ID_MASK << shift)) >> shift;
> >
> > Can be simplified as (old_ids >> shift) & ID_MASK. You might want to
> > double check that.
> >
> > > +}
> > > +
> > > +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
> > > +                                           pgoff_t offset,
> > > +                                           unsigned short new_id)
> > > +{
> > > +       unsigned short old_id;
> > > +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
> >
> > Same here, it assumes ID_PER_SC is 2.
> >
> > > +       struct swap_cgroup *sc = &map[offset / ID_PER_SC];
> > > +       unsigned int new_ids, old_ids = atomic_read(&sc->ids);
> >
> > Again it assumes  sizeof(unsigned int) == sizeof(atomic_t).
>
> I think this should be OK? The document says "atomic_t, atomic_long_t
> and atomic64_t use int, long and s64 respectively".
>
> Could change this with some wrapper but I think it's unnecessary.

Ack.

>
> >
> > > +
> > > +       do {
> > > +               old_id = (old_ids & (ID_MASK << shift)) >> shift;
> > Can be simplify:
> > old_id = (old_ids >> shift) & ID_MASK;
> >
> > > +               new_ids = (old_ids & ~(ID_MASK << shift));
> > > +               new_ids |= ((unsigned int)new_id) << shift;
> >
> > new_ids |= (atomic_t) new_id << shift;
>
> atomic_t doesn't work with bit operations, it must be an arithmetic
> type, so here I think it has to stay like this.

Ack

Chris


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-12-15 15:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-10  9:28 [PATCH v2 0/3] mm/swap_cgroup: remove global swap cgroup lock Kairui Song
2024-12-10  9:28 ` [PATCH v2 1/3] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
2024-12-10  9:28 ` [PATCH v2 2/3] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
2024-12-10  9:28 ` [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock Kairui Song
2024-12-11  1:19   ` Roman Gushchin
2024-12-14 16:07   ` Chris Li
2024-12-14 19:48     ` Kairui Song
2024-12-15 15:04       ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox