From: Chris Li <chrisl@kernel.org>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Hugh Dickins <hughd@google.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
Yosry Ahmed <yosryahmed@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Barry Song <baohua@kernel.org>, Michal Hocko <mhocko@kernel.org>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock
Date: Sat, 14 Dec 2024 08:07:08 -0800 [thread overview]
Message-ID: <CACePvbUoije1wgy3jPambP9-rbYs_Yq1Pajnv3U1MDOxFGU2fg@mail.gmail.com> (raw)
In-Reply-To: <20241210092805.87281-4-ryncsn@gmail.com>
On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> replaced the cmpxchg/xchg with a global irq spinlock because some archs
> doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
>
> And as commented in swap_cgroup.c, this lock is not needed for map
> synchronization.
>
> Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> it to get rid of this lock. Introduced two helpers for doing so and they
> can be easily dropped if a generic 2 byte xchg is support.
>
> Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> memory cgroup using 4k folios showed below improvement (10 test run):
>
> Before this series:
> Sys time: 10809.46 (stdev 80.831491)
> Real time: 171.41 (stdev 1.239894)
>
> After this commit:
> Sys time: 9621.26 (stdev 34.620000), -10.42%
> Real time: 160.00 (stdev 0.497814), -6.57%
>
> With 64k folios and 2G memcg:
> Before this series:
> Sys time: 8231.99 (stdev 30.030994)
> Real time: 143.57 (stdev 0.577394)
>
> After this commit:
> Sys time: 7403.47 (stdev 6.270000), -10.06%
> Real time: 135.18 (stdev 0.605000), -5.84%
>
> Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> Before this series:
> 5461409.12 us (stdev 183957.827084)
>
> After this commit:
> 5420447.26 us (stdev 196419.240317)
>
> Sequential swapin of 8G 4k zero folios (24 test run):
> Before this series:
> 19736958.916667 us (stdev 189027.246676)
>
> After this commit:
> 19662182.629630 us (stdev 172717.640614)
>
> Performance is better or at least not worse for all tests above.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
> 1 file changed, 45 insertions(+), 28 deletions(-)
>
> diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> index 1770b076f6b7..a0a8547dc85d 100644
> --- a/mm/swap_cgroup.c
> +++ b/mm/swap_cgroup.c
> @@ -7,19 +7,20 @@
>
> static DEFINE_MUTEX(swap_cgroup_mutex);
>
> +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
Might not be two short if the atomic_t is more than 4 bytes. The
assumption here is that short is 2 bytes and atomic_t is 4 bytes. It
is hard to conclude that is the case for all architecture.
> +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))
You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)",
or get rid of struct swap_cgroup and directly use atomic_t.
> +#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
> +#define ID_MASK (BIT(ID_SHIFT) - 1)
> struct swap_cgroup {
> - unsigned short id;
> + atomic_t ids;
You use struct swap_cgroup and atomic_t which assumes no padding added
to the struct. You might want to build an assert on sizeof(atomic_t)
== sizeof(struct swap_cgroup).
> };
>
> struct swap_cgroup_ctrl {
> struct swap_cgroup *map;
> - spinlock_t lock;
> };
>
> static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
>
> -#define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
> -
> /*
> * SwapCgroup implements "lookup" and "exchange" operations.
> * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> * SwapCache(and its swp_entry) is under lock.
> * - When called via swap_free(), there is no user of this entry and no race.
> * Then, we don't need lock around "exchange".
> - *
> - * TODO: we can push these buffers out to HIGHMEM.
> */
> -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> - struct swap_cgroup_ctrl **ctrlp)
> +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
> + pgoff_t offset)
> {
> - pgoff_t offset = swp_offset(ent);
> - struct swap_cgroup_ctrl *ctrl;
> + unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
Might not want to assume the ID_PER_SC is two. If some architecture
atomic_t is 64 bits then that code will break.
> + unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some
strange architecture might break it. Better use unsigned version of
aotmic_t;
>
> - ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> - if (ctrlp)
> - *ctrlp = ctrl;
> - return &ctrl->map[offset];
> + return (old_ids & (ID_MASK << shift)) >> shift;
Can be simplified as (old_ids >> shift) & ID_MASK. You might want to
double check that.
> +}
> +
> +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
> + pgoff_t offset,
> + unsigned short new_id)
> +{
> + unsigned short old_id;
> + unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
Same here, it assumes ID_PER_SC is 2.
> + struct swap_cgroup *sc = &map[offset / ID_PER_SC];
> + unsigned int new_ids, old_ids = atomic_read(&sc->ids);
Again it assumes sizeof(unsigned int) == sizeof(atomic_t).
> +
> + do {
> + old_id = (old_ids & (ID_MASK << shift)) >> shift;
Can be simplify:
old_id = (old_ids >> shift) & ID_MASK;
> + new_ids = (old_ids & ~(ID_MASK << shift));
> + new_ids |= ((unsigned int)new_id) << shift;
new_ids |= (atomic_t) new_id << shift;
> + } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
> +
> + return old_id;
> }
>
> /**
> @@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> unsigned int nr_ents)
> {
> struct swap_cgroup_ctrl *ctrl;
> - struct swap_cgroup *sc;
> - unsigned short old;
> - unsigned long flags;
> pgoff_t offset = swp_offset(ent);
> pgoff_t end = offset + nr_ents;
> + unsigned short old, iter;
> + struct swap_cgroup *map;
>
> - sc = lookup_swap_cgroup(ent, &ctrl);
> + ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> + map = ctrl->map;
>
> - spin_lock_irqsave(&ctrl->lock, flags);
> - old = sc->id;
> - for (; offset < end; offset++, sc++) {
> - VM_BUG_ON(sc->id != old);
> - sc->id = id;
> - }
> - spin_unlock_irqrestore(&ctrl->lock, flags);
The above will always assign nr_ents of swap entry atomically.
> + old = __swap_cgroup_id_lookup(map, offset);
> + do {
> + iter = __swap_cgroup_id_xchg(map, offset, id);
> + VM_BUG_ON(iter != old);
> + } while (++offset != end);
Here it is possible that some of the nr_ents can be changed while the
offset is still in the loop. Might want to examine if the caller can
trigger that or not. We want to make sure it is safe to do so, when
removing the spin lock, the nr_ents might not update to the same value
if two callers race it.
>
> return old;
> }
> @@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> */
> unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> {
> + struct swap_cgroup_ctrl *ctrl;
> +
> if (mem_cgroup_disabled())
> return 0;
> - return lookup_swap_cgroup(ent, NULL)->id;
> +
> + ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> + return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
> }
>
> int swap_cgroup_swapon(int type, unsigned long max_pages)
> @@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
> if (mem_cgroup_disabled())
> return 0;
>
> - map = vcalloc(max_pages, sizeof(struct swap_cgroup));
> + BUILD_BUG_ON(!ID_PER_SC);
It is simpler just to assert on: sizeof(atomic_t) >= sizeof(unsigned short).
I think that is what it does here.
You might also want to assert: !(sizeof(atomic_t) % sizeof(unsigned short))
Chris
> + map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
> + sizeof(struct swap_cgroup));
> if (!map)
> goto nomem;
>
> ctrl = &swap_cgroup_ctrl[type];
> mutex_lock(&swap_cgroup_mutex);
> ctrl->map = map;
> - spin_lock_init(&ctrl->lock);
> mutex_unlock(&swap_cgroup_mutex);
>
> return 0;
> --
> 2.47.1
>
next prev parent reply other threads:[~2024-12-14 16:07 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-10 9:28 [PATCH v2 0/3] mm/swap_cgroup: " Kairui Song
2024-12-10 9:28 ` [PATCH v2 1/3] mm, memcontrol: avoid duplicated memcg enable check Kairui Song
2024-12-10 9:28 ` [PATCH v2 2/3] mm/swap_cgroup: remove swap_cgroup_cmpxchg Kairui Song
2024-12-10 9:28 ` [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock Kairui Song
2024-12-11 1:19 ` Roman Gushchin
2024-12-14 16:07 ` Chris Li [this message]
2024-12-14 19:48 ` Kairui Song
2024-12-15 15:04 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACePvbUoije1wgy3jPambP9-rbYs_Yq1Pajnv3U1MDOxFGU2fg@mail.gmail.com \
--to=chrisl@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=ying.huang@linux.alibaba.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox