From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AC51AE77181
	for <linux-mm@archiver.kernel.org>; Tue, 10 Dec 2024 07:05:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1AC4D6B00A3; Tue, 10 Dec 2024 02:05:53 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 15CB06B010C; Tue, 10 Dec 2024 02:05:53 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F3ED36B012A; Tue, 10 Dec 2024 02:05:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id CF91D6B0128
	for <linux-mm@kvack.org>; Tue, 10 Dec 2024 02:05:52 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 547AE413F5
	for <linux-mm@kvack.org>; Tue, 10 Dec 2024 07:05:52 +0000 (UTC)
X-FDA: 82878164292.28.029AFC2
Received: from mail-lj1-f178.google.com (mail-lj1-f178.google.com [209.85.208.178])
	by imf26.hostedemail.com (Postfix) with ESMTP id F3F3414000A
	for <linux-mm@kvack.org>; Tue, 10 Dec 2024 07:05:33 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=bv6lDQen;
	spf=pass (imf26.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1733814336;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+r40tdqGxedKVHyxQ11mMVfv6cV8xIj7UlzXlFGqZx0=;
	b=RGupdGfAf+sb4JhyzBJigm47eRPaf+hXsPfUmOmo+L0Sjw5BaUtcPWI4ZP/ZngqlzMF67X
	nG7+A1jVjgylk/SEjJN29Ty0Zl3n52uk4pV47X5+sygcDj3Tz73eIlr7abI3c7UqO2xABf
	DdXZYmVWH9nQ7qbrTcJLTxX3warx62o=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733814336; a=rsa-sha256;
	cv=none;
	b=MCbnUdd3lOhgfs36zqBWOlDGEwrLICCe1fuQSPmoH0qNQ75/65HCrqy2tFeLZJyOJHXmgu
	4Mv64HVkJTPdpaMLb3Crz1rFryqVzAyRcF/yo8YC5/jrNB9ujUMyHYcQe+T7Af/xI22PUQ
	UauiaWVWh92FCBWdTrlctD6naxYN7YQ=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=bv6lDQen;
	spf=pass (imf26.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-lj1-f178.google.com with SMTP id 38308e7fff4ca-300479ca5c6so24634521fa.3
        for <linux-mm@kvack.org>; Mon, 09 Dec 2024 23:05:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1733814348; x=1734419148; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+r40tdqGxedKVHyxQ11mMVfv6cV8xIj7UlzXlFGqZx0=;
        b=bv6lDQendulq4mrAQ6iErLg1ZpLmdQyf0i//zGtKwEJNWCpXVMcbgReV540Vmn36dY
         3FbDTx3am8u2dUjAMJQppvj/VcN7ftH1UqDLSW54nar89um/VYEydf4kwWCnkdIxc6hf
         Oco59TbMnnv1IUnTZJe8bNvl2Hhvo5GtfCTyqHN/bIFph9DEmBqnjUB6krC9fFGkE9EX
         P4NrSbK52Y7EvdFtMk0xZCznfhh7s49JeTxEuM7pihijeFLs5Inzttta9L1imBF/QVqw
         yb9vn3VeJrdwWgSnOIFnYneimFmHG5PrFE2Cw9EsEmq5aojH+DjRVNWCG5wuUHMCA4xW
         Iqjg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1733814348; x=1734419148;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+r40tdqGxedKVHyxQ11mMVfv6cV8xIj7UlzXlFGqZx0=;
        b=pBZSqwsmKp82XGFqk5w7jSUIjz9sCsV9Q7cNI26p98JPCMA8MzdHSp6tQV5C/38RCZ
         63riERrHD63yJigZKiRH5+/Dt/yp1zba3W5If4OP+EBzXBxjnvLSg5YUyKIbc17OOl/C
         fR3HtcXG/FL8EGxP8zmPGolTTuNFpFS+ByzMe/x0Em+mEPi4tZzI/u1AfcNvOyzni672
         pYTM46LaEet5OolwkIsXm/DicQ5L8j8CtORjE23LMti5LSvmRHFpgMkf9zz0pO+x/5xc
         7eWshBbIQhzHbWZNNM2iL5GvZ/92b16CkVhXxTVOaI+Gjy3H8AVk+AnleWeGxQDL2aAZ
         gt4Q==
X-Gm-Message-State: AOJu0Yx3nFi24cGjvipRmlf9cOh2ai0t/PrPF+anfG48MS3/EPUGzUXs
	KDCqmLqozrKnRVxJbPr680tDilcyTlBK72JKMulSWkHMPMq/FCEGmx4XgPhoGcJJkllFcCmxg79
	QP9u1NlPAFLd2SFYDDEvIAUKBmpA=
X-Gm-Gg: ASbGncsZUFp28tc3uW0dZRY1MUIYhDq9YOH17VQ33vAm7qufuWScYQtn36BTsyw5fMD
	KcAOQXcK+7HzKgKt9nMbPYYHiGykyaGBSCiA=
X-Google-Smtp-Source: AGHT+IEjoI16xcnZF00aY7SZ9MT7aVBjK3lvkjb0W6tTxbslbqdA3eSBFAwxcLHXTBxMh/ShOeZiwD7vHDwNxwkCxfs=
X-Received: by 2002:a05:651c:1556:b0:2ff:a89b:4210 with SMTP id
 38308e7fff4ca-3022fb3ef21mr11751991fa.8.1733814348062; Mon, 09 Dec 2024
 23:05:48 -0800 (PST)
MIME-Version: 1.0
References: <20241202184154.19321-1-ryncsn@gmail.com> <20241202184154.19321-5-ryncsn@gmail.com>
 <CACePvbXD=2Xj0yykQQdeQpaO_KCfyOCw1Q+U9VXugPttqH1bbA@mail.gmail.com>
In-Reply-To: <CACePvbXD=2Xj0yykQQdeQpaO_KCfyOCw1Q+U9VXugPttqH1bbA@mail.gmail.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Tue, 10 Dec 2024 15:05:31 +0800
Message-ID: <CAMgjq7AFjXa0bkeeeCYSvuFu4EDTEKSczRJG1G=vXAf=9BAYFw@mail.gmail.com>
Subject: Re: [PATCH 4/4] mm, swap_cgroup: remove global swap cgroup lock
To: Chris Li <chrisl@kernel.org>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, 
	Hugh Dickins <hughd@google.com>, "Huang, Ying" <ying.huang@intel.com>, 
	Yosry Ahmed <yosryahmed@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Johannes Weiner <hannes@cmpxchg.org>, 
	Barry Song <baohua@kernel.org>, Michal Hocko <mhocko@kernel.org>, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: F3F3414000A
X-Stat-Signature: btug7ukf31qkp6nd73cetmu1aaemumqb
X-Rspam-User: 
X-HE-Tag: 1733814333-603810
X-HE-Meta: U2FsdGVkX19DUWMqXNu4e9mFtrbwYTrKTRoy+xD/QGHHo+pbhIplRhWxjT49srmmFe/wlgRbi09u2NBLyUl/sMnIPRj4Pe+mJ4/wL2giHlf/8wet6yippF14wWJm1KwCw6e8nzAYlTpAe7PZGtw8WenRxuTqG3TwUrQHVq6wRX1Ej7skbN2+uKLd8RIX6HEl0jdbW7dwReYexarvN/mCki+Wpl6yaGgI4z3XQDFYsnv7Vv1CIHtzPKP32ZUf4wORORQM0lRilXzqLbYnoqRo7jWZ+PLskamwDvzsOvXSNMwhu29cwXdqRphJ7oK7Gwd5FpJhilLZGXVUpULbL5leJRq8cupxyB+bzNtOd9CbrOA9FYQFeKSkhQKdBs9HZcJtllp1GeqJdx7EBYJmz46eXmdT2js7kITHC1xeSQpjr2L8CrrWnYE3WiigWbd+EU/bSdFkDr2PFDd9pq8CIioHdsmqQfG8qAuWeoalmhU9f8+R2+56BK/8WQhpCKH3x5mZFTpITFETrKGcdNXEnootWrisZ6aynMA5xsiUyGXYCJO2dWb36fzL9VFKDBYFxpm/Yuh/8huctRCHrqsEfeHKABJQyNj1kLi5at5H47fmYJjfevxC94p0ZnQJXFBASu7Ee7K4pLeA9Ju2M9c48Y6ju/uSmDd2R/S5+1a74DH4hrqvUuh8leDDJ82P7uvFViu94Z3xcWD3Ksz+WxZ+EQlASK5Hd9N5HiwlvCKJnlPQyjJiKbJdtlDqvp3gPgQROxRevApqnQJ9mxl1L3WhryksSUZMRVQSSNAuxTWGbmgt0sdJtQ43y4BMaF2UE62OPk3jviss7Ora+EjpOQ9hM3VfNP9gx8CzsqoCHNEos54Pqr8imTZJ+BqcaMbtCSbK/qniKwu657ttmgi4QsFWfflYigvHp95IWXGtKQTDxONByIUBSEci5BMRcYpPWkMI/Tpxxn1n+Oftps5FYMNDVcj
 3bTTxoNP
 x2Nrdw3yOg6KH/K+hCuHw4+LziVd3nMPHNz7edzO8GyStwnYjBOrH6u3XQYerYOSJOijw4XqA4ZxQU5vFeby3BODrGMZDvQcz/Wq5HBia+Elkl/W6NhrCXlcYxTiwM1/AdyvsAqg1vR0wcwqJT2a1Va8yFuadg/7Fpo0PSB5cquJMx3x8otvHSO2dc42swFC0Dr/SvCqBjVpL+jC6Mltf7/r2AZSbiqqpSiR+ZRTveTKuxzfUCqZA7C/iNXbALnw/ngm+CykhSSYBr0z70jG0iwxns2UdsGgSylpMfIJkGHjOdsOAldLR9kFgqljS9EAp8+9HP1hl8ZNbyAYJmz7hoHstNr5C+gMbRFGsLggZ9wOMrmtjltuJVqbwCRSa4t5jGqSPTM4q2/jxJV97ZLPIeAHWeWXLm2EcNRW43NvehecmstPEKdJPj9PkFw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Dec 5, 2024 at 3:40=E2=80=AFAM Chris Li <chrisl@kernel.org> wrote:
>
> On Mon, Dec 2, 2024 at 10:42=E2=80=AFAM Kairui Song <ryncsn@gmail.com> wr=
ote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintaina=
nce")
> > replaced the cmpxchg/xchg with a global irq spinlock because some archs
> > doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
> >
> > And as commented in swap_cgroup.c, this lock is not needed for map
> > synchronization.
> >
> > Emulation of 2 bytes cmpxchg/xchg with atomic isn't hard, so implement
> > it to get rid of this lock.
> >
> > Testing using 64G brd and build with build kernel with make -j96 in 1.5=
G
> > memory cgroup using 4k folios showed below improvement (10 test run):
> >
> > Before this series:
> > Sys time: 10730.08 (stdev 49.030728)
> > Real time: 171.03 (stdev 0.850355)
> >
> > After this commit:
> > Sys time: 9612.24 (stdev 66.310789), -10.42%
> > Real time: 159.78 (stdev 0.577193), -6.57%
> >
> > With 64k folios and 2G memcg:
> > Before this series:
> > Sys time: 7626.77 (stdev 43.545517)
> > Real time: 136.22 (stdev 1.265544)
> >
> > After this commit:
> > Sys time: 6936.03 (stdev 39.996280), -9.06%
> > Real time: 129.65 (stdev 0.880039), -4.82%
> >
> > Sequential swapout of 8G 4k zero folios (24 test run):
> > Before this series:
> > 5461409.12 us (stdev 183957.827084)
> >
> > After this commit:
> > 5420447.26 us (stdev 196419.240317)
> >
> > Sequential swapin of 8G 4k zero folios (24 test run):
> > Before this series:
> > 19736958.916667 us (stdev 189027.246676)
> >
> > After this commit:
> > 19662182.629630 us (stdev 172717.640614)
> >
> > Performance is better or at least not worse for all tests above.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_cgroup.c | 56 +++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 41 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> > index a76afdc3666a..028f5e6be3f0 100644
> > --- a/mm/swap_cgroup.c
> > +++ b/mm/swap_cgroup.c
> > @@ -5,6 +5,15 @@
> >
> >  #include <linux/swapops.h> /* depends on mm.h include */
> >
> > +#define ID_PER_UNIT (sizeof(atomic_t) / sizeof(unsigned short))
>
> You might want to have some compile time assert that (sizeof(atomic_t)
> % sizeof(unsigned short)) is zero. Could not hurt.
>
> > +struct swap_cgroup_unit {
> > +       union {
> > +               int raw;
> > +               atomic_t val;
> > +               unsigned short __id[ID_PER_UNIT];
> > +       };
> > +};
>
> I suggest just getting rid of this complicated struct/union and using
> bit shift and mask to get the u16 out from the atomic_t.

Good suggestion.

>
> > +
> >  static DEFINE_MUTEX(swap_cgroup_mutex);
> >
> >  struct swap_cgroup {
> > @@ -12,8 +21,10 @@ struct swap_cgroup {
> >  };
> >
> >  struct swap_cgroup_ctrl {
> > -       unsigned short  *map;
> > -       spinlock_t      lock;
> > +       union {
> > +               struct swap_cgroup_unit *units;
> > +               unsigned short *map;
>
> You really shouldn't access the map as an "unsigned short" array,
> therefore, I suggest changing the array pointer to "atomic_t".
>
> > +       };
> >  };
> >
> >  static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> > @@ -31,6 +42,24 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_=
SWAPFILES];
> >   *
> >   * TODO: we can push these buffers out to HIGHMEM.
> >   */
> > +static unsigned short __swap_cgroup_xchg(void *map,
> > +                                        pgoff_t offset,
> > +                                        unsigned int new_id)
> > +{
> > +       unsigned int old_id;
> > +       struct swap_cgroup_unit *units =3D map;
> > +       struct swap_cgroup_unit *unit =3D &units[offset / ID_PER_UNIT];
> > +       struct swap_cgroup_unit new, old =3D { .raw =3D atomic_read(&un=
it->val) };
> > +
> > +       do {
> > +               new.raw =3D old.raw;
> > +               old_id =3D old.__id[offset % ID_PER_UNIT];
> > +               new.__id[offset % ID_PER_UNIT] =3D new_id;
> > +       } while (!atomic_try_cmpxchg(&unit->val, &old.raw, new.raw));
>
> I suggest just calculating the atomic_t offset  (offset /
> ID_PER_UNIT) and getting the address of the atomic_t.
> Then use the mask and shift to construct the new atomic_t value. It is
> likely to generate better code.
> You don't want the compiler to generate memory load and store for
> constructing the temporary new value.
> I haven't checked the machine generated code, I suspect the compiler
> is not smart enough to convert those into register shift here. Which
> is what you really want.
>
> > +
> > +       return old_id;
> > +}
> > +
> >  /**
> >   * swap_cgroup_record - record mem_cgroup for a set of swap entries
> >   * @ent: the first swap entry to be recorded into
> > @@ -44,22 +73,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, =
unsigned short id,
> >                                   unsigned int nr_ents)
> >  {
> >         struct swap_cgroup_ctrl *ctrl;
> > -       unsigned short *map;
> > -       unsigned short old;
> > -       unsigned long flags;
> >         pgoff_t offset =3D swp_offset(ent);
> >         pgoff_t end =3D offset + nr_ents;
> > +       unsigned short old, iter;
> > +       unsigned short *map;
>
> Make it an atomic_t pointer here as well.
>
> >
> >         ctrl =3D &swap_cgroup_ctrl[swp_type(ent)];
> >         map =3D ctrl->map;
> >
> > -       spin_lock_irqsave(&ctrl->lock, flags);
> > -       old =3D map[offset];
> > +       old =3D READ_ONCE(map[offset]);
>
> Ah, you shouldn't perform u16 reading directly. That will get into the
> endian problem of how the u16 is arranged into atomic_t. You should do
> atomic reading then shift the bits out so you don't have the endian
> problem. It is a bad idea mixing atomic updates and reading the middle
> of the atomic address location.

Good suggestion, convert the whole map into atomic_t and access / xchg
with bit shifts is also OK, mixing atomic with other types may lead to
misuse indeed.

>
> Chris
>
> >         do {
> > -               VM_BUG_ON(map[offset] !=3D old);
> > -               map[offset] =3D id;
> > +               iter =3D __swap_cgroup_xchg(map, offset, id);
> > +               VM_BUG_ON(iter !=3D old);
> >         } while (++offset !=3D end);
> > -       spin_unlock_irqrestore(&ctrl->lock, flags);
> >
> >         return old;
> >  }
> > @@ -85,20 +111,20 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t e=
nt)
> >
> >  int swap_cgroup_swapon(int type, unsigned long max_pages)
> >  {
> > -       void *map;
> > +       struct swap_cgroup_unit *units;
> >         struct swap_cgroup_ctrl *ctrl;
> >
> >         if (mem_cgroup_disabled())
> >                 return 0;
> >
> > -       map =3D vzalloc(max_pages * sizeof(unsigned short));
> > -       if (!map)
> > +       units =3D vzalloc(DIV_ROUND_UP(max_pages, ID_PER_UNIT) *
> > +                       sizeof(struct swap_cgroup_unit));
> > +       if (!units)
> >                 goto nomem;
> >
> >         ctrl =3D &swap_cgroup_ctrl[type];
> >         mutex_lock(&swap_cgroup_mutex);
> > -       ctrl->map =3D map;
> > -       spin_lock_init(&ctrl->lock);
> > +       ctrl->units =3D units;
> >         mutex_unlock(&swap_cgroup_mutex);
> >
> >         return 0;
> > --
> > 2.47.0
> >
>