From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AC484CD4F30
	for <linux-mm@archiver.kernel.org>; Wed,  4 Sep 2024 21:10:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 118366B02AC; Wed,  4 Sep 2024 17:10:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0C9836B02AE; Wed,  4 Sep 2024 17:10:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E84466B02B0; Wed,  4 Sep 2024 17:10:46 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id C6AA36B02AC
	for <linux-mm@kvack.org>; Wed,  4 Sep 2024 17:10:46 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 3A0CBA0576
	for <linux-mm@kvack.org>; Wed,  4 Sep 2024 21:10:46 +0000 (UTC)
X-FDA: 82528300092.26.AF9AA6A
Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46])
	by imf18.hostedemail.com (Postfix) with ESMTP id 4944D1C0019
	for <linux-mm@kvack.org>; Wed,  4 Sep 2024 21:10:44 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=0a9RsRuP;
	spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725484168;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=XP8HEjcvcsSjCqyyJiUYtMj75FT90mtxfvTOuGNKzpc=;
	b=IdV/yuL7KZI1kTLj38t9iRAxBWY9KmBTAIq4tghPasEQDUZh0ZJVASLzexcQMwKq2JjSG4
	cnHUK5Ddz2zHqwzZodjBliTU2hWi/guB0zzcebVWV6UlT4Pqv0MsrUo3cHAkPSpQKO6xod
	o1odeGwMcBd4ygBc93EM8FYB04l5MCA=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=0a9RsRuP;
	spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725484168; a=rsa-sha256;
	cv=none;
	b=mdmu3m/UjWD+JKx5DjQzB7RJKBlBjz7llj8PapUK5Bvfh9IXaTAbJN1NKJtiUKmEC/Mory
	9bCW4aBCtSHjKqNAgpUq9aJng9HP2mYRlsySzYxJs91FRS8vfw1R8ktjM6800AtwKa4b0z
	avC5ikYyywpOS6YmU5FFdlZ68oAA9cg=
Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-5333b2fbedaso12381959e87.0
        for <linux-mm@kvack.org>; Wed, 04 Sep 2024 14:10:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1725484242; x=1726089042; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=XP8HEjcvcsSjCqyyJiUYtMj75FT90mtxfvTOuGNKzpc=;
        b=0a9RsRuPOC/Eb42HMaWcD/65Js3XIZWAD/MMQbJfwGOep3DOBZLDP1COHGTNCYtlzE
         pCTMr4ug0QJT30nPd6zq6i5Cb417PTOFurW7eB8z1lzslPD+4zj6EdPpKik5m2rdm8qb
         /XnM1b4/dNOq2ouRwDureoncd33Reya/vdeB6CXPlFTov/bICin0UiXNZCFNYNNduW+4
         R2eBjMaf2fI6397o7lEDp5S/Ht5rgKUsZlTPocFwQ2yDi4yguMxHIMEJ8e+mZmAeoj4z
         vc3YYzXOpvRMYdH8r0MmK8ur6DxGPA+pBWPSC+/v9+RG1hdeegMQJ3kE3/b7iiSR6cLN
         dOEQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725484242; x=1726089042;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=XP8HEjcvcsSjCqyyJiUYtMj75FT90mtxfvTOuGNKzpc=;
        b=SVVyOikC1bYTmCb2c79OyfyYBqhrkjBC8GvS+FSJrjsHm6+6fhuioouqsLpIF2fV+c
         CH6UIb8idGGxregouo38tN1HWOMaTrJbtL1/oBXpQAHXxUqnkVR9/xgJX/lTWEyF/Mgl
         JUniUeg/p+o6Kat/eIYR9IRjlC7SYNiynw6DxJ7Bi0NgUcpSbCQqxoLznKxj+C+AH9PO
         l97NCO4oiNZNv+hDcQzKosHfpw2HOpcsfqbE1PAFu/CpVCSSiPYJN2zxym7tLTXNaslG
         fFhZYDNnhtL5O8xabrUJcnz7E0cIIluOgAz203A2mSxFARIw7evIRr4se1nZkIxVpA5d
         7Z5A==
X-Forwarded-Encrypted: i=1; AJvYcCV2b2qHhjJCMPVjxWadhw6C3qHQooNH4zKZpk+VhAnOFsPvL+2u/pA+VlDmpkoUitgbHVaeEeae6w==@kvack.org
X-Gm-Message-State: AOJu0Yx97AtPQ/tBe2oPErtYXGvcu/wQgBaUX9VwesZy+UjQg5HQ889U
	HfvYDJoCPGk5mZlgPNGv9RifM5RUeeLolBIEWkQ7esQ5ps+J8maTX381MLo7znD34QOpPLKLnYp
	A0XFhW3EjL/eh0OUCwDGQqwe4PuLPGp0A85fl
X-Google-Smtp-Source: AGHT+IHWolUXkuWL//rhfaUvkN/zThvDPlsWG88X63zP37akORnLDoixS2Ft+XAO3y3JFu8SwbH6T6Q9UATa2h6DvsQ=
X-Received: by 2002:a05:6512:1286:b0:533:e4d:3374 with SMTP id
 2adb3069b0e04-53546ba8fdamr12344003e87.57.1725484241758; Wed, 04 Sep 2024
 14:10:41 -0700 (PDT)
MIME-Version: 1.0
References: <172547884995.206112.808619042206173396.stgit@firesoul>
In-Reply-To: <172547884995.206112.808619042206173396.stgit@firesoul>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 4 Sep 2024 14:10:03 -0700
Message-ID: <CAJD7tkak0yZNh+ZQ0FRJhmHPmC5YmccV4Cs+ZOk9DCp4s1ECCA@mail.gmail.com>
Subject: Re: [PATCH V10] cgroup/rstat: Avoid flushing if there is an ongoing
 root flush
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, 
	hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, 
	kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: gffkk5orxw9g1ecam69jc8e1o5s83567
X-Rspam-User: 
X-Rspamd-Queue-Id: 4944D1C0019
X-Rspamd-Server: rspam02
X-HE-Tag: 1725484244-519704
X-HE-Meta: U2FsdGVkX1+9PiF5zA+7TBxA8WaWGAEToOOO7+sEiFl6e6jokkW/GY0wvlp9VWcOjlFR7lEagHVo1OVc/wrlVvtLVsnNQPJAMjPeNAOu96zwKD1wUIKwJd+hpBLz2nbNGymMG2NfvkGI9CQ9ZzOs89Hz+umhGU/DFqgZdkop1oenXpP2GIn7zrVKd758j84C8w7p1EzT2xbDQBGjMEy4ogW3jm8i2Ub8rkrxnp51QZxH+p3CyIgqIQFBB1J+NrKXXO2bTrCcSQT/7Kv2GzUV/TnyKPthqpDmqUDMVVnMHaF27/kZeiCkQy2IK/vsxTceSeGdSp7YqRMsb6af/SmbZiYwhDzK1oGJmH116Q4YJxZF96GQf39Uc7+14b1iq/M8VMU0DoIYNqfSTCZ86SwnsSDDYoXp0jmqFU5FKIhcxeK1XlQgM61LG+l/eY88RWf5rzrVorQRq+mLyW/4Fbx9AAdOA0nJG0kbmQAD3va42tJeBaVzWb7As+t+4+9NQo52OE9/E2HsDPDuh9MdQyAGIhH8bm7fmQNMa9ZIRD14AFnebjW4POEYzigTpUJtopSBALzaklWWBtFGWmMRBavOaQC4KXCYKjjShbhEoWEOVJ5BRCroaRjb7rDM2IPNEPKrY2df88g+I+1WB4mDOQx53coBAMBjJY3iXzxX3gZ/O/b3z3cHV044DgOMFD7Fa/N4rf66qHBmSMpwu/n2XHfFKfXun5pGquh/5d6ZvlQ2C18VoLqHZKNgRLAFFPFhQNddcl2ywj8G3QderhC7L7jvMGLwhhPxDj8ZbmMmqZ4yn4dH66PHQAdp5bXtYsa3xVgzOBkOPhcMzoVsdnH+Yk3aNpfzhQpvFJ+VYDqenLUZDIJAm5MzpS6XOob8nJ1Bh+n6Fe8jfYm9vW1khoHia9rR6HilOA+nN/bEZREFQRf57D/pGRyB4Wz+dU4oYdvq+aYwNFZSbqVtQ1HaSmluPzJ
 BS2twIoZ
 DmHoYVAlHZfEDrJxChKOyZHRBSOyG40H+7DH9SioS6AWLq0mUgIDxCHHBNA3A3vOP9sk01weEVz37VKxIhbjITDHAdbGtUrRPgU5iYZQ/wPqbw2EDysrsUVuD7OHrhodGPrQKbhB11m79ENyp0HbtxsFVLrwwynPi8UGCGkUAXnK4jRK9uNKSYKLWEOubwU+HjVSfD2nq8uLhMO0lR2j37k3UZiArGqg1k3Ws
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000006, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Sep 4, 2024 at 12:41=E2=80=AFPM Jesper Dangaard Brouer <hawk@kernel=
.org> wrote:
>
> This patch reintroduces and generalizes the "stats_flush_ongoing" concept=
 to
> avoid redundant flushes if there is an ongoing flush at cgroup root level=
,
> addressing production lock contention issues on the global cgroup rstat l=
ock.
>
> At Cloudflare, we observed significant performance degradation due to
> lock contention on the rstat lock, primarily caused by kswapd. The
> specific mem_cgroup_flush_stats() call inlined in shrink_node, which
> takes the rstat lock, is particularly problematic.
>
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, w=
e
> noted severe lock contention on the rstat lock, causing 12 CPUs to waste
> cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedst=
at)
> for kthreads revealed that we are burning an average of 20,000 CPU cores
> fleet-wide on kswapd, primarily due to spinning on the rstat lock.
>
> Here's a brief overview of the issue:
> - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN thre=
ads
>   to wake up simultaneously.
> - The kswapd thread invokes shrink_node (via balance_pgdat), triggering t=
he
>   cgroup rstat flush operation as part of its work.
> - balance_pgdat() has a NULL value in target_mem_cgroup, causing
>   mem_cgroup_flush_stats() to flush with root_mem_cgroup.
>
> The kernel previously addressed this with a "stats_flush_ongoing" concept=
,
> which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree sta=
ts
> flushing"). This patch reintroduces and generalizes the concept to apply =
to
> all users of cgroup rstat, not just memcg.
>
> In this patch only a root cgroup can become the ongoing flusher, as this =
solves
> the production issue. Letting other levels becoming ongoing flusher cause=
 root
> cgroup to contend on the lock again.
>
> Some in-kernel users of the cgroup_rstat_flush() API depend on waiting fo=
r the
> flush to complete before continuing. This patch introduce the call
> cgroup_rstat_flush_relaxed() and use it in those cases that can live with
> slightly inaccurate flushes.
>
> This change significantly reduces lock contention, especially in
> environments with multiple NUMA nodes, thereby improving overall system
> performance.
>
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

I am honestly not happy with this patch, see below.

> ---
>

Please include a brief delta vs the previous version here to save
reviewers the duplicated effort of figuring it out.

> V9: https://lore.kernel.org/all/172245504313.3147408.12138439169548255896=
.stgit@firesoul/
> V8: https://lore.kernel.org/all/172139415725.3084888.13770938453137383953=
.stgit@firesoul
> V7: https://lore.kernel.org/all/172070450139.2992819.13210624094367257881=
.stgit@firesoul
> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.=
stgit@firesoul/
> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.=
stgit@firesoul/
> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077=
.stgit@firesoul/
> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.=
stgit@firesoul/
> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.=
stgit@firesoul/
> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519=
.stgit@firesoul/
> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252=
.stgit@firesoul/
[..]
> @@ -299,6 +301,67 @@ static inline void __cgroup_rstat_unlock(struct cgro=
up *cgrp, int cpu_in_loop)
>         spin_unlock_irq(&cgroup_rstat_lock);
>  }
>
> +static inline bool cgroup_is_root(struct cgroup *cgrp)
> +{
> +       return cgroup_parent(cgrp) =3D=3D NULL;
> +}
> +
> +/**
> + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flu=
sher
> + * @cgrp: target cgroup
> + *
> + * Function return value follow trylock semantics. Returning true when l=
ock is
> + * obtained. Returning false when not locked and it detected flushing ca=
n be
> + * skipped as another ongoing flusher is taking care of the flush.
> + *
> + * For callers that depend on flush completing before returning a strict=
 option
> + * is provided.
> + */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool stric=
t)
> +{
> +       struct cgroup *ongoing;
> +
> +       if (strict)
> +               goto lock;
> +
> +       /*
> +        * Check if ongoing flusher is already taking care of this.  Desc=
endant
> +        * check is necessary due to cgroup v1 supporting multiple root's=
.
> +        */
> +       ongoing =3D READ_ONCE(cgrp_rstat_ongoing_flusher);
> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing))
> +               return false;

Why did we drop the agreed upon method of waiting until the flushers
are done? This is now a much more intrusive patch which makes all
flushers skip if a root is currently flushing. This causes
user-visible problems and is something that I worked hard to fix. I
thought we got good results with waiting for the ongoing flusher as
long as it is a root? What changed?

You also never addressed my concern here about 'ongoing' while we are
accessing it, and never responded to my question in v8 about expanding
this to support non-root cgroups once we shift to a mutex.

I don't appreciate the silent yet drastic change made in this version
and without addressing concerns raised in previous versions. Please
let me know if I missed something.

> +
> +       /* Grab right to be ongoing flusher */
> +       if (!ongoing && cgroup_is_root(cgrp)) {
> +               struct cgroup *old;
> +
> +               old =3D cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp);
> +               if (old) {
> +                       /* Lost race for being ongoing flusher */
> +                       if (cgroup_is_descendant(cgrp, old))
> +                               return false;
> +               }
> +               /* Due to lock yield combined with strict mode record ID =
*/
> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current);

I am not sure I understand why we need this, do you mind elaborating?